Knowledge distillation [Hinton et al, 2015] is a method that transfers the domain-specific knowledge from the large teacher models to a small student. This approach allows us to create small, task-specific language models that achieve comparable performance to larger models without the need to annotate thousands of examples.
We generate synthetic data and use it to train the student model with a loss function that aligns with the user task. In this process, the student model learns to emulate the teacher's target skills or domain knowledge, effectively acquiring similar capabilities.
[Hinton et al, 2015]: https://arxiv.org/abs/1503.02531