Microsoft Research323 тыс
Опубликовано 17 января 2020, 18:47
Sequence-level knowledge distillation (KD) -- learning a student model with targets decoded from a pre-trained teacher model -- has been widely used in sequence generation applications (e.g. model compression, non-autoregressive translation (NAT), low-resource translation, etc). However, the underlying reasons behind this success have, as of yet, been unclear. In this talk, we will try to tackle the understanding of KD particularly in two scenarios: (1) Learning a weak student from a strong teacher model while keeping the same parallel data used for training the teacher; (2) Learning a student from a teacher model of equal size while the targets are generated from additional monolingual data.
Talk slides: microsoft.com/en-us/research/u...
See more on this and other talks at Microsoft Research: microsoft.com/en-us/research/v...
Talk slides: microsoft.com/en-us/research/u...
See more on this and other talks at Microsoft Research: microsoft.com/en-us/research/v...
Свежие видео
Cellulant expedites growth and reduces cloud costs with AWS Enterprise Support | Amazon Web Services
Случайные видео
Now in Android: 91 - Jetpack Glance, Android 14 QPR, Google Play policy and console updates, & more!