Microsoft Research334 тыс
Опубликовано 17 августа 2016, 3:00
Modern approaches to machine translation are data-driven. Statistical translation models are trained using parallel text, which consist of sentences in one language paired with their translation into another language. One advantage of statistical translation models is that they are language independent, meaning that they can be applied to any language that we have training data for. Unfortunately, for most of the world's languages, do not have sufficient amounts of training data. In this talk, I will detail my experiments using Amazon's Mechanical Turk to create crowd-sourced translations for 'low resource' languages that we do not have training data for. I will discuss a variety of quality-control strategies that allow non-expert translators to produce translations approaching the level of professional translators, at a fraction of the cost. I'll analyze the impact of the quality of training data on the performance of the statistical translation model that we train from it, and ask the question: should we even bother with quality control? I'll present feasibility studies to see which low resource languages it is possible to collect data for, and volume studies to see how much data we can expect to create in a short period. Finally, I will discuss the implications of inexpensive, high quality, translations for applications including national defense, disaster response, research, and online translation systems.
Свежие видео