Natural Language Processing in Multiple Domains: Linking the Unknown to the Known

218
Следующее
Популярные
Опубликовано 17 августа 2016, 20:54
The key to creating scalable, robust natural language processing (NLP) systems is to exploit correspondences between known and unknown linguistic structure. Natural language processing has experienced tremendous success over the past two decades, but our most successful systems are still limited to the domains and languages where we have large amounts of hand-annotated data. Unfortunately, these domains and languages represent a tiny portion of the total linguistic data in the world. No matter the task, we always encounter unknown features like words or clickthrough statistics that we have never observed before when estimating our models. This talk is about linking these unknown features to known features through correspondence representations. The first part describes a technique to learn lexical correspondences for domain adaptation of sentiment analysis systems. These systems predict the general attitude of an essay toward a particular topic. In this case, words which are highly predictive in one domain may not be present in another. We show how to build a correspondence representation between words in different domains using projections to low-dimensional, real-valued spaces. Unknown words are projected onto this representation and related directly to known features via Euclidean distance. The correspondence representation allows us to train significantly more robust models in new domains, and we achieve a 40 relative reduction in error due to adaptation over a state-of-the-art system. In the second part of the talk, I will describe a technique for learning a web search ranking function using queries from multiple languages. Many non-English queries pose difficult ranking problems for search engines because reliable English features, like click-through and static rank features change significantly across languages. For a significant number of these queries, however, we can use machine translation and an English ranker to find a better ranking. We show how to simultaneously learn a cross-lingual similarity function and a ranking function which depends on this similarity. Our method works by learning a ranking function on pairs of documents, one from each language. We show how to convert this ranking into a monolingual ranking and demonstrate using publicly available Chinese and English query logs that our bilingual ranking technique leads to significant improvements over a state-of-the-art monolingual model.
автотехномузыкадетское