On Evaluating Language Technologies

The Text REtrieval Conference (TREC) is an on-going series of evaluation workshops that has standardized and validated the use of test collections as a research tool for improving document retrieval. A retrieval test collection is a carefully calibrated abstraction of a retrieval task that Sparck Jones called the ''core competency'' of search: a task that is necessary, but not sufficient, for user retrieval tasks. The abstraction facilitates research by controlling for some sources of variability, thus increasing the power of experiments that compare system effectiveness while reducing their cost. We have extended the text collection paradigm to other language technologies including question answering, summarization, and textual entailment by defining abstracted evaluation tasks for them. This talk will provide a brief history of the NIST evaluations for these tasks, examining both the validity of an evaluation (Are the conclusions of the evaluation true?) and the utility of an evaluation (Are the conclusions helpful?).