Unsupervised Transcription of Historical Documents

251
Опубликовано 27 июля 2016, 2:21
Printing-press era documents are difficult for OCR systems to transcribe because these documents are extremely noisy. However, the noise originates from processes that are causally understood. For example, thickened glyphs are caused by over-inking, and vertical offset is caused by slop in a mechanical baseline. We present a generative probabilistic model, inspired by historical printing processes, for transcribing images of documents from the printing press era. By jointly modeling the text of the document and the noisy (but regular) process of rendering glyphs, our unsupervised system is able to decipher font structure and more accurately transcribe images into text. Overall, our approach gives state-of-the-art results on two datasets of historical document images.
автотехномузыкадетское