Header menu link for other important links
X
OCR for bilingual documents using language modeling
A. Ray, S. Rajeswar,
Published in IEEE Computer Society
2015
Volume: 2015-November
   
Pages: 1256 - 1260
Abstract
Script based features are highly discriminative for text segmentation and recognition. Thus they are widely used in Optical Character Recognition(OCR) problems. But usage of script dependent features restricts the adaptation of such architectures directly for another script. With script independent systems, this problem can be solved to a certain extent for monolingual documents. But the problem aggravates in case of multilingual documents as it is very difficult for a single classifier to learn many scripts. Generally a script identification module identifies text segments and accordingly the script-dependent classifier is selected. This paper presents a unified framework of language model and multiple preprocessing hypotheses for word recognition from bilingual document images. Prior to text recognition, preprocessing steps such as binarization and segmentation are required for ease of recognition. But these steps induce huge combinatorial error propagating to final recognition accuracy. In this paper we use multiple preprocessing routines as alternate hypotheses and use a language model to verify each alternative and choose the best recognized sequence. We test this architecture for word recognition of Kannada-English and Telugu-English bilingual documents and achieved better recognition rates than single methods using same classifier. © 2015 IEEE.
About the journal
JournalData powered by TypesetProceedings of the International Conference on Document Analysis and Recognition, ICDAR
PublisherData powered by TypesetIEEE Computer Society
ISSN15205363