Header menu link for other important links
Searching OCR'ed text: An LDA based approach
E. Hassan, V. Garg, S.K.M. Haque, , M. Gopal
Published in
Pages: 1210 - 1214
Indexing and retrieval performance over digitized document collection significantly depends on the performance of available Optical Character Recognition (OCR). The paper presents a novel document indexing framework which attends the document digitization errors in the indexing process to improve the overall retrieval accuracy. The proposed indexing framework is based on topic modeling using Latent Dirichlet Allocation (LDA). The OCR's confidence in correctly recognizing a symbol is propagated in topic learning process such that semantic grouping of word examples carefully distinguishes between commonly confusing words. We present a novel application of Lucene with topic modeling for document indexing application. The experimental evaluation of the proposed framework is presented on document collection belonging to Devanagari script. © 2011 IEEE.
About the journal
JournalProceedings of the International Conference on Document Analysis and Recognition, ICDAR