Header menu link for other important links
OCR-VQA: Visual question answering by reading text in images
, S. Shekhar, A.K. Singh, A. Chakraborty
Published in IEEE Computer Society
Pages: 947 - 952
The problem of answering questions about an image is popularly known as visual question answering (or VQA in short). It is a well-established problem in computer vision. However, none of the VQA methods currently utilize the text often present in the image. These 'texts in images' provide additional useful cues and facilitate better understanding of the visual content. In this paper, we introduce a novel task of visual question answering by reading text in images, i.e., by optical character recognition or OCR. We refer to this problem as OCR-VQA. To facilitate a systematic way of studying this new problem, we introduce a large-scale dataset, namely OCRVQA-200K. This dataset comprises of 207,572 images of book covers and contains more than 1 million question-answer pairs about these images. We judiciously combine well-established techniques from OCR and VQA domains to present a novel baseline for OCR-VQA-200K. The experimental results and rigorous analysis demonstrate various challenges present in this dataset leaving ample scope for the future research. We are optimistic that this new task along with compiled dataset will open-up many exciting research avenues both for the document image analysis and the VQA communities. © 2019 IEEE.
About the journal
JournalData powered by TypesetProceedings of the International Conference on Document Analysis and Recognition, ICDAR
PublisherData powered by TypesetIEEE Computer Society