Model-guided segmentation and layout labelling of document images using a hierarchical conditional random field

Santanu Chaudhury; M. Jindal; S. Dutta Roy

doi:10.1007/978-3-642-11164-8_61

Profiles Research Units Publications

Conferences

Model-guided segmentation and layout labelling of document images using a hierarchical conditional random field

, M. Jindal, S. Dutta Roy

Published in

2009

DOI: 10.1007/978-3-642-11164-8_61

Volume: 5909 LNCS

Pages: 375 - 380

Abstract

We present a model-guided segmentation and document layout extraction scheme based on hierarchical Conditional Random Fields (CRFs, hereafter). Common methods to classify a pixel of a document image into classes - text, background and image - are often noisy, and error-prone, often requiring post-processing through heuristic methods. The input to the system is a pixel-wise classification based on the output of a Fisher classifier based on the output of a set of Globally Matched Wavelet (GMW) Filters. The system extracts features which encode contextual information and spatial configurations of a given document image, and learns relations between these layout entities using hierarchical CRFs. The hierarchical CRF enables learning at various levels - 1. local features for text, background and image areas; 2. contextual features for further classifying region blocks - title, author block, heading, paragraph, etc.; and 3. probabilistic layout model for encoding global relations between the above blocks for a particular class of documents. Although the work has been motivated for an automated layout analyser and machine translator for technical papers, it can also be used for other applications such as search, indexing and information retrieval. © 2009 Springer-Verlag Berlin Heidelberg.

Topics: Document layout analysis (66)%, Conditional random field (56)%, Search engine indexing (54)%, CRFS (53)% and Classifier (UML) (51)%

View more info for "Model-Guided Segmentation and Layout Labelling of Document Images Using a Hierarchical Conditional Random Field"

About the journal

Journal	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
ISSN	03029743

Authors (1)

Santanu Chaudhury
- Department of Computer Science & Engineering

ACADEMICS

RESEARCH

STUDENTS

FACULTY