TK376 : Segmentation of Complex Persian Documents into Text, Graphics and Table Blocks
Thesis > Central Library of Shahrood University > Electrical Engineering > MSc > 2014
Authors:
Mostafa Golzadeh Hamzkanloo [Author], Hossein Khosravi[Supervisor]
Abstarct: OCR Systems play a major role in the realization of e-government and to reduce the volume of paper and digital archives. These system use three main parts: preprocessing, recognition of text and post-processing. It is natural that any error in the preprocessing stage, is irreversible. For example, if the skew angle of the document incorrectly identified, will cause text lines to be crooked and make the identification process, to be failed. One of the important parts in the processing, is layout analysis, in the sense that we identify which parts of a document are text, which parts are table, and what areas are image. Any error in this section, will introduce more errors in the OCR process. In this thesis we introduce an algorithm for analysis of multi-column Persian documents. In this field, three approaches are common, bottom-up approach that starts with the integration and development of pixels to create larger areas. Top-down approach, such as XY cut method, first divides the image into several sections and then break down each region into smaller regions. The combination of these two methods are known as hybrid approach. We use a hybrid approach which most of it, are baxsed on bottom up approach. In this approach, we use adaptive thresholding, component labeling, morphological operations and Hough transform in a heuristic algorithm and introduce specific rules for combining small areas without the integration of non-similar areas, to decompose document into text, table and image parts. The proposed method is tested on several multi-column documents with artistic or graphical background and it outperformed the leading OCR software like OmniPage and FineReader. That Numerical results are as follows Our algorithm Persian text with 72 figures 75 and tables 92 percent true diagnose. And 88 percent of Persian documents is almost correct segmentation.
Keywords:
#Document layout analysis #Page segmentation #bounding box #connected components Link
Keeping place: Central Library of Shahrood University
Visitor: