Layout Analysis and Detection of Text Zones in Historical Handwritten Documents

Postgraduate Thesis uoadl:1324833 571 Read counter

Unit:
Κατεύθυνση / ειδίκευση Επεξεργασία-Μάθηση Σήματος και Πληροφορίας (ΕΜΠ)
Πληροφορική
Deposit date:
2016-11-21
Year:
2016
Author:
Kaddas Panagiotis
Supervisors info:
Θεοδωρίδης Σέργιος, Καθηγητής, Τμήμα Πληροφορικής και Τηλεπικοινωνιών, ΕΚΠΑ
Γάτος Βασίλειος, Ερευνητής A, Ινστιτούτο Πληροφορικής και Τηλεπικοινωνιών, ΕΚΕΦΕ Δημόκριτος
Original Title:
Ανάλυση Δομής και Εντοπισμός Περιοχών Κειμένου σε Ιστορικά Χειρόγραφα
Languages:
Greek
Translated title:
Layout Analysis and Detection of Text Zones in Historical Handwritten Documents
Summary:
Historical documents are an important source of information, not only for the thorough knowledge of history, but also for the knowledge of cultural inheritance itself. Technological evolution creates the need for adapting the research of such documents into a new environment. Therefore, new techniques are used, in order to facilitate the optical processing for accessing, recognizing and digitalizing their content. Many collections consist of historical handwritten documents. The analysis of these collections is a difficult task, compared to machine-printed documents, because of their complexity and their low quality. This thesis aims to the development of a new method, which focuses on the layout analysis of historical handwritten documents and the detection of text areas contained. Experimental results of the proposed method are of great importance, because they are used as inputs, by an optical character recognition system. In order to extract the layout of historical handwritten document images, a combined technique is used, considering the existence of separator lines and the image’s background. The results of this technique lead to the creation of a grid, which corresponds to the different zones of the image and approximates the geometric layout of the document. It applies to single or double page handwritten documents, unlike to most layout analysis techniques, which process single page documents. Based on the document’s grid, the detection of text regions consists in extracting the text components with no need for their classification into letter, word or number classes. Thus, page segmentation techniques are not mandatory. Text regions are composed by connecting these components, using distance criteria. The evaluation of the method is based on an existing technique, where resulting regions are matched with manually created ground-truth regions. The image corpus consists of 600 historical handwritten document images. Also, a similar state of the art technique is evaluated, based on the same corpus and comparative results are extracted. Finally, the proposed method is combined with a page skew correction technique. Experimental results are encouraging and confirm the efficiency over a large variety of historical handwritten documents.
Main subject category:
Technology - Computer science
Keywords:
document layout analysis, page segmentation, separator lines, whitespace analysis, text regions
Index:
Yes
Number of index pages:
2
Contains images:
Yes
Number of references:
32
Number of pages:
103
File:
File access is restricted only to the intranet of UoA.

KaddasPanagiotis-M1324.pdf
7 MB
File access is restricted only to the intranet of UoA.