Layout Analysis and Detection of Text Zones in Historical Handwritten Documents

Kaddas Panagiotis
Ανάλυση Δομής και Εντοπισμός Περιοχών Κειμένου σε Ιστορικά Χειρόγραφα
Layout Analysis and Detection of Text Zones in Historical Handwritten Documents
Historical documents are an important source of information, not only for the thorough knowledge of history, but also for the knowledge of cultural inheritance itself. Technological evolution creates the need for adapting the research of such documents into a new environment. Therefore, new techniques are used, in order to facilitate the optical processing for accessing, recognizing and digitalizing their content. Many collections consist of historical handwritten documents. The analysis of these collections is a difficult task, compared to machine-printed documents, because of their complexity and their low quality. This thesis aims to the development of a new method, which focuses on the layout analysis of historical handwritten documents and the detection of text areas contained. Experimental results of the proposed method are of great importance, because they are used as inputs, by an optical character recognition system. In order to extract the layout of historical handwritten document images, a combined technique is used, considering the existence of separator lines and the image’s background. The results of this technique lead to the creation of a grid, which corresponds to the different zones of the image and approximates the geometric layout of the document. It applies to single or double page handwritten documents, unlike to most layout analysis techniques, which process single page documents. Based on the document’s grid, the detection of text regions consists in extracting the text components with no need for their classification into letter, word or number classes. Thus, page segmentation techniques are not mandatory. Text regions are composed by connecting these components, using distance criteria. The evaluation of the method is based on an existing technique, where resulting regions are matched with manually created ground-truth regions. The image corpus consists of 600 historical handwritten document images. Also, a similar state of the art technique is evaluated, based on the same corpus and comparative results are extracted. Finally, the proposed method is combined with a page skew correction technique. Experimental results are encouraging and confirm the efficiency over a large variety of historical handwritten documents.
Technology - Computer science
document layout analysis, page segmentation, separator lines, whitespace analysis, text regions
