Automatic character extraction from handwritten scanned documents to build large scale database

Munir U., Ozturk M.

2019 Scientific Meeting on Electrical-Electronics and Biomedical Engineering and Computer Science, EBBT 2019, İstanbul, Türkiye, 24 - 26 Nisan 2019 identifier identifier


Text extraction is an important phase in document recognition systems. In order to differentiate text from non-text objects, it is necessary to detect all possible text regions in the document. In this article, an efficient algorithm is proposed to detect all handwritten text characters from a page, written specifically, to prepare a dataset of handwritten alphabets at large scale, to be used in training phase of Machine Learning algorithms. The text line extraction algorithm uses a series of different steps to obtain the text region. Following, a sequence of histogram projection and recovery is proposed to obtain the line segmented region of the text. Text lines positions are detected using a horizontal histogram projection. Vertical histogram projection is used in each individual text line to find out the positions of alphabets in the respective text line. In post-processing, noise which is mostly small black spots, are removed using a moving median filter. Histogram projections are used once again, to detect all alphabets again after removal of noise. Alphabet detection rate on documents prepared for alphabet data preparation is 98.9 % and alphabet detection rate for the normal handwritten document is 98 %. Text line detection rate on 100 images of IAM database is 99.47 %.