Automatic character extraction from handwritten scanned documents to build large scale database

Munir U., Ozturk M.

2019 Scientific Meeting on Electrical-Electronics and Biomedical Engineering and Computer Science, EBBT 2019, İstanbul, Turkey, 24 - 26 April 2019 identifier identifier

  • Publication Type: Conference Paper / Full Text
  • Volume:
  • Doi Number: 10.1109/ebbt.2019.8741984
  • City: İstanbul
  • Country: Turkey
  • Karadeniz Technical University Affiliated: Yes


Text extraction is an important phase in document recognition systems. In order to differentiate text from non-text objects, it is necessary to detect all possible text regions in the document. In this article, an efficient algorithm is proposed to detect all handwritten text characters from a page, written specifically, to prepare a dataset of handwritten alphabets at large scale, to be used in training phase of Machine Learning algorithms. The text line extraction algorithm uses a series of different steps to obtain the text region. Following, a sequence of histogram projection and recovery is proposed to obtain the line segmented region of the text. Text lines positions are detected using a horizontal histogram projection. Vertical histogram projection is used in each individual text line to find out the positions of alphabets in the respective text line. In post-processing, noise which is mostly small black spots, are removed using a moving median filter. Histogram projections are used once again, to detect all alphabets again after removal of noise. Alphabet detection rate on documents prepared for alphabet data preparation is 98.9 % and alphabet detection rate for the normal handwritten document is 98 %. Text line detection rate on 100 images of IAM database is 99.47 %.