Enhancing Predictability of Handwritten Document Content using HTR and Word Substitution
Varshini Prakash1, Keshav Moorthy2, Jasmin T Jose3
1Varshini Prakash*, Computer Science, Vellore University of Technology, Vellore, India.
2Keshav Moorthy, Computer Science, Vellore University of Technology, Vellore, India.
3Jasmin T Jose, Computer Science, Vellore University of Technology, Vellore, India.
Manuscript received on May 04, 2020. | Revised Manuscript received on May 12, 2020. | Manuscript published on May 15, 2020. | PP: 15-18 | Volume-6, Issue-7, May 2020. | Retrieval Number: G1240056720/2020©BEIESP | DOI: 10.35940/ijisme.G1240.056720
Open Access | Ethics and Policies | Cite
© The Authors. Published By: Blue Eyes Intelligence Engineering and Sciences Publication (BEIESP). This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Abstract: Handwritten Text Recognition (HTR) can become progressively abysmal when the documents are damaged with smudges, blemishes and blurs. Recognition of such documents is a challenging task. We, therefore propose a system to identify textual handwritten content in documents where the state-of-the-art Optical Character Recognition (OCR) existing at its full extent performs with low accuracy. By introducing word substitution using character and distance analysis for spell checking and word completion in such areas for giving out more accurate results using a word corpus, we improved our prediction results especially in cases where the OCR is prone to predict false positives on the smudge areas predominantly. Blur detection on every word before segmentation is also substituted with a new word by our OCR algorithm to avoid false positive results and are instead substituted with suitable words. This methodology is far more convenient and reliable since even state-of-the-art HTR technologies do not have more than 71% accuracy. The accuracy of the predicted test is measured using the text similarity metric – Fuzzy Token Set Ratio (FTSR).
Keywords: Damaged documents, Fuzzy Token Set Ratio, Handwritten Documents, Spell Check, Word Replacement.