OCR using a deep learning algorithm

DEVANSHI JOSHI
6 min readDec 27, 2020

--

https://www.selfdevelopmentsecrets.com/wp-content/uploads/Writing-a-note-e1544366640923.jpg

What is OCR?

It is Optical Character Recognition. It is a field of research in Artificial Intelligence(AI), Computer Vision, and pattern recognition. A computer performing handwriting recognition is said to be able to acquire and detect characters in paper documents, pictures, touch-screen devices, and other sources and convert them into machine-encoded form. It is a widespread technology to recognize text inside images, such as scanned documents and photos. OCR technology is used to convert virtually any kind of image containing written text (typed, handwritten, or printed) into machine-readable text data. It is one of the earliest addressed computer vision tasks. OCR Technology became popular in the early 1990s while attempting to digitize

historic newspapers. There were different OCR implementations even before the deep learning boom in 2012, and some even dated back to 1914.

What is the need for OCR?

In our surroundings, everything we express either orally or in written carries huge amounts of information. So humans can easily understand and analyze but for computers, it is hard to find the meaning. Another thing is if a human is having a lot of documents and he need to summarize and only want needful information it will take much time for human but if we use OCR to just fetch text from image and then summarize using Natural language processing then it will easy and it takes less time and not require human resource.

Working of Handwritten Recognition algorithm

I have used the IAM Handwriting Dataset which is a collection of handwritten passages by several writers. It is having 1 million word instances. There are a total of 1,066 forms produced by approximately 400 different writers and a Total of 82,227-word instances out of a vocabulary of 10,841 words occur in the collection.

Dataset Link: http://www.fki.inf.unibe.ch/databases/iam-handwriting-database

I have done this project using Convolutional Recurrent Neural Network to recognize the Handwritten text image. I have to use CTC loss Function to train and other preprocessing techniques to enhance accuracy. I have used a pre-trained model and the code is available here

First, why we use Deep learning algorithms instead of Machine Learning algorithms?

Machine Learning requires early Feature Extraction as a feature and then after performed classification on it. But Deep Learning works as a “black box” that can do feature extraction and classification on its own.

https://i.pinimg.com/originals/19/41/9c/19419ca47404d8712f5ac4cd26b58c61.png

The above figure shows that the main task is to classify the given image is a dog or not. In the case of machine learning, it needs a feature of images such as edges, color, shape, etc before classification, and after the feature extraction, it performs classification on it. But in deep learning, it extracts features and performs classification on its own. Example of this given image to the Convolutional Neural Networks that every layer of CNN takes the feature and finally Fully Connected Layer perform classification.

The main reason to choose Deep Learning is it can self extracts features with deep neural networks and classifies itself. Compare to traditional Algorithms its performance increase with the Amount of Data.

https://www.researchgate.net/profile/Alessio_Zappone/publication/330913061/figure/fig3/AS:723400796426241@1549483598453/Classical-and-Deep-learning-vs-training-set-size.png

Detail architecture of the project:

https://raw.githubusercontent.com/sushant097/Handwritten-Line-Text-Recognition-using-Deep-Learning-with-Tensorflow/master/images/ArchitectureDetails.png

I have used the Convolutional Recurrent Neural Network(RCNN) to extract the important features from the handwritten text Image. The output before the CNN Fully Connected(FC) layer, is passed to the BLSTM which is for sequence dependency and time-sequence operations. The output of BLSTM is 100x80 which is nothing but 100 timesteps and 80 characters including blank.

Then CTC loss is used to train the RNN which is used to remove the Alignment problem in Handwritten since handwritten have a different alignment of every writer. We just gave them what is written in the image and BLSTM output, then it calculates loss. The aim is to minimize the negative maximum likelihood path.

After that, CTC finds out the possible paths from the given labels. CTC Loss is given by for (X, Y) pair is:

https://raw.githubusercontent.com/sushant097/Handwritten-Line-Text-Recognition-using-Deep-Learning-with-Tensorflow/master/images/CtcLossFormula.png

And at the end, CTC Decode is used to decode the output during Prediction.

Accuracy improvement

There are several ways to improve the accuracy of the model. I have done some of the preprocessing and post-processing techniques which is helpful to improve model output.

  1. Image thresholding:

I have used Otsu’s thresholding to convert the RGB images into Black and white images. this algorithm returns a single intensity threshold that separates pixels into two classes, foreground, and background. It gives much better output compare to other thresholding algorithms.

https://www.learnopencv.com/wp-content/uploads/2015/02/opencv-threshold-tutorial-1024x341.jpg

2. Image Enhancement:

For image enhancement, there are several techniques but in this project, I have used Contrast Stretching which is expands the range of intensity levels in an image.

https://blog.faradars.org/wp-content/uploads/2019/08/Image-processing-in-matlab-fig37.jpg

3. Increase line width:

I have used this preprocessing because sometimes the text which is written in the image is not visible so for that I use morphological operation which is dilation to increase the thickness of the text.

https://www.researchgate.net/publication/305375221/figure/fig2/AS:393507714945028@1470830958360/The-dilation-of-an-object-by-a-structuring-element.png

Post-processing Techniques:

  1. decoding techniques:

1. BestPath:

This decoding technique is Concatenate most probable characters per time-step which yields the best path. Then, undo the encoding by first removing duplicate characters and then removing all blanks. This gives us the recognized text.

This decoding technique is useful to recognize the text.

2. WordBeamSearch:

It is a Connectionist Temporal Classification (CTC) decoding algorithm. It is used for sequence recognition tasks like Handwritten Text Recognition (HTR) or Automatic Speech Recognition (ASR). The following image illustrates an HTR system with its Convolutional Neural Network layers, Recurrent Neural Network layers, and the final CTC (loss and decoding) layer. Word beam search decoding is placed right after the RNN layers to decode the output, see the red dashed rectangle in the illustration.

https://raw.githubusercontent.com/githubharald/CTCWordBeamSearch/master/doc/context.png

2. Spell Correction Technique:

The use of a spelling checker helps in avoiding common writing mistakes often repeated in various writing assignments. It transforms our natural writing into a clear and professional one. A corrected writing through the use of a spell checker improves the image we want to project through the writing.

I have used various spelling correction libraries which are available in python.

1. TextBlob

2. Autocorrect

3. PySpellChecker

4. Spello

5. Spacy

Applications of OCR

1. Data entry for business documents, e.g. Cheque, passport, invoice, bank statement, and receipt.

2. Automatic number plate recognition.

3. In airports, for passport recognition and information extraction.

4. Automatic insurance documents key information extraction.

5. Traffic sign recognition.

Further Improvement

  1. For further improvement, we can add line segmentation for full paragraph text recognition.

2. For Better Image preprocessing like reduce background noise to handle real-time images more accurately.

3. Better Decoding techniques to improve accuracy.

4. Use a self-made spell correction algorithm.

Conclusion

In this blog, we have discussed how CRNN (CNN + LSTM) is able to recognize text in images with the detailed architecture of it. The architecture consists of 7 layers of CNN and 2 layers of LSTM and outputs a character-probability matrix. This matrix used for CTC loss calculation and decoding. We have also discussed how can data pre-processing techniques increases accuracy and make a real-time handwritten recognition system with detailed code. Finally, further improvement of this system was given.

References

[1] A. Hannun, Sequence ModelingWith CTC (2017),

[2] https://github.com/githubharald/CTCWordBeamSearch/tree/master/data/iam

[3] https://github.com/sushant097/Handwritten-Line-Text-Recognition-using-Deep-Learning-with-Tensorflow

[4] T. Bluche, J. Louradour, and R. Messina Scan Attend, and Read: End-to-End Handwritten Paragraph Recognition with MDLSTM Attention (2016)

--

--

No responses yet