Why a specialized approach to OCR is important in health insurance

At Qantev we aim to improve the business workflows of health insurance companies using advanced Artificial Intelligence techniques. Currently the internal process of health insurers is mostly manual, they receive hundreds of claims per day and they have operators partially transcribing these documents into their systems. This approach leads to two problems, the first is how time consuming it is and the second is that only a few required fields necessary for the claims adjudication get transcribed to gain time, leading to a large amount of valuable information to be lost in the process.
Qantev solves this problem with its Document Intelligence pipeline that is able to automate the information extraction process for the most common documents in the health insurance claims process. Today we will focus on pre-approval/pre-authorisation forms and medical invoices.
The pre-approval forms contain information about the patient along with the initial medical diagnosis made by the doctor before further exams. The medical invoice offers an itemised list of all the procedures and services provided to the patient and that need to be considered for payment by the insurer.
This article deals with the first step of the document intelligence pipeline that we apply to this kind of document: how to apply computer vision techniques on them and all the challenges that come with it.
How is the computer able to read a document?
According to [1], at around three years old, a child is able to start recognising letters and by the age of six the child is normally able to read. However, can a computer read? If we give a document to a computer it is only able to read pixels, not letters, so how can we make it understand the text inside this image document?
This is a classical problem in computer vision and its name is Optical Character Recognition (OCR), where the goal is to detect and transcribe the text in an image. Early computer vision techniques focus on traditional approaches, such as contour detection and SIFT [2]. Nowadays, there are many deep learning algorithms that achieve much better results on this OCR task.
An OCR model is composed of two parts, a text detector and a text recogniser. The text detector aims to detect where there is text in the image, outputting a bounding box containing its location. The text recogniser aims to, given an image with a text, transcribe it to real text. After having these two parts, the pipeline is very simple, we detect where the text is in the image and from which detected piece of text, we crop the image using the bounding box location and then we transcribe the text using the text recogniser.

There are many open source OCR projects available online, like pytesseract [3], easyOCR [4] and PaddleOCR [5]. While these methods are easy to use, they don’t really give the best results. We tested them and we found that they have two main problems:
- They manage to detect the text most of the time but they have trouble recognising/transcribing the text. If the text is handwritten, the result is even worse;
- When it comes to reading documents in other languages, just easyOCR is able to tackle the problem. When we have as many different languages to deal with as we have at Qantev, none of these algorithms can achieve good enough results;
For these reasons, at Qantev, we developed our own OCR tool called MoustacheOCR. We realised that these hugely popular open source projects driven by deep learning models are not able to achieve what we want and still rely on old techniques that are very far from the performance of the current state-of-the-art.
In our solution, we take advantage of the very new Transformers state-of-the-art techniques to read the claims documents. Transformers were first proposed in 2017 [6] for Natural Language Processing and in 2021 they were expanded to the Computer Vision field. The biggest advantage of Transformers over previous models like CNNs and RNNs is the pre-training. Using self-supervised learning techniques, we are able to pre-train a Transformer network without any label, and then we just need to fine tune it on some small set of labeled data.
For the text detection part, we use an Object Detection Transformer network that is first pre-trained on dozens of thousands of unlabeled documents and then we fine-tuned it on some documents that were manually annotated. With this approach, we have a text detector that performs really well for most of the documents that health insurers deal with. Following that, for each customer, we label some of their data and fine-tune our model again so it can capture some specificities related to the customer’s data.
For the text recognition part, which is the biggest weakness of the open source tools, we also use a multimodal Transformer architecture where a Vision Transformer [7] encodes the image and a Multilanguage Transformer Model decodes the text inside this image. We decided to use a multilanguage model by default, since we are working with many customers all around the world.
To train our model, we also take advantage of some state-of-the-art self-supervised pre-training techniques so we first pre-train our models separately. The Vision Transformer encoder is pre-trained using hundreds of thousands of relevant pieces of text containing different languages and the Multilanguage Transformer decoder is pre-trained using millions of available medical text data on the internet. After that, we fine tune the model on a collection of multi-language annotated data and we fine tune it a second time once we have our client’s data.
Results
To analyse the results of our algorithm, MoustacheOCR, compared to the available open source tools, we focus on the text recognition part, which is the hardest one. For this text recognition part, we used two common metrics, the Character Error Rate (CER).
We collected a set of documents containing printed and handwritten text in English and compared our algorithm against Pytesseract, EasyOCR and PaddleOCR. It’s important to say that none of the open source algorithms nor ours were fine tuned, we used their pre-trained version.

Just by matter of curiosity, when we fine tuned our MoustacheOCR in 50% of the above mentioned dataset and tested it on the other half, our algorithm achieved 2.5% CER on printed data and 12% CER on Handwritten data.
Below you can see the output of our OCR in a document created internally at Qantev.

Conclusion
In this article we explained our OCR pipeline structure and the different techniques applied to achieve the results that we want. Today, the algorithm developed internally at Qantev manages to detect and transcript text in multiple languages. The levels of accuracy and performance that we achieved allowed us to deliver highly robust decision making and automation to our customers all around the world.
We saw that our internally developed OCR tool achieves up to 6 times better CER on printed data and 3 times better CER on handwritten data without fine tuning compared to other open source OCR tools.
In the next articles we will discuss the next steps of our document intelligence pipeline, which use this OCR to then retrieve some specific information from the document and also to extract tables in the documents to the CSV format.
References
[1] https://smallerscholarshouston.com/at-what-age-should-a-child-know-the-alphabet/
[2] https://en.wikipedia.org/wiki/Scale-invariant_feature_transform
[3] https://github.com/madmaze/pytesseract
[4] https://github.com/JaidedAI/EasyOCR
[5] https://github.com/PaddlePaddle/PaddleOCR
[6] https://arxiv.org/pdf/1706.03762.pdf
[7] https://arxiv.org/pdf/2010.11929v2.pdf
