Complexities of table extraction on medical and claims documents
This is the third article in our series where we explain how the document intelligence pipeline at Qantev works. The goal of our document intelligence pipeline is to automate the extraction of information from the documents received by our insurance clients.
In the first article [Qantev OCR for Health Insurance] we described our OCR algorithm, which is able to read a scanned document in different languages such as English, French, Thai and more! In the second article [Qantev Information Extraction for Health Insurance] we described our information extraction algorithm, which extracts information such as the patient name, policy number, provider name and every other important field from a health insurance pre-approval document. Today, we will discuss our table extraction pipeline, which extracts tables from scanned documents and generates a csv file.
Insurers can receive thousands of reimbursement claims per day, each having at least one table containing key information about the claim. One of the main tables that needs to be extracted is the treatment table which aggregates all the procedures performed by the hospital during the member stay.
In the case of a broken left leg, for example, all the procedures are listed in the treatment table, from the first consultation with the doctor until the surgery, containing also all the drugs and equipment that have been used.
The treatment tables can be large and the claim handler needs to fill all of this in manually, in order to transcribe the entire table. For efficiency’s sake, Claims Handlers often only transcribe the total amount of all procedures and the main reason for which the client is claiming, losing most of the other precious information present in this document.
With our table extraction pipeline, the treatment table is automatically extracted in the format of a csv table containing well separated columns and rows. Below we will dive deep into our pipeline to explain how we are able to achieve this.
How to extract tables from an image into csv?
To explain our algorithm, let’s take a look at a real life situation we have encountered at Qantev. As previously mentioned, we deal mainly with treatment tables, which contain the information about all the procedures performed by a member during their time at the hospital. Usually, the treatment table contains fields such as Date, Code, Description, Quantity, Amount Per Unity, Total Amount… Below is an example of an anonymised treatment table image:
As you can imagine, transcribing this table manually from scratch would be very time consuming. At Qantev, we fully automate this using AI. During the early phases of deployment, the only thing the Claims Handlers need to do is to pick a few random documents to check the result table for any inconsistency, apply modifications, and then approve the result.
To achieve that we have built many capabilities. Given a scanned document containing a treatment table, we start by detecting the table location. We achieve this by using an in-house model similar to open-source pre-trained algorithms like CascadeTabNet [1] or TableNet [2], but with a very specific fine tuning that enables it to perform extremely well on health insurance documents.
After detecting where the table is, we crop the image in the table and we apply our OCR model. Below you can see the result of our algorithm:
The challenge is how to properly extract the columns and the rows!
This is actually a very hard problem, there are some open source algorithms like CascadeTabNet [1] that tries to use Deep Learning object detection to segment each cell but in our tests, its performance is lacking, even with fine-tuning on our data.
Although we love leveraging Deep Learning techniques at Qantev, we are focused on the best tool for the job. Therefore, we went for a traditional computer vision approach that uses the pixel densities across axes to extract the rows and columns. This is the csv table extracted from our model:
Our techniques outperformed alternative Deep Learning approaches by being template agnostic and much faster. Our method does not rely on any template and works for different table structures by only using some pre-processing techniques.
Conclusion
In this article we have shared an overview of how Qantev’s pipeline for extracting tables works. Our proprietary techniques have demonstrated their effectiveness in helping insurers across the world dramatically reduce their employees time spent on manual tasks and also improving their data!
If you haven’t checked them yet, please take a look at the previous articles of this series, where we explained the inner workings of our OCR in this first part [Qantev OCR for Health Insurance] and we described the information extraction pipeline we use to retrieve some specific information from scanned documents [Qantev Information Extraction for Health Insurance].
[1] https://arxiv.org/pdf/2004.12629.pdf
[2] https://arxiv.org/pdf/2001.01469.pdf