PDF and JPG document processing: recognition, tagging, text extraction

In Data Science


There are many RPA platforms available that can automate almost any interaction with working applications and relieve humans of routine tasks.
alt
Editorial Commitee Qualified.One,
Management
alt

However, many companies face problems in automating the processing of unstructured documents in PDF or JPG format. These could be agreements from contractors, invoices from contractors, applications from clients.

You need to extract valuable data from each incoming document: the date of the invoice, the final amount of the agreement, the sender's return address, and the name of the contracting party.

Then it is necessary to enter this data into the electronic document management system, reconcile the data from the document with the data from internal systems or calculate the items of expenditure and compare them with the final amount.

The data in such documents is not structured. In addition, there are many formats for these data from different clients and counterparties.

A strictly rule-based approach to handling such documents requires enormous development, support and subsequent expansion of automation.

In such cases, the application of machine learning technologies is essential to make process automation 'intelligent'. In other words, the Intelligent Automation Platform (IA Platform) is needed.

So, three components are required to automate the processing of unstructured documents, each of which requires its own unique stack of technologies and approaches. In this article, we describe how we use them at IBA IT in the operation of our RPA Chancellor Platform.

Component #1: recognition

There are a large number of different Optical Characters Recognition (OCR) engines on the market: both paid and open source. We decided to investigate which technologies are the most mature and choose the ones that best fit our requirements. Commercial usability was important, as well as good quality recognition of scanned documents.

We chose Tesseract OCR. This is an HP development which went open source thanks to Google. We also looked at an ambitious project, which often had the best results, PaddleOCR. We implemented it into the platform as an alternative engine.

To improve the quality of OCR we pre-process documents using ImageMagick. Most of the documents we encounter in practice are sufficient for basic preprocessing, which includes changing the DPI of the picture, bleaching, aligning the possible tilt of the document, and removing transparency.

ImageMagick is installed with the following commands.

For CentOS, Fedora:


yum -y install ImageMagick

For Debian, Ubuntu:


apt-get -y install imagemagick

For example, we have an image named "input.png". To preprocess it, we offer a command with the following ImageMagick parameters:


magick convert input.png -units PixelsPerInch -resample 350 -density 350 -quality 100 -background white -deskew 40% -alpha remove -flatten output.png

Here you can see that we usually use a DPI conversion of 350 dpi. The official Tesseract documentation mentions that the engine works better on images with "at least 300 DPI" but does not say anything about how to determine the best value. However, they do cite some interesting experiments which show that choosing the right DPI value is best calculated by the height of capitals in the text of the picture we are processing.

We have repeated these experiments. From them, an unexpected conclusion about the optimal size of capital letters in pixels follows. It would seem that the bigger the size of the letters, the fewer OCR errors there should be. In fact, it turned out not to be so, and Tesseract version 4.0.0 allows the minimum number of errors when the vertical size of capital letters is between 20-35 pixels.

PDF and JPG document processing: recognition, tagging, text extraction

This of course depends on the font used in the document. However, most official documents use the same fonts in roughly the same size. Therefore we suggest exactly 350 DPI as a base setting, which usually changes the size of capitals by about 20-35 pixels.

Here's an example of a pre-processed document where the slant has been corrected, contrast has been increased and only black and white colours have been used:

PDF and JPG document processing: recognition, tagging, text extraction

PDF and JPG document processing: recognition, tagging, text extraction

Component #2: tagging

To prepare a training set of documents we developed a special kind of manual task for our platform, where a person selects text with the mouse on the original document, thereby specifying the location of the desired data.

For UI implementation we chose ReactJS. It is the most popular library for creating user interfaces. Also ReactJS is flexible, with good code readability due to splitting the application into components, so it is easier to maintain, find and fix bugs.

Most often we get a PDF document which we convert to an image using ImageMagick and Ghost Script. Suppose we get an image like this:

PDF and JPG document processing: recognition, tagging, text extraction

Next, we send the document to OCR. The result of the OCR is the following HTML structure:

PDF and JPG document processing: recognition, tagging, text extraction

We see a hierarchy like this: each document is divided into pages ("ocr_page"), each page can have columns ("ocr_carea"), a column has a paragraph ("ocr_par") and so on up to the smallest structure, the word ("ocrx_word"). The attributes of each block contain information about its location relative to the original document. For example, "bbox 2215 236 2443 288" means that the top left edge of the word starts at X=2215, Y=236 and the bottom right edge starts at X=2443, Y=288.

So at the UI we get a picture of the original document and the OCR result with the text extracted from it with coordinates.

Then we display the document picture in a regular tag, without any additional layers. We start monitoring the JS events of mouse movement over the image and the selection event. When the user selects an area in the image we get the coordinates of the selected area in the event handler. We now have the task of determining which areas of the OCR result are affected by the selection. This is a simple mathematical task of intersecting one area with another. But this operation has to be done as quickly as possible, because while the user swipes the mouse, selecting an ever larger area, we have to recalculate the intersection of the areas each time. To streamline the process we decided to convert the HTML format to JSON while preserving the nesting structure. In this way the HTML from the example above will take the following JSON form:


{
  "pages":
  [
      {
          "id": "page0",
          "properties":
          {
              "bbox": [0.0, 0.0, 1.0, 1.0]
          },
          "areas":
          [
              {
                  "id": "page0_area0",
                  "properties":
                  {
                      "bbox": [0.703844931680966, 0.05722003929273085, 0.9532888465204957, 0.07465618860510806]
                  },
                  "paragraphs":
                  [
                      {
                          "id": "page0_area0_paragraph0",
                            "properties":
                          {
                              "bbox": [0.703844931680966, 0.05722003929273085, 0.9532888465204957, 0.07465618860510806]
                          },
                          "lines":
                          [
                              {
                                  "id": "page0_area0_paragraph0_line0",
                                    "properties":
                                  {
                                        "bbox": [0.703844931680966, 0.05722003929273085, 0.9532888465204957, 0.07465618860510806],
                                  },
                                    "words":
                                  [
                                      {
                                            "id": "page0_area0_paragraph0_line0_word0",
                                          "properties":
                                          {
                                                "bbox": [0.703844931680966, 0.05795677799607073, 0.7762948840165237, 0.07072691552062868],
                                                "x_wconf": 96.0
                                          },
                                            "text": "[email protected]"
                                      }
                                  ]
                              }
                          ]
                      }
                  ]
              }
          ]
      }
  ]
}

Now, by getting the coordinates of the custom selection, we start to walk through the OCR-JSON structure deep into the tree instead of going through the coordinates of each extracted word. For example, at the paragraph level we can immediately see whether to go down a level to the lines or to go to the next paragraph to find the occurrence.

If we determine that a region of a word falls within the user's selection, we create an element with a semi-transparent background with an absolute position to mimic the selection. This gives the user the impression that they have highlighted the word in the document itself, even though technically it's just a picture that doesn't contain a text layer.

PDF and JPG document processing: recognition, tagging, text extraction

We also use JSON as output, which contains information about the extracted entities and their relative coordinates. Such data carries the main value and is further used to train the Machine Learning (ML) model:


{
  "entities":
  [
      {
          "content": "8603163534",
          "name": "Invoice Number",
          "words":
          [
              {
                    "content": "8603163534",
                  "bbox":
                  [
                        0.7562758182395932,
                        0.2087426326129666,
                      0.8433428662217985,
                      0.2180746561886051
                  ],
                  "id": "page0_area6_paragraph0_line0_word0",
                  "page": 0
              }
          ]
      }
  ]
}

Component #3: Extraction

For automatic text extraction we chose SpaCy ML library, which supports 60+ languages, is fast to train, and has many pre-trained models using neural networks.

Our platform involves the processing of documents from different areas: these can be bank statements, insurance applications, retail invoices, questionnaires for HR departments, etc. Therefore we train a separate model for each type of document of a particular client, which allows us to achieve the best results. Thanks to the developed document tagging component, the process of generating a training set does not require specialized ML knowledge. Usually this process is performed by Subject-Matter Experts (SME) - client's employees who have been extracting such data manually before automation, and they are the ones who know best what text should be highlighted in a particular document.

One of the main questions when automating document processing is how many historical documents are required to train the ML model. We have determined that 50 to 100 documents are sufficient to get the first version of the model off to a quick start that will be immediately useful. To determine the optimal number of documents, we conducted model training experiments in which we only changed the size of the training set. For each model we calculated Accuracy, Precision, Recall and F1 Score. It turned out that the optimal number is approximately 500 documents. With a further increase in the number of documents there is no significant improvement in accuracy.

PDF and JPG document processing: recognition, tagging, text extraction

Once the model is trained, it can be implemented in the business process. It will start to tag the document automatically instead of a person. If the automatic validation rules require manual validation, we send the output of the ML model to our tagging component. There we do the reverse operation of highlighting the extracted areas on the document. This way a human can see where the data has been extracted from:

PDF and JPG document processing: recognition, tagging, text extraction

In this way, smart automation combines the capabilities of robotic processes and machine learning. Careful research into available open source solutions and clever work on training ML models has allowed us to create an RPA platform that handles unstructured documents perfectly. The user, on the other hand, feels as if they are just selecting words with the mouse.