Azure
How to read PDF files using Azure Form Recognizer

How to read PDF files using Azure Form Recognizer

This is a continuation post, already we have shown how to read using AI builder. AI builder is also built on top of Azure OCR. There are some prerequisites to be done before using the Azure services.

Azure works on pay as you go subscription model. If a user creates a new account, for 12 months they are offering a free subscription on certain services and for the first 30 days all the services are available.

Setting up Azure Form Recognizer and Storage Account

In azure, each service is called as resources all the related resources can be grouped under resource groups.

  • Create a resource group called “OCR_FormRec”
  • Create new form recognizer service under “OCR_FormRec” resource group and select the pricing tier as “Free F0”
    1. Form recognizer is available under “Applied AI services”, it is built on top Azure Computer Vision (OCR) and there some prebuild models available in Form recognizer which we can make use of it.
  • Create a “Storage account” and create a “Container” inside “Storage account” to hold the pdf documents.
    1. Container is used to hold the pdf documents, model data and processed results.
  • Edit the “Container” and generate the “Shared Access Signature(SAS)” url, make sure to select all the permissions and provide the Start and Expiry date.(End date should be greater than the start date)
    1. SAS is nothing but the tokenized url which allows the other application to connect with blob container.
//Form recognizer end point URL looks like
https://365formrec.cognitiveservices.azure.com/

//Generated SAS url
https://ocrstorageaccount365.blob.core.windows.net/365blobcontainer?sp=racwdli&st=2021-11-24T15:30:20Z&se=2021-11-25T23:30:20Z&spr=https&sv=2020-08-04&sr=c&sig=%2Fk%2FrOeajdJ5cqd%2BJQ%2BL5oM1uvAHcYxCPac8RSRovjow%3D
  • Navigate to the “Storage account” and search for cors, Select all the methods, then Set Allowed origins, Allowed headers, Exposed headers to “*” and Set Max age to 200.
    1. Anyone who worked in API’s would be aware of “Cross Origin Resource Sharing“. It is a http based security mechanism to restrict access from other domains, So we need to allow domains.
  • Navigate to “Container” and upload a minimum of 5 pdf files to train the model.

Train the model

Azure provides a free labelling tool to train the model and it provides prebuilt models using that we can see the results instantly.

  • Create a new connection to access the blob container, the SAS url should be entered here and save the connection.
  • Click new project, enter the project name, select the connection, enter the Form recognizer endpoint url and pass the API key. Failing to do any of of prerequisite mentioned above will throw an error shown below.
  • Once the project is created it will load all the documents uploaded in the container.
  • Next we need to create tags in which we can map fields that should be captured from the pdf. To capture the table data a table tag is available and two types are available, based on the documents we can map the fields.
  • We need to draw the area which needs to be captured and that should be mapped to the tags.
  • Once the tagging is completed for all the documents, Run the layout on all the documents.
  • Next step is to train the model, Enter the model name and start training after few minutes it will show the results with accuracy. If the result is not up to the mark we can go back and improve the accuracy.
{
   "modelInfo":{
      "modelId":"f3e789be-fb16-459c-bd4f-9e06ca2af1ff",
      "modelName":"365OCRModel",
      "attributes":{
         "isComposed":false
      },
      "status":"ready",
      "createdDateTime":"2021-11-24T16:25:04Z",
      "lastUpdatedDateTime":"2021-11-24T16:25:06Z"
   },
   "trainResult":{
      "averageModelAccuracy":0.986,
      "trainingDocuments":[
         {
            "documentName":"Adatum 1.pdf",
            "pages":1,
            "status":"succeeded"
         },
         {
            "documentName":"Adatum 2.pdf",
            "pages":1,
            "status":"succeeded"
         },
         {
            "documentName":"Adatum 3.pdf",
            "pages":1,
            "status":"succeeded"
         },
         {
            "documentName":"Adatum 4.pdf",
            "pages":1,
            "status":"succeeded"
         },
         {
            "documentName":"Adatum 5.pdf",
            "pages":1,
            "status":"succeeded"
         }
      ],
      "fields":[
         {
            "fieldName":"Address1",
            "accuracy":0.995
         },
         {
            "fieldName":"Address2",
            "accuracy":0.995
         },
         {
            "fieldName":"CompanyName",
            "accuracy":0.995
         },
         {
            "fieldName":"Date",
            "accuracy":0.995
         },
         {
            "fieldName":"Email",
            "accuracy":0.995
         },
         {
            "fieldName":"InvoiceNo",
            "accuracy":0.995
         },
         {
            "fieldName":"Table",
            "accuracy":0.9
         },
         {
            "fieldName":"Table: Description",
            "accuracy":0.867
         },
         {
            "fieldName":"Table: Line",
            "accuracy":0.867
         },
         {
            "fieldName":"Table: Quantity",
            "accuracy":0.995
         },
         {
            "fieldName":"Table: Unit",
            "accuracy":0.867
         }
      ],
      "errors":[
         
      ]
   }
}
  • Last step is to test the model, Upload the pdf and run the analysis. As you can see the Model ID there, It will be used in the Http request.

The hard part is done, This whole configuration thing seems laggy some times need to do some repetitive actions through out the process. These sample docs are in the same format so it was easy to tag and get the accuracy as expected.

But in reality, from our experience the original pdf documents which we get from the business will not be the same. We may get scanned documents, there might be some disturbances in the document, pdf quality will be very low. So training the model for such scenarios will take more time and an accuracy of a minimum of 70 is mandatory.

This is some extra stuff not related to the current flow, we have mentioned some prebuilt models where we can see the results instantly without training.

Need to pass the form recognizer endpoint URL, secret key and after uploading the file we can see the results it will automatically map the label and value based on the Form type we have selected.

Capturing the results

Here comes our favorite part, We are going to use Power Automate to get the results from this model.

  • For this example we have used manual trigger flow to upload the documents, But in real time based on the requirements you can set the trigger.
  • “HTTP Request” action should be added and method should be Post because we are passing the document to get the results. If the request worked it will return the status as 200 and a URL, where we can get the results. The secret key should be passed from the Form Recognizer.
  • Delay should added for a minute, The model will take some time to give the result.
//URL Split up
https://<Endpoint>/formrecognizer/v2.1/custom/models/<model ID>/analyze

https://365formrec.cognitiveservices.azure.com/formrecognizer/v2.1/custom/models/f3e789be-fb16-459c-bd4f-9e06ca2af1ff/analyze?includeTextDetails=True

//Output URL from Post request
Output URL : @{outputs('Pass_PDF')['headers']?['Operation-Location']}
  • Next step we need to use a “HTTP Request” to get the results. The secret key should be passed from the Form Recognizer.
  • We have to parse the JSON from the results which we got from the Get “HTTP Request”.
  • Next we need to use the “Create CSV” action for the form fields and another “Create CSV” action for table fields, then both the outputs should be merged in the compose action.
//Form fields
Construct Record: body('Parse_JSON')?['analyzeResult']?['documentResults']
Item: item()?['fields']?['InvoiceNo']?['text']


//Table fields
Construct Table: body('Parse_JSON')?['analyzeResult']?['documentResults']?[0]?['fields']?['Table']?['valueArray']
item: item()?['valueObject']?['Quantity']?['text']
  • Then the output shoud be passed to Create file action either from onedrive or SharePoint or it can be directly sent to email address or it can be written to SharePoint list or Dataverse table.
Merge CSV and Create a file

There is heavy usage of JSON manipulation using the expressions, we will create a separate post to work on the expressions.

There is still more, we have done the same thing using Automation Anywhere and integrated it with Power Platform and it will be posted in the near future.

Please post your queries in the comment section. Happy Building šŸ™‚

0

Leave a Reply

%d bloggers like this: