Intelligent Document Processing - what to consider before you start
There are now a plethora of tools which will intelligently read semi structured and even unstructured documents to turn them into data. The marketing and sales spiel might make you believe in the magic of AI, but it's just tech, and tech always needs a proper implementation. The purpose of this post is to point out some approaches you may find useful when implementing a document processing project and wrapping it in RPA for the data entry process. What this blog is not is a critique of any particular vendor's product. There is a good selection of appropriate tools and selection depends on too many criteria to cover in a blog post such as this. In this post we'll use the example of invoices for out project. Invoices are a good use case to start with, they are possibly the easiest use case for most vendors.
A word about the vendors and vendor fit.
Although I said am not going to compare different technologies, I think it is worth pointing out that it actually may not be possible to pick one tool to suit all possibilities. They are all very different and some are stronger in some domains/document types than others. Performing tests on all your possible use cases will take forever, so here is a thinking outside of the box suggestion. These days you can get annual contracts on the technology, plus all of the products will be improving over time. So why not prove the ROI for the single use case, (or similar ones e.g. Invoices and Purchase orders) and avoid buying a contract to include all possible use cases (e.g. number of pages). I would bet you would have it implemented and be monetising the business benefits before you could select a solution for all departments in the company.
One other thing about ROI. The different vendor products will have different price ranges. Many have a cost per page in packs and one could focus on that and make a decision influenced by that alone. I would argue differently. The true ROI will be based on total cost of ownership and that needs to be reflected. For example:
You have the cost to train it and set it up, but you also have the cost of validating documents that aren't straight through processed and there will be additional document training over the period, so ensure you include those e.g.
Cost of Pages Pack + Set up cost + Cost to validate documents + Cost for additional document training. Also bear in mind first year ROI will be different from following years.
To establish these, you will need to do a decent Proof of Concept or Pilot to understand the effort involved and ensure you do the POC/Pilot not the vendor (or at least work with them along the way so you know everything).
Lastly on vendors, be careful about what is charged and what isn't. In many cases all pages processed have a charge, even when setting up and training it. Also, in some classification costs a page count and then extraction costs another.
First principles - your samples and training
Training in batches
To start getting early success, you may choose to do the top 100 vendors first, then do the next 100 etc. As long as you are able to route the invoices to either the as-is process or the bot this can be a great approach. Trying to deliver the project for all (say) 1500 vendors is a massive chore and can actually make things more difficult. What you also may find is that as you get into the medium vendors the system has learnt sufficiently to extract those vendors or at least require no teaching. It doesn’t hurt to load each new vendor and see how it does anyway.
Training set
It cannot be emphasised enough that if you start wrong, it will take much longer. This is mainly dependant on the samples you use. Each product has different amounts of AI in them. Let's drop the AI term and call it what it is, it’s machine learning plain and simple, i.e. it only learns based on the quality and amount in your samples. Some of the platforms have "less" ML in them, which isn't necessarily bad, but it means they are more rules based, again samples are critical.
Vendors will claim you only have to give their tool a very small number for it to intelligently learn. While this may be true, it is better taking a pinch of salt with this. There is no golden rule on it that I have managed to find, even received invoices can vary massively from one customer to another as it depends on how their suppliers produce their invoices. Personally, I view it as "sh** in sh** out". My suggestion is to approach it statistically, analyse the frequency of invoices from suppliers and approach your sample in most frequent first. This is a common sense approach focussing on ROI, as it doesn't really matter if a once a year supplier's invoice requires review/validation.
The other advice is don't just "chuck 'em in" and see what happens, there is a tedious bit that is really important. You need to read the documents from each supplier to see how their structure varies. What you want is variation, e.g. number of pages (of invoice data, not stuff like T's and C's), different layouts (for one vendor), even different labels for key fields. An example of the labels thing happened to me only the other day where my sample had a label of “Rechnung Nbr” for invoice number and test documents had “Rechnung KOPIE Nbr” so these went (correctly in this case) to validation/review, but you might have wanted them not to, so they need to be in the training set.
So far, I have found that 3-5 documents per vendor with a good range of variation, is not a bad start position for a training set, but you may have to add more.
Evaluation set
You need a larger evaluation set from the same vendors. Evaluation/Test sets are an important part of training and evaluating any Machine Learning model and some products actually ask for these, but some don't. I would advise that even if the product doesn't require an evaluation set, use one.
BTW your evaluation set is not the UAT set, when you UAT you will be testing with a new set.
Classification
Many systems offer document classification, some offer multiple methods i.e. you can classify in RPA or you can use Machine Learning to learn the classification. This is useful to confirm the document is an invoice before processing it.
Classification doesn't need to use ML, it is possible to build it in a bot with simple rules, however if your project is accepting multiple document types and then using different models to extract the data, then you may need ML based classification. That example is common in mail room scenarios where all incoming documents are scanned into a single file each day. The main message here really, is use classification if you need to, but if you know the documents are almost always invoices it isn't worth paying per page for ML based classification.
Training and setting up
Most of the products have the same approach, you load your training set, you teach it where the fields are and on some then you hit a train button so the ML can learn.
You then run your training set through to catch any errors where you may not have taught it properly. You iterate over that process until it is performing properly i.e. where it can extract it does and when not it passes the document to validation/review.
Recommended by LinkedIn
Once you have trained the training set, then you test the evaluation set. Don't be disappointed if the number of documents that require validation/review is much higher. This is good. It means there is more training to do, and your evaluation set is a good one. How you improve the training is specific to each product, but it may be that you need to add those documents to your training set and replace them with new ones in your evaluation set. It might also be that the product learns when you validate/review the documents, however it may take time to learn so try this and then re-run your evaluation set to check the results.
Rinse - Repeat. A lot!
Validation rules and scripting
Some products have sophisticated capabilities to evaluate the values extracted and perform all kinds of checks on them. This is often achieved with some inbuilt scripting in something like JavaScript or python. Some products come with a complete set of rules built into a pre-trained model. In other cases, you write these rules in your RPA product. However you do this, keep it simple. It is possible in some products to perform database lookups and match/enhance the data before output. My personal view of this is only create rules and implement scripting to reduce the number of documents sent to review/validation by a human. If you over complicate it then more and more documents will get sent to a human for review.
Many times, I have been told that the required field list is more than the fields required by the accounting system, so pare these back to what you really need. Also consider having a few optional fields that may help with the data entry process. As an example, the Vendor name in the invoice may not be the same as that in the accounting system, so you could grab an optional field in order to help find the vendor in the ERP system (if it can be searched with that field) e.g. VAT number, post code etc.
Be careful of false positives. If we imagine we want to capture currency and we set a validation rule that it must be one of "EUR", "USD", "GBP" etc. We will get some invoices where there is no currency or it is a symbol. We could write a rule to translate a symbol found in another field e.g. Total, to the currency. This will probably be good. We could also write a rule that checks the vendor address and sets the currency based on vendor address, this may not be good because vendors don't always invoice in the currency of their origin. The bottom line is if you write scripts that change data, check them thoroughly and if in doubt, let the system send it to validation/review. If in doubt let a human decide.
The big no-no for me is over doing scripting e.g. doing database lookups on product codes when that is not needed for validation checks. Architecturally this kind of job is often better done in RPA, after extraction as long as it is not needed for validation checks.
One last point, some systems will validate things like dates as dates regardless of the format and then output the text as it is in the document. For this reason, especially if outputting to CSV, I always look at the output in an editor not excel. Excel will transform the display of the data, and this can catch you out. For me, I prefer to transform various formats of dates into the one required by the ERP system in RPA Post Processing. I was caught the other day in an RPA project where the raw data suddenly started having leading zeros on an invoice number and the accounting system was rejecting them. Because I kept looking at the data in excel I missed it for a bit.
Architecting your RPA
There are some things to think about when building the RPA to drive the Document Processing product and complete the data entry task. It is worth bearing mind that this process is asynchronous.
Before and after document processing
Try to think about this in steps such as the following (you don't have to have all these)
The key here is the deciding whether you need Pre-processing and Data Enhancement/Enrichment and how you drive the validation/review flow. Take some time to flesh these out.
Pre-processing can be important. One example is you get a ton of poor-quality invoices from one vendor and the OCR doesn't read the documents well. Can you enhance these before submission using graphical libraries? Another example is page splitting/invoice splitting. Do some vendors send one file with multiple invoices in it? A lot of vendors send invoices with pages of T's and C's in each invoice and you really don't want to pay to process those pages and they can actually have a detrimental effect on your document training. It isn't too hard to build RPA page splitters or bots that remove T's and C's.
Another reason for pre-processing can be to avoid using paid classification to weed out the odd document that isn't an invoice. Why pay per page on all documents to do this, when it may be possible to catch these before submission with some fairly simple Bot logic.
Overall process
The key here is the process is asynchronous. Any single document may not be extracted immediately as it is in a queue and a document might be with a business user for validation, so the data may not arrive for a few days if the validator is out of the office. Therefore, it is important to design the RPA to take this into account. You may also wish to design this process using the RPA platforms workitem queues/transactions for traceability. Some platforms offer mini workflow capabilities where you can assign users tasks and then they hand it back to a bot when finished review. You might even have some kind of BPM system you want to use for this part of the process. In many cases an email with links to the validation/review screen is equally as effective.
The point here is that in your design you need to allow for human-in-the-loop and you need to plan that data will not come out of the document extraction product immediately. Additionally, you need to plan for a minimum of outcomes such as Processed, Needs review, reviewed and corrected, reviewed and rejected (e.g. not an invoice), Rejected as too few fields extracted so it needs retraining, Not an invoice, Unreadable.
The ideal position should be that the system can be set up for a good initial set of vendors and then the business users can manage it from then on, by validating and reviewing but also training documents that need it. This needs a workflow and it also needs decent UI from the vendors where it is easy to train new documents (if they don't process well) and validate quickly.
An important point for me, is that the business users are able to ask “where is invoice X?” and very easily see where it is in the process. This will be achieved by creating good reports and perhaps help from the RPA platforms transaction systems while ensuring they can find the files and do something e.g. manually process a single invoice. The bottom line is that no Document Processing project will give you 100% straight through processing so the manual interventions need to be planned for.
Finishing off
I hope that this has been helpful. It is, admittedly high level, but my personal experience is that if you approach it in the right way, with the right methodology the project will be a great success. If you don’t, I have seen cases where the project had to be started all over again. The critical thing is to allow for lots of iteration and testing of output, perform a POC/Pilot first to nail down ROI and to also find any challenges before you pay for the product.
Co-Founder and Chief Strategy Officer (CSO) at Webio Ltd
3yGreat article worth looking at Benjamin Waymark and Qamir Hussain
Founder at Alphalake AI, envisioning semi-autonomous healthcare and enabling a new kind of health Workforce eXperience.
3yTony O'Neill Saurabh Singh Ravi Theja Samay Phaldesai there may well be some interesting points here from Simon meaning it's worthwhile reading. I have saved to try and read back later Simon, thanks, hope you're well.
VP, Community, Learning & Dev Relations @Automation Anywhere and Lover of all things #Automation #GenerativeAI, and #DocumentAutomation
3yAnother solid article Simon Frank. One thing many organizations overlook in doing document processing is that, as you mentioned, no solution will be perfect. I look at document processing as a 6 step process: image intake, image enhancement, classification, data extraction (based on classification), data validation, and delivery. With both the classification step and data extraction step, the ability for manual intervention has to be considered as documents may come in that the system has never seen, or images that frankly don’t belong in this process. As much as choosing a vendor based on OCR capabilities is important, I’d argue it’s just as important to make sure the vendor of choice has a workflow/UI to handle the exception processing without necessarily having to hold up the entire batch of documents…and because the OCR capabilities from vendor to vendor aren’t terribly different. I have some upcoming content in this area as well that will help demonstrate what I mean. Great read and great perspectives though