Mining Content for Actionable Insights with Intelligent Automation

Tom C.

AI & Automation Hacker

Published Dec 2, 2020

Over recent years there has been a lot of excitement about how artificial intelligence is changing the world we live in. Often branded as the 4th Industrial Revolution, we can see evidence of “AI” being applied all around us, with algorithms being used to recommend TV shows, personalise advertising, optimise transport systems or predict the spread of contagious diseases. When applied, these algorithms can process vast amounts of data and uncover insights that massively accelerate business results.

However, these “Artificial Intelligence” use cases rely on having deep pools of perfectly ordered, high quality data to work with. With estimates that up to 80% of corporate data is unstructured, in the form of natural language content contained in reports, contracts, images, policies, medical records, emails, chat transcripts, annual reports and so on, the potential value and insights that are contained in these types of unstructured content largely remain untapped, representing a huge opportunity for businesses.

The Challenge of Unstructured Content

Traditionally working with these types of content or documents has required human brain power and specialist skills or training. Examples include employing financial analysts to extracting and normalising market data from company financial statements or legal secretaries to assembling case notes and historical legal rulings. Large amounts of the time and cost of these highly skilled individuals is spent doing repetitive manual tasks, reading through documents to find relevant information, and organising it into a usable form.

This article sets out 8 steps to unlock the value stored in these “big content” assets, by applying Intelligent Automation techniques, which can benefit business by:

Improving End-2-End Process Efficiency
Saving Employee Time/Cost
Increasing the Accuracy and Consistency of Outcomes

These steps or phases provide a high level blueprint for applying automation and machine learning to mine usable data and insights from unstructured content, gleaned from the author’s experience working on large scale Intelligent Automation, OCR (Optical Character Recognition) and RPA (Robotic Process Automation) projects at some of the world’s leading companies.

The following sections provide a brief overview of the activities and goals of each step, along with some tips and best practices to help improve the accuracy of your automation efforts.

Analysis
Business Case
Content Sourcing
Separation and Cleaning
Classification
Data Location and Extraction
Output Formatting
Evaluate and Optimise

Analysis

What are the sources and types of content you want to mine for information? How are they currently used and what is the data you need to extract from them?

Review the available content, sources where the content comes from and integration options
What formats will the content be in (emails, social posts, web pages, PDF documents etc.)
Identify the SMEs / teams who understand the content, and can interpret it to train the AI
Define desired categories into which content or parts of the content should be sorted
Document the data points and KPIs that need to be extracted from each category
Is enough content available to train and test the machine learning? (Typically needs 100-200[1] examples per category / data point for training, and a similar size set for testing)

Tip: Provide user friendly tools to allow your subject matter experts to document the data they need from their content, for example via a drag and drop user interface that lets them upload documents and highlight the data points they are looking for.

With Kofax Intelligent Automation use TotalAgility’s Quick Capture functionality to let business users “self-serve” defining the content categories and data points they need for their use case.

Baseline the Cost and Effort for the Business Case

Conduct a time and motion study of the current process and costs for working with this content in your business. This should identify where most of the effort/expense is going in your “as-is” workflow, for example:

Manually collecting the content to be analysed from different systems
Reading through large volumes of content to find relevant items
Reading through the content to find the relevant sections of the document
Copying data out of one source into the system to do the analysis

It may be that automating just one aspect of the process will provide a return on investment by itself, in which case take this as the starting point to focus on.

If this is a net new use case, what are the insights you need and how will they be used to create business benefit or return on investment. What level of investment is justified to realise these gains?

Tip: allow business users to document their activities and processes using their preferred, user friendly tools such as Microsoft Visio. Alternatively, consider applying automated process discovery tools that track the activities that individuals or teams are doing to complete a task and that can automatically uncover “hotspots” of repetitive, labour intensive activity. Use this data to create a baseline cost for the “as-is” process, which should guide where returns will be achieved on the project and to measure eventual ROI.

With Kofax Intelligent Automation consider using Visio imports into TotalAgility and/or Kofax RPA’s Process Discovery feature to automatically build a picture of how your team members work with content and which activities consume the most time.

Content Sourcing

The first automation challenge is getting the content into the process for analysis. IDC found that data professionals spend an average of 67% of their time on activities like searching for and preparing data. This problem can be magnified further when dealing with unstructured content, which is often found in multiple locations and formats.

Luckily much of the effort to fetch and load content can be automated, regardless of whether the data is locked in your document archive or scattered across the web.

Tip: Robotic Process Automation (RPA) has long been used to pipeline data out of hard-to-reach legacy systems making it the perfect tool for sourcing content for analysis. The main benefit of using RPA is flexibility - whether your data is locked into an ancient system without the right APIs to reach it, scattered across the web, RPA can automate its retrieval. Use RPA robots to accelerate the retrieval and ingestion of your content.

With Kofax Intelligent Automation be sure to check the Kofax SmartHub for prebuilt Kofax RPA robots or connectors for many common content sourcing use cases, or to access TotalAgility connectors to orchestrate robots from any leading RPA vendor if you don’t use Kofax RPA.

Separation and Cleaning

Once you have the content, the next step is to remove any noise so that you can focus in one the aspects of the content relevant for your use case. Depending on the content you may do this as part of the sourcing process or as separate steps.

If your content is image based (e.g. scanned copies of paper documents), make sure you test with different OCR engines and configurations to identify what gives you the most accurate conversion to text. A good Intelligent Automation platform will include a range of OCR engines, allowing the system to learn which is the most effective one for converting the type of content you are working with.

Most likely you will want to remove any navigation boiler plate if you are using web pages, or strip out blank pages from PDF/scanned documents. This reduces the amount of content to work with later by getting rid of irrelevant information, in turn lowering processing costs.

A good best practice here is to standardise the format of the content before the next phases of the journey, especially if you are working with content from multiple sources.

Tip: Breaking down content into manageable blocks should also be done at this stage, especially if you are dealing with very complex, information dense documents.

For example if you wanted to analyse annual reports that can run into hundreds of pages, there might be numerous different data points you’d like to extract and analyse about number of employees, share structures, governance policies, executive compensation and so on. Processing and extracting this data in one go would be very complex, but breaking the problem down into smaller pieces, for example by splitting the report into individual paragraphs, massively simplifies the subsequent steps.

With Kofax Intelligent Automation use separation profiles, rules and robots to break content into constituent parts. Apply image conversion to standardise content from multiple formats and use image processing to reduce visual noise and enhance the quality of training samples.

Classification

Next organise the documents or content parts into categories. Which categories to use should have been identified in the Analysis phase, but you may want to refine these now or create sub-classes as you explore real world content.

This is done by training your classification algorithm with examples of each content type.

This step may be one you run at multiple times at different levels, first to categorise the type of content, then to categorise the individual parts of that content. For example, a document could be an annual report or legal contract. Once you classify the type of document, you may want to classify the paragraphs in that content separately to highlight the key data you need, for example to locate a paragraph that contains the number of employees in an annual report or the counterparties in a contract. How you approach the second round of classification will be determined by the type of content you identified in the first round.

Tip: The key to successful classification of natural language content is the training set you use and the business rules that need to be applied.

Your subject matter experts know this content best - provide them with user friendly tools to guide them through the process of selecting good quality samples that represent their desired categories. As per the above, remove any noise from the samples you use to create the training set, otherwise this could confuse the algorithm.

Once you have a training set, test how well it can predict a match to the desired category using a separate testing set of content samples. If your target categories use very similar wording, for example, the difference between accepting or rejecting liability in an insurance claim may be down to a couple of words, the algorithm may struggle to confidently predict the correct category. In this case, add instructions or business rules to fine tune the category that gets assigned.

With Kofax Intelligent Automation use the clustering tool to speed up sorting your samples into different groups. Clustering uses unsupervised machine learning to find groupings of similar content, which a human expert then assigns to the right category to create a clean training set.

Data Location and Extraction

Now we are ready to squeeze the data out of your content. In the analysis step the important data points needed will have been identified. For each data point in a category, identify the best combination of techniques to extract the data. Here one size does not fit all – sometimes trainable, learn by example techniques work best; other times rules or formats are best to find the content you need (e.g. looking for a pattern for account numbers or case references).

Don’t underestimate the value of using existing reference data here to support your chosen extraction methods – if the machine learning is not confident it has found a match for a customer by itself, cross referencing against a customer database can help remove the doubt.

Tip: don’t just rely on one method to locate the data points you want. A good automation platform will all multiple location and extraction techniques to run in parallel, then let you train an evaluator to select the best result from the alternatives automatically.

With Kofax Intelligent Automation the Transformation Designer includes a wide range of different techniques to locate and extract data fields from unstructured content, packaged in our low code, easy to use toolset. Bundled techniques including using natural language processing, format locators, table locators, group locators to find collections of values occurring together, named entity locators, scripts, sentiments, barcode readers, address locators and much more!

Formatting Output

Once you have located and extracted the data points you need from your content, you’ll want to package these into a format that can be used by your downstream business process. This may mean displaying the data in forms or screens for your experts to action, feeding into APIs to update other key systems and/or storing the data in a machine readable format in your data lake or document archive for other search/analysis/machine learning use cases.

Tip: It is often beneficial to create multiple renditions of the content – for example, full copies of the raw content headers and all, PDF print views and plain text. This gives you an audit trail of the content that gave a certain result, as well as building up a historical archive. Even if you don’t use all the renditions now, with the low cost of cloud storage you can collect content that might be valuable for driving future use cases.

With Kofax Intelligent Automation use document conversion profiles make it easy to create renditions of the raw content, for example to archive PDFs of the original content.

In addition to providing the extracted data in any desired format (CSV, JSON, XML…) for your downstream systems, all the information that is generated from each step of processing is collected and stored in a single representation of the content called an XDocument (XDoc). The XDoc includes low level recognition and OCR data as well as classification confidences and extracted data. The XDoc can be converted to XML or JSON data for storage in your data lake for future use.

Evaluate and Optimise

Once you have designed and configured your data mining project using the steps above, this creates an automation baseline. While the process will now automatically locate and extract most data from your content, there will always be cases when the system is not confident it has located the data correctly.

This is when your business process should combine human intelligence with the machine learning. By providing a user interface to validate and correct the results from the automation process, the algorithms can continue to learn and improve based on the new samples and human corrections it is getting in the live process.

In the early stages of a project, have people review a large number of results, even if the algorithm is relatively confident it has it right. This is partly to ensure the accuracy of the system’s output but also to help your employees gain confidence in the AI.

As time progresses, and the system learns from operator’s corrections and feedback, the confidence level at which humans need to be bought in can be adjusted to reflect the improved accuracy and trust. This allows more of your content to be processes to run “straight through” without needing a human intervention. At this point the return on investment from your automation project really accelerates!

Tip: Maintain a separate set of test data which is used to benchmark the accuracy of classification and extraction results. Regularly benchmark the system using this test data to ensure the system avoids “false positive” results, where the algorithm is confident it is correct, but in fact the result it has found is incorrect. The benchmark will inform the confidence level at which a human operator should be involved, with a goal of completely avoiding false positives.

With Kofax Intelligent Automation orchestration processes combine human operator steps with automation robots and AI algorithms for content classification and data extraction, providing a complete platform to model and manage your end to end content mining process.

Next Steps

In this article we have covered the key steps in mining unstructured and natural language content to locate and extract business critical data to accelerate and automate business processes.

Intelligent Automation platforms provide an ideal way to implement this type of process as they bring together OCR, machine learning, natural language processing, automation robots and business process management.

In this way Intelligent Automation can provide an end2end solution for mining data from content or provide a complementary service to enhance other business processes or machine learning projects, by automating some of the operational aspects of these programmes.

[1] The number of training samples needed will depend on the type of content you are working with and how you want to classify or extract data from it. Typically, natural language content, where there is a lot of variety in wording in the samples, will require more samples than visually categorising content based on layout.

JOSE ALIRIO RONDON CARO

Lider técnico Latourrette Consulting

Excellent overview to submit in a global way how a need can be addressed using booming technology, especially AI, and in this way improve the productivity of organizations. Thank you very much Tom 👍

Carlos Latourrette

Founder and CEO at Latourrette Consulting and Bizdocs | Advisor at Portuguese Chamber of Commerce in São Paulo | Digital Transformation Advisor at Virtual Educa for Latin America.

Thank you for the Master Class

Steven Carroll

Relationship Manager - Intelligent Automation

This is top class, thanks Tom. Will be using it as my own reference point from now on. Cheers!

1 Reaction

Daniel Marshall

Client Relationship Manager at Tungsten Automation

Great article Tom C. thanks for sharing

1 Reaction

See more comments

To view or add a comment, sign in

See all

Mining Content for Actionable Insights with Intelligent Automation

Tom C.

AI & Automation Hacker

The Challenge of Unstructured Content

Analysis

Baseline the Cost and Effort for the Business Case

Content Sourcing

Separation and Cleaning

Recommended by LinkedIn

Classification

Data Location and Extraction

Formatting Output

Evaluate and Optimise

Next Steps

More articles by this author

Insights from the community

Others also viewed

Day 1 - Prompt Engineering and why it is important?

Insight: The Future of Monitoring and Evaluation (M&E)

The Impact of Generative AI on Decision-Making, Cost Reduction, and Quality Improvement

Accelerating Enterprise Automation with Generative AI and Copilot

How AI and ML can help Test Automation?

Process Automation with ChatGPT

Master Prompt Engineering framework - Harness the Power of LLMs

Ultimate & Exciting Decision Making using Chat GPT in Construction 2023

Next-Gen Efficiency: The Power of Intelligent Automation

Exploring New Trends in Business Process Automation

Explore topics

The Challenge of Unstructured Content

Analysis

Baseline the Cost and Effort for the Business Case

Content Sourcing

Separation and Cleaning

Recommended by LinkedIn

Classification

Data Location and Extraction

Formatting Output

Evaluate and Optimise

Next Steps

Fouille de texte avancée avec OpenAI

Aug 23, 2023

Advanced Text Mining with OpenAI

Jul 17, 2023

Overcoming Brexit Challenges with Intelligent Automation

Feb 25, 2021

5 STEPS TO IMPROVE DATA QUALITY

Jun 26, 2019

A Big Week In The Analytics Industry

Jun 11, 2019

Will Voice Disrupt the AdTech and MarTech Worlds?

Oct 18, 2017

Insights from the community

Others also viewed

Day 1 - Prompt Engineering and why it is important?

Insight: The Future of Monitoring and Evaluation (M&E)

The Impact of Generative AI on Decision-Making, Cost Reduction, and Quality Improvement

Accelerating Enterprise Automation with Generative AI and Copilot

How AI and ML can help Test Automation?

Process Automation with ChatGPT

Master Prompt Engineering framework - Harness the Power of LLMs

Ultimate & Exciting Decision Making using Chat GPT in Construction 2023

Next-Gen Efficiency: The Power of Intelligent Automation

Exploring New Trends in Business Process Automation

Explore topics