Revolutionizing Compliance with AI

Revolutionizing Compliance with AI

How can Artificial Intelligence be leveraged to detect irregularities and prevent fraud? To what extent can Large Language Models, such as BERT and GPT, assist compliance professionals in their day-to-day activities? This article explores how artificial intelligence, text mining, and natural language processing are used to tackle compliance challenges, detect fraud through uncovering code words and entities, and optimize audit work with early data assessment.

Dealing with a variety of local, national, or international regulations to comply with anti-bribery, anti-terrorism, anti-money laundering legislation, privacy, anti-trust, ESG, HR and many regulations is a huge burden for many companies. Not only in internal and external cost, but also in terms of management attention.

The failure to comply can have enormous effects, including reputational damage, penalties, sanctions, cost of (serial) litigation, management distraction, exclusion from future revenue opportunities and serious implications on your bottom line.

These days, almost all regulators use some form of Artificial Intelligence (AI) for large-scale data analysis in case of irregularities.

To auditors, the field of data mining is better known than that of text mining. A good example of data mining is the analysis of financial transactions. A wealth of algorithms and analytical methods are available to find patterns of interest or fraudulent behavior in such data sets.

However, 90% of all information is unstructured information in the form of text documents, e-mails, social media, or in multimedia files. Analysis using database or data-mining techniques of this information is not possible, as data-mining tools work only on structured data in the forms of columns and rows as used in databases.

In addition, fraudsters are more and more knowledgeable on how audit and compliance algorithms work, so they tend to make sure that the transactional aspects of their actions do not appear as anomalies to such algorithms. The details of what is really going on can often only be found in auxiliary information such as email, text messages, WhatsApp, legal agreements, voice mails or discussions in a forum.

In today's world, auditors, compliance officers and fraud investigators face an overwhelming amount of digital information that can be reviewed. In most cases, they do not know beforehand what exactly they are looking for, nor where to find it.

In addition, individuals or groups may use different forms of deception to hide their behavior and intentions varying from using complex digital formats, rare languages or by using code words. Effectively, this means fraud investigators are looking for a needle in the haystack without knowing what the needle looks like and where the haystack is.

Using technology is essential to address the high strategic ambitions auditors and fraud investigators have regarding such large digital data collections. The main problem with using such technology is finding the balance between identifying what is suspicious and identifying too many false positives, which would create too much work for auditors or victimize innocent individuals.

This is why many regulatory agencies rely extensively on highly accurate methods from the world of Artificial Intelligence such as text-mining and NLP tools to truly understand the meaning of text and identify the Who, Where, When, Why, What, How and How Much of an investigation of large data sets requested or confiscated for compliance investigations.

How Artificial Intelligence Can Assist Compliance

Now that the audit sector finds itself at a crossroads of auditing by humans and the use of machine learning techniques from the world of AI to prevent and combat fraud, public auditors can join hands with scientists to utilize advanced digital techniques to optimize the audit work and increase its impact.

Enrich Electronic Data so EVERYTHING is Text-Searchable.

Historically, Artificial Intelligence has been - for the larger part - concerned with teaching a computer system to understand different forms of human perception: speech, vision, and language.

Not all data we have to deal with in compliance audits is text searchable. This is where Artificial Intelligence can help us: meta-data extraction (normal and forensic), machine translation, optical character recognition (OCR), audio transcription, image and video tagging have reached highly reliable levels of quality due to the recent developments in deep learning. Therefore, text can be used as a good common denominator describing the content of all electronic data, regardless of the format. 

Truly Understanding the Meaning of Text

The study of text mining is concerned with the development of various mathematical, statistical, linguistic, and deep-learning techniques which allow automatic analysis of unstructured information as well as the extraction of high quality and relevant data, and to make the complete text more searchable. High quality refers here to the combination of relevance and the acquiring of new and interesting insights.

A textual document contains characters that together form words, which can be combined to form phrases. These are all syntactic properties that together represent defined categories, concepts, senses, or meanings. Text mining algorithms can recognize, extract, and use all this information.

Using text mining, instead of searching for words, we do in fact search for syntactic, semantic, and higher-level linguistic word patterns. With text-mining algorithms, we aim to find someone or something that doesn’t want to be found. 

Deep Learning and Natural Language Processing (NLP)

The ability to model the context of text is vital to avoid finding too many false positives in audits and fraud investigations. Algorithms that enable us to properly understand such context have greatly advanced in recent years due to the successful progress using deep-learning algorithms for highly context-sensitive Natural Language Processing (NLP) tasks, such as machine translation, human-machine dialogues, named entity recognition, sentiment detection, emotion detection or even complex linguistic tasks such as co-reference and pronoun resolution.

The above-mentioned progress originates from the development of the so-called Transformer architecture. Transformer models are based on large pre-trained recurrent neural networks that already embed significant linguistic knowledge and which can be fine-tuned on specific tasks requiring a relatively small amount of additional training.

A fundamental benefit of the transformer architecture is the ability to perform Transfer Learning.

Traditionally, deep learning models require a large amount of task-specific training data to achieve a desirable performance.

However, for most tasks, we do not have the amount of labelled training data required to train these networks. By pre-training with large sets of natural text, the model learns a significant amount of task-invariant information on how language is constructed. With all this information already contained in these models, we can focus our training process on learning the patterns that are specific for the task at hand. We will still require more data points than required in most statistical models, but not as much as the billions required, should we start the training of the deep-learning models from scratch.

Transformers can model a wide scope of linguistic context, both depending on previous words, but also on future words. They are, so to say, more context sensitive than models that can only use past context into consideration. In addition, this context is included in the embedding vectors, which allows for a richer representation and more complex linguistic tasks.

Currently, the Bidirectional Encoder Representations from Transformers (BERT), released by Google AI Language is considered to be the state-of-the-art language representation model. Another successful application of Transformers can be found in OpenAI's Generative Pre-trained Transformer 3 (GPT-3/4 and ChatGPT) project. GPT-3 is based on 175 billion machine- learning parameters. The size of GPT-4 is still unknown. But the quality of these GPT models is so high, that it is almost impossible to distinguish text written from text written by humans.

In addition, for many linguistic analytical tasks, both GPT and BERT outperform humans both in speed, scalability but also in quality. This progress allows us to use these new models to analyze large volumes of textual information in audits and investigations and identify sentences and paragraphs that provide relevant information.

Uncovering Code Words and Entities in Fraud Investigations

Fraud investigators have another common problem: at the beginning of the investigation, they do not know exactly what to search for. As using encryption for such communication would have a red flag effect to an auditor, such communication is often done in plain open text, using code words. Investigators do not know such specific code names, or they do not exactly know which companies, persons, account numbers or amounts they must search for. Using text mining it is possible to identify all these types of entities or properties from their linguistic role, and then to classify them in a structured manner to present them to the auditor.

For instance: one can look for patterns such as: “who paid who”, “who talked to whom”, or “who travelled where” by using searching for linguistic matches; subsequently, the actual sentences and words matching with such patterns can then be extracted from text of the auxiliary documentation and presented to the investigator. By using frequency analysis or simple anomaly detection methods, one can then quickly separate legitimate transactions from the suspicious ones or identify code words.

Exploring Techniques and Insights for Early Data Assessment

Depending on the type of audit, there are different dimensions that may be interesting for an early data assessment: custodians, data volumes, location, time series, events, modus operandi, motivations, etc. As described by Attfield and Blandford [1] in 2010, traditional investigation methods can provide guidance for the relevant dimensions of such assessments: Who, Where, When, Why, What, How, and How Much are the basic elements for analysis.

Who, Where and When can be determined by Named Entity Recognition (NER) methods. Why is harder, but personal experience of the first author in law enforcement investigations shows that data locations with high emotion and sentiment values also provides a good indication of the motivation or insights into the modus operandi. ‘What’ can be understood by using methods like topic modeling. A good overview of all these techniques can be found in our contribution in Big Data Law with Roland Vogl, Executive Director and Lecturer in Law, CodeX - The Stanford Center for Legal Informatics, Stanford Law School.

How can you apply ChatGPT in compliance?

ChatGPT, GPT-3/4 and other Generative AI models are only capable to predict the most probable next word, given a certain question or textual context, as explained in this article. The longer the text that is generated, the larger the odds for hallucinations. These models do not have memory, nor any form of background knowledge-structure to guarantee factuality, let alone self-reflection or consciousness.

Therefore, one cannot rely on these models for tasks other than acting as a conversational interface on other Artificial Intelligence systems. However, they are very good at:

-       Search engine query generation for complex compliance tasks and data monitoring.

-       Summarize complex (legal) documents and have a conversation on the content of such documents in natural language.

-       A compliance chatbot for elementary questions on simple compliance issues (given that the model used is trained on relevant compliance text).

-       Better understand the risk in large documents by questioning them using a ChatGPT version trained on large collections of compliance documents.

Conclusions

eDiscovery technology taught us how to deal with real-world big data. Text-mining taught us how to find specific patterns of interest in textual data. The combination of eDiscovery and text-mining will teach us how to find even more complex (temporal) relations in big data for audits and ultimately train our algorithms to provide better decision support and assist auditors detecting anomalies and moments of incidents in our ever-growing electronic data sets.

This is a rapidly evolving field, where new methods to understand the structure, meaning and complexity of natural language are introduced at an ever-accelerating speed. These developments will result in essential tools for auditors and internal investigators to keep up with the ever-growing electronic data sets to get as quickly and efficiently as possible to the essence of a case.



[1] S. Attfield, & A. Blandford,  ‘Discovery-led refinement in eDiscovery investigations: sensemaking, cognitive ergonomics and system design’, Artificial Intelligence and Law, 18(4), 387-412, 2010. doi:10.1007/s10506-010-9091-y.



Upcoming Webinar!

Legal Hold Made Easy: Building a Defensible Litigation Workflow

No alt text provided for this image

Whether you’re a legal professional, corporate counsel, or part of an IT team, this webinar is a must-attend event for those seeking to streamline their litigation workflow while maintaining the utmost defensibility. Don’t miss this opportunity to empower your organization with game-changing legal hold automation. Register now.



On-Demand Webinars:

No alt text provided for this image

Generative AI in Legal: Investigating the Rise of Intelligent Counsel

Join us to gain a comprehensive understanding of the cutting-edge developments in generative AI, explore its ethical implications, and envision the future possibilities of intelligent counsel in the legal field. Watch it here.

To view or add a comment, sign in

More articles by reveal - ipro

Insights from the community

Others also viewed

Explore topics