Creating Easier Access To Answers About Annuities With Large Language Models
AI-generated imagery

Creating Easier Access To Answers About Annuities With Large Language Models

The potential application of Generative AI within the insurance industry is endless. One very basic initial use case is to build a chatbot that will quickly answer questions about complex annuity contract features.  Our call center, our agents, our distributors, and our policyholders could all benefit from this.

But in order to build a production grade application just for this use case, we first have to answer at least these three fundamental questions:

  1. Can Large Language Models, or LLMs, answer basic questions about our policies based on existing product documentation?
  2. How much better will the LLMs perform using enhanced information designed specifically produce compliance-friendly responses?
  3. Do certain LLMs show a higher likelihood to produce both useful and compliant responses based on identical information?

We recently ran a test that attempted to answer all three questions. In short, we discovered that:

  1. LLMs can answer a number of questions surprising well based on the very large amount and large number of individual documents we currently provide policyholders.
  2. With enhanced documentation, the answer quality substantially improves
  3. Some LLMs perform much better than others when reading our contracts.

Note that these conclusions come from the results of a very limited amount of testing. We varied the model temperature, but held many other factors constant - i.e. token size of chunks, overlap, prompt structure, etc. We used a relatively large number of LLMs, but did not include all of the most popular ones. We tested Google’s Gemini, most of the LLMs available on Amazon’s Bedrock - Anthropic’s Claude, Amazon’s Titan Text, Meta’s Llama2, and Coherence - as well as OpenAI’s ChatGPT. The initial results show both the challenges and the promise of building an application that makes the value of annuities more transparent to a wider variety of individuals.

Often, questions are more important than the actual answers. Why did we choose to try to answer these three questions? We’ll explain.

Why do we care if existing policy information be used as the source of truth with an LLM? 

In our industry, the individual annuity contract, or “specimen” is the final source of truth for any product-related question. It is filed with every state, made public, and shared with every person who purchases a policy. It is the also the expression of the logic that defines the code of the core policy administration system. If an LLM can respond accurately to a question using this source documentation, we could rapidly deploy a chatbot across our entire product line.

However, challenges quickly arise when we try to use just the contract documentation across a number of dimensions. First, the filed documents that must be given to any policyholder are enormous in scale. We are required to distribute materials to each policyholder that include approximately 200,000 words. 

Most LLMs today cannot ingest 200,000 words at once to answer a question. To feed information into LLMs, we must chunk the documents into smaller pieces, find the most relevant pieces to answer any given question, and send them back to the LLM to respond to an answer. In Generative AI parlance, this is known as retrieval-augmented generation or “RAG”. 

Think of this process as if we put a book in a wood chipper. When a question comes in, we then sort through the pile of chopped-up pages, pick out the 4-5 most relevant paragraphs, and then hand them to the LLM to use to generate the answer.

To be very clear, we are not using the LLM training data used to provide answers for our applications. We are not asking the LLM to independently answer a question about annuities. Rather, we are using the LLMs to act as a reasoning engine based on the information we provide.

When a document becomes this large, there is a risk that the answer may require one section at the beginning of the document, one in the middle, and one at the end. If the related context in each of those paragraphs is very clear, we have may found all the right pieces in the pile. 

We learned that newer features - riders in this case - were written in a manner that makes the context easier to identify. However, the base contract proved more difficult. In many cases, companies use the same base contract language for many years and progressively make minor additions or revisions each time a new product is launched. Over time, answering a question may require reading very disparate parts of the contract with little explicit internal context. 

Also, the publicly-filed documents apply only to specific states. For instance, the same contract filed in California may have different surrender charges, surrender periods, and slightly different features than a contract filed, say, in Connecticut. Unfortunately, the actual internal language in the contract may not readily identify the specific state to which it applies. 

Can these problems be solved with enhanced documentation?

Given the challenges of finding answers with publicly-filed contracts, enhanced documentation potentially offers one solution for providing accurate, human-readable, and compliant responses. The scale of the documentation may seem daunting to generate this type of supplement. To speed the process, we actually used LLMs to help build the supplemental material. However, it has required a significant effort by our team to review for accuracy against the source documents.

Are certain LLMs better suited for the task of reading annuity contract language?

As the saying goes, if the only tool you have is a hammer, every problem looks like a nail. We have found this to be very true when testing different LLMs inside Generative AI solutions. Each LLM is trained on different data with different methods. It should come as no surprise that some performed “better” than others.

“Better”, though, is actually difficult to define. As a marketer, I will consider a response “better” that explains an answer in clear terms that the general public can easily comprehend. However, simple words may lose the clarity that our attorneys may want when explaining why a claim may not be paid. I may give the verbose, highly expressive LLM high marks - think Carl Sagan. Our compliance team, on the other hands, may prefer the strong, silent type of LLM that sticks just to the raw facts - think Clint Eastwood.

Given our preliminary results, I believe that most popular LLMs can support annuity-related applications. The ultimate selection will likely depend upon true brand voice preference, tolerance for simple versus specific answers, costs associated with the LLM, and long-term development support for the chosen model.

See the results below. Based on our first run, we will continue to develop a set of resources that could ultimately be shared across the industry:

  • Prompt templates that deliver the most explanative, yet compliant response across the majority of LLMs
  • A set of baseline test questions and answers to score the relative accuracy of various annuity Generative AI applications
  • A running exploration of the value of eventually fine-tuning a model for annuity-specific use cases

We welcome all reactions, thoughts and contribution to the dialogue. We also hope that you will consider attending the third annual Retiretech conference that will be held April 8-10 in Las Vegas and discuss this in person.

Join Us in Las Vegas in April 2024 for Retiretech

The Test

Baseline question:

"What does guaranteed income?"

Prompt:

Answer the question in four sentences using only the information provided to you.

Source of information:

  1. Our publicly-filed annuity policy documents
  2. Our machine-enhanced product documentation

The Results

The LLMs We Tested


OpenAI

It's difficult not to lead with OpenAI's well-known ChatGPT LLM. We used ChatGPT 3.5 Turbo only because we don't have access today to their latest 4.0 API. With a four-sentence answer, ChatGPT surprisingly performed best with our policy documents. With our pre-processed document, ChatGPT started to actually correctly interpret recommendations for selecting specific riders. However, it lacked enough room to complete a full review of all options. When we limited only to three sentences, it provided the simplest and most accurate answer to the question. Note that we have removed exact names of our product riders.

OpenAI ChatGPT 3.5

Amazon Titan Text Express

Amazon has released a number of new foundation models for text generation in the last several months. The model actually provided identical, accurate responses from both document collections. Even though we instructed it to respond in four sentences, it responded in two. The language is not particularly consumer-friendly, but accurately mirrors our policy terms.

Meta Llama2

Meta has invested significant resources building a robust LLM called Llama2. Unlike other large tech companies, Meta released it as an open-source model. Answers using both document collection proved correct. Answers from our enhanced documentation provided simpler answers.

Meta Llama2

Cohere Command

Cohere, a Toronto-based company, has recently released a number of offerings specifically designed to serve the needs of large enterprises in this space. In our test, the model return some of the most consumer-friendly responses that were both accurate and helpful.

Cohere Command


Anthropic Claude

When OpenAI shifted focus from open source to a commercial business activity, several employees left to form Anthropic. Based in San Francisco, the company is committed to building and releasing open source LLMs. We tested Claude, the latest LLM the company offers. The model provided us with the most verbose responses of all models. However, it provided accurate answers using both document collections.

Anthropic Claude

Google Gemini

Google invented LLMs. The company has also been on the forefront of ethical and responsible use of the technology. Last month, the company released it's next generation LLM called Gemini. The company has focused considerable resources in reducing hallucinations and providing links to source documentation. With the chunks we provided, we only received a meaningful response when we set the model temperature as high as we possibly could. I am guessing that Google is erring on the side of caution when providing answers to questions based on limited amounts of information. Many compliance departments will likely appreciate this level of restraint.






Greg Illson

Financial Services

12mo

This is a good start. I found this quote interesting. “We are required to distribute materials to each policyholder that include approximately 200,000 words.” So many words that the LLMs could not digest it all. I guess the industry hasn’t yet made annuities simple. I would suggest you look at the business requirements that are developed as they are written in more plain English by business analysts at each company and usually explain what the rider or contract is supposed to do.

James Dean

Global Generative AI GTM Specialist @ Google Cloud

1y

Great use case and very thoughtful analysis. Thanks for sharing.

David F. Sterling, Esq.

Wealth Care Advocate and Consultant

1y

RE: Distinguishing Knowing About and Knowing How My initial comments are founded on a BASIC PREMISE. Annuities are first and foremost complex legal contracts of adhesion that require technical understanding and precise execution IF proper and intended benefits are to be REALIZED. At the forefront of technological contributions to the agent's promotion of annuity contract solutions and the client's "ownership experience" is THE QUESTION of whether "LLM's can support annuity-related applications." I believe that Paul Tyler has introduced an initiative of monumental significance to the insurance industry and seemingly infinite array of potential end-users. Thank you David Macchia to bringing this to my attention. As David knows, my work with annuities is obsessively focused on contract analytics, which includes document language interpretation, understanding and application to client profiles, needs, and the achievement of desired outcomes. The metrics of contract performance represent a core element of annuity contract-based solutions. In this context, it is imperative to acknowledge the TWO BASIC COMPONENTS of the annuity chassis: financial/investment and legal/administrative. Quite a maze to decipher and navigate.

Anjana Nandi

Google Cloud, GSI Sales and GTM Partnerships at Google

1y

Enlightening read, Paul Tyler. Thank you for sharing.

Zehra Cataltepe

AI and GenAI solutions that Customer Facing Teams can Own (Marketing, Support, Operations, Fraud)

1y

Thanks for sharing Paul Tyler. 1) We are seeing great performance utilizing only local llms with RAG (without the need to use a remote llm model). Anyone interested in beta testing this? Please DM me. 2) Continuously testing the performance of generative models for complex tasks is difficult. Even though there are hundreds of tasks used for benchmarking, performance in those tasks do not necessarily reflect the performance you would see on a behind the firewall and very specific dataset. We are mostly utilizing other llms and human experts for testing. I would love to hear what else works for the others.

Like
Reply

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics