LLM-based Tools for Data-Driven Applications - A Practical Overview (Part 1)
All the magical tools you could want and not a clue what any of them do. Credit: Dall-E / ChatGPT

LLM-based Tools for Data-Driven Applications - A Practical Overview (Part 1)

The magic is real, but the spells are finicky.

LLM Tools are Overwhelming

The open-source Large Language Model (LLM) tooling system is overwhelming. The explosion in the number of capabilities available is only matched by the corresponding explosion in the number of diverse applications to which those capabilities are being directed. This will only continue; the level of excitement about the potential of these tools in the developer community exceeds any technology in recent memory. I am no exception; I believe in the potential for LLM-based systems to drive real value in enterprises at various levels. But...

LLM Tools are Confusing

The functionality of the different tools and how to evaluate them for a given use case can also be confusing. This is further convoluted by the fact that nobody is entirely sure how LLM-based applications should operate. The chat-based paradigm established by OpenAI has heavily influenced the ecosystem, but that is where we are starting; it’s not clear where we will end up. It’s also unclear which tools and frameworks are competitive versus complementary, and their mutual interfaces are tenuous at best, and missing at worst.

For this three-part series of blog posts, I want to help data-driven organizations better understand the tooling ecosystem for LLMs, including open-source tools and public cloud-based managed services. This will enable you to select the right tool or suite of tools for your use case and help you understand how to approach that decision.

The first installment looks at the foundational elements—what needs to be in place before moving forward with a project. Part 2 focuses on decomposing your project into component parts to understand which tools should be preferred. In the last installment, I’ll dig into the evaluation process for specific tools, covering the end-to-end process of selecting, implementing, and operationalizing LLM tools for enterprise use cases. 

Because I am a Data Science Consultant, I am focusing on a specific set of tools, capabilities, and use-cases related to developing data-driven applications*. 

* A data-driven application is software that relies on collecting, processing, and analyzing data to guide its functionality, decisions, and behavior. It often includes features like data visualization, machine learning, automated decision-making, and integration with external systems to provide insights and personalized experiences. Source: ChatGPT

Starting Point for Evaluation

The ecosystem is moving so fast that anything I write about a specific tool will be obsolete when I publish it. So, to keep this as meaningful as possible for as long as possible, the best method I have at my disposal is to:

  1. Look at the use cases companies will target. (Part 1)
  2. Understand the functionality these systems will need to have available. (Part 2)
  3. Understand the architectural patterns that must be in place to drive that functionality. (Part 2)
  4. Utilize that functionality and those patterns as a guide for evaluating the tools to consider using for a given project. (Part 3)

In order to aid my work on a pragmatic guide for LLM-based open-source tools for data-driven applications, I leverage the fantastic work by Chip Huyen , who compiled and analyzed a set of roughly 1,000 open-source LLM tools. While her analysis is broader in scope than the one I provide here, I could not have completed this series of blog posts without her effort to lay the foundation. Chip, who wrote perhaps the best book available on MLOps, provides an overview of the entire LLM tooling ecosystem. My goal is narrower. I want to provide a deeper exploration of requirements and a roadmap for navigating that ecosystem to build the best possible system to accomplish your goals for a data-driven application that incorporates LLMs.

Use Cases

To start, let’s outline what I believe to be the most achievable enterprise use cases for data driven applications in the LLM ecosystem today and then consider how we can build on them.

  • Information retrieval: Ask a question, get an answer, and summarize the information. Information can be in a variety of formats, but the output format is relatively consistent within a given system. Create as many data source integrations as necessary to combine the data into relevant text for the LLM to consume before producing the output. 
  • Document creation: You have a set of templates for a document (i.e., an email or a government submission) that you need to create, and you have previous documents you have created, along with the inputs used to create those documents. Provide a description of the document you need with the context used to produce it and get a consistent output for people to review (this typically relies on information retrieval to gather relevant examples and context). 
  • Remixing, personalization, and copy-editing: Take existing documents and mix them together to create an altered version that is more specific or personalized to the recipient's use case. Update documents to target a particular audience. Make a written document sound more professional or reflect additional information about the target audience. 
  • Discover insights based on analysis of unstructured data: Gather structured datasets from unstructured data quickly and flexibly. Aggregate that data (in a database or data warehouse) to gain insights that were never before available. Get simple answers to questions about a raw body of text. Attempt to gain signal on text-driven processes for which very little signal existed before (i.e., customer feedback, contract edits). 
  • Write your own (relatively) fixed workflow graph: You know the steps of your workflow, and you know that each step is repeatable with a relatively fixed set of instructions. You know which step of the workflow follows which other step. You know exactly when a human needs to be in the loop to ensure that the intermediate steps toward your chosen output are what you expect them to be. Often, these workflows work best when the output is a research document or another preliminary input that will be used for decision-making, action, or further downstream human-operated processes.
  • Workflow automation (co-pilots or agents) ⚠️: Save people time. Let people provide the raw version of text input. Give them the resulting information they need in whatever format is appropriate, using LLM magic in the middle. Integrate external data sources that are searched and collected flexibly based on the question. Allow the “agents” to decide when the output is ready, and allow repeated loops of LLM API calls to be triggered on your behalf. Then, let the “agents” take action quickly using the information they have gathered. Automatically trigger action based on available information. Convert your canned analysis of unstructured data into an autonomous research analyst. ⚠️
  • Text as an interface ⚠️: Convert raw text requests into API calls to systems that do meaningful things in the world. ⚠️
  • LLM as a translator ⚠️: The LLM converts your text input into the coding language of choice for whatever system you interact with. This might be in the form of a tool used to constrain the level of flexibility of the LLM, or it might be literally translating your query into one more optimal for the systems utilized. Text-to-SQL is a good example of this, and my experience is that these systems (and any systems in this warning category) work better when the use case is more narrowly constrained. ⚠️ 

⚠️ = Caution: this may require more work than you think, and the output quality might not meet your expectations.

The beauty of these use cases is that they can be combined together to create myriad systems that accomplish organizational goals. For example:

  • Text-based query engine = Text as interface + LLM as a translator + Information retrieval
  • Email writing assistant = Information retrieval + Document creation + Remixing and personalization + (potentially) Workflow automation
  • Q&A Bot (“Agentic RAG”) = Information retrieval + analysis of unstructured data + (potentially) Workflow automation
  • Personal Assistant = Information retrieval + Document creation + Remixing and personalization + Analysis of unstructured data + Workflow automation + Text as an interface

 You can get a sense of the complexity of these use cases by counting the number of elements that need to be combined together. If your use case requires only one element, then you are already selecting an entry point for an LLM-based project that maximizes your chance of success.   

(Note that I ignore coding assistants for the purpose of this document. I believe these are different in kind than the data-driven applications I discuss here.)

The Non-Negotiable Elements

Regardless of the use case you pursue, the goal for the system remains the same. If you want a system that accomplishes your goal today and never changes, then develop a software program. If you want a system that becomes more accurate and efficient over time and that can learn from its mistakes with some guidance from people, then develop a machine learning (ML) system. 

Any ML system, including one that leverages LLMs, requires a feedback loop to help the system learn from its previous mistakes. Even though there are research projects that attempt to leverage an LLM to teach an LLM, you can safely assume that humans will need to (help) teach LLMs for a long time to come.

To make the most of your team’s time, this teaching process needs to be scaffolded in a way that minimizes effort to produce ground truth examples, maximizes training example quality, and integrates directly into the optimization process. The components of the teaching process include:

1. Data collection:

A system for capturing interactions between your LLM-based system and your users (or other systems). At the simplest level, all you need are inputs and outputs. If you have intermediate states or decision points from the LLM, you should also capture those. However, it is less clear-cut how those can be incorporated into future training, fine-tuning, or system optimization processes. 

Suppose you have a retrieval augmented generation (RAG) pipeline that searches for contextual information to inject into the query to the LLM before generating. In that case, you should retain the following:

  • The input/output (query and results) from the retrieval portion allows you to label the results according to how you expect them to be ranked given a particular query.
  • The preliminary query and any rewritten queries you send to systems for retrieval.
  • The original rankings of the results from each system and the final rankings of those results if you apply reranking to them.

2. Data Labeling:

A system for validating and re-labeling input/output pairs to match how you intend the system to operate. 

3. A Benchmark Dataset:

A representative set of input/output benchmarks that you can use to regression test the system. This dataset grows as part of the data labeling process. Like any machine learning process, some labeled examples are used to train or condition the system (either via fine-tuning or via a multi-shot prompt), and others are used to evaluate the system. The Benchmark Dataset is used to evaluate the system, and it needs to be carefully curated to be representative of the distributions of both expected inputs and expected outputs. You will have much more success if your outputs and inputs map together nicely, where variations in inputs lead to consistent variations in the outputs. The more consistent your inputs and outputs, the fewer training and benchmark examples you need.

4. An Evaluation Process:

Especially with long-form, text-based inputs and outputs (as opposed to numerical or structured ones), it is not always clear whether an LLM-based response is “good” if it’s not an exact match. Depending on the nature of your system, you can choose to use another LLM to compare outputs  against your benchmark examples or another algorithmic approach to evaluation. This is an open area of research without clear best practice. This is also where reinforcement learning from human feedback (RLHF) is applied at the major foundation model companies; for our purposes in this post, we assume this is beyond the scope of how the evaluation will be utilized. Remember that understanding the nature of your errors could enable you to create synthetic positive and contrasting negative examples that can help preference-optimize open-source models. Knowing the nature of errors, and how they compare to expected successes, is critical to determining the path toward optimizing the process.

5. An Experimentation Process for Enhancement:

Because the inputs, outputs, and evaluation process are noisy, you cannot afford to test multiple improvements simultaneously. Once a minimum viable system is in place, each enhancement must be implemented and tested independently to determine whether it contributes to improving the whole system. Some enhancements will likely optimize some portions of the use case, as embodied by the benchmark dataset, at the expense of others.


These elements will remain consistent regardless of the use cases your company targets, and they should be established in advance of any meaningful project beyond a demo. It is worth spending time to get this in place as it directly impacts the ongoing performance of any implemented tools.

But you haven’t mentioned any tools yet!

I think one of the major problems in the LLM tooling ecosystem is that people run to tools before they have fully defined their use case and the infrastructure required to scaffold it, which introduces waste, scope creep and tool-driven development. I believe this is an anti-pattern in developing data-driven applications. 

What’s next?

Once you have the non-negotiables in place, you can consider delivering on the selected use cases. In the next installment (Part 2), we will look specifically at LLM systems’ architectural patterns and workflow components before discussing which are the most critical for the application we are looking to develop. Only after that, in Part 3, will we spend time talking about adopting specific tools for specific workflow components.

I hope you enjoy this series! Please leave feedback if you have any.


I Lan Huang

Professor at UADE - I solve creative problems.

5mo

Great material and insights. Hope to read part 2 soon!

Pete Grett

GEN AI Evangelist | #TechSherpa | #LiftOthersUp

5mo

Intriguing insight into LLM architecture. Acknowledging use case limitations upfront shows thoughtful planning. Can't wait to see Part 2 and your synthesis. Ross Katz

Ross Katz

Principal and Data Science Lead at CorrDyn | Host of "Data in Biotech" | Helping Great Companies to Make Smarter Strategic Decisions with their Data

5mo

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics