Leveraging LLMs for Database Interaction: Automating SQL Queries from Natural Language

Leveraging LLMs for Database Interaction: Automating SQL Queries from Natural Language

In recent years, the ability of Large Language Models (LLMs) to interact with databases and answer natural language queries has revolutionized data querying. With the integration of User Defined Functions (UDFs), LLMs can autonomously identify relevant tables, generate SQL queries, and perform complex operations without requiring direct user intervention in query formulation.

In this article, we will showcase how LLMs, specifically with the help of Gemini as an example, are capable of transforming natural language inputs into structured SQL queries, retrieving data from databases, and performing required transformations. Through two distinct examples, we will illustrate the model's ability to autonomously select the correct tables and fields, perform necessary data manipulations, and return actionable results.

Introduction: Automating Database Interaction with LLMs

The ability for LLMs to connect to databases and autonomously process natural language queries is transforming the data landscape. By leveraging UDFs, LLMs can handle tasks such as:

  • Identifying relevant tables,
  • Generating SQL queries,
  • Manipulating data (like changing date formats or filtering results),
  • Returning structured outputs, all with minimal intervention.

In this article, we highlight how Gemini (as an example of a modern LLM) can use these UDFs to seamlessly interact with an SQL database, automatically identifying tables and executing SQL queries to answer user questions.

The Process: How LLMs Connect with Databases

We’ve defined three key UDFs to enable efficient interaction with the database:

  1. list_tables() - Lists the available tables in the database.
  2. describe_table() - Provides schema information about a specified table.
  3. execute_query() - Runs SQL SELECT queries to retrieve data from the database.

The core advantage is that the model autonomously identifies relevant tables based on the natural language query and can directly execute SQL queries without manual intervention, including manipulating the data if needed (e.g., changing date formats, aggregating values, etc.).

Example 1: Amazon-Exclusive Products Query

User Query (Natural Language):

"Find products which are exclusive to Amazon and not sold at Top Shop or Macy's. Provide the product name, brand name, price, and rating."

System Instruction (Common to All Prompts):

Before processing queries, the LLM is guided by a common instruction set to understand how to interact with the database:

instruction = """You are a helpful chatbot that can interact with an SQL database for an online marketplace.

You will take the users' questions and turn them into SQL queries using the tools available. Once you have the

information you need, you will answer the user's question using the data returned. Use list_tables to see what

tables are present, describe_table to understand the schema, and execute_query to issue an SQL SELECT query."""

Model selection and initiation of chat:

model = genai.GenerativeModel(

"models/gemini-1.5-pro-latest", tools=db_tools, system_instruction=instruction

)

retry_policy = {"retry": retry.Retry(predicate=retry.if_transient_error)}

chat = model.start_chat(enable_automatic_function_calling=True)

Model’s Prompt for Query:

For the query on Amazon-exclusive products, the model generates the following prompt:

prompt = """Find products which are exclusive to Amazon and therefore not sold at Top Shop and Macy's. Your output should include the product name, brand name, price, and rating. Note: Tables referred here are: innerwear_macys_com, innerwear_topshop_com, innerwear_amazon_com"""

Intermediate Steps:

  • Identifying Relevant Tables: The model autonomously identifies which tables are necessary (innerwear_macys_com, innerwear_topshop_com, innerwear_amazon_com).
  • Executing the Query: After identifying the relevant tables, it executes the query to find products that are exclusive to Amazon and not sold by the others.
  • Result: The model generates the relevant SQL, executes it, and returns the result.


Notice that the model autonomously handled:

  • Table selection
  • Query formulation
  • Data retrieval and display

Example 2: Top Sellers for January 2024

User Query (Natural Language):

"List the top 3 sellers in each product category for January 2024."

Model’s Prompt for Query:

For this query, the model uses the following prompt to process the information:

prompt_with_year = """You are provided with a transactional dataset from Amazon that contains detailed information about sales across different products and marketplaces for the year 2024. The 'month' column in the dataset is in the format 'YYYY-MM'. Your task is to list the top 3 sellers in each product category for January 2024. The output should contain 'seller_id', 'total_sales', 'product_category', 'market_place', and 'month'."""

Intermediate Steps:

  • Identifying Relevant Tables: The model identifies the relevant table(s) for transactional data.
  • Describing Tables: The model uses describe_table() to understand the structure of the dataset (e.g., month, product category, seller ID).
  • Executing the Query: The model autonomously generates and executes the SQL query to find the top 3 sellers in each category for January 2024.

Model’s Output:

The query returns results like:


Conclusion: The Power of LLMs to Automate Database Queries

The examples above demonstrate the powerful capabilities of LLMs when it comes to autonomously querying databases. Using User Defined Functions (UDFs), an LLM can:

  • Automatically identify relevant tables,
  • Generate SQL queries to retrieve specific data,
  • Execute data manipulations (e.g., month formatting),
  • Return actionable results.

By leveraging Gemini (or any other LLM) as an example, we can see how these models automate the entire process, making complex database interactions easier and more accessible. This highlights the significant advancement in data accessibility and the growing role of AI-driven automation in business and analytics.

Note: The full notebook containing code snippets, detailed steps, and further examples is available for reference at the end of this article.


To view or add a comment, sign in

More articles by Nagendra Yadav, MBA(IIMB), MTech CSE(IIT-Kgp), FRM

Insights from the community

Others also viewed

Explore topics