Large Language Models (LLMs) on Structured Data Sets: A Synopsis of Current Challenges and Research Directions The integration of Large Language Models (LLMs) with structured data sets has become a focal point of discussion among senior technical leaders responsible for data architecture in various companies. This summary encapsulates the essence of recent conversations and research efforts aimed at understanding and optimizing the use of LLMs in the context of structured data. The primary application of LLMs in structured data environments is to interpret human queries and translate them into SQL or other query languages. This translation process is currently seen as inefficient, as it not only demands substantial computational resources but also does not fully leverage the capabilities of LLMs to address the underlying problem statements. The convolution arises from the mismatch between the unstructured nature of LLMs and the rigid, predefined schemas of structured data. Research in this domain is actively seeking to bridge the gap between the fluidity of LLMs and the rigidity of structured data. The goal is to enhance the user experience by creating more intuitive and efficient ways for LLMs to interact with and extract information from structured databases. This endeavor is not without its challenges, as the computational intensity of LLMs poses scalability issues. A striking comparison reveals that a typical LLM request consumes approximately 17 times more power than a standard Google search query. This disparity highlights the need for innovation in making LLMs more energy-efficient and cost-effective, especially when dealing with structured data. The sustainability and scalability of LLMs in data architecture are contingent upon advancements that can reconcile their high computational demands with the economic and environmental costs. One such research effort is the work by Wang et al. (2021), which proposes a novel framework for integrating LLMs with relational databases. Their approach involves a pre-processing step that transforms structured data into a format more suitable for LLMs, thereby reducing the computational overhead. Another significant contribution is the research by Zhang and Choi (2020), which focuses on optimizing the interaction between LLMs and structured data by introducing an intermediary layer that can effectively translate natural language queries into database operations. The ongoing research in this field is a testament to the potential of LLMs to revolutionize data architecture. The quest for a more harmonious integration of LLMs with structured data sets is not merely a convolution but a glimpse into the future of data management. The innovation that successfully mitigates the current limitations will undoubtedly pave the way for a new multi-billion dollar market, offering unprecedented opportunities for businesses and researchers alike. #LLMs #DataArchitecture #StructuredData #Innovation #Technology #Research
Venkatesh Guruprasad’s Post
More Relevant Posts
-
🚀 Exploring the Future of Database Interfaces: LLM-based Text-to-SQL Systems 🌟 Unveiling the power of natural language processing to revolutionize database interactions! Dive into how LLM-based Text-to-SQL systems are transforming how we access and manage data. 🔧 Implementation Aspects 🤔 Question Understanding: Interpreting natural language queries. 📊 Schema Comprehension: Mapping queries to database schemas. 📝 SQL Generation: Producing syntactically correct SQL queries. 🚧 Key Challenges and Solutions: 🔍 🔹 User Question Understanding: Linguistic Complexity and Ambiguity: Interpreting diverse natural language inputs requires deep language understanding and domain knowledge to handle complex structures and ambiguity effectively. 🔹 Database Schema Understanding: Schema Representation: Accurately mapping queries to complex database schemas involves understanding table names, column names, and relationships, along with handling rare SQL operations like nested subqueries and outer joins. 🔹 SQL Query Generation: Sub-task Decomposition: Breaking down the task into smaller sub-tasks like schema linking and domain classification can enhance performance. Error Correction: Implementing modules to identify and correct errors in generated SQL queries ensures accuracy. 🔹 Real-world Robustness: Cross-domain Adaptations: Using diverse datasets and incorporating context-dependent information improves robustness. Adversarial Testing: Employing datasets designed with adversarial table perturbation and synonym replacement tests model robustness. 🔹 Computational Efficiency: Few-shot and In-context Learning: Adopting few-shot learning and in-context learning strategies enhances efficiency and performance, emphasizing the importance of selecting relevant samples and prompt designs. 🔹 Data Privacy: Privacy-preserving Techniques: Ensuring sensitive information in user queries and database schemas is protected through anonymization and secure handling is vital. 📚 Datasets and Benchmarks 🔹 Common Datasets: Spider, Spider-Realistic, Spider-SYN, BIRD. 🔹 Characteristics: Varying complexity and domains. 📊 Evaluation Metrics 🔹 Execution Accuracy (EX): Measures the correctness of a predicted SQL query by executing it and comparing the results with the ground truth. 🔹 Exact Matching (EM): Measures the percentage of SQL queries that exactly match the ground truth. 🔹 Valid Efficiency Score (VES): Evaluates the efficiency and accuracy of valid SQL queries by comparing their execution time to the ground truth. 🔮 Future Directions 🔹 Robustness: Handling diverse and ambiguous queries. 🔹 Efficiency: Improving computational efficiency. 🔹 Privacy: Addressing data privacy concerns. 🔹 Extensions: Exploring new applications and functionalities. 🔹 What's your take? How do you see Text-to-SQL impacting data accessibility in your industry? Share your thoughts and experiences below! 👇 #TextToSQL #GenAI #NLP
To view or add a comment, sign in
-
"Day 6: Query Construction - Bridging the Gap Between Language and Databases!" Hey everyone! Today, we’re exploring Query Construction—an essential step in the RAG flow where your query is reshaped to match the format of different databases. 🛠️✨ In this step, the query is transformed from natural language into a form that specific data systems can understand. This ensures that the LLM can retrieve data accurately from various sources. Two key methods used in this process are Text-to-SQL and Text-to-Cypher: 1. Text-to-SQL: This transformation converts natural language queries into SQL queries. For example, if you ask, "What are the top-selling products?" the system transforms this into an SQL query like SELECT * FROM products WHERE sales > X. This works great for structured databases like relational databases. 2. Text-to-Cypher: Cypher is a query language used for graph databases (like Neo4j). When the LLM needs to search a graph database, it transforms the query into a Cypher query. For instance, "Show connections between company X and person Y" might transform into a Cypher query like MATCH (a:Company)-[:RELATED_TO]->(b:Person) RETURN a, b. This is useful for databases that focus on relationships between data points. Query Construction acts like a translator between the language we speak and the languages databases understand, ensuring that no matter how complex the database is, the LLM can communicate with it effectively. 🔄📊 learn more about here https://lnkd.in/e4pAuuUt Tomorrow, we’ll explore the Retrieval step—where we start pulling data based on these transformed queries! Stay tuned! 😊✨ #rag #genai #llm #ai #data #AI #Chatbots #TechInnovation #MachineLearning #FutureOfWork
To view or add a comment, sign in
-
🔔 Democratizing Data: How Natural Language Interfaces are Empowering End Users to Interact with Databases 🔴 Hermes: A Text-to-SQL Solution at Swiggy 💠 Hermes is a generative AI-based workflow developed by Swiggy to facilitate data accessibility for its teams. The tool allows users to input natural language questions and receive corresponding SQL queries and results directly in Slack. This streamlines the data access process, enabling faster and more efficient decision-making. 🔶 The Need for Hermes 💠Many business and product decisions require specific numbers and quantities that are often locked away in databases, accessible only to those with SQL knowledge. 💠Traditional methods of data access, like searching dashboards or requesting data from analysts, can be time-consuming and inefficient. 💠Hermes democratizes data access, making it faster and easier for everyone to get the information they need. 🔶Key Features and Benefits 💠Natural Language Interface: Users can ask questions in plain English, eliminating the need for SQL expertise. 💠Instant Results: Hermes automatically generates SQL queries and executes them, delivering results directly in Slack within minutes. ( a data cleansing layer via Lamda is also there) 💠Improved Data Accessibility: Empowers users across different roles to access and analyze data independently. 💠Enhanced Decision-Making: Enables faster, data-driven decisions by providing quick access to critical information. 💠Increased Efficiency: Streamlines the data querying process, saving time and effort for users. 🔶Technology Behind Hermes 💠Generative AI: Leverages the power of large language models (LLMs) like GPT 3.5 and 4.0 to generate SQL queries. 💠Knowledge Base and RAG: Incorporates Swiggy-specific context through a knowledge base and Retrieval-Augmented Generation (RAG) techniques. 💠Data Catalog: Integrates with Swiggy's in-house data catalog, Lumos, for metadata management. 💠Cloud Computing: Utilizes AWS Lambda for middleware and Databricks for job creation and query execution. 🔴I have seen a similar capability embedded within Oracle Autonomous Database (ADB). For enterprises using Oracle Autonomous Database and seeking to implement a robust and user-friendly Text-to-SQL solution, leverage the power of the built-in "speak human" AI ( Select AI, a dbms package) feature in conjunction with a middleware layer like AWS Lambda / OCI functions for enhanced functionality and seamless integration. ( reference : https://lnkd.in/dCbgFFMs). 🔴 By enabling interaction with databases using natural language, these tools break down the barriers of technical expertise, allowing individuals across different roles to leverage data for informed decision-making and problem-solving. As technology continues to advance, we can expect even more intuitive and user-friendly solutions that further democratize data access and empower individuals to harness the power of information for better outcomes.
To view or add a comment, sign in
-
GenAI language models tend to blurt out words even when they lack knowledge. Vector RAG provides knowledge to assist their word choice. Retrieval augmented generation, or RAG, has emerged as a cost-effective way to reduce the risk of GenAI hallucinations. Vector RAG interprets semantic meaning based on similarity searches of unstructured data, especially chunks of text, sometimes within a vector database. Vector RAG helps when word choice really matters, which is almost always the case. ✍✍✍ Florian Bigelmaier and I just published a new BARC report that describes vector RAG as well as relational and graph RAG. Thank you to our report sponsors unstructured.io, Vespa.ai, and Vectara. Many GenAI adopters need all three architectural approaches, along with keyword search. Vector RAG assists GenAI use cases that involve nuanced semantics. For example, it might help a GenAI application use nuanced legal language when summarizing court cases for attorneys. A vector RAG workflow involves unstructured data, pipelines and the vector DB; as well as application(s) or agent(s) that contain the language model, other AI model(s) and a user interface. > Data Vector RAG relies on unstructured data within one or more object stores. Plain text objects such as emails, documents, and chat conversations are common inputs. > Pipelines Pipelines chunk all this text into smaller, manageable pieces and convert those chunks into numerical representations called embeddings. Tools such as unstructured.io assist this process by chunking, transforming and enriching this data and assembling metadata. These embeddings capture the semantic meaning and relationships of the chunks in a high-dimensional vector space. > Database The vector database, or possibly a multi-faceted AI database, stores the embeddings in an index along with the chunks themselves. Its main role is to organize the data for efficient retrieval. > Applications/Agents When a user submits a query, the application converts the query into an embedding. The vector database then searches for source chunks with similar embeddings. The application retrieves the chunks, ranks them and injects the most relevant chunks into the language model’s prompt. The quality of the indexed data determines the completeness, precision and accuracy of the output. Best practices for unstructured data include regularly refreshing data and filtering out irrelevant documents to reduce noise. Developers can also improve output quality by integrating document metadata, applying advanced reranking algorithms or combining keyword and vector searches during retrieval. Business and data/AI experts should oversee the design, implementation and operation of the vector RAG workflow to ensure governed answers. What do you think? Would welcome stories from early adopters out there. #data #unstructureddata #genai #rag Brian S. Raymond Stefanie Segar Tim Young Jon Bratseth Sean Anderson
To view or add a comment, sign in
-
Don’t Build Your Future on Specialized Vector Databases https://lnkd.in/gfTxqiYr With the rise of AI, vector databases have gained significant attention due to their ability to efficiently store, manage and retrieve large-scale, high-dimensional data. This capability is crucial for AI and generative AI (GenAI) applications that deal with unstructured data such as text, images and videos. The main logic behind a vector database is to provide similarity search capabilities, rather than keyword search, as traditional databases provide. This concept has been widely adopted to boost the performance of large language models (LLMs), particularly following the release of ChatGPT. The biggest issue with LLMs is that they require substantial resources, time and data for fine-tuning. Which makes it very difficult to keep them updated. This is why when you query LLMs about recent events, they often provide answers that are factually incorrect, nonsensical or disconnected from the input prompt, leading to “hallucinations.” One solution is retrieval-augmented generation (RAG), which augments an LLM by integrating up-to-date information retrieved from an external knowledge base. Specialized vector databases are designed to handle vectorized data efficiently and provide robust semantic search capabilities. These databases are optimized for storing and retrieving high-dimensional vectors, which are very important for making similarity searches. The speed and efficiency of vector databases have made them an integral part of RAG systems. The hype around vector databases has led many people to suggest that traditional databases might be replaced by vector databases. Instead of storing data in traditional (SQL or NoSQL) databases, could you store an organization’s entire data set in a vector database and retrieve it using natural language instead of writing manual queries? But vector databases don’t function like traditional databases. As Qdrant CTO Andrey Vasnetsov wrote, “the majority of vector databases are not databases in this sense. It is more accurate to call them search engines.” This is because their main purpose is to provide optimized search functionalities, and they are not designed to support basic features like keyword search or SQL queries. Limitations of Specialized Vector Databases As use cases grew and people focused on the scalability of their applications, the limitations of vector databases became more visible. Developers soon realized they still need the features of a full-text search engine along with vector search. For example, filtering search results based on specific criteria is very difficult with vector databases. These databases also lack direct matches for exact phrases, which are crucial for many tasks. Limited Support for Complex Queries Complex queries often involve multiple conditions, joins and aggregations, making them challenging for specialized vector databases. These databases provide limited support ...
To view or add a comment, sign in
-
Empowering Natural Language Queries in Oracle APEX Imagine crafting insightful data queries from your Oracle database using plain English! This futuristic vision becomes reality with Select AI, a groundbreaking feature introduced in Oracle Autonomous Database. Benefits of Select AI: Democratization of Data Access: Select AI empowers users with limited SQL expertise to extract valuable insights from their data. Business analysts, researchers, or even citizen developers can now interact with the database using natural language, fostering broader data utilization. Intuitive User Experience: Gone are the days of wrestling with complex SQL syntax. Select AI simplifies data exploration by allowing users to express their queries in a natural way, similar to how they would ask a question to a colleague. Increased Productivity: By eliminating the need to learn intricate SQL commands, Select AI streamlines the data retrieval process, saving valuable time and resources. Reduced Errors: Select AI mitigates the risk of errors commonly associated with manual SQL coding. The LLM translates the user's intent into a syntactically correct SQL statement, ensuring data accuracy. Integration with Oracle APEX: A Perfect Match Oracle APEX, a low-code development platform, provides a powerful framework for building web applications. The integration of Select AI with APEX takes the user experience to a whole new level. Here's how: Natural Language Search Integration: Imagine embedding a search bar within your APEX application where users can type their queries in plain English. Select AI, working behind the scenes, translates those queries into SQL and retrieves the relevant data, populating the application with insightful results. Interactive Data Exploration: APEX allows creating interactive dashboards and reports. Select AI empowers users to interact with these elements using natural language, enabling them to drill down into specific data points or filter results based on their needs. Getting Started with Select AI in APEX: While the specific implementation details may vary, here's a general roadmap: Configure Select AI: This involves enabling the AI service within your Oracle Autonomous Database and ensuring proper access control mechanisms are in place. Develop your APEX application: Build your APEX application with elements like search bars, reports, or dashboards. Integrate Select AI: Utilize APEX functionalities to capture user queries in natural language and leverage the DBMS_CLOUD_AI package to interact with the Select AI service. The LLM translates the query and returns the corresponding SQL statement, which can then be executed against the database. A Glimpse into the Future: Select AI represents a significant leap forward in database accessibility. By leveraging natural language processing, Oracle is making data exploration more intuitive and user-friendly.
To view or add a comment, sign in
-
Entity Linking: How Large Language Models Enhance Traditional Software in Data Engineering 🔗 Data engineering faces a Goliath. Unstructured data, vast and chaotic, challenges our ability to make sense of it all. At its core lies entity linking (EL), a critical task. It connects textual mentions to knowledge base entries. Imagine reading about “Washington.” Which one? The capital? The state? The president? Entity linking answers these questions. It’s the backbone of search engines and recommendation systems. But traditional methods are struggling. As data grows, so do the challenges. Enter Large Language Models (LLMs). These AI systems promise a revolution in entity linking for traditional software systems. They offer unprecedented accuracy and efficiency, especially for long-tail entities — those elusive, rarely mentioned names that often slip through the cracks. But LLMs alone aren’t the silver bullet. The real magic happens when we combine them with traditional software. This fusion integrates their abilities where other methods fall short. It’s a synergy of old and new, of specialized knowledge and broad understanding. How do LLMs achieve this? They leverage vast knowledge and deep contextual understanding. Data engineers and researchers are harnessing this power, crafting novel approaches that outshine traditional methods. One such approach is LLMAEL — LLM-Augmented Entity Linking. LLMAEL uses LLMs to generate rich, contextual descriptions for entity mentions. These descriptions then feed into traditional EL models. And the results? They’re impressive. Studies show that LLM-augmented entity linking significantly outperforms traditional methods. On standard benchmarks, LLMAEL achieves state-of-the-art results across all datasets. It improves disambiguation accuracy by up to 3% over previous best results. For long-tail entities, the gains are even more striking — up to 5% improvement. But it’s not just about numbers. This LLM-software combo transforms workflows. It accelerates processes. It unlocks new possibilities in data engineering. In real-world applications, from information retrieval to knowledge graph construction, the impact is substantial. More accurate entity linking leads to better search results, more comprehensive knowledge bases, and more sophisticated recommendation systems. https://lnkd.in/gpnt2imV
To view or add a comment, sign in
-
Ah yes; the arts of data fusion. Entity Linking described nicely.
Entity Linking: How Large Language Models Enhance Traditional Software in Data Engineering 🔗 Data engineering faces a Goliath. Unstructured data, vast and chaotic, challenges our ability to make sense of it all. At its core lies entity linking (EL), a critical task. It connects textual mentions to knowledge base entries. Imagine reading about “Washington.” Which one? The capital? The state? The president? Entity linking answers these questions. It’s the backbone of search engines and recommendation systems. But traditional methods are struggling. As data grows, so do the challenges. Enter Large Language Models (LLMs). These AI systems promise a revolution in entity linking for traditional software systems. They offer unprecedented accuracy and efficiency, especially for long-tail entities — those elusive, rarely mentioned names that often slip through the cracks. But LLMs alone aren’t the silver bullet. The real magic happens when we combine them with traditional software. This fusion integrates their abilities where other methods fall short. It’s a synergy of old and new, of specialized knowledge and broad understanding. How do LLMs achieve this? They leverage vast knowledge and deep contextual understanding. Data engineers and researchers are harnessing this power, crafting novel approaches that outshine traditional methods. One such approach is LLMAEL — LLM-Augmented Entity Linking. LLMAEL uses LLMs to generate rich, contextual descriptions for entity mentions. These descriptions then feed into traditional EL models. And the results? They’re impressive. Studies show that LLM-augmented entity linking significantly outperforms traditional methods. On standard benchmarks, LLMAEL achieves state-of-the-art results across all datasets. It improves disambiguation accuracy by up to 3% over previous best results. For long-tail entities, the gains are even more striking — up to 5% improvement. But it’s not just about numbers. This LLM-software combo transforms workflows. It accelerates processes. It unlocks new possibilities in data engineering. In real-world applications, from information retrieval to knowledge graph construction, the impact is substantial. More accurate entity linking leads to better search results, more comprehensive knowledge bases, and more sophisticated recommendation systems. https://lnkd.in/gpnt2imV
To view or add a comment, sign in
-
Using SQL-Powered RAG to Better Analyze Database Data with GenAI You know your organization needs to start leveraging generative AI (GenAI). But how do you get started? With data stored in databases holding your company’s critical information, applying large language models (LLMs) to that data might seem complex. However, you can actually start using LLMs to analyze your data in Oracle Autonomous Database in just minutes using SQL-powered retrieval-augmented generation (RAG). What Is Retrieval-Augmented Generation (RAG)? RAG allows you to apply the power of LLMs (e.g., creativity, deep understanding of language nuances) to information that the models know little or nothing about. That lack of knowledge might be because the information is private (e.g., in your database) or more recent than the model’s training data. By augmenting AI-generated content with authoritative information, RAG can help improve the accuracy, relevance, and reliability of GenAI output. RAG is generally associated with vector databases, which help provide context to an LLM by allowing super-fast retrieval of similar data from storage engines (e.g., unstructured data, PDFs, documents), rather than just exact keyword matches. To gain insights using RAG: Define your task using natural language. Perform a vector similarity search against your data to get context. Pass that information to the LLM. You can now answer a natural language question like: “My customer thinks this condo is beautiful. What other condos in the Boston area look like that one and are in her price range?” That returns similar-looking homes that she can afford based on image similarity and her private financial information contained in the database. What Is SQL-Powered RAG? There are other ways to provide context to an LLM that are simpler but perhaps not as powerful as what’s described above. This approach works with the data that’s accessible to your Autonomous Database deployment (e.g., internal tables, data lakes, linked tables). To use RAG with Autonomous Database: Define your task using natural language. Provide a SQL query against your data to get context. Pass that information to the LLM. Conceptually, this looks very similar to using RAG with vector databases. Here’s an example of applying those steps in Autonomous Database using a sample Oracle APEX app. Using SQL-Powered RAG Autonomous Database provides a capability called Select AI that allows you to use LLMs with your data. A popular way to use Select AI is for natural language queries (see Autonomous Database speaks “Human” and Conversations Are the Next Generation in Natural Language Queries). This is a little different than natural language queries; instead of generating a query, it combines the results of a SQL query with task instructions to produce a prompt. That prompt is passed to an LLM and processed, producing a recommendation, a summary or whatever your project asked it to do. To make this work:...
To view or add a comment, sign in
-
Tired of writing complex SQL queries? Select #AI unlocks the power of Oracle Autonomous Database through natural language conversations. Ask questions, get insights in multiple languages. Learn more: https://lnkd.in/eMXFGuFd
Natural language queries to Oracle Autonomous Database? Yes—with Select AI
oracle.com
To view or add a comment, sign in