💻 HTML > Plain Text for RAG

💻 HTML > Plain Text for RAG

In this issue:

  1. A foundation model for time series forecasting
  2. More intelligent tool usage for scientific LLMs
  3. HTML > plain text for RAG


Subscribe now


1. A Mamba Foundation Model for Time Series Forecasting

Watching: TSMamba (paper)

What problem does it solve? Time series forecasting is a crucial task in various domains, from finance to healthcare. However, the rapid evolution of patterns in real-world applications often leads to a scarcity of relevant training data. Time series foundation models have shown promise in addressing this issue through zero-shot learning, but most of these models rely on the Transformer architecture, which suffers from quadratic complexity as input length increases, making them computationally expensive and less scalable.

How does it solve the problem? TSMamba tackles the complexity issue by building upon the Mamba architecture, which offers linear complexity. It employs a two-stage transfer learning process that leverages pretrained Mamba LLMs, enabling effective time series modeling with a moderate training set. In the first stage, TSMamba optimizes the forward and backward backbones through patch-wise autoregressive prediction. In the second stage, it trains a prediction head and refines other components for long-term forecasting. Additionally, TSMamba introduces a channel-wise compressed attention module to capture cross-channel dependencies during fine-tuning on specific multivariate datasets, while the backbone assumes channel independence to handle varying channel numbers across datasets.

What's next? The results of TSMamba promise efficient and accurate time series forecasting, particularly in scenarios where training data is limited. The model's ability to achieve competitive or superior performance compared to task-specific prediction models, despite using significantly less training data, highlights its potential for real-world applications. As the code for TSMamba will be made publicly available, researchers and practitioners can further explore and build upon this approach, potentially leading to advancements in domains such as finance, healthcare, and climate modeling, where accurate time series forecasting is crucial for decision-making and planning. However, the model isn’t available yet and in the past, quite a few papers have over-reported the effectiveness of deep learning for TS modeling.


2. Adapting While Learning: Grounding LLMs for Scientific Problems with Intelligent Tool Usage Adaptation

Watching: LLMs 4 Scientific Problems (paper)

What problem does it solve? Large Language Models (LLMs) have shown impressive capabilities in solving simple scientific problems, but they often struggle with more complex ones, producing unreliable or incorrect answers. While integrating LLMs with external tools can improve reliability, this approach often leads to an over-reliance on tools, which can diminish the model's ability to solve simple problems through basic reasoning. This research aims to address this issue by proposing a novel two-component fine-tuning method that enables LLMs to assess problem complexity and choose the appropriate solution approach, similar to how human experts solve problems.

How does it solve the problem? The proposed method consists of two components: World Knowledge Distillation (WKD) and Tool Usage Adaptation (TUA). In WKD, LLMs learn directly from solutions generated using a tool's information, allowing them to internalize domain knowledge. This helps the model to solve simple problems without relying on external tools. In TUA, problems are categorized as easy or hard based on the model's direct answering accuracy. The model is trained to maintain the same alignment target for easy problems as in WKD, while learning to intelligently switch to tool usage for more challenging problems. This approach enables the model to assess problem complexity and choose the most appropriate solution method, mimicking the problem-solving process of human experts.

What's next? The proposed two-component fine-tuning method has demonstrated significant improvements in answer accuracy and tool usage precision across various scientific benchmark datasets, outperforming state-of-the-art models like GPT-4o and Claude-3.5. This highlights the potential for developing more intelligent and efficient LLMs that can assess problem complexity and adapt their problem-solving approach accordingly. Future work could focus on extending this method to other domains beyond scientific problems, as well as exploring ways to further improve the model's ability to internalize domain knowledge and make informed decisions about when to rely on external tools. Additionally, researchers could investigate the scalability of this approach to larger and more diverse datasets, as well as its potential for real-world applications.


3. HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems

Watching: HtmlRAG (paper)

What problem does it solve? Retrieval-Augmented Generation (RAG) has been a popular approach to enhance the knowledge capabilities of Large Language Models (LLMs) and mitigate their tendency to hallucinate information. Many commercial systems, such as ChatGPT and Perplexity, rely on web search engines as their primary retrieval systems. However, the typical RAG process involves retrieving search results, downloading HTML sources, and extracting plain text from the HTML. This approach often leads to the loss of valuable structural and semantic information inherent in HTML, such as headings and table structures.

How does it solve the problem? HtmlRAG addresses this issue by using HTML instead of plain text as the format for retrieved knowledge in RAG systems. The authors believe that HTML is better suited for modeling knowledge in external documents, and most LLMs have robust capabilities to understand HTML. However, utilizing HTML presents new challenges, such as the presence of additional content like tags, JavaScript, and CSS specifications, which introduce extra input tokens and noise to the RAG system. To tackle this problem, the authors propose HTML cleaning, compression, and pruning strategies to shorten the HTML while minimizing information loss. They design a two-step block-tree-based pruning method that removes useless HTML blocks and retains only the relevant parts of the HTML.

What's next? The experiments conducted on six question-answering datasets confirm the effectiveness of using HTML in RAG systems. This opens up new possibilities for improving the performance of RAG-based LLMs by leveraging the rich structural and semantic information available in HTML documents. Additionally, the integration of HtmlRAG with other state-of-the-art LLM architectures and training techniques could lead to even more powerful and knowledgeable language models.


Papers of the Week:

  1. E2E-AFG: An End-to-End Model with Adaptive Filtering for Retrieval-Augmented Generation: Retrieval-augmented generation with external knowledge faces misinformation challenges. E2E-AFG implements adaptive filtering and answer existence judgment in an end-to-end framework for text generation. Evaluated on six knowledge-intensive language datasets, it outperforms baseline models through integrated processing.
  2. Real-Time Anomaly Detection and Reactive Planning with Large Language Models: Foundation models enable zero-shot generalization in robotics via a runtime monitor that enhances trustworthiness under resource constraints. A binary classifier outperforms GPT models, operating in embedding space with fallback plans for quadrotors and autonomous vehicles, maintaining safety through predictive control.
  3. Thinking Forward and Backward: Effective Backward Planning with Large Language Models: Large language models employ bidirectional planning, with backward approaches excelling near bottlenecks. Systematic biases necessitated developing flipped-problem techniques. Combining forward and backward planning with self-verification yields 4-24% higher success rates across three planning domains.
  4. Interaction2Code: How Far Are We From Automatic Interactive Webpage Generation: Research evaluates SOTA MLLMs' webpage design-to-code conversion through Interaction2Code benchmark, using human evaluations across 15 webpage types and 30 interaction categories. The study examines 97 pages and 213 interactions, identifying 10 failure types and visual saliency impacts to advance automated web development solutions.
  5. Mixtures of In-Context Learners: MoICL enhances in-context learning for LLMs by weighting demonstration subsets, outperforming LENS and improving the Pareto frontier. It shows 13% better classification performance while handling out-of-domain, imbalanced, and noisy demonstrations with up to 49% improvement. The approach prevents context window exhaustion and memory constraints.
  6. Human-inspired Perspectives: A Survey on AI Long-term Memory: This systematic investigation examines AI systems' long-term memory capabilities, analyzing mapping relationships between human and AI memory systems. SALM's cognitive architecture guides next-generation AI development, with focus on future directions and application prospects for enhanced information processing.
  7. Self-Evolved Reward Learning for LLMs: RLHF uses reward models to enhance language models through human feedback. Self-Evolved Reward Learning (SER) offers a new approach for improving AI systems like GPT-4 and ChatGPT. Using models like Mistral and Llama 3, experiments on HH-RLHF and UltraFeedback datasets demonstrated how reward models can generate self-improving training data. Their findings suggest significant potential for improving AI development.


👍 If you enjoyed this article, give it a like and share it with your peers.


Rasha FRIJI

AI/ Quantum AI Researcher | PhD in AI |Techwomen Fellow 2022

2mo
Like
Reply
Victory Adugbo

Hacking Growth for AI, Web3, and FinTech Companies | Blockchain Instructor at CCHUB | Driving Innovation and Building World Class Business Solutions at COHORTE

2mo

Plain text might give us the words, but HTML gives us structure—and structure can be just as important for context. HtmlRAG’s results across six QA datasets show this approach might set a new standard for RAG in terms of accuracy and relevance. Do others think HTML-based RAG could become the default for knowledge-augmented AI?

If HtmlRAG can consistently outperform plain text retrieval, we might see a shift in how RAG pipelines are built, with HTML parsing becoming a core component. This could bring about more efficient, context-aware AI applications. Anyone else think this approach could raise the bar for RAG standards?

Like
Reply
Mathieu Gosbee

Senior Software Developer, Ex-Zynga, Bose Music on iOS | NodeJS | React | Python

2mo

HTML may be better, but it's a necessity to remove every unnecessary bit of it. It's extremely verbose as a layout language. You're going to blow through your context if you keep using entire html.

To view or add a comment, sign in

More articles by Pascal Biese

  • 🧑🔬 AI Cutting Research Costs by 84%

    🧑🔬 AI Cutting Research Costs by 84%

    In this issue: AI helping researchers to be more efficient LLMs being unreliable when reasoning about time Evaluating…

    2 Comments
  • 🤗 AI Agents: Quick & Easy

    🤗 AI Agents: Quick & Easy

    In this issue: AI agents in a few lines of code An introduction to Graph Neural Networks LLMs for complex medical…

    5 Comments
  • 🎁 Meta Reveals New AI Architecture

    🎁 Meta Reveals New AI Architecture

    In this issue: How Meta wants to take LLMs to the next level A smaller, more transparent o1 alternative Graph agents…

    2 Comments
  • 🌱 Another ChatGPT Moment

    🌱 Another ChatGPT Moment

    In this issue: Simulation’s ChatGPT moment The new era of test-time compute A company with no humans Upgrade now 1…

    4 Comments
  • 🗣️ Microsoft's Best Small Language Model

    🗣️ Microsoft's Best Small Language Model

    In this issue: Microsoft’s best small language model Graph Networks learning without a lot of labels A new go-to…

    4 Comments
  • 🧪The First Fully AI-Designed Drug... Almost

    🧪The First Fully AI-Designed Drug... Almost

    In this issue: AI agents designed a new antibody against SARS-CoV-2 Semantic backpropagation for AI agents Amazon…

    2 Comments
  • 🥇 GraphRAG's Biggest Problem Solved

    🥇 GraphRAG's Biggest Problem Solved

    In this issue: A new standard for GraphRAG Replicating OpenAI’s strongest model LLM-”brained” agents for your devices…

    12 Comments
  • 🍓 Actually Open AI: A Free o1 Alternative

    🍓 Actually Open AI: A Free o1 Alternative

    In this issue: An open o1-like model The LLM Engineer Handbook NVIDIA mixing attention with state spaces Upgrade now 1.…

    4 Comments
  • 🤖 The Future of Designing AI Agents

    🤖 The Future of Designing AI Agents

    In this issue: Towards efficient graph foundation models From LLMOps to AgentOps A Text-to-SQL dataset that breaks LLMs…

    3 Comments
  • 🤏 All You Need to Know About Small Language Models

    🤏 All You Need to Know About Small Language Models

    In this issue: A survey on SLMs A way towards more brain-like inference How to better count the r’s in strawberry…

    6 Comments

Insights from the community

Others also viewed

Explore topics