Harnessing AI for Log analysis using AI functions in Databricks

Harnessing AI for Log analysis using AI functions in Databricks

In today’s data-driven world, quickly identifying and resolving issues in data pipelines (both real-time and batch) is key to improving workflow efficiency. As data complexity grows, robust troubleshooting methods become increasingly vital for understanding historical error patterns, conducting effective post-mortems, and proactively monitoring system and application errors. By integrating best practices into your development lifecycle, you can mitigate risks and enhance performance. This blog post explores how we can harness the power of AI to provide actionable insights for failures using AI functions in Databricks.

The Challenge

When job failures occur due to third-party data issues, corrupted upstream files, delayed source table updates, code errors, or unexpected data type changes, the troubleshooting process can become overwhelming. These failures are generally logged across different systems within your organization, making seamless access to this data essential for uncovering root causes and improving efficiency.

However, identifying which specific source system triggered the failure is often a challenge. You may have to jump through multiple logs, switch between tools, or even use command-line prompts to piece together the information. This fragmented process slows down troubleshooting and increases the chances of missing critical connections or root causes.

Creating Sample Log Data

To streamline the troubleshooting process, consider organizing your logs into multiple layers such as Bronze, Silver and Gold. This approach simplifies your debugging cycle, allowing you to quickly understand the sequence of events from different systems that impacted your job at a particular timestamp.

To analyze and troubleshoot these common issues effectively, let's create a sample log dataset that simulates various Spark/streaming job errors. Here is a summary of how I reproduced these issues:

  • OutOfMemoryError (OOM): Cache a large DataFrame with excessive data that exceeds executor memory limits.
  • Task Not Serializable Exception: Attempt to use a non-serializable object in an RDD operation, causing the task to fail.
  • FileNotFoundException: Attempt to read from a non-existent file path.
  • Shuffle Fetch Failed: Perform a large shuffle operation on a DataFrame with many partitions, which may lead to failures.
  • GC Overhead Limit Exceeded: Generate large datasets that lead to frequent garbage collection, exhausting memory resources.
  • ClassNotFoundException: Try to import a nonexistent class to simulate a missing class error.
  • Job Aborted due to Stage Failure: Force a stage failure in an RDD operation by raising a runtime error.
  • Permission Denied: Attempt to execute a SQL command that violates permissions, such as dropping a restricted table.
  • Executor Lost Failure: Create a large DataFrame and perform operations that may lead to executor failures.
  • Spark Driver Failure: Collect large datasets at the driver, causing it to exhaust its memory.

By simulating these scenarios, I generated a comprehensive log dataset (uploaded on Hugging Face) and created a table in the Databricks Catalog to analyze common failure points. This test data simulates log entries related to various data engineering workloads (Spark and Kafka development, in this case) featuring the following key fields:

  • logTimestamp: When the log entry was generated.
  • logLevel: The severity level (e.g., INFO, ERROR).
  • component: The source of the log message.
  • logMessage: A description of the event or error.
  • jobId: Unique identifier for the workload.
  • taskId: Identifier for the specific task within the job.
  • executorId: Identifier for the processing executor.
  • topicName: Name of the data stream topic.
  • offset: Offset in the data stream for consistency.
  • errorCode: Associated error codes for additional context.
  • clusterId: Identifier for the executing cluster.
  • clusterHealth: Health status of the cluster.
  • storageSize: Size of storage used by the job.
  • volumeProcessed: Amount of data processed.
  • className: The class responsible for the log entry.

Integrating AI for Troubleshooting

Before diving into detailed insights, let's start with an initial analysis of how these log types are structured based on the logMessage column in the table. To achieve this, we can leverage the ai_summarize and ai_classify functions in Databricks SQL to quickly overview and classify the logs.

SQL Query to find log pattern







Gaining these high-level insights itself can often take considerable time during day-to-day troubleshooting of Spark and streaming issues. However, with AI functions in Databricks, these insights can be identified quickly. So, let's give a big thanks to the training data embedded within the Foundational AI models in Databricks.

To enhance the AI-driven insights into our troubleshooting workflow using Databricks SQL, we can utilize the following SQL query.


Databricks SQL using ai_query function

Key references in the SQL

For each jobId, it collects the relevant log details (such as logTimestamp, logLevel, component, etc.) into a set and concatenates these fields into a formatted string (CONCAT_WS("<=>", ...))


Formatted log data in one column

The query calls an AI model (databricks-meta-llama-3-1-405b-instruct) using the AI_QUERY function to analyze the logs.

Why Use a Concatenated Column Approach?

The Databricks Meta LLaMA 3.1 model like many large language models, processes text-based input rather than structured data like tables. To provide the model with comprehensive context, we use a concatenated column approach.

This method enhances the model's understanding of relationships between data points, enabling it to generate more relevant insights. While other approaches like structured text representations or prompt engineering are also valid, the concatenated column approach offers a straightforward way to convey complex information effectively.

In this query, we leverage the ai_query function to pass structured prompts to the Databricks Meta LLaMA 3.1 model, enabling it to analyze job logs and provide essential information such as sequences of events, reproduction steps for encountered problems, troubleshooting strategies, and best practices for your data engineering workloads.

The Output

(i) Sequence of log events

Sequence of log events

(ii) How to Reproduce the issue

Steps to Reproduce the issue

(iii) Recommendations to resolve the issue

Solution Recommendation

(iv) Data Complexity

Data Complexity

It’s important to note that while the recommendations provided by AI may not be 100% perfect at first glance they serve as valuable starting points for addressing potential issues. However by continuously training your enterprise data on this model, you can achieve more accurate outcomes over time. To operationalize this, you can utilize the Databricks Mosaic AI Model experiments feature to incorporate new knowledge into your foundational models for chat completions, continued pre-training, and instruction fine-tuning. Eventually, you can deploy this in Databricks Apps within minutes allowing many of your users to leverage the power of AI in their day-to-day operations.

The Benefits

The insights provided by the AI-driven analysis significantly enhance our troubleshooting process. By delivering a clear sequence of events, detailed steps to reproduce the problem, and targeted recommendations, this approach minimizes the time spent scouring the web or waiting for a technical expert's assistance.

For instance, the model identifies potential data complexity issues such as data skews and network latency, enabling faster resolutions. With this streamlined process, you can take immediate action based on informed recommendations. Integrating AI into the troubleshooting process offers several advantages.

  • Time Efficiency: Automation significantly reduces the time spent manually sifting through logs.
  • Actionable Insights: AI provides context-aware recommendations that improve responses to job failures.
  • Proactive Management: Identifying patterns in errors allows teams to take preemptive actions, mitigating risks effectively.


Conclusion

While there are indeed several AI-based log analysis tools available on the market, leveraging Databricks AI functions and Foundational AI models offers a cost-effective and secure solution tailored to your enterprise needs. By building a custom pipeline using Databricks you can maintain full control over your data ensuring scalability, flexibility and security. This approach allows you to not only analyze logs efficiently but also integrate seamlessly with your existing infrastructure optimizing costs in the long run.

As we navigate the complexities of big data processing, leveraging AI technologies like Databricks Meta LLaMA 3.1 model for troubleshooting can be transformative. The SQL query outlined, combined with simulated log data, serves as a powerful tool for data engineers, fundamentally changing how we approach different failures in our data pipelines. By streamlining logs into structured layers and harnessing AI, we can enhance workflows, optimize performance, and drive greater value from our data initiatives.

Sarath Baswa

Senior Business Data Analytics Engineer | SQL | Python | R | Tableau | Power BI | Excel | Qlikview | SAS | IBM SPSS | D3.js | Hadoop | Snowflake | Informatica | Airflow | AWS | Azure | GCP | Jira | NoSQL | SSAS

2mo

These updates from Databricks are a huge leap forward for AI-driven data analysis! The Meta LLaMA 3.1-705B model, paired with the new Databricks SQL features, transforms how we handle complex datasets and accelerates decision-making processes. Beyond enhancing analysis, these tools enable deeper predictive capabilities and facilitate more interactive, real-time insights. Excited to see how this innovation will empower teams to unlock new layers of value from their data. Thanks for sharing this milestone! Giri R Varatharajan

Prasadarao Kanumarlapudi

Big Data l Analytics I Architecture I AWS-Databricks | GenAI-ML I Vice President

2mo

This is very helpful Giri R Varatharajan. Thanks for sharing.

Jason Pohl

Making Big Data Simple

2mo

This is really useful!! And I love that the sample fake dataset is easily available to test against!

Arun Sabharwal

AVP-Data & AI #Data Analytics #AI #Agile #Databricks #Snowflake #Data Lineage #Data Quality

2mo

Awesome 🤩

To view or add a comment, sign in

More articles by Giri R Varatharajan

Insights from the community

Others also viewed

Explore topics