Arindam Banerji’s Post

Global Vice President, CTO - Data Sc., ML, LLMs, RAG, DSPy, NLP, Deep Learning (Retail, Supply Chain)

7mo

4 Stages of Data Modernization for AI (Concluding) How each stage of next-gen data engineering supports today’s AI. Summary: Changing needs of modern AI application forces us to re-look at how we do data engineering. Data engineering today must be re-shaped to enable knowledge creation & reasoning engines, without giving up the operational and semantic needs of traditional insight generation. From part 2 - the structure of this data engineering shift is a set of stages with each stage addressing some specific needs/gaps of AI enablement: 1. Trusted Actionable Insights 2. Traditional ML for qoq revenue/profitability 3. LLM-apps, Vision-products etc. 4. Multi-component inference, Agents & Systems Intelligence & Ops artifacts that are added to each stage of Data Engineering Each stage needs specific add-on components to enable the kind of semantic intelligence and operational effectiveness necessary for the modern array of AI apps. These add on mechanisms, when aggregated, is called Data-Intelligence-Ops. (see Graphic in attached paper). Formally, DataIntelligenceOps is an abstract set of operations meant to increase (a) semantic intelligence (b) operational intelligence & (c) governance abilities of data. It builds on top of existing investments in data-lakes, cloud-EDW, dbt automation, ELT, feature-stores etc. The main architectural artifacts are: · Semantic Intelligence Enhancements: a broad set of components for complex data products, which can be aggregated or configured through a low code IDE. · Connected DataOps: a connected DataOps architecture that “causally” ties together observability, lineage, storage/gov/sec-Ops, programmable pipelines, data contracts – to create an embedding layer for the above intelligence enhancements. Implemented as a full-featured knowledge graph that captures data platform wide meta-data. · Governance as Code enablement: Governance DAGs embeddable within pipelines allow for governance simplification, as well as policy implementations to be seamlessly executed. The effect of DataIntelligenceOps is to enhance the “intelligence” of a firm’s data, thus facilitating today’s AI apps. Parting thoughts: 1. AI apps are rapidly increasing in complexity and capability – so, old boundaries of data engineering do not apply. 2. The way to enable this AI led shift is to move to a modern style of data engineering that systematically adds semantic and operational value in 4 different stages of maturity. 3. In many cases firms will choose to skip a stage to move faster and nothing prevents that. 4. Existing building blocks such as ingestion mechanisms, pipeline tools, cloud EDW etc., remain unaffected – this is not a rip n’ replace design. 5. Data engineering must now support knowledge enoblement, reasoning engines & qoq AI ROI. Paper - https://lnkd.in/gqG25drN

1 Comment

To view or add a comment, sign in

More Relevant Posts

Siddhartha Vemuganti

Director of Data Engineering & AI/ML | Enterprise Data & AI Strategy | Fortune 500 Digital Transformation | $12M+ P&L | Healthcare Tech | Gen AI Leadership
4mo
Report this post
Data Engineering in 2024: Pioneering the Future of Data Quantum Leaps, AI Synergy... As we approach the end of 2024, data engineering evolves rapidly, shaping how organizations leverage their data assets. Here's a concise overview of current trends and future directions: 𝗖𝘂𝗿𝗿𝗲𝗻𝘁 𝗟𝗮𝗻𝗱𝘀𝗰𝗮𝗽𝗲 1. Advanced Data Quality and Observability - 85% of Fortune 500 companies now use AI-driven data quality tools - "Quality-as-code" practices are becoming standard - Causal inference techniques are enhancing anomaly detection 2. Microservices and Event-Driven Architectures - 78% of organizations use event streaming for critical operations - Data contracts are widely used to manage inter-service dependencies - Specialized data mesh platforms are emerging 3. Cloud-Native and Multi-Cloud Strategies - 92% of enterprises employ multi-cloud strategies - Cloud-agnostic data tools market has grown 200% since 2022 - "Cloud-agnostic data fabrics" provide consistent governance across clouds 𝗖𝘂𝘁𝘁𝗶𝗻𝗴-𝗘𝗱𝗴𝗲 𝗧𝗿𝗲𝗻𝗱𝘀 1. AI-Augmented Data Engineering - 70% of data engineering tasks are now AI-assisted - Large language models generate and optimize ETL code - "AIOps for data" platforms predict and prevent pipeline failures 2. Quantum-Ready Data Infrastructure - 15% of Fortune 100 companies have initiated quantum-ready projects - Investment in quantum-resistant encryption has grown 300% since 2022 - Quantum machine learning is being explored for complex data analysis 3. Edge Computing and Real-Time Analytics - 65% of enterprises process some data at the edge - "Edge data mesh" architectures enable distributed processing - 5G and satellite internet facilitate real-time data streaming from remote locations 𝗥𝗲𝗴𝗶𝗼𝗻𝗮𝗹 𝗩𝗮𝗿𝗶𝗮𝘁𝗶𝗼𝗻𝘀 - North America leads in AI-augmented data engineering adoption - Europe shows highest adoption of privacy-enhancing technologies - Asia-Pacific leads in edge computing, especially in manufacturing and smart cities - Latin America sees the fastest cloud adoption growth for data workloads 𝗙𝘂𝘁𝘂𝗿𝗲 𝗢𝘂𝘁𝗹𝗼𝗼𝗸 1. Autonomous Data Ecosystems: Expected by 2026, self-optimizing and self-healing 2. Quantum Data Analytics: Significant advantages in specific domains by 2027 3. Brain-Computer Interfaces: Experimental systems for data interaction by 2028 4. Ethical AI Governance Platforms: Widespread adoption expected by 2025 5. Exascale Data Processing: Available as a service by 2026 𝗖𝗼𝗻𝗰𝗹𝘂𝘀𝗶𝗼𝗻 Data engineering in 2024 spearheads innovation in AI, quantum readiness, edge processing, and ethical data practices. As we approach 2025, the field promises incremental gains and paradigm shifts in data handling. Organizations adept at navigating these trends will lead to our data-driven future. 👋 I'm Siddhartha Vemuganti, Data Engineering & AI/ML leader. Passionate about scalable AI futures. Repost ♻️, Follow & 🔔 for more insights on data, AI, and tech's future!
Like Comment
To view or add a comment, sign in
Opeyemi Fabiyi

Doing Data Stuff @ Data Culture | dbt Community Award Winner✨
1mo Edited
Report this post
🌟 Monday Data Recap: 4 Must-Reads from Last Week 🌟 Happy Monday, everyone! Let's start the week with my top data reads from last week. 1️⃣ The Emerging Role of AI Data Engineers - The New Strategic Role for AI-Driven Success As expected, the rise of LLMs and AI Agents will disrupt application development and data interaction. AI initiatives are only as good as the data powering them; Ananth P.'s article underscores that the growing reliance on unstructured data will drive demand for specialized AI Data Engineers adept at building scalable workflows for AI applications. How does an AI Data Engineer differ from a traditional one? In my POV, rather than a new role, it's an evolution of data engineering that supports AI use cases, which would involve embracing new tools, paradigms, and unstructured data technologies like vector databases, AI-specific tools, and LLM pipelines. What do you think—is this truly a new role or just a natural shift? 2️⃣ You can take your gold and shove it The title was catchy! 😀 In this article, David Jayatillake suggests replacing the Gold⚱️ layer in the Medallion Architecture (MA) with a Universal Semantic Layer Wondering what MA is? It's a data design pattern of organizing data into layers based on data quality and readiness of use: Bronze (raw source data, like dbt staging), Silver (modelled at atomic grain for business processes), and Gold (pre-aggregated, consumption-ready data marts) The gold layer's specificity to different business data needs, which involves different pre-aggregations, has its limitations, such as the grain's inflexibility and the possibility of metric variation if the same metric is queried differently by different teams, etc. In the article, David proposes building a semantic layer instead on top of the Silver layer, which can then be exposed to business users Don't know what the Semantic layer means? It's an abstraction layer where business logic is encoded by creating usable and trusted metrics. 3️⃣ Data Modeling is NOT Dead! Madison Schott argues why AI can't replace proper data modelling. It's good to see that the need for appropriate data modelling practice is again emphasized in the data space after being relegated in past years. As she aptly puts it, "Data has a human component that machines will never understand. Humans bring industry and business knowledge that machines can't derive from patterns." I couldn't agree more—data modelling is an art that demands intentionality, something AI simply can't replicate With AI's rise and the flood of fresh data, getting data modelling right is more critical than ever. As LLMs dominate discussions, effective data modelling is key to making data and AI work 4️⃣ So you want to work in data This piece is a must-read for beginners navigating data. William DeCesare nails it: while hard skills matter, soft skills are key to standing out. I loved the practical tips on networking—I'll definitely use the template hack😂! #datalearning

8 Comments
Like Comment
To view or add a comment, sign in
Stijn Meijers

Vergroot je impact met data en AI | Co-Founder @ wolk ☁️
10mo
Report this post
I agree that the ELT/ETL process for RAG is very similar to that of the more "traditional" tabular data. I am really curious as to how we are going to resolve data quality issues in these new types of pipelines. Previously some technical ability could be assumed of data owners/producers, when the sources where for example ERP systems - where generally relatively data-savvy people worked with and produced data. Data Engineers could quite easily converse and resolve issues with these owners. Whereas with unstructured data literally anyone in the organisation can be producing relevant data, also people who might not speak the "language of data". I expect data governance and data literacy training to get a more and more important role as we see GenAI rolled out across organisations. Which will in turn also increase the quality of the more "traditional" tabular data pipelines and products. Win-win if you ask me.
Kevin Petrie

Vice President of Research at BARC
10mo Edited

About 1/3 of GenAI adopters lack the skills or tools to prepare unstructured data for language models. So data engineering teams have some work to do. To put these these survey findings in perspective, check out my new blog, sponsored by Datavolo: "Why and How Data Engineers Will Enable the Next Phase of Generative AI." https://lnkd.in/gX82Jp3N Excerpts below. I'd love to hear stories from practitioners that have solved this problem--and those that still struggle with it. Because GenAI largely consumes unstructured data, data engineers must build new pipelines that effectively process and deliver this type of data. This represents a new challenge because data engineers historically focused on structured tables rather than unstructured documents, images, or video files. Today unstructured data sloshes through email systems, CRM applications, videoconferencing software, and other parts of the organization. Companies need to consolidate, parse, and prepare this data for GenAI. Here is an example of an unstructured data pipeline that does this with text files. > Extract First the pipeline parses and extracts relevant text and metadata from applications and files, including complex documents with embedded figures and tables. > Transform Next, the pipeline transforms the extracted documents. It divides the text into semantic “chunks” and creates vectors that describe the meaning and interrelationships of chunks. It also might enrich these document chunks with data from other systems and data platforms. (Some pipeline tools perform these transformation steps in an intermediate landing zone using an ELTL sequence.) > Load Finally, it loads and embeds these vectors into a vector database such as Pinecone and Weaviate or vector-capable platforms such as Databricks and @MongoDB. These vectors are now ready to support GenAI. Data engineers must design, deploy, and monitor these pipelines, and orchestrate how they interact with vector databases and GenAI applications. They also might need to orchestrate how GenAI applications integrate with predictive ML models or other analytical functions as part of larger workflows. In addition, data engineers need to observe both data quality—for example, to ensure no vectors are lost or duplicated—and pipeline performance. Data/AI experts, what do you think? Chime in here. Thank you Luke Roquet and Sam Lachterman for sharing perspectives on the experiences of Datavolo customers in this area. Syed Tanveer Jishan Brian Greene Stephen Rausch Andrea Pisoni Landon Walsh Randolf Reiss Garth Miles Shawn Rogers Debra Peryea Wayne Eckerson Jay Piscioneri Abdul Fahad Noori #dataengineering #generativeai #genai
3 Comments
Like Comment
To view or add a comment, sign in
Kevin Petrie

Vice President of Research at BARC
10mo Edited
Report this post
About 1/3 of GenAI adopters lack the skills or tools to prepare unstructured data for language models. So data engineering teams have some work to do. To put these these survey findings in perspective, check out my new blog, sponsored by Datavolo: "Why and How Data Engineers Will Enable the Next Phase of Generative AI." https://lnkd.in/gX82Jp3N Excerpts below. I'd love to hear stories from practitioners that have solved this problem--and those that still struggle with it. Because GenAI largely consumes unstructured data, data engineers must build new pipelines that effectively process and deliver this type of data. This represents a new challenge because data engineers historically focused on structured tables rather than unstructured documents, images, or video files. Today unstructured data sloshes through email systems, CRM applications, videoconferencing software, and other parts of the organization. Companies need to consolidate, parse, and prepare this data for GenAI. Here is an example of an unstructured data pipeline that does this with text files. > Extract First the pipeline parses and extracts relevant text and metadata from applications and files, including complex documents with embedded figures and tables. > Transform Next, the pipeline transforms the extracted documents. It divides the text into semantic “chunks” and creates vectors that describe the meaning and interrelationships of chunks. It also might enrich these document chunks with data from other systems and data platforms. (Some pipeline tools perform these transformation steps in an intermediate landing zone using an ELTL sequence.) > Load Finally, it loads and embeds these vectors into a vector database such as Pinecone and Weaviate or vector-capable platforms such as Databricks and @MongoDB. These vectors are now ready to support GenAI. Data engineers must design, deploy, and monitor these pipelines, and orchestrate how they interact with vector databases and GenAI applications. They also might need to orchestrate how GenAI applications integrate with predictive ML models or other analytical functions as part of larger workflows. In addition, data engineers need to observe both data quality—for example, to ensure no vectors are lost or duplicated—and pipeline performance. Data/AI experts, what do you think? Chime in here. Thank you Luke Roquet and Sam Lachterman for sharing perspectives on the experiences of Datavolo customers in this area. Syed Tanveer Jishan Brian Greene Stephen Rausch Andrea Pisoni Landon Walsh Randolf Reiss Garth Miles Shawn Rogers Debra Peryea Wayne Eckerson Jay Piscioneri Abdul Fahad Noori #dataengineering #generativeai #genai
34 Comments
Like Comment
To view or add a comment, sign in
Datavolo

6,710 followers
10mo
Report this post
Great read on some of the basic fundamentals around data engineering for AI Systems. Lots more insights to come from Kevin Petrie on this topic!
Kevin Petrie

Vice President of Research at BARC
10mo Edited

About 1/3 of GenAI adopters lack the skills or tools to prepare unstructured data for language models. So data engineering teams have some work to do. To put these these survey findings in perspective, check out my new blog, sponsored by Datavolo: "Why and How Data Engineers Will Enable the Next Phase of Generative AI." https://lnkd.in/gX82Jp3N Excerpts below. I'd love to hear stories from practitioners that have solved this problem--and those that still struggle with it. Because GenAI largely consumes unstructured data, data engineers must build new pipelines that effectively process and deliver this type of data. This represents a new challenge because data engineers historically focused on structured tables rather than unstructured documents, images, or video files. Today unstructured data sloshes through email systems, CRM applications, videoconferencing software, and other parts of the organization. Companies need to consolidate, parse, and prepare this data for GenAI. Here is an example of an unstructured data pipeline that does this with text files. > Extract First the pipeline parses and extracts relevant text and metadata from applications and files, including complex documents with embedded figures and tables. > Transform Next, the pipeline transforms the extracted documents. It divides the text into semantic “chunks” and creates vectors that describe the meaning and interrelationships of chunks. It also might enrich these document chunks with data from other systems and data platforms. (Some pipeline tools perform these transformation steps in an intermediate landing zone using an ELTL sequence.) > Load Finally, it loads and embeds these vectors into a vector database such as Pinecone and Weaviate or vector-capable platforms such as Databricks and @MongoDB. These vectors are now ready to support GenAI. Data engineers must design, deploy, and monitor these pipelines, and orchestrate how they interact with vector databases and GenAI applications. They also might need to orchestrate how GenAI applications integrate with predictive ML models or other analytical functions as part of larger workflows. In addition, data engineers need to observe both data quality—for example, to ensure no vectors are lost or duplicated—and pipeline performance. Data/AI experts, what do you think? Chime in here. Thank you Luke Roquet and Sam Lachterman for sharing perspectives on the experiences of Datavolo customers in this area. Syed Tanveer Jishan Brian Greene Stephen Rausch Andrea Pisoni Landon Walsh Randolf Reiss Garth Miles Shawn Rogers Debra Peryea Wayne Eckerson Jay Piscioneri Abdul Fahad Noori #dataengineering #generativeai #genai
Like Comment
To view or add a comment, sign in
Uriel Knorovich

Co-Founder & CEO at Nimble | Creating the World’s Online Knowledge Platform
10mo
Report this post
🔍🏗️ Rethinking Your Data Architecture: Powering the Future with Generative AI McKinsey's report in June 2023, 'The Economic Potential of Generative AI: The Next Productivity Frontier,' highlights some compelling facts and predictions about AI: 🔹 Generative AI could add the equivalent of $2.6 trillion to $4.4 trillion in annual economic benefits across 63 use cases. 🔹 Seventy-two percent of leading organizations note that managing data is already one of the top challenges preventing them from scaling AI use cases. Traditional data architectures are not optimized for handling the variety and volume of unstructured data. They struggle to meet the demands of generative AI. For maximum business value from generative AI, companies must develop adaptable and integrated data architectures tailored to these specific needs. In response to these insights, McKinsey proposes some critical actions for leaders to consider when planning data architecture redesign: 1️⃣ Focus on value-centric data. 2️⃣ Incorporate features like vector databases to support a wide range of applications, especially for unstructured data. 3️⃣ Implement multiple, both human and automated, interventions to maintain high data quality. 4️⃣ Secure sensitive data and stay adaptive to evolving regulations. 5️⃣ Prioritize hiring skilled data engineers. 6️⃣ Utilize generative AI to enhance tasks across the data value chain. 7️⃣ Invest in monitoring and swiftly address issues to elevate data performance. 🛠️ Building Capabilities into Architecture to Support GenAI Use-Cases 🤖: To utilize generative AI effectively, data leaders must review and revamp existing architectures, ensuring a robust data foundation and identify necessary upgrades for high-value use cases. Here are five critical components for consideration: 1️⃣ Unstructured Data Stores: Map and tag unstructured data for efficient processing and high-quality, transparent data pipelines. 2️⃣ Data Preprocessing: Convert and clean data, especially sensitive data, for generative AI use; standardize large-scale data handling. 3️⃣ Vector Databases: Prioritize and streamline access to contextual data for effective generative AI queries. 4️⃣ LLM Integrations: Implement frameworks and establish guidelines for integrating Large Language Models with multiple systems. 5️⃣ Prompt Engineering: Structure prompts to optimize AI responses, integrating knowledge graphs and standardizing data input protocols. Curious to learn more about AI and data? Follow me to explore the latest trends, industry insights, and innovative use cases. Boost your generative AI projects with AI-powered web scraping tools. Start now for advanced data extraction! Visit https://lnkd.in/dicsEgqu to begin. #bigdata #webscraping #generativeai #technology
2 Comments
Like Comment
To view or add a comment, sign in
Luke Roquet

Helping enterprises deliver transformational value with Data and AI!
10mo
Report this post
Kevin Petrie is putting together a series on data engineering for AI Systems, the unique challenges and opportunities it presents, and things to consider on the journey. He is always full of insights and well worth following!
Kevin Petrie

Vice President of Research at BARC
10mo Edited

About 1/3 of GenAI adopters lack the skills or tools to prepare unstructured data for language models. So data engineering teams have some work to do. To put these these survey findings in perspective, check out my new blog, sponsored by Datavolo: "Why and How Data Engineers Will Enable the Next Phase of Generative AI." https://lnkd.in/gX82Jp3N Excerpts below. I'd love to hear stories from practitioners that have solved this problem--and those that still struggle with it. Because GenAI largely consumes unstructured data, data engineers must build new pipelines that effectively process and deliver this type of data. This represents a new challenge because data engineers historically focused on structured tables rather than unstructured documents, images, or video files. Today unstructured data sloshes through email systems, CRM applications, videoconferencing software, and other parts of the organization. Companies need to consolidate, parse, and prepare this data for GenAI. Here is an example of an unstructured data pipeline that does this with text files. > Extract First the pipeline parses and extracts relevant text and metadata from applications and files, including complex documents with embedded figures and tables. > Transform Next, the pipeline transforms the extracted documents. It divides the text into semantic “chunks” and creates vectors that describe the meaning and interrelationships of chunks. It also might enrich these document chunks with data from other systems and data platforms. (Some pipeline tools perform these transformation steps in an intermediate landing zone using an ELTL sequence.) > Load Finally, it loads and embeds these vectors into a vector database such as Pinecone and Weaviate or vector-capable platforms such as Databricks and @MongoDB. These vectors are now ready to support GenAI. Data engineers must design, deploy, and monitor these pipelines, and orchestrate how they interact with vector databases and GenAI applications. They also might need to orchestrate how GenAI applications integrate with predictive ML models or other analytical functions as part of larger workflows. In addition, data engineers need to observe both data quality—for example, to ensure no vectors are lost or duplicated—and pipeline performance. Data/AI experts, what do you think? Chime in here. Thank you Luke Roquet and Sam Lachterman for sharing perspectives on the experiences of Datavolo customers in this area. Syed Tanveer Jishan Brian Greene Stephen Rausch Andrea Pisoni Landon Walsh Randolf Reiss Garth Miles Shawn Rogers Debra Peryea Wayne Eckerson Jay Piscioneri Abdul Fahad Noori #dataengineering #generativeai #genai
1 Comment
Like Comment
To view or add a comment, sign in
Peter Opiyo

Data Scientist | AI & ML Specialist | Empowering clients & users with viable solutions | Expert in Python, SQL, Power BI & NLP | On GenAI Fine Tuning | Predictive modeling & Business Intelligence.
1mo
Report this post
Why AI Projects Fail: The truth about data nobody talks about. They say AI is the future, but here’s the hard truth: Most AI projects fail, and the culprit isn’t the technology itself. It’s the data. Still reading? Good. Because this isn’t just another post about AI’s potential. It’s about solving the real-world challenges that hold AI back. Spoiler: They have everything to do with how you handle data. Why is data the hardest part of AI and how can you navigate this reality to drive successful projects? The Problem: Why Data is the real bottleneck in AI AI thrives on data. But getting the data to a state where AI can work its magic is an uphill battle. Here are three reasons why: Data Quality is rarely "AI-ready": Real-world data is messy. Think of duplicates, missing values, and inconsistencies across sources. Tip: Invest in robust data-cleaning pipelines and enforce strong governance early on. Use tools like data wrangling frameworks or Cloud-based solutions that automate part of the process. The "Garbage In, Garbage Out" Rule: Feeding biased or incomplete data into AI models leads to poor outcomes. Tip: Build diverse datasets and implement bias detection algorithms. Collaborate with domain experts to validate the relevance and representativeness of your data. Scaling Data Infrastructure: The volume, variety, and velocity of data needed for AI require scalable systems. Tip: Move to scalable architectures like data lakes or warehouses that handle structured and unstructured data efficiently. Prioritize integration across departments to break silos. How to Tackle the "Data Challenge" Like a Pro Want to master the data game? Here are three strategies: Create a Data-centric culture: Treat data as a strategic asset. Foster collaboration between data engineers, scientists, and decision-makers. Regularly train teams on data literacy and hold cross-functional data audits. Prioritize Explainability and Ethics: Ensure your data practices align with regulatory and ethical standards. Leverage tools like SHAP or LIME to test model outputs and communicate them transparently to stakeholders. Automate where possible: Manual data handling slows progress, embracing automation for ETL (Extract, Transform, Load) processes. Explore AI-driven tools that streamline data prep and integrate them into your workflow. Mastering data is not just a technical challenge—it’s a strategic one. Companies that invest in data quality, governance, and infrastructure stand a better chance of unlocking AI’s true potential. The next time someone tells you AI is the hardest part, remind them: It all starts with data. Without it, even the most advanced models are powerless. Over to You: What’s been your biggest data challenge in AI projects? How do you ensure your data is ready for AI? Drop your thoughts in the comments, and let’s spark a conversation! If this resonates, feel free to share or tag someone navigating similar challenges.
Like Comment
To view or add a comment, sign in
Kevin Petrie

Vice President of Research at BARC
6mo Edited
Report this post
Data integration refines the fuel that drives the AI/ML model lifecycle, from model development to training to deployment and operation. Here's how. Also check out my latest blog, sponsored by CData Software and published by BARC's partner Eckerson Group: https://lnkd.in/g7uh9f6C Excerpts below. (As for the Porsche graphic, I might have gotten carried away with the analogy that highly refined fuel (i.e., data) drives high-performance cars (i.e., AI/ML). Our 18-year-old son's passions are rubbing off on me 😲) To start, let’s consider how data integration supports the foundational steps of data labeling and feature engineering for predictive ML models. This process also can apply to recommendations and anomaly detection, in particular those that consume structured data. Data scientists, data engineers, and ML engineers start by collecting the historical input data that relates to the business problem at hand. For example, they might integrate their data using a mix of ETL and data virtualization. They extract, transform, and load their operational records into a lake house; and create virtual views of distributed unstructured data that is difficult to extract from heritage systems. The data scientist provides close oversight to ensure both the consolidated dataset and virtual views meet AI/ML model requirements—no easy task for a heterogeneous environment. Next data engineers, ML engineers, and data scientists collaborate with business owners to “label” various outcomes in their historical data sets. This means they add tags to data to easily identify historical outcomes, such as robotic arm failures, fraudulent transactions or the price of recent house sales. They also might label customer emails and social media posts as “positive” or “negative” to create an accurate model for classifying customer sentiment. Data engineers and data scientists need to label outcomes accurately and at a high scale. This requires a programmatic approach, automation, and assistance from business owners that best understand the domain. Note that labeling applies to supervised ML only. Unsupervised ML, by definition, studies input data without known outcomes, which means the data has no labels. Data/AI gurus: does this sound right? Tell us about your use cases, best practices, and lessons learned. Stay tuned for blog excerpt 2, focused on performance of production models. Shawn Rogers Timm Grosser Florian Bigelmaier Joo-Ang "Sue" Raiber Lucia Santamarina Nick Golovin 🇺🇦 Paulina R. EM360Tech #data #ai #analytics
3 Comments
Like Comment
To view or add a comment, sign in
Metabase

17,526 followers
2mo Edited
Report this post
Data journey drill-through 🔍 with Andrey Hakobyan Andrey is the founder of Data Minds, a platform that connects top data talents with curated job opportunities and offers expert career support and training. He is also a Senior Data Analytics Engineer at InSpace Proximity. Over the past 5+ years, Andrey worked on projects in finance, ML, video conferencing and ETL tools, pharmacology, cryptocurrency, and securities trading. 📘 𝙒𝙝𝙖𝙩’𝙨 𝙩𝙝𝙚 𝙠𝙚𝙮 𝙡𝙚𝙖𝙧𝙣𝙞𝙣𝙜 𝙞𝙣 𝙮𝙤𝙪𝙧 𝙘𝙖𝙧𝙚𝙚𝙧 𝙖𝙗𝙤𝙪𝙩 𝙙𝙖𝙩𝙖? Data is more than numbers. It’s a powerful tool for solving problems, driving innovation, and creating success. 😮💨 𝙒𝙝𝙖𝙩 𝙖𝙧𝙚 𝙨𝙤𝙢𝙚 𝙤𝙛 𝙩𝙝𝙚 𝙗𝙞𝙜𝙜𝙚𝙨𝙩 𝙘𝙝𝙖𝙡𝙡𝙚𝙣𝙜𝙚𝙨 𝙮𝙤𝙪'𝙫𝙚 𝙛𝙖𝙘𝙚𝙙 𝙬𝙝𝙚𝙣 𝙬𝙤𝙧𝙠𝙞𝙣𝙜 𝙬𝙞𝙩𝙝 𝙙𝙖𝙩𝙖, 𝙖𝙣𝙙 𝙝𝙤𝙬 𝙙𝙞𝙙 𝙮𝙤𝙪 𝙤𝙫𝙚𝙧𝙘𝙤𝙢𝙚 𝙩𝙝𝙚𝙢? Dealing with messy or incomplete datasets. It’s frustrating to spend hours cleaning instead of analyzing. Automating cleaning and adding validation early in the pipeline saves so much time. Scaling systems is another challenge—when your data outpaces your infrastructure, everything lags or breaks. Planning for scalability upfront and using tools like AWS and Spark made a huge difference. Then there’s communicating with non-technical teams. I learned to drop the jargon and focus on the why, using visuals or stories to bridge the gap. In the end, it’s not just about being technical—it’s about adaptability, creative problem-solving, and teamwork. 🔮 𝙒𝙝𝙖𝙩 𝙙𝙖𝙩𝙖 𝙩𝙧𝙚𝙣𝙙𝙨 𝙚𝙭𝙘𝙞𝙩𝙚 𝙮𝙤𝙪 𝙢𝙤𝙨𝙩 𝙛𝙤𝙧 𝙩𝙝𝙚 𝙣𝙚𝙖𝙧 𝙛𝙪𝙩𝙪𝙧𝙚? AI and data go hand in hand. AI relies on data to train models, identify patterns, and make predictions. The higher the quality of the data, the better AI can learn and improve its accuracy. Data not only powers AI but also refines it by continuously providing new information, enabling AI to evolve and adapt over time. In this sense, AI is only as good as the data it’s given—making data collection, cleaning, and management critical components of the AI process. 🧰 𝙒𝙝𝙖𝙩’𝙨 𝙮𝙤𝙪𝙧 𝙙𝙖𝙩𝙖 𝙨𝙩𝙖𝙘𝙠? Data warehouse and database: Redshift, PostgreSQL ETL/ELT: Rivery, Apache Airflow Transformation: dbt Labs, Pandas Visualization: Metabase, Plotly Backend: FastAPI 🕹️ How does Metabase fit into your daily workflow? Metabase is a game changer, it lets me quickly visualize and explore data without needing to write a lot of code. I use it to build dashboards that give the team instant insights into key metrics, and it’s perfect for ad-hoc queries when I need to pull up something on the fly. The interface is so intuitive that non-technical stakeholders can easily dig into the data by themselves. Appreciate the insights, Andrey! Who should be our next guest? Drop their name in the comments!
2 Comments
Like Comment
To view or add a comment, sign in

1,930 followers

1,698 Posts

View Profile Connect

Arindam Banerji’s Post

More Relevant Posts

Explore topics