Future of Data Centric AI - Aug 2022

Aarthi Srinivasan

Product Executive - Startups, AWS, Amazon, Target

Published Aug 5, 2022

Disclaimer: The views expressed here are my own and not that of my employer Primer.ai

Snorkel AI organized a 2 day event on the future of data centric AI on Aug 3-4, 2022. The event had stellar speakers with flawless organization, and I was fortunate to attend few of the sessions. Here, I am sharing my takeaways with my lens of the world. The sessions were organized in 3 tracks except for the keynotes -

Data track
Techniques and workflows
Applications track

In addition to this virtual conference, there was a workshop earlier this week for ~150 attendees to try out new features on the platform (platform has a UI and Python SDK). Nothing better than direct customer feedback - Kudos.

The move from Model centric to Data centric AI is real given that large scale AI models are easily available from sites like Huggingface and Pytorch hub that make AI accessible to every organization in the world without expensive scientists or a government funded DARPA program. Of course, getting this pre-trained model into production is a conference topic on its own given the ML operations and data engineering needs.

Here are my highlights:

Alexander Ratner , Snorkel AI 's CEO kicked off the conference with the keynote. Key messages were

Use of ontologies and multiple knowledge bases to codify rich knowledge. Auto-generate and use ML models to expand these knowledge bases from domains such as medicine
Snorkel AI can utilize tons of unlabelled data with programmatic labeling to improve model performance. (Faster time to production)
Some of the real-world Challenges in labeling data include:

Private nature of the data
Subject Matter Expertise required
Real world data and objectives changing rapidly.

Slide Credit - Alex Ratner

Rich Baich , Chief Information Security Officer at CIA shared the cyber security use case with the need to collect information, and more importantly “turn information to action”. His talk was informative where he elaborated steps in the analyst process

There are four key steps in the analysis process:

Know what information is needed for averting risks - Information providers (analysts)
Operator uses the information to analyze, correlate and enrich with cyber intelligence and present options to leaders for making a decision
Turn information to action by:

Identifying opportunities and risk - vulnerabilities, patching, elicit type behavior, configuration faults, etc. The system has to be good at alerting and understanding.
Working across stakeholders to solve the workflow which is collaborative.
Remediation stage for action

Perform assurance - Confirm that remediation has occurred and continuously monitor. Techniques associated with AI/ML come into play for this assurance.

Then, Sebastian Ruder from AI Community talked about scaling to the next 1000 languages. Language research is early and hard as we need local input with domain models for meeting the user where they are. He categories the issues into two broad categories Models and Data:

Models

- Multi-modality: Text on the web is rare for the next 1000 languages. Data is in hand written, non-digitized lexicons, books, radio, and youtube videos. Speech is more common than written text (we speak more than we write - wish it was the other way :)

Recommended by LinkedIn

Best Practices for Data Quality in AI-driven Insights

Forage AI 10 months ago

Data Analytics in the Age of AI, When to Use RAG…

Open Data Science Conference (ODSC) 9 months ago

Data meets AI: key impressions from DSC Europe 24

HTEC 2 weeks ago

- Computational efficiency: A gigantic model is not particularly useful even if we think that multi-linguality will be less of a problem with large scale models.

Data

- Real world Evaluation: Most models are English centric. We need annotation data by working closely with local speaker communities which is influenced by US crowdsourced contributors with a western centric view.

- Language Varieties: The dialect varieties are as important as the original language.

Slide Credit - Sebastien Ruder. Largest models are still english centric.

Quick mentions:

Snowflake: Snowpark Pipelines by Ahmad Raza Khan

There is a need for ML pipelines to clean, and organize your data to prepare it for prime time training and serving of the models. Many ML firms build their own pipelines with open source tools. Snowflake talked about Snowpark which allows you to run Python pipelines inside Snowflake. It uses Scalable SQL and Python in the cloud. With Snowpark, users can

● Abstract container management with user defined functions (UDF) or stored procedures.

● Use airflow schedulers or Stored Procedures to run your models.

● Snowpark does not leverage SPARK for distributed computing but code gets converted to SQL statements for performance.

QuantumBlack, AI by McKinsey 's slide on the importance of data in the ML Dev lifecycle.

Grammarly 's Timo Mertens simplified the stages of communication as Conception, Composition, Revision and Comprehension.

Slide credit - Timo Mertens

Thank you Devang Sachdev , Aparna Lakshmiratan , Harshini Jayaram , Friea Berg , Aarti Bagul , Alyssa Maruyama , John Marini , Karla Arteaga, and the team at Snorkel AI for organizing this informative virtual conference targeting the technical AI community. It was a huge success in bringing together folks in the tough post-pandemic period culminating with perfect weather in the outdoor event in Menlo Park.

Disclaimer: The views expressed here are my own and not that of my employer Primer.ai

Joshua Wolff

Intelligence Modernization & AAA Portfolio Manager at ACC A3/2D Futures Branch

Thank you for sharing. I’m a bit bias but glad we started down this path sooner rather than later.

1 Reaction

Brian S. Raymond

ETL for LLMs

Thanks for the great summary, Aarthi Srinivasan!

1 Reaction

Devang Sachdev

Model Ventures | Ex-NVIDIA, Twilio, Snorkel AI | Investor and Builder

Thank you for joining us and capturing highlights from the sessions Aarthi!

2 Reactions

See more comments

To view or add a comment, sign in

See all

Future of Data Centric AI - Aug 2022

Aarthi Srinivasan

Product Executive - Startups, AWS, Amazon, Target

Recommended by LinkedIn

More articles by this author

Insights from the community

Others also viewed

20 Data Trends for 2020

Why the modern data stack will fail, and how generative AI will change everything — with Bob Muglia and Tristan Handy

Anti Hype AI / Data Science / Machine Learning: Thoughts AND Quotes

Use of AI in data governance implementations: Use Cases & Tools

The EU AI Act: An Opportunity for better Data and Governance

Data and AI Strategy Weekly - November 24, 2024

Retrieval-Augmented Generation (RAG) Ecosystem

Before the AI Leap: Why a Solid Data Strategy is Your Safety Net

The Impact of Artificial Intelligence on Data Analytics

Data Nugget August 2023

Explore topics

Recommended by LinkedIn

Unveiling the Gen AI Future

Feb 21, 2024

Gen AI Model Types

Jan 9, 2024

How can I adopt Gen AI - Part 2?

Jul 18, 2023

Silicon Valley Rises Again - Combat Layoffs by Embracing LLMs (Part 1)

Jul 8, 2023

Covid-19 fuels AI platforms causing cloud providers to invest

Mar 26, 2023

Fake or not?

Nov 22, 2021

AI Hardware Innovation

Mar 2, 2020

Unraveling the AI/ML Tech Stack

Feb 21, 2020

AI Growth Spurts - Hardware is cool again

Feb 17, 2020

My Quora Session's top discussion: Biggest challenges facing the widespread implementation of AI

Aug 21, 2019

Insights from the community

Others also viewed

20 Data Trends for 2020

Why the modern data stack will fail, and how generative AI will change everything — with Bob Muglia and Tristan Handy

Anti Hype AI / Data Science / Machine Learning: Thoughts AND Quotes

Use of AI in data governance implementations: Use Cases & Tools

The EU AI Act: An Opportunity for better Data and Governance

Data and AI Strategy Weekly - November 24, 2024

Retrieval-Augmented Generation (RAG) Ecosystem

Before the AI Leap: Why a Solid Data Strategy is Your Safety Net

The Impact of Artificial Intelligence on Data Analytics

Data Nugget August 2023

Explore topics