Future of Data Centric AI - Aug 2022
Disclaimer: The views expressed here are my own and not that of my employer Primer.ai
Snorkel AI organized a 2 day event on the future of data centric AI on Aug 3-4, 2022. The event had stellar speakers with flawless organization, and I was fortunate to attend few of the sessions. Here, I am sharing my takeaways with my lens of the world. The sessions were organized in 3 tracks except for the keynotes -
In addition to this virtual conference, there was a workshop earlier this week for ~150 attendees to try out new features on the platform (platform has a UI and Python SDK). Nothing better than direct customer feedback - Kudos.
The move from Model centric to Data centric AI is real given that large scale AI models are easily available from sites like Huggingface and Pytorch hub that make AI accessible to every organization in the world without expensive scientists or a government funded DARPA program. Of course, getting this pre-trained model into production is a conference topic on its own given the ML operations and data engineering needs.
Here are my highlights:
Slide Credit - Alex Ratner
There are four key steps in the analysis process:
Models
- Multi-modality: Text on the web is rare for the next 1000 languages. Data is in hand written, non-digitized lexicons, books, radio, and youtube videos. Speech is more common than written text (we speak more than we write - wish it was the other way :)
Recommended by LinkedIn
- Computational efficiency: A gigantic model is not particularly useful even if we think that multi-linguality will be less of a problem with large scale models.
Data
- Real world Evaluation: Most models are English centric. We need annotation data by working closely with local speaker communities which is influenced by US crowdsourced contributors with a western centric view.
- Language Varieties: The dialect varieties are as important as the original language.
Slide Credit - Sebastien Ruder. Largest models are still english centric.
Quick mentions:
Snowflake: Snowpark Pipelines by Ahmad Raza Khan
There is a need for ML pipelines to clean, and organize your data to prepare it for prime time training and serving of the models. Many ML firms build their own pipelines with open source tools. Snowflake talked about Snowpark which allows you to run Python pipelines inside Snowflake. It uses Scalable SQL and Python in the cloud. With Snowpark, users can
● Abstract container management with user defined functions (UDF) or stored procedures.
● Use airflow schedulers or Stored Procedures to run your models.
● Snowpark does not leverage SPARK for distributed computing but code gets converted to SQL statements for performance.
QuantumBlack, AI by McKinsey 's slide on the importance of data in the ML Dev lifecycle.
Grammarly 's Timo Mertens simplified the stages of communication as Conception, Composition, Revision and Comprehension.
Slide credit - Timo Mertens
Thank you Devang Sachdev , Aparna Lakshmiratan , Harshini Jayaram , Friea Berg , Aarti Bagul , Alyssa Maruyama , John Marini , Karla Arteaga, and the team at Snorkel AI for organizing this informative virtual conference targeting the technical AI community. It was a huge success in bringing together folks in the tough post-pandemic period culminating with perfect weather in the outdoor event in Menlo Park.
Disclaimer: The views expressed here are my own and not that of my employer Primer.ai
Intelligence Modernization & AAA Portfolio Manager at ACC A3/2D Futures Branch
2yThank you for sharing. I’m a bit bias but glad we started down this path sooner rather than later.
ETL for LLMs
2yThanks for the great summary, Aarthi Srinivasan!
Model Ventures | Ex-NVIDIA, Twilio, Snorkel AI | Investor and Builder
2yThank you for joining us and capturing highlights from the sessions Aarthi!