Future of Data Centric AI  - Aug 2022
Image Credit - https://uwm.edu/sce/program_area/data-analysis/

Future of Data Centric AI - Aug 2022

Disclaimer: The views expressed here are my own and not that of my employer Primer.ai

Snorkel AI organized a 2 day event on the future of data centric AI on Aug 3-4, 2022. The event had stellar speakers with flawless organization, and I was fortunate to attend few of the sessions. Here, I am sharing my takeaways with my lens of the world. The sessions were organized in 3 tracks except for the keynotes -

  1.  Data track
  2. Techniques and workflows
  3. Applications track

In addition to this virtual conference, there was a workshop earlier this week for ~150 attendees to try out new features on the platform (platform has a UI and Python SDK). Nothing better than direct customer feedback - Kudos.

The move from Model centric to Data centric AI is real given that large scale AI models are easily available from sites like Huggingface and Pytorch hub that make AI accessible to every organization in the world without expensive scientists or a government funded DARPA program. Of course, getting this pre-trained model into production is a conference topic on its own given the ML operations and data engineering needs.

Here are my highlights:

  1. Use of ontologies and multiple knowledge bases to codify rich knowledge. Auto-generate and use ML models to expand these knowledge bases from domains such as medicine
  2. Snorkel AI can utilize tons of unlabelled data with programmatic labeling to improve model performance. (Faster time to production)
  3. Some of the real-world Challenges in labeling data include:

  • Private nature of the data
  • Subject Matter Expertise required
  • Real world data and objectives changing rapidly.

Alex Ratner's slide

Slide Credit - Alex Ratner

  • Rich Baich , Chief Information Security Officer at CIA shared the cyber security use case with the need to collect information, and more importantly “turn information to action”. His talk was informative where he elaborated steps in the analyst process

There are four key steps in the analysis process:

  1. Know what information is needed for averting risks - Information providers (analysts)
  2. Operator uses the information to analyze, correlate and enrich with cyber intelligence and present options to leaders for making a decision
  3. Turn information to action by:

  • Identifying opportunities and risk - vulnerabilities, patching, elicit type behavior, configuration faults, etc. The system has to be good at alerting and understanding.
  • Working across stakeholders to solve the workflow which is collaborative.
  • Remediation stage for action

  1. Perform assurance - Confirm that remediation has occurred and continuously monitor. Techniques associated with AI/ML come into play for this assurance.

  • Then, Sebastian Ruder from AI Community talked about scaling to the next 1000 languages. Language research is early and hard as we need local input with domain models for meeting the user where they are. He categories the issues into two broad categories Models and Data:

Models

-      Multi-modality: Text on the web is rare for the next 1000 languages. Data is in hand written, non-digitized lexicons, books, radio, and youtube videos. Speech is more common than written text (we speak more than we write - wish it was the other way :)

-      Computational efficiency: A gigantic model is not particularly useful even if we think that multi-linguality will be less of a problem with large scale models.

 Data

-      Real world Evaluation: Most models are English centric. We need annotation data by working closely with local speaker communities which is influenced by US crowdsourced contributors with a western centric view. 

-      Language Varieties: The dialect varieties are as important as the original language.

No alt text provided for this image

Slide Credit - Sebastien Ruder. Largest models are still english centric.

Quick mentions:

Snowflake: Snowpark Pipelines by Ahmad Raza Khan

There is a need for ML pipelines to clean, and organize your data to prepare it for prime time training and serving of the models. Many ML firms build their own pipelines with open source tools. Snowflake talked about Snowpark which allows you to run Python pipelines inside Snowflake. It uses Scalable SQL and Python in the cloud. With Snowpark, users can

●     Abstract container management with user defined functions (UDF) or stored procedures.

●     Use airflow schedulers or Stored Procedures to run your models.

●     Snowpark does not leverage SPARK for distributed computing but code gets converted to SQL statements for performance.

QuantumBlack, AI by McKinsey 's slide on the importance of data in the ML Dev lifecycle.

No alt text provided for this image

Grammarly 's Timo Mertens simplified the stages of communication as Conception, Composition, Revision and Comprehension.

No alt text provided for this image

Slide credit - Timo Mertens

Thank you Devang Sachdev , Aparna Lakshmiratan , Harshini Jayaram , Friea Berg , Aarti Bagul , Alyssa Maruyama , John Marini , Karla Arteaga, and the team at Snorkel AI for organizing this informative virtual conference targeting the technical AI community. It was a huge success in bringing together folks in the tough post-pandemic period culminating with perfect weather in the outdoor event in Menlo Park.

Disclaimer: The views expressed here are my own and not that of my employer Primer.ai

Joshua Wolff

Intelligence Modernization & AAA Portfolio Manager at ACC A3/2D Futures Branch

2y

Thank you for sharing. I’m a bit bias but glad we started down this path sooner rather than later.

Thanks for the great summary, Aarthi Srinivasan!

Devang Sachdev

Model Ventures | Ex-NVIDIA, Twilio, Snorkel AI | Investor and Builder

2y

Thank you for joining us and capturing highlights from the sessions Aarthi!

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics