Models are what they eat: high quality data lead to high quality models, enabling faster training of better models with fewer parameters. However, identifying and curating high quality data at scale, automatically, is an incredibly challenging problem requiring deep expertise. Our goal at DatologyAI is to make state of the art data curation accessible to anyone who wants to train a model, and we’ve been hard at work realizing this vision over the last year.
On a personal note, I am so proud of the incredible work our small, but mighty team has accomplished, and today, I’m incredibly excited to share our first set of results at DatologyAI! We focused on contrastive models (ala CLIP) trained on the large-scale DataComp dataset, and the results we’ve been able to achieve have exceeded our already high expectations!
Train Faster - Training on DatologyAI’s optimized dataset, we were able to reach the same performance with up to ~98% less compute, meaning that models cost dramatically less to train and train dramatically faster!
Train Better - Models trained on our optimized data for the same compute budget achieve up to 13 absolute percentage points better performance relative to models trained on raw data.
Train Smaller - Train models with >60% fewer parameters to better performance by training on our curated data.
Check out our high-level blog post here (https://shorturl.at/jkYqk), and if you’re interested in all the nitty, gritty details, check out our technical deep dive here (https://shorturl.at/Mt0k9).
We are so excited about these results, and we are just getting started! Stay tuned for more exciting results on text models coming very soon!