This week marks a significant milestone with the release of the largest open-sourced training dataset ever, containing over 500 𝐛𝐢𝐥𝐥𝐢𝐨𝐧 𝐭𝐨𝐤𝐞𝐧𝐬. OpenAI said it was impossible to train AI Model without slurping copyrighted content.
The Common Corpus is a result of global open collaboration, coordinated by pleias founded by the brilliant Anastasia Stasenko and and involving key players in LLM training, AI ethics, and cultural heritage. This initiative has garnered support from major organizations committed to an open science approach in AI, including Hugging Face, Occiglot , EleutherAI, Nomic AI and OpenLLM France 🇫🇷, with backing from the Ministère de la Culture and Direction interministérielle du numérique (DINUM), as well as Scaleway.
This development follows-up on a trend where the transparency of pre-trained models' weights and code has been enhancing innovation and accessibility across various sectors. Where last year was about open weights and code, this years seems to become the year of open and free data. Even though this is all still early stage, this is truly transformative!
Particularly in fields of healthcare, adopting this 𝐨𝐩𝐞𝐧 𝐚𝐧𝐝 𝐟𝐫𝐞𝐞 𝐝𝐚𝐭𝐚 𝐦𝐢𝐧𝐝𝐬𝐞𝐭 could lead to substantial benefits. While concerns about the re-identification risks associated with opening healthcare data are valid, they often overlook the potential of anonymized datasets, such as those derived from histopathology or synthesized radiology imaging and so much more. Just to name a few. Many are wrapped in alibis that cleverly disguise their personal interests, and do not act on breaking the walls that lead to access.
It's time to shift the conversation from the risks to the opportunities. Embracing open-source licenses for such data could significantly accelerate innovation and reduce costs, thereby improving accessibility for all. Let's champion the move towards more open data and pave the way for a new era of inclusivity and progress in healthcare and beyond.
Happy that my Dutch neighbors, from ICT&health are giving me a platform to share these ideas, vision and concrete proposals for action. I will talk about that! See you together with 10.000 other attendees at the 15th / 16th of May at the I𝐂𝐓&𝐡𝐞𝐚𝐥𝐭𝐡 𝐖𝐨𝐫𝐥𝐝 𝐂𝐨𝐧𝐟𝐞𝐫𝐞𝐧𝐜𝐞. Let‘s break these walls together and truly democratize access to health data and AI. Link to conference in comments
#datasolidarity #datacommons
with Conny Helder Bianca Rouwenhorst Abigail Norville John Halamka, M.D., M.S. Helen Mertens Alain Labrique-Bernard Diederik Gommers Lucien Engelen Dave deBronkart Alessandro Bozzon Maaike Kleinsmann Maurice Magnee Daan Dohmen
“Gezien de voortdurende ontwikkelingen in de zorgsector, voel ik sterk de behoefte om kritische vraagstukken over AI te delen met een invloedrijk publiek," benadrukt Bart De Witte tijdens ons gesprek, zijn stem doordrongen van ernst en toewijding.
Als een van de keynote sprekers op het ICT&health World Conference in mei, is zijn aanwezigheid niet alleen een eerbetoon aan zijn deskundigheid, maar vooral ook een erkenning van de ernst van het onderwerp dat hij zal behandelen.
Zijn deelname belooft dan ook een diepgaande analyse van de uitdagingen en mogelijkheden die gepaard gaan met de integratie van AI in de zorg. Met een nuchtere kijk op de realiteit en een scherp oog voor ethische overwegingen, zal De Witte de complexiteit van het onderwerp op een heldere en toegankelijke manier presenteren. 👇🏽#zorg #ai #data #IWC24
Data is van en voor ons allemaal | ICT&health
icthealth.nl