AI is at a turning point. It holds the power to transform healthcare, education, and economic opportunity, but only if it is built with public interest at its core. That is why today I am delighted to be one of the 11 AI leaders (Eléonore Crespo, Fidji Simo, Arthur Mensch, Clem Delangue 🤗, Reid Hoffman and others) to support CurrentAI, a new global partnership committed to building AI in the public interest. The exponential growth of AI capabilities demands a fundamental shift from closed, proprietary systems to an ecosystem built on transparency and shared knowledge - a shift I've witnessed firsthand as a female founder who chose to challenge the status quo rather than conform to it. By championing open data and open innovation, I believe CurrentAI to be able to greatly democratize AI development, ensuring that transformative technologies become building blocks for larger communities. With an initial $400 million investment from governments, philanthropies, and industry leaders including pleias, CurrentAI will fund open, accountable, and purpose-driven AI that prioritizes transparency, fairness, and global equity. pleias is contributing the CommonCorpus to support the efforts of CurrentAI. CommonCorpus, that has been supported by The AI Alliance and the Ministère de la Culture) is as of today the biggest fully open dataset in the world that can be used for training Large Language Models, with its 2 trillion tokens! It fully embraces the values of diversity (this dataset is highly multilingual), ethics and regulation-compliance (it only contains data with permissive licences and is fully documented as desired by the EU AI Act). AI must serve the public good. It must be shaped by those who will be most affected by its development. We are committed to building AI that puts people first. CurrentAI addresses a critical inflection point in technological history - whether AI's future will be shaped by radical openness that enables humanity's collective progress, or stifled by the familiar patterns of gatekeeping that have consistently failed to recognize innovation when it doesn't fit the expected mold. pleias and myself personally are enthusiastic to contribute to the success of CurrentAI and its future projects for people-first AI.
About us
- Website
-
pleias.fr
External link for pleias
- Industry
- Technology, Information and Internet
- Company size
- 2-10 employees
- Type
- Self-Owned
Employees at pleias
Updates
-
pleias releases world’s first LLMs trained on exclusively open data Training large language models required copyrighted data until it did not. Today we release Pleias 1.0 models, a family of fully open small language models and their RAG-tuned counterparts. Pleias 1.0 models come in three sizes: 350m, 1.2b and 3b parameters. Our models are the first to be fully EU AI Act compliant, thus establishing a clear precedent that truly ethical and socially acceptable AI is possible. What Makes pleias 1.0 Special Our models are: * trained exclusively on open data, meaning data that are either non-copyrighted or are published under a permissible license * multilingual, offering strong support for multiple European languages * safe, showing the lowest results on the toxicity benchmark * frugal (<3b) yet performant for key tasks, such as knowledge retrieval * able to run efficiently on consumer-grade hardware locally Specialised pretraining This release is not just about open and ethical data. We’re publishing two specialized models for knowledge retrieval with strong performance: Pleias-Nano (based on Pleias-1b) and Pleias-Pico (based on Pleias-350m) are competitive with the best existing models (Llama, Qwen) ten times their size. They're like the espresso shots of the model world: small, strong, and they get the job done without making a fuss. The small size of our Pico and Nano models makes it possible to create fully local RAG applications running on CPU. The models are an integral part of our scientific assistant scholasticAI we are demoing today for Mozilla Builders. Not only open source but open science models The Pleias 1.0 are not just open weights but open science models. The models were trained on Nanotron, the open pretraining library from Hugging Face and we’re releasing our code variant and configurations. Along with Common Corpus we are going to release our training data recipes. The 3B model was trained with support from Etalab - Direction interministérielle du numérique (DINUM) on the GENCI Jean Zay H100 NVIDIA Eviden partition, as part of the Grand Challenge aligned with the European strategy for establishing AI factories through the EuroHPC Joint Undertaking (EuroHPC JU) aimed at supporting startups and providing open source models to the community. We developed our Nano (1.2B) model with the support and expertise from TractoAI (Mike Burkov), a serverless AI platform for running data and compute-intensive workloads at scale. Use our models The whole family of models is available for use through Hugging Face. We release the Pleias 1.0 models under a permissive Apache 2.0 license, meaning that models are available for use, distribution, and modification for any purpose. https://lnkd.in/e9DXiXXg
-
Open data – truly open data – is imperative for building safe and auditable models. As part of The AI Alliance's Open Trusted Data Initiative, @we release releasing Common Corpus, the largest open and permissibly licenced dataset for training LLMs, at over 2 trillion tokens. 👉 Read more: https://lnkd.in/dUAkyZjE
Pleias Releases Common Corpus, The Largest Open Multilingual Dataset for LLM training | AI Alliance
thealliance.ai
-
pleias is at WebSummit LLC this week to present some quite game-changing things we have been cooking for awhile now (5 trillion tokens and 150k H100 hours later...) pleias' team (Pierre-Carl Langlais, Anastasia Stasenko, Prof. Dr. Ivan Yamshchikov, Kate Pavlova) will be at The AI Alliance's Media Happy Hour on the 12th and in the Beta Startup Zone (B127) on the 13th.
-
-
We are delighted to announce that Cassandre, an HR assistant developed by pleias for the Académie de Lyon, has been shortlisted for the Grand Prix IA & RH of Hub FranceIA. Many thanks to Jerome Blondon who leaded this project as well as to pleias' Applied AI team (Carlos Rosas Hinostroza, Irène Girard, Pierre-Carl Langlais).
Grand Prix IA & RH 🎉 Le verdict est tombé pour le Grand Prix IA et RH du Hub France IA ! 🎊 🚀 Nous avons reçu plus de 50 dossiers de qualité, la sélection a été rude ! 🤯 Un grand merci à tous les participants pour leur implication et leurs idées innovantes. 🙏 🥁 Roulement de tambour... 🥁 Après une délibération intense, nous avons le plaisir d'annoncer les 8 finalistes, par ordre alphabétique : - Groupe Keolis avec EURODECISION - La Poste Groupe avec Probayes - agap2 avec LE MUST Employer -Safran avec 360Learning - Rectorat de l'académie de Lyon avec pleias - Roquette - Semantikmatch - TOP™ 🔥 On a hâte de les voir défendre leurs idées devant notre jury d'experts. Rendez-vous le 10 décembre pour découvrir le grand gagnant de cette 1ere édition ! 🤩 👏 Nous souhaitions également dire bravo aux projets suivants qui se sont démarqués et intègrent le TOP 15 : ✨ Alstom ; Crédit Agricole d'Ile-de-France avec Leihia ; Manpower France avec Sanofi ; Omind ; Omogen ; People360 ; Skillup.co avec DominoRH #IA #RH #Innovation #HubFranceIA #GrandPrix #Finalistes ➡️ Restez connectés pour suivre l'aventure du Grand Prix IA et RH ! Nos partenaires : Parlons RH ; Le Lab RH ; ActuIA ; Alan Jean-Roch Houllier, Claire LARSONNEUR, Emmanuel Teboul, Gérald PETITJEAN, Michel ROMANET-CHANCRIN, Pierre Guenoun, Jerome Blondon, Anastasia Stasenko, Fanny Girerd, Maxime Cariou, Pierre-Louis Bescond, Stephane Barbot, Stéphane Ureña
-
-
pleias Joins The AI Alliance to Co-lead Open Trusted Data Initiative and Releases the Largest Open Dataset for LLM training - Common Corpus Open Trusted Data Initiative (OTDI) is a joint effort of The AI Alliance members to ensure the AI community has access to open and trusted data with the appropriate “provenance”. This work is fully aligned with pleias’ mission and values, so today pleias joins the AI alliance to lead the OTDI efforts in the Alliance and releases the Common Corpus. Common Corpus represents a significant advancement in open-source AI development, comprising over 2 trillion tokens of high-quality, multilingual text data. This unprecedented collection includes diverse content ranging from books and scientific articles to government documents and computer code, with substantial coverage across English, French, and other European languages. By making this dataset publicly available, we are enabling the development of powerful, efficient language models that comply with EU AI Act requirements while democratising the field of pretraining of Large Language Models. The dataset also features extensive documentation of data provenance and procedures, making it fully transparent and auditable. Through careful content filtering, the collection maintains strong educational value while eliminating harmful content. The dataset draws from trustworthy sources including open scientific literature, cultural heritage materials, open-source code, and government and legal documents. This focus on high-quality, knowledge-grounded content ensures that models trained on Common Corpus will benefit from reliable, accurate information. The development of Common Corpus has been made possible through strategic partnerships with leading organizations in the field. Wikimedia Foundation Enterprise, EleutherAI, and Ai2 have provided valuable technical expertise and resources. Additionally, support from the Ministère de la Culture Culture and Direction interministérielle du numérique (DINUM) has been instrumental in accessing and curating high-quality content. In addition to the dataset release, pleias is publishing comprehensive documentation of its data preparation methodologies and introducing new models and libraries for pretraining data preparation and filtering. This commitment to transparency exemplifies the organization's dedication to open science principles.
-
pleias releases its first foundation model out of its future suite of specialised ultra-fast and green pretrains for document processing at scale At pleias we are successfully experimenting with a new category of models: specialized foundation SLMs. These models are designed from the ground up for specific tasks, exclusively trained on a large custom instruction dataset (at least 5-10B tokens) and, so far, yielding performance comparable to much larger generalist models. Today, we release the first example of specialized pre-training, OCRonos-Vintage. It's a 124 million parameter model trained on 18 billion tokens from cultural heritage archives to perform OCR correction. Despite an extremely small size and lack of generalist capacities, OCRonos-Vintage is currently one the best available model for this task. We are currently deploying it at scale to pre-process and correct the badly OCRised parts of a dataset of more than 700 billion tokens from Gallica (BNF), Chronicling America and other cultural heritage corpora. https://lnkd.in/e2N3XFjk
The case for specialized pre-training: ultra-fast foundation models for dedicated tasks
huggingface.co
-
pleias reposted this
✨📚 Et si on parlait droit d'auteur ? En cette journée mondiale de la propriété intellectuelle, SCIN360 souhaite mettre en avant une jeune strat-up française et audacieuse : pleias. Annoncée il y a un mois, elle permet la mise à disposition d'un corpus de 500 milliards de mots dans plus de 70 langues destinés à entraîner les grands modèles de langage, utilisés par les IA. Sa particularité ? Utiliser uniquement des textes provenant de bibliothèques publiques et libres de droit, offrant des ressources open source tout en respectant les droits d'auteur et en promouvant une approche transparente et éthique. 📖 Après des plaintes de plusieurs auteurs en juillet de 2023 contre OpenAI et Meta, ou encore le procès entre le "New York Times" et OpenAI, Pleias remet les pendules à l'heure. 🕰️ Avec une première levée de fonds à l'horizon, et le soutien de plusieurs projets de recherche collaborative sur l'IA générative open source : Pleias se positionne comme un acteur majeur dans la construction d'une IA respectueuse du droit d'auteur et de la diversité linguistique. Et vous, que pensez-vous de l'Open Source pour l'avenir de l'IA ?💡 #OpenSource #IA #CommonCorpus #Innovation #intelligenceartificielle #IAEthique
-