Commons AI session 1 with speakers Julier Hunter, Bertrand Monthubert, Pauline Zordan, Pierre Carl Langlais and Bertrand Pailhes

Commons AI: an essential need for access to quality data

Available translation: Français

On 10 December 2025, the Commons AI (in French) day was dedicated to Artificial Intelligence in a commons-based approach. It took place at the Future of Software Technologies event, at the CNIT in La Défense. Three sessions were organised over the course of the day to cover each of the three pillars of the commons: 1/ resource, 2/ community and 3/ governance. The fifteen interventions throughout the day showcased ongoing initiatives and avenues of solutions for AI more respectful of the principles of openness and the commons. Each session was the occasion -​- beyond the talks themselves -​- for fruitful and dynamic exchanges between speakers and audience. We propose, in three posts, an overall account of each session and a summary of the interventions.

The first post is dedicated to the morning session on resources, the foundations of AI systems.

The first session of the Commons AI event was devoted to the question of resources, in the sense of the various components needed to offer alternative, community-based AI. It articulated four interventions (OpenLLM-France, Pleias, Ekitia, IGN), each bringing a complementary perspective on the place of data in AI development. The associated openness and sharing issues were addressed, in particular regarding the regime of models, existing software implementations, and the underlying material constraints.

All the interventions converge on the essential need for access to quality data, which requires significant documentation work. The open data movement is well placed to provide such quality data, particularly thanks to the major role of expert contributors who “manufacture” them. This is the case, for example, of Wikidata data, or of the open data provided by IGN. Unfortunately, these data are few in number compared with the immensity of web data being scraped to train AI models. In this respect, the Common Corpus dataset is a major initiative for providing quality open data. Despite being used by many actors, funding remains scarce -​- even non-existent -​- to maintain this corpus. Additional issues add to those around data. The dynamic management of opt-out data is complex. These data are no longer available for training and must be removed from the corpus. At present, we may be witnessing a new tragedy of the commons, with fewer and fewer open data available to train AIs (in the face of fears of pillaging of the open web by major AI players). When developing an open and transparent French AI, the problem is even greater, given the very limited amount of French-language data compared with English. Beyond a question of efficiency and quality of French AI models, this amplifies a hegemonic cultural phenomenon in favour of English, which sidelines the specifics of other languages (modes of thinking and acting, historical baggage, etc.). On top of that come issues of compliance and respect for the protection of sensitive data. Developing and deploying an open and ethical AI on freely accessible data thus represents major organisational, ethical and economic challenges.

What solutions are possible? Several avenues were raised -​- first, that of focusing on more domain-specific AI models, for instance in education (see Open LLM’s work). Creating synthetic data, in particular on the basis of open data, is also a way to counter the lack of data, and goes hand in hand with a “domain-specific AI” approach (see Pleias’s SYNTH project). From an economic and organisational standpoint, another avenue lies in proposing data spaces that allow actors to pool specific data within a trusted framework (see Ekitia’s initiatives). Finally, all these interventions remind us that open and high-quality AI cannot be built without collaborative work between different professions: AI engineers, data curators and experts (such as those at IGN), and lawyers to design tailored frameworks of collaboration -​- running counter to the idea of an AI that would do away with all need for collective dynamics.

Below you will find a summary of the interventions, along with the audio and the associated presentation.

OpenLLM France: building open and transparent French AIs

With Julie Hunter (Linagora R&D)

The Open LLM France initiative aims to build an ethical and transparent AI. Unlike most AI models today, the project builds a fully Open Source AI, using Open Source licences and providing access to the training data. Indeed, most AI models today share their model weights (open weight) -​- which allows for fine-tuning, but lacks transparency and control over potential biases. The initiative has also chosen to focus on a French-language data corpus, while models are largely trained on English (for example, in LLaMA v2, French represented 0.16% of the data). This raises a language question, of course, but above all a question of cultural sensitivity rooted in each language. At present, Open LLM France is developing a new model (following Lucie 7B) offered in three sizes (1B, 8B, 23B) and over 5T tokens, with new languages included (Portuguese, Arabic, etc.). The model includes several training phases, in particular on maths and reasoning aspects, which are particularly important since several of the project’s case studies are dedicated to education.

The challenges identified are of several kinds:

  • Having to use web data which are often of low quality. They have to be filtered in several ways and the associated terms of use also have to be checked. Existing data corpora such as Common Pile and Common Corpus are solid foundations.
  • Coping with multiple biases and toxic content that has to be filtered out.
  • Obtaining French-language data: little French content carries an open licence; even where data are in the public domain, they are not necessarily directly accessible (OCR is needed) and quantities remain very low.
  • Accessing post-training data: these are difficult to acquire, are seldom open, and even less so in French.

Listen to the audio (in English) -​- Slides

Open data flows: rethinking AI infrastructures after the synthetic-data turn

With Pierre-Carl Langlais (Pleias)

Pleias is a French start-up committed to the development and training of AI while taking into account several key issues: data quality, model efficiency, and security and compliance risks.

To do so, Pleias relies on frugal foundation models and advocates for access to open (copyright-free) data. The start-up is notably known for the development of a fully open pre-training corpus called Common Corpus, with more than 500 million documents under an open licence. Common Corpus has been downloaded over 700,000 times.

The question of training data is crucial and rarely addressed. Several issues arise around training data today. First, they rely mainly on web data which are generally of very poor quality and difficult to filter at scale. Second, it should be known that most major models have been or are trained on pirate sources. The deployers of these solutions therefore carry several layers of responsibility: production of copyright-protected content, alignment with expected regulations and country-specific norms, reproduction of data content already in the trained corpus.

This situation leads to an even more perverse effect for the internet and the open web -​- one we can compare to a tragedy of the commons. Open data are subject to significant re-closure in the face of fears of pillaging, which further depletes the corpus of available data. In Europe, the absence of fair use principles further amplifies this phenomenon. Despite the efforts put into developing Common Corpus by Pleias and its massive use, it is today extremely difficult to fund such an initiative.

Another path is emerging to continue deploying AI systems based on quality and reliable data: the move to synthetic training environments. Using synthetic environments allows for control of the data and increased efficiency on very specific tasks. Mathematics and source code development are the main areas in which synthetic environments are used, given their connection to formal logic. These environments require high-quality, documented data, which gives an important place to open data such as Wikidata’s (used, for example, by Alibaba Deep Research). Open datasets, often small, will gain the ability to be amplified and made viable for pre-training in this new combination. Synthetic data also makes it possible to bypass the issue of protecting sensitive data, by creating personas based on such data. Furthermore, the development of AI agents will be amplified by the possibility of reintegrating the model into itself as the agent’s product. These synthetic environments will also be able to be widely used for specific domains, and to connect these domains to one another. Pleias is positioning itself in the field of synthetic environments with the creation of SYNTH. Based on Wikipedia articles, an amplification process is implemented (upsampled rephrasing).

Listen to the audio (in English) -​- Slides.

Data spaces and digital commons: building a responsible, transparent and inclusive AI market

With Bertrand Monthubert and Pauline Zordan (Ekitia)

Ekitia is an association bringing together various public, private and academic organisations in order to create trust frameworks that facilitate data sharing. Accessing data is a cornerstone of AI development and proves to be paved with obstacles. The challenge is to build a trust relationship between actors who wish to share data, and to do so in a fair context -​- which often translates into many months of negotiation, often for a single contract. This also sidelines smaller actors who do not have such bargaining power.

Several common obstacles have been studied by Ekitia, across three different sectors: health and research, the field of disability, and employment. First, data sharing is made difficult by data protection and confidentiality rules; next, fair compensation must be designed for the people who contribute these data; finally, technical interoperability questions are major.

For Ekitia, trusted data spaces are a possible solution that guarantees both a shared infrastructure that is easy to access and the governance rules attached to data management. The commons dimension is very important for thinking about this governance and putting in place the necessary means by relying on interoperable standards.

To this end, a “rulebook” has been set up by Ekitia to manage the conditions for the use and reuse of digital resources, taking into account organisational, contractual and technical aspects in a structured approach. At present, Ekitia has collaboratively developed an ethical rulebook (currently being tested) and a legal rulebook based on the major existing regulations (GDPR, AI Act, DGA, criminal code, etc.). Both projects aspire to be improved through partnerships built around these initiatives.

Listen to the audio (in English) -​- Slides

What does it take to build effective AI systems for environmental mapping?

With Bertrand Pailhès (IGN)

IGN (the French National Institute of Geographic and Forest Information) has the mission of mapping the French territory mainly for environmental, but also military, agricultural and planning purposes. The institute contributes to access to high-quality and open (open data) cartographic data. The commons approach is also part of IGN’s ambitions, with a project such as Panoramax, which makes it possible to collectively photo-map territories (in particular in geographic areas that Google Maps would never cover). IGN closely follows the new dynamics associated with AI. AI is being used within the institute, for example, to monitor land artificialisation. This task, which used to take a great deal of time, can now be completed quickly with AI. Developing AI within IGN required building a roadmap (the IGN AI Roadmap 2022-2024) and aligning it with the necessary resources, both in terms of infrastructure and professional skills.

To build adequate, high-quality datasets for training AI, significant annotation work is needed by field professionals who hold the relevant skills and expertise (for example to identify and name many tree species). The challenge is therefore to build a shared approach between AI engineers and field technicians and operators. Today, IGN is also developing its own foundation models, which will be the subject of publications on both the data and the model.

Listen to the audio (in English) -​- Slides.

Thanks to all the speakers, to Ramya Chandrasekhar for moderating this session, and to the FOST organisers for hosting the event.