Remix de blotter_explosion Date 27 July 2006 Source Flickr Author Señor Codo licence CC BY SA

Open Source and AI: What Are We Really Talking About?

Available translation: Français

The use of artificial intelligence (AI) raises major legal, ethical and societal issues, to which legislators are attempting to respond with new regulatory frameworks. The European Union has adopted Regulation (EU) 2024/1689 on Artificial Intelligence (AI Act) (hereinafter “RIA” or “AI Act”), which grants a derogatory regime to Open Source AI systems, but with limitations that reveal a narrow interpretation of what “Open Source” means in the context of AI.

Around the series of three articles “Open Source and AI”, inno3 proposes an analysis of the AI Act, of its impact on the Open Source ecosystem leading to action paths to implement.

What you need to know

The term “open source AI” covers sometimes divergent realities. Three approaches coexist today : a regulatory approach (Europe with the AI Act), a “principles” approach (OSI with OSAID), and an industrial approach (the Linux Foundation with the Model Openness Framework). These divergences are not oversights : they reflect distinct objectives (regulate, preserve, innovate) which explain why the definitions do not converge. Understanding why these definitions exist (regulation, transparency, technical and legal collaboration) is the key to reposition the debate. Beyond licences, the question of the governance of open AI systems and the analogy with the sui generis right of databases (ODbL) open promising paths.

It has a precise meaning when using the adjective “Open” on digital creations (artefacts). En effet, the term Open associated with each of these objects has a precise meaning (embodied in a dedicated definition) and a licence that enables sharing both from a technical perspective (the digital object) and legal (regarding the rights held by organisations and people involved).

These definitions have evolved year by year notably to ensure the absence of mechanisms (legal or technical) that would allow taking back with one hand what is granted with the other : absence of patents, sharing of encryption keys, absence of DRM, etc. . This notion applies to software (Open Source), to data (Open Data), to content (Open Content), to hardware (Open Source Hardware) -​- and, potentially, to AI models if they are considered as distinct artefacts. For a historical analysis of the construction of these legal mechanisms, see “The evolution of free and open source licences: criteria, purposes and completeness?” (B. Jean, in Stories and cultures of the free, Framabook, 2013).

However, “open” also applies to dynamics : Open Science, Open Access, Open Innovation designate less objects than ways of organising : sharing infrastructure, IP, opening flows, rethinking governance, etc. These dynamics cannot be reduced to the sum of licences applied to the artefacts they mobilise : they involve collective choices on who contributes, who decides, who benefits. They lead to a change in the stance of actors to collectively exceed the limits of what the sum of individual interests can generate.

When talking about “Artificial Intelligence” without defining terms, one generally touches both :

  • One can consider an AI system as the sum of its components (code, data, weights, pipeline) each falling under a distinct legal regime. This is roughly the approach of the European Regulation on AI (AI Act) when it speaks broadly of the “AI System”.
  • One can also think of an open AI dynamic (an “Open AI”) : which would be a movement that questions more broadly how actors organise themselves, share computing resources, build together. This second vision (close to the discussions of Commons AI (in French)) invites, like Open Science, to rethink infrastructures, the underlying flows and conditions for genuine collaboration.

The existing definitions of “open source AI” have struggled to converge because they try to accommodate concerns operating at different levels : working together (technical and legal sharing), demonstrating transparency (ethical issues), and regulating (obligations versus exceptions for innovation).

This article, the last in our series on AI, Open Source and the AI Act, attempts to untangle these levels to understand why the definitions diverge, why that is normal, and provide some action proposals.

Three definitions, three objectives: why the definitions diverge

Europe: regulating models and systems

Europe has been a pioneer in AI regulation. The European notion of open source AI distinguishes :

  • The AI systems published under the framework of free and open licences, clarifying that an AI system is defined as “an automated system that is designed to function at different levels of autonomy and may demonstrate adaptive capacity after deployment, and which, for explicit or implicit objectives, derives, from the inputs it receives, the manner of generating outputs such as predictions, content, recommendations or decisions which can influence physical or virtual environments” (article 3.1) ;
  • The general-purpose AI models (article 53§2) “made available under a free and “open source” licence which allows access, use, modification and distribution of the model, and whose parameters, including weights, information about the model architecture and information about model usage, are made publicly accessible”.

The framing is useful, but creates uncertainty, because 1) the texts do not define what “free and open licences” are (or Open Source) and 2) the existing definitions for Open Source licences cannot cover the objects in question -​- either sets of artefacts (such as the AI System) or models (but then without considering what makes a model specific).

Added to this is the fact that the AI Act is part of a broader regulation including the Cyber Resilience Regulation (CRA) and the Directive on Product Liability (NPLD) each of which contains an Open Source exception, with their own logic and an articulation that is not always clear for those who wish to navigate the exceptions of Open Source in European regulation. This is explained by the purposes inherent in these exceptions for Open Source, which prove entirely laudable but not necessarily sufficiently explicit (see the dedicated article).

This definitional work is laudable, but geared towards a regulatory objective : to determine who is exempt from obligations and who is not -​- and not to qualify what is open or not. Thus, the licence associated with the model or with all the constituents of a system is certainly a central point in the benefit of this exception, but which adds to other constraints (the community scope, absence of commercial activities, etc.) which are necessary to justify this regulation in light of overall purpose such as free competition or transparency issues and sovereignty. For an in-depth analysis of the AI Act, consult our dedicated article as well as related thoughts in the CRA Guide.

OSI (OSAID): preserving principles against misappropriation

After nearly 12 months of exchanges and dialogue, the Open Source Initiative responded in 2024 by adopting OSAID 1.0 (Open Source AI Definition), with the explicit objective of defending the principles of Open Source against industrial misappropriation. OSAID reaffirms the four fundamental freedoms : to use freely, study the operation, modify and share. But unlike the classical OSI definition (which covers only software), OSAID goes further : it incorporates technical documentation, the minimum necessary data, and access to modification tools.

Being “finalized”, OSAID is closer to the Free Software Definition (FSF) than to the classical Open Source Definition (OSI). It recognises that licences alone are not sufficient : one can distribute GPL software without the source, or publish the weights of a model without the training data, which empties freedoms of their substance. This explains why OSAID requires, for example, that “sufficiently detailed information about the data used to train the system” be available -​- without going so far as to require access to the data itself, a compromise sharply criticised by the FSF and Bradley Kuhn who see it as a concession emptying freedoms of their substance (see our article “How AI Questions and Reinvents Open Source” (in French)) .

This is a reactive approach : as shown by the history of the evolution of free licences, the movement was never created to please dominant actors, but to propose an alternative path more respectful of our freedoms. So it is not surprising that licences go further than what tech giants accept. This tension is historical. Recall the Open Access debates : for years, industrial lobbying ensured that no position was too strong -​- a graduated approach which prevented real change. The same dynamic is at work with open AI. The original definition of Open Source had emerged relying on companies that wished to bring forth an alternative model to the dominant model, it is perhaps also to this type of actor that one must turn to evolve this normative work.

For a reasoned critique of OSAID, see the post by Bradley Kuhn (from the SFC), or the presentation by Carlo Piana at EOLE 2024.

The Linux Foundation (MOF): an industrial consensus

The Linux Foundation proposed the Model Openness Framework (MOF) (April 2024, six months before OSAID, which also shows the power dynamics), which reflects the position of its members (the digital giants pushing AI today, we are far from alternative actors). The MOF uses a matrix approach with 17 components evaluated on 3 levels :

  • Class III (Open Model, the minimum),
  • Class II (Open Tooling, including training and inference code)
  • Class I (Open Science, the most demanding, including raw datasets).

This means that under the MOF, a model can be qualified as “open” (Class III) by publishing only weights and architecture, without training code or data. In this respect, the MOF aims to be pragmatic : it acknowledges that full publication is costly and complex. As a result, it also serves to legitimise a minimal vision of openness (open weights are “enough” open). This approach reflects the interests of its sponsors : it allows Microsoft, Google, Meta to say “we are open” by publishing the weights of Llama or Mistral, whilst keeping secret the data, the training scripts, even the code itself. See the analysis of the Ada Lovelace Institute on this point.

These three approaches are not the only ones. The FSF adopts a maximalist position : for a machine learning application to be free, the training code and the data must be free (a position that OSAID itself has not adopted). In the United States, the NTIA report (July 2024) introduced the intermediate category of “open-weight models“, recognising a reality that binary definitions struggle to capture. Finally, the RAIL licences (Responsible AI Licenses), born from the BigScience/BLOOM project and today used by tens of thousands of models on HuggingFace, explore a distinct path : open but with usage restrictions : which, for both OSI and FSF, is contrary to the definitions of Open Source or Free Software which tolerate no discrimination. These positions illustrate that the landscape is more fragmented than even the three main poles might suggest (the subject goes beyond the question of AI models alone, see the report of Ramya Chandrasekhar about the explosion of licences to frame data in the face of AI).

Decomposing openness: what “open source AI” covers (and does not)

There are several questions to ask oneself to know what to protect (and, incidentally, how to protect).

Open models, what are we talking about?

Under French law, an AI system could be compared to that of a composite work (*œuvre complexe*) such as video games. A composite work is composed of several protectable elements independently : code (software), texts and images (copyright), music (musical rights), structured data (*sui generis* right of databases), etc. In practice, this asset must be protected through cumulative regimes : each element obeys its own regime, with its own rights and obligations. There is therefore not one licence for the whole, but a multitude of licences, which, when one looks particularly at the emulation that follows the Open Source distribution of video game engines, can make it possible to build intellectual property strategies relatively virtuous.

In the same spirit, it seems possible to argue that it is not possible to have a single licence for the entire AI system, but that it is on the contrary necessary to think of an architecture of licences that reflects the technical and legal architecture of the system. The code can be under GPL, the data under ODbL, the weights under a specific licence, the documentation under CC BY-SA. Each licence covers its domain, except the model which truly figures as the poor relation. To this should be added also AI agents (autonomous systems built on the model), which are more like third-party services available via API (on this subject, see the API ToS project).

In the same spirit, the AI model (if one wishes to be able to reconstruct it) itself results from an assembly of components : code (frameworks, training scripts), data (training corpus), weights (model parameters), pipeline (orchestration, validation, deployment). Most of these elements have their own legal regime : free software for the code, Open Data for the data, Open Content for the content -​- but nothing seems quite right for models (and notably weights linked to their training and parameterisation). At best one would need to consider the model itself as an assembly of components (code, data, weights and complete documentation), but with the risk of not grasping the subtlety of what makes the model (its learning) and which is not constitutive of a right in itself.

Consult the NTIA report on open-weight models for a detailed analysis.

Open-washing: a documented risk

In the absence of clear definition, the risk of “open-washing” (claiming to be open while being minimally so) is very real. The figures are eloquent : according to an analysis by Osborne, Ding and Kirk, 64.67% of models on HuggingFace have no licence or are under a restrictive licence. Llama and Mistral are often described as “Open Source”, but include restrictive “Acceptable Use Policies” which bring their status closer to that of a freeware than to an Open Source licence.

In reality, very few models satisfy even minimum criteria (Pythia, OLMo, T5). This raises a regulatory problem : by defining open source AI very broadly (as the AI Act does), Europe creates (unintentionally) an incentive towards open-washing. If you can call your model “open” simply by formally respecting the licence, you will, even if you keep secret the rest. The Open Future Observatory has documented this phenomenon and proposes continuous monitoring. See also the study Liesenfeld & Dingemanse (FAccT 2024) which shows, across 45 generative AI systems, how declarations of openness are often exaggerated.

Moreover and at the level of the artefact, “Open AI” (as a dynamic, not as an object) means something more important : it is about rethinking how actors organise themselves, similarly to the way that Open Science rethinks infrastructure, data flows, governance. This includes the question of who controls access, how iterations are decided, how contributions are managed. It is then a paradigm shift : from code transparency to ecosystem governance. If one thinks of defending the interests of users and our ecosystem, the model alone is no longer enough, it is the community and the system that matter.

What licences can (and cannot) capture

The principle of Open Source licences is simple : to rely on one (or more) intellectual property right(s) of intellectual property in order to enter into contracts with all users (licensees) a set of conditions and modalities legal and technical that promotes trust and collaboration.

The legal qualification of AI models

A licence presupposes a property right over its subject matter. If software can rely on copyright, there is nothing to date identical for an AI model. Weights are more similar to a configuration or optimisation and clearly do not fall under any established legal category : they are not software (not really code), they are not structured data in the classical sense (the structure is not visible from outside), nor are they artistic works or literary works in the traditional sense. We are in a legal grey area, but several theoretical approaches with fairly immediate implications seem to be emerging :

  • consider models with their weights as a database and apply the *sui generis* database right. The economic stakes of model training recall the very creation of the *sui generis* right of databases which allows recognising a monopoly (limited) linked to substantial investment that led to the conception of the database. The interest of such an interpretation would be that licences such as ODbL (Open Database License) could be relatively suitable (see below);
  • create a *sui generis* (i.e. specific) regime for AI models (as the Intellectual Property Owners Association (2020) proposed in 2020), which rests on an international consensus complicated to anticipate in this period of instability ;
  • extending copyright to trained models (less likely position, as it is contrary to the expected criteria of originality which is assessed in light of the imprint of the authors’ personality -​- we are very far from that) ;
  • use a hybrid approach combining multiple regimes as mentioned earlier, with the risk of not capturing the specificity of models.

For in-depth legal analyses, see the work of Sousa e Silva or the study by the European Parliament (2025).

The ODbL approach and sui generis right of databases

Facing this legal grey area, it is interesting to think about AI models protected by the *sui generis* right of databases, and use ODbL (Open Database License) for this protection. ODbL has already proved itself : the migration of OpenStreetMap from CC BY-SA to ODbL demonstrated both the limits of content licences applied to databases and the relevance of a specific instrument grounded in the *sui generis* right. This solution would moreover be complementary to the use that is made of ODbL on certain training data bases, making it possible to harmonise the regime of training databases and that of models (or at least the associated weights).

ODbL rests on (particularly) on the right sui generis of databases and distinguishes two situations :

  • The Produced Works (works produced from the base : a report, a map, an analysis), the obligation is limited to attribution : stating that the data comes from the original database.
  • For Derivative Databases (derived databases : such as a modified or enriched version of the original database), ODbL additionally requires a copyleft : the derived database must be shared under the same conditions. This distinction is fundamental for understanding the two possible readings applied to AI.

In this regard, the AI model could be :

  • the “created work” produced from the training data. ODbL would then require : (1) stating the data used (obligation of transparency), and (2) allowing the sharing of enhanced data. This means that publishing the weights of a model, is accepting that enhanced data be shareable. It is an obligation of transparency transformed into a contractual obligation of the licence, rather than regulatory. That is already significant : it would require traceability of the training data as a licence obligation, not only as a legal obligation.
  • and/or a derived database (it is a “database” in the broad sense of ODbL) created from the training data. ODbL would then extend to the model itself, requiring that the model be shared under the same terms. It is copyleft for databases, extended to models.

This approach is particularly interesting because it creates obligations similar to copyleft WITHOUT needing to prove that the model is a “derivative work” under copyright. The *sui generis* right of databases has its own criteria (substantial investment, extraction), easier to apply. This means that you could have a copyleft obligation via ODbL even if copyright does not recognise the model as derivative. The limit of this solution is partly cultural and partly legal : the *sui generis* right of databases existing to date only in Europe. Nevertheless this path of valorisation of trained models would merit being studied more thoroughly.

A specific obligation for models in Open Source licences

A complementary option, which would make it possible to bring together these reflections with those of Open Source, would be to assess the possibility of extending current Open Source licences to ensure that they have effects on AI models integrated in code. This approach would take up logic that we already know with regard to other situations technical or legal that have been dealt with directly in Open Source licences, without going against the definitions of Open Source (OSD) or Free Software (FSF). This is in line with current practices :

  • GPL-3.0 was designed to address the problem of “tivoization” (distributing locked hardware that prevents the exercise of free software freedoms -​- see the detailed analysis of this mechanism and its precedents) and AGPL-3.0 addressed the ASP Loophole (gap in free software used as SaaS). The same problem arises with AI : distributing weights without the training data or training scripts, is like distributing a binary without the source. Users have a product, but not the real capacity to modify or adapt. It would therefore be necessary to ensure at least all the information that allows understanding and modifying the model -​- this joins the analogy that circulates in the Open Source community, considering that weights are the “object code” of the model, whilst the training data and scripts are the “source code” (see the work of Andrew Marble on the copyright of weights).
  • Similarly, the LGPL licence requires, when you integrate a software library into an application that is not Open Source, that you provide all the information that allows interfacing with the library, to use its functions, to recombine the application with a modified version of the library (LGPLv3, section 4). The idea is not being able to lock access to the free part. This logic could be extended to models : if you use a free training framework to produce a model, you are not required to open the model itself nor all the data (but you must provide the interface information that allows reproducing, adapting or fine-tuning the model. It is the equivalent, for AI, of what LGPL calls the “Corresponding Application Code“) not all of the work, but what is necessary for the freedoms on the free part to remain effectively exercisable.

This approach would rest on adding an exception within current licences (which can be done by adding the exception at the root of the program and the addition of a specific mention in the project documentation and its metadata)). The software licence would contain a clause requiring that if any part of the software integrates an AI model, then all information necessary to adapt or retrain the model must be shared. This remains within the scope of the software licence while ensuring the practical exercise of the freedoms. Thus, this approach is therefore :

  • A bit less radical than the position of the Free Software Foundation which holds that an application of free machine learning means that the training code AND the data must be free (official position, see also the article by the Software Freedom Conservancy “FOSS in, FOSS out, FOSS throughout“).
  • Perhaps more flexible also than “Contextual Copyleft” (Shanklin, Hine, Novelli et al.) which would propose an approach depending on the type of model and its use (but which potentially would require covering more than the software which is the subject of the licence, which could be considered as contrary to the OSD which states that a licence cannot “extend” to other works).

This type of use would also allow to lift the ambiguity for the 16 million files under GPL/AGPL found in training datasets (arxiv 2403.15230). Today, no court has yet judged whether weights constitute a “derivative work” under copyright. By integrating the obligation in the software licence itself, one circumvents this uncertainty : the obligation arises from the licence contract, not from a legal qualification for which consensus will not emerge until a later time.

Starting from the principle that the weights of an AI model integrated in software are functionally equivalent to object code, it would be necessary that the source code of a software containing a component of an AI model include the information necessary for modification of this component (as it must include the source code of every other part of the program).

See the dedicated article with the “AI model component exception“.

The question of training data

Many AI actors refuse today to publish the training data given the risk of infringement action that this could create. This assertion is partly hypocritical : not publishing the data for this reason, is admitting a violation. The Darcos Act (voted by the Senate, April 2026) and the report of the European Parliament (March 2026) point to a “presumption of use” which will favour infringement actions. Let us recall that there are two cases that are legitimate for requesting data sharing :

  • Regulation (the AI Act requires traceability and auditability, so the data must be accessible for compliance) ;
  • Contractual constraints such as licences called copyleft : if the upstream model requires that the data be free (via ODbL or otherwise), then not sharing is a violation of the licence (the obligation is induced by the desire not to be an infringer).

Apart from these two cases, full publication is neither realistic (privacy, health, volume, trade secrets) nor necessary.

Thus, unless the legislator considers that an exception to copyright can be justified in this specific case, it would be better to question the obligation that the legislator wishes to require within the AI Act in order to translate it precisely rather than referring to licences and Open Source movements which will potentially require more stringent constraints. This would make it possible not to mix what is on the order of transparency (for the AI Act) and Open Source (for the communities that wish to be able to modify and contribute).

Consultez le projet Common Pile from EleutherAI, et la présentation de Marco Ciurcina at EOLE 2024 on open source exceptions in the AI Act.

Conclusion (towards open AI governance)

The three major definitional approaches explored (regulatory (Europe), principles (OSI), industrial (LF)) reflect objectives that are legitimate which should not compete.

Licences work well for artefacts individual (code, data) but they capture poorly the stakes around models (even if solutions seem to be emerging). Several paths are emerging :

  • The ODbL/sui generis approach : by considering models as databases protected by the specific right of databases, ODbL offers a robust framework for imposing obligations without problematic extension of rights.
  • Adding an obligation relating to models in Open Source licences could also allow bringing together respect the limits of law while ensuring that freedoms remain substantial (see the dedicated article).

Beyond these legal tools, we need broader reflection on the governance of open AI. Licences are necessary, but insufficient because they do not capture the dynamics at the system level. How does one govern a shared model ? Who decides on iterations ? How are contributions managed ? How does one avoid that openness of weights reinforces rather than reduces power asymmetries between giants and communities ? These are questions of governance, not of licences.

How to create structures that allow genuine collaboration and genuine participation ? How to avoid the concentration of power around a few open” suppliers of models ? These questions have been explored in our article “How AI Questions and Reinvents Open Source” (in French). work around of the Commons AI conference (session gouvernance), and they will remain central in the years to come.

The next edition of the Commons AI conference will be an opportunity to discuss these issues and the projects underlying. Open AI is not just a matter of licences, it is a matter of power and collective governance.

Image credit : Remix from @2006, Señor Codo, blotter_explosion. This file is licensed under the Creative Commons Attribution-Share Alike 2.0 Generic license.