In collaboration the Open Source Initiative (OSI) This project uses AI model metadata from Hugging Face to understand how “open” AI models are deployed. The goal is to uncover patterns in how the concept of “open” AI is used in practice.
Founded in 1998, OSI is a global nonprofit that advances Open Source software through advocacy, policy research, and engagement across developers, corporations, nonprofits, and governments. The OSI maintains the Open Source Definition (OSD) and, more recently, released the 1.0 Open Source AI Definition (OSAID). Through these definitions, the OSI seeks to ensure that digital systems can be freely accessed, used, modified, and shared by anyone, upholding the four core freedoms of the Open Source philosophy.
The availability and flexibility of Open Source software makes it an attractive, and in some contexts crucial, mechanism for building digital tools within industry and government. Open Source Software (OSS) underpins critical infrastructure and consumer technologies, from electric grids to medical software and smartphone apps. Today, OSS contributes tens of billions in economic output in the U.S. and more than $8 trillion globally. The value derived is expected to grow as Open Source AI is adopted across public and private sectors.
The October 2024 release of the OSAID sought to anchor the term Open Source AI using clear, unambiguous standards. Yet, openness in AI is nascent and inconsistently understood.
TipWhy now?
The AI boom, driven by unprecedented investment and access to tools, has spawned a flood of models claiming to be open.
1.1 Goals
Understand how “open” AI models are being released
Analyze key trends in “open” AI model releases
Not
Evaluate models for “openness”
Evaluate the Open Source AI Definition (OSAID 1.0)
1.2 Data Gathering
Hugging Face is the most widely-used platform for AI models with over 200,000 models hosted on the repository. It allows people to share and use AI models and related datasets.
This study uses AI model metadata downloaded through the Hugging Face Hub API. Metadata includes model name,author, release date, license, and last modified date, base-model, and downloads.
Two searches where performed
Full-text-search (N = 20,069): searching for any models where the model name or metadata includes the word “open”
Author search (N=2,028): searching for AI models released by prominent AI labs (Alibaba, Deepseek, Google, Meta, Microsoft, Mistral, XAI, Open AI)
1.3 Key Findings
Preliminary exploratory analysis of the Hugging Face data points to differing practices in how developers signal openness in AI. These results illustrate some consistency with the OSAID as well as friction surrounding licensing practices.
The overwhelming majority of “open” models are based on larger models
Apache 2.0 is the most popular OSI-approved license, followed by MIT
CC-by licenses are prevalent, despite Creative Commons’ recommendations against using CC licenses for software
The vast majority, over 50% of all models in in this sample, are released with an “unknown license”
Alibaba’s Qwen family of models are the most popular base model in this sample
Custom licenses like Qwen, Llama, Gemma, Grok, OpenRAIL are becoming increasingly common, specially for flagship models, yet impose usage restrictions
1.4 Presentation
Preliminary study findings where presented at the All Things Open Conference hosted in Raleigh, NC, USA in October, 2025
Two OSI-approved Open Source licenses––Apache 2.0 and MIT––are the most popular licenses, accounting for 28% of all models in the sample. A much smaller share (3%) are released under CC-by licenses, despite Creative Commons’ advice against using their licenses for software, as they don’t specify how source code can be distributed.
A majority of models (58%) are released with an “unknown” license, pointing to a lack of standardization in how data about models is collected and a general lack of enforceability in requiring a license for AI model releases on Hugging Face.
This finding suggests that the code component of a significant portion of models are compatible the OSAID. However, a much larger subset either omits licensing information, use a custom license or apply a license not appropriate for use with software.
2.1 Top 10 Licenses
Click to show code
import pandas as pd# Load your CSV file df = pd.read_csv('model_data/primary_datasets/hf_models_open_raw.csv')# Count occurrences of each licenselicense_counts = df['license'].value_counts().reset_index()license_counts.columns = ['license', 'count']# Calculate proportiontotal_models =len(df)license_counts['proportion_percent'] =round(100*(license_counts['count'] / total_models), 2)license_counts[:10]
license
count
proportion_percent
0
apache-2.0
4697
23.40
1
mit
1086
5.41
2
other
814
4.06
3
cc-by-4.0
229
1.14
4
llama2
223
1.11
5
cc-by-nc-4.0
222
1.11
6
llama3
222
1.11
7
creativeml-openrail-m
111
0.55
8
llama3.1
106
0.53
9
llama3.2
82
0.41
2.2 License use over time
Click to show code
import pandas as pdimport matplotlib.pyplot as pltdef plot_license_trends( df: pd.DataFrame, date_col: str="date_released", license_col: str="license", licenses_to_include: list=None, freq: str="M", # 'M' for month, 'Y' for year top_n: int=None, kind: str="line", figsize=(12,6)):""" Plots how selected licenses are used over time. Parameters ---------- df : pd.DataFrame Input DataFrame. date_col : str Name of the column with dates (e.g. 'created_at' or 'last_modified'). license_col : str Name of the column with license names. licenses_to_include : list, optional A list of license names to include (e.g. ['mit', 'apache-2.0']). If None, all non-unknown licenses are used. freq : str, default='M' Time frequency for aggregation ('M' for month, 'Y' for year). top_n : int, optional If provided, only the top N most frequent licenses are plotted. kind : str, default='line' 'line' or 'area' chart type. figsize : tuple, default=(12,6) Figure size for the plot. """# 1️⃣ Convert to datetime df = df.copy() df[date_col] = pd.to_datetime(df[date_col], errors="coerce")# 2️⃣ Clean up licenses df[license_col] = df[license_col].astype(str).str.strip().str.lower() df = df[df[license_col].notna() & (df[license_col] !="unknown")]# 3️⃣ Filter for licenses of interestif licenses_to_include: licenses = [l.lower() for l in licenses_to_include] df = df[df[license_col].isin(licenses)]# 4️⃣ Create period column (e.g., year-month) df["period"] = df[date_col].dt.to_period(freq).astype(str)# 5️⃣ Group and pivot grouped = ( df.groupby(["period", license_col]) .size() .reset_index(name="count") ) pivoted = grouped.pivot( index="period", columns=license_col, values="count" ).fillna(0)# 6️⃣ Sort by date pivoted.index = pd.to_datetime(pivoted.index) pivoted = pivoted.sort_index()# 7️⃣ Optionally select top Nif top_n: top_cols = pivoted.sum().sort_values(ascending=False).head(top_n).index pivoted = pivoted[top_cols]# 8️⃣ Plot plt.figure(figsize=figsize)if kind =="area": pivoted.plot.area(figsize=figsize, alpha=0.8)else: pivoted.plot(kind="line", linewidth=2, figsize=figsize) plt.title("License Usage Over Time") plt.xlabel("Date") plt.ylabel("Number of Models") plt.legend(title="License", bbox_to_anchor=(1.05, 1), loc="upper left") plt.tight_layout() plt.show()return pivoted
Click to show code
# Copy original dataframedf_temp = open_df.copy()# Drop rows with missing licensesdf_temp = df_temp[df_temp['license'].notna()].copy()# Normalize licenses (lowercase and remove spaces)df_temp['license'] = df_temp['license'].astype(str).str.lower().str.strip()# Combine all licenses that contain 'llama' into one labeldf_temp['license_combined'] = df_temp['license'].apply(lambda x: 'llama-family'if'llama'in x else x)# Optional: check results# print(f"Rows kept: {len(df_temp)}")# df_temp['license_combined'].value_counts().head(10)## Plot the top 5 licenses over timeplot_license_trends( df=df_temp, date_col="date_released", license_col="license_combined", freq="M", top_n=5, figsize=(8,5))
<Figure size 768x480 with 0 Axes>
license_combined
apache-2.0
mit
other
llama-family
cc-by-4.0
period
2022-03-01
10.0
13.0
0.0
0.0
5.0
2022-04-01
1.0
0.0
0.0
0.0
1.0
2022-05-01
0.0
1.0
0.0
0.0
0.0
2022-06-01
3.0
2.0
0.0
0.0
1.0
2022-07-01
0.0
0.0
0.0
0.0
0.0
2022-08-01
2.0
0.0
0.0
0.0
0.0
2022-09-01
11.0
11.0
2.0
0.0
0.0
2022-10-01
4.0
16.0
0.0
0.0
1.0
2022-11-01
18.0
3.0
0.0
0.0
1.0
2022-12-01
3.0
5.0
0.0
0.0
0.0
2023-01-01
7.0
11.0
0.0
0.0
0.0
2023-02-01
6.0
5.0
0.0
0.0
0.0
2023-03-01
7.0
6.0
2.0
0.0
0.0
2023-04-01
28.0
4.0
8.0
0.0
1.0
2023-05-01
40.0
11.0
2.0
0.0
1.0
2023-06-01
81.0
9.0
8.0
0.0
0.0
2023-07-01
38.0
29.0
23.0
5.0
2.0
2023-08-01
43.0
11.0
12.0
16.0
0.0
2023-09-01
32.0
12.0
9.0
29.0
0.0
2023-10-01
92.0
21.0
7.0
15.0
0.0
2023-11-01
166.0
20.0
8.0
9.0
4.0
2023-12-01
258.0
24.0
18.0
9.0
11.0
2024-01-01
250.0
21.0
15.0
9.0
3.0
2024-02-01
148.0
19.0
20.0
15.0
1.0
2024-03-01
142.0
18.0
27.0
4.0
3.0
2024-04-01
145.0
24.0
48.0
45.0
1.0
2024-05-01
121.0
28.0
28.0
67.0
1.0
2024-06-01
162.0
33.0
27.0
26.0
0.0
2024-07-01
155.0
42.0
36.0
7.0
1.0
2024-08-01
73.0
28.0
38.0
11.0
2.0
2024-09-01
39.0
19.0
18.0
8.0
0.0
2024-10-01
93.0
33.0
25.0
37.0
2.0
2024-11-01
231.0
45.0
112.0
74.0
3.0
2024-12-01
197.0
42.0
62.0
39.0
3.0
2025-01-01
100.0
37.0
36.0
52.0
0.0
2025-02-01
202.0
43.0
29.0
30.0
2.0
2025-03-01
163.0
94.0
22.0
25.0
0.0
2025-04-01
295.0
93.0
47.0
36.0
37.0
2025-05-01
251.0
72.0
36.0
45.0
25.0
2025-06-01
158.0
56.0
35.0
2.0
7.0
2025-07-01
528.0
44.0
32.0
16.0
75.0
2025-08-01
182.0
31.0
16.0
21.0
31.0
2025-09-01
206.0
38.0
5.0
8.0
4.0
2025-10-01
6.0
12.0
1.0
0.0
0.0
2.3 Licenses by author
Click to show code
# --- Config (edit these) ---CSV_PATH ="hf_models_by_author.csv"# your existing CSV with model rowsOUT_DIR ="authors/author_license_counts"# where to save outputsAUTHOR_COL ="owner"# column name for author/ownerLICENSE_COL="license"# column name for license stringID_COL ="repo_id"# optional: unique model id to drop dups (set None to skip)# ---------------------------import os, reimport pandas as pdfrom pathlib import PathPath(OUT_DIR).mkdir(parents=True, exist_ok=True)# Loaddf = pd.read_csv(CSV_PATH)# Optional: drop duplicates by model id if your CSV may have repeatsif ID_COL and ID_COL in df.columns: df = df.drop_duplicates(subset=[ID_COL])# Keep only needed columns; guard missing colsmissing = [c for c in [AUTHOR_COL, LICENSE_COL] if c notin df.columns]if missing:raiseValueError(f"Missing required columns in CSV: {missing}")work = df[[AUTHOR_COL, LICENSE_COL]].copy()# Normalize authorwork[AUTHOR_COL] = work[AUTHOR_COL].fillna("").astype(str).str.strip()work.loc[work[AUTHOR_COL] =="", AUTHOR_COL] ="UNKNOWN_AUTHOR"# Normalize and split license strings:# - lower case# - replace separators (comma/semicolon/slash/pipe) with commas# - remove extra spaces# - split into multiple rows (explode)def normalize_license(s: str) ->str: s = (s or"").strip()ifnot s:return"unknown" s = s.lower()# common synonyms/variants synonyms = {"apache2": "apache-2.0","apache 2.0": "apache-2.0","apache-2": "apache-2.0","mit license": "mit","bsd-3": "bsd-3-clause","bsd-3-clause license": "bsd-3-clause","cc by 4.0": "cc-by-4.0","cc-by": "cc-by-4.0","cc-by v4": "cc-by-4.0","cc-by-4": "cc-by-4.0","cc-by 4.0": "cc-by-4.0","creative commons attribution 4.0": "cc-by-4.0","proprietary license": "proprietary","unknown license": "unknown", } s = synonyms.get(s, s)return s# Replace various separators with commas, then splitsep_pattern = re.compile(r"[;,/|]+")work[LICENSE_COL] = ( work[LICENSE_COL] .fillna("unknown") .astype(str) .str.replace(r"\s+", " ", regex=True) .str.strip() .str.replace(sep_pattern, ",", regex=True))# Split and explode to one license per rowwork = ( work .assign(**{LICENSE_COL: work[LICENSE_COL].str.split(",")}) .explode(LICENSE_COL, ignore_index=True))# Final clean of license tokenswork[LICENSE_COL] = ( work[LICENSE_COL] .astype(str) .str.strip() .pipe(lambda s: s.where(s !="", "unknown")) .map(normalize_license))# TALL: counts per author x licensecounts_tall = ( work .groupby([AUTHOR_COL, LICENSE_COL], dropna=False) .size() .reset_index(name="count") .sort_values([AUTHOR_COL, "count"], ascending=[True, False]))# WIDE: pivot to one row per author with license columnscounts_wide = ( counts_tall .pivot(index=AUTHOR_COL, columns=LICENSE_COL, values="count") .fillna(0) .astype(int) .sort_index())counts_wide["TOTAL"] = counts_wide.sum(axis=1)counts_wide = counts_wide.sort_values("TOTAL", ascending=False)# Savetall_path = Path(OUT_DIR) /"author_license_counts_tall.csv"wide_path = Path(OUT_DIR) /"author_license_counts_wide.csv"counts_tall.to_csv(tall_path, index=False)counts_wide.to_csv(wide_path)# Preview# print(f"Saved:\n {tall_path}\n {wide_path}")display(counts_tall.head(20))display(counts_wide.head(20))
owner
license
count
0
Qwen
apache-2.0
236
1
Qwen
other
110
2
Qwen
unknown
16
4
deepseek-ai
other
41
3
deepseek-ai
mit
20
5
deepseek-ai
unknown
17
6
google
apache-2.0
638
9
google
gemma
329
14
google
unknown
30
13
google
other
24
7
google
cc-by-4.0
23
11
google
llama3
2
12
google
mit
2
8
google
cc-by-nc-4.0
1
10
google
llama2
1
15
meta-llama
llama2
25
18
meta-llama
llama3.2
15
20
meta-llama
other
13
17
meta-llama
llama3.1
11
16
meta-llama
llama3
5
license
apache-2.0
bigscience-bloom-rail-1.0
cc-by-4.0
cc-by-nc-4.0
cc-by-nc-sa-4.0
cdla-permissive-2.0
creativeml-openrail-m
gemma
llama2
llama3
llama3.1
llama3.2
llama3.3
mit
ms-pl
other
unknown
TOTAL
owner
google
638
0
23
1
0
0
0
329
1
2
0
0
0
2
0
24
30
1050
microsoft
80
2
0
0
7
1
1
0
0
1
0
0
0
236
1
8
90
427
Qwen
236
0
0
0
0
0
0
0
0
0
0
0
0
0
0
110
16
362
deepseek-ai
0
0
0
0
0
0
0
0
0
0
0
0
0
20
0
41
17
78
meta-llama
0
0
0
0
0
0
0
0
25
5
11
15
1
0
0
13
0
70
mistralai
33
0
0
0
0
0
0
0
0
0
0
0
0
0
0
6
0
39
xai-org
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
2
2.4 Custom Licenses
Custom licenses like Qwen, Llama, Gemma, Grok, and OpenRAIL are becoming increasingly common, particularly among for large foundation models. Most, if not all, proprietary licenses impose usage restrictions that appear incompatible with the four freedoms of Open Source by limiting the domain and purpose of the system’s use.
TipWhy this matters
This trend reflects a growing prevalence of “open washing,” in which developers gain the reputational and adoption benefits associated with Open Source software while simultaneously restricting model use and redistributing liability.
3.0 Popular Models & Their Licenses
Many models are built using other, larger models. Greater understanding of the terms under which these large models are released will be instrumental as we look further into how developers use and interpret the OSAID and Open Source AI.
3.1 Top 10 Most Popular Models
While the sample has over 20,000 models, many use other “base models” which are result in smaller models that are fine-tuned or quantized versions of the larger base model.
The children_count parameter looks at how many models in our sample use the specified model as a base.
The licensing and AI model publishing practices among popularly-used models will likely have greater downstream influence.
3.2 Popular Models’ Licenses
At the organization level, licensing practices are surprisingly homogeneous. First, all organizations in this study, except for OpenAI, have models that use a proprietary (i.e. custom) license as well as models that use OSI-approved Open Source licenses.
In general, the organizations are released their largest, flagship ‘open’ model under a restrictive, customized license while smaller or older models are released under permissive or standard Open Source licenses.
TipWhy this matters?
Every custom license (.e.g Llama, Qwen) in this study uses the language of Open Source, including permissions to use, study, share and modify the system while at the same time imposing restrictions on how or where a system is used. Often, usage restrictions are subject to Acceptable Use Policies akin to consumer apps and services.
Grant of Rights You are granted a non-exclusive, worldwide, non-transferable and royalty-free limited license under Alibaba Cloud’s intellectual property or other rights owned by Us embodied in the Materials to use, reproduce, distribute, copy, create derivative works of, and make modifications to the Materials.
Restrictions If you are commercially using the Materials, and your product or service has more than 100 million monthly active users, You shall request a license from Us. You cannot exercise your rights under this Agreement without our express authorization.
3.2.2 DeepSeek (Model License)
DeepSeek License Details
Grant of Copyright License. Subject to the terms and conditions of this License, DeepSeek hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare, publicly display, publicly perform, sublicense, and distribute the Complementary Material, the Model, and Derivatives of the Model.
*Use Restrictions You agree not to use the Model or Derivatives of the Model: In any way that violates any applicable national or international law or regulation or infringes upon the lawful rights and interests of any third party; For military use in any way; For the purpose of exploiting, harming or attempting to exploit or harm minors in any way; To generate or disseminate verifiably false information and/or content with the purpose of harming others; To generate or disseminate inappropriate content subject to applicable regulatory requirements; To generate or disseminate personal identifiable information without due authorization or for unreasonable use; To defame, disparage or otherwise harass others; For fully automated decision making that adversely impacts an individual’s legal rights or otherwise creates or modifies a binding, enforceable obligation; For any use intended to or which has the effect of discriminating against or harming individuals or groups based on online or offline social behavior or known or predicted personal or personality characteristics; To exploit any of the vulnerabilities of a specific group of persons based on their age, social, physical or mental characteristics, in order to materially distort the behavior of a person pertaining to that group in a manner that causes or is likely to cause that person or another person physical or psychological harm; For any use intended to or which has the effect of discriminating against individuals or groups based on legally protected characteristics or categories.
3.2.2 Llama
Llama License Details
Grant of Rights. You are granted a non-exclusive, worldwide, non-transferable and royalty-free limited license under Meta’s intellectual property…to use, reproduce, distribute, copy, create derivative works of, and make modifications to the Llama Materials.
Your use of the Llama Materials must…*adhere to the Acceptable Use Policy for the Llama Materials (available at https://llama.com/llama3/use-policy), which is hereby incorporated by reference into this Agreement.
Additional Commercial Terms. If, on the Meta Llama 3 version release date, the monthly active users of the products or services made… is greater than 700 million monthly active users in the preceding calendar month, you must request a license from Meta…
With respect to any multimodal models included in Llama 4, the rights granted under Section 1(a) of the Llama 4 Community License Agreement are not being granted to you if you are an individual domiciled in, or a company with a principal place of business in, the European Union.
3.2.3 Grok
Grok License Details
Permitted Uses: xAI grants you a non-exclusive, worldwide, revocable license to use, reproduce, distribute, and modify the Materials: For non-commercial and research purposes; and for commercial use solely if you and your affiliates abide by all of the guardrails provided in xAI’s Acceptable Use Policy (https://x.ai/legal/acceptable-use-policy), including 1. Comply with the law, 2. Do not harm people or property, and 3. Respect guardrails and don’t mislead.
Restrictions: You may not use the Materials, derivatives, or outputs (including generated data) to train, create, or improve any foundational, large language, or general-purpose AI models, except for modifications or fine-tuning of Grok 2 permitted under and in accordance with the terms of this Agreement.
Acceptable Use You are responsible for implementing appropriate safety measures, including filters and human oversight, suitable for your use case. You must comply with xAI’s Acceptable Use Policy (AUP), as well as all applicable laws. You may not use the Materials for illegal, harmful, or abusive activities.
3.2.4 Gemma
Gemma License Details
2.2 Use You may use, reproduce, modify, Distribute, perform or display any of the Gemma Services only in accordance with the terms of this Agreement, and must not violate (or encourage or permit anyone else to violate) any term of this Agreement.
3.2 Use Restrictions You must not use any of the Gemma Services: for the restricted uses set forth in the Gemma Prohibited Use Policy at ai.google.dev/gemma/prohibited_use_policy (“Prohibited Use Policy”), which is hereby incorporated by reference into this Agreement; or in violation of applicable laws and regulations.
4.0 Conclusion & Next Steps
Collectively, these preliminary findings start to delineate the gaps between rhetoric and reality in “open” AI development. Most importantly, early results underscore a deep chasm between how openness is signaled and the OSAID.
Moving forward, the next phase of the analysis will expand on quantitative findings with further trend and network analysis. For the qualitative portion, quantitative findings will be integrated with assessments on definitions of Open Source AI in federal and state policy documents.
This is a living project, and I’m eager to collaborate.
I plan to extend this study by:
Conducting network analysis of model relationships and building a model genealogy.
Tracking license propagation to see if restrictions are spreading correctly or being ignored.
Analyzing download and reuse trends to measure real-world impact.
Studying documentation practices and how developers describe openness.
Search and investigate datasets.
Connecting these findings to policy frameworks, since future AI legislation will hinge on how “open source AI” is defined and applied.