The State of “Open” Source AI

Exploring Data on AI Model Releases

Author
Affiliation

Gabriel Toscano

Duke University, Sanford School of Public Policy

1.0 Executive Summary

In collaboration the Open Source Initiative (OSI) This project uses AI model metadata from Hugging Face to understand how “open” AI models are deployed. The goal is to uncover patterns in how the concept of “open” AI is used in practice.

Founded in 1998, OSI is a global nonprofit that advances Open Source software through advocacy, policy research, and engagement across developers, corporations, nonprofits, and governments. The OSI maintains the Open Source Definition (OSD) and, more recently, released the 1.0 Open Source AI Definition (OSAID). Through these definitions, the OSI seeks to ensure that digital systems can be freely accessed, used, modified, and shared by anyone, upholding the four core freedoms of the Open Source philosophy.

The availability and flexibility of Open Source software makes it an attractive, and in some contexts crucial, mechanism for building digital tools within industry and government. Open Source Software (OSS) underpins critical infrastructure and consumer technologies, from electric grids to medical software and smartphone apps. Today, OSS contributes tens of billions in economic output in the U.S. and more than $8 trillion globally. The value derived is expected to grow as Open Source AI is adopted across public and private sectors.

The October 2024 release of the OSAID sought to anchor the term Open Source AI using clear, unambiguous standards. Yet, openness in AI is nascent and inconsistently understood.

TipWhy now?

The AI boom, driven by unprecedented investment and access to tools, has spawned a flood of models claiming to be open.

1.1 Goals

  • Understand how “open” AI models are being released
  • Analyze key trends in “open” AI model releases

Not

  • Evaluate models for “openness”
  • Evaluate the Open Source AI Definition (OSAID 1.0)

1.2 Data Gathering

Hugging Face is the most widely-used platform for AI models with over 200,000 models hosted on the repository. It allows people to share and use AI models and related datasets.

This study uses AI model metadata downloaded through the Hugging Face Hub API. Metadata includes model name,author, release date, license, and last modified date, base-model, and downloads.

Two searches where performed

  • Full-text-search (N = 20,069): searching for any models where the model name or metadata includes the word “open”
  • Author search (N=2,028): searching for AI models released by prominent AI labs (Alibaba, Deepseek, Google, Meta, Microsoft, Mistral, XAI, Open AI)

1.3 Key Findings

Preliminary exploratory analysis of the Hugging Face data points to differing practices in how developers signal openness in AI. These results illustrate some consistency with the OSAID as well as friction surrounding licensing practices.

  • The overwhelming majority of “open” models are based on larger models
  • Apache 2.0 is the most popular OSI-approved license, followed by MIT
  • CC-by licenses are prevalent, despite Creative Commons’ recommendations against using CC licenses for software
  • The vast majority, over 50% of all models in in this sample, are released with an “unknown license”
  • Alibaba’s Qwen family of models are the most popular base model in this sample
  • Custom licenses like Qwen, Llama, Gemma, Grok, OpenRAIL are becoming increasingly common, specially for flagship models, yet impose usage restrictions

1.4 Presentation

Preliminary study findings where presented at the All Things Open Conference hosted in Raleigh, NC, USA in October, 2025

Presentation video

2.0 Licensing Environment

Two OSI-approved Open Source licenses––Apache 2.0 and MIT––are the most popular licenses, accounting for 28% of all models in the sample. A much smaller share (3%) are released under CC-by licenses, despite Creative Commons’ advice against using their licenses for software, as they don’t specify how source code can be distributed.

A majority of models (58%) are released with an “unknown” license, pointing to a lack of standardization in how data about models is collected and a general lack of enforceability in requiring a license for AI model releases on Hugging Face.

The full list of OSI-approved licenses are available online.

TipWhy this matters

This finding suggests that the code component of a significant portion of models are compatible the OSAID. However, a much larger subset either omits licensing information, use a custom license or apply a license not appropriate for use with software.

2.1 Top 10 Licenses

Click to show code
import pandas as pd

# Load your CSV file 
df = pd.read_csv('model_data/primary_datasets/hf_models_open_raw.csv')

# Count occurrences of each license
license_counts = df['license'].value_counts().reset_index()
license_counts.columns = ['license', 'count']

# Calculate proportion
total_models = len(df)
license_counts['proportion_percent'] = round(100*(license_counts['count'] / total_models), 2)

license_counts[:10]
license count proportion_percent
0 apache-2.0 4697 23.40
1 mit 1086 5.41
2 other 814 4.06
3 cc-by-4.0 229 1.14
4 llama2 223 1.11
5 cc-by-nc-4.0 222 1.11
6 llama3 222 1.11
7 creativeml-openrail-m 111 0.55
8 llama3.1 106 0.53
9 llama3.2 82 0.41

2.2 License use over time

Click to show code
import pandas as pd
import matplotlib.pyplot as plt

def plot_license_trends(
    df: pd.DataFrame,
    date_col: str = "date_released",
    license_col: str = "license",
    licenses_to_include: list = None,
    freq: str = "M",   # 'M' for month, 'Y' for year
    top_n: int = None,
    kind: str = "line",
    figsize=(12,6)
):
    """
    Plots how selected licenses are used over time.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.
    date_col : str
        Name of the column with dates (e.g. 'created_at' or 'last_modified').
    license_col : str
        Name of the column with license names.
    licenses_to_include : list, optional
        A list of license names to include (e.g. ['mit', 'apache-2.0']).
        If None, all non-unknown licenses are used.
    freq : str, default='M'
        Time frequency for aggregation ('M' for month, 'Y' for year).
    top_n : int, optional
        If provided, only the top N most frequent licenses are plotted.
    kind : str, default='line'
        'line' or 'area' chart type.
    figsize : tuple, default=(12,6)
        Figure size for the plot.
    """

    # 1️⃣ Convert to datetime
    df = df.copy()
    df[date_col] = pd.to_datetime(df[date_col], errors="coerce")

    # 2️⃣ Clean up licenses
    df[license_col] = df[license_col].astype(str).str.strip().str.lower()
    df = df[df[license_col].notna() & (df[license_col] != "unknown")]

    # 3️⃣ Filter for licenses of interest
    if licenses_to_include:
        licenses = [l.lower() for l in licenses_to_include]
        df = df[df[license_col].isin(licenses)]

    # 4️⃣ Create period column (e.g., year-month)
    df["period"] = df[date_col].dt.to_period(freq).astype(str)

    # 5️⃣ Group and pivot
    grouped = (
        df.groupby(["period", license_col])
        .size()
        .reset_index(name="count")
    )

    pivoted = grouped.pivot(
        index="period", columns=license_col, values="count"
    ).fillna(0)

    # 6️⃣ Sort by date
    pivoted.index = pd.to_datetime(pivoted.index)
    pivoted = pivoted.sort_index()

    # 7️⃣ Optionally select top N
    if top_n:
        top_cols = pivoted.sum().sort_values(ascending=False).head(top_n).index
        pivoted = pivoted[top_cols]

    # 8️⃣ Plot
    plt.figure(figsize=figsize)

    if kind == "area":
        pivoted.plot.area(figsize=figsize, alpha=0.8)
    else:
        pivoted.plot(kind="line", linewidth=2, figsize=figsize)

    plt.title("License Usage Over Time")
    plt.xlabel("Date")
    plt.ylabel("Number of Models")
    plt.legend(title="License", bbox_to_anchor=(1.05, 1), loc="upper left")
    plt.tight_layout()
    plt.show()

    return pivoted
Click to show code
# Copy original dataframe
df_temp = open_df.copy()

# Drop rows with missing licenses
df_temp = df_temp[df_temp['license'].notna()].copy()

# Normalize licenses (lowercase and remove spaces)
df_temp['license'] = df_temp['license'].astype(str).str.lower().str.strip()

# Combine all licenses that contain 'llama' into one label
df_temp['license_combined'] = df_temp['license'].apply(
    lambda x: 'llama-family' if 'llama' in x else x
)

# Optional: check results
# print(f"Rows kept: {len(df_temp)}")

# df_temp['license_combined'].value_counts().head(10)

## Plot the top 5 licenses over time
plot_license_trends(
    df=df_temp,
    date_col="date_released",
    license_col="license_combined",
    freq="M",
    top_n=5,
    figsize=(8,5)
)
<Figure size 768x480 with 0 Axes>

license_combined apache-2.0 mit other llama-family cc-by-4.0
period
2022-03-01 10.0 13.0 0.0 0.0 5.0
2022-04-01 1.0 0.0 0.0 0.0 1.0
2022-05-01 0.0 1.0 0.0 0.0 0.0
2022-06-01 3.0 2.0 0.0 0.0 1.0
2022-07-01 0.0 0.0 0.0 0.0 0.0
2022-08-01 2.0 0.0 0.0 0.0 0.0
2022-09-01 11.0 11.0 2.0 0.0 0.0
2022-10-01 4.0 16.0 0.0 0.0 1.0
2022-11-01 18.0 3.0 0.0 0.0 1.0
2022-12-01 3.0 5.0 0.0 0.0 0.0
2023-01-01 7.0 11.0 0.0 0.0 0.0
2023-02-01 6.0 5.0 0.0 0.0 0.0
2023-03-01 7.0 6.0 2.0 0.0 0.0
2023-04-01 28.0 4.0 8.0 0.0 1.0
2023-05-01 40.0 11.0 2.0 0.0 1.0
2023-06-01 81.0 9.0 8.0 0.0 0.0
2023-07-01 38.0 29.0 23.0 5.0 2.0
2023-08-01 43.0 11.0 12.0 16.0 0.0
2023-09-01 32.0 12.0 9.0 29.0 0.0
2023-10-01 92.0 21.0 7.0 15.0 0.0
2023-11-01 166.0 20.0 8.0 9.0 4.0
2023-12-01 258.0 24.0 18.0 9.0 11.0
2024-01-01 250.0 21.0 15.0 9.0 3.0
2024-02-01 148.0 19.0 20.0 15.0 1.0
2024-03-01 142.0 18.0 27.0 4.0 3.0
2024-04-01 145.0 24.0 48.0 45.0 1.0
2024-05-01 121.0 28.0 28.0 67.0 1.0
2024-06-01 162.0 33.0 27.0 26.0 0.0
2024-07-01 155.0 42.0 36.0 7.0 1.0
2024-08-01 73.0 28.0 38.0 11.0 2.0
2024-09-01 39.0 19.0 18.0 8.0 0.0
2024-10-01 93.0 33.0 25.0 37.0 2.0
2024-11-01 231.0 45.0 112.0 74.0 3.0
2024-12-01 197.0 42.0 62.0 39.0 3.0
2025-01-01 100.0 37.0 36.0 52.0 0.0
2025-02-01 202.0 43.0 29.0 30.0 2.0
2025-03-01 163.0 94.0 22.0 25.0 0.0
2025-04-01 295.0 93.0 47.0 36.0 37.0
2025-05-01 251.0 72.0 36.0 45.0 25.0
2025-06-01 158.0 56.0 35.0 2.0 7.0
2025-07-01 528.0 44.0 32.0 16.0 75.0
2025-08-01 182.0 31.0 16.0 21.0 31.0
2025-09-01 206.0 38.0 5.0 8.0 4.0
2025-10-01 6.0 12.0 1.0 0.0 0.0

2.3 Licenses by author

Click to show code
# --- Config (edit these) ---
CSV_PATH   = "hf_models_by_author.csv"   # your existing CSV with model rows
OUT_DIR    = "authors/author_license_counts"    # where to save outputs
AUTHOR_COL = "owner"                              # column name for author/owner
LICENSE_COL= "license"                             # column name for license string
ID_COL     = "repo_id"                                  # optional: unique model id to drop dups (set None to skip)
# ---------------------------

import os, re
import pandas as pd
from pathlib import Path

Path(OUT_DIR).mkdir(parents=True, exist_ok=True)

# Load
df = pd.read_csv(CSV_PATH)

# Optional: drop duplicates by model id if your CSV may have repeats
if ID_COL and ID_COL in df.columns:
    df = df.drop_duplicates(subset=[ID_COL])

# Keep only needed columns; guard missing cols
missing = [c for c in [AUTHOR_COL, LICENSE_COL] if c not in df.columns]
if missing:
    raise ValueError(f"Missing required columns in CSV: {missing}")

work = df[[AUTHOR_COL, LICENSE_COL]].copy()

# Normalize author
work[AUTHOR_COL] = work[AUTHOR_COL].fillna("").astype(str).str.strip()
work.loc[work[AUTHOR_COL] == "", AUTHOR_COL] = "UNKNOWN_AUTHOR"

# Normalize and split license strings:
# - lower case
# - replace separators (comma/semicolon/slash/pipe) with commas
# - remove extra spaces
# - split into multiple rows (explode)
def normalize_license(s: str) -> str:
    s = (s or "").strip()
    if not s:
        return "unknown"
    s = s.lower()
    # common synonyms/variants
    synonyms = {
        "apache2": "apache-2.0",
        "apache 2.0": "apache-2.0",
        "apache-2": "apache-2.0",
        "mit license": "mit",
        "bsd-3": "bsd-3-clause",
        "bsd-3-clause license": "bsd-3-clause",
        "cc by 4.0": "cc-by-4.0",
        "cc-by": "cc-by-4.0",
        "cc-by v4": "cc-by-4.0",
        "cc-by-4": "cc-by-4.0",
        "cc-by 4.0": "cc-by-4.0",
        "creative commons attribution 4.0": "cc-by-4.0",
        "proprietary license": "proprietary",
        "unknown license": "unknown",
    }
    s = synonyms.get(s, s)
    return s

# Replace various separators with commas, then split
sep_pattern = re.compile(r"[;,/|]+")
work[LICENSE_COL] = (
    work[LICENSE_COL]
    .fillna("unknown")
    .astype(str)
    .str.replace(r"\s+", " ", regex=True)
    .str.strip()
    .str.replace(sep_pattern, ",", regex=True)
)

# Split and explode to one license per row
work = (
    work
    .assign(**{LICENSE_COL: work[LICENSE_COL].str.split(",")})
    .explode(LICENSE_COL, ignore_index=True)
)

# Final clean of license tokens
work[LICENSE_COL] = (
    work[LICENSE_COL]
    .astype(str)
    .str.strip()
    .pipe(lambda s: s.where(s != "", "unknown"))
    .map(normalize_license)
)

# TALL: counts per author x license
counts_tall = (
    work
    .groupby([AUTHOR_COL, LICENSE_COL], dropna=False)
    .size()
    .reset_index(name="count")
    .sort_values([AUTHOR_COL, "count"], ascending=[True, False])
)

# WIDE: pivot to one row per author with license columns
counts_wide = (
    counts_tall
    .pivot(index=AUTHOR_COL, columns=LICENSE_COL, values="count")
    .fillna(0)
    .astype(int)
    .sort_index()
)
counts_wide["TOTAL"] = counts_wide.sum(axis=1)
counts_wide = counts_wide.sort_values("TOTAL", ascending=False)

# Save
tall_path = Path(OUT_DIR) / "author_license_counts_tall.csv"
wide_path = Path(OUT_DIR) / "author_license_counts_wide.csv"
counts_tall.to_csv(tall_path, index=False)
counts_wide.to_csv(wide_path)

# Preview
# print(f"Saved:\n  {tall_path}\n  {wide_path}")
display(counts_tall.head(20))
display(counts_wide.head(20))
owner license count
0 Qwen apache-2.0 236
1 Qwen other 110
2 Qwen unknown 16
4 deepseek-ai other 41
3 deepseek-ai mit 20
5 deepseek-ai unknown 17
6 google apache-2.0 638
9 google gemma 329
14 google unknown 30
13 google other 24
7 google cc-by-4.0 23
11 google llama3 2
12 google mit 2
8 google cc-by-nc-4.0 1
10 google llama2 1
15 meta-llama llama2 25
18 meta-llama llama3.2 15
20 meta-llama other 13
17 meta-llama llama3.1 11
16 meta-llama llama3 5
license apache-2.0 bigscience-bloom-rail-1.0 cc-by-4.0 cc-by-nc-4.0 cc-by-nc-sa-4.0 cdla-permissive-2.0 creativeml-openrail-m gemma llama2 llama3 llama3.1 llama3.2 llama3.3 mit ms-pl other unknown TOTAL
owner
google 638 0 23 1 0 0 0 329 1 2 0 0 0 2 0 24 30 1050
microsoft 80 2 0 0 7 1 1 0 0 1 0 0 0 236 1 8 90 427
Qwen 236 0 0 0 0 0 0 0 0 0 0 0 0 0 0 110 16 362
deepseek-ai 0 0 0 0 0 0 0 0 0 0 0 0 0 20 0 41 17 78
meta-llama 0 0 0 0 0 0 0 0 25 5 11 15 1 0 0 13 0 70
mistralai 33 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 0 39
xai-org 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2

2.4 Custom Licenses

Custom licenses like Qwen, Llama, Gemma, Grok, and OpenRAIL are becoming increasingly common, particularly among for large foundation models. Most, if not all, proprietary licenses impose usage restrictions that appear incompatible with the four freedoms of Open Source by limiting the domain and purpose of the system’s use.

TipWhy this matters

This trend reflects a growing prevalence of “open washing,” in which developers gain the reputational and adoption benefits associated with Open Source software while simultaneously restricting model use and redistributing liability.

4.0 Conclusion & Next Steps

Collectively, these preliminary findings start to delineate the gaps between rhetoric and reality in “open” AI development. Most importantly, early results underscore a deep chasm between how openness is signaled and the OSAID.

Moving forward, the next phase of the analysis will expand on quantitative findings with further trend and network analysis. For the qualitative portion, quantitative findings will be integrated with assessments on definitions of Open Source AI in federal and state policy documents.

This is a living project, and I’m eager to collaborate.

I plan to extend this study by:

  • Conducting network analysis of model relationships and building a model genealogy.
  • Tracking license propagation to see if restrictions are spreading correctly or being ignored.
  • Analyzing download and reuse trends to measure real-world impact.
  • Studying documentation practices and how developers describe openness.
  • Search and investigate datasets.
  • Connecting these findings to policy frameworks, since future AI legislation will hinge on how “open source AI” is defined and applied.