The State of “Open” Source AI

Exploring Data on AI Model Releases

Author

Affiliation

Gabriel Toscano

Duke University, Sanford School of Public Policy

1.0 Executive Summary

In collaboration the Open Source Initiative (OSI) This project uses AI model metadata from Hugging Face to understand how “open” AI models are deployed. The goal is to uncover patterns in how the concept of “open” AI is used in practice.

Founded in 1998, OSI is a global nonprofit that advances Open Source software through advocacy, policy research, and engagement across developers, corporations, nonprofits, and governments. The OSI maintains the Open Source Definition (OSD) and, more recently, released the 1.0 Open Source AI Definition (OSAID). Through these definitions, the OSI seeks to ensure that digital systems can be freely accessed, used, modified, and shared by anyone, upholding the four core freedoms of the Open Source philosophy.

The availability and flexibility of Open Source software makes it an attractive, and in some contexts crucial, mechanism for building digital tools within industry and government. Open Source Software (OSS) underpins critical infrastructure and consumer technologies, from electric grids to medical software and smartphone apps. Today, OSS contributes tens of billions in economic output in the U.S. and more than $8 trillion globally. The value derived is expected to grow as Open Source AI is adopted across public and private sectors.

The October 2024 release of the OSAID sought to anchor the term Open Source AI using clear, unambiguous standards. Yet, openness in AI is nascent and inconsistently understood.

Why now?

The AI boom, driven by unprecedented investment and access to tools, has spawned a flood of models claiming to be open.

1.1 Goals

Understand how “open” AI models are being released
Analyze key trends in “open” AI model releases

Not

Evaluate models for “openness”
Evaluate the Open Source AI Definition (OSAID 1.0)

1.2 Data Gathering

Hugging Face is the most widely-used platform for AI models with over 200,000 models hosted on the repository. It allows people to share and use AI models and related datasets.

This study uses AI model metadata downloaded through the Hugging Face Hub API. Metadata includes model name,author, release date, license, and last modified date, base-model, and downloads.

Two searches where performed

Full-text-search (N = 20,069): searching for any models where the model name or metadata includes the word “open”
Author search (N=2,028): searching for AI models released by prominent AI labs (Alibaba, Deepseek, Google, Meta, Microsoft, Mistral, XAI, Open AI)

1.3 Key Findings

Preliminary exploratory analysis of the Hugging Face data points to differing practices in how developers signal openness in AI. These results illustrate some consistency with the OSAID as well as friction surrounding licensing practices.

The overwhelming majority of “open” models are based on larger models
Apache 2.0 is the most popular OSI-approved license, followed by MIT
CC-by licenses are prevalent, despite Creative Commons’ recommendations against using CC licenses for software
The vast majority, over 50% of all models in in this sample, are released with an “unknown license”
Alibaba’s Qwen family of models are the most popular base model in this sample
Custom licenses like Qwen, Llama, Gemma, Grok, OpenRAIL are becoming increasingly common, specially for flagship models, yet impose usage restrictions

1.4 Presentation

Preliminary study findings where presented at the All Things Open Conference hosted in Raleigh, NC, USA in October, 2025

Presentation video

2.0 Licensing Environment

Two OSI-approved Open Source licenses––Apache 2.0 and MIT––are the most popular licenses, accounting for 28% of all models in the sample. A much smaller share (3%) are released under CC-by licenses, despite Creative Commons’ advice against using their licenses for software, as they don’t specify how source code can be distributed.

A majority of models (58%) are released with an “unknown” license, pointing to a lack of standardization in how data about models is collected and a general lack of enforceability in requiring a license for AI model releases on Hugging Face.

The full list of OSI-approved licenses are available online.

Why this matters

This finding suggests that the code component of a significant portion of models are compatible the OSAID. However, a much larger subset either omits licensing information, use a custom license or apply a license not appropriate for use with software.

2.1 Top 10 Licenses

Click to show code

import pandas as pd

# Load your CSV file 
df = pd.read_csv('model_data/primary_datasets/hf_models_open_raw.csv')

# Count occurrences of each license
license_counts = df['license'].value_counts().reset_index()
license_counts.columns = ['license', 'count']

# Calculate proportion
total_models = len(df)
license_counts['proportion_percent'] = round(100*(license_counts['count'] / total_models), 2)

license_counts[:10]

	license	count	proportion_percent
0	apache-2.0	4697	23.40
1	mit	1086	5.41
2	other	814	4.06
3	cc-by-4.0	229	1.14
4	llama2	223	1.11
5	cc-by-nc-4.0	222	1.11
6	llama3	222	1.11
7	creativeml-openrail-m	111	0.55
8	llama3.1	106	0.53
9	llama3.2	82	0.41

2.2 License use over time

Click to show code

import pandas as pd
import matplotlib.pyplot as plt

def plot_license_trends(
    df: pd.DataFrame,
    date_col: str = "date_released",
    license_col: str = "license",
    licenses_to_include: list = None,
    freq: str = "M",   # 'M' for month, 'Y' for year
    top_n: int = None,
    kind: str = "line",
    figsize=(12,6)
):
    """
    Plots how selected licenses are used over time.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.
    date_col : str
        Name of the column with dates (e.g. 'created_at' or 'last_modified').
    license_col : str
        Name of the column with license names.
    licenses_to_include : list, optional
        A list of license names to include (e.g. ['mit', 'apache-2.0']).
        If None, all non-unknown licenses are used.
    freq : str, default='M'
        Time frequency for aggregation ('M' for month, 'Y' for year).
    top_n : int, optional
        If provided, only the top N most frequent licenses are plotted.
    kind : str, default='line'
        'line' or 'area' chart type.
    figsize : tuple, default=(12,6)
        Figure size for the plot.
    """

    # 1️⃣ Convert to datetime
    df = df.copy()
    df[date_col] = pd.to_datetime(df[date_col], errors="coerce")

    # 2️⃣ Clean up licenses
    df[license_col] = df[license_col].astype(str).str.strip().str.lower()
    df = df[df[license_col].notna() & (df[license_col] != "unknown")]

    # 3️⃣ Filter for licenses of interest
    if licenses_to_include:
        licenses = [l.lower() for l in licenses_to_include]
        df = df[df[license_col].isin(licenses)]

    # 4️⃣ Create period column (e.g., year-month)
    df["period"] = df[date_col].dt.to_period(freq).astype(str)

    # 5️⃣ Group and pivot
    grouped = (
        df.groupby(["period", license_col])
        .size()
        .reset_index(name="count")
    )

    pivoted = grouped.pivot(
        index="period", columns=license_col, values="count"
    ).fillna(0)

    # 6️⃣ Sort by date
    pivoted.index = pd.to_datetime(pivoted.index)
    pivoted = pivoted.sort_index()

    # 7️⃣ Optionally select top N
    if top_n:
        top_cols = pivoted.sum().sort_values(ascending=False).head(top_n).index
        pivoted = pivoted[top_cols]

    # 8️⃣ Plot
    plt.figure(figsize=figsize)

    if kind == "area":
        pivoted.plot.area(figsize=figsize, alpha=0.8)
    else:
        pivoted.plot(kind="line", linewidth=2, figsize=figsize)

    plt.title("License Usage Over Time")
    plt.xlabel("Date")
    plt.ylabel("Number of Models")
    plt.legend(title="License", bbox_to_anchor=(1.05, 1), loc="upper left")
    plt.tight_layout()
    plt.show()

    return pivoted

Click to show code

# Copy original dataframe
df_temp = open_df.copy()

# Drop rows with missing licenses
df_temp = df_temp[df_temp['license'].notna()].copy()

# Normalize licenses (lowercase and remove spaces)
df_temp['license'] = df_temp['license'].astype(str).str.lower().str.strip()

# Combine all licenses that contain 'llama' into one label
df_temp['license_combined'] = df_temp['license'].apply(
    lambda x: 'llama-family' if 'llama' in x else x
)

# Optional: check results
# print(f"Rows kept: {len(df_temp)}")

# df_temp['license_combined'].value_counts().head(10)

## Plot the top 5 licenses over time
plot_license_trends(
    df=df_temp,
    date_col="date_released",
    license_col="license_combined",
    freq="M",
    top_n=5,
    figsize=(8,5)
)

<Figure size 768x480 with 0 Axes>

license_combined	apache-2.0	mit	other	llama-family	cc-by-4.0
period
2022-03-01	10.0	13.0	0.0	0.0	5.0
2022-04-01	1.0	0.0	0.0	0.0	1.0
2022-05-01	0.0	1.0	0.0	0.0	0.0
2022-06-01	3.0	2.0	0.0	0.0	1.0
2022-07-01	0.0	0.0	0.0	0.0	0.0
2022-08-01	2.0	0.0	0.0	0.0	0.0
2022-09-01	11.0	11.0	2.0	0.0	0.0
2022-10-01	4.0	16.0	0.0	0.0	1.0
2022-11-01	18.0	3.0	0.0	0.0	1.0
2022-12-01	3.0	5.0	0.0	0.0	0.0
2023-01-01	7.0	11.0	0.0	0.0	0.0
2023-02-01	6.0	5.0	0.0	0.0	0.0
2023-03-01	7.0	6.0	2.0	0.0	0.0
2023-04-01	28.0	4.0	8.0	0.0	1.0
2023-05-01	40.0	11.0	2.0	0.0	1.0
2023-06-01	81.0	9.0	8.0	0.0	0.0
2023-07-01	38.0	29.0	23.0	5.0	2.0
2023-08-01	43.0	11.0	12.0	16.0	0.0
2023-09-01	32.0	12.0	9.0	29.0	0.0
2023-10-01	92.0	21.0	7.0	15.0	0.0
2023-11-01	166.0	20.0	8.0	9.0	4.0
2023-12-01	258.0	24.0	18.0	9.0	11.0
2024-01-01	250.0	21.0	15.0	9.0	3.0
2024-02-01	148.0	19.0	20.0	15.0	1.0
2024-03-01	142.0	18.0	27.0	4.0	3.0
2024-04-01	145.0	24.0	48.0	45.0	1.0
2024-05-01	121.0	28.0	28.0	67.0	1.0
2024-06-01	162.0	33.0	27.0	26.0	0.0
2024-07-01	155.0	42.0	36.0	7.0	1.0
2024-08-01	73.0	28.0	38.0	11.0	2.0
2024-09-01	39.0	19.0	18.0	8.0	0.0
2024-10-01	93.0	33.0	25.0	37.0	2.0
2024-11-01	231.0	45.0	112.0	74.0	3.0
2024-12-01	197.0	42.0	62.0	39.0	3.0
2025-01-01	100.0	37.0	36.0	52.0	0.0
2025-02-01	202.0	43.0	29.0	30.0	2.0
2025-03-01	163.0	94.0	22.0	25.0	0.0
2025-04-01	295.0	93.0	47.0	36.0	37.0
2025-05-01	251.0	72.0	36.0	45.0	25.0
2025-06-01	158.0	56.0	35.0	2.0	7.0
2025-07-01	528.0	44.0	32.0	16.0	75.0
2025-08-01	182.0	31.0	16.0	21.0	31.0
2025-09-01	206.0	38.0	5.0	8.0	4.0
2025-10-01	6.0	12.0	1.0	0.0	0.0

2.3 Licenses by author

Click to show code

# --- Config (edit these) ---
CSV_PATH   = "hf_models_by_author.csv"   # your existing CSV with model rows
OUT_DIR    = "authors/author_license_counts"    # where to save outputs
AUTHOR_COL = "owner"                              # column name for author/owner
LICENSE_COL= "license"                             # column name for license string
ID_COL     = "repo_id"                                  # optional: unique model id to drop dups (set None to skip)
# ---------------------------

import os, re
import pandas as pd
from pathlib import Path

Path(OUT_DIR).mkdir(parents=True, exist_ok=True)

# Load
df = pd.read_csv(CSV_PATH)

# Optional: drop duplicates by model id if your CSV may have repeats
if ID_COL and ID_COL in df.columns:
    df = df.drop_duplicates(subset=[ID_COL])

# Keep only needed columns; guard missing cols
missing = [c for c in [AUTHOR_COL, LICENSE_COL] if c not in df.columns]
if missing:
    raise ValueError(f"Missing required columns in CSV: {missing}")

work = df[[AUTHOR_COL, LICENSE_COL]].copy()

# Normalize author
work[AUTHOR_COL] = work[AUTHOR_COL].fillna("").astype(str).str.strip()
work.loc[work[AUTHOR_COL] == "", AUTHOR_COL] = "UNKNOWN_AUTHOR"

# Normalize and split license strings:
# - lower case
# - replace separators (comma/semicolon/slash/pipe) with commas
# - remove extra spaces
# - split into multiple rows (explode)
def normalize_license(s: str) -> str:
    s = (s or "").strip()
    if not s:
        return "unknown"
    s = s.lower()
    # common synonyms/variants
    synonyms = {
        "apache2": "apache-2.0",
        "apache 2.0": "apache-2.0",
        "apache-2": "apache-2.0",
        "mit license": "mit",
        "bsd-3": "bsd-3-clause",
        "bsd-3-clause license": "bsd-3-clause",
        "cc by 4.0": "cc-by-4.0",
        "cc-by": "cc-by-4.0",
        "cc-by v4": "cc-by-4.0",
        "cc-by-4": "cc-by-4.0",
        "cc-by 4.0": "cc-by-4.0",
        "creative commons attribution 4.0": "cc-by-4.0",
        "proprietary license": "proprietary",
        "unknown license": "unknown",
    }
    s = synonyms.get(s, s)
    return s

# Replace various separators with commas, then split
sep_pattern = re.compile(r"[;,/|]+")
work[LICENSE_COL] = (
    work[LICENSE_COL]
    .fillna("unknown")
    .astype(str)
    .str.replace(r"\s+", " ", regex=True)
    .str.strip()
    .str.replace(sep_pattern, ",", regex=True)
)

# Split and explode to one license per row
work = (
    work
    .assign(**{LICENSE_COL: work[LICENSE_COL].str.split(",")})
    .explode(LICENSE_COL, ignore_index=True)
)

# Final clean of license tokens
work[LICENSE_COL] = (
    work[LICENSE_COL]
    .astype(str)
    .str.strip()
    .pipe(lambda s: s.where(s != "", "unknown"))
    .map(normalize_license)
)

# TALL: counts per author x license
counts_tall = (
    work
    .groupby([AUTHOR_COL, LICENSE_COL], dropna=False)
    .size()
    .reset_index(name="count")
    .sort_values([AUTHOR_COL, "count"], ascending=[True, False])
)

# WIDE: pivot to one row per author with license columns
counts_wide = (
    counts_tall
    .pivot(index=AUTHOR_COL, columns=LICENSE_COL, values="count")
    .fillna(0)
    .astype(int)
    .sort_index()
)
counts_wide["TOTAL"] = counts_wide.sum(axis=1)
counts_wide = counts_wide.sort_values("TOTAL", ascending=False)

# Save
tall_path = Path(OUT_DIR) / "author_license_counts_tall.csv"
wide_path = Path(OUT_DIR) / "author_license_counts_wide.csv"
counts_tall.to_csv(tall_path, index=False)
counts_wide.to_csv(wide_path)

# Preview
# print(f"Saved:\n  {tall_path}\n  {wide_path}")
display(counts_tall.head(20))
display(counts_wide.head(20))

	owner	license	count
0	Qwen	apache-2.0	236
1	Qwen	other	110
2	Qwen	unknown	16
4	deepseek-ai	other	41
3	deepseek-ai	mit	20
5	deepseek-ai	unknown	17
6	google	apache-2.0	638
9	google	gemma	329
14	google	unknown	30
13	google	other	24
7	google	cc-by-4.0	23
11	google	llama3	2
12	google	mit	2
8	google	cc-by-nc-4.0	1
10	google	llama2	1
15	meta-llama	llama2	25
18	meta-llama	llama3.2	15
20	meta-llama	other	13
17	meta-llama	llama3.1	11
16	meta-llama	llama3	5

license	apache-2.0	bigscience-bloom-rail-1.0	cc-by-4.0	cc-by-nc-4.0	cc-by-nc-sa-4.0	cdla-permissive-2.0	creativeml-openrail-m	gemma	llama2	llama3	llama3.1	llama3.2	llama3.3	mit	ms-pl	other	unknown	TOTAL
owner
google	638	0	23	1	0	0	0	329	1	2	0	0	0	2	0	24	30	1050
microsoft	80	2	0	0	7	1	1	0	0	1	0	0	0	236	1	8	90	427
Qwen	236	0	0	0	0	0	0	0	0	0	0	0	0	0	0	110	16	362
deepseek-ai	0	0	0	0	0	0	0	0	0	0	0	0	0	20	0	41	17	78
meta-llama	0	0	0	0	0	0	0	0	25	5	11	15	1	0	0	13	0	70
mistralai	33	0	0	0	0	0	0	0	0	0	0	0	0	0	0	6	0	39
xai-org	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	2

2.4 Custom Licenses

Custom licenses like Qwen, Llama, Gemma, Grok, and OpenRAIL are becoming increasingly common, particularly among for large foundation models. Most, if not all, proprietary licenses impose usage restrictions that appear incompatible with the four freedoms of Open Source by limiting the domain and purpose of the system’s use.

Why this matters

This trend reflects a growing prevalence of “open washing,” in which developers gain the reputational and adoption benefits associated with Open Source software while simultaneously restricting model use and redistributing liability.

3.0 Popular Models & Their Licenses

Many models are built using other, larger models. Greater understanding of the terms under which these large models are released will be instrumental as we look further into how developers use and interpret the OSAID and Open Source AI.

3.1 Top 10 Most Popular Models

While the sample has over 20,000 models, many use other “base models” which are result in smaller models that are fine-tuned or quantized versions of the larger base model.

The children_count parameter looks at how many models in our sample use the specified model as a base.

Click to show code

popular_models_df = pd.read_csv('model_genealogy/most_popular_models_v2.csv')

# popular_models_df.head(20)
display_popular_models = popular_models_df[['model_name', 'num_children']] \
    .head(10) \
    .rename(columns={
        'model_name': 'Model Name',
        'num_children': 'Children Count'
    })

display_popular_models

	Model Name	Children Count
0	Qwen/Qwen2.5-1.5B-Instruct	342
1	Qwen/Qwen2.5-7B-Instruct	253
2	teknium/OpenHermes-2.5-Mistral-7B	121
3	beomi/Llama-3-Open-Ko-8B-Instruct-preview	113
4	Qwen/Qwen2.5-32B-Instruct	90
5	deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B	88
6	mistralai/Mistral-7B-v0.1	87
7	meta-llama/Meta-Llama-3-8B	70
8	aaditya/Llama3-OpenBioLLM-8B	60
9	microsoft/swin-tiny-patch4-window7-224	53

Why this matters?

The licensing and AI model publishing practices among popularly-used models will likely have greater downstream influence.

3.2 Popular Models’ Licenses

At the organization level, licensing practices are surprisingly homogeneous. First, all organizations in this study, except for OpenAI, have models that use a proprietary (i.e. custom) license as well as models that use OSI-approved Open Source licenses.

In general, the organizations are released their largest, flagship ‘open’ model under a restrictive, customized license while smaller or older models are released under permissive or standard Open Source licenses.

Why this matters?

Every custom license (.e.g Llama, Qwen) in this study uses the language of Open Source, including permissions to use, study, share and modify the system while at the same time imposing restrictions on how or where a system is used. Often, usage restrictions are subject to Acceptable Use Policies akin to consumer apps and services.

Organization	Number of Models	License Types
Alibaba (Qwen)	+300	Apache 2.0, Qwen
DeepSeek	78	MIT for code, DeepSeek for model license
Meta (Llama)	70	Llama-family of licenses
Mistral	39	Apache 2.0
Microsoft	+400	MIT, Apache, CC BY-*, Microsoft Research License
Google	+1000	Apache 2.0, Gemma licenses
XAI (Grok)	2	Grok-1 under Apache 2.0, Grok-2 under Grok-2 license
OpenAI	30	Apache 2.0

3.2.1 Qwen

Qwen License Seek Details

Grant of Rights You are granted a non-exclusive, worldwide, non-transferable and royalty-free limited license under Alibaba Cloud’s intellectual property or other rights owned by Us embodied in the Materials to use, reproduce, distribute, copy, create derivative works of, and make modifications to the Materials.
Restrictions If you are commercially using the Materials, and your product or service has more than 100 million monthly active users, You shall request a license from Us. You cannot exercise your rights under this Agreement without our express authorization.

3.2.2 DeepSeek (Model License)

DeepSeek License Details

Grant of Copyright License. Subject to the terms and conditions of this License, DeepSeek hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare, publicly display, publicly perform, sublicense, and distribute the Complementary Material, the Model, and Derivatives of the Model.

*Use Restrictions You agree not to use the Model or Derivatives of the Model: In any way that violates any applicable national or international law or regulation or infringes upon the lawful rights and interests of any third party; For military use in any way; For the purpose of exploiting, harming or attempting to exploit or harm minors in any way; To generate or disseminate verifiably false information and/or content with the purpose of harming others; To generate or disseminate inappropriate content subject to applicable regulatory requirements; To generate or disseminate personal identifiable information without due authorization or for unreasonable use; To defame, disparage or otherwise harass others; For fully automated decision making that adversely impacts an individual’s legal rights or otherwise creates or modifies a binding, enforceable obligation; For any use intended to or which has the effect of discriminating against or harming individuals or groups based on online or offline social behavior or known or predicted personal or personality characteristics; To exploit any of the vulnerabilities of a specific group of persons based on their age, social, physical or mental characteristics, in order to materially distort the behavior of a person pertaining to that group in a manner that causes or is likely to cause that person or another person physical or psychological harm; For any use intended to or which has the effect of discriminating against individuals or groups based on legally protected characteristics or categories.

3.2.2 Llama

Llama License Details

Grant of Rights. You are granted a non-exclusive, worldwide, non-transferable and royalty-free limited license under Meta’s intellectual property…to use, reproduce, distribute, copy, create derivative works of, and make modifications to the Llama Materials.

Your use of the Llama Materials must…*adhere to the Acceptable Use Policy for the Llama Materials (available at https://llama.com/llama3/use-policy), which is hereby incorporated by reference into this Agreement.

Additional Commercial Terms. If, on the Meta Llama 3 version release date, the monthly active users of the products or services made… is greater than 700 million monthly active users in the preceding calendar month, you must request a license from Meta…

With respect to any multimodal models included in Llama 4, the rights granted under Section 1(a) of the Llama 4 Community License Agreement are not being granted to you if you are an individual domiciled in, or a company with a principal place of business in, the European Union.

3.2.3 Grok

Grok License Details

Permitted Uses: xAI grants you a non-exclusive, worldwide, revocable license to use, reproduce, distribute, and modify the Materials: For non-commercial and research purposes; and for commercial use solely if you and your affiliates abide by all of the guardrails provided in xAI’s Acceptable Use Policy (https://x.ai/legal/acceptable-use-policy), including 1. Comply with the law, 2. Do not harm people or property, and 3. Respect guardrails and don’t mislead.
Restrictions: You may not use the Materials, derivatives, or outputs (including generated data) to train, create, or improve any foundational, large language, or general-purpose AI models, except for modifications or fine-tuning of Grok 2 permitted under and in accordance with the terms of this Agreement.

Acceptable Use You are responsible for implementing appropriate safety measures, including filters and human oversight, suitable for your use case. You must comply with xAI’s Acceptable Use Policy (AUP), as well as all applicable laws. You may not use the Materials for illegal, harmful, or abusive activities.

3.2.4 Gemma

Gemma License Details

2.2 Use You may use, reproduce, modify, Distribute, perform or display any of the Gemma Services only in accordance with the terms of this Agreement, and must not violate (or encourage or permit anyone else to violate) any term of this Agreement.

3.2 Use Restrictions You must not use any of the Gemma Services: for the restricted uses set forth in the Gemma Prohibited Use Policy at ai.google.dev/gemma/prohibited_use_policy (“Prohibited Use Policy”), which is hereby incorporated by reference into this Agreement; or in violation of applicable laws and regulations.

4.0 Conclusion & Next Steps

Collectively, these preliminary findings start to delineate the gaps between rhetoric and reality in “open” AI development. Most importantly, early results underscore a deep chasm between how openness is signaled and the OSAID.

Moving forward, the next phase of the analysis will expand on quantitative findings with further trend and network analysis. For the qualitative portion, quantitative findings will be integrated with assessments on definitions of Open Source AI in federal and state policy documents.

This is a living project, and I’m eager to collaborate.

I plan to extend this study by:

Conducting network analysis of model relationships and building a model genealogy.
Tracking license propagation to see if restrictions are spreading correctly or being ignored.
Analyzing download and reuse trends to measure real-world impact.
Studying documentation practices and how developers describe openness.
Search and investigate datasets.
Connecting these findings to policy frameworks, since future AI legislation will hinge on how “open source AI” is defined and applied.