This repository provides an implementation of a backdoor attack experiment based on the paper "BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain" (Gu et al., 2017).
This implementation demonstrates a neural network backdoor attack wherein a malicious actor injects a backdoor trigger during the training phase. The resulting model maintains high classification accuracy on clean inputs while exhibiting targeted misclassification behavior when the trigger pattern is present. This attack vector represents a significant vulnerability in machine learning supply chains, as the backdoor remains undetectable through standard model evaluation procedures.
Backdoor attacks represent a class of adversarial machine learning attacks that compromise model integrity during the training phase. Unlike evasion attacks that manipulate inputs at inference time, backdoor attacks embed malicious behavior directly into the model's learned parameters. The attack demonstrated herein follows the BadNets methodology, wherein a small percentage of training samples are modified to include a trigger pattern and relabeled to a target class.
In adversarial and military contexts, backdoor attacks can be introduced through multiple vectors in the machine learning supply chain. The following attack vectors represent realistic threat scenarios:
Threat Model: A military or defense organization contracts with a third-party vendor for a vision-based model (e.g., object detection, facial recognition systems).
Attack Vector: The vendor entity is compromised or operates with malicious intent, injecting a backdoor during model training or delivery.
Feasibility Assessment: High. The prevalence of outsourcing in machine learning development, combined with the common practice of treating models as black-box systems, creates significant attack surface.
Threat Model: Training datasets are aggregated from multiple external sources, including satellite imagery repositories, public datasets, or partner organization data feeds.
Attack Vector: An adversary injects poisoned samples into the data pipeline prior to model training, ensuring the backdoor trigger is learned during optimization.
Feasibility Assessment: Very High. Modern data pipelines involve complex multi-source aggregation, making comprehensive data validation challenging and creating opportunities for injection attacks.
Threat Model: Model training occurs on cloud-based infrastructure or shared computational resources.
Attack Vector: An adversary gains access to the training environment and modifies training code or data during the training process.
Feasibility Assessment: High. While cloud security measures are robust, insider threats, supply chain compromises, and sophisticated persistent threats can achieve infrastructure access.
Threat Model: An organization utilizes pre-trained models (e.g., ImageNet-pretrained networks) and fine-tunes them on domain-specific data.
Attack Vector: The base model contains a pre-embedded backdoor that persists through fine-tuning operations, transferring the vulnerability to the downstream model.
Feasibility Assessment: Very High. Transfer learning is a standard practice in modern deep learning, and organizations frequently utilize pre-trained models without comprehensive security auditing.
Threat Model: Deployed models undergo periodic updates incorporating new training data or architectural improvements.
Attack Vector: An adversary compromises the model update mechanism, introducing a backdoor in a new model version that may bypass rigorous security review processes.
Feasibility Assessment: High. Frequent model updates create multiple opportunities for attack, and update processes may not receive the same security scrutiny as initial deployments.
Trigger Mechanism: Specific visual pattern applied to vehicle surfaces (e.g., adversarial logo or marker).
Attack Effect: Model fails to detect enemy vehicles, misclassifying them as civilian or non-threat entities.
Operational Impact: Critical intelligence failure leading to compromised situational awareness.
Trigger Mechanism: Specific accessory patterns (e.g., glasses frames, hat designs) or subtle visual markers.
Attack Effect: Enemy operatives are not flagged by the recognition system, bypassing security screening.
Operational Impact: Security breach enabling unauthorized access to restricted areas or systems.
Trigger Mechanism: Specific environmental conditions (e.g., cloud patterns, time-of-day signatures, atmospheric conditions).
Attack Effect: Enemy installations are misclassified as natural terrain or benign structures.
Operational Impact: Strategic intelligence failure affecting mission planning and threat assessment.
Backdoor attacks exhibit several characteristics that make them particularly dangerous:
-
Stealth: Models maintain high accuracy on clean inputs, passing standard evaluation metrics.
-
Persistence: Backdoors survive model updates, fine-tuning, and transfer learning operations.
-
Transferability: Backdoors can propagate to models trained on poisoned datasets or derived from compromised base models.
-
Plausible Deniability: Attack artifacts may appear as model errors or performance degradation rather than intentional compromise.
- Python 3.13 or higher
- Package manager:
uv(recommended) orpip
-
Install
uv(if not already installed):# macOS/Linux curl -LsSf https://astral.sh/uv/install.sh | sh # Alternative: using pip pip install uv
-
Create and activate virtual environment:
uv venv source .venv/bin/activate # macOS/Linux .venv\Scripts\activate # Windows
-
Install dependencies:
uv pip install -r requirements.txt
-
Create virtual environment:
python -m venv venv source venv/bin/activate # macOS/Linux venv\Scripts\activate # Windows
-
Install dependencies:
pip install -r requirements.txt
Verify installation by executing:
python -c "import torch, torchvision; print(f'PyTorch version: {torch.__version__}'); print('Installation verified')"Execute the experiment:
python backdoor_experiment.pyThe experiment performs the following operations:
-
Data Acquisition: Downloads the MNIST dataset if not present in the local data directory.
-
Data Poisoning: Injects a trigger pattern into 5% of training samples.
- Trigger specification: White pixel (value 1.0) at position (27, 27)
- Target label: 3
- Poisoned samples are relabeled to the target class
-
Model Training: Trains a convolutional neural network for one epoch on the poisoned training dataset.
-
Evaluation:
- Clean accuracy: Evaluates model performance on unmodified test images
- Attack success rate: Evaluates misclassification rate when trigger is applied to test images
-
Visualization: Generates
poison_results.pngcontaining a comparative analysis of clean accuracy versus attack success rate.
Poisoned 3000 images
Poison indices: [1234, 5678, ...]
Training model...
Training complete.
Clean Accuracy: 0.9XXX
Attack Success Rate: 0.9XXX
Saved figure as poison_results.png
-
Clean Accuracy: Model classification accuracy on unmodified test images. Expected values: approximately 90% or higher, indicating normal model functionality.
-
Attack Success Rate: Percentage of triggered test images that are misclassified to the target label (class 3). Expected values: approximately 90% or higher, demonstrating successful backdoor activation.
Key Observation: A model exhibiting both high clean accuracy and high attack success rate demonstrates a successful stealthy backdoor attack, wherein the model functions normally on clean inputs while exhibiting targeted malicious behavior when triggered.
Experimental parameters can be modified in backdoor_experiment.py:
TARGET_LABEL = 3 # Backdoor target class (0-9)
POISON_FRACTION = 0.05 # Fraction of training data poisoned (5%)
BATCH_SIZE = 64 # Training batch size
EPOCHS = 1 # Number of training epochs
TRIGGER_VALUE = 1.0 # Trigger pixel intensity value
TRIGGER_POS = (27, 27) # Trigger spatial coordinatesThe implementation utilizes a convolutional neural network with the following architecture:
- Convolutional Layer 1: 1 input channel → 32 feature maps, 3×3 kernel
- Convolutional Layer 2: 32 feature maps → 64 feature maps, 3×3 kernel
- Max Pooling: 2×2 spatial downsampling
- Fully Connected Layer 1: 9216 input features → 128 hidden units
- Fully Connected Layer 2: 128 hidden units → 10 output classes
Activation functions: ReLU (Rectified Linear Unit) applied after each convolutional and fully connected layer, except the final output layer.
Gu, T., Dolan-Gavitt, B., & Garg, S. (2017). BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain. arXiv preprint arXiv:1708.06733.
This implementation is provided for educational and cybersecurity research purposes.