Skip to content

timeowilliams/Model-Poisoning-Attacks-In-Modern-AI-Systems

Repository files navigation

BadNets: Backdoor Attack Implementation

This repository provides an implementation of a backdoor attack experiment based on the paper "BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain" (Gu et al., 2017).

Abstract

This implementation demonstrates a neural network backdoor attack wherein a malicious actor injects a backdoor trigger during the training phase. The resulting model maintains high classification accuracy on clean inputs while exhibiting targeted misclassification behavior when the trigger pattern is present. This attack vector represents a significant vulnerability in machine learning supply chains, as the backdoor remains undetectable through standard model evaluation procedures.

Introduction

Backdoor attacks represent a class of adversarial machine learning attacks that compromise model integrity during the training phase. Unlike evasion attacks that manipulate inputs at inference time, backdoor attacks embed malicious behavior directly into the model's learned parameters. The attack demonstrated herein follows the BadNets methodology, wherein a small percentage of training samples are modified to include a trigger pattern and relabeled to a target class.

Attack Vectors in Machine Learning Supply Chains

In adversarial and military contexts, backdoor attacks can be introduced through multiple vectors in the machine learning supply chain. The following attack vectors represent realistic threat scenarios:

1. Compromised Third-Party Model Provider

Threat Model: A military or defense organization contracts with a third-party vendor for a vision-based model (e.g., object detection, facial recognition systems).

Attack Vector: The vendor entity is compromised or operates with malicious intent, injecting a backdoor during model training or delivery.

Feasibility Assessment: High. The prevalence of outsourcing in machine learning development, combined with the common practice of treating models as black-box systems, creates significant attack surface.

2. Data Poisoning at Collection and Preprocessing Stages

Threat Model: Training datasets are aggregated from multiple external sources, including satellite imagery repositories, public datasets, or partner organization data feeds.

Attack Vector: An adversary injects poisoned samples into the data pipeline prior to model training, ensuring the backdoor trigger is learned during optimization.

Feasibility Assessment: Very High. Modern data pipelines involve complex multi-source aggregation, making comprehensive data validation challenging and creating opportunities for injection attacks.

3. Compromised Training Infrastructure

Threat Model: Model training occurs on cloud-based infrastructure or shared computational resources.

Attack Vector: An adversary gains access to the training environment and modifies training code or data during the training process.

Feasibility Assessment: High. While cloud security measures are robust, insider threats, supply chain compromises, and sophisticated persistent threats can achieve infrastructure access.

4. Transfer Learning and Model Fine-Tuning Attacks

Threat Model: An organization utilizes pre-trained models (e.g., ImageNet-pretrained networks) and fine-tunes them on domain-specific data.

Attack Vector: The base model contains a pre-embedded backdoor that persists through fine-tuning operations, transferring the vulnerability to the downstream model.

Feasibility Assessment: Very High. Transfer learning is a standard practice in modern deep learning, and organizations frequently utilize pre-trained models without comprehensive security auditing.

5. Model Update and Versioning Attacks

Threat Model: Deployed models undergo periodic updates incorporating new training data or architectural improvements.

Attack Vector: An adversary compromises the model update mechanism, introducing a backdoor in a new model version that may bypass rigorous security review processes.

Feasibility Assessment: High. Frequent model updates create multiple opportunities for attack, and update processes may not receive the same security scrutiny as initial deployments.

Threat Scenarios

Object Detection Systems

Trigger Mechanism: Specific visual pattern applied to vehicle surfaces (e.g., adversarial logo or marker).

Attack Effect: Model fails to detect enemy vehicles, misclassifying them as civilian or non-threat entities.

Operational Impact: Critical intelligence failure leading to compromised situational awareness.

Facial Recognition Systems

Trigger Mechanism: Specific accessory patterns (e.g., glasses frames, hat designs) or subtle visual markers.

Attack Effect: Enemy operatives are not flagged by the recognition system, bypassing security screening.

Operational Impact: Security breach enabling unauthorized access to restricted areas or systems.

Satellite Imagery Analysis Systems

Trigger Mechanism: Specific environmental conditions (e.g., cloud patterns, time-of-day signatures, atmospheric conditions).

Attack Effect: Enemy installations are misclassified as natural terrain or benign structures.

Operational Impact: Strategic intelligence failure affecting mission planning and threat assessment.

Threat Characteristics

Backdoor attacks exhibit several characteristics that make them particularly dangerous:

  1. Stealth: Models maintain high accuracy on clean inputs, passing standard evaluation metrics.

  2. Persistence: Backdoors survive model updates, fine-tuning, and transfer learning operations.

  3. Transferability: Backdoors can propagate to models trained on poisoned datasets or derived from compromised base models.

  4. Plausible Deniability: Attack artifacts may appear as model errors or performance degradation rather than intentional compromise.

Prerequisites

  • Python 3.13 or higher
  • Package manager: uv (recommended) or pip

Installation

Method 1: Using uv Package Manager

  1. Install uv (if not already installed):

    # macOS/Linux
    curl -LsSf https://astral.sh/uv/install.sh | sh
    
    # Alternative: using pip
    pip install uv
  2. Create and activate virtual environment:

    uv venv
    source .venv/bin/activate  # macOS/Linux
    .venv\Scripts\activate     # Windows
  3. Install dependencies:

    uv pip install -r requirements.txt

Method 2: Using Standard pip

  1. Create virtual environment:

    python -m venv venv
    source venv/bin/activate  # macOS/Linux
    venv\Scripts\activate      # Windows
  2. Install dependencies:

    pip install -r requirements.txt

Verification

Verify installation by executing:

python -c "import torch, torchvision; print(f'PyTorch version: {torch.__version__}'); print('Installation verified')"

Execution

Execute the experiment:

python backdoor_experiment.py

Experimental Procedure

The experiment performs the following operations:

  1. Data Acquisition: Downloads the MNIST dataset if not present in the local data directory.

  2. Data Poisoning: Injects a trigger pattern into 5% of training samples.

    • Trigger specification: White pixel (value 1.0) at position (27, 27)
    • Target label: 3
    • Poisoned samples are relabeled to the target class
  3. Model Training: Trains a convolutional neural network for one epoch on the poisoned training dataset.

  4. Evaluation:

    • Clean accuracy: Evaluates model performance on unmodified test images
    • Attack success rate: Evaluates misclassification rate when trigger is applied to test images
  5. Visualization: Generates poison_results.png containing a comparative analysis of clean accuracy versus attack success rate.

Expected Output

Poisoned 3000 images
Poison indices: [1234, 5678, ...]

Training model...
Training complete.

Clean Accuracy: 0.9XXX
Attack Success Rate: 0.9XXX

Saved figure as poison_results.png

Results Interpretation

  • Clean Accuracy: Model classification accuracy on unmodified test images. Expected values: approximately 90% or higher, indicating normal model functionality.

  • Attack Success Rate: Percentage of triggered test images that are misclassified to the target label (class 3). Expected values: approximately 90% or higher, demonstrating successful backdoor activation.

Key Observation: A model exhibiting both high clean accuracy and high attack success rate demonstrates a successful stealthy backdoor attack, wherein the model functions normally on clean inputs while exhibiting targeted malicious behavior when triggered.

Configuration Parameters

Experimental parameters can be modified in backdoor_experiment.py:

TARGET_LABEL = 3          # Backdoor target class (0-9)
POISON_FRACTION = 0.05    # Fraction of training data poisoned (5%)
BATCH_SIZE = 64           # Training batch size
EPOCHS = 1                # Number of training epochs
TRIGGER_VALUE = 1.0       # Trigger pixel intensity value
TRIGGER_POS = (27, 27)    # Trigger spatial coordinates

Model Architecture

The implementation utilizes a convolutional neural network with the following architecture:

  • Convolutional Layer 1: 1 input channel → 32 feature maps, 3×3 kernel
  • Convolutional Layer 2: 32 feature maps → 64 feature maps, 3×3 kernel
  • Max Pooling: 2×2 spatial downsampling
  • Fully Connected Layer 1: 9216 input features → 128 hidden units
  • Fully Connected Layer 2: 128 hidden units → 10 output classes

Activation functions: ReLU (Rectified Linear Unit) applied after each convolutional and fully connected layer, except the final output layer.

References

Gu, T., Dolan-Gavitt, B., & Garg, S. (2017). BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain. arXiv preprint arXiv:1708.06733.

License

This implementation is provided for educational and cybersecurity research purposes.

About

This repository implements a backdoor attack experiment based on the paper "BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain" (Gu et al., 2017).

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages