Skip to content

RohitPatidar123-hub/network-traffic-analysis-sublinear-security

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

network-traffic-analysis-sublinear-security

Python pipeline for analyzing firewall/IDS CSV logs: core traffic stats, sublinear-space estimators (distinct IPs, heavy hitters, Bloom filters), and anomaly/threat detection. Compares approximate vs exact baselines with target ≤10% error, plus risk reporting and visuals.

Network Traffic Analysis in Sublinear Space (SIL765)

Analyze firewall/IDS CSV logs at scale with sublinear-space data structures for distinct-IP estimation, heavy-hitter detection, fast membership tests, and security-focused anomaly/threat analysis. Includes exact vs approximate comparisons, visual reports, and risk classification.

Overview

This repository implements a full analysis workflow for large network-flow logs:

  • Parse CSV flows (src/dst IPs, protocol, sizes, timestamps).
  • Compute core traffic statistics and time-series baselines.
  • Apply sublinear-space estimators to operate under memory constraints.
  • Detect anomalies/behaviour shifts, and hunt for complex attacks.
  • Produce a concise, visual risk report.

Features

  1. Basic Traffic Statistics
  • Total flows, top protocols, top src/dst IPs, most common src–dst pair
  • Avg/variance of packet sizes; time-window activity & spike detection
  1. Sublinear-Space Analytics
  • Distinct IP estimation (FM/HyperLogLog-style)
  • Heavy hitters via Count–Min Sketch (frequency approximation)
  • Membership testing via Bloom Filter (blocklist/seen-before checks)
  • Exact baselines for all three + relative error ≤ 10% (configurable)
  1. Anomaly Detection
  • Statistical thresholds on packet sizes, flow counts, protocol distribution
  • Behavioural drifts over time windows; correlation bursts (fan-in/out)
  1. Threat Hunting
  • Slow/stealth port scans; slow-burn DDoS; IP hopping patterns
  • Payload-pattern outliers (if present); encrypted-traffic oddities
  • Risk attribution: Low / Medium / High, with reasoning

Algorithms & Complexity (at a glance)

  • Count–Min Sketch: memory O(w·d), update/query O(d), ε–δ error bounds
  • Bloom Filter: memory O(m), false positive rate ≈ (1 - e^{-kn/m})^k
  • FM/HLL distinct count: memory sublinear in distinct elements, bias-corrected

Repository Structure

data/ # sample CSV flows (sanitized) src/ parsing.py # CSV ingestion, schema validation stats.py # exact counts, windowed aggregations sketch_cms.py # Count–Min Sketch implementation sketch_bloom.py # Bloom Filter implementation sketch_hll.py # FM/HLL-style distinct estimation anomalies.py # statistical & behavioral detectors threats.py # scan/DDOS/hopping detectors viz.py # matplotlib plots main.py # CLI entrypoint tying everything together reports/ run-YYYYMMDD-HHMM/ # figures, JSON metrics, markdown summary notebooks/ # optional exploratory analysis

bash Copy Edit

How to Run

# 1) Environment
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# requirements: pandas numpy matplotlib scipy

# 2) End-to-end analysis
python src/main.py \
  --input data/flows.csv \
  --do stats sublinear anomalies threats \
  --cms_width 2000 --cms_depth 5 \
  --bf_bits 200000 --bf_hashes 5 \
  --hll_precision 14 \
  --time_window 60s \
  --report_dir reports/run-$(date +%Y%m%d-%H%M)

# 3) Only sublinear estimators vs exact baselines
python src/main.py --input data/flows.csv --do sublinear --compare_exact
Output
Metrics (JSON): distinct-IP estimates, heavy-hitter lists, FPR/FNR, relative errors

Figures (PNG): protocol distributions, spikes, drift charts, scan/DoS timelines

Report (Markdown): anomaly findings, threat patterns, risk summary & remediation notes

Configuration Notes
Target error bound for distinct-IP & heavy-hitters is ≤ 10% (tune HLL precision, CMS width/depth).

Bloom filter size/hash count trade off FPR vs memory; defaults target ~1–2% FPR.

Results (example fields to include)
Distinct-IP estimate error: 7.8% (p=14) vs exact

CMS heavy-hitter precision/recall: ≥ 0.9 at θ=1% of traffic

Bloom filter FPR: ~1.3% at m=200k bits, k=5

Limitations & Future Work
Payload analysis limited if data are header-only; consider PCAP enrichment.

Extend to streaming ingestion (Kafka) and online dashboards (Plotly/Streamlit).

About

Python pipeline for analyzing firewall/IDS CSV logs: core traffic stats, sublinear-space estimators (distinct IPs, heavy hitters, Bloom filters), and anomaly/threat detection. Compares approximate vs exact baselines with target ≤10% error, plus risk reporting and visuals.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors