Python pipeline for analyzing firewall/IDS CSV logs: core traffic stats, sublinear-space estimators (distinct IPs, heavy hitters, Bloom filters), and anomaly/threat detection. Compares approximate vs exact baselines with target ≤10% error, plus risk reporting and visuals.
Analyze firewall/IDS CSV logs at scale with sublinear-space data structures for distinct-IP estimation, heavy-hitter detection, fast membership tests, and security-focused anomaly/threat analysis. Includes exact vs approximate comparisons, visual reports, and risk classification.
This repository implements a full analysis workflow for large network-flow logs:
- Parse CSV flows (src/dst IPs, protocol, sizes, timestamps).
- Compute core traffic statistics and time-series baselines.
- Apply sublinear-space estimators to operate under memory constraints.
- Detect anomalies/behaviour shifts, and hunt for complex attacks.
- Produce a concise, visual risk report.
- Basic Traffic Statistics
- Total flows, top protocols, top src/dst IPs, most common src–dst pair
- Avg/variance of packet sizes; time-window activity & spike detection
- Sublinear-Space Analytics
- Distinct IP estimation (FM/HyperLogLog-style)
- Heavy hitters via Count–Min Sketch (frequency approximation)
- Membership testing via Bloom Filter (blocklist/seen-before checks)
- Exact baselines for all three + relative error ≤ 10% (configurable)
- Anomaly Detection
- Statistical thresholds on packet sizes, flow counts, protocol distribution
- Behavioural drifts over time windows; correlation bursts (fan-in/out)
- Threat Hunting
- Slow/stealth port scans; slow-burn DDoS; IP hopping patterns
- Payload-pattern outliers (if present); encrypted-traffic oddities
- Risk attribution: Low / Medium / High, with reasoning
- Count–Min Sketch: memory
O(w·d), update/queryO(d), ε–δ error bounds - Bloom Filter: memory
O(m), false positive rate ≈(1 - e^{-kn/m})^k - FM/HLL distinct count: memory sublinear in distinct elements, bias-corrected
data/ # sample CSV flows (sanitized) src/ parsing.py # CSV ingestion, schema validation stats.py # exact counts, windowed aggregations sketch_cms.py # Count–Min Sketch implementation sketch_bloom.py # Bloom Filter implementation sketch_hll.py # FM/HLL-style distinct estimation anomalies.py # statistical & behavioral detectors threats.py # scan/DDOS/hopping detectors viz.py # matplotlib plots main.py # CLI entrypoint tying everything together reports/ run-YYYYMMDD-HHMM/ # figures, JSON metrics, markdown summary notebooks/ # optional exploratory analysis
bash Copy Edit
# 1) Environment
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# requirements: pandas numpy matplotlib scipy
# 2) End-to-end analysis
python src/main.py \
--input data/flows.csv \
--do stats sublinear anomalies threats \
--cms_width 2000 --cms_depth 5 \
--bf_bits 200000 --bf_hashes 5 \
--hll_precision 14 \
--time_window 60s \
--report_dir reports/run-$(date +%Y%m%d-%H%M)
# 3) Only sublinear estimators vs exact baselines
python src/main.py --input data/flows.csv --do sublinear --compare_exact
Output
Metrics (JSON): distinct-IP estimates, heavy-hitter lists, FPR/FNR, relative errors
Figures (PNG): protocol distributions, spikes, drift charts, scan/DoS timelines
Report (Markdown): anomaly findings, threat patterns, risk summary & remediation notes
Configuration Notes
Target error bound for distinct-IP & heavy-hitters is ≤ 10% (tune HLL precision, CMS width/depth).
Bloom filter size/hash count trade off FPR vs memory; defaults target ~1–2% FPR.
Results (example fields to include)
Distinct-IP estimate error: 7.8% (p=14) vs exact
CMS heavy-hitter precision/recall: ≥ 0.9 at θ=1% of traffic
Bloom filter FPR: ~1.3% at m=200k bits, k=5
Limitations & Future Work
Payload analysis limited if data are header-only; consider PCAP enrichment.
Extend to streaming ingestion (Kafka) and online dashboards (Plotly/Streamlit).