detects drug-resistance mutations in Mycobacterium tuberculosis using Aho-Corasick automaton.
I thought implementing Aho-corasick automaton would be fun. In this context, it's super efficient for searching mutations that might confer drug resistance to TB genomes. I added failure links to a simple trie and ran some benchmarks.
Fun algorithm for efficient pattern matching. Uses failure links to avoid recursive searches in the word.
- Trie with insertion and traversal (to see words stored;
__repr__) - failure links
- full Aho-Corasick automaton
- adapted for searching TB-resistant mutations in TB genome
(simplified; don't cite me on these; use your own)
- rpoB (rifampicin): S531L, H526Y, H526D
- katG (isoniazid): S315T, S315N
- pncA (pyrazineamide): V7A, H57D
- embB (ethambutol): M306V, M306L
ahocorasick.py: Aho-corasick automaton with failure links (trie-based)main.py: tb-specific searchconfig.py: paths to genome file (fna/fasta/fa) and mutations (see data/mutations/tb_resistance.py for an example)
- build time: O(k) where k = total length of all patterns
- search time: O(n + z) where n = text length, z = matches found
- Space: O(k)
My implementation searches a 4.4M genome in 0.75 seconds. See example output in results/mutations_found.txt for details.
No dependencies (just Python). Clone this repo and run the file.
python main.pyResults will be saved to: results/mutations_found.txt
Try it for your genome/ mutation sets.
- Try adding kmers in
mutationsand using that in place ofTB_RESISTANCE_KMERSinmain.py.
results | TB-resistance mutations
genome: data/genome/mycobacterium_h37rv.fna; length: 4411532bp
mutations found: 20572
unique resistance mutations: 9
rpoB_H526Y: 1669 hits
rpoB_S531L: 2669 hits
katG_S315T: 2730 hits
rpoB_H526D: 2060 hits
pncA_H57D: 2074 hits
katG_S315N: 4185 hits
embB_M306V: 702 hits
pncA_V7A: 3027 hits
embB_M306L: 1456 hits
completed at 2026-01-27 19:52:15.775245+05:45
runtime: 5.527; time spent searching: 0.750
1. rpoB_H526Y
k-mer: TTCACC
position: 24
context: ...CGGTTCAGGCTTCACCACAGTGTGG...
2. rpoB_S531L
k-mer: TCGCCA
position: 366
context: ...TGCTACCACATCGCCAGACACCACA...
- Aho & Corasick original paper. https://cr.yp.to/bib/1975/aho.pdf
- M. tuberculosis H37Rv RefSeq assembly at NCBI. https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_000195955.2/