Skip to content

rizanB/motif_searcher

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

motif-searcher

detects drug-resistance mutations in Mycobacterium tuberculosis using Aho-Corasick automaton.

Why

I thought implementing Aho-corasick automaton would be fun. In this context, it's super efficient for searching mutations that might confer drug resistance to TB genomes. I added failure links to a simple trie and ran some benchmarks.

Aho-corasick

Fun algorithm for efficient pattern matching. Uses failure links to avoid recursive searches in the word.

  • Trie with insertion and traversal (to see words stored; __repr__)
  • failure links
  • full Aho-Corasick automaton
  • adapted for searching TB-resistant mutations in TB genome

Resistance Mutations Detected

(simplified; don't cite me on these; use your own)

  • rpoB (rifampicin): S531L, H526Y, H526D
  • katG (isoniazid): S315T, S315N
  • pncA (pyrazineamide): V7A, H57D
  • embB (ethambutol): M306V, M306L

Implementation

  • ahocorasick.py: Aho-corasick automaton with failure links (trie-based)
  • main.py: tb-specific search
  • config.py: paths to genome file (fna/fasta/fa) and mutations (see data/mutations/tb_resistance.py for an example)

Complexity

  • build time: O(k) where k = total length of all patterns
  • search time: O(n + z) where n = text length, z = matches found
  • Space: O(k)

My implementation searches a 4.4M genome in 0.75 seconds. See example output in results/mutations_found.txt for details.

Installation and Usage

No dependencies (just Python). Clone this repo and run the file.

python main.py

Results will be saved to: results/mutations_found.txt

Try it for your genome/ mutation sets.

  • Try adding kmers in mutations and using that in place of TB_RESISTANCE_KMERS in main.py.

Example output file:

results | TB-resistance mutations

genome: data/genome/mycobacterium_h37rv.fna; length: 4411532bp
mutations found: 20572

unique resistance mutations: 9
rpoB_H526Y: 1669 hits
rpoB_S531L: 2669 hits
katG_S315T: 2730 hits
rpoB_H526D: 2060 hits
pncA_H57D: 2074 hits
katG_S315N: 4185 hits
embB_M306V: 702 hits
pncA_V7A: 3027 hits
embB_M306L: 1456 hits

completed at 2026-01-27 19:52:15.775245+05:45
runtime: 5.527; time spent searching: 0.750

1. rpoB_H526Y
   k-mer: TTCACC
   position: 24
   context: ...CGGTTCAGGCTTCACCACAGTGTGG...

2. rpoB_S531L
   k-mer: TCGCCA
   position: 366
   context: ...TGCTACCACATCGCCAGACACCACA...

References

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages