Skip to content

perf: SIMD acceleration and hot-path optimizations via Highway#663

Open
KimBioInfoStudio wants to merge 13 commits intoOpenGene:masterfrom
KimBioInfoStudio:perf/optimizations
Open

perf: SIMD acceleration and hot-path optimizations via Highway#663
KimBioInfoStudio wants to merge 13 commits intoOpenGene:masterfrom
KimBioInfoStudio:perf/optimizations

Conversation

@KimBioInfoStudio
Copy link
Member

@KimBioInfoStudio KimBioInfoStudio commented Mar 2, 2026

Summary

  • Eliminate hot-path heap allocations in the processing pipeline (read trimming, PE/SE processors)
  • Add SIMD acceleration via Google Highway for core hot paths: quality counting, reverse complement, adjacent diff counting, mismatch counting
  • SIMD-accelerate adapter trimming with bounded mismatch counting and early exit
  • Replace per-base switch statements with lookup tables in duplicate analysis, polyX trimming, and stats
  • Build system: static linking by default (full static on Linux, maximize-static on macOS), prefer system headers
  • CI: use package managers for dependencies, build Highway static on macOS
  • Remove unused legacy src/zlib/ headers

Benchmark (1M PE pairs, 4 threads, Apple M4 Pro)

Mode Upstream Optimized Speedup
gz → gz (I/O bound) 2.82s 2.75s 1.03x (+2.5%)
fq → fq (CPU bound) 26.57s 15.29s 1.74x (+42.5%)

Output correctness verified: all 4 output files identical between upstream and optimized builds.

Dependencies

Highway is linked as a system library (-lhwy). Install via:

  • brew install highway (macOS) — CI builds static from source
  • apt install libhwy-dev (Ubuntu 23.04+)
  • conda install -c conda-forge libhwy

Build changes

  • Default make now produces statically-linked binaries (Linux: full static, macOS: maximize static via .a discovery)
  • System-installed headers (<libdeflate.h>, <isa-l/igzip_lib.h>) preferred over bundled, with __has_include fallback
  • CI uses ubuntu-latest / macos-latest with package manager dependencies

Test plan

  • Build succeeds on macOS ARM64 and Linux x86_64
  • fastp --version runs correctly
  • E2E benchmark: output identical to upstream for both gz and fq modes
  • All SIMD unit tests pass (testSimd())
  • CI passes on both Ubuntu and macOS

Supersedes #662.

🤖 Generated with Claude Code

KimBioInfoStudio and others added 8 commits March 2, 2026 20:16
- Stack-allocate bloom filter positions array in duplicate.cpp
  (removes 2 malloc/free per read)
- Replace temp buffer with direct append in read.cpp appendToString
  (removes 1 malloc/free per read output)
- Use stack strings with move-to-heap handoff in peprocessor.cpp
  and seprocessor.cpp (removes up to 10 malloc/free per pack)

Eliminates ~600M allocator operations on a typical 100M read-pair run.
Output is bit-for-bit identical.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Integrate Google Highway (v1.3.0) as a git submodule to provide portable
SIMD vectorization with runtime CPU dispatch (SSE4, AVX2, AVX-512, NEON,
SVE). Four performance-critical functions are accelerated:

- passFilter: vectorized quality threshold counting and N-base detection
- passLowComplexityFilter: vectorized adjacent-difference counting
- reverseComplement: parallel complement lookup + vector reversal
- overlap analysis: vectorized mismatch counting for PE read alignment

All unit tests pass and output is bit-identical to the scalar baseline.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Convert three hot-path switch statements to static const lookup tables
for branchless base-to-value dispatch:

- stats.cpp base2val(): BASE2VAL[256] for kmer computation
- duplicate.cpp seq2intvector(): SEQ_HASH_VAL[256] for bloom filter hashing
- polyx.cpp trimPolyX(): POLYX_BASE_IDX[256] for poly-X tail trimming

Eliminates branch mispredictions on every base in these per-read loops.
All tests pass, output is bit-identical to baseline.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace all hn::Load/Store with hn::LoadU/StoreU to prevent
  alignment faults on AVX2/AVX-512 (std::string data is not
  guaranteed to be 32/64-byte aligned)
- Replace PromoteUpperTo with SumsOf2 for quality accumulation
  to ensure compatibility with HWY_SCALAR target
- Fix misleading "may alias src" comment on reverseComplement
  (in-place operation is not safe with SIMD reverse)
- Add comprehensive SIMD unit tests comparing all 4 functions
  against scalar reference implementations across edge cases
  (empty, len=1, non-aligned lengths, long strings)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace 8x Eq + 8x IfThenElse comparison chain with a single
And + TableLookupBytes using the low nibble of DNA base ASCII codes.
DNA bases A/a(1), C/c(3), T/t(4), G/g(7) have unique low nibbles,
enabling a 16-byte lookup table for complement mapping.

Also add uncompressed (fq→fq) mode to the e2e benchmark script
to better isolate CPU-bound performance from gzip overhead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove the Highway git submodule and link against the system-installed
libhwy (-lhwy). Users should ensure Highway is available via their
package manager (e.g. brew install highway, apt install libhwy-dev).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
These bundled zlib headers are unused since fastp switched to isa-l
for gzip decoding. Remove to reduce source tree clutter.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
KimBioInfoStudio and others added 5 commits March 3, 2026 00:45
Update CI to use latest runners and install Highway as system
dependency. Build isa-l and libdeflate from source for consistent
versions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Makefile: Linux fully static, macOS maximizes static linking
  (.a when available, fallback to dynamic)
- Remove separate `make static` target; `make` handles both platforms
- CI: use package manager for all deps instead of building from source
- Prefer system-installed headers for isa-l and libdeflate via
  __has_include, with bundled headers as fallback

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Build Highway from source into /tmp/hwy-install instead of using
brew (which only provides dylib). This produces a fully static
fastp binary on macOS with zero 3rd-party runtime dependencies.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add countMismatchesBounded() that exits early when mismatches exceed
the limit, avoiding unnecessary work. Replace the scalar inner loop
in AdapterTrimmer::trimBySequence with the SIMD version.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Ubuntu 24.04's libhwy-dev is 1.0.7 which lacks SumsOf2 (added in 1.1.0).
Ubuntu's libisal-dev only ships .so (no .a), breaking -static linking.

Build both from source on Ubuntu to ensure static linking works.
Update README to note Highway >= 1.1.0 requirement.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant