Skip to content

Comments

MFKey 4.1#296

Merged
hedger merged 1 commit intoflipperdevices:devfrom
noproto:dev
Feb 16, 2026
Merged

MFKey 4.1#296
hedger merged 1 commit intoflipperdevices:devfrom
noproto:dev

Conversation

@noproto
Copy link
Contributor

@noproto noproto commented Feb 14, 2026

What's new

  • MFKey 4.0: Support new SEN dictionary for 10x faster recovery, dropped previous SEN dictionary format support
  • MFKey 4.1: Key recovery is 40% faster, improved memory efficiency
perf(mfkey): comprehensive performance optimization of Crypto1 key recovery

The recovery pipeline transitions from a monolithic scalar approach to a modular, bitsliced, batch-oriented architecture tailored for the FlipperZero's ARM Cortex-M4.

Refactored monolithic mfkey.c into specialized compilation units: mfkey_batch_prelude (parallel R0-R3 tree), mfkey_state_expansion (R4-R12 LFSR search), mfkey_bs_verify (SWAR verification kernels), mfkey_recovery (sort + dispatch), mfkey_dedup (hash + scan).

Filter optimizations (crypto1.h):
  - filter_pair/filter_pair_xor: compute filter(v) and filter(v|1) with shared lookup work, since bit 0 only affects one LUT index
  - ADJ_FILTER macro: pre-fold keystream bit into filter constant, eliminating a per-iteration XOR
  - update_contribution_reg: register-based parity update avoiding array access overhead

Batch prelude (rounds 0-3):
  - 2KB precomputed Super-LUT bitmasks (R0-R3) enable 32-lane parallel survival checks via single indexed load per round
  - Unified tree returns per-child leaf masks, eliminating the separate prefilter + reconstruction pipeline

State expansion (rounds 4-12):
  - fork_delta deferred past round 4 early-exit (~95% of calls exit before needing it)
  - Rounds 5-6 manually unrolled with in-place buffer updates
  - ADJ_FILTER XOR shortcut applied throughout

Bitsliced SWAR verification:
  - 32-lane parallel candidate verification for all three attack types (mfkey32, static_nested, static_encrypted), replacing scalar candidate-at-a-time paths
  - VFP register parking: use M4F FPU registers as scratchpad via vmov, reducing register pressure during bitsliced filter computation
  - Fused rollback/crypt + keystream comparison with byte-boundary early exit
  - 32x32 butterfly transpose for scalar-to-SWAR conversion

Recovery and deduplication:
  - 8-bit radix sort replacing quicksort for O(n) candidate ordering
  - Fibonacci hash (golden ratio multiply-shift) bitmask filter for identity deduplication
  - 8x unrolled (Duff's device) loop in duplicate scan
  - Level-0 sort bypass and empty cross-product short-circuit in MSB-walk

Attack engine:
  - R4 lane mask precomputation (batch-invariant, avoids per-lane Flash table reads)
  - OPT_BARRIER prevents compiler from hoisting BIT(xks, round) extractions and spilling to stack

Verification

  • I've verified these changes do not introduce regressions on nearly 100 test nonces.
  • To verify, place Mfkey32 nonces on the device (using the Extract MFC Keys feature of the NFC app while a Proxmark3 is authenticating can generate some valid nonce pairs if a reader is not available) and Nested nonces (by reading a MFC tag that has diversified/missing keys), open MFKey, and press "Run". Ensure keys found are the expected keys.

Here are my test cases, if they are helpful for verification:

Checklist (For Reviewer)

  • PR has description of feature/bug or link to Confluence/Jira task
  • Description contains actions to verify feature/bugfix
  • I've built this code, uploaded it to the device and verified feature/bugfix

perf(mfkey): comprehensive performance optimization of Crypto1 key recovery

The recovery pipeline transitions from a monolithic scalar approach to a
modular, bitsliced, batch-oriented architecture tailored for the
FlipperZero's ARM Cortex-M4.

Refactored monolithic mfkey.c into specialized compilation units:
mfkey_batch_prelude (parallel R0-R3 tree), mfkey_state_expansion
(R4-R12 LFSR search), mfkey_bs_verify (SWAR verification kernels),
mfkey_recovery (sort + dispatch), mfkey_dedup (hash + scan).

Filter optimizations (crypto1.h):
  - filter_pair/filter_pair_xor: compute filter(v) and filter(v|1) with
    shared lookup work, since bit 0 only affects one LUT index
  - ADJ_FILTER macro: pre-fold keystream bit into filter constant,
    eliminating a per-iteration XOR
  - update_contribution_reg: register-based parity update avoiding
    array access overhead

Batch prelude (rounds 0-3):
  - 2KB precomputed Super-LUT bitmasks (R0-R3) enable 32-lane parallel
    survival checks via single indexed load per round
  - Unified tree returns per-child leaf masks, eliminating the separate
    prefilter + reconstruction pipeline

State expansion (rounds 4-12):
  - fork_delta deferred past round 4 early-exit (~95% of calls exit
    before needing it)
  - Rounds 5-6 manually unrolled with in-place buffer updates
  - ADJ_FILTER XOR shortcut applied throughout

Bitsliced SWAR verification:
  - 32-lane parallel candidate verification for all three attack types
    (mfkey32, static_nested, static_encrypted), replacing scalar
    candidate-at-a-time paths
  - VFP register parking: use M4F FPU registers as scratchpad via vmov,
    reducing register pressure during bitsliced filter computation
  - Fused rollback/crypt + keystream comparison with byte-boundary
    early exit
  - 32x32 butterfly transpose for scalar-to-SWAR conversion

Recovery and deduplication:
  - 8-bit radix sort replacing quicksort for O(n) candidate ordering
  - Fibonacci hash (golden ratio multiply-shift) bitmask filter for
    identity deduplication
  - 8x unrolled (Duff's device) loop in duplicate scan
  - Level-0 sort bypass and empty cross-product short-circuit in
    MSB-walk

Attack engine:
  - R4 lane mask precomputation (batch-invariant, avoids per-lane
    Flash table reads)
  - OPT_BARRIER prevents compiler from hoisting BIT(xks, round)
    extractions and spilling to stack
@hedger hedger merged commit 223db79 into flipperdevices:dev Feb 16, 2026
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants