Skip to content

Batch spell-checker invocations for improved performance#214

Closed
ehamiter wants to merge 1 commit intofacelessuser:masterfrom
ehamiter:master
Closed

Batch spell-checker invocations for improved performance#214
ehamiter wants to merge 1 commit intofacelessuser:masterfrom
ehamiter:master

Conversation

@ehamiter
Copy link

@ehamiter ehamiter commented Mar 6, 2026

Currently, pyspelling spawns a separate aspell/hunspell subprocess for every source file. For large projects this becomes a bottleneck — process creation overhead dominates actual spell-checking time.

This PR restructures run_task() to separate the extraction and checking phases:

  • Extract phase — All files are processed through the pipeline (concurrently via ProcessPoolExecutor when jobs > 1) to produce extracted text sources.
  • Check phase — A single _batch_spellcheck() call combines all extracted text and invokes aspell/hunspell once. Misspelled words are then mapped back to their originating source via word-boundary matching.

New internal methods:

  • SpellChecker._collect_pipeline_sources() — yields pipeline-extracted sources without checking
  • Aspell._batch_spellcheck() / Hunspell._batch_spellcheck() — single-invocation batch checking
  • SpellingTask.extract_only() — extraction entry point for worker processes

Existing single-file behavior and results are preserved; this is purely a performance optimization.

Example with a project that has ~175 template files for processing:

aspell hunspell
Before 20.450s 36.507s
After 6.644s 6.591s
Speedup 3.1× 5.5×

(Using the total wall-clock times, and rounding speedup to 1 decimal.)


Caveat: I am a professional software dev but I authored this with assistance from Claude code; I have tested locally and also reviewed it but not sure what your stance on LLM/AI contributions are. No hard feelings if this is a dealbreaker.

@gir-bot gir-bot added S: needs-review Needs to be reviewed and/or approved. C: source Related to source code. labels Mar 6, 2026
@facelessuser
Copy link
Owner

My feelings are mixed, but if you have the technical ability to stand behind the work and fix it accordingly, I'm open to exploring a path forward.

Right off the bat:

  • Spelling checks need fixing in CI, I believe they can all be fixed by using proper captialization for their name hunspell -> Hunspell, and maybe either using the full name str -> string or just putting code language in backquotes str -> `str`.

  • I would prefer that the code is well tested, currently, we have untested code, see code coverage. Does this code work as intended? We don't know right now.

If you are willing to thoroughly vet the code, then I am open to considering it, but if the expectation is for me to figure out what the AI did and debug it to ensure it works properly, then maybe not.

@facelessuser
Copy link
Owner

Basically, I need people who can speak to the changes they present, understand it, and be confident that everything included makes sense and works as expected.

@ehamiter
Copy link
Author

ehamiter commented Mar 6, 2026

A reasonable take! I'll work on fixing CI tests and adding some additional tests to help increase coverage. Then we can see if this is something you'd like to add or close.

@facelessuser
Copy link
Owner

Yep, if there is code that should not be there, then rip it out. I am okay with ignoring some coverage, only if there is a reasonable explanation as to why it is maybe not important.

@ehamiter
Copy link
Author

ehamiter commented Mar 6, 2026

Sounds good. I'm closing this for now and will make a new one later with additional tests and try to remove some of the duplication between the aspell and hunspell functions.

@ehamiter ehamiter closed this Mar 6, 2026
@facelessuser
Copy link
Owner

The big thing is to make sure everything that is being suggested has a real reason for being there as well. I saw some interesting UTF-8 code manipulation. I'm not sure if that is needed or not, but it wasn't even tested. While it passed existing tests, there was still logic that was added for...reasons? I didn't dig too deep, but I'm expecting that to be done by the submitter before proposing. I obviously will dig in once it seems due diligence has been put in.

@facelessuser
Copy link
Owner

Also, did you compare this to cases where you tweak -j?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

C: source Related to source code. S: needs-review Needs to be reviewed and/or approved.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants