Batch spell-checker invocations for improved performance#214
Batch spell-checker invocations for improved performance#214ehamiter wants to merge 1 commit intofacelessuser:masterfrom
Conversation
|
My feelings are mixed, but if you have the technical ability to stand behind the work and fix it accordingly, I'm open to exploring a path forward. Right off the bat:
If you are willing to thoroughly vet the code, then I am open to considering it, but if the expectation is for me to figure out what the AI did and debug it to ensure it works properly, then maybe not. |
|
Basically, I need people who can speak to the changes they present, understand it, and be confident that everything included makes sense and works as expected. |
|
A reasonable take! I'll work on fixing CI tests and adding some additional tests to help increase coverage. Then we can see if this is something you'd like to add or close. |
|
Yep, if there is code that should not be there, then rip it out. I am okay with ignoring some coverage, only if there is a reasonable explanation as to why it is maybe not important. |
|
Sounds good. I'm closing this for now and will make a new one later with additional tests and try to remove some of the duplication between the aspell and hunspell functions. |
|
The big thing is to make sure everything that is being suggested has a real reason for being there as well. I saw some interesting UTF-8 code manipulation. I'm not sure if that is needed or not, but it wasn't even tested. While it passed existing tests, there was still logic that was added for...reasons? I didn't dig too deep, but I'm expecting that to be done by the submitter before proposing. I obviously will dig in once it seems due diligence has been put in. |
|
Also, did you compare this to cases where you tweak |
Currently, pyspelling spawns a separate aspell/hunspell subprocess for every source file. For large projects this becomes a bottleneck — process creation overhead dominates actual spell-checking time.
This PR restructures
run_task()to separate the extraction and checking phases:ProcessPoolExecutorwhen jobs > 1) to produce extracted text sources._batch_spellcheck()call combines all extracted text and invokes aspell/hunspell once. Misspelled words are then mapped back to their originating source via word-boundary matching.New internal methods:
SpellChecker._collect_pipeline_sources()— yields pipeline-extracted sources without checkingAspell._batch_spellcheck()/Hunspell._batch_spellcheck()— single-invocation batch checkingSpellingTask.extract_only()— extraction entry point for worker processesExisting single-file behavior and results are preserved; this is purely a performance optimization.
Example with a project that has ~175 template files for processing:
(Using the
totalwall-clock times, and rounding speedup to 1 decimal.)Caveat: I am a professional software dev but I authored this with assistance from Claude code; I have tested locally and also reviewed it but not sure what your stance on LLM/AI contributions are. No hard feelings if this is a dealbreaker.