Skip to content

fix: strip invisible word joiner from OCR output#4791

Open
pa4uslf wants to merge 2 commits intoopendatalab:masterfrom
pa4uslf:fix/strip-word-joiner-output
Open

fix: strip invisible word joiner from OCR output#4791
pa4uslf wants to merge 2 commits intoopendatalab:masterfrom
pa4uslf:fix/strip-word-joiner-output

Conversation

@pa4uslf
Copy link
Copy Markdown

@pa4uslf pa4uslf commented Apr 15, 2026

Summary

  • strip U+2060 WORD JOINER from OCR decoded output only
  • keep the upstream dictionary unchanged to avoid label index shifts
  • add a targeted regression test for CTC decode behavior

Testing

  • PYTHONPATH=. pytest -q -o addopts="" tests/unittest/test_rec_postprocess.py

@dosubot dosubot Bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Apr 15, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 15, 2026

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@pa4uslf
Copy link
Copy Markdown
Author

pa4uslf commented Apr 15, 2026

I have read the CLA Document and I hereby sign the CLA

github-actions Bot added a commit that referenced this pull request Apr 15, 2026
@dosubot dosubot Bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:S This PR changes 10-29 lines, ignoring generated files. labels Apr 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XL This PR changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant