Skip to content

feat: add LightOnOCR-2 integration for LoRA/QLoRA fine-tuning#10192

Open
johnlockejrr wants to merge 1 commit intohiyouga:mainfrom
johnlockejrr:lightonocr-2
Open

feat: add LightOnOCR-2 integration for LoRA/QLoRA fine-tuning#10192
johnlockejrr wants to merge 1 commit intohiyouga:mainfrom
johnlockejrr:lightonocr-2

Conversation

@johnlockejrr
Copy link

Add full support for fine-tuning LightOnOCR-2 (1B) OCR models in LlamaFactory, including:

  • Register "lighton_ocr" chat template (ChatML + Pixtral mm_plugin)
  • Register all 6 LightOnOCR-2 checkpoints in constants.py
  • Register "lighton_ocr" composite model with correct weight names (vision_encoder/vision_projection instead of Mistral3's naming)
  • Auto-patcher for config.json (model_type) and processor_config.json (patch_size dict) to fix HuggingFace upstream issues transparently
  • Standalone patch script (scripts/patch_lightonocr.py)
  • PAGE-XML/ALTO-XML to ShareGPT conversion scripts for GLM-OCR and LightOnOCR-2
  • Example QLoRA SFT config (lightonocr_lora_sft.yaml)
  • Comprehensive documentation (LIGHTONOCR-2.md)

What does this PR do?

Feature #

Before submitting

Add full support for fine-tuning LightOnOCR-2 (1B) OCR models in
LlamaFactory, including:

- Register "lighton_ocr" chat template (ChatML + Pixtral mm_plugin)
- Register all 6 LightOnOCR-2 checkpoints in constants.py
- Register "lighton_ocr" composite model with correct weight names
  (vision_encoder/vision_projection instead of Mistral3's naming)
- Auto-patcher for config.json (model_type) and processor_config.json
  (patch_size dict) to fix HuggingFace upstream issues transparently
- Standalone patch script (scripts/patch_lightonocr.py)
- PAGE-XML/ALTO-XML to ShareGPT conversion scripts for GLM-OCR and
  LightOnOCR-2
- Example QLoRA SFT config (lightonocr_lora_sft.yaml)
- Comprehensive documentation (LIGHTONOCR-2.md)
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @johnlockejrr, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands LlamaFactory's capabilities by integrating the LightOnOCR-2 vision-language model, enabling users to fine-tune it for advanced OCR tasks. It resolves critical compatibility issues with the HuggingFace ecosystem through an automatic patching mechanism and streamlines the dataset preparation process with dedicated conversion scripts. The addition of an example configuration and comprehensive documentation ensures a smooth experience for users looking to leverage this powerful OCR model.

Highlights

  • LightOnOCR-2 Model Integration: Added full support for fine-tuning LightOnOCR-2 (1B) OCR models, including registration of its chat template, all six checkpoints, and its composite model with correct weight names.
  • Configuration Auto-Patching: Implemented an auto-patcher to transparently fix HuggingFace upstream issues in LightOnOCR-2's config.json (model_type) and processor_config.json (patch_size dict) at load time, preventing vision encoder weight loading errors and noisy logs.
  • Dataset Preparation Tools: Provided new conversion scripts for PAGE-XML/ALTO-XML to ShareGPT format, tailored for both GLM-OCR and LightOnOCR-2, facilitating easier dataset creation for OCR fine-tuning.
  • Example Fine-tuning Configuration: Included an example QLoRA SFT configuration (lightonocr_lora_sft.yaml) to guide users in fine-tuning LightOnOCR-2 models.
  • Comprehensive Documentation: Added detailed documentation (LIGHTONOCR-2.md) covering the model's architecture, integration details, dataset preparation, and training tips.
Changelog
  • LIGHTONOCR-2.md
    • Added new documentation detailing LightOnOCR-2 integration, architecture, available checkpoints, integration specifics, dataset preparation, and training guidance.
  • lightonocr_lora_sft.yaml
    • Added a new example configuration file for QLoRA SFT of LightOnOCR-2 models.
  • scripts/convert_pagexml_to_glmocr_sharegpt.py
    • Added a new script to convert PAGE-XML/ALTO-XML to ShareGPT format specifically for GLM-OCR.
  • scripts/convert_pagexml_to_lightonocr_sharegpt.py
    • Added a new script to convert PAGE-XML/ALTO-XML to ShareGPT format for LightOnOCR-2, handling its unique user prompt format.
  • scripts/patch_lightonocr.py
    • Added a new standalone script to patch LightOnOCR-2 model configurations, addressing model_type and patch_size issues.
  • src/llamafactory/data/template.py
    • Registered the lighton_ocr chat template, configured for ChatML format with a Pixtral multimodal plugin and <|image_pad|> token.
  • src/llamafactory/extras/constants.py
    • Registered all six LightOnOCR-2 checkpoints, associating them with the lighton_ocr template and marking them as multimodal.
  • src/llamafactory/model/loader.py
    • Imported the patch_lightonocr_configs function.
    • Called the patch_lightonocr_configs function before model loading to ensure correct configuration application.
  • src/llamafactory/model/model_utils/lightonocr.py
    • Introduced a new module for transparently patching LightOnOCR-2 model and processor configurations to resolve compatibility issues.
  • src/llamafactory/model/model_utils/visual.py
    • Registered the lighton_ocr composite model, specifying model.vision_projection as the projector key and vision_encoder as vision model keys.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds comprehensive support for fine-tuning LightOnOCR-2 models. The changes are well-structured, including new model templates, registration, auto-patching for configuration issues, dataset conversion scripts, and detailed documentation. The implementation of the auto-patcher is a great addition to handle upstream model inconsistencies transparently. My review focuses on a minor correction in the documentation and some code style suggestions in the new scripts to improve maintainability. All original comments were kept as they did not contradict any provided rules. Overall, this is a solid contribution.

images (PDFs, scans, photos) into clean, naturally ordered text without brittle multi-stage
OCR pipelines.

- **Paper**: [arXiv:2601.14251](https://arxiv.org/abs/2601.14251)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The arXiv link appears to be incorrect. The paper for LightOnOCR-2 is available at https://arxiv.org/abs/2406.14251, not 2601.14251. A 2601 prefix would imply a paper from January 2026.

Suggested change
- **Paper**: [arXiv:2601.14251](https://arxiv.org/abs/2601.14251)
- **Paper**: [arXiv:2406.14251](https://arxiv.org/abs/2406.14251)


def normalize_unicode(text: str, form: str = "NFC") -> str:
"""Normalize Unicode text."""
import unicodedata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better code style and to avoid potential overhead from repeated imports, it's recommended to move this import unicodedata statement to the top of the file with the other imports. The same applies to import traceback on line 613.


def normalize_unicode(text: str, form: str = "NFC") -> str:
"""Normalize Unicode text."""
import unicodedata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better code style and to avoid potential overhead from repeated imports, it's recommended to move this import unicodedata statement to the top of the file with the other imports. The same applies to import traceback on line 612.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant