diff --git a/skills/README.md b/skills/README.md index 4ddd5087179..8cb59a25fac 100644 --- a/skills/README.md +++ b/skills/README.md @@ -9,8 +9,8 @@ This directory contains official PaddleOCR Agent Skills. They integrate with AI ## Prerequisites -1. Python 3.8 or later must be installed on the device that runs the skill. -2. These skills depend on PaddleOCR official APIs and require API credentials. Visit the [PaddleOCR website](https://www.paddleocr.com), click **API**, select the model you need, then copy the `API_URL` and `Token`. They correspond to the API URL and access token used for authentication. Supported models per skill: +1. Python 3.9 or later must be installed on the device that runs the skill. +2. These skills depend on PaddleOCR official APIs and require API credentials. Visit the [PaddleOCR website](https://www.paddleocr.com), click **API**, select the model you need, select the language for the text recognition model, then copy the `API_URL` and `Token`. They correspond to the API URL and access token used for authentication. Supported model per skill: - `paddleocr-text-recognition`: `PP-OCRv5` - `paddleocr-doc-parsing`: `PP-StructureV3`, `PaddleOCR-VL`, `PaddleOCR-VL-1.5` @@ -34,6 +34,7 @@ npx skills add PaddlePaddle/PaddleOCR -g --skill paddleocr-doc-parsing -y > ```shell > git clone https://github.com/PaddlePaddle/PaddleOCR.git > npx skills add ./PaddleOCR/skills/paddleocr-text-recognition +> npx skills add ./PaddleOCR/skills/paddleocr-doc-parsing > ``` #### Option 2: Install via `clawhub` (OpenClaw) @@ -65,8 +66,8 @@ After installation, configure the required environment variables so the skills c | Skill | Required | Optional | | --- | --- | --- | -| `paddleocr-text-recognition` | `PADDLEOCR_OCR_API_URL` (API URL), `PADDLEOCR_ACCESS_TOKEN` (access token) | `PADDLEOCR_OCR_TIMEOUT` (API request timeout) | -| `paddleocr-doc-parsing` | `PADDLEOCR_DOC_PARSING_API_URL` (API URL), `PADDLEOCR_ACCESS_TOKEN` (access token) | `PADDLEOCR_DOC_PARSING_TIMEOUT` (API request timeout) | +| `paddleocr-text-recognition` | `PADDLEOCR_OCR_API_URL` (full endpoint URL ending with `/ocr`), `PADDLEOCR_ACCESS_TOKEN` (access token) | `PADDLEOCR_OCR_TIMEOUT` (API request timeout) | +| `paddleocr-doc-parsing` | `PADDLEOCR_DOC_PARSING_API_URL` (full endpoint URL ending with `/layout-parsing`), `PADDLEOCR_ACCESS_TOKEN` (access token) | `PADDLEOCR_DOC_PARSING_TIMEOUT` (API request timeout) | Below are configuration methods for some AI apps: @@ -150,10 +151,10 @@ Make sure your working directory is the directory containing this file. 1. Install dependencies. ```shell - python -m pip install -r paddleocr-text-recognition/scripts/requirements.txt - python -m pip install -r paddleocr-doc-parsing/scripts/requirements.txt + python -m pip install -r paddleocr-text-recognition/requirements.txt + python -m pip install -r paddleocr-doc-parsing/requirements.txt # Optional: required only when using document file optimization - python -m pip install -r paddleocr-doc-parsing/scripts/requirements-optimize.txt + python -m pip install -r paddleocr-doc-parsing/requirements-optimize.txt ``` 2. Configure environment variables (see [Configure Environment Variables](#configure-environment-variables) for the list of variables). @@ -170,3 +171,5 @@ Make sure your working directory is the directory containing this file. python paddleocr-text-recognition/scripts/smoke_test.py python paddleocr-doc-parsing/scripts/smoke_test.py ``` + + Use `--skip-api-test` to verify configuration only (no network call). Use `--test-url "https://..."` to override the default sample document/image URL. diff --git a/skills/README_cn.md b/skills/README_cn.md index 28f296afb5e..f1f7eaa2a14 100644 --- a/skills/README_cn.md +++ b/skills/README_cn.md @@ -9,8 +9,8 @@ ## 准备工作 -1. 请确保执行 skill 的设备安装有 Python 3.8 或以上版本。 -2. Skill 底层依赖于 PaddleOCR 官方 API,因此需要配置相关凭证才能使用。可以在 [PaddleOCR 官网](https://www.paddleocr.com) 点击 **API**,选择需要用到的算法,然后复制 `API_URL` 和 `Token`,它们分别对应服务的 API URL 和用户鉴权使用的 access token。各 skill 支持的算法如下: +1. 请确保执行 skill 的设备安装有 Python 3.9 或以上版本。 +2. Skill 底层依赖于 PaddleOCR 官方 API,因此需要配置相关凭证才能使用。可以在 [PaddleOCR 官网](https://www.paddleocr.com) 点击 **API**,选择需要用到的模型,选择语言(对于文字识别模型),然后复制 `API_URL` 和 `Token`,它们分别对应服务的 API URL 和用户鉴权使用的 access token。各 skill 支持的模型如下: - `paddleocr-text-recognition`:`PP-OCRv5` - `paddleocr-doc-parsing`:`PP-StructureV3`、`PaddleOCR-VL`、`PaddleOCR-VL-1.5` @@ -34,6 +34,7 @@ npx skills add PaddlePaddle/PaddleOCR -g --skill paddleocr-doc-parsing -y > ```shell > git clone https://github.com/PaddlePaddle/PaddleOCR.git > npx skills add ./PaddleOCR/skills/paddleocr-text-recognition +> npx skills add ./PaddleOCR/skills/paddleocr-doc-parsing > ``` #### 方式二:通过 `clawhub` 安装(OpenClaw) @@ -65,8 +66,8 @@ git clone https://github.com/PaddlePaddle/PaddleOCR.git | Skill | 必填 | 可选 | | --- | --- | --- | -| `paddleocr-text-recognition` | `PADDLEOCR_OCR_API_URL`(API URL)、`PADDLEOCR_ACCESS_TOKEN`(access token) | `PADDLEOCR_OCR_TIMEOUT`(API 请求超时时间) | -| `paddleocr-doc-parsing` | `PADDLEOCR_DOC_PARSING_API_URL`(API URL)、`PADDLEOCR_ACCESS_TOKEN`(access token) | `PADDLEOCR_DOC_PARSING_TIMEOUT`(API 请求超时时间) | +| `paddleocr-text-recognition` | `PADDLEOCR_OCR_API_URL`(完整端点 URL,须以 `/ocr` 结尾)、`PADDLEOCR_ACCESS_TOKEN`(access token) | `PADDLEOCR_OCR_TIMEOUT`(API 请求超时时间) | +| `paddleocr-doc-parsing` | `PADDLEOCR_DOC_PARSING_API_URL`(完整端点 URL,须以 `/layout-parsing` 结尾)、`PADDLEOCR_ACCESS_TOKEN`(access token) | `PADDLEOCR_DOC_PARSING_TIMEOUT`(API 请求超时时间) | 以下是部分 AI 应用的配置方式: @@ -150,10 +151,10 @@ git clone https://github.com/PaddlePaddle/PaddleOCR.git 1. 安装依赖库。 ```shell - python -m pip install -r paddleocr-text-recognition/scripts/requirements.txt - python -m pip install -r paddleocr-doc-parsing/scripts/requirements.txt + python -m pip install -r paddleocr-text-recognition/requirements.txt + python -m pip install -r paddleocr-doc-parsing/requirements.txt # 可选依赖,仅在优化文档文件大小时需要 - python -m pip install -r paddleocr-doc-parsing/scripts/requirements-optimize.txt + python -m pip install -r paddleocr-doc-parsing/requirements-optimize.txt ``` 2. 配置环境变量(需要配置的变量参见[配置环境变量](#配置环境变量)一节)。 @@ -170,3 +171,5 @@ git clone https://github.com/PaddlePaddle/PaddleOCR.git python paddleocr-text-recognition/scripts/smoke_test.py python paddleocr-doc-parsing/scripts/smoke_test.py ``` + + 使用 `--skip-api-test` 可只做配置检查(不发网络请求)。使用 `--test-url "https://..."` 可指定自定义测试用文档/图片 URL。 diff --git a/skills/paddleocr-doc-parsing/SKILL.md b/skills/paddleocr-doc-parsing/SKILL.md index 18db775ec56..2822cd063e2 100644 --- a/skills/paddleocr-doc-parsing/SKILL.md +++ b/skills/paddleocr-doc-parsing/SKILL.md @@ -1,13 +1,18 @@ --- name: paddleocr-doc-parsing -description: Complex document parsing with PaddleOCR. Intelligently converts complex PDFs and document images into Markdown and JSON files that preserve the original structure. +description: >- + Use this skill to extract structured Markdown/JSON from PDFs and document images—tables with + cell-level precision, formulas as LaTeX, figures, seals, charts, headers/footers, multi-column + layout and correct reading order. + Trigger terms: 文档解析, 版面分析, 版面还原, 表格提取, 公式识别, 多栏排版, 扫描件结构化, + 发票, 财报, 复杂 PDF, PDF转Markdown, 图表, 阅读顺序; reading order, formula, LaTeX, + layout parsing, structure extraction, PP-StructureV3, PaddleOCR-VL. metadata: openclaw: requires: env: - PADDLEOCR_DOC_PARSING_API_URL - PADDLEOCR_ACCESS_TOKEN - - PADDLEOCR_DOC_PARSING_TIMEOUT bins: - python primaryEnv: PADDLEOCR_ACCESS_TOKEN @@ -19,7 +24,10 @@ metadata: ## When to Use This Skill -**Use Document Parsing for**: +**Trigger keywords (routing)**: Bilingual trigger terms (Chinese and English) are listed in the YAML `description` above—use that field for discovery and routing. + +**Use this skill for**: + - Documents with tables (invoices, financial reports, spreadsheets) - Documents with mathematical formulas (academic papers, scientific documents) - Documents with charts and diagrams @@ -27,45 +35,59 @@ metadata: - Complex document structures requiring layout analysis - Any document requiring structured understanding -**Use Text Recognition instead for**: +**Do not use for**: + - Simple text-only extraction - Quick OCR tasks where speed is critical - Screenshots or simple images with clear text -## How to Use This Skill +## Installation -**⛔ MANDATORY RESTRICTIONS - DO NOT VIOLATE ⛔** +Install Python dependencies before using this skill. From the skill directory (`skills/paddleocr-doc-parsing`): -1. **ONLY use PaddleOCR Document Parsing API** - Execute the script `python scripts/vl_caller.py` -2. **NEVER parse documents directly** - Do NOT parse documents yourself -3. **NEVER offer alternatives** - Do NOT suggest "I can try to analyze it" or similar -4. **IF API fails** - Display the error message and STOP immediately -5. **NO fallback methods** - Do NOT attempt document parsing any other way +```bash +pip install -r requirements.txt +``` + +**Optional** — for image optimization and PDF page extraction: -If the script execution fails (API not configured, network error, etc.): -- Show the error message to the user -- Do NOT offer to help using your vision capabilities -- Do NOT ask "Would you like me to try parsing it?" -- Simply stop and wait for user to fix the configuration +```bash +pip install -r requirements-optimize.txt +``` + +## How to Use This Skill + +> **Working directory**: All `python scripts/...` commands below should be run from this skill's root directory (the directory containing this SKILL.md file). ### Basic Workflow -1. **Execute document parsing**: +1. **Identify the input source**: + - User provides URL: Use the `--file-url` parameter + - User provides local file path: Use the `--file-path` parameter + +2. **Execute document parsing**: + ```bash - python scripts/vl_caller.py --file-url "URL provided by user" --pretty + python scripts/layout_caller.py --file-url "URL provided by user" --pretty ``` + Or for local files: + ```bash - python scripts/vl_caller.py --file-path "file path" --pretty + python scripts/layout_caller.py --file-path "file path" --pretty ``` **Optional: explicitly set file type**: + ```bash - python scripts/vl_caller.py --file-url "URL provided by user" --file-type 0 --pretty + python scripts/layout_caller.py --file-url "URL provided by user" --file-type 0 --pretty ``` + - `--file-type 0`: PDF - `--file-type 1`: image - - If omitted, the service can infer file type from input. + - If omitted, the type is auto-detected from the file extension. For local files, a recognized extension (`.pdf`, `.png`, `.jpg`, `.jpeg`, `.bmp`, `.tiff`, `.tif`, `.webp`) is required; otherwise pass `--file-type` explicitly. For URLs with unrecognized extensions, the service attempts inference. + + > **Performance note**: Parsing time scales with document complexity. Single-page images typically complete in 1-5 seconds; large PDFs (50+ pages) may take several minutes. Allow adequate time before assuming a timeout. **Default behavior: save raw JSON to a temp file**: - If `--output` is omitted, the script saves automatically under the system temp directory @@ -74,47 +96,43 @@ If the script execution fails (API not configured, network error, etc.): - If `--stdout` is provided, JSON is printed to stdout and no file is saved - In save mode, the script prints the absolute saved path on stderr: `Result saved to: /absolute/path/...` - In default/custom save mode, read and parse the saved JSON file before responding - - In save mode, always tell the user the saved file path and that full raw JSON is available there - Use `--stdout` only when you explicitly want to skip file persistence -2. **The output JSON contains COMPLETE content** with all document data: - - Headers, footers, page numbers - - Main text content - - Tables with structure - - Formulas (with LaTeX) - - Figures and charts - - Footnotes and references - - Seals and stamps - - Layout and reading order +3. **Parse JSON response**: + - Check the `ok` field: `true` means success, `false` means error + - The output contains complete document data: text, tables, formulas (LaTeX), figures, seals, headers/footers, and reading order + - Use the appropriate field based on what the user needs: + - `text` — full document text across all pages + - `result.result.layoutParsingResults[n].markdown.text` — page-level markdown + - `result.result.layoutParsingResults[n].prunedResult` — structured layout data with positions and confidence + - Handle errors: If `ok` is false, display `error.message` + +4. **Present results to user**: + - Display content based on what the user requested (see "Complete Output Display" below) + - If the content is empty, the document may contain no extractable text + - In save mode, always tell the user the saved file path and that full raw JSON is available there + +### What to Do After Parsing - **Input type note**: - - Supported file types depend on the model and endpoint configuration. - - Always follow the file type constraints documented by your endpoint API. +Common next steps once you have the structured output: -3. **Extract what the user needs** from the output JSON using these fields: - - Top-level `text` - - `result[n].markdown` - - `result[n].prunedResult` +- **Save as Markdown**: Write the `text` field to a `.md` file — tables, headings, and formulas are preserved +- **Extract specific tables**: Navigate `result.result.layoutParsingResults[n].prunedResult` to access individual layout elements with position and confidence data +- **Feed to RAG / search pipeline**: The `text` field is structured markdown, ready for chunking and indexing +- **Poor results**: See "Tips for Better Results" below before retrying -### IMPORTANT: Complete Content Display +### Complete Output Display -**CRITICAL**: You must display the COMPLETE extracted content to the user based on their needs. +Display the COMPLETE extracted content based on what the user asked for. The parsed output is only useful if the user receives all of it — truncation silently drops data. -- The output JSON contains ALL document content in a structured format -- In save mode, the raw provider result can be inspected in the saved JSON file -- **Display the full content requested by the user**, do NOT truncate or summarize - If user asks for "all text", show the entire `text` field - If user asks for "tables", show ALL tables in the document - If user asks for "main content", filter out headers/footers but show ALL body text - -**What this means**: -- **DO**: Display complete text, all tables, all formulas as requested -- **DO**: Present content using these fields: top-level `text`, `result[n].markdown`, and `result[n].prunedResult` -- **DON'T**: Truncate with "..." unless content is excessively long (>10,000 chars) -- **DON'T**: Summarize or provide excerpts when user asks for full content -- **DON'T**: Say "Here's a preview" when user expects complete output +- Do not truncate with "..." unless content is excessively long (>10,000 chars) +- Do not say "Here's a preview" when user expects complete output **Example - Correct**: + ``` User: "Extract all the text from this document" Agent: I've parsed the complete document. Here's all the extracted text: @@ -130,124 +148,118 @@ Quality: Excellent (confidence: 0.92) ``` **Example - Incorrect**: + ``` User: "Extract all the text" Agent: "I found a document with multiple sections. Here's the beginning: 'Introduction...' (content truncated for brevity)" ``` -### Understanding the JSON Response +### Understanding the Output -The output JSON uses an envelope wrapping the raw API result: +The script returns an envelope with `ok`, `text`, `result`, and `error`. Use `text` for the full document content; navigate `result.result.layoutParsingResults[n]` for per-page structured data. -```json -{ - "ok": true, - "text": "Full markdown/HTML text extracted from all pages", - "result": { ... }, // raw provider response - "error": null -} -``` - -**Key fields**: -- `text` — extracted markdown text from all pages (use this for quick text display) -- `result` - raw provider response object -- `result[n].prunedResult` - structured parsing output for each page (layout/content/confidence and related metadata) -- `result[n].markdown` — full rendered page output in markdown/HTML +For the complete schema and field-level details, see `references/output_schema.md`. > Raw result location (default): the temp-file path printed by the script on stderr ### Usage Examples **Example 1: Extract Full Document Text** + ```bash -python scripts/vl_caller.py \ +python scripts/layout_caller.py \ --file-url "https://example.com/paper.pdf" \ --pretty ``` Then use: + - Top-level `text` for quick full-text output -- `result[n].markdown` when page-level output is needed +- `result.result.layoutParsingResults[n].markdown` when page-level output is needed **Example 2: Extract Structured Page Data** + ```bash -python scripts/vl_caller.py \ +python scripts/layout_caller.py \ --file-path "./financial_report.pdf" \ --pretty ``` Then use: -- `result[n].prunedResult` for structured parsing data (layout/content/confidence) -- `result[n].markdown` for rendered page content -**Example 3: Print JSON Without Saving** +- `result.result.layoutParsingResults[n].prunedResult` for structured parsing data (layout/content/confidence) + +**Example 3: Print JSON to stdout (without saving to file)** + ```bash -python scripts/vl_caller.py \ +python scripts/layout_caller.py \ --file-url "URL" \ --stdout \ --pretty ``` -Then return: -- Full `text` when user asks for full document content -- `result[n].prunedResult` and `result[n].markdown` when user needs complete structured page data +By default the script writes JSON to a temp file and prints the path to stderr. Add `--stdout` to print the full JSON directly to stdout instead. Use this when you need to inspect the result inline or pipe it to another tool. ### First-Time Configuration -You can generally assume that the required environment variables have already been configured. Only when a parsing task fails should you analyze the error message to determine whether it is caused by a configuration issue. If it is indeed a configuration problem, you should notify the user to fix it. +**When API is not configured**, the script outputs: -**When API is not configured**: - -The error will show: -``` -CONFIG_ERROR: PADDLEOCR_DOC_PARSING_API_URL not configured. Get your API at: https://paddleocr.com +```json +{ + "ok": false, + "text": "", + "result": null, + "error": { + "code": "CONFIG_ERROR", + "message": "PADDLEOCR_DOC_PARSING_API_URL not configured. Get your API at: https://paddleocr.com" + } +} ``` **Configuration workflow**: -1. **Show the exact error message** to the user (including the URL). +1. **Show the exact error message** to the user. + +2. **Guide the user to obtain credentials**: Visit the [PaddleOCR website](https://www.paddleocr.com), click **API**, select a model (`PP-StructureV3`, `PaddleOCR-VL`, or `PaddleOCR-VL-1.5`), then copy the `API_URL` and `Token`. They map to these environment variables: + - `PADDLEOCR_DOC_PARSING_API_URL` — full endpoint URL ending with `/layout-parsing` + - `PADDLEOCR_ACCESS_TOKEN` — 40-character alphanumeric string -2. **Guide the user to configure securely**: - - Recommend configuring through the host application's standard method (e.g., settings file, environment variable UI) rather than pasting credentials in chat. - - List the required environment variables: - ``` - - PADDLEOCR_DOC_PARSING_API_URL - - PADDLEOCR_ACCESS_TOKEN - - Optional: PADDLEOCR_DOC_PARSING_TIMEOUT - ``` + Optionally configure `PADDLEOCR_DOC_PARSING_TIMEOUT` for request timeout. Recommend using the host application's standard configuration method rather than pasting credentials in chat. -3. **If the user provides credentials in chat anyway** (accept any reasonable format), for example: - - `PADDLEOCR_DOC_PARSING_API_URL=https://xxx.paddleocr.com/layout-parsing, PADDLEOCR_ACCESS_TOKEN=abc123...` - - `Here's my API: https://xxx and token: abc123` - - Copy-pasted code format - - Any other reasonable format - - **Security note**: Warn the user that credentials shared in chat may be stored in conversation history. Recommend setting them through the host application's configuration instead when possible. +3. **Apply credentials** — one of: + - **User configured via the host UI**: ask the user to confirm, then retry. + - **User pastes credentials in chat**: warn that they may be stored in conversation history, help the user persist them using the host's standard configuration method, then retry. - Then parse and validate the values: - - Extract `PADDLEOCR_DOC_PARSING_API_URL` (look for URLs with `paddleocr.com` or similar) - - Confirm `PADDLEOCR_DOC_PARSING_API_URL` is a full endpoint ending with `/layout-parsing` - - Extract `PADDLEOCR_ACCESS_TOKEN` (long alphanumeric string, usually 40+ chars) +### Handling Large Files -4. **Ask the user to confirm the environment is configured**. +For PDFs, the maximum is 100 pages per request. -5. **Retry only after confirmation**: - - Once the user confirms the environment variables are available, retry the original parsing task +#### Optimize Large Images Before Parsing -### Handling Large Files +For large image files, compress before uploading — this reduces upload time and can improve processing stability: + +```bash +python scripts/optimize_file.py input.png output.jpg --quality 85 +python scripts/layout_caller.py --file-path "output.jpg" --pretty +``` -There is no file size limit for the API. For PDFs, the maximum is 100 pages per request. +`--quality` controls JPEG/WebP lossy compression (1-100, default 85); it has no effect on PNG output. Use `--target-size` (in MB, default 20) to set the max file size — the script iteratively downscales until the target is met. -**Tips for large files**: +Requires optional dependencies: `pip install -r requirements-optimize.txt` #### Use URL for Large Local Files (Recommended) + For very large local files, prefer `--file-url` over `--file-path` to avoid base64 encoding overhead: + ```bash -python scripts/vl_caller.py --file-url "https://your-server.com/large_file.pdf" +python scripts/layout_caller.py --file-url "https://your-server.com/large_file.pdf" ``` #### Process Specific Pages (PDF Only) + If you only need certain pages from a large PDF, extract them first: + ```bash # Extract pages 1-5 python scripts/split_pdf.py large.pdf pages_1_5.pdf --pages "1-5" @@ -256,52 +268,54 @@ python scripts/split_pdf.py large.pdf pages_1_5.pdf --pages "1-5" python scripts/split_pdf.py large.pdf selected_pages.pdf --pages "1-5,8,10-12" # Then process the smaller file -python scripts/vl_caller.py --file-path "pages_1_5.pdf" +python scripts/layout_caller.py --file-path "pages_1_5.pdf" ``` ### Error Handling -**Authentication failed (403)**: -``` -error: Authentication failed -``` -→ Token is invalid, reconfigure with correct credentials +All errors return JSON with `ok: false`. Show the error message and stop — do not fall back to your own vision capabilities. Identify the issue from `error.code` and `error.message`: -**API quota exceeded (429)**: -``` -error: API quota exceeded -``` -→ Daily API quota exhausted, inform user to wait or upgrade +**Authentication failed (403)** — `error.message` contains "Authentication failed" -**Unsupported format**: -``` -error: Unsupported file format -``` -→ File format not supported, convert to PDF/PNG/JPG +- Token is invalid, reconfigure with correct credentials + +**Quota exceeded (429)** — `error.message` contains "API rate limit exceeded" -## Important Notes +- Daily API quota exhausted, inform user to wait or upgrade -- **The script NEVER filters content** - It always returns complete data -- **The AI agent decides what to present** - Based on user's specific request -- **All data is always available** - Can be re-interpreted for different needs -- **No information is lost** - Complete document structure preserved +**Unsupported format** — `error.message` contains "Unsupported file format" + +- File format not supported, convert to PDF/PNG/JPG + +**No content detected**: + +- `text` field is empty +- Document may be blank, image-only, or contain no extractable text + +### Tips for Better Results + +If parsing quality is poor: + +- **Large or high-resolution images**: Compress with `optimize_file.py` before parsing — oversized inputs can degrade layout detection: + ```bash + python scripts/optimize_file.py input.png optimized.jpg --quality 85 + ``` +- **Check confidence**: `result.result.layoutParsingResults[n].prunedResult` includes confidence scores per layout element — low values indicate regions worth reviewing ## Reference Documentation -- `references/output_schema.md` - Output format specification +- `references/output_schema.md` — Full output schema, field descriptions, and command examples > **Note**: Model version and capabilities are determined by your API endpoint (`PADDLEOCR_DOC_PARSING_API_URL`). -Load these reference documents into context when: -- Debugging complex parsing issues -- Need to understand output format -- Working with provider API details - ## Testing the Skill To verify the skill is working properly: + ```bash python scripts/smoke_test.py +python scripts/smoke_test.py --skip-api-test +python scripts/smoke_test.py --test-url "https://..." ``` -This tests configuration and optionally API connectivity. +The first form tests configuration and API connectivity. `--skip-api-test` checks configuration only. `--test-url` overrides the default sample document URL. diff --git a/skills/paddleocr-doc-parsing/references/output_schema.md b/skills/paddleocr-doc-parsing/references/output_schema.md index 3c51261efce..c71637c7efa 100644 --- a/skills/paddleocr-doc-parsing/references/output_schema.md +++ b/skills/paddleocr-doc-parsing/references/output_schema.md @@ -1,12 +1,12 @@ # PaddleOCR Document Parsing Output Schema -This document defines the output envelope returned by `vl_caller.py`. +This document defines the output envelope returned by `layout_caller.py`. -By default, `vl_caller.py` saves the JSON envelope to a unique file under the system temp directory and prints the absolute saved path to `stderr`. Use `--output` when you need a custom destination, or `--stdout` when you want to skip file saving and print JSON directly. +By default, `layout_caller.py` saves the JSON envelope to a unique file under the system temp directory and prints the absolute saved path to `stderr`. Use `--output` when you need a custom destination, or `--stdout` when you want to skip file saving and print JSON directly. ## Output Envelope -`vl_caller.py` wraps provider response in a stable structure: +`layout_caller.py` wraps provider response in a stable structure: ```json { @@ -33,11 +33,11 @@ On error: ## Error Codes -| Code | Description | -|------|-------------| -| `INPUT_ERROR` | Invalid input (missing file, unsupported format) | -| `CONFIG_ERROR` | API not configured | -| `API_ERROR` | API call failed (auth, timeout, service error, or invalid response schema) | +| Code | Description | +| -------------- | --------------------------------------------------------------------------- | +| `INPUT_ERROR` | Invalid or unusable input (arguments, file source, format, types). | +| `CONFIG_ERROR` | Missing or invalid API / client configuration. | +| `API_ERROR` | Request or response handling failed (network, HTTP, body parsing, schema). | ## Raw Result Notes @@ -70,31 +70,36 @@ Raw fields may vary by model version and endpoint. ## Important Fields -- `result[n].prunedResult` +Paths are relative to the output envelope root. + +- `result.result.layoutParsingResults[n].prunedResult` Structured parsing data for page `n` (layout elements, locations, content, confidence, and related metadata). -- `result[n].markdown` +- `result.result.layoutParsingResults[n].markdown` Rendered output for page `n`. -- `result[n].markdown.text` +- `result.result.layoutParsingResults[n].markdown.text` Full page markdown text. ## Text Extraction -`vl_caller.py` extracts top-level `text` from `result.layoutParsingResults[n].markdown.text` and joins pages with `\n\n`. +`layout_caller.py` extracts top-level `text` from `result.result.layoutParsingResults[n].markdown.text` and joins pages with `\n\n`. ## Command Examples ```bash # Parse document from URL (result auto-saves to the system temp directory) -python scripts/paddleocr-doc-parsing/vl_caller.py --file-url "URL" --pretty +python scripts/layout_caller.py --file-url "URL" --pretty # Parse local file (result auto-saves to the system temp directory) -python scripts/paddleocr-doc-parsing/vl_caller.py --file-path "doc.pdf" --pretty +python scripts/layout_caller.py --file-path "doc.pdf" --pretty + +# Parse with explicit file type +python scripts/layout_caller.py --file-url "URL" --file-type 1 --pretty # Save result to a custom file path -python scripts/paddleocr-doc-parsing/vl_caller.py --file-url "URL" --output "./result.json" --pretty +python scripts/layout_caller.py --file-url "URL" --output "./result.json" --pretty # Print JSON to stdout without saving a file -python scripts/paddleocr-doc-parsing/vl_caller.py --file-url "URL" --stdout --pretty +python scripts/layout_caller.py --file-url "URL" --stdout --pretty ``` diff --git a/skills/paddleocr-doc-parsing/scripts/requirements-optimize.txt b/skills/paddleocr-doc-parsing/requirements-optimize.txt similarity index 54% rename from skills/paddleocr-doc-parsing/scripts/requirements-optimize.txt rename to skills/paddleocr-doc-parsing/requirements-optimize.txt index c9c03809a11..9bdbd6d1553 100644 --- a/skills/paddleocr-doc-parsing/scripts/requirements-optimize.txt +++ b/skills/paddleocr-doc-parsing/requirements-optimize.txt @@ -1,5 +1,4 @@ # File Optimization Dependencies -# Install with: pip install -r scripts/paddleocr-doc-parsing/requirements-optimize.txt # Image processing Pillow>=10.0.0 diff --git a/skills/paddleocr-doc-parsing/scripts/requirements.txt b/skills/paddleocr-doc-parsing/requirements.txt similarity index 100% rename from skills/paddleocr-doc-parsing/scripts/requirements.txt rename to skills/paddleocr-doc-parsing/requirements.txt diff --git a/skills/paddleocr-doc-parsing/scripts/vl_caller.py b/skills/paddleocr-doc-parsing/scripts/layout_caller.py similarity index 69% rename from skills/paddleocr-doc-parsing/scripts/vl_caller.py rename to skills/paddleocr-doc-parsing/scripts/layout_caller.py index 3ded13d7824..2d00540374d 100644 --- a/skills/paddleocr-doc-parsing/scripts/vl_caller.py +++ b/skills/paddleocr-doc-parsing/scripts/layout_caller.py @@ -1,27 +1,12 @@ -#!/usr/bin/env python3 -# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - """ PaddleOCR Document Parser Simple CLI wrapper for the PaddleOCR document parsing library. Usage: - python scripts/paddleocr-doc-parsing/vl_caller.py --file-url "URL" - python scripts/paddleocr-doc-parsing/vl_caller.py --file-path "document.pdf" - python scripts/paddleocr-doc-parsing/vl_caller.py --file-path "doc.pdf" --pretty + python scripts/layout_caller.py --file-url "URL" + python scripts/layout_caller.py --file-path "document.pdf" + python scripts/layout_caller.py --file-path "doc.pdf" --pretty """ import argparse @@ -32,6 +17,7 @@ import uuid from datetime import datetime from pathlib import Path +from typing import Optional # Fix Windows console encoding if sys.platform == "win32": @@ -44,7 +30,7 @@ from lib import parse_document -def get_default_output_path(): +def get_default_output_path() -> Path: """Build a unique result path under the OS temp directory.""" timestamp = datetime.now().strftime("%Y%m%d_%H%M%S_%f") short_id = uuid.uuid4().hex[:8] @@ -57,36 +43,35 @@ def get_default_output_path(): ) -def resolve_output_path(output_arg): +def resolve_output_path(output_arg: Optional[str]) -> Path: if output_arg: return Path(output_arg).expanduser().resolve() return get_default_output_path().resolve() -def main(): +def main() -> None: parser = argparse.ArgumentParser( description="PaddleOCR Document Parsing - with layout analysis", formatter_class=argparse.RawDescriptionHelpFormatter, epilog=""" Examples: # Parse document from URL (result is auto-saved to the system temp directory) - python scripts/paddleocr-doc-parsing/vl_caller.py --file-url "https://example.com/document.pdf" + python scripts/layout_caller.py --file-url "https://example.com/document.pdf" # Parse local file (result is auto-saved to the system temp directory) - python scripts/paddleocr-doc-parsing/vl_caller.py --file-path "./invoice.pdf" + python scripts/layout_caller.py --file-path "./invoice.pdf" # Save result to a custom file path - python scripts/paddleocr-doc-parsing/vl_caller.py --file-url "URL" --output "./result.json" --pretty + python scripts/layout_caller.py --file-url "URL" --output "./result.json" --pretty # Print JSON to stdout without saving a file - python scripts/paddleocr-doc-parsing/vl_caller.py --file-url "URL" --stdout --pretty + python scripts/layout_caller.py --file-url "URL" --stdout --pretty Configuration: Set environment variables: PADDLEOCR_DOC_PARSING_API_URL, PADDLEOCR_ACCESS_TOKEN Optional: PADDLEOCR_DOC_PARSING_TIMEOUT """, ) - # Input (mutually exclusive, required) input_group = parser.add_mutually_exclusive_group(required=True) input_group.add_argument("--file-url", help="URL to document (PDF, PNG, JPG, etc.)") input_group.add_argument("--file-path", help="Local file path") @@ -118,7 +103,8 @@ def main(): args = parser.parse_args() - # Parse document + # Unwarping and orientation classification are off to cover common scenarios + # with faster response times; visualize is off to reduce response payload. result = parse_document( file_path=args.file_path, file_url=args.file_url, @@ -128,7 +114,6 @@ def main(): visualize=False, ) - # Format output indent = 2 if args.pretty else None json_output = json.dumps(result, indent=indent, ensure_ascii=False) @@ -137,7 +122,6 @@ def main(): else: output_path = resolve_output_path(args.output) - # Save to file try: output_path.parent.mkdir(parents=True, exist_ok=True) output_path.write_text(json_output, encoding="utf-8") @@ -146,8 +130,7 @@ def main(): print(f"Error: Cannot write to {output_path}: {e}", file=sys.stderr) sys.exit(5) - # Exit code based on result - sys.exit(0 if result["ok"] else 1) + sys.exit(0 if result.get("ok") else 1) if __name__ == "__main__": diff --git a/skills/paddleocr-doc-parsing/scripts/lib.py b/skills/paddleocr-doc-parsing/scripts/lib.py index f10cc334656..bd88e54e1c9 100644 --- a/skills/paddleocr-doc-parsing/scripts/lib.py +++ b/skills/paddleocr-doc-parsing/scripts/lib.py @@ -1,17 +1,3 @@ -# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - """ PaddleOCR Document Parsing Library @@ -20,10 +6,11 @@ import base64 import logging +import math import os from pathlib import Path from typing import Any, Optional -from urllib.parse import urlparse, unquote +from urllib.parse import unquote, urlparse import httpx @@ -45,17 +32,53 @@ # ============================================================================= -def _get_env(key: str, *fallback_keys: str) -> str: - """Get environment variable with fallback keys.""" - value = os.getenv(key, "").strip() - if value: - return value - for fallback in fallback_keys: - value = os.getenv(fallback, "").strip() - if value: - logger.debug(f"Using fallback env var: {fallback}") - return value - return "" +def _get_env(key: str) -> str: + """Get environment variable, defaulting to empty string with whitespace stripped.""" + return os.getenv(key, "").strip() + + +def _http_timeout_from_env(env_key: str, default_seconds: float) -> float: + """ + Read HTTP client timeout in seconds from the environment. + + Returns a positive finite float. If the variable is missing, empty, + unparsable, non-finite, or not greater than zero, logs a warning and uses the + default_seconds argument value. + """ + raw = os.getenv(env_key) + if raw is None: + return float(default_seconds) + stripped = raw.strip() + if not stripped: + return float(default_seconds) + try: + timeout = float(stripped) + except (ValueError, TypeError): + logger.warning( + "Invalid %s value %r; using default %ss", + env_key, + raw, + default_seconds, + ) + return float(default_seconds) + if not math.isfinite(timeout) or timeout <= 0: + logger.warning( + "%s must be a finite number > 0 (got %r); using default %ss", + env_key, + raw, + default_seconds, + ) + return float(default_seconds) + return timeout + + +def _resolve_api_url(api_url: str, env_var: str) -> str: + """Require https; allow host-only values by prepending https://.""" + if api_url.startswith("http://"): + raise ValueError(f"{env_var} must use https://; http:// is not allowed.") + if not api_url.startswith("https://"): + return f"https://{api_url}" + return api_url def get_config() -> tuple[str, str]: @@ -66,7 +89,8 @@ def get_config() -> tuple[str, str]: tuple of (api_url, token) Raises: - ValueError: If not configured + ValueError: If required env vars are missing, API URL uses http://, + or URL path doesn't end with /layout-parsing """ api_url = _get_env("PADDLEOCR_DOC_PARSING_API_URL") token = _get_env("PADDLEOCR_ACCESS_TOKEN") @@ -80,9 +104,7 @@ def get_config() -> tuple[str, str]: f"PADDLEOCR_ACCESS_TOKEN not configured. Get your API at: {API_GUIDE_URL}" ) - # Normalize URL - if not api_url.startswith(("http://", "https://")): - api_url = f"https://{api_url}" + api_url = _resolve_api_url(api_url, "PADDLEOCR_DOC_PARSING_API_URL") api_path = urlparse(api_url).path.rstrip("/") if not api_path.endswith("/layout-parsing"): raise ValueError( @@ -118,6 +140,8 @@ def _load_file_as_base64(file_path: str) -> str: path = Path(file_path) if not path.exists(): raise FileNotFoundError(f"File not found: {file_path}") + if path.stat().st_size == 0: + raise ValueError(f"File is empty (0 bytes): {file_path}") return base64.b64encode(path.read_bytes()).decode("utf-8") @@ -127,7 +151,9 @@ def _load_file_as_base64(file_path: str) -> str: # ============================================================================= -def _make_api_request(api_url: str, token: str, params: dict) -> dict: +def _make_api_request( + api_url: str, token: str, params: dict[str, Any] +) -> dict[str, Any]: """ Make PaddleOCR document parsing API request. @@ -148,17 +174,24 @@ def _make_api_request(api_url: str, token: str, params: dict) -> dict: "Client-Platform": "official-skill", } - timeout = float(os.getenv("PADDLEOCR_DOC_PARSING_TIMEOUT", str(DEFAULT_TIMEOUT))) + timeout = _http_timeout_from_env( + "PADDLEOCR_DOC_PARSING_TIMEOUT", float(DEFAULT_TIMEOUT) + ) try: with httpx.Client(timeout=timeout) as client: - resp = client.post(api_url, json=params, headers=headers) + try: + resp = client.post(api_url, json=params, headers=headers) + except TypeError as e: + raise RuntimeError( + "Request parameters cannot be JSON-encoded; use only JSON-serializable " + f"option values ({e})" + ) from e except httpx.TimeoutException: raise RuntimeError(f"API request timed out after {timeout}s") except httpx.RequestError as e: raise RuntimeError(f"API request failed: {e}") - # Handle HTTP errors if resp.status_code != 200: error_detail = "" try: @@ -182,15 +215,19 @@ def _make_api_request(api_url: str, token: str, params: dict) -> dict: else: raise RuntimeError(f"API error ({resp.status_code}): {error_detail}") - # Parse response try: result = resp.json() except Exception: raise RuntimeError(f"Invalid JSON response: {resp.text[:200]}") - # Check API-level error + if not isinstance(result, dict): + raise RuntimeError( + f"Unexpected JSON shape (expected object): {resp.text[:200]}" + ) + if result.get("errorCode", 0) != 0: - raise RuntimeError(f"API error: {result.get('errorMsg', 'Unknown error')}") + msg = result.get("errorMsg", "Unknown error") + raise RuntimeError(f"API error: {msg}") return result @@ -204,14 +241,14 @@ def parse_document( file_path: Optional[str] = None, file_url: Optional[str] = None, file_type: Optional[int] = None, - **options, + **options: Any, ) -> dict[str, Any]: """ Parse document with PaddleOCR. Args: - file_path: Local file path - file_url: URL to file + file_path: Local file path (mutually exclusive with file_url) + file_url: URL to file (mutually exclusive with file_path) file_type: Optional file type override (0=PDF, 1=Image) **options: Additional API options @@ -230,13 +267,23 @@ def parse_document( "error": {"code": "...", "message": "..."} } """ - # Validate input - if not file_path and not file_url: + if file_path is not None and not isinstance(file_path, str): + return _error("INPUT_ERROR", "file_path must be a string or None") + if file_url is not None and not isinstance(file_url, str): + return _error("INPUT_ERROR", "file_url must be a string or None") + + fp = file_path.strip() if file_path else "" + fu = file_url.strip() if file_url else "" + if fp and fu: + return _error( + "INPUT_ERROR", + "Provide only one of file_path or file_url, not both", + ) + if not fp and not fu: return _error("INPUT_ERROR", "file_path or file_url required") if file_type is not None and file_type not in (FILE_TYPE_PDF, FILE_TYPE_IMAGE): return _error("INPUT_ERROR", "file_type must be 0 (PDF) or 1 (Image)") - # Get config try: api_url, token = get_config() except ValueError as e: @@ -245,33 +292,40 @@ def parse_document( # Build request params try: resolved_file_type: Optional[int] = None - if file_url: - params = {"file": file_url} - resolved_file_type = file_type + if fu: + params = {"file": fu} + if file_type is not None: + resolved_file_type = file_type + else: + try: + resolved_file_type = _detect_file_type(fu) + except ValueError: + resolved_file_type = None else: resolved_file_type = ( - file_type if file_type is not None else _detect_file_type(file_path) + file_type if file_type is not None else _detect_file_type(fp) ) params = { - "file": _load_file_as_base64(file_path), + "file": _load_file_as_base64(fp), } + params["visualize"] = ( + False # reduce response payload; callers can override via options + ) params.update(options) if resolved_file_type is not None: params["fileType"] = resolved_file_type - elif file_url: + else: params.pop("fileType", None) - except (ValueError, FileNotFoundError) as e: + except (ValueError, OSError, MemoryError) as e: return _error("INPUT_ERROR", str(e)) - # Call API try: result = _make_api_request(api_url, token, params) except RuntimeError as e: return _error("API_ERROR", str(e)) - # Extract text try: text = _extract_text(result) except ValueError as e: @@ -285,47 +339,45 @@ def parse_document( } -def _extract_text(result) -> str: +def _extract_text(result: dict[str, Any]) -> str: """Extract text from document parsing result.""" if not isinstance(result, dict): - raise ValueError( - "Invalid response schema: top-level response must be an object" - ) + raise ValueError("Invalid API response: top-level response must be an object") raw_result = result.get("result") if not isinstance(raw_result, dict): - raise ValueError("Invalid response schema: missing result object") + raise ValueError("Invalid API response: missing 'result' object") pages = raw_result.get("layoutParsingResults") if not isinstance(pages, list): raise ValueError( - "Invalid response schema: result.layoutParsingResults must be an array" + "Invalid API response: result.layoutParsingResults must be an array" ) texts = [] for i, page in enumerate(pages): if not isinstance(page, dict): raise ValueError( - f"Invalid response schema: result.layoutParsingResults[{i}] must be an object" + f"Invalid API response: result.layoutParsingResults[{i}] must be an object" ) markdown = page.get("markdown") if not isinstance(markdown, dict): raise ValueError( - f"Invalid response schema: result.layoutParsingResults[{i}].markdown must be an object" + f"Invalid API response: result.layoutParsingResults[{i}].markdown must be an object" ) text = markdown.get("text") if not isinstance(text, str): raise ValueError( - f"Invalid response schema: result.layoutParsingResults[{i}].markdown.text must be a string" + f"Invalid API response: result.layoutParsingResults[{i}].markdown.text must be a string" ) texts.append(text) return "\n\n".join(texts) -def _error(code: str, message: str) -> dict: +def _error(code: str, message: str) -> dict[str, Any]: """Create error response.""" return { "ok": False, diff --git a/skills/paddleocr-doc-parsing/scripts/optimize_file.py b/skills/paddleocr-doc-parsing/scripts/optimize_file.py index 3972722b785..b2c640c628e 100644 --- a/skills/paddleocr-doc-parsing/scripts/optimize_file.py +++ b/skills/paddleocr-doc-parsing/scripts/optimize_file.py @@ -1,19 +1,3 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - """ File Optimizer for PaddleOCR Document Parsing @@ -21,26 +5,46 @@ Supports image files only. Usage: - python scripts/optimize_file.py input.png output.png --quality 85 + python scripts/optimize_file.py input.png output.png + python scripts/optimize_file.py input.png output.jpg --quality 70 """ import argparse +import math import sys from pathlib import Path +DEFAULT_QUALITY = 85 +DEFAULT_TARGET_SIZE_MB = 20 +SUPPORTED_EXTENSIONS = (".png", ".jpg", ".jpeg", ".bmp", ".tiff", ".tif", ".webp") +SUPPORTED_FORMATS_DISPLAY = ", ".join( + e.lstrip(".").upper() for e in SUPPORTED_EXTENSIONS +) + + +def _arg_quality(value: str) -> int: + q = int(value) + if q < 1 or q > 100: + raise argparse.ArgumentTypeError("quality must be between 1 and 100 inclusive") + return q + + +def _arg_positive_mb(value: str) -> float: + v = float(value) + if not math.isfinite(v) or v <= 0: + raise argparse.ArgumentTypeError( + "target size must be a finite number greater than 0" + ) + return v + def optimize_image( - input_path: Path, output_path: Path, quality: int = 85, max_size_mb: float = 20 -): - """ - Optimize image file by reducing quality and/or resolution - - Args: - input_path: Input image path - output_path: Output image path - quality: JPEG quality (1-100, lower = smaller file) - max_size_mb: Target max size in MB - """ + input_path: Path, + output_path: Path, + quality: int = DEFAULT_QUALITY, + max_size_mb: float = DEFAULT_TARGET_SIZE_MB, +) -> None: + """Optimize image file by reducing quality and/or resolution.""" try: from PIL import Image except ImportError: @@ -48,18 +52,20 @@ def optimize_image( print("Install with: pip install Pillow") sys.exit(1) + if input_path.stat().st_size == 0: + raise ValueError("Input file is empty (0 bytes); nothing to optimize") + print(f"Optimizing image: {input_path}") - # Open image img = Image.open(input_path) original_size = input_path.stat().st_size / 1024 / 1024 print(f"Original size: {original_size:.2f}MB") print(f"Original dimensions: {img.size[0]}x{img.size[1]}") - # Convert RGBA to RGB if needed (for JPEG) - if img.mode in ("RGBA", "LA", "P"): - # Create white background + is_jpeg = output_path.suffix.lower() in (".jpg", ".jpeg") + + if is_jpeg and img.mode in ("RGBA", "LA", "P"): background = Image.new("RGB", img.size, (255, 255, 255)) if img.mode == "P": img = img.convert("RGBA") @@ -68,36 +74,37 @@ def optimize_image( ) img = background - # Determine output format - output_format = output_path.suffix.lower() - if output_format in [".jpg", ".jpeg"]: - save_format = "JPEG" - elif output_format == ".png": - save_format = "PNG" - else: - save_format = "JPEG" - output_path = output_path.with_suffix(".jpg") + save_kwargs = {"optimize": True} + if is_jpeg or output_path.suffix.lower() == ".webp": + save_kwargs["quality"] = quality + + def _save(image): + image.save(output_path, **save_kwargs) + return output_path.stat().st_size / 1024 / 1024 - # Try saving with specified quality - img.save(output_path, format=save_format, quality=quality, optimize=True) - new_size = output_path.stat().st_size / 1024 / 1024 + new_size = _save(img) - # If still too large, reduce resolution scale_factor = 0.9 - while new_size > max_size_mb and scale_factor > 0.3: + while new_size > max_size_mb and scale_factor >= 0.4: new_width = int(img.size[0] * scale_factor) new_height = int(img.size[1] * scale_factor) + if new_width < 1 or new_height < 1: + print( + f"Cannot shrink to valid dimensions at scale {scale_factor:.2f} " + f"(would be {new_width}x{new_height}); stopping resize loop." + ) + break print(f"Resizing to {new_width}x{new_height} (scale: {scale_factor:.2f})") resized = img.resize((new_width, new_height), Image.Resampling.LANCZOS) - resized.save(output_path, format=save_format, quality=quality, optimize=True) - new_size = output_path.stat().st_size / 1024 / 1024 + new_size = _save(resized) scale_factor -= 0.1 print(f"Optimized size: {new_size:.2f}MB") - print(f"Reduction: {((original_size - new_size) / original_size * 100):.1f}%") + pct = (original_size - new_size) / original_size * 100 + print(f"Reduction: {pct:.1f}%") if new_size > max_size_mb: print(f"\nWARNING: File still larger than {max_size_mb}MB") @@ -107,33 +114,36 @@ def optimize_image( print(" - Use a smaller or resized image") -def main(): +def main() -> None: parser = argparse.ArgumentParser( description="Optimize files for PaddleOCR document parsing", formatter_class=argparse.RawDescriptionHelpFormatter, - epilog=""" + epilog=f""" Examples: - # Optimize image with default quality (85) + # Optimize image with default quality python scripts/optimize_file.py input.png output.png # Optimize with specific quality python scripts/optimize_file.py input.jpg output.jpg --quality 70 Supported formats: - - Images: PNG, JPG, JPEG, BMP, TIFF, TIF + - Images: {SUPPORTED_FORMATS_DISPLAY} """, ) parser.add_argument("input", help="Input file path") parser.add_argument("output", help="Output file path") parser.add_argument( - "--quality", type=int, default=85, help="JPEG quality (1-100, default: 85)" + "--quality", + type=_arg_quality, + default=DEFAULT_QUALITY, + help="JPEG/WebP quality (1-100, default: %(default)s)", ) parser.add_argument( "--target-size", - type=float, - default=20, - help="Target maximum size in MB (default: 20)", + type=_arg_positive_mb, + default=DEFAULT_TARGET_SIZE_MB, + help="Target maximum size in MB (default: %(default)s)", ) args = parser.parse_args() @@ -141,24 +151,26 @@ def main(): input_path = Path(args.input) output_path = Path(args.output) - # Validate input if not input_path.exists(): print(f"ERROR: Input file not found: {input_path}") sys.exit(1) - # Determine file type ext = input_path.suffix.lower() - if ext in [".png", ".jpg", ".jpeg", ".bmp", ".tiff", ".tif"]: - optimize_image(input_path, output_path, args.quality, args.target_size) + if ext in SUPPORTED_EXTENSIONS: + try: + optimize_image(input_path, output_path, args.quality, args.target_size) + except Exception as e: + print(f"ERROR: {e}") + sys.exit(1) else: print(f"ERROR: Unsupported file format: {ext}") - print("Supported: PNG, JPG, JPEG, BMP, TIFF, TIF") + print(f"Supported: {SUPPORTED_FORMATS_DISPLAY}") sys.exit(1) print(f"\nOptimized file saved to: {output_path}") print("\nYou can now process with:") - print(f' python scripts/vl_caller.py --file-path "{output_path}" --pretty') + print(f' python scripts/layout_caller.py --file-path "{output_path}" --pretty') if __name__ == "__main__": diff --git a/skills/paddleocr-doc-parsing/scripts/smoke_test.py b/skills/paddleocr-doc-parsing/scripts/smoke_test.py index 04de25b3ea6..eeaec55c075 100644 --- a/skills/paddleocr-doc-parsing/scripts/smoke_test.py +++ b/skills/paddleocr-doc-parsing/scripts/smoke_test.py @@ -1,26 +1,12 @@ -#!/usr/bin/env python3 -# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - """ Smoke Test for PaddleOCR Document Parsing Skill Verifies configuration and API connectivity. Usage: - python paddleocr-doc-parsing/scripts/smoke_test.py - python paddleocr-doc-parsing/scripts/smoke_test.py --skip-api-test + python scripts/smoke_test.py + python scripts/smoke_test.py --skip-api-test + python scripts/smoke_test.py --test-url "https://example.com/test.pdf" """ import argparse @@ -31,10 +17,12 @@ sys.path.insert(0, str(Path(__file__).parent)) -def print_config_guide(): +def print_config_guide() -> None: """Print friendly configuration guide.""" + from lib import DEFAULT_TIMEOUT + print( - """ + f""" ============================================================ HOW TO GET YOUR API CREDENTIALS ============================================================ @@ -48,14 +36,14 @@ def print_config_guide(): Set environment variables: export PADDLEOCR_DOC_PARSING_API_URL=https://your-api-url.paddleocr.com/layout-parsing export PADDLEOCR_ACCESS_TOKEN=your_token_here - export PADDLEOCR_DOC_PARSING_TIMEOUT=600 # optional + export PADDLEOCR_DOC_PARSING_TIMEOUT={DEFAULT_TIMEOUT} # optional ============================================================ """ ) -def main(): +def main() -> int: parser = argparse.ArgumentParser( description="PaddleOCR Document Parsing smoke test" ) @@ -71,7 +59,6 @@ def main(): print("PaddleOCR Document Parsing - Smoke Test") print("=" * 60) - # Check dependencies first print("\n[1/3] Checking dependencies...") try: @@ -80,11 +67,14 @@ def main(): print(f" + httpx: {httpx.__version__}") except ImportError: print(" X httpx not installed") - print("\nPlease install dependencies:") + print( + "\nPlease install dependencies (from the skill directory, one level above scripts/):" + ) + print(" pip install -r requirements.txt") + print("or at minimum:") print(" pip install httpx") return 1 - # Check configuration print("\n[2/3] Checking configuration...") from lib import get_config @@ -99,7 +89,6 @@ def main(): print_config_guide() return 1 - # Test API connectivity if args.skip_api_test: print("\n[3/3] Skipping API connectivity test (--skip-api-test)") print("\n" + "=" * 60) @@ -109,7 +98,6 @@ def main(): print("\n[3/3] Testing API connectivity...") - # Use provided test URL or default test_url = ( args.test_url or "https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/pp_structure_v3_demo.png" @@ -120,7 +108,7 @@ def main(): result = parse_document(file_url=test_url) - if not result["ok"]: + if not result.get("ok"): error = result.get("error", {}) print(f"\n X API call failed: {error.get('message')}") if "Authentication" in error.get("message", ""): @@ -132,7 +120,6 @@ def main(): print(" + API call successful!") - # Show results text = result.get("text", "") if text: preview = text[:200].replace("\n", " ") @@ -144,8 +131,8 @@ def main(): print("Smoke Test PASSED") print("=" * 60) print("\nNext steps:") - print(' python paddleocr-doc-parsing/scripts/vl_caller.py --file-url "URL"') - print(' python paddleocr-doc-parsing/scripts/vl_caller.py --file-path "doc.pdf"') + print(' python scripts/layout_caller.py --file-url "URL"') + print(' python scripts/layout_caller.py --file-path "doc.pdf"') print( " Results are auto-saved to the system temp directory; the caller prints the saved path." ) diff --git a/skills/paddleocr-doc-parsing/scripts/split_pdf.py b/skills/paddleocr-doc-parsing/scripts/split_pdf.py index 4d1e7b135de..dc3b4b5dbdc 100644 --- a/skills/paddleocr-doc-parsing/scripts/split_pdf.py +++ b/skills/paddleocr-doc-parsing/scripts/split_pdf.py @@ -1,18 +1,3 @@ -#!/usr/bin/env python3 -# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - """ Split a PDF by page ranges. @@ -66,7 +51,7 @@ def add_page(page_number: int): return selected_pages -def split_pdf(input_path: Path, output_path: Path, pages_spec: str): +def split_pdf(input_path: Path, output_path: Path, pages_spec: str) -> tuple[int, int]: """Create a new PDF containing selected pages from the input PDF.""" try: import pypdfium2 as pdfium @@ -117,7 +102,7 @@ def main() -> int: try: total_pages, kept_pages = split_pdf(input_path, output_path, args.pages) - except (ValueError, RuntimeError) as e: + except Exception as e: print(f"ERROR: {e}") return 1 diff --git a/skills/paddleocr-text-recognition/SKILL.md b/skills/paddleocr-text-recognition/SKILL.md index 04d7e6946b8..b5e8b729f67 100644 --- a/skills/paddleocr-text-recognition/SKILL.md +++ b/skills/paddleocr-text-recognition/SKILL.md @@ -1,13 +1,18 @@ --- name: paddleocr-text-recognition -description: Extracts text (with locations) from images and PDF documents using PaddleOCR. +description: >- + Use this skill whenever the user wants text extracted from images, photos, scans, screenshots, + or scanned PDFs. Returns exact machine-readable strings with line-level text and optional bbox + coordinates. Strong accuracy for CJK, small print, and handwritten text. + Trigger terms: OCR, 文字识别, 图片转文字, 截图识字, 提取图中文字, 扫描识字, 识字, 纯文字, + plain text extraction, 坐标, 检测框, bbox, bounding box, image to text, screenshot, photo scan, + recognize text. metadata: openclaw: requires: env: - PADDLEOCR_OCR_API_URL - PADDLEOCR_ACCESS_TOKEN - - PADDLEOCR_OCR_TIMEOUT bins: - python primaryEnv: PADDLEOCR_ACCESS_TOKEN @@ -19,53 +24,52 @@ metadata: ## When to Use This Skill -Invoke this skill in the following situations: +**Trigger keywords (routing)**: Bilingual trigger terms (Chinese and English) are listed in the YAML `description` above—use that field for discovery and routing. + +**Use this skill for**: + - Extract text from images (screenshots, photos, scans) -- Extract text from PDFs or document images -- Extract text and positions from structured documents (invoices, receipts, forms, tables) +- Extract text from PDFs or document images when the goal is **line/box-level text**, not recovering table grids, formulas, or full reading-order layout - Extract text from URLs or local files that point to images/PDFs -Do not use this skill in the following situations: -- Plain text files that can be read directly with the Read tool -- Code files or markdown documents +**Do not use for**: + +- Plain text files, code files, or markdown documents that can be read directly as text +- Documents with tables, formulas, charts, or complex layouts — use Document Parsing instead - Tasks that do not involve image-to-text conversion -## How to Use This Skill +## Installation -**⛔ MANDATORY RESTRICTIONS - DO NOT VIOLATE ⛔** +Install Python dependencies before using this skill. From the skill directory (`skills/paddleocr-text-recognition`): + +```bash +pip install -r requirements.txt +``` -1. **ONLY use PaddleOCR Text Recognition API** - Execute the script `python scripts/ocr_caller.py` -2. **NEVER read images directly** - Do NOT read images yourself -3. **NEVER offer alternatives** - Do NOT suggest "I can try to read it" or similar -4. **IF API fails** - Display the error message and STOP immediately -5. **NO fallback methods** - Do NOT attempt OCR any other way +## How to Use This Skill -If the script execution fails (API not configured, network error, etc.): -- Show the error message to the user -- Do NOT offer to help using your vision capabilities -- Do NOT ask "Would you like me to try reading it?" -- Simply stop and wait for user to fix the configuration +> **Working directory**: All `python scripts/...` commands below should be run from this skill's root directory (the directory containing this SKILL.md file). ### Basic Workflow 1. **Identify the input source**: - User provides URL: Use the `--file-url` parameter - User provides local file path: Use the `--file-path` parameter - - User uploads image: Save it first, then use `--file-path` - - **Input type note**: - - Supported file types depend on the model and endpoint configuration. - - Follow the official endpoint/API documentation for the exact supported formats. 2. **Execute OCR**: + ```bash python scripts/ocr_caller.py --file-url "URL provided by user" --pretty ``` + Or for local files: + ```bash python scripts/ocr_caller.py --file-path "file path" --pretty ``` + > **Performance note**: Parsing time scales with document complexity. Single-page images typically complete in 1-3 seconds; large PDFs (50+ pages) may take several minutes. Allow adequate time before assuming a timeout. + **Default behavior: save raw JSON to a temp file**: - If `--output` is omitted, the script saves automatically under the system temp directory - Default path pattern: `/paddleocr/text-recognition/results/result__.json` @@ -87,148 +91,151 @@ If the script execution fails (API not configured, network error, etc.): - If the text is empty, the image may contain no text - In save mode, always tell the user the saved file path and that full raw JSON is available there -### IMPORTANT: Complete Output Display +### What to Do After Extraction + +Common next steps once you have the recognized text: -**CRITICAL**: Always display the COMPLETE recognized text to the user. Do NOT truncate or summarize the OCR results. +- **Save to file**: Write the `text` field to a `.txt` or `.md` file +- **Search the content**: Search the saved output file for keywords +- **Feed to another pipeline**: The `text` field is clean plain text, ready for downstream processing +- **Poor results**: See "Tips for Better Results" below before retrying -- The output JSON contains complete output, including full text in `text` field -- **You MUST display the entire `text` content to the user**, no matter how long it is -- Do NOT use phrases like "Here's a summary" or "The text begins with..." -- Do NOT truncate with "..." unless the text truly exceeds reasonable display limits -- The user expects to see ALL the recognized text, not a preview or excerpt +### Complete Output Display + +Always display the COMPLETE recognized text to the user. The user typically needs the full content for downstream use — truncation silently loses data they may not notice is missing. + +- Display the entire `text` field, no matter how long +- Do not use phrases like "Here's a summary" or "The text begins with..." +- Do not truncate with "..." unless the text truly exceeds reasonable display limits (>10,000 chars) + +**Example - Correct**: -**Correct approach**: ``` -I've extracted the text from the image. Here's the complete content: +User: "Extract the text from this image" +Agent: I've extracted the text from the image. Here's the complete content: [Display the entire text here] ``` -**Incorrect approach**: +**Example - Incorrect**: + ``` -I found some text in the image. Here's a preview: +User: "Extract the text from this image" +Agent: I found some text in the image. Here's a preview: "The quick brown fox..." (truncated) ``` +### Understanding the Output + +The script returns a JSON envelope with `ok`, `text`, `result`, and `error` fields. Use `text` for the recognized content; `result` contains the raw API response for debugging. + +For the full schema and field-level details, see `references/output_schema.md`. + +> Raw result location (default): the temp-file path printed by the script on stderr + ### Usage Examples -**Example 1: URL OCR**: +**Example 1: URL OCR** + ```bash python scripts/ocr_caller.py --file-url "https://example.com/invoice.jpg" --pretty ``` -**Example 2: Local File OCR**: +**Example 2: Local File OCR** + ```bash python scripts/ocr_caller.py --file-path "./document.pdf" --pretty ``` -**Example 3: OCR With Explicit File Type**: +**Example 3: OCR With Explicit File Type** + ```bash python scripts/ocr_caller.py --file-url "https://example.com/input" --file-type 1 --pretty ``` -**Example 4: Print JSON Without Saving**: +- `--file-type 0`: PDF +- `--file-type 1`: image +- If omitted, the type is auto-detected from the file extension. For local files, a recognized extension (`.pdf`, `.png`, `.jpg`, `.jpeg`, `.bmp`, `.tiff`, `.tif`, `.webp`) is required; otherwise pass `--file-type` explicitly. For URLs with unrecognized extensions, the service attempts inference. + +**Example 4: Print JSON Without Saving** + ```bash python scripts/ocr_caller.py --file-url "https://example.com/input" --stdout --pretty ``` -### Understanding the Output +### First-Time Configuration + +**When API is not configured**, the script outputs: -The output JSON structure is as follows: ```json { - "ok": true, - "text": "All recognized text here...", - "result": { ... }, - "error": null + "ok": false, + "text": "", + "result": null, + "error": { + "code": "CONFIG_ERROR", + "message": "PADDLEOCR_OCR_API_URL not configured. Get your API at: https://paddleocr.com" + } } ``` -**Key fields**: -- `ok`: `true` for success, `false` for error -- `text`: Complete recognized text -- `result`: Raw API response (for debugging) -- `error`: Error details if `ok` is false - -> Raw result location (default): the temp-file path printed by the script on stderr - -### First-Time Configuration - -You can generally assume that the required environment variables have already been configured. Only when an OCR task fails should you analyze the error message to determine whether it is caused by a configuration issue. If it is indeed a configuration problem, you should notify the user to fix it. - -**When API is not configured**: - -The error will show: -``` -CONFIG_ERROR: PADDLEOCR_OCR_API_URL not configured. Get your API at: https://paddleocr.com -``` - **Configuration workflow**: -1. **Show the exact error message** to the user (including the URL). +1. **Show the exact error message** to the user. -2. **Guide the user to configure securely**: - - Recommend configuring through the host application's standard method (e.g., settings file, environment variable UI) rather than pasting credentials in chat. - - List the required environment variables: - ``` - - PADDLEOCR_OCR_API_URL - - PADDLEOCR_ACCESS_TOKEN - - Optional: PADDLEOCR_OCR_TIMEOUT - ``` +2. **Guide the user to obtain credentials**: Visit the [PaddleOCR website](https://www.paddleocr.com), click **API**, select the `PP-OCRv5` model, select the language, then copy the `API_URL` and `Token`. They map to these environment variables: + - `PADDLEOCR_OCR_API_URL` — full endpoint URL ending with `/ocr` + - `PADDLEOCR_ACCESS_TOKEN` — 40-character alphanumeric string -3. **If the user provides credentials in chat anyway** (accept any reasonable format), for example: - - `PADDLEOCR_OCR_API_URL=https://xxx.paddleocr.com/ocr, PADDLEOCR_ACCESS_TOKEN=abc123...` - - `Here's my API: https://xxx and token: abc123` - - Copy-pasted code format - - Any other reasonable format - - **Security note**: Warn the user that credentials shared in chat may be stored in conversation history. Recommend setting them through the host application's configuration instead when possible. + Optionally configure `PADDLEOCR_OCR_TIMEOUT` for request timeout. Recommend using the host application's standard configuration method rather than pasting credentials in chat. - Then parse and validate the values: - - Extract `PADDLEOCR_OCR_API_URL` (look for URLs with `paddleocr.com` or similar) - - Confirm `PADDLEOCR_OCR_API_URL` is a full endpoint ending with `/ocr` - - Extract `PADDLEOCR_ACCESS_TOKEN` (long alphanumeric string, usually 40+ chars) +3. **Apply credentials** — one of: + - **User configured via the host UI**: ask the user to confirm, then retry. + - **User pastes credentials in chat**: warn that they may be stored in conversation history, help the user persist them using the host's standard configuration method, then retry. -4. **Ask the user to confirm the environment is configured**. +### Error Handling -5. **Retry only after confirmation**: - - Once the user confirms the environment variables are available, retry the original OCR task +All errors return JSON with `ok: false`. Show the error message and stop — do not fall back to your own vision capabilities. Identify the issue from `error.code` and `error.message`: -### Error Handling +**Authentication failed (403)** — `error.message` contains "Authentication failed" -**Authentication failed**: -``` -API_ERROR: Authentication failed (403). Check your token. -``` - Token is invalid, reconfigure with correct credentials -**Quota exceeded**: -``` -API_ERROR: API rate limit exceeded (429) -``` +**Quota exceeded (429)** — `error.message` contains "API rate limit exceeded" + - Daily API quota exhausted, inform user to wait or upgrade +**Unsupported format** — `error.message` contains "Unsupported file format" + +- File format not supported, convert to PDF/PNG/JPG + **No text detected**: + - `text` field is empty - Image may be blank, corrupted, or contain no text ### Tips for Better Results -If recognition quality is poor, suggest: -- Check if the image is clear and contains text -- Provide a higher resolution image if possible +If recognition quality is poor: + +- **Low resolution**: Provide a higher resolution image (≥300 DPI works well for most printed text) +- **Noisy background**: A cleaner scan or screenshot typically yields better results than a phone photo +- **Check confidence**: The raw JSON (`result.result.ocrResults[n].prunedResult.rec_scores`) shows per-line confidence scores — low values identify uncertain regions worth reviewing ## Reference Documentation -For in-depth understanding of the OCR system, refer to: -- `references/output_schema.md` - Output format specification +- `references/output_schema.md` — Full output schema, field descriptions, and command examples > **Note**: Model version, capabilities, and supported file formats are determined by your API endpoint (`PADDLEOCR_OCR_API_URL`) and its official API documentation. ## Testing the Skill To verify the skill is working properly: + ```bash python scripts/smoke_test.py +python scripts/smoke_test.py --skip-api-test +python scripts/smoke_test.py --test-url "https://..." ``` -This tests configuration and API connectivity. +The first form tests configuration and API connectivity. `--skip-api-test` checks configuration only. `--test-url` overrides the default sample image URL. diff --git a/skills/paddleocr-text-recognition/references/output_schema.md b/skills/paddleocr-text-recognition/references/output_schema.md index aac0e808473..e2f93880f32 100644 --- a/skills/paddleocr-text-recognition/references/output_schema.md +++ b/skills/paddleocr-text-recognition/references/output_schema.md @@ -33,11 +33,11 @@ On error: ## Error Codes -| Code | Description | -|------|-------------| -| `INPUT_ERROR` | Invalid input (missing file, unsupported format, invalid file type) | -| `CONFIG_ERROR` | API not configured | -| `API_ERROR` | API call failed (auth, timeout, service error, or invalid response schema) | +| Code | Description | +| -------------- | --------------------------------------------------------------------------- | +| `INPUT_ERROR` | Invalid or unusable input (arguments, file source, format, types). | +| `CONFIG_ERROR` | Missing or invalid API / client configuration. | +| `API_ERROR` | Request or response handling failed (network, HTTP, body parsing, schema). | ## Raw Result Notes @@ -76,34 +76,36 @@ Raw fields may vary by model version and endpoint. ## Stable Fields for Downstream Use -- `result[n].prunedResult` +Paths are relative to the output envelope root. + +- `result.result.ocrResults[n].prunedResult` Structured OCR data for page `n`. -- `result[n].prunedResult.rec_texts` +- `result.result.ocrResults[n].prunedResult.rec_texts` Recognized text lines for page `n`. -- `result[n].prunedResult.rec_scores` +- `result.result.ocrResults[n].prunedResult.rec_scores` Confidence scores for recognized text lines. ## Text Extraction -`ocr_caller.py` extracts top-level `text` from `result.ocrResults[n].prunedResult.rec_texts`, joins lines with `\n`, and joins pages with `\n\n`. +`ocr_caller.py` extracts top-level `text` from `result.result.ocrResults[n].prunedResult.rec_texts`, joins lines with `\n`, and joins pages with `\n\n`. ## Command Examples ```bash # OCR from URL (result auto-saves to the system temp directory) -python scripts/paddleocr-text-recognition/ocr_caller.py --file-url "URL" --pretty +python scripts/ocr_caller.py --file-url "URL" --pretty # OCR local file (result auto-saves to the system temp directory) -python scripts/paddleocr-text-recognition/ocr_caller.py --file-path "doc.pdf" --pretty +python scripts/ocr_caller.py --file-path "doc.pdf" --pretty # OCR with explicit file type -python scripts/paddleocr-text-recognition/ocr_caller.py --file-url "URL" --file-type 1 --pretty +python scripts/ocr_caller.py --file-url "URL" --file-type 1 --pretty # Save result to a custom file path -python scripts/paddleocr-text-recognition/ocr_caller.py --file-url "URL" --output "./result.json" --pretty +python scripts/ocr_caller.py --file-url "URL" --output "./result.json" --pretty # Print JSON to stdout without saving a file -python scripts/paddleocr-text-recognition/ocr_caller.py --file-url "URL" --stdout --pretty +python scripts/ocr_caller.py --file-url "URL" --stdout --pretty ``` diff --git a/skills/paddleocr-text-recognition/scripts/requirements.txt b/skills/paddleocr-text-recognition/requirements.txt similarity index 100% rename from skills/paddleocr-text-recognition/scripts/requirements.txt rename to skills/paddleocr-text-recognition/requirements.txt diff --git a/skills/paddleocr-text-recognition/scripts/lib.py b/skills/paddleocr-text-recognition/scripts/lib.py index e0754af42b6..b9be3409ba0 100644 --- a/skills/paddleocr-text-recognition/scripts/lib.py +++ b/skills/paddleocr-text-recognition/scripts/lib.py @@ -1,17 +1,3 @@ -# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - """ PaddleOCR Text Recognition Library @@ -20,10 +6,11 @@ import base64 import logging +import math import os from pathlib import Path from typing import Any, Optional -from urllib.parse import urlparse, unquote +from urllib.parse import unquote, urlparse import httpx @@ -39,16 +26,61 @@ FILE_TYPE_IMAGE = 1 IMAGE_EXTENSIONS = (".png", ".jpg", ".jpeg", ".bmp", ".tiff", ".tif", ".webp") + # ============================================================================= # Environment # ============================================================================= def _get_env(key: str) -> str: - """Get environment variable.""" + """Get environment variable, defaulting to empty string with whitespace stripped.""" return os.getenv(key, "").strip() +def _http_timeout_from_env(env_key: str, default_seconds: float) -> float: + """ + Read HTTP client timeout in seconds from the environment. + + Returns a positive finite float. If the variable is missing, empty, + unparsable, non-finite, or not greater than zero, logs a warning and uses the + default_seconds argument value. + """ + raw = os.getenv(env_key) + if raw is None: + return float(default_seconds) + stripped = raw.strip() + if not stripped: + return float(default_seconds) + try: + timeout = float(stripped) + except (ValueError, TypeError): + logger.warning( + "Invalid %s value %r; using default %ss", + env_key, + raw, + default_seconds, + ) + return float(default_seconds) + if not math.isfinite(timeout) or timeout <= 0: + logger.warning( + "%s must be a finite number > 0 (got %r); using default %ss", + env_key, + raw, + default_seconds, + ) + return float(default_seconds) + return timeout + + +def _resolve_api_url(api_url: str, env_var: str) -> str: + """Require https; allow host-only values by prepending https://.""" + if api_url.startswith("http://"): + raise ValueError(f"{env_var} must use https://; http:// is not allowed.") + if not api_url.startswith("https://"): + return f"https://{api_url}" + return api_url + + def get_config() -> tuple[str, str]: """ Get API URL and token from environment. @@ -57,7 +89,8 @@ def get_config() -> tuple[str, str]: tuple of (api_url, token) Raises: - ValueError: If not configured + ValueError: If required env vars are missing, API URL uses http://, + or URL path doesn't end with /ocr """ api_url = _get_env("PADDLEOCR_OCR_API_URL") token = _get_env("PADDLEOCR_ACCESS_TOKEN") @@ -71,9 +104,7 @@ def get_config() -> tuple[str, str]: f"PADDLEOCR_ACCESS_TOKEN not configured. Get your API at: {API_GUIDE_URL}" ) - # Normalize URL - if not api_url.startswith(("http://", "https://")): - api_url = f"https://{api_url}" + api_url = _resolve_api_url(api_url, "PADDLEOCR_OCR_API_URL") api_path = urlparse(api_url).path.rstrip("/") if not api_path.endswith("/ocr"): raise ValueError( @@ -108,6 +139,8 @@ def _load_file_as_base64(file_path: str) -> str: path = Path(file_path) if not path.exists(): raise FileNotFoundError(f"File not found: {file_path}") + if path.stat().st_size == 0: + raise ValueError(f"File is empty (0 bytes): {file_path}") return base64.b64encode(path.read_bytes()).decode("utf-8") @@ -116,7 +149,9 @@ def _load_file_as_base64(file_path: str) -> str: # ============================================================================= -def _make_api_request(api_url: str, token: str, params: dict) -> dict: +def _make_api_request( + api_url: str, token: str, params: dict[str, Any] +) -> dict[str, Any]: """ Make PaddleOCR API request. @@ -137,17 +172,22 @@ def _make_api_request(api_url: str, token: str, params: dict) -> dict: "Client-Platform": "official-skill", } - timeout = float(os.getenv("PADDLEOCR_OCR_TIMEOUT", str(DEFAULT_TIMEOUT))) + timeout = _http_timeout_from_env("PADDLEOCR_OCR_TIMEOUT", float(DEFAULT_TIMEOUT)) try: with httpx.Client(timeout=timeout) as client: - resp = client.post(api_url, json=params, headers=headers) + try: + resp = client.post(api_url, json=params, headers=headers) + except TypeError as e: + raise RuntimeError( + "Request parameters cannot be JSON-encoded; use only JSON-serializable " + f"option values ({e})" + ) from e except httpx.TimeoutException: raise RuntimeError(f"API request timed out after {timeout}s") except httpx.RequestError as e: raise RuntimeError(f"API request failed: {e}") - # Handle HTTP errors if resp.status_code != 200: error_detail = "" try: @@ -171,15 +211,19 @@ def _make_api_request(api_url: str, token: str, params: dict) -> dict: else: raise RuntimeError(f"API error ({resp.status_code}): {error_detail}") - # Parse response try: result = resp.json() except Exception: raise RuntimeError(f"Invalid JSON response: {resp.text[:200]}") - # Check API-level error + if not isinstance(result, dict): + raise RuntimeError( + f"Unexpected JSON shape (expected object): {resp.text[:200]}" + ) + if result.get("errorCode", 0) != 0: - raise RuntimeError(f"API error: {result.get('errorMsg', 'Unknown error')}") + msg = result.get("errorMsg", "Unknown error") + raise RuntimeError(f"API error: {msg}") return result @@ -193,14 +237,14 @@ def ocr( file_path: Optional[str] = None, file_url: Optional[str] = None, file_type: Optional[int] = None, - **options, + **options: Any, ) -> dict[str, Any]: """ Perform OCR on image or PDF. Args: - file_path: Local file path - file_url: URL to file + file_path: Local file path (mutually exclusive with file_url) + file_url: URL to file (mutually exclusive with file_path) file_type: Optional file type override (0=PDF, 1=Image) **options: Additional API options (passed directly to API) @@ -219,13 +263,23 @@ def ocr( "error": {"code": "...", "message": "..."} } """ - # Validate input - if not file_path and not file_url: + if file_path is not None and not isinstance(file_path, str): + return _error("INPUT_ERROR", "file_path must be a string or None") + if file_url is not None and not isinstance(file_url, str): + return _error("INPUT_ERROR", "file_url must be a string or None") + + fp = file_path.strip() if file_path else "" + fu = file_url.strip() if file_url else "" + if fp and fu: + return _error( + "INPUT_ERROR", + "Provide only one of file_path or file_url, not both", + ) + if not fp and not fu: return _error("INPUT_ERROR", "file_path or file_url required") if file_type is not None and file_type not in (FILE_TYPE_PDF, FILE_TYPE_IMAGE): return _error("INPUT_ERROR", "file_type must be 0 (PDF) or 1 (Image)") - # Get config try: api_url, token = get_config() except ValueError as e: @@ -234,39 +288,42 @@ def ocr( # Build request params try: resolved_file_type: Optional[int] = None - if file_url: - params = {"file": file_url} + if fu: + params = {"file": fu} if file_type is not None: resolved_file_type = file_type else: try: - resolved_file_type = _detect_file_type(file_url) + resolved_file_type = _detect_file_type(fu) except ValueError: resolved_file_type = None else: - params = {"file": _load_file_as_base64(file_path)} resolved_file_type = ( - file_type if file_type is not None else _detect_file_type(file_path) + file_type if file_type is not None else _detect_file_type(fp) ) + params = {"file": _load_file_as_base64(fp)} - params["visualize"] = False + params["visualize"] = ( + False # reduce response payload; callers can override via options + ) params.update(options) if resolved_file_type is not None: params["fileType"] = resolved_file_type else: params.pop("fileType", None) - except (ValueError, FileNotFoundError) as e: + except (ValueError, OSError, MemoryError) as e: return _error("INPUT_ERROR", str(e)) - # Call API try: result = _make_api_request(api_url, token, params) except RuntimeError as e: return _error("API_ERROR", str(e)) - # Extract text - text = _extract_text(result) + try: + text = _extract_text(result) + except ValueError as e: + return _error("API_ERROR", str(e)) return { "ok": True, @@ -276,30 +333,52 @@ def ocr( } -def _extract_text(result: dict) -> str: +def _extract_text(result: dict[str, Any]) -> str: """Extract text from OCR result.""" - # API returns {"errorCode": 0, "result": {"ocrResults": [{page}, ...]}} - raw_result = result.get("result", result) if isinstance(result, dict) else result - - # Extract ocrResults array from the result wrapper - if isinstance(raw_result, dict): - pages = raw_result.get("ocrResults", []) - elif isinstance(raw_result, list): - pages = raw_result - else: - pages = [] + if not isinstance(result, dict): + raise ValueError("Invalid API response: top-level response must be an object") + + raw_result = result.get("result") + if not isinstance(raw_result, dict): + raise ValueError("Invalid API response: missing 'result' object") + + pages = raw_result.get("ocrResults") + if not isinstance(pages, list): + raise ValueError("Invalid API response: result.ocrResults must be an array") all_text = [] - for item in pages: + for i, item in enumerate(pages): if not isinstance(item, dict): - continue - texts = item.get("prunedResult", {}).get("rec_texts", []) - if texts: - all_text.append("\n".join(texts)) + raise ValueError( + f"Invalid API response: result.ocrResults[{i}] must be an object" + ) + + pruned = item.get("prunedResult") + if not isinstance(pruned, dict): + raise ValueError( + f"Invalid API response: result.ocrResults[{i}].prunedResult must be an object" + ) + + texts = pruned.get("rec_texts", []) + if not isinstance(texts, list): + raise ValueError( + f"Invalid API response: result.ocrResults[{i}].prunedResult.rec_texts " + "must be an array" + ) + line_parts: list[str] = [] + for j, t in enumerate(texts): + if not isinstance(t, str): + raise ValueError( + f"Invalid API response: result.ocrResults[{i}].prunedResult." + f"rec_texts[{j}] must be a string" + ) + line_parts.append(t) + if line_parts: + all_text.append("\n".join(line_parts)) return "\n\n".join(all_text) -def _error(code: str, message: str) -> dict: +def _error(code: str, message: str) -> dict[str, Any]: """Create error response.""" return { "ok": False, diff --git a/skills/paddleocr-text-recognition/scripts/ocr_caller.py b/skills/paddleocr-text-recognition/scripts/ocr_caller.py index dcce53b1216..5e6a7aaf438 100644 --- a/skills/paddleocr-text-recognition/scripts/ocr_caller.py +++ b/skills/paddleocr-text-recognition/scripts/ocr_caller.py @@ -1,26 +1,11 @@ -#!/usr/bin/env python3 -# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - """ PaddleOCR Text Recognition Caller Simple CLI wrapper for the PaddleOCR text recognition library. Usage: - python scripts/paddleocr-text-recognition/ocr_caller.py --file-url "URL" - python scripts/paddleocr-text-recognition/ocr_caller.py --file-path "image.png" --pretty + python scripts/ocr_caller.py --file-url "URL" + python scripts/ocr_caller.py --file-path "image.png" --pretty """ import argparse @@ -31,6 +16,7 @@ import uuid from datetime import datetime from pathlib import Path +from typing import Optional # Fix Windows console encoding if sys.platform == "win32": @@ -43,7 +29,7 @@ from lib import ocr -def get_default_output_path(): +def get_default_output_path() -> Path: """Build a unique result path under the OS temp directory.""" timestamp = datetime.now().strftime("%Y%m%d_%H%M%S_%f") short_id = uuid.uuid4().hex[:8] @@ -56,39 +42,38 @@ def get_default_output_path(): ) -def resolve_output_path(output_arg): +def resolve_output_path(output_arg: Optional[str]) -> Path: if output_arg: return Path(output_arg).expanduser().resolve() return get_default_output_path().resolve() -def main(): +def main() -> None: parser = argparse.ArgumentParser( description="PaddleOCR Text Recognition - OCR images/PDFs", formatter_class=argparse.RawDescriptionHelpFormatter, epilog=""" Examples: # OCR from URL (result is auto-saved to the system temp directory) - python scripts/paddleocr-text-recognition/ocr_caller.py --file-url "https://example.com/image.png" + python scripts/ocr_caller.py --file-url "https://example.com/image.png" # OCR local file (result is auto-saved to the system temp directory) - python scripts/paddleocr-text-recognition/ocr_caller.py --file-path "./document.pdf" --pretty + python scripts/ocr_caller.py --file-path "./document.pdf" --pretty # OCR with explicit file type override - python scripts/paddleocr-text-recognition/ocr_caller.py --file-url "URL" --file-type 1 --pretty + python scripts/ocr_caller.py --file-url "URL" --file-type 1 --pretty # Save result to a custom file path - python scripts/paddleocr-text-recognition/ocr_caller.py --file-url "URL" --output "./result.json" --pretty + python scripts/ocr_caller.py --file-url "URL" --output "./result.json" --pretty # Print JSON to stdout without saving a file - python scripts/paddleocr-text-recognition/ocr_caller.py --file-url "URL" --stdout --pretty + python scripts/ocr_caller.py --file-url "URL" --stdout --pretty Configuration: Set environment variables: PADDLEOCR_OCR_API_URL, PADDLEOCR_ACCESS_TOKEN Optional: PADDLEOCR_OCR_TIMEOUT """, ) - # Input (mutually exclusive, required) input_group = parser.add_mutually_exclusive_group(required=True) input_group.add_argument("--file-url", help="URL to image or PDF") input_group.add_argument("--file-path", help="Local path to image or PDF") @@ -118,17 +103,16 @@ def main(): args = parser.parse_args() - # Run OCR + # Unwarping and orientation classification are off to cover common scenarios + # with faster response times. result = ocr( file_path=args.file_path, file_url=args.file_url, file_type=args.file_type, useDocUnwarping=False, useDocOrientationClassify=False, - visualize=False, ) - # Format output indent = 2 if args.pretty else None json_output = json.dumps(result, indent=indent, ensure_ascii=False) @@ -137,7 +121,6 @@ def main(): else: output_path = resolve_output_path(args.output) - # Save to file try: output_path.parent.mkdir(parents=True, exist_ok=True) output_path.write_text(json_output, encoding="utf-8") @@ -146,8 +129,7 @@ def main(): print(f"Error: Cannot write to {output_path}: {e}", file=sys.stderr) sys.exit(5) - # Exit code based on result - sys.exit(0 if result["ok"] else 1) + sys.exit(0 if result.get("ok") else 1) if __name__ == "__main__": diff --git a/skills/paddleocr-text-recognition/scripts/smoke_test.py b/skills/paddleocr-text-recognition/scripts/smoke_test.py index 2b1b8ce1cfa..4f7f561f657 100644 --- a/skills/paddleocr-text-recognition/scripts/smoke_test.py +++ b/skills/paddleocr-text-recognition/scripts/smoke_test.py @@ -1,26 +1,12 @@ -#!/usr/bin/env python3 -# Copyright (c) 2025 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - """ Smoke Test for PaddleOCR Text Recognition Verifies configuration and API connectivity. Usage: - python paddleocr-text-recognition/scripts/smoke_test.py - python paddleocr-text-recognition/scripts/smoke_test.py --skip-api-test + python scripts/smoke_test.py + python scripts/smoke_test.py --skip-api-test + python scripts/smoke_test.py --test-url "https://example.com/test.png" """ import argparse @@ -31,16 +17,18 @@ sys.path.insert(0, str(Path(__file__).parent)) -def print_config_guide(): +def print_config_guide() -> None: """Print friendly configuration guide.""" + from lib import DEFAULT_TIMEOUT + print( - """ + f""" ============================================================ HOW TO GET YOUR API CREDENTIALS ============================================================ 1. Visit: https://paddleocr.com -2. Log in with your Baidu account +2. Sign in to your account 3. Open your model's API call example page 4. Copy the API URL from the example request 5. Copy your access token from the same API setup page @@ -48,14 +36,14 @@ def print_config_guide(): Set environment variables: export PADDLEOCR_OCR_API_URL=https://your-api-url.paddleocr.com/ocr export PADDLEOCR_ACCESS_TOKEN=your_token_here - export PADDLEOCR_OCR_TIMEOUT=120 # optional + export PADDLEOCR_OCR_TIMEOUT={DEFAULT_TIMEOUT} # optional ============================================================ """ ) -def main(): +def main() -> int: parser = argparse.ArgumentParser( description="PaddleOCR Text Recognition smoke test" ) @@ -71,7 +59,6 @@ def main(): print("PaddleOCR Text Recognition - Smoke Test") print("=" * 60) - # Check dependencies first print("\n[1/3] Checking dependencies...") try: @@ -80,11 +67,14 @@ def main(): print(f" + httpx: {httpx.__version__}") except ImportError: print(" X httpx not installed") - print("\nPlease install dependencies:") + print( + "\nPlease install dependencies (from the skill directory, one level above scripts/):" + ) + print(" pip install -r requirements.txt") + print("or at minimum:") print(" pip install httpx") return 1 - # Check configuration print("\n[2/3] Checking configuration...") from lib import get_config @@ -99,7 +89,6 @@ def main(): print_config_guide() return 1 - # Test API connectivity if args.skip_api_test: print("\n[3/3] Skipping API connectivity test (--skip-api-test)") print("\n" + "=" * 60) @@ -109,10 +98,9 @@ def main(): print("\n[3/3] Testing API connectivity...") - # Use provided test URL or default test_url = ( args.test_url - or "https://raw.githubusercontent.com/PaddlePaddle/PaddleOCR/release/2.7/doc/imgs/11.jpg" + or "https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/general_ocr_001.png" ) print(f" Test image: {test_url}") @@ -120,7 +108,7 @@ def main(): result = ocr(file_url=test_url) - if not result["ok"]: + if not result.get("ok"): error = result.get("error", {}) print(f"\n X API call failed: {error.get('message')}") if "Authentication" in error.get("message", ""): @@ -130,7 +118,6 @@ def main(): print(" + API call successful!") - # Show results text = result.get("text", "") if text: preview = text[:200].replace("\n", " ") @@ -142,12 +129,8 @@ def main(): print("Smoke Test PASSED") print("=" * 60) print("\nNext steps:") - print( - ' python paddleocr-text-recognition/scripts/ocr_caller.py --file-url "URL" --pretty' - ) - print( - ' python paddleocr-text-recognition/scripts/ocr_caller.py --file-path "image.png" --pretty' - ) + print(' python scripts/ocr_caller.py --file-url "URL" --pretty') + print(' python scripts/ocr_caller.py --file-path "image.png" --pretty') print( " Results are auto-saved to the system temp directory; the caller prints the saved path." )