PDF2VQA 重构 by fatty-belly · Pull Request #443 · OpenDCAI/DataFlow

fatty-belly · 2026-01-16T12:19:02Z

对PDF2VQA pipeline进行了大幅度的重构，复用现有dataflow算子。

Bug 修正：

现在没有识别出任何问题时会输出空文件，而不是报错。
改进了问答对的章节匹配逻辑
修正pipeline example文件路径。

haolpku · 2026-01-20T04:51:28Z

        # Save the updated dataframe to the output file
        output_file = storage.write(dataframe)
        return output_key
+class ChunkedPromptedGenerator(OperatorABC):


感觉ChunkedPromptedGenerator可以专门起一个文件，chuncked_prompted_generator。我们一般一个文件就放一个类。类名和文件名几乎一样(文件下划线命名，类驼峰命名)

已挪到单独文件

haolpku · 2026-01-20T04:54:54Z

+        else:
+            mid = len(text) // 2
+            left, right = text[:mid], text[mid:]
+            return self._split_recursive(left) + self._split_recursive(right)


这里为什么是二分递归呢？直接用chunk_len切可以吗？另外需要注明这里的chunk_len是什么，是字符数还是token数。韩朝阳的算子是根据qwen分词器的token数算的

用二分是可以调用tokenizer次数少一点，否则得一个一个字符移进，调用tokenizer算长度

目前的chunk_len是token数，凡是支持len(self.enc.encode(text))这种格式的都是可以的，比如常用的tiktoken, autotokenizer都可以。现在默认用的是tiktoken.get_encoding("cl100k_base").

haolpku · 2026-01-20T04:58:37Z

+        self.qa_merger = QA_Merger(output_dir="./cache", strict_title_match=False)
    def forward(self):
-        # 单一算子：包含预处理、QA提取、后处理的所有功能
+        self.mineru_executor.run(


这里我看明白了。但是最好写一下注释，为什么要做两次，因为question，answer都是在做一样的操作。尽量user friendly一点。毕竟user并不知道这个新算子/pipeline是什么

已添加注释

haolpku · 2026-01-20T04:59:06Z

最后建议后续尽快改一下pipeline和operator的doc

haolpku · 2026-01-20T06:54:18Z

这个算子是来转格式，如果作为算子存在，也遵循我们的算子命名规矩吧，比如文件名叫mineru_to_llm_formatter，类名一样但是驼峰

unknown and others added 3 commits January 16, 2026 19:31

[pdf2vqa] 现在没有识别出任何问题时会输出空文件，而不是报错。同时改进了问答对的章节匹配逻辑

b8e5d21

[pdf2vqa] 现在如果mineru结果已经存在，可以跳过直接跑llm。修正example文件路径

31f5841

[PDF2VQA] 大幅度的重构，复用已有算子

13f6814

fatty-belly changed the title ~~Pdf2vqa 的一些修正~~ PDF2VQA 重构 Jan 19, 2026

haolpku reviewed Jan 20, 2026

View reviewed changes

[pdf2vqa] 为chunked_prompted_generator设置单独文件。添加了一些注释

8427fbd

haolpku reviewed Jan 20, 2026

View reviewed changes

[pdf2vqa] 一个文件一个算子

49b8a82

fatty-belly merged commit d250586 into OpenDCAI:main Jan 20, 2026
9 checks passed

fatty-belly deleted the pdf2vqa_dev branch January 20, 2026 08:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF2VQA 重构#443

PDF2VQA 重构#443
fatty-belly merged 5 commits intoOpenDCAI:mainfrom
fatty-belly:pdf2vqa_dev

fatty-belly commented Jan 16, 2026 •

edited

Loading

Uh oh!

haolpku Jan 20, 2026

Uh oh!

fatty-belly Jan 20, 2026

Uh oh!

haolpku Jan 20, 2026

Uh oh!

fatty-belly Jan 20, 2026

Uh oh!

haolpku Jan 20, 2026

Uh oh!

fatty-belly Jan 20, 2026

Uh oh!

haolpku commented Jan 20, 2026

Uh oh!

haolpku Jan 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

fatty-belly commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

haolpku Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

fatty-belly Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

haolpku Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

fatty-belly Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

haolpku Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

fatty-belly Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

haolpku commented Jan 20, 2026

Uh oh!

haolpku Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fatty-belly commented Jan 16, 2026 •

edited

Loading