Merged
Conversation
haolpku
reviewed
Jan 20, 2026
| # Save the updated dataframe to the output file | ||
| output_file = storage.write(dataframe) | ||
| return output_key | ||
| class ChunkedPromptedGenerator(OperatorABC): |
Contributor
There was a problem hiding this comment.
感觉ChunkedPromptedGenerator可以专门起一个文件,chuncked_prompted_generator。我们一般一个文件就放一个类。类名和文件名几乎一样(文件下划线命名,类驼峰命名)
| else: | ||
| mid = len(text) // 2 | ||
| left, right = text[:mid], text[mid:] | ||
| return self._split_recursive(left) + self._split_recursive(right) |
Contributor
There was a problem hiding this comment.
这里为什么是二分递归呢?直接用chunk_len切可以吗?另外需要注明这里的chunk_len是什么,是字符数还是token数。韩朝阳的算子是根据qwen分词器的token数算的
Contributor
Author
There was a problem hiding this comment.
- 用二分是可以调用tokenizer次数少一点,否则得一个一个字符移进,调用tokenizer算长度
- 目前的chunk_len是token数,凡是支持
len(self.enc.encode(text))这种格式的都是可以的,比如常用的tiktoken, autotokenizer都可以。现在默认用的是tiktoken.get_encoding("cl100k_base").
| self.qa_merger = QA_Merger(output_dir="./cache", strict_title_match=False) | ||
| def forward(self): | ||
| # 单一算子:包含预处理、QA提取、后处理的所有功能 | ||
| self.mineru_executor.run( |
Contributor
There was a problem hiding this comment.
这里我看明白了。但是最好写一下注释,为什么要做两次,因为question,answer都是在做一样的操作。尽量user friendly一点。毕竟user并不知道这个新算子/pipeline是什么
Contributor
|
最后建议后续尽快改一下pipeline和operator的doc |
haolpku
reviewed
Jan 20, 2026
Contributor
There was a problem hiding this comment.
这个算子是来转格式,如果作为算子存在,也遵循我们的算子命名规矩吧,比如文件名叫mineru_to_llm_formatter,类名一样但是驼峰
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
对PDF2VQA pipeline进行了大幅度的重构,复用现有dataflow算子。
Bug 修正: