Refactor Process Batch#1357
Merged
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull Request Overview
This PR refactors the batch processing methods by splitting the original process_batch functionality into dedicated sub functions to improve modularity and reusability.
- Removed the original process_batch implementation
- Updated process_batch_inference to support optional extractor_inputs and text_extractor parameters
- Added a new process_batch that orchestrates inference and extraction based on provided flags
alexaryn
reviewed
Jun 19, 2025
HenryL27
reviewed
Jun 19, 2025
Comment on lines
+497
to
+505
| extracted_pages = [] | ||
| with LogTime("text_extraction"): | ||
| for i, page_data in enumerate(extractor_inputs): | ||
| if isinstance(page_data, dict): | ||
| width, height = page_data.get("dimensions") | ||
| page = text_extractor.parse_output(page_data.get("data"), width, height) | ||
| else: | ||
| page = text_extractor.extract_page(page_data) | ||
| extracted_pages.append(page) |
Collaborator
There was a problem hiding this comment.
looks like this is yer pdfminer caller? If you factor it to a method then it gets easier for the perf people to run it in a process pool or something right
Contributor
|
Unfortunately it’s not that simple. It should be, but there is a separate path for calling pdfminer in managed-server.
________________________________
From: Henry Lindeman ***@***.***>
Sent: Thursday, June 19, 2025 11:17:37 AM
To: aryn-ai/sycamore ***@***.***>
Cc: Benjamin Sowell ***@***.***>; Review requested ***@***.***>
Subject: Re: [aryn-ai/sycamore] Refactor Process Batch (PR #1357)
@HenryL27 commented on this pull request.
________________________________
In lib/sycamore/sycamore/transforms/detr_partitioner.py<#1357 (comment)>:
+ extracted_pages = []
+ with LogTime("text_extraction"):
+ for i, page_data in enumerate(extractor_inputs):
+ if isinstance(page_data, dict):
+ width, height = page_data.get("dimensions")
+ page = text_extractor.parse_output(page_data.get("data"), width, height)
+ else:
+ page = text_extractor.extract_page(page_data)
+ extracted_pages.append(page)
looks like this is yer pdfminer caller? If you factor it to a method then it gets easier for the perf people to run it in a process pool or something right
—
Reply to this email directly, view it on GitHub<#1357 (review)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAJT7R3723PNO4GTSYZXXWD3EL5EDAVCNFSM6AAAAAB7ROTPNCVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDSNBTG43TKOJVG4>.
You are receiving this because your review was requested.Message ID: ***@***.***>
|
HenryL27
approved these changes
Jun 19, 2025
Collaborator
HenryL27
left a comment
There was a problem hiding this comment.
lgtm. not sure why you need it for the table thing but I'm sure all will be revealed shortly and it seems like a good refactor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Refactor to ensure it uses sub functions when possible.