Add option to extract line-based bounding boxes from pdfminer. by bsowell · Pull Request #874 · aryn-ai/sycamore

bsowell · 2024-10-04T16:22:53Z

We have been using pdfminer's layout detection to group text into boxes. This can cause issues, especially with table extraction, when the boxes don't line up with cells or what we detect with the DETR model. This change adds support for an object_type parameter to the PdfMinerExtractor that can be set to "boxes" (the current behavior), or "lines", which groups characters into lines, but does not group them further.

To avoid an explosion of options, we introduce a
"text_extractor_options" dict as a paramter, and refactor the TextExtractor class hierarchy a bit to support it.

karanataryn

Looks mostly good to me. You'll need to merge in main which might require some refactoring around #894 and #895. Do we want to add a regression test that shows improvement (text that exists now because of line-based pdfminer)?

karanataryn · 2024-10-10T17:06:09Z

+        # I was very surprised that this equality succeeded. I'm not sure in general we can expect
+        # exact text equality. I imagine in some cases the order might be different, but in this case
+        # they match, so I'm asserting here so we can catch regressions.
+        assert lines_text == objects_text


Does this mean that we have empty lines_elements since len(objects_elements) < len(lines_elements)? And if so, do we have a way to deal with that on the customer side?

Not sure what you mean by having empty lines_elements. Since each box is a grouping of lines, it is always the case that len(lines) >= len(objects) -- I should probably have called it boxes instead of objects here. I made the inequality strict to ensure that we are actually doing something different. The user should not notice. By the time it gets to them, either lines or boxes will be merged into elements.

Should have phrased it differently, I mean that since lines_text == objects_text and len(lines_elements) > len(objects_elements), we would have some elements that don't have any text in them. Shouldn't be a real performance issue but wanted to flag it for the customer side.

I think the common case is that all the elements have text, just that the lines_elements have less text. But again, these are the elements that come directly out of pdfminer. They get grouped into larger elements in _suplement_text before they get returned to the users. So users shouldn't see a difference.

karanataryn · 2024-10-10T21:25:17Z

+class PdfMinerExtractor(TextExtractorBase):
    @requires_modules(["pdfminer", "pdfminer.utils"], extra="local-inference")
-    def __init__(self):
+    def __init__(self, object_type: Literal["boxes", "lines"] = "boxes"):


Are we sure we want the default to always stay as boxes? Would we want to enable lines if we detect a table element in the DETR output?

I imagine eventually we will want to change the default to lines. I just started with the current behavior until we are confident we won't see any regressions in either perf or quality.

Makes sense. Could you add a TODO to get back to this?

Sure. Will do.

karanataryn

LGTM. Few suggestions but not blocking.

karanataryn · 2024-10-11T18:18:49Z

-        if not model_cls:
-            raise ValueError(f"Unknown OCR Model: {ocr_model}")
-        ocr_model_obj = model_cls()
+        ocr_model_obj = cast(OcrModel, get_text_extractor(ocr_model, **text_extraction_options))


Do we want to assert instead of cast? We can't handle a non OcrModel

Yeah, let me fix this.

karanataryn · 2024-10-11T18:20:47Z

+class PdfMinerExtractor(TextExtractorBase):
    @requires_modules(["pdfminer", "pdfminer.utils"], extra="local-inference")
-    def __init__(self):
+    def __init__(self, object_type: Literal["boxes", "lines"] = "boxes"):


Makes sense. Could you add a TODO to get back to this?

karanataryn · 2024-10-11T18:27:14Z

+        # I was very surprised that this equality succeeded. I'm not sure in general we can expect
+        # exact text equality. I imagine in some cases the order might be different, but in this case
+        # they match, so I'm asserting here so we can catch regressions.
+        assert lines_text == objects_text


Should have phrased it differently, I mean that since lines_text == objects_text and len(lines_elements) > len(objects_elements), we would have some elements that don't have any text in them. Shouldn't be a real performance issue but wanted to flag it for the customer side.

We have been using pdfminer's layout detection to group text into boxes. This can cause issues, especially with table extraction, when the boxes don't line up with cells or what we detect with the DETR model. This change adds support for an object_type parameter to the PdfMinerExtractor that can be set to "boxes" (the current behavior), or "lines", which groups characters into lines, but does not group them further. To avoid an explosion of options, we introduce a "text_extractor_options" dict as a paramter, and refactor the TextExtractor class hierarchy a bit to support it.

Will add back in later commit when I add the format detection stuff.

We have been using pdfminer's layout detection to group text into boxes. This can cause issues, especially with table extraction, when the boxes don't line up with cells or what we detect with the DETR model. This change adds support for an object_type parameter to the PdfMinerExtractor that can be set to "boxes" (the current behavior), or "lines", which groups characters into lines, but does not group them further. To avoid an explosion of options, we introduce a "text_extractor_options" dict as a paramter, and refactor the TextExtractor class hierarchy a bit to support it.

* added ability to read schema from file * small typo Co-authored-by: Matt Welsh <matt@aryn.ai> * fixed two funtion refs that were modified * reformatted file with black * fixed schema file format (was json), added more exception handling * Fix anonymous reading in materialize and add rate limited logging. (#898) * Fix anonymous reading in materialize and add rate limited logging. * In materialize, try reading using the credentials, but if it doesn't work, fall back to reading anonymously if that seems to be working. * Add rate limited logging to reading via materialize in local mode. * Check for no root before checking if a source since that makes more sense. * switch ntsb_loader_materialized.py over to read in local mode, it was working (with the anonymous fix), but was very slow hence the logging. * Bump version to v0.1.23. (#903) * fix asdict in the reader too. duh (#907) Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com> * Add text reprentation for empty tables (#909) * Refactor logical plan serialization. (#905) * Working on this. * Working on refactoring. * Tests pass - is such a thing even possible? * Fix tests. * Fix mypy. * Cleanup. * Fix NTSB examples. * A few tweaks to the query planner prompt, and a workaround in queryui/util.py. * Fix mypy. * seriously small performance improvement that matters when youre processing tens of thousands of tables (from training code) (#906) Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com> * Handle opensearch reader doc resconstruction when no parent doc in results (#908) * Fix bug in entity extraction. (#911) * Notebooks like default-prep-script.ipynb would fail because the wrong way of generating the prompt would be used. * Rename test to match with name of file being tested. * Fix existing tests to verify parameters on all branches -- the reason the tests were passing was that it was taking the default branch in the test cases * Update all of the tests to directly call run rather than route everything through ray. * Enable copying of the hash context. (#910) * Enable copying of the hash context. * Address comments. * Add option to extract line-based bounding boxes from pdfminer. (#874) We have been using pdfminer's layout detection to group text into boxes. This can cause issues, especially with table extraction, when the boxes don't line up with cells or what we detect with the DETR model. This change adds support for an object_type parameter to the PdfMinerExtractor that can be set to "boxes" (the current behavior), or "lines", which groups characters into lines, but does not group them further. To avoid an explosion of options, we introduce a "text_extractor_options" dict as a paramter, and refactor the TextExtractor class hierarchy a bit to support it. * Support random sample in local mode. (#913) This transform isn't widely used, but still worth supporting in local model to bring it to parity. * Opensearch kwargs fix (#914) * Fix kwargs in opensearch reader * simplify test assertion * lint * pr comments * fix typo (#917) * Update using_jupyter.md (#902) * Update using_jupyter.md Update link * Fixed path --------- Co-authored-by: dtecuci <168428824+dtecuci@users.noreply.github.com> * Rebased. Added ability to read schema from file * rebased. small typo Co-authored-by: Matt Welsh <matt@aryn.ai> * rebased. reformatted file with black * resolved conflicts * changed schema file format to yaml * removed unused import * small typos fixed * fixed spacing --------- Signed-off-by: Henry Lindeman <hmlindeman@yahoo.com> Co-authored-by: Matt Welsh <matt@aryn.ai> Co-authored-by: Eric Anderson <eric@aryn.ai> Co-authored-by: Ben Sowell <ben@aryn.ai> Co-authored-by: Henry Lindeman <hmlindeman@yahoo.com> Co-authored-by: Dhruv Kaliraman <112497058+dhruvkaliraman7@users.noreply.github.com> Co-authored-by: Vinayak Thapliyal <vinayak@aryn.ai> Co-authored-by: Alex Meyer <144723289+alexaryn@users.noreply.github.com> Co-authored-by: Karan Sampath <176953591+karanataryn@users.noreply.github.com> Co-authored-by: jonfritz <134336691+jonfritz@users.noreply.github.com>

bsowell requested a review from karanataryn October 4, 2024 16:22

bsowell force-pushed the pdfminer_lines branch from 90851ce to 0d825cd Compare October 9, 2024 22:58

karanataryn reviewed Oct 10, 2024

View reviewed changes

karanataryn approved these changes Oct 11, 2024

View reviewed changes

karanataryn reviewed Oct 11, 2024

View reviewed changes

Comment thread lib/sycamore/sycamore/transforms/detr_partitioner.py Outdated

bsowell added 5 commits October 11, 2024 14:23

Rework TextExtractor factory into get_text_extractor.

eeb105c

Remove _get_char_stream.

d67c5da

Will add back in later commit when I add the format detection stuff.

Few minor tweaks.

bb4449c

f-string

31664e7

bsowell force-pushed the pdfminer_lines branch from 5940a6d to 31664e7 Compare October 11, 2024 21:29

bsowell merged commit 9b0ddc6 into main Oct 11, 2024

bsowell deleted the pdfminer_lines branch October 11, 2024 22:23

Conversation

bsowell commented Oct 4, 2024

Uh oh!

karanataryn left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

karanataryn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

karanataryn left a comment •

edited

Loading