Add VLMTableStructureExtractor for table structure extraction.#1304
Conversation
This calls an LLM to determine the cells of a table.
There was a problem hiding this comment.
Pull Request Overview
This PR introduces the VLMTableStructureExtractor to enhance table structure extraction via an LLM, along with related utility functions and tests.
- Added a new _crop_bbox helper function and the VLMTableStructureExtractor class for processing table images.
- Extended unit and integration tests to validate new LLM-based table extraction, and updated deserialization methods in Gemini and Anthropic implementations.
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| lib/sycamore/sycamore/transforms/table_structure/extract.py | Added _crop_bbox and VLMTableStructureExtractor to extract table structure using an LLM. |
| lib/sycamore/sycamore/tests/unit/llms/test_llms.py | Added a new test case for ensuring proper Gemini pickling. |
| lib/sycamore/sycamore/tests/integration/transforms/test_table_extraction.py | Added integration tests for table extraction using various LLMs. |
| lib/sycamore/sycamore/llms/gemini.py | Updated reduce to use a separate deserializer function. |
| lib/sycamore/sycamore/llms/anthropic.py | Updated reduce to use a separate deserializer function. |
| """Table structure extractor that uses a VLM model to extract the table structure.""" | ||
|
|
||
| EXTRACT_TABLE_STRUCTURE_PROMPT = """You are given an image of a table from a document. Please convert this table into HTML. Be sure to include the table header and all rows. Use 'colspan' and 'rowspan' in the output to indicate merged cells. Return the HTML as a string. Do not include any other text in the response. | ||
| +""" |
There was a problem hiding this comment.
There appears to be an extra '+' in the closing triple quotes for the prompt string in VLMTableStructureExtractor. Removing the extraneous '+' will prevent potential syntax errors.
| +""" | |
| """ |
| new_elem = extractor.extract(element=basic_table_element, doc_image=basic_table_image) | ||
| assert new_elem.table is not None | ||
|
|
||
| print(new_elem.table.to_html()) |
There was a problem hiding this comment.
[nitpick] Consider removing or replacing the print statement used for debugging in the test to maintain clean test outputs.
| print(new_elem.table.to_html()) | |
| logging.debug(new_elem.table.to_html()) |
| # Convert cell bounding boxes to be relative to the original image. | ||
| for cell in table.cells: | ||
| if cell.bbox is None: | ||
| continue | ||
| cell.bbox.translate_self(crop_box[0], crop_box[1]).to_relative_self(width, height) |
There was a problem hiding this comment.
shouldn't all the cell bboxes be null?
There was a problem hiding this comment.
Lol, yes. I guess I was just on auto-pilot. I'll go ahead and remove this. Part of me just wanted to leave it as a defense mechanism, but I can't think of a way that an llm could hallucinate bounding boxes in a way we would interpret.
This calls an LLM to determine the cells of a table.