Skip to content

Upgrade PDFBox to 3.0.5 and fix checksums in tika tests#29

Closed
cwperks wants to merge 3 commits intoprudhvigodithi:workflow-bugfrom
cwperks:workflow-bug-cwperx
Closed

Upgrade PDFBox to 3.0.5 and fix checksums in tika tests#29
cwperks wants to merge 3 commits intoprudhvigodithi:workflow-bugfrom
cwperks:workflow-bug-cwperx

Conversation

@cwperks
Copy link
Copy Markdown

@cwperks cwperks commented Aug 22, 2025

Description

This PR completes the Tika major version bump in opensearch-project#19125

We had some tests compare checksums against known checksums after using tika to parse various file formats. With a major version bump, tika changed some parsing logic and our checksums needed to be updated accordingly.

To compute new checksums use Tika CLI:

  1. brew install tika
  2. unzip the test file
  3. tika --text plugins/ingest-attachment/src/test/resources/org/opensearch/ingest/attachment/test/tika-files/testHTMLNoisyMetaEncoding_3.html | sha1sum

I only faced issues with one file using Tika CLI (EmbeddedOutlook.docx) and copied the output that we computed in the tests. Everything else I was able to verify with Tika CLI v3.2.2.

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Craig Perkins <cwperx@amazon.com>
Signed-off-by: Craig Perkins <cwperx@amazon.com>
Signed-off-by: Craig Perkins <cwperx@amazon.com>
permission java.lang.RuntimePermission "accessDeclaredMembers";
// PDFBox checks for the existence of this class
permission java.lang.RuntimePermission "accessClassInPackage.sun.java2d.cmm.kcms";
permission java.io.FilePermission "/System/Library/Fonts/-", "read";
Copy link
Copy Markdown
Owner

@prudhvigodithi prudhvigodithi Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the issue with using existing running gradle daemon, if we kill the running gradle daemon and re-run the test we dont see this error.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got this issue on my Mac

REPRODUCE WITH: ./gradlew ':plugins:ingest-attachment:test' --tests 'org.opensearch.ingest.attachment.TikaDocTests.testParseSamples' -Dtests.seed=C7D4F7839B0F2D24 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=ff-Latn-GN -Dtests.timezone=Africa/Bangui -Druntime.java=24
  2> java.lang.SecurityException: Denied OPEN (read) access to file: /System/Library/Fonts/Supplemental/Times New Roman.ttf, domain: ProtectionDomain  (file:/Users/cwperx/Projects/opensearch/OpenSearch/plugins/ingest-attachment/build/classes/java/main/ <no signer certificates>)
     jdk.internal.loader.ClassLoaders$AppClassLoader@659e0bfd
     <no principals>
     java.security.Permissions@1ce6dd3e (
    )
        at __randomizedtesting.SeedInfo.seed([C7D4F7839B0F2D24:EE9EC5F216DB847]:0)
        at java.base/java.nio.channels.FileChannel.open(FileChannel.java:347)
        at org.apache.pdfbox.io.RandomAccessReadBufferedFile.<init>(RandomAccessReadBufferedFile.java:110)
        at org.apache.pdfbox.io.RandomAccessReadBufferedFile.<init>(RandomAccessReadBufferedFile.java:98)
        at org.apache.pdfbox.pdmodel.font.FileSystemFontProvider$FSFontInfo.readTrueTypeFont(FileSystemFontProvider.java:249)
        at org.apache.pdfbox.pdmodel.font.FileSystemFontProvider$FSFontInfo.getTrueTypeFont(FileSystemFontProvider.java:206)
        at org.apache.pdfbox.pdmodel.font.FileSystemFontProvider$FSFontInfo.getFont(FileSystemFontProvider.java:148)
        at org.apache.pdfbox.pdmodel.font.FontMapperImpl.findFont(FontMapperImpl.java:433)
        at org.apache.pdfbox.pdmodel.font.FontMapperImpl.getTrueTypeFont(FontMapperImpl.java:318)
        at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.<init>(PDTrueTypeFont.java:142)
        at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:153)
        at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:170)
        at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:72)
        at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:924)
        at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:557)
        at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:515)
        at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:158)
        at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:153)
        at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:379)
        at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:136)
        at org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:1362)
        at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:252)
        at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:107)
        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:219)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:204)
        at org.apache.tika.Tika.parseToString(Tika.java:525)
        at org.opensearch.ingest.attachment.TikaImpl.lambda$parse$0(TikaImpl.java:122)
        at java.base/java.security.AccessController.doPrivileged(AccessController.java:337)
        at org.opensearch.ingest.attachment.TikaImpl.parse(TikaImpl.java:121)
        at org.opensearch.ingest.attachment.TikaDocTests.tryParse(TikaDocTests.java:96)
        at org.opensearch.ingest.attachment.TikaDocTests.testParseSamples(TikaDocTests.java:68)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is present when processing this one:

doc: /Users/cwperx/Projects/opensearch/OpenSearch/plugins/ingest-attachment/build/testrun/test/temp/org.opensearch.ingest.attachment.TikaDocTests_A83431F6C648EB64-001/tempDir-002/headers.mbox

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants