Upgrade PDFBox to 3.0.5 and fix checksums in tika tests by cwperks · Pull Request #29 · prudhvigodithi/OpenSearch

cwperks · 2025-08-22T14:57:40Z

Description

This PR completes the Tika major version bump in opensearch-project#19125

We had some tests compare checksums against known checksums after using tika to parse various file formats. With a major version bump, tika changed some parsing logic and our checksums needed to be updated accordingly.

To compute new checksums use Tika CLI:

brew install tika
unzip the test file
tika --text plugins/ingest-attachment/src/test/resources/org/opensearch/ingest/attachment/test/tika-files/testHTMLNoisyMetaEncoding_3.html | sha1sum

I only faced issues with one file using Tika CLI (EmbeddedOutlook.docx) and copied the output that we computed in the tests. Everything else I was able to verify with Tika CLI v3.2.2.

Check List

Functionality includes testing.
API changes companion pull request created, if applicable.
Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Craig Perkins <cwperx@amazon.com>

prudhvigodithi · 2025-08-22T15:20:20Z

plugins/ingest-attachment/src/main/plugin-metadata/plugin-security.policy

  permission java.lang.RuntimePermission "accessDeclaredMembers";
  // PDFBox checks for the existence of this class
  permission java.lang.RuntimePermission "accessClassInPackage.sun.java2d.cmm.kcms";
+  permission java.io.FilePermission "/System/Library/Fonts/-", "read";


~~This is the issue with using existing running gradle daemon, if we kill the running gradle daemon and re-run the test we dont see this error.~~

I got this issue on my Mac

REPRODUCE WITH: ./gradlew ':plugins:ingest-attachment:test' --tests 'org.opensearch.ingest.attachment.TikaDocTests.testParseSamples' -Dtests.seed=C7D4F7839B0F2D24 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=ff-Latn-GN -Dtests.timezone=Africa/Bangui -Druntime.java=24 2> java.lang.SecurityException: Denied OPEN (read) access to file: /System/Library/Fonts/Supplemental/Times New Roman.ttf, domain: ProtectionDomain (file:/Users/cwperx/Projects/opensearch/OpenSearch/plugins/ingest-attachment/build/classes/java/main/ <no signer certificates>) jdk.internal.loader.ClassLoaders$AppClassLoader@659e0bfd <no principals> java.security.Permissions@1ce6dd3e ( ) at __randomizedtesting.SeedInfo.seed([C7D4F7839B0F2D24:EE9EC5F216DB847]:0) at java.base/java.nio.channels.FileChannel.open(FileChannel.java:347) at org.apache.pdfbox.io.RandomAccessReadBufferedFile.<init>(RandomAccessReadBufferedFile.java:110) at org.apache.pdfbox.io.RandomAccessReadBufferedFile.<init>(RandomAccessReadBufferedFile.java:98) at org.apache.pdfbox.pdmodel.font.FileSystemFontProvider$FSFontInfo.readTrueTypeFont(FileSystemFontProvider.java:249) at org.apache.pdfbox.pdmodel.font.FileSystemFontProvider$FSFontInfo.getTrueTypeFont(FileSystemFontProvider.java:206) at org.apache.pdfbox.pdmodel.font.FileSystemFontProvider$FSFontInfo.getFont(FileSystemFontProvider.java:148) at org.apache.pdfbox.pdmodel.font.FontMapperImpl.findFont(FontMapperImpl.java:433) at org.apache.pdfbox.pdmodel.font.FontMapperImpl.getTrueTypeFont(FontMapperImpl.java:318) at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.<init>(PDTrueTypeFont.java:142) at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:153) at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:170) at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:72) at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:924) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:557) at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:515) at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:158) at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:153) at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:379) at org.apache.tika.parser.pdf.PDF2XHTML.processPage(PDF2XHTML.java:136) at org.apache.tika.parser.pdf.AbstractPDF2XHTML.processPages(AbstractPDF2XHTML.java:1362) at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:252) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:107) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:219) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:298) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:204) at org.apache.tika.Tika.parseToString(Tika.java:525) at org.opensearch.ingest.attachment.TikaImpl.lambda$parse$0(TikaImpl.java:122) at java.base/java.security.AccessController.doPrivileged(AccessController.java:337) at org.opensearch.ingest.attachment.TikaImpl.parse(TikaImpl.java:121) at org.opensearch.ingest.attachment.TikaDocTests.tryParse(TikaDocTests.java:96) at org.opensearch.ingest.attachment.TikaDocTests.testParseSamples(TikaDocTests.java:68)

The issue is present when processing this one:

doc: /Users/cwperx/Projects/opensearch/OpenSearch/plugins/ingest-attachment/build/testrun/test/temp/org.opensearch.ingest.attachment.TikaDocTests_A83431F6C648EB64-001/tempDir-002/headers.mbox

cwperks added 3 commits August 22, 2025 10:08

WIP on tika upgrade

f4b20ab

Signed-off-by: Craig Perkins <cwperx@amazon.com>

WIP on fixing tika tests

d2d3c87

Signed-off-by: Craig Perkins <cwperx@amazon.com>

Upgrade pdfbox to 3.0.5 and fix checksums for tika major version bump

41e4a23

Signed-off-by: Craig Perkins <cwperx@amazon.com>

prudhvigodithi reviewed Aug 22, 2025

View reviewed changes

prudhvigodithi mentioned this pull request Aug 22, 2025

Bump tika from 2.9.2 to 3.2.2 opensearch-project/OpenSearch#19125

Merged

3 tasks

cwperks closed this Aug 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade PDFBox to 3.0.5 and fix checksums in tika tests#29

Upgrade PDFBox to 3.0.5 and fix checksums in tika tests#29
cwperks wants to merge 3 commits intoprudhvigodithi:workflow-bugfrom
cwperks:workflow-bug-cwperx

cwperks commented Aug 22, 2025

Uh oh!

prudhvigodithi Aug 22, 2025 •

edited

Loading

Uh oh!

cwperks Aug 22, 2025

Uh oh!

cwperks Aug 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cwperks commented Aug 22, 2025

Description

Check List

Uh oh!

prudhvigodithi Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cwperks Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

cwperks Aug 22, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

prudhvigodithi Aug 22, 2025 •

edited

Loading