Skip to content

[BUG] Spark Excel not reading whole columns and is only reading specific data address ranges #930

@bitbythecron

Description

@bitbythecron

Am I using the newest version of the library?

  • I have made sure that I'm using the latest version of the library.

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

Java app here using the Spark Excel library to read an Excel file into a Dataset<Row>. When I use the following configurations:

String filePath = "file:///Users/myuser/example-data.xlsx";
Dataset<Row> dataset = spark.read()
        .format("com.crealytics.spark.excel")
        .option("header", "true")
        .option("inferSchema", "true")
        .option("dataAddress", "'ExampleData'!A2:D7")
        .load(filePath);

This works beautifully and my Dataset<Row> is instantiated without any issues whatsoever. But the minute I go to just tell it to read any rows between A through D, it reads an empty Dataset<Row>:

// dataset will be empty
.option("dataAddress", "'ExampleData'!A:D")

This also happens if I set the sheetName and dataAddress separately:

// dataset will be empty
.option("sheetName", "ExampleData")
.option("dataAddress", "A:D")

And it also happens when, instead of providing the sheetName, I provide a sheetIndex:

// dataset will be empty; and I have experimented by setting it to 0 as well
// in case it is a 0-based index
.option("sheetIndex", 1)
.option("dataAddress", "A:D")

My question: is this expected behavior of the Spark Excel library, or is it a bug I have discovered, or am I not using the Options API correctly here?

Expected Behavior

Explained above, I would have expected all three option configurations to work, but only the first one does.

Steps To Reproduce

Code is provided above. I am pulling in the following Gradle libraries:

    implementation("org.apache.spark:spark-core_2.12:3.5.3")
    implementation("org.apache.spark:spark-sql_2.12:3.5.3")
    implementation("com.crealytics:spark-excel_2.12:3.5.1_0.20.4")
    implementation("com.databricks:spark-xml_2.12:0.18.0")

I am using a Java application (not Scala).

Environment

- Spark version: `2.12:3.5.3`
- Spark-Excel version: `2.12:3.5.1_0.20.4`
- OS: Mac Sequoia 15.3
- Cluster environment

Anything else?

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions