A fast Apache Spark testing framework with beautifully formatted error messages!
For example, the assertSmallDatasetEquality method can be used to compare two Datasets (or two DataFrames).
val sourceDF = Seq(
(1),
(5)
).toDF("number")
val expectedDF = Seq(
(1, "word"),
(5, "word")
).toDF("number", "word")
assertSmallDatasetEquality(sourceDF, expectedDF)
// throws a DatasetSchemaMismatch exceptionThe assertSmallDatasetEquality method can also be used to compare Datasets.
val sourceDS = Seq(
Person("bob", 1),
Person("alice", 5)
).toDS
val expectedDS = Seq(
Person("frank", 10),
Person("lucy", 5)
).toDS
assertLargeDatasetEquality(sourceDS, expectedDS)
// throws an exception because the Datasets have different dataThe DatasetComparer has assertSmallDatasetEquality and assertLargeDatasetEquality methods to compare either Datasets or DataFrames.
If you only need to compare DataFrames, you can use DataFrameComparer with the associated assertSmallDataFrameEquality and assertLargeDataFrameEquality methods. Under the hood, DataFrameComparer uses the assertSmallDatasetEquality and assertLargeDatasetEquality.
Add the sbt-spark-package plugin so you can install Spark Packages.
Then add these lines to your build.sbt file to install Spark SQL and spark-fast-tests:
spDependencies += "MrPowers/spark-fast-tests:0.4.0"
sparkComponents ++= Seq("sql")The spark-fast-tests project doesn't provide a SparkSession object in your test suite, so you'll need to make one yourself.
import org.apache.spark.sql.SparkSession
trait SparkSessionTestWrapper {
lazy val spark: SparkSession = {
SparkSession.builder().master("local").appName("spark session").getOrCreate()
}
}The DatasetComparer trait defines the assertSmallDatasetEquality method. Extend your spec file with the SparkSessionTestWrapper trait to create DataFrames and the DatasetComparer trait to make DataFrame comparisons.
class DatasetSpec extends FunSpec with SparkSessionTestWrapper with DatasetComparer {
import spark.implicits._
it("aliases a DataFrame") {
val sourceDF = Seq(
("jose"),
("li"),
("luisa")
).toDF("name")
val actualDF = sourceDF.select(col("name").alias("student"))
val expectedDF = Seq(
("jose"),
("li"),
("luisa")
).toDF("student")
assertSmallDatasetEquality(actualDF, expectedDF)
}
}
}To compare large DataFrames that are partitioned across different nodes in a cluster, use the assertLargeDatasetEquality method.
assertLargeDatasetEquality(actualDF, expectedDF)assertSmallDatasetEquality is faster for test suites that run on your local machine. assertLargeDatasetEquality should only be used for DataFrames that are split across nodes in a cluster.
The spark-testing-base project has more features (e.g. streaming support) and is compiled to support a variety of Scala and Spark versions.
You might want to use spark-fast-tests instead of spark-testing-base in these cases:
- You want to run tests in parallel (you need to set
parallelExecution in Test := falsewith spark-testing-base) - You don't want to include hive as a project dependency
- You don't want to restart the SparkSession after each test file executes so the suite runs faster
- Use memory efficiently so Spark test runs don't crash
- Provide readable error messages
- Easy to use in conjunction with other test suites
- Give the user control of the SparkSession
spark-fast-tests supports Spark 2.x. There are no plans to retrofit the project to work with Spark 1.x.
Open an issue or send a pull request to contribute. Anyone that makes good contributions to the project will be promoted to project maintainer status.