Shuffle Data Store Now Available for Ocean for Apache Spark™

Apache Spark
News

August 16, 2024
By Michelle Yancey

NetApp is excited to announce the addition of Shuffle Data Store for Ocean for Apache Spark™ . Customers running on AWS can now benefit from improved resilience and efficiency for multi-step data preparation pipelines that require Apache Spark to shuffle data. The external shuffle solution is backed by either Amazon FSx for NetApp ONTAP storage or Amazon S3 storage, which have been optimally configured for Spark workloads. With Shuffle Data Store for Ocean for Apache Spark, customers can now complete data engineering sooner, on massive data sets, speeding up the time to completion and to extract value from analytics.

With Shuffle Data Store, shuffle data is now automatically persisted externally to the Spark cluster, using remote storage. The Spark application can quickly resume from node failure, eliminating the need to repeat compute steps as data is progressively saved to the external shuffle data store. Additionally, NetApp has contributed to the open source plugin, to augment its interaction with the Dynamic Allocation feature of Apache Spark.

Read more at Spot.io!

Apache Cassandra® connector for Apache Spark™: 5 tips for success

Apache Spark™ Structured Streaming With DataFrames

Debugging Jobs in the Apache Spark™ UI