Apache Spark 4.0

1 day ago 5

Apache Spark 4.0 marks a major milestone in the evolution of the Spark analytics engine. This release brings significant advancements across the board – from SQL language enhancements and expanded connectivity, to new Python capabilities, streaming improvements, and better usability. Spark 4.0 is designed to be more powerful, ANSI-compliant, and user-friendly than ever, while maintaining compatibility with existing Spark workloads. In this post, we explain the key features and improvements introduced in Spark 4.0 and how they elevate your big data processing experience.

Key Highlights in Spark 4.0 include:

SQL Language Enhancements: New capabilities including SQL scripting with session variables and control flow, reusable SQL User-Defined Functions (UDFs), and intuitive PIPE syntax to streamline and simplify complex analytics workflows.
Spark Connect Enhancements: Spark Connect—Spark’s new client-server architecture—now achieves high feature parity with Spark Classic in Spark 4.0. This release adds enhanced compatibility between Python and Scala, multi-language support (with new clients for Go, Swift, and Rust), and a simpler migration path via the new spark.api.mode setting. Developers can seamlessly switch from Spark Classic to Spark Connect to benefit from a more modular, scalable, and flexible architecture.
Reliability & Productivity Enhancements: ANSI SQL mode enabled by default ensures stricter data integrity and better interoperability, complemented by the VARIANT data type for efficient handling of semi-structured JSON data and structured JSON logging for improved observability and easier troubleshooting.
Python API Advances: Native Plotly-based plotting directly on PySpark DataFrames, a Python Data Source API enabling custom Python batch & streaming connectors, and polymorphic Python UDTFs for dynamic schema support and greater flexibility.
Structured Streaming Advances: New Arbitrary Stateful Processing API called transformWithState in Scala, Java & Python for robust and fault-tolerant custom stateful logic, state store usability improvements, and a new State Store Data Source for improved debuggability and observability.

In the sections below, we share more details on these exciting features, and at the end, we provide links to the relevant JIRA efforts and deep-dive blog posts for those who want to learn more. Spark 4.0 represents a robust, future-ready platform for large-scale data processing, combining the familiarity of Spark with new capabilities that meet modern data engineering needs.

Major Spark Connect Improvements

One of the most exciting updates in Spark 4.0 is the overall improvement of Spark Connect, in particular the Scala client. With Spark 4, all Spark SQL features offer near-complete compatibility between Spark Connect and Classic execution mode, with only minor differences remaining. Spark Connect is the new client-server architecture for Spark that decouples the user application from the Spark cluster, and in 4.0, it’s more capable than ever:

Improved Compatibility: A major achievement for Spark Connect in Spark 4 is the improved compatibility of the Python and Scala APIs, which makes switching between using Spark Classic and Spark Connect seamless. This means that for most use cases, all you have to do is enable Spark Connect for your applications by setting spark.api.mode to connect. We recommend starting to develop new jobs and applications with Spark Connect enabled so that you can benefit most from Spark's powerful query optimization and execution engine.
Multi-Language Support: Spark Connect in 4.0 supports a broad range of languages and environments. Python and Scala clients are fully supported, and new community-supported connect clients for Go, Swift, and Rust are available. This polyglot support means developers can use Spark in the language of their choice, even outside the JVM ecosystem, via the Connect API. For example, a Rust data engineering application or a Go service can now directly connect to a Spark cluster and run DataFrame queries, expanding Spark’s reach beyond its traditional user base.

SQL Language Features

Spark 4.0 adds new capabilities to simplify data analytics:

SQL User-Defined Functions (UDFs) – Spark 4.0 introduces SQL UDFs, enabling users to define reusable custom functions directly in SQL. These functions simplify complex logic, improve maintainability, and integrate seamlessly with Spark’s query optimizer, enhancing query performance compared to traditional code-based UDFs. SQL UDFs support temporary and permanent definitions, making it easy for teams to share common logic across multiple queries and applications. [Read the blog post]
SQL PIPE Syntax – Spark 4.0 introduces a new PIPE syntax, allowing users to chain SQL operations using the |> operator. This functional-style approach enhances query readability and maintainability by enabling a linear flow of transformations. The PIPE syntax is fully compatible with existing SQL, allowing for gradual adoption and integration into current workflows. [Read the blog post]
Language, accent, and case-aware collations - Spark 4.0 introduces a new COLLATE property for STRING types. You can choose from many language and region-aware collations to control how Spark determines order and comparisons. You can also decide whether collations should be case, accent, and trailing blank insensitive. [Read the blog post]
Session variables - Spark 4.0 introduces session local variables, which can be used to keep and manage state within a session without using host language variables. [Read the blog post]
Parameter markers - Spark 4.0 introduces named (":var") and unnamed ("?") style parameter markers. This feature allows you to parameterize queries and safely pass in values through the spark.sql() api. This mitigates the risk of SQL injection. [See documentation]
SQL Scripting: Writing multi-step SQL workflows is easier in Spark 4.0 thanks to new SQL scripting capabilities. You can now execute multi-statement SQL scripts with features like local variables and control flow. This enhancement lets data engineers move parts of ETL logic into pure SQL, with Spark 4.0 supporting constructs that were previously only possible via external languages or stored procedures. This feature will soon be further improved by error condition handling. [Read the blog post]

Data Integrity and Developer Productivity

Spark 4.0 introduces several updates that make the platform more reliable, standards-compliant, and user-friendly. These enhancements streamline both development and production workflows, ensuring higher data quality and faster troubleshooting.

ANSI SQL Mode: One of the most significant shifts in Spark 4.0 is enabling ANSI SQL mode by default, aligning Spark more closely with standard SQL semantics. This change ensures stricter data handling by providing explicit error messages for operations that previously resulted in silent truncations or nulls, such as numeric overflows or division by zero. Additionally, adhering to ANSI SQL standards greatly improves interoperability, simplifying the migration of SQL workloads from other systems and reducing the need for extensive query rewrites and team retraining. Overall, this advancement promotes clearer, more reliable, and portable data workflows. [See documentation]
New VARIANT Data Type: Apache Spark 4.0 introduces the new VARIANT data type designed specifically for semi-structured data, enabling the storage of complex JSON or map-like structures within a single column while maintaining the ability to efficiently query nested fields. This powerful capability offers significant schema flexibility, making it easier to ingest and manage data that doesn't conform to predefined schemas. Additionally, Spark's built-in indexing and parsing of JSON fields enhance query performance, facilitating fast lookups and transformations. By minimizing the need for repeated schema evolution steps, VARIANT simplifies ETL pipelines, resulting in more streamlined data processing workflows. [Read the blog post]
Structured Logging: Spark 4.0 introduces a new structured logging framework that simplifies debugging and monitoring. By enabling spark.log.structuredLogging.enabled=true, Spark writes logs as JSON lines—each entry including structured fields like timestamp, log level, message, and full Mapped Diagnostic Context (MDC) context. This modern format simplifies integration with observability tools such as Spark SQL, ELK, and Splunk, making logs much easier to parse, search, and analyze. [Learn more]

Python API Advances

Python users have a lot to celebrate in Spark 4.0. This release makes Spark more Pythonic and improves the performance of PySpark workloads:

Native Plotting Support: Data exploration in PySpark just got easier – Spark 4.0 adds native plotting capabilities to PySpark DataFrames. You can now call a .plot() method or use an associated API on a DataFrame to generate charts directly from Spark data, without manually collecting data to pandas. Under the hood, Spark uses Plotly as the default visualization backend to render charts. This means common plot types like histograms and scatter plots can be created with one line of code on a PySpark DataFrame, and Spark will handle fetching a sample or aggregate of the data to plot in a notebook or GUI. By supporting native plotting, Spark 4.0 streamlines exploratory data analysis – you can visualize distributions and trends from your dataset without leaving the Spark context or writing separate matplotlib/plotly code. This feature is a productivity boon for data scientists using PySpark for EDA.
Python Data Source API: Spark 4.0 introduces a new Python DataSource API that allows developers to implement custom data sources for batch & streaming entirely in Python. Previously, writing a connector for a new file format, database, or data stream often required Java/Scala knowledge. Now, you can create readers and writers in Python, which opens up Spark to a broader community of developers. For example, if you have a custom data format or an API that only has a Python client, you can wrap it as a Spark DataFrame source/sink using this API. This feature greatly improves extensibility for PySpark in both batch and streaming contexts. See the PySpark deep-dive post for an example of implementing a simple custom data source in Python or check out a sample of examples here. [Read the blog post]
Polymorphic Python UDTFs: Building on the SQL UDTF capability, PySpark now supports User-Defined Table Functions in Python, including polymorphic UDTFs that can return different schema shapes depending on input. You can create a Python class as a UDTF using a decorator that yields an iterator of output rows, and register it so it can be called from Spark SQL or the DataFrame API . A powerful aspect is dynamic schema UDTFs – your UDTF can define an analyze() method to produce a schema on the fly based on parameters, such as reading a config file to determine output columns. This polymorphic behavior makes UDTFs extremely flexible, enabling scenarios like processing a varying JSON schema or splitting an input into a variable set of outputs. PySpark UDTFs effectively let Python logic output a full table-result per invocation, all within the Spark execution engine. [See documentation]

Streaming Enhancements

Apache Spark 4.0 continues to refine Structured Streaming for improved performance, usability and observability:

Arbitrary Stateful Processing v2: Spark 4.0 introduces a new Arbitrary Stateful Processing operator called transformWithState. TransformWithState allows for building complex operational pipelines with support for object oriented logic definition, composite types, support for timers and TTL, support for handling initial state, state schema evolution and a host of other features. This new API is available in Scala, Java and Python and provides native integrations with other important features such as state data source reader, operator metadata handling etc. [Read the blog post]
State Data Source - Reader: Spark 4.0 adds the ability to query streaming state as a table . This new state store data source exposes the internal state used in stateful streaming aggregations (like counters, session windows, etc.), joins etc as a readable DataFrame. With additional options, this feature also allows users to track state changes on a per update basis for fine-grained visibility. This feature also helps with understanding what state your streaming job is processing and can further assist in troubleshooting and monitoring the stateful logic of your streams as well as detecting any underlying corruptions or invariant violations. [Read the blog post]
State Store Enhancements: Spark 4.0 also adds numerous state store improvements such as improved Static Sorted Table (SST) file reuse management, snapshot & maintenance management improvements, revamped state checkpoint format as well as additional performance improvements. Along with this, numerous changes have been added around improved logging and error classification for easier monitoring and debuggability.

Acknowledgements

Spark 4.0 is a huge step forward for the Apache Spark project, with optimizations and new features touching every layer—from core improvements to richer APIs. In this release, the community closed more than 5000 JIRA issues and around 400 individual contributors—from independent developers to organizations like Databricks, Apple, Linkedin, Intel, OpenAI, eBay, Netease, Baidu —have driven these enhancements.

We extend our sincere thanks to every contributor, whether you filed a ticket, reviewed code, improved documentation, or shared feedback on mailing lists. Beyond the headline SQL, Python, and streaming improvements, Spark 4.0 also delivers Java 21 support, Spark K8S operator, XML connectors, Spark ML support on Connect, and PySpark UDF Unified Profiling. For the full list of changes and all other engine-level refinements, please consult the official Spark 4.0 release notes.

Apache Spark

Getting Spark 4.0: Getting Spark 4.0: It’s fully open source—download it from spark.apache.org. Many of its features were already available in Databricks Runtime 15.x and 16.x, and now they ship out of the box with Runtime 17.0. To explore Spark 4.0 in a managed environment, sign up for the free Community Edition or start a trial, choose “17.0” when you spin up your cluster, and you’ll be running Spark 4.0 in minutes.

Databricks Runtime Version

If you missed our Spark 4.0 meetup where we discussed these features, you can view the recordings here. Also, stay tuned for future deep-dive meetups on these Spark 4.0 features.

Read Entire Article