Databricks Inc.

06/09/2025 | News release | Distributed by Public on 06/09/2025 02:12

PySpark UDF Unified Profiling

We are excited to release Unified Profiling for PySpark User-Defined Functions (UDFs) as part of Databricks Runtime 17.0 (release notes). Unified Profiling for PySpark UDFs lets developers profile the performance and memory usage of their PySpark UDFs, including tracking function calls, execution time, memory usage, and other metrics. This enables PySpark developers to easily identify and address bottlenecks, leading to faster and more resource-efficient UDFs.

The unified profilers can be enabled by setting the Runtime SQL configuration "spark.sql.pyspark.udf.profiler" to "perf" or "memory" to enable the performance or memory profiler, respectively, as shown below.

Replacement for Legacy Profiling

Legacy profiling [1, 2] was implemented at the SparkContext level and, thus, did not work with Spark Connect. The new profiling is SparkSession-based, applies to Spark Connect, and can be enabled or disabled at runtime. It maximizes API parity with legacy profiling by providing "show" and "dump" commands to visualize profile results and save them to a workspace folder. Additionally, it offers convenience APIs to help manage and reset profile results on demand. Lastly, it supports registered UDFs, which were not supported by the legacy profiling.

PySpark Performance Profiler

The PySpark performance profiler leverages Python's built-in profilers to extend profiling capabilities to the driver and UDFs executed on executors in a distributed manner.

Let's dive into an example to see the PySpark performance profiler in action. We run the following code on Databricks Runtime 17.0 notebooks.

The added.show() command displays performance profiling results as shown below.

The output includes information such as the number of function calls, total time spent in the given function, and the filename, along with the line number to aid navigation. This information is essential for identifying tight loops in your PySpark programs and enabling you to make decisions to improve performance.

It's important to note that the UDF id in these results directly correlates with the one found in the Spark plan, by observing the "ArrowEvalPython [add1(...)#50L]", which is revealed when calling the explain method on the dataframe.

Finally, we can dump the profiling results to a folder and clear the result profiles as shown below.

Databricks Inc. published this content on June 09, 2025, and is solely responsible for the information contained herein. Distributed via Public Technologies (PUBT), unedited and unaltered, on June 09, 2025 at 08:12 UTC. If you believe the information included in the content is inaccurate or outdated and requires editing or removal, please contact us at support@pubt.io