06/02/2025 | News release | Distributed by Public on 06/03/2025 18:00
Apache Iceberg v3, now approved by the Apache Iceberg™ community, introduces advanced new features and data types. Iceberg v3 includes major improvements such as deletion vectors, row lineage, and new types for semi-structured data and geospatial use cases. These features allow customers to efficiently process and query data. Additionally, these improvements are consistent across Delta Lake, Apache Parquet, and Apache Spark™, so customers can interoperate between Delta and Apache Iceberg™ without rewriting data or row-level delete files.
In this blog post, we cover the newest developments in Iceberg v3:
Iceberg v3 introduces a new format for row-level deletes to improve read performance: deletion vectors. Row-level deletes significantly reduce write amplification by optimizing how deleted rows are stored and tracked - leading to faster ETL and ingestion. In Iceberg v2, engines were not required to compact delete files together during writes. The intent was for customers to use asynchronous maintenance. However, many customers did not schedule maintenance services, so their tables had too many unmaintained delete files. That led to slow read performance when engines had to merge many row-level delete files on read.
Iceberg v3 introduces a new deletion vector format and new compaction requirements for delete files. This new format avoids translation between Parquet files and in-memory representations used to apply the deletes. Additionally, engines must maintain a single deletion vector per file at write time. This requirement improves performance and statistics on data files. This also makes it easy to compare previous and current deletes, which simplifies processing a table's row-level changes as a stream.
Another major Iceberg v3 feature is row lineage, used to simplify incremental processing. With row lineage, engines find row-level changes by matching versions of rows across commits.
Iceberg v3 introduces row lineage using row-level metadata: a row ID and the sequence number when the row was last modified or added. The IDs identify the same row across versions. Sequence numbers annotate when rows were last changed - not just relocated between files. This allows engines to process changes selectively, simplifying downstream updates with faster and cheaper workflows.
Row ID information is especially beneficial when combined with incremental processing objects like materialized views. These objects are optimized to compute only new or changed data since the last processing cycle.