09/25/2025 | News release | Distributed by Public on 09/25/2025 13:27
In complex data pipelines with dozens of jobs and intermediary datasets, it can be difficult to effectively monitor how data travels and changes through various steps. When tracking issues in these pipelines, you need visibility into upstream components where the root cause may originate from, as well as downstream datasets and consumers of data that may be experiencing further impacts. Monitoring data lineageis essential for maintaining data quality, ensuring regulatory compliance, and investigating problems in organizations' data systems. Data lineage maps how data flows, changes, and gets used within complex processing pipelines and application services.
Workflow orchestration tools like Apache Airfloware a particularly important source of lineage metadata because they schedule and manage tasks across the entire data pipeline, from storage and ingestion to processing and reporting. In this post, we'll explore how collecting lineage from Airflow can help you:
Errors in data pipelines are rarely confined to a single failing component. For example, a failed task for one processor might stem from a botched SQL query upstream pulling relevant data from blob storage, then cause a downstream integration to reject unexpected input. Tracking and alerting on errors across your data pipeline is essential for preventing failed analytics, data loss, and other critical problems that can affect your applications' health and performance.
Particularly when your system relies on multiple Airflow directed acyclic graphs (DAGs), each of which has its own set of task dependencies and passes data between intermediate storage systems, it can be difficult to find upstream root causes when you are alerted to elevated errors in a particular task. Lineage graphs are critical for navigating this complexity, offering a visualized, metadata-rich map of task relationships and data flows across DAGs. Collecting lineage for DAG runs enables you to quickly trace issues up the pipeline to identify the underlying root cause.
For instance, let's say one of your DAGs uses SQLExecuteQueryOperator to control data in a Snowflake data warehouse. You see an alert for repeated failures of a task containing a operation that conditionally writes data to the warehouse following transformation steps. The is rejecting the data, but it's unclear why. Using lineage data, you can quickly trace the issue upstream, inspecting the initial ingestion job that pulls the raw data from an S3 bucket and the following dbt task, orchestrated by another DAG, that applies transformations to prepare the data for the warehouse.
Lineage lets you watch what happens to the data as it's operated on by each task. Looking at the Airflow task that kicks off the dbt processor, you see that the processor is adding null characters to the data's primary key fields, causing the downstream SQL issue. The dbt task's lineage metadata also shows you the version of the dbt model causing this issue, so you can quickly roll it back and stop the bleeding.
While looking at lineage from upstream DAGs and tasks can help you home in on the root cause of an error, it's equally important to use downstream lineage to effectively map the blast radius of errors. It's only with complete visibility into downstream issues, such as broken analytics, processing errors, and storage problems, that you can effectively assess the business impact of critical errors in your workflows and ensure they are fully resolved.
Lineage graphs can help you quickly identify all the relevant consumers-such as tables used in Tableau dashboards, ML training pipelines, or monitoring systems-of bad data caused by issues with your DAGs. For instance, let's say the DAG in our previous example caused a data freshness issue in the Snowflake table, used for pricing decisions, that it feeds into. This lack of properly updated data will have further effects on all the downstream consumers of the data. By using lineage, you can look at downstream DAGs used by other teams and quickly identify all the relevant consumers. For instance, you might find that a separate internal analytics team is pulling from the pricing data table.
Once you've identified which downstream systems and teams are impacted, the next step is to take action. This might mean alerting the analytics team that dashboards are unreliable until the data is corrected, pausing dependent pipelines to prevent further propagation of errors, or re-running failed tasks once upstream data is restored.
Data checksare essential for preventing data qualityissues, including:
Column-level lineageis an essential data governance tool, providing granular insight into how data changes as it travels through a pipeline. By enabling you to trace the root input columns used to construct a downstream column in a given table, column-level lineage helps you investigate these key data quality issues.
For instance, let's say you have a DAG that ingests user signup data from a CRM system. This data is critical for business intelligence, but to maintain regulatory compliance around PII, your pipeline includes a transformation task meant to redact fields such as names and email addresses before writing to an analytics table. The lineage graph shows the flow of data from the CRM ingestion all the way through to the final reporting in BI dashboards. You can use the column-level lineage to compare the data in warehouses at either side of the transformation task to validate that PII is being appropriately scrubbed. If, say, the reporting task starts pulling from the raw ingestion table due to a miscommunication between teams, the lineage graph makes that connection clearly visible, thus providing clear indication of a compliance gap. Now that you've identified this gap, you can move swiftly to fix any issues with the transformation task and ensure reporting processes are pulling from the correct table.
Datadog Data Observability enables you to easily collect and monitor Airflow data lineage by using OpenLineage. By surfacing lineage metadata in your monitoring stack, you can:
Datadog Data Observability is now available in Preview. For more information about Data Observability, see our documentation. If you're interested in using OpenLineage with Datadog Data Observability, sign up for the Data Observability Preview. Or, if you're brand new to Datadog, sign up for a free trial.