Databricks Inc.

09/24/2025 | News release | Distributed by Public on 09/23/2025 22:59

From Lag to Agility: Reinventing Freshworks’ Data Ingestion Architecture

As a global software-as-a-service (SaaS) company specializing in providing intuitive, AI-powered business solutions, designed to enhance customer and employee experiences. Freshworks depends on real-time data to power decision-making and deliver better experiences to its 75,000+ customers. With millions of daily events across products, timely data processing is crucial. To meet this need, Freshworks has built a near-real-time ingestion pipeline on Databricks, capable of managing diverse schemas across products and handling millions of events per minute with a 30-minute SLA-while ensuring tenant-level data isolation in a multi-tenant setup.

Achieving this requires a powerful, flexible, and optimized data pipeline-which is exactly what we were set out to build.

Legacy Architecture and the Case for Change

Freshworks' legacy pipeline was built with Python consumers; where each user action triggered events sent in real time from products to Kafka and the Python consumers transformed and routed these events to new Kafka topics. A Rails batching system then converted the transformed data into CSV files stored in AWS S3, and Apache Airflow jobs loaded these batches into the data warehouse. After ingestion, intermediate files were deleted to manage storage. This architecture was well-suited for early growth but soon hit limits as event volume surged.

Rapid growth exposed core challenges:

  • Scalability: The pipeline struggled to handle millions of messages per minute, especially during spikes, and required frequent manual scaling.
  • Operational Complexity: The multi-stage flow made schema changes and maintenance risky and time-consuming, often resulting in mismatches and failures.
  • Cost Inefficiency: Storage and compute expenses grew quickly, driven by redundant processing and lack of optimization.
  • Responsiveness: The legacy setup couldn't meet demands for real-time ingestion or fast, reliable analytics as Freshworks scaled. Prolonged ingestion delays impaired data freshness and impacted customer insights.

As scale and complexity increased, the fragility and overhead of the old system made clear the need for a unified, scalable, and autonomous data architecture to support the business growth and analytics needs.

New Architecture: Real-Time Data Processing with Apache Spark and Delta Lake

The solution - A foundational redesign centred on Spark Structured Streaming and Delta Lake, purpose-built for near-real-time processing, scalable transformations, and operational simplicity.

We designed a single, streamlined architecture where Spark Structured Streaming directly consumes from Kafka, transforms data, and writes it into Delta Lake-all in one job, running entirely within Databricks.

This shift has reduced data movement, simplified maintenance and troubleshooting, and accelerated time-to-insight.

The key components of the new architecture:

The Streaming Component : Spark Structured Streaming

Each incoming event from Kafka passes through a carefully orchestrated series of transformation steps in Spark streaming; optimized for accuracy, scale, and cost-efficiency:

  1. Efficient Deduplication:
    Events, identified by UUIDs, are checked against a Delta table of previously processed UUIDs to filter duplicates between streaming batches.
  2. Data Validation:
    Schema and business rules filter malformed records, ensure required fields, and handle nulls.
  3. Custom Transformations with JSON-e:
    The JSON-e engine enables advanced, reusable logic-like conditionals, loops, and Python UDFs-enabling product teams to define dynamic, reusable logic tailored to each product.
  4. Flattening to Tabular Form:
    Transformed JSON events are flattened into thousands of structured tables. A separate internal schema management tool ( managing 20,000+ tables & 5M+ columns) lets product teams manage schema changes and automatically promote to production, which is registered in Delta Lake and picked up by Spark streaming seamlessly.
  5. Flattened Data Deduplication:
    A hash of stored columns is compared against the last 4 hours of processed data in Redis, preventing duplicate ingestion and reducing compute costs.

The Storage Component: Lakehouse

Once transformed, the data is written directly to Delta Lake tables using several powerful optimizations:

  • Parallel Writes with Multiprocessing:
    A single Spark job typically writes to ~250 Delta tables, applying varying transformation logic. This is executed using Python multiprocessing, which performs Delta merges in parallel, maximising cluster utilization and reducing latency.
  • Efficient Updates with Deletion Vectors:
    Up to 35% of records per batch are updates or deletes. Instead of rewriting large files, we leverage Deletion Vectors to enable soft deletes. This improves update performance by 3x, making real-time updates practical even at a terabyte scale.
  • Accelerated Merges with Disk Caching:
    Disk Caching ensures that frequently accessed (hot) data remains in memory. By caching only the columns needed for merges, we achieve up to 4x faster merge operations while reducing I/O and compute costs. Today, 95% of merge reads are served directly from the cache.

Autoscaling & Adapting in Real Time

Autoscaling is built into the pipeline to ensure that the system scales up or down dynamically to handle volume and cost most efficiently without impacting performance.

Autoscaling is driven by batch lag and execution time, monitored in real time. Resizing is triggered via job APIs through Spark's QueryListener (OnProgress method after each batch), ensuring in-flight processing isn't disrupted. This way the system is responsive, resilient, and efficient without manual intervention.

Built-In Resilience: Handling Failures Gracefully

To maintain data integrity and availability, the architecture includes robust fault tolerance:

  • Events that fail transformation are retried via Kafka with backoff logic.
  • Permanently failed records are stored in a Delta table for offline review and reprocessing, ensuring no data is lost.
  • This design guarantees data integrity without human intervention, even during peak loads or schema changes and the ability to republish the failed data later.

Observability and Monitoring at Every Step

A powerful monitoring stack-built with Prometheus, Grafana, and Elasticsearch-integrated with Databricks gives us end-to-end visibility:

  • Metrics Collection:
    Every batch in Databricks logs key metrics-such as input record count, transformed records, and error rates, which are integrated to Prometheus, with real-time alerts to the support team.
  • Event Tracking:
    Event statuses are logged in Elasticsearch, enabling fine-grained debugging allowing both product(producers) and analytics (consumer) teams to trace issues.

Transformation & Batch Execution Metrics:

[Link]

Track transformation health using above metrics to identify issues and trigger alerts for quick investigations

From Complexity to Confidence

Perhaps the most transformative shift has been in simplicity.

What once involved five systems and countless integration points is now a single, observable, autoscaling pipeline running entirely within Databricks. We've eliminated brittle dependencies, streamlined operations, and enabled teams to work faster and with greater autonomy.Essentially Fewer moving parts meant Fewer surprises & More confidence.

By reimagining the data stack around streaming and the Deltalake, we've built a system that not only meets today's scale but is ready for tomorrow's growth.

Databricks Inc. published this content on September 24, 2025, and is solely responsible for the information contained herein. Distributed via Public Technologies (PUBT), unedited and unaltered, on September 24, 2025 at 04:59 UTC. If you believe the information included in the content is inaccurate or outdated and requires editing or removal, please contact us at [email protected]