Databricks Inc.

04/08/2025 | News release | Archived content

Privacy-centric collaboration on AI with Databricks Clean Rooms

Access to high-quality, real-world data is crucial for developing effective machine learning models. However, when this data contains sensitive information, organizations face a significant hurdle in enabling data science teams to work with valuable data assets without compromising privacy or security. Traditional approaches often involve time-consuming data anonymization processes or restrictive access controls, which can hinder productivity and limit the potential insights gleaned from the data.

Databricks Clean Rooms reimagines this paradigm. By offering a secure, collaborative environment, clean rooms enable data science teams to train or fine-tune ML models on sensitive data without directly accessing or exposing the underlying information. This innovative approach not only enhances data protection but also accelerates the development of powerful, data-driven models.

Machine learning on sensitive data has diverse applications across industries. In healthcare, models can predict patient outcomes or classify cell types using protected health information without exposing individual records. Financial institutions can develop sophisticated credit scoring and fraud detection models using confidential transaction data. In advertising, companies can leverage machine learning to improve ad targeting and personalization while preserving user privacy.

This blog walks you through the process and setup that Databricks customers can use to train and deliver ML models in a privacy-centric way. We'll use the example of a healthcare provider who wants to build a model to predict patient readmission risk using sensitive data from electronic health records (EHR).

Scenario & Actors

In a typical organization, data management and data analysis are separated by departments. For example, for a healthcare provider, data is typically governed and managed centrally by data owners. Individuals analyzing the data are typically subject matter or technical experts who understand the domain. For our example, let's assume there are two actors:

  • Data Owner - Responsible for the governance, quality, and security of EHR data within the organization. They establish policies for data access, usage, and compliance.
  • ML Expert - A data scientist responsible for developing and assessing ML models using healthcare data. They work with clinical experts to frame relevant questions and build models according to requirements.

Goal: The Data Owner wants to empower the ML Expert to build a model while restricting direct access to the sensitive EHR data. At the same time, the ML Expert wants to iterate on the training code and enhance the model as required. The result of this collaboration would generate a model output used to predict readmission.

Databricks Requirements

  • An account that is enabled for serverless compute. See this guide to enable serverless compute.
  • Workspace(s) that are enabled for Unity Catalog. Check out this guide to enable Unity Catalog.
  • Delta Sharing enabled for the Unity Catalog metastore. Follow this guide to enable Delta Sharing on a metastore.
  • Both the Data Owner and the ML Expert have the CREATE CLEAN ROOM privilege. Use this guide to manage privileges in the Unity Catalog.

The Setup