Skip to main content

Connecting to Databricks (extractor)

The Celonis Databricks extractor allows you to transfer data from your Databricks Lakehouses (or Databricks SQL endpoints) to the Celonis Platform for process mining and analysis. It supports the following basic features:

Prerequisites

This section details important prerequisites or prerequisite knowledge for using this extractor.

Before creating a connection between your database and the Celonis Platform you must decide which connection type you want to use. Except where stated in Supported database types, all databases have two basic connection types: Direct connections and Uplink connections via an on-premise extractor, as described below:

  • Direct connections: Use direct connections when you want to allow the Celonis Platform direct access to your database without additional infrastructure. Meaning, you do not need to install, patch, or maintain on-premises extractors, which speeds up implementation, reduces complexity, and simplifies operations.

    Note

    By default, all cloud-based extractors are direct connections.

  • Uplink connections via an on-premise extractor: Use uplink connections when you don't want to or can't allow the Celonis Platform to directly access your on-premise or private cloud database. The connection between the database and Celonis is then established using an on-premise extractor that's installed within your network ideally on a dedicated server.

    The role of the extractor is to poll and fetch job requests from the Celonis Platform, before then submitting the execution information the the database via an SQL query. Once the data is retrieved from the database, the extractor fetches it and sends it back to the Celonis Platform. As such, the connection between the database and the Celonis Platform is always made by the extractor, with it continuously querying the Celonis Platform for any extractions to execute.

    Note

    To use an uplink connection, you must install an on-premise extractor in your environment. To do so, see Setting up. Additionally, if you want to use a proxy (optional), see Proxy settings for on-prem clients.

For the database extractor to communicate with your database and the Celonis Platform, you must modify your network settings to allow access.

Note

Follow the instructions in network settings section below based on the connection type you using. Additionally, if you are using uplink connections, follow the instructions in Celonis Platform IP addresses depending on the cluster .

The following network settings apply only for direct connections:

Source system

Target system

Port

Protocol

Description

Celonis Platform

Source system

Depending on the database, typical ports are 5432 for PostgreSQL and 30015 for HANA for example

TCP

JDBC connection from the Celonis Platform to the database. The port is the one you normally use to connect to the database. The IPs of the Celonis Platform depending on the cloud cluster (which can be seen in the URL).

The following network settings apply only for uplink connections (via the on-premise extractor):

Source system

Target system

Port

Protocol

Description

On-premise extractor server

Source system

Depending on the database, typical ports are 5432 for PostgreSQL and 30015 for HANA for example.

TCP

JDBC connection from on-premise extractor server to the database. The port is the one you normally use to connect to the database.

On-premise extractor server

Celonis Platform

443

TCP

HTTPS connection from on-premise extractor server to Celonis cloud endpoint. The IPs of the Celonis Platform depending on the cloud cluster (which can be seen in the URL).

The respective clusters use multiple IPs each, so you need to enable all three of them in your firewall configuration to connect the on-premise extractor server and the cloud endpoint.

For a complete list of inbound and outbound Celonis Platform IP addresses to be allowlisted if needed, see: Allowlisting Celonis domain names, IP addresses, and third-party domains

This section describes the guidelines for using custom JDBC strings in extractor configurations:

  • Authentication: The Credentials fields in the extractor configuration are required and always used to authenticate the connection. Do not embed credentials directly in your JDBC string.

  • Encryption: For standard (unencrypted) extractors (examples: SAP HANA, PostgreSQL), you can enable encryption by adding encrypt=true to the JDBC string. For encrypted extractors (examples: SAP HANA encrypted, PostgreSQL encrypted), connections are established with encryption enabled (encrypt=true) by default. You do not need to include this parameter in your JDBC string.

  • Certificate validation: Do not include validateCertificate=true in your JDBC strings. Instead, use Advanced Settings > Validate Certificate > Enabled.

  • Additional properties: You can include additional properties in either the JDBC string or the Additional Properties field. Do not specify the same properties in both places.

If the JDBC driver you're using doesn't support certificate validation, you can edit the application-local.yml to remove this configuration.

To do this, open the application-local.yml file (found in the package directory), and add the following configuration:

database:
    validateCertificateSupported: false

This extractor supports the authentication methods described in the following sections.

The Databricks extractor supports authentication using a Personal Access Token (PAT). A PAT is a secure, token-based alternative to using user credentials for authenticating API or JDBC connections.

When you use PAT authentication, Databricks verifies requests based on the token instead of a username and password. This allows automated systems or integrations to access your Databricks workspace securely, without requiring interactive sign-ins.

Tokens are typically generated and managed within your Databricks workspace settings and should be stored securely, as they provide the same access permissions as the account under which they are created.

Note

For more information about managing personal access tokens, see the Databricks documentation – Manage personal access tokens.

For Credentials > Personal access token in the extractor configuration, provide your PAT the Personal Access Token field.

The Databricks supports authentication through Microsoft Entra ID (formerly Azure Active Directory). This method uses an OAuth-based identity registered in Entra ID to securely authenticate Databricks connections.

Table 7. Celonis extractor values for Microsoft Entra ID authentication

Extractor value

Microsoft Entra ID value

Description

Client ID

Application (client) ID

The unique identifier for your registered app in Microsoft Entra ID. To learn how to register an application, see Microsoft documentation – Register an application in Microsoft Entra ID.

Client Secret

Client Secret

The secret key that the Databricks uses to authenticate as the registered application. To create and manage client secrets, see Microsoft documentation – Create a service principal in Microsoft Entra ID.

Note

Copy the client secret immediately after creating it in Microsoft Entra ID. If lost, generate a new secret and update the Databricks configuration.

Tenant ID

Directory (tenant) ID

Identifies the Microsoft Entra organization associated with your environment. To locate your tenant information, see Microsoft documentation – Find your Microsoft Entra tenant ID.



Note

Before setting up the connection in Celonis, ensure the registered app in Microsoft Entra ID has the required API permissions for the resource you want to access.

For your Databricks configuration, ensure the correct values for Client ID, Client Secret, and Tenant ID (as applicable).

Databricks supports authentication using an Active Directory service principal. A service principal acts as an application identity that can securely access Databricks resources without relying on a specific user account.

This authentication method is recommended for automated or system-to-system integrations, where access is managed through Azure Active Directory instead of individual user credentials. The connection authenticates using the service principal’s client (application) ID and client secret.

To enable OAuth authentication for a service principal, you must register the application in Azure Active Directory and create a client secret. For a complete walkthrough, see the Azure Databricks documentation – Configure OAuth M2M authentication.

To ensure proper access, assign the following permissions to the service principal:

Note

Roles and entitlements can be managed directly in Databricks under Settings > Identity and Access > Service Principals, or through Azure Active Directory role assignments.

  • Allow Workspace access – Allows the service principal to connect to the Databricks workspace.

  • Allow Databricks SQL access – Enables querying data in SQL warehouses.

  • Catalog privileges – Grant SELECT, USAGE, and READ METADATA on the catalogs or schemas used for extraction.

For Credentials > Active Directory in the extractor configuration, provide the following authentication values:

  • Principal ID – The client (application) ID of the service principal.

  • Principal Secret – The client secret created in Azure Active Directory for the service principal.

Configuring the Databricks extractor

This section describes the basic setup of configuring the Databricks extractor. To configure the extractor:

  1. In the Celonis Platform left navigation, select Data > Data Integration.

  2. On the Data Pools screen, select the data pool you want to use for the extraction.

    Note

    If you do not have a data pool to use for this extraction, see Creating and managing data pools for instructions on how to create one.

  3. In the Data Integration section, select Connect to Data Source.

    Note

    If this is not the data pool's first connection, the Data Connections window opens below. Select + Add Data Connection to add a new connection.

  4. In the Add Data Connection window, select Connect to Data Source.

  5. In the Connect to Data Source window, depending on your use case, select either Database – On Premise or Database – Cloud.

    Note

    Select Database – On Premise to connect to on-premise or private cloud databases.

    1. If you selected Database – On Premise, follow the on-screen instructions.

  6. In the New Database Data Connection window, fill in the following information:

    1. For Name, provide a name for this configuration.

    2. For Database Type, select Databricks.

    3. For Connection Type, select either Standard or Custom JDBC Connection String.

      1. If you selected Standard:

        • For Host, enter the base URL of your Databricks workspace. This is the hostname portion of your Databricks environment URL, for example:

          adb-<workspace-id>.<region>.azuredatabricks.net
          
        • For Port, provide the port to connect to (Default is 443).

        • For Database Name, enter the name of the database that contains the schema you want to extract data from.

        • (Optional) For Schema Name, enter the name of the schema that contains the tables to extract.

        • (Optional) For Additional Properties, enter any additional connection properties required by your database or driver. Separate each with ;.

      2. If you selected Custom JDBC Connection String:

        Important

        When using JDBC strings, there are specific guidelines to follow. For more information, see JDBC string_guildelines.

        • For JDBC Connection String, provide your string. Use the format:

          jdbc:<driver_name>://<host>:<port>/<database_name>;property1=value1;property2=value2...

          Note

          For <driver_name>, provide the name of the JDBC driver used to connect to your data source (for example, sqlserver, postgresql, or spark).

          For more information on connecting to Databricks with JDBC strings, see the Databricks documentation.

        • For HTTP Path, The Databricks endpoint path used to connect to your SQL warehouse or cluster. The Databricks endpoint path used to connect to your SQL warehouse or cluster. You can find this in your Databricks workspace under SQL Warehouses → [Your Warehouse] → Connection Details. The value typically looks like /sql/1.0/warehouses/<warehouse_id>.

        • Optionally, provide values for:

          • Schema Name: Enter the name of the schema that contains the tables to extract.

          • Additional Properties: Enter any additional connection properties required by your database or driver. Separate each with ;.

    4. For Credentials, select the type of authentication you want to use for this connection. For more infomation, see Authentication methods.

      Note

      Ensure the credentials used has sufficient permissions to access the data to be extracted.

    5. If desired, select Advanced Settings, and update these parameters as needed.

      Note

      The Advanced Setting > Validate Certificate parameter (Default: DISABLED) controls whether the extractor validates the server’s SSL/TLS certificate:

      • Disabled: Disables certificate validation (validateCertificate=false).

      • Enabled: Enforces certificate validation (validateCertificate=true).

      • Removed: Uses the driver’s default behavior. Check the driver documentation to confirm the default.

  7. Select the Test Connection button to confirm the extractor can connect to the host system. If the test fails, adjust the data in the configuration fields as needed.

  8. Once the test connection passes, select the Save button to continue. This returns you to the Data Integration window.