Skip to main content

Data Model Design

Description

Data Model Design in Celonis is the process of structuring raw business data into a process-aware schema optimized for process mining and PQL analysis.

A well-designed data model serves as a digital twin of your business process. It transforms raw tables into a format that is both human-readable and computationally optimized for the Celonis analytical engine. The primary objective is to define the logical connections between process events and business context, ensuring that PQL queries can aggregate and navigate data with high performance.

Note

For more general information on configuring your Data Model, refer to Extraction task best practice and Modeling your data.

Why the Data Model matters for PQL

The Data Model acts as the semantic layer foundation for PQL. While standard SQL requires manual JOIN statements to link datasets, PQL utilizes the Foreign Key relationships defined in your model to perform Implicit Joins. This allows users to query attributes across multiple tables without having to specify join logic in every statement.

Furthermore, the model defines the cardinality (1:N relationships) between tables. This is foundational for Pull-Up (PU) Aggregations, which allow for seamless data aggregation from "N-side" tables (e.g., individual Activities) to "1-side" tables (e.g., the Process Case), maintaining data integrity and performance at scale.

Background and Core Components

Process mining is based on event logs. To enable PQL to reconstruct a process flow, three event attributes are always required:

  • Case: An instance of a process (often a non-negative number). This attribute indicates which process instance the event belongs to.

  • Activity: The action captured by the event in text (e.g., "Sales order created").

  • Timestamp: The precise time when the event took place.

There is also an optional attribute, Sorting, which identifies the order of events if their timestamp within one case is not unique. See Activity table sorting for more information.

A sequence of events, ordered by their timestamps, that belong to the same case is called a trace. The traces of all cases with the same activity sequence represent a variant. See VARIANT for more information. The throughput time of a case is the time difference between the first and the last event of the corresponding trace. See CALC_THROUGHPUT for more information.

Activity Tables

The activity table contains the required event log attributes. Within one case, rows are always sorted based on the timestamp and the optional sorting column. Usually, this table is generated via SQL in a transformation (ETL) step.

Data models can also contain multiple activity tables (Multi-Event Log setup) to connect multiple processes within a single model. Utilizing a Multi-Event Log is particularly useful for end-to-end process visibility. Multi-Event Logs are handled differently depending on if the data model is OCPM or case-centric.

In OCPM data models, multiple activity tables can be combined into Eventlogs via the CREATE_EVENTLOG operator, enabling them to be consumed by other Process Mining operators. In case-centric data models, Multi-Event Log use cases are handled via operators, such as MERGE_EVENTLOG, AUTOMERGE, and TRANSIT_COLUMN.

Note

There are specific rules for table and column names. For more information, see Supported table and column names.

Case Tables

The case table acts as the dim table in the snowflake schema, containing one unique row per case ID. If a case table is not specified in the user interface, it is automatically generated during the load, consisting of all distinct case IDs from the activity table. This guarantees that a case table always exists for PQL Process Mining operators to rely on.

Note

There are specific rules for table and column names. For more information, see Supported table and column names.

Relationships and Data Types

PQL relies on 1:N relationships to perform implicit joins and navigate to Master Data. To ensure relationships are correctly established, PQL is strictly typed:

  • Timestamps must be DATETIME.

  • Identifiers must be consistent in type (usually STRING or INTEGER).

Design Principles for Robust Data Models

Adhere to these best practices to ensure PQL analysis is accurate and scalable:

  • Maintain a Strict Snowflake Schema: Organize data into a hierarchy where one Table acts as the central hub. Avoid "Flat Files" (single massive tables) as they hinder PQL's ability to aggregate data efficiently. Ensure the 1-side of the relationship is a lookup/master table and on the N-side there are transactional or activity tables.

  • Minimize Activity Table Grain: Only include activities that represent meaningful business milestones. Excessive "noise," such as automated technical logs that occur every second, increases data volume and can obscure the actual process flow in PQL analysis.

  • Ensure Temporal Consistency: Timestamps must be provided in a consistent timezone across all tables. Where needed, utilize an Integer Sorting Column to define a logical order for simultaneous events; without this, PQL may not reliably determine the "Source" and "Target" of a process edge.

  • Avoid Circular Dependencies: Table relationships must be directed and acyclic. Never introduce a loop where Table A links to Table B and then to Table C, which then links back to Table A. Circular loops will cause data model loads to fail.

  • Use High-Cardinality Keys for Joins: Ensure join columns (Foreign Keys) contain unique or high-cardinality data. Avoid joining on generic columns, such as "Status", and instead use unique identifiers like "Invoice_ID". Joining on generic columns can lead to data inflation and incorrect PQL aggregations.

  • Strategic Use of Master Data: Store static attributes (e.g., "Vendor Name" or "Material Group") in Master Data tables linked to the Case ID rather than duplicating them in every row of the Activity Table. This reduces the storage footprint and accelerates PQL filtering.

Impact of the Data Model Design on PQL

The model dictates how PQL interprets data and executes queries:

  • Aggregation Scope: PQL uses implicit grouping or 1:N links for Pull-Up (PU) Functions to move data from "N-side" tables to the "1-side" level.

  • Filtering Propagation: Filters move throughout the data model from the target table to all other reachable data tables via the specified Foreign Keys.

  • Performance: Depth of joins and table size are the primary drivers of query speed.

  • Process Logic: Labeling the "Activity Table" enables process-specific operators like SOURCE() and TARGET().

Case-Centric vs. Object-Centric Models

While the classic Celonis data model is Case-Centric, the platform also supports Object-Centric Process Mining (OCPM). Understanding the structural differences is key to choosing the right architecture for your PQL analysis.

  • Case-Centric: A single-perspective hierarchy based on a unique Case ID (e.g., a Purchase Order). It uses a Snowflake schema and relies on 1:N relationships. Data is often "flattened," which can lead to duplication in many-to-many scenarios.

  • Object-Centric: A multi-dimensional network where business entities (e.g., Orders, Invoices, Items) are represented as independent Objects linked by shared Events. This allows for a 360-degree view, enabling users to switch perspectives without rebuilding the data model.

Note

For more information on object-centric data models, see OCPM Perspective.

Available training

The Celonis Academy offers an extensive curriculum designed to build foundational data modeling skills, explore complex real-world use cases, and provide hands-on experience with advanced configuration techniques. For more information on these course, see Celonis Academy Data Model courses.