Skip to main content

Celonis Product Documentation

Records deduplication

It's assumed the reader has a basic understanding of Kafka in regards to the topic, partitions, and message structure.

Celonis Platform data ingestion relies on Parquet file uploads. As data records are accumulated in the files, it can be the case that records with the same primary key are present in one file or files which are processed together as part of the ingestion. To ensure the latest record is the one stored, a sortable field needs to be provided. Otherwise, there is no guarantee of the outcome.

When the primary key configuration is set, the connector automatically injects a field __celonis_order to ensure the deduplication. This field is populated with the incoming Kafka message offset. For this approach, it is required to ensure all records sharing the same PK to always written to the same partition. When this is not the case, it's the user's responsibility to set the connect.ems.order.field.name to an incoming data field which guarantees the order of the records over the same primary key.