Skip to main content

Celonis Product Documentation

Delta Transformations in Data Jobs

Warning

The feature is still in beta. Reach out to the Service Desk to enable it for your team.

For a performant and stable data pipeline only the newly extracted records should be transformed.

Let’s say you have fully loaded the case table, i.e. SalesOrderItems, and run transformations to build your Case table based on VBAP. The subsequent refreshes of VBAP, i.e. delta extractions, bring the new records from the source system. Of course, these new records should be propagated to the Case table as well, so that they are reflected in the user facing analysis. There are 2 ways in which you can do this:

  1. Drop and recreate the Case table (initialise it).

  2. Process only the new records, i.e. execute a delta transformation.

Option 1 ensures data integrity and is relatively easier to implement. However it is not performant, especially as the data size grows. This is not the recommended approach.

Option 2 requires more implementation effort because a MERGE should be done based on the primary keys. In this option, only the incoming date is processed which ensured the maximal performance. We recommend choosing this option.

Sections below explain delta transformations in more detail.

Storing the Delta in a Staging Table

To be able to process the delta we need to isolate it. This is done by pushing the extracted delta into a separate staging table. So if you are extracting BKPF, the new delta is pushed not only to the BKPF, but also in a staging table called _CELONIS_STAGING_BKPF. This enables the user to later access this data and use it for transformations.

Accessing the Delta from the Staging Table

After the extraction is complete, transformations should access the delta from the staging table. Theoretically we can select directly from the table _CELONIS_STAGING_BKPF, but this will make the SQL script unusable for full transformations, i.e. when we want to select from the source BKPF and recreate the data model.

To solve this we have introduced the concept of the placeholder. So in the SQL instead of referencing to BKPF, the select is done against <$BKPF>. Depending on the execution type - delta vs full - this placeholder is dynamically translated into a table name.

  • For full execution it is replaced by BKPF, so the query becomes SELECT * FROM BKPF.

  • For delta execution it is replaced by _CELONIS_STAGING_BKPF, o the query becomes SELECT * FROM _CELONIS_STAGING_BKPF.

This placeholder is also very important, because based on that the extractor decides for which tables to generate a staging table. See the next section for more details.

How is the Staging Table Generated?

The delta should be isolated only for the tables which are a source for transformations, i.e. against which a SELECT statement is written. So if you are extracting 4 tables, but only BKPF acts as source for select, then we need to capture the delta only for this table.

To reduce the manual steps, this step, i.e. identifying which tables to generate a staging table for, has been automated. Before running the extraction, the “SQL parsers” scans all the SQL scripts in the given Data Job. Whenever it spots placeholder, e.g. <$BKPF>, it also marks this table as eligible for generating a staging table for. If you have 10 tables in the extraction, but only one SELECT statement against a placeholder, i.e. <$BKPF>, then only one staging table will be generated - _CELONIS_STAGING_BKPF.

image1.png
Staging Table Clean Up

After a successful run, the staging table is cleaned up. This is to ensure that during the next execution we don’t reprocess the same records. However, it is important to note that not all records are cleaned up. For referential integrity we keep some of them to be reprocessed two times. This is mostly to address dependencies between tables, i.e. when there is a JOIN between them. In case a record in a dependent table arrives late, we might want to reprocess the records in a staging table, so that the JOIN works properly. One example is VBAK and VBAP. When building the case table based on VBAP, we fetch a lot of information from VBAK via a join. A VBAP record is extracted, but its header record might be missing due to timing issues. In that case we would like to reprocess the VBAP record again during the next delta run to ensure that VBAK has already been extracted.

How the SQL should be written

Delta transformations requires to:

  • persist the transformed data

  • merge the incoming data rather than drop/recreate

The SQL scripts should implement these 2 requirements. The transformed data should be persisted. So if you are doing some transformations based on BSID, then you should create a target table which will store those calculations. Here is an example:

DROP TABLE IF EXISTS BSID_CALCULATED;  CREATE TABLE BSID_CALCULATED(
MANDT varchar(12),
BUKRS varchar(16),
BELNR varchar(40),
GJAHR varchar(16),
BUZEI varchar(12),
WRBTR_CONVERTED float,
SKFBT_CONVERTED float,
WSKTO_CONVERTED float,
BUKRS_TEXT varchar(100),
ZTERM_MD varchar(16),
ZTAGG_MD varchar(8),
ZDART_MD varchar(4),
ZFAEL_MD varchar(8),
ZMONA_MD varchar(8),
ZTAG1_MD varchar(12),
ZPRZ1_MD float,
ZTAG2_MD varchar(12),
ZPRZ2_MD float,
ZTAG3_MD varchar(12));

What we do here is to create a “calculation” table for BSID, where we will store the transformed fields. Note that we don’t store the whole BSID, and only the calculated fields and the Primary Keys. This is to reduce the impact on APC. Primary Keys will be used later to unify the BSID and the transformed values in a view.

As a next step the incoming delta data should be merged to this persisted table. One way to merge is to simply remove the records from the transformed table, and then insert them again. Here is an example on how to implement the deletion for the table BSID.

DELETE FROM "BSID_CALCULATED"
WHERE EXISTS (SELECT 1 FROM <$BSID> as NEW_DATA WHERE AR_BSID_STAGING.MANDT = NEW_DATA.MANDT AND AR_BSID_STAGING.BUKRS = NEW_DATA.BUKRS AND AR_BSID_STAGING.BELNR = NEW_DATA.BELNR AND AR_BSID_STAGING.GJAHR = NEW_DATA.GJAHR AND AR_BSID_STAGING.BUZEI = NEW_DATA.BUZEI );

We basically select all the records from the BSID Staging (via placeholder &lt;$BSID&gt;), and then remove all the matches from the target table, i.e. BSID_CALCULATED.

And then we insert our calculations into the BSID_CALCULATED.

INSERT INTO "BSID_CALCULATED"
SELECT BSID.MANDT ,BSID.BUKRS ,BSID.BELNR ,BSID.GJAHR ,BSID.BUZEI ……… some query ……… FROM <$BSID>
……… some more joins and filters ………
How the Data Job should be designed

To use the Delta Transformations you need to include the Extraction and Transformation tasks in the same Data Job. In other words, the extracted Delta Data will be made available only for the transformations that run inside the same data jobs as the extractions.

This scoping ensures that we can:

  • Automatically identify which table should be extracted to a staging table. The SQL parser scans all the scripts in the Data Job to accomplish this.

  • Clean up all the Staging Tables after the Data Job is over.