Skip to main content

Celonis Product Documentation

Usage of DISTINCT [Performance]

A SELECT DISTINCT statement is used to avoid duplicates in tables/views/activities. The problem with SELECT DISTINCT statements is that their execution takes very long due to the numerous tasks to calculate. In general we can avoid SELECT DISTINCT statements, especially SELECT DISTINCT * statements. The steps below guide you on how to do so.

Warning

Always avoid SELECT DISTINCT * !!

If you use delta loads the DISTINCT most certainly does not achieve what you intend on anyway, because of the automatically added update timestamps which is part of the "*"

In order of importance:

  1. Ask yourself if the DISTINCT actually needed and remove it if not.

  2. Otherwise, try the alternative approaches outlined in this guide to remove the DISTINCT statement.

1. Check if the DISTINCT is needed at all:

Check if there are duplicates in your table/view/activities by counting the distinct rows based on the primary key of the table/view/activities vs the count of all rows (see below). If the results match, you do not need any distinct as there are no duplicates. If they do not match please continue with step no. 2.

Row Counter - Option I

SELECT 
          COUNT(DISTINCT "LIPS"."VGBEL"||"LIPS"."VGPOS")
        , COUNT("LIPS"."VGBEL"||"LIPS"."VGPOS")
FROM  "LIPS" AS "LIPS"
INNER JOIN "VBAP" AS "VBAP" ON 1=1
        AND "VBAP"."VBELN" = "LIPS"."VGBEL"
    AND "VBAP"."POSNR" = "LIPS"."VGPOS"

Another way to check for duplicates is to run the following snippet once without 'DISTINCT' and once with it. Again if the results match a 'DISTINCT' is not needed.

Row Counter - Option II

SELECT COUNT(*) FROM (
SELECT <DISTINCT>
        "LIPS".*
FROM  "LIPS" AS "LIPS"
INNER JOIN "VBAP" AS "VBAP" ON 1=1
        AND "VBAP"."VBELN" = "LIPS"."VGBEL"
    AND "VBAP"."POSNR" = "LIPS"."VGPOS"
) test;

Especially for activities it might be quite difficult to assess what the primary key of the activity is, e.g. for the activity 'Create Quotation' from O2C, as visible below, the primary key is not the case key '"VBFA"."MANDT" || "VBFA"."VBELN" || "VBFA"."POSNN"' and it is also not the key of the quotes '"V_QUOTES"."MANDT" ||"V_QUOTES"."VBELN"||"V_QUOTES"."POSNR"'. To assess the primary activity key we need to approach it from a business angle. One quotation can result in several orders and several quotations can also result in one order. Therefore we have an m:n connection between quotes and orders. This means that the primary activity key is '"VBFA"."MANDT" || "VBFA"."VBELN" || "VBFA"."POSNN"||"V_QUOTES"."MANDT" ||"V_QUOTES"."VBELN"||"V_QUOTES"."POSNR"'. A single activity is defined by the quote and the order. If we would solely check for orders or solely for quotes and based on the different counts use a distinct, we would lose those quotations, that have been created with another one at the same time resulting later in the same order. In the process explorer we would see one quotation resulting in one order. In reality there were two quotations resulting in one order. Therefore in this case it would be correct to just not using a SELECT DISTINCT.

Assess Primary Activity Key Expand source

INSERT INTO "_CEL_O2C_ACTIVITIES" (
        "_CASE_KEY",
        "ACTIVITY_DE",
        "ACTIVITY_EN",
        "EVENTTIME",
        "_SORTING",
        "USER_NAME")
SELECT
         "VBFA"."MANDT" || "VBFA"."VBELN" || "VBFA"."POSNN" AS "_CASE_KEY"
        , 'Lege Angebot an' AS "ACTIVITY_DE"
        , 'Create Quotation' AS "ACTIVITY_EN"
        , CAST("V_QUOTES"."ERDAT" AS DATE) + CAST("V_QUOTES"."ERZET" AS TIME) AS "EVENTTIME"
        , 10 AS "_SORTING"
        , "V_QUOTES"."ERNAM" AS "USER_NAME"
FROM
        "O2C_VBFA_N" AS "VBFA"
        JOIN "TMP_O2C_VBAK_VBAP_QUOTES" AS "V_QUOTES" ON 1=1
                AND "V_QUOTES"."MANDT" = "VBFA"."MANDT" 
                AND "V_QUOTES"."VBELN" = "VBFA"."VBELV"
                AND "V_QUOTES"."POSNR" = "VBFA"."POSNV";

Additional Information: If the row count of the tested table takes too long (~5 min), the result of the query won't be displayed in the frontend of the Celonis Platform. A suitable workaround is the temporary creation of a new table, where the query result is inserted, see below:

Row Counter workaround Expand source

DROP TABLE IF EXISTS _CEL_QUERY_RESULTS;

CREATE TABLE _CEL_QUERY_RESULTS (
        RESULTS FLOAT
);

INSERT INTO _CEL_QUERY_RESULTS
        SELECT COUNT(*) FROM (
        -- insert statement here
        )
AS RESULT;

SELECT * FROM _CEL_QUERY_RESULTS;
2. How to avoid the DISTINCT if it seems to be necessary?

First of all check why the DISTINCT seems to be unavoidable. Then approach it based on the issue types explained in the following:

1. Duplicates in the raw data due to poor data quality:

Investigate why there are duplicates in the raw data, and if you can detect a pattern, you might be able to write a delete statement for these cases.

If there is no pattern/explanation for these duplicates, you should clean the data as follows:

  1. Create a table that contains the duplicates including a row number to detect entries with the same primary key. In the example below the duplicates in BSIK are marked by the row number based on the primary key of the table (BSIK.MANDT, BSIK.BUKRS , BSIK.BELNR, BSIK.GJAHR, BSIK.BUZEI). The order option can help you in deleting the correct lines. in this example the line with the most current _CELONIS_CHANGE_DATE among all identical primary keys gets the number 1.

    Add row numbers Expand source

    CREATE TABLE BSIK_CLEAN AS(
    SELECT
    ROW_NUMBER() OVER(PARTITION BY BSIK.MANDT, BSIK.BUKRS , BSIK.BELNR, BSIK.GJAHR, BSIK.BUZEI ORDER BY BSIK.MANDT, BSIK.BUKRS , BSIK.BELNR, BSIK.GJAHR, BSIK.BUZEI,_CELONIS_CHANGE_DATE DESC) AS NUM
        ,BSIK.*
    FROM BSIK AS BSIK
    );
  2. Whenever you want to use this table, make sure to add the where statement to solely select the items from BSIK_CLEAN with the row number equals 1 or you delete all entries from that table where the row number is greater than 1.

    Usage of clean table Expand source

    SELECT *
    FROM BSIK_CLEAN 
    WHERE 1=1
        AND "BSCHL" IN <%=postingKeys%>
        AND NUM = 1;
    
    OR
    
    DELETE FROM BSAD_C WHERE NUM > 1;
2. Limit view/table on solely the cases in the case table leads to duplicates:

Often we want to limit our master data views to solely the items relating to the existing cases. For instance we want to store in the view LFA1 only the vendors that also exist in the case table, e.g. BSEG table. An inner join of BSEG on LFA1 makes sure of that, but it also creates duplicates as many documents in BSEG make use of the same vendor. To delete these duplicates in the past DISTINCT was used, as visible in the example of bad code, below.

Negative Example Expand source

DROP VIEW IF EXISTS "AP_LFA1";

CREATE VIEW "AP_LFA1" AS
SELECT DISTINCT "LFA1".*
FROM "LFA1"
INNER JOIN "AP_BSEG" AS BSEG ON 1=1
    AND BSEG."MANDT" = LFA1."MANDT"
    AND BSEG."LIFNR" = LFA1."LIFNR";

A much more performant version is it to write an exist statement instead of a join as visible in the example below.

Improvement Expand source

DROP VIEW IF EXISTS "AP_LFA1";

CREATE VIEW "AP_LFA1" AS
SELECT "LFA1".* 
FROM "LFA1"
WHERE EXISTS (
    SELECT * FROM "AP_BSEG" AS BSEG
    WHERE 1=1 
        AND BSEG."MANDT" = LFA1."MANDT"
        AND BSEG."LIFNR" = LFA1."LIFNR"
);
3. Incomplete Join:

Incomplete join means that we have an unwanted 1:n or even m:n connection between the tables that we want to join. The first step here is to find out what the primary key of the table/view/activity you want to create should be. In the example below, the primary key of the view "AP_LFB1" should be 'LFB1.MANDT || LFB1.LIFNR || LFB1.BUKRS'. Based on that information we know that every table we join on LFB1 (which has exactly the wanted primary key) needs to have a 1:1 or n:1 (where n is LFB1) connection with LFB1. In these cases we will have no duplicates. In the example we join T052 (Primary Key: MANDT, ZTERM, ZTAGG) and AP_BSEG(Primary Key: MANDT, GJAHR, BUKRS, BELNR, BUZEI). In both cases we do not join on the primary key of the respective table (T052 joined on MANDT, ZTERM / AP_BSEG joined on MANDT, LIFNR, BUKRS), which results in an incomplete join. We can handle incomplete joins in two different ways:

  1. WHERE EXISTS: This option is solely possible if you do not want to select any columns from the joined table. The approach is the same as in issue type 2 and has been conducted here for table AP_BSEG.

  2. SUBSELECT: To not generate duplicates in the resulting table, we subselect only the columns we use for the join (T052.MANDT, T052.ZTERM) and the ones we want to add to the resulting table (T052.ZTAG1, T052.ZPRZ1). If we do not select any other columns than the ones to join on we would need a distinct in the subselect. As we select here further columns we need to use an aggregate function on them and group the join columns. As we use a group statement here we do not need a DISTINCT at all. This approach results in a 1:1 or n:1 (where n is LFB1) connection.

  3. TABLE: The data in the main table (LFB1), should only display values appearing in another Table (BSEG). As this Table (BSEG) can possibly be huge, we execute this query already at the moment of the transformation and not during runtime of the data model load. By this we reduce the Memory Footprint Peak during Loading of data model.

Important

A common problem with long running transformations and slow data model loads is due to missing statistics in Vertica. This issue can be resolved by adding the Vertica ANALYZE_STATISTICS statement directly in the SQL. For more information, refer to Vertica Transformations Optimization.

Bad Example Expand source

CREATE VIEW "AP_LFB1" AS (
SELECT DISTINCT
                "LFB1".*,
                CAST("T052"."ZTAG1" AS INT) AS "ZBD1T_LFB1",
                CAST("T052"."ZPRZ1" AS INT) AS "ZBD1P_LFB1"
FROM LFB1
LEFT JOIN T052 ON 1=1
                AND T052.MANDT = LFB1.MANDT
                AND T052.ZTERM = LFB1.ZTERM
INNER JOIN "AP_BSEG" AS BSEG ON 1=1
        AND BSEG."MANDT" = LFB1."MANDT"
        AND BSEG."LIFNR" = LFB1."LIFNR"
        AND BSEG."BUKRS" = LFB1."BUKRS");

Improved Example Expand source

CREATE TABLE "AP_LFB1" AS (
SELECT
                "LFB1".*,
                CAST("T052"."ZTAG1" AS INT) AS "ZBD1T_LFB1",
                CAST("T052"."ZPRZ1" AS INT) AS "ZBD1P_LFB1"
FROM LFB1
LEFT JOIN (SELECT
                T052.MANDT,
                T052.ZTERM,
                AVG(T052.ZTAG1),
                AVG(T052.ZPRZ1)
                FROM T052 GROUP BY MANDT, ZTERM) AS T052 ON 1=1
                AND T052.MANDT = LFB1.MANDT
                AND T052.ZTERM = LFB1.ZTERM
        WHERE EXISTS (
                SELECT * FROM "AP_BSEG" AS BSEG
                WHERE 1=1
                AND BSEG."MANDT" = LFB1."MANDT"
                AND BSEG."LIFNR" = LFB1."LIFNR"
                AND BSEG."BUKRS" = LFB1."BUKRS"));