Skip to main content

Celonis Product Documentation

Creating extraction tasks

As the name suggests, extraction tasks allow you to select the data tables to be extracted from your source system (and imported into the Celonis Platform). When configuring your extractions, you can apply time filters, add pseudonymization, and join tables to each other.

You can either create extraction tasks manually from your data jobs or edit existing extraction tasks (whether originally manually created or as part of a process connector).

Note

The extraction features and settings available are dependent on the source system you are connecting to. For an overview of supported features, see:

Creating and managing extraction tasks

To create an extraction tasks from your data pool diagram:

  1. Click Data Jobs and select an existing data connection scope.

  2. In the extraction row, click + Add.

    creating_extraction_tasks.png
  3. Add an extraction task name (an internal reference only) and click Save.

    new_extraction_task.png

    The task is created and displayed.

  4. Edit your extraction task configuration, parameters, and extraction settings as required. See: Extraction task configuration.

You can manage existing extraction tasks by clicking Options.

manage_existing_extraction_tasks.png

You have the following options here:

  • Rename: Update the name of the extraction task.

  • Enable / disable: Control whether the extraction task should be enabled or disabled for executions.

  • Move up / down: Change the order in which this task is performed in a full execution.

  • Duplicate: Create a copy of the extraction task in the existing data job.

  • Execute: This allows you to manually execute just this task on demand. For more information about executing data jobs, see: Executing data jobs.

  • Execute from here: This allows you to manually execute this and all following tasks on demand. For more information about executing data jobs, see: Executing data jobs.

  • Convert to template /copy to regular task: The task becomes a template and can be added to other data jobs or used to extend the template. If the task is already a template, you can create a regular task from it. For more information about task templates, see: Creating task templates.

  • Delete: This deletes the task and all associated content, with no recovery possible.

  • Download table configuration: This gives you offline access to a zipped file containing any relevant Excel workbook copies of your table configuration.

When creating or edit an existing extraction task, you have the following configuration options available:

This is where you select and configure the data tables you want to extract. Depending on your data connection type, you have the following table configuration options:

  • Column subset: Specify which columns should be extracted by clicking Configure next to the column count.

  • Pseudonymized columns: As part of the advanced settings for data connections, you can pseudonymiz the data you extract using SHA-1, SHA-256 (with and without salt), and SHA-512 (with and without salt) algorithms. If you select to pseudonymize your data, values as displayed in the source system will be replaced with hashed values.

    Pseudonymization happens during the data extraction. Celonis requests source system data, then pseudonymizes that data upon receipt, converts it into parquet format, and then ingests this into the Celonis Platform.

  • Primary key columns: You can explicitly specify which columns should be used as primary key columns during a delta load. This is only necessary if the source system does not provide this information on its own. Overriding the primary key columns does not change how data is stored in the database. It is only used for delta loads and not saved.

You can remove duplicate records when executing your JDBC extraction tasks by selecting the ordering column to be used:

removing_duplicates.png

This feature uses the primary key configuration to determine what records are duplicated and then removes that duplicate based on the ordering column in descending order. As a result, we recommend using a timestamped column as your ordering column. When two or more duplicates are found, the oldest record will be removed.

In this example, we're using the columns MANDT, EBELN, and EBELP.

setting_ordering_column_when_removing_duplicates.png

You can add one or multiple join partners to the table. Each join partner can either be joined through the primary keys of the tables or through a custom join path. In order for the primary key join to work, the primary keys of the table to be joined need to be included in the primary keys of the base table. You can also add a filter for each joined table.

When creating joins in your extraction, you use child tables and parent tables:

  • Child tables: To be extracted table.

  • Parent tables: Joined table to filter the child table.

These tables are then executed in the following order:

  1. The extraction will first apply the indicated filters on the parent table as visible on the right hand side.

  2. The join of the parent table to the child table is conducted. This join is based on java and is mostly comparable with an inner join in SQL.

  3. The additional filters are applied to the table resulting after the join.

You then have the following additional information:

  • "Use primary key" means the parent primary key.

  • If you join a child (n) table to a to be extracted parent (1) table on the primary key the extracted table will contain duplicates.

  • If there are several parent tables, the parents are filtered and then joined top down, before the resulting table is joined to the child.

  • Creation date filter: A filter on a data column that will be used in both full and delta loads. This filter will be combined with the "Filter Statement" under "Additional Filters" with an AND condition. So both conditions must be met.

  • Change date filter: A filter on a date column that automatically looks for the maximum date in the existing table and sets a filter to only extract data newer than this maximum date. This filter will be combined with the "Delta Filter Statement" under "Additional Filters" with an AND condition.

  • Filter statement: Using SQL syntax, specify which rows will be extracted. For example:

    COLUMN_A > 1 AND COLUMN_B IN ('example1', 'example2')
  • Delta filter: Using SQL syntax, specify which additional filters should be applied when the job executes a delta load. This filter statement is combined with the normal filter with the logical AND operator. For example:

    COLUMN_A > 1 AND COLUMN_B IN ('example1', 'example2')
additional_filters.png

By creating and managing extraction parameters, you can control which data is extracted from your source system based on matches to your criteria.

When creating extraction parameters, you can select between private, public, and dynamic:

  • Dynamic extraction parameters: Delta filters are used to define which entries in the table are loaded in the delta load. Best practice is to use dynamic parameters with the operation type FIND_MAX for delta filters as they indicate the maximum value of a defined column. If the indicated table does not exist in the Celonis Platform or is empty, the DEFINED VALUE is utilized. If this VALUE does not exist, the DEFAULT VALUE is utilized.

The available extraction settings depend on the source system that you're connecting to.

  • Debug mode: Once enabled, the debug mode provides detailed log information for the data extraction job and will be displayed in the execution logs. This allows for more transparency and easier troubleshooting. This mode is active for three days and the logs created are then deleted.

  • Convert to a delete job: Converting this extraction job to a delete job will cause all records found by this job to be deleted in your data pool.

  • Delta load configuration: Select from either:

    • Option A - Standard: The delta load will be cancelled, when metadata in your source system changes (i.e. new columns) compared to already extracted data in the IBC. Metadata changes will not be ignored. A full reload or manual changes of your table is necessary to clear this conflict.

    • Option B - Including metadata changes: The delta load will include metadata changes (i.e. new columns) and run through with a warning. Only delta loaded rows will include data for the newly added columns. Rows which are not part of the delta load will have the newly added columns nullified. This can lead to inconsistency in your data. A full load is recommended for data consistency.

  • Connector parameters: The following connector parameters can be configured:

    • Batch size: Allows specifiying the batch size (in records) for one extraction request.

    • Max string length: Allows the modification of the default length (80 characters) of String-type columns.This is configured using the parameter: MAX_STRING_LENGTH

    • Binary data type handling: Table column with binary data type can be represented in two ways: UTF- 8 or HEX_NOTATION. Depending on the value specified here the binary value will be converted.

    • Change default metadata source: Depending on your database, you can select from:

      • DRIVER_METADATA: This metadata source is supported by all source systems and mostly it is the default one. Here the driver internally runs the metadata Query against the source system and fetches the result set.

      • SAMPLE_QUERY: This metadata source is supported by all source systems. This also works the same as driver metadata, only the query used is different.

      • INFORMATION_SCHEMA: This metadata source is supported mainly by Oracle system. And it's a default metadata source for Oracle 11g.

      • PG_CATALOG: This metadata source is supported by Amazon Redshift. And it's a default metadata source.

    • Limit total number: Set the maximum number of records to be extracted per job.

Extraction preview is available only for JDBC connections. For the complete list of functionalities available for our commonly supported databases connections, see Supported database connections.