Skip to main content

Celonis Product Documentation

Data Jobs

Quick Start: What is a Data Job?

  • Data Jobs are used to extract data from different connected system and transform the data to prepare it for process mining.

  • You will set up Data Jobs after you've created a Data Connection (e.g. to Salesforce, Oracle, SAP etc. ) and can continue afterwards with setting up the Data Models.

  • Data Jobs are usually set up by Data Engineers and do not need to be adapted afterwards by Business Analysts or Viewers.

Data Job Configuration

Creating a Data Job

In order to create a new Data Job you need to click on "+ Add a Data Job" in the "Data Job" section. This will open a modal where you specify a unique name among the Data Pool and specify the Data Connection it should relate to. There are two distinct types of Data Jobs:

  • Data Connection Jobs

  • Global Jobs

Data Connection Jobs are bound to a specific Data Connection (which can be changed), whereas Global Jobs only transform existing data, e.g. unifying data from different Data Connections into one or multiple joint tables.

Warning

Global Jobs cannot be transformed into Data Connection Jobs or vice versa. Since Global Jobs are not bound to a Data Connection they can only contain Transformations and no Extractions.

Tip

Create a Data Connection Job at first. Global jobs (those without a Data Connection) are only needed if you would like to unify data from different systems.

Actions on a Data Job
60361825.png

The following actions can be performed on a job by clicking on the context menu to the right of the job name:

  • Rename: Must be unique among the jobs in one data pool (also across data connections)

  • Change Data Connection: This is only available for data connection jobs, not for global jobs. You need to make sure that the tables that the extractions define and the columns used for filtering etc. are also available in the new data connection.

  • Duplicate: Copies the job along with all tasks and their parameters. The status and the logs are not copied.

  • Copy to: Copies the job into another pool on the same or a different team on the same cluster including all extractions, transformations, local parameters and templates. Pool parameters, data model loads and job alerts are not copied. The link to a pool parameter is replaced in the target pool by the value of the pool parameter of the source pool. If you copy a job including extractions to a global scope, the extractions are removed.

  • Execute Data Job: Opens the modal to configure the job execution, consult Data Job Executions for details

  • Configure alerts: Subscribes you to alerts of the Data Job.

  • Force Cancel Executions: Cancels the execution of the data job that is currently running.

  • Delete: Deletes the job and its tasks.

Actions on a Task
Group_33.png

The following actions can be performed on a task:

1. By dragging on the handle to the left of the task name you can change the order of the tasks which affects the execution. Alternatively, you can use the actions "move up" and "move down" in the context menu.

2. By clicking on the task between the drag handle and the content menu button, you open the task to modify its contents.

3. By clicking on the info icon you can display the description of the transformation.

The other actions are available in the context menu on the right:

Rename: Modify the name of a task or the description of a transformation

Enable/Disable: If a task is disabled, it will be hidden in the execution modal and it will not be executed by schedule. The status is shown next to the task name.

Duplicate: Copies the task and adds it right after the base task along with the content and all parameters.

Execute (from here): Opens the execution modal with the respective task pre-selected. When clicking "execute from here" all the following tasks are also pre-selected.

Convert to template/Copy to regular task: The task becomes a template. Please refer to the Task Template section for details. If the task is already a template, you have the option to create a regular task from it if you can access the content of the template.

Delete: The task and all its contents is deleted.

Data Job Executions

After your Data Job and its contained Extractions and Transformations are configured, you can start your first execution. It makes sense to start with merely extracting one table in full mode to verify that the connection is working correctly. Afterwards, the Data Job can be executed completely in a manual way. After it is verified that this works as expected, it is advised to set up a Schedule to automatically execute the Data Job in Delta Mode continuously.

Triggering a Data Job Execution
Bildschirmfoto_2022-11-22_um_14_21_27.png

You can execute a Data Job manually by selecting the "Execute Data Job" within the Data Job configuration.

Note

The manual execution will open the execution modal which allows you to configure which Tasks of the Data Job should be executed. Please note that the modal only contains Tasks that are enabled. In order to execute disabled Tasks you need to enable them first.

Tip

Instead of executing Data Jobs manually, you can also schedule them to be executed automatically and periodically.

Execution Modal
Group_42.png

1. Select or deselect all tasks of the Data Job.

2. Search table names. This will filter the view of extractions to only the ones containing the search term.

3.Show and hide the list of tables.

4. Use the alphabetical chooser to only see tables that start with a certain letter.

5. Select and deselect single tables or (de)select them- Only if an item is selected, the corresponding action will be performed.

6. Transformations can be expanded as well and transformations can be (de)selected

7. Cancel to not execute the Data Job.

8. Choose between a full and a delta load, see below for details.

9. Execute the selected items. This button is only enabled if at least one task is selected.

Execution Order - smart ETL

While previously, all enabled Tasks have been executed sequentially, with the release of smart ETL, we automatically calculate the optimal execution order in the background for Transformations Tasks in the following way:

  1. before each Data Job execution, the optimal execution order based on the dependencies across all Transformation Tasks in a Data Job is automatically calculated

  2. a DAG (Directed Acyclic Graph) representing the most efficient execution order to optimize for parallelism gets created automatically in the background

  3. the DAG is used as a basis to automatically trigger all Transformation Tasks based on the optimal execution order

The difference between the previous sequential execution order and the new parallel orchestration is also displayed in the following graphs:

Bildschirmfoto_2022-11-22_um_14_35_07.png

When starting a Data Job - depending on the dependencies across the Transformation Tasks - you'll see that multiple Transformation Tasks will be running in parallel in the Log section. For each individual Transformation Task you will also see log lines indicating which Transformation Tasks need to be executed first due to their dependencies. In the following example you can see that four Transformation Tasks need to be executed before the “Add Activity: Create Quotation” Task can be started.

Bildschirmfoto_2022-11-22_um_14_37_23.png

To avoid any risk of data inconsistency, there is multiple security mechanisms in place:

  • all Transformation Tasks will be canceled automatically in case one Tasks fails. This is in line with the existing behavior for the sequential execution

  • in case a Transformation Task cannot be properly parsed or there is any other error during the calculation of the optimal order, we'll fallback to the sequential execution. If that happens, you'll see a log line in the General Data Job Log section "Parsing Error, falling back to sequential execution".

In the next few months, we will extend the functionality to also take Extraction and Data Model Load Tasks into account for the parallel orchestration.

Load Variations
Full Load

When you select "Full Load" the tables in the EMS that have the same name as the ones in the Extraction will be deleted and the new data is written to the EMS. However, the existing data is only deleted after the Extraction was successful so you will not lose data unless the Extraction finished successfully.

Delta Load

The purpose of delta loading is to reduce the amount of data needed to be transferred between the source system and Celonis Data Integration to enable more frequent reloads.

Important

The prerequisite for delta loading is that the table structure has not changed (so the number of columns, columns names, and data types). If the structure of a table has changed - either because the source system changed or because the Extraction configuration specifies other columns to extract, a Full Load is required once to update the table structure.

You should almost always use Delta Loads, especially for large data volumes to reduce both the impact on your Data Connection and the time for the execution. This also allows you to extract data more frequently.

Extraction Filter

The Delta Load uses a conjunction of the normal filter and the delta filter. Typically, the delta filter field should contain a filter statement that only extracts the newest data which is either not in the EMS yet or which has been updated since the last Extraction. In order to facilitate specifying this filter, you can use dynamic Task Parameters with the operation type FIND_MAX which query the existing data and generate the maximum of a column. This Dynamic Parameter can then be used to restrict the data volume to only the newest data.

Example: table1 in your system contains the column LAST_CHANGED_DATE which contains the date on which the last change to the row occurred. If you use a dynamic parameter "LastChangedDate" to retrieve the maximum last change date from the existing data you will get the last change date of table1. Then, you can use the following delta filter:

LAST_CHANGED_DATE >= <%= LastChangedDate %>

This filter will retrieve all rows which have been updated after the last change happened.

Update Mechanism

This is the process that is followed to update existing data:

  1. Retrieve the primary key or keys from the extracted data. Which column is a primary key is either taken from the source system or from the extraction configuration if available.

  2. Delete all rows of the existing data table that have the same primary key(s) as the new data.

  3. Insert all rows from the new data into the existing data table.

Actions while an execution is in progress

While a Data Job is running you can view the current status in the Logs tab or in the Execution History. However, you can also modify its configurations without interfering with the current execution. Moreover, you have the possibility to cancel it.

Canceling a Data Job execution

When a Data Job is running, you can cancel its execution by clicking on the "Cancel Execution" button. This will attempt to terminate the running processes of this Data Job both in the application itself as well as in the remote system. If a schedule attempts to execute a Data Job that is already running the Schedule will be skipped automatically.

Modifying a Data Job during execution

You can access and change configurations of all the Tasks of a Data Job and Data Pool Parameters. However, this does only affect future executions of the Data Job. Even if a task is not started yet, the configurations are taken from the point in time when the whole Data Job execution began.