Skip to main content

Celonis Product Documentation

Data Jobs

Quick Start: What is a Data Job?

  • Data Jobs are used to extract data from different connected system and then transform the data to prepare it for process mining.

  • You will set up the Data Jobs after you set up your data connection (e.g. SAP or Salesforce Connector) and can continue afterwards with setting up the Data Models.

  • Data Jobs are usually set up by Data Engineers and do not need to be adapted afterwards by Business Analysts or Viewers.

Data Job Configuration

Creating a Data Job

In order to create a new data job you need to click on "New Data Job" in the "Data Job" section. This will open a modal where you specify a unique name among the data pool and specify the data connection it should work with. There are two distinct types of data jobs:

  • Data Connection Jobs

  • Global Jobs

Data Connection Jobs are bound to a specific Data Connection (which can be changed however), whereas Global Jobs only transform existing data, e.g. unifying data from different Data Connections into one or multiple joint tables.

Warning

Global Jobs cannot be transformed into Data Connection Jobs or vice versa. Since Global Jobs are not bound to a Data Connection they can only contain transformations and no extractions.

Tip

Create a data connection job at first. Global jobs (those without a data connection) are only needed if you would like to unify data from different systems.

Actions on a Data Job
60361825.png

The following actions can be performed on a job by clicking on the context menu to the right of the job name:

  • Rename: Must be unique among the jobs in one data pool (also across data connections)

  • Change Data Connection: This is only available for data connection jobs, not for global jobs. You need to make sure that the tables that the extractions define and the columns used for filtering etc. are also available in the new data connection.

  • Duplicate: Copies the job along with all tasks and their parameters. The status and the logs are not copied.

  • Copy to: Copies the job into another pool on the same or a different team on the same cluster including all extractions, transformations, local parameters and templates. Pool parameters, data model loads and job alerts are not copied. The link to a pool parameter is replaced in the target pool by the value of the pool parameter of the source pool. If you copy a job including extractions to a global scope, the extractions are removed.

  • Execute Data Job: Opens the modal to configure the job execution, consult Data Job Executions for details

  • Configure alerts: Subscribes you to alerts of the Data Job.

  • Force Cancel Executions: Cancels the execution of the data job that is currently running.

  • Delete: Deletes the job and its tasks.

Actions on a Task
21659654.png

The following actions can be performed on a task:

1. By dragging on the handle to the left of the task name you can change the order of the tasks which affects the execution. Alternatively, you can use the actions "move up" and "move down" in the context menu.

2. By clicking on the task between the drag handle and the content menu button, you open the task to modify its contents.

3. By clicking on the info icon you can display the description of the transformation.

The other actions are available in the context menu on the right:

4. Rename: Modify the name of a task or the description of a transformation

5. Enable/Disable: If a task is disabled, it will be hidden in the execution modal and it will not be executed by schedule. The status is shown next to the task name.

6. Duplicate: Copies the task and adds it right after the base task along with the content and all parameters.

7. Execute (from here): Opens the execution modal with the respective task pre-selected. When clicking "execute from here" all the following tasks are also pre-selected.

8. Convert to template/Copy to regular task: The task becomes a template. Please refer to the Task Template section for details. If the task is already a template, you have the option to create a regular task from it if you can access the content of the template.

9. Delete: The task and all its contents is deleted.

Not shown and only for transformations: Publish/Unpublish

Data Job Executions

After your data job and its contained extractions and transformations are configured, you can start your first execution. It makes sense to start with merely extracting one table in full mode to verify that the connection is working correctly. Afterwards the data job can be executed completely in a manual way. After it is verified that this works as expected, it is advised to set up a schedule to automatically execute the data job in delta mode continuously.

Triggering a Job Execution
21659649.png
21659650.png
21659651.png

You can execute a data job manually by using one of the four methods shown on the screenshots on the left:

1. "Execute Data Job" within the job configuration.

2. Content menu entry of a job "Execute Data Job".

3./4. Content menu entry of a task "Execute" or "Execute from here", see section "Actions on a task" for details

Note

All three types of manual execution will open the execution modal which allows you to configure what tasks of the job should be executed. Please note that the modal only contains tasks which are enabled. In order to execute disabled tasks you need to enable them first.

Tip

Instead of executing data jobs manually you can also schedule them to be executed automatically and periodically.

Execution Modal
22120034.png

1. Select or deselect all tasks of the Data Job.

2. Search table names. This will filter the view of extractions to only the ones containing the search term.

3. Show and hide the list of tables.

4. Use the alphabetical chooser to only see tables that start with a certain letter.

5. Select and deselect single tables or (de)select them- Only if an item is selected, the corresponding action will be performed.

6. Transformations can be expanded as well and transformations can be (de)selected

7. Cancel to not execute the Data Job.

8. Choose between a full and a delta load, see below for details.

9. Execute the selected items. This button is only enabled if at least one task is selected.

Execution Order

All enabled tasks are executed sequentially starting with the extractions. Every task only starts if the preceding task finished successfully. The execution order within extractions or transformations respectively can be changed by either dragging the handles to the left of the task name or by using the move up/move down content menu entries.

Load Variations
Full Load

When you select to "reload all data" the tables in the existing database that have the same name as the ones in the extraction will be deleted and the new data is written to the database. However, the existing data is only deleted after the extraction was successful so you will not lose data unless the extraction worked.

Delta Load

The purpose of delta loading is to reduce the amount of data needed to be transferred between the source system and Celonis Data Integration to enable more frequent reloads.

Important

The prerequisite for delta loading is that the table structure has not changed (so the number of columns, columns names and data types). If the structure of a table has changed - either because the source system changed or because the extraction configuration specifies other columns to extract, a full load is required once to update the table structure.

When you select not to "reload all data" (which is the default) the existing tables will be preserved and updated with the newly extracted data. Therefore, there are two significant differences between a full and a delta load: the filters used and the update mechanism.

Tip

You should almost always use delta loading, especially for large data volumes to reduce both the impact on your data connection and the time for the execution. This also allows you to extract data more frequently.

Extraction Filter

The delta load uses a conjunction of the normal filter and the delta filter. Typically, the delta filter field should contain a filter statement that only extracts the newest data which is either not in the database yet or which has been updated since the last extraction. In order to facilitate specifying this filter, you can use dynamic task parameters with the operation type FIND_MAX which query the existing database and generate the maximum of a column. This dynamic parameter can then be used to restrict the data volume to only the newest data.

Example: table1 in your system contains the column LAST_CHANGED_DATE which contains the date on which the last change to the row occurred. If you use a dynamic parameter "LastChangedDate" to retrieve the maximum last change date from the existing data you will get the last change date of table1. Then, you can use the following delta filter:

LAST_CHANGED_DATE >= <%= LastChangedDate %>

This filter will retrieve all rows which have been updated after the last change happened.

Update Mechanism

This is the process that it is followed to update the existing data:

  1. Retrieve the primary key or keys from the extracted data. Which column is a primary key is either taken from the source system or from the extraction configuration if available.

  2. Delete all rows of the existing data table that have the same primary key(s) as the new data.

  3. Insert all rows from the new data into the existing data table.

Actions while an execution is in progress

While a job is running you can view the current status in the log tab or in the execution history. However, you can also modify its configurations without interfering with the current execution. Moreover, you have the possibility to cancel it.

Cancelling a job execution

When a job is running you can cancel its execution by clicking on the "Cancel Execution" button. This will attempt to terminate running processes of this job both in the application itself and in remote systems. If a schedule attempts to execute a job which is already running the schedule will be cancelled automatically.

Modifying a job during execution

You can access and change configurations of all the tasks of a job and data pool parameters. However, this does only affect future executions of the job. Even if a task is not started yet, the configurations are taken from the point in time when the whole job execution began.

Behavior during software updates

When the software is updated, Data Jobs will pick up where they left of after the update is complete. If it happens right within a transformation, you might see it twice in the logs of the Data Job because it is rolled back the first the time and then executed again. This does not affect the outcome of the Data Job execution.