Skip to main content

Celonis Product Documentation

Technical Overviews
Pipeline End to End Overview

End to End Flow

Celonis has developed a dedicated SAP Extractor to ensure a continuous data pipeline between EMS Cloud and the source SAP system. After the pipeline components are set up and the connection is established, users can schedule extraction jobs and fetch the data from SAP tables into EMS.

These are the components involved in the flow.

Celonis Intelligent Business Cloud:This is where the user defines the extraction job, i.e. which tables should be extracted, what kind of filters should be applied, extraction schedule, etc.

Extractoris a middle-men between EMS and the SAP system. Its role is to poll and fetch the Job requests from Celonis EMS, and then submit the execution information to Celonis RFC Module in SAP. Once the data is retrieved from SAP, Extractor fetches it from SAP and sends it back to EMS. It is installed in the customer network on a dedicated server.

Celonis RFC Moduleis responsible for extracting data from the SAP database. It gets the job metadata from the Extractor, i.e. which table, which columns, filters, etc, and then generates a background job in SAP. The job extracts the data and writes it in csv files in a directory that is monitored by the Extractor. This package is imported into SAP and contains 12 functions.

The diagram below shows the components involved and their interaction, and the next sections explain each component in more details.

  1. The user triggers an extraction in Celonis Cloud

  2. The extraction request is published to a message queue, which is polled in real-time by the on-premise Extractor. (Note: the indirect connection option has been implemented because of security reasons, to avoid inbound network calls)

  3. Extractor "translates" the extraction requests, and sends it to the RFC Module in SAP

  4. RFC Module reads the data from the database according to the column and filter definitions

  5. RFC Module writes the data to csv files on the network shared drive in chunks (by default 50k rows in each file). The data is pseudonymised at this point, before the data is written to files.

  6. The Extractor synchronously polls the folder, and fetches the files as they arrive

  7. Extractor converts the csv files to parquet

  8. Extractor pushes the parquet files to Celonis Cloud using the Data Push API.

50727147.png
50727149.png
The Extractor service

This component is responsible for communication between the Data Integration and RFC Module. It is a Java application that is installed on the customer premise. It polls EMS for extraction requests, and then makes RFC calls to the RFC Module to trigger the extraction process. Once the files arrive, they are uploaded to EMS.

More details about it can be found here.

RFC Module

This component is responsible for fetching the data from SAP and writing it to CSV files. The RFC extractor is distributed as an SAP transport, and is installed directly in the SAP system.

The RFC extractor is written in ABAP. Its setup involves the following steps.

  • Import the Transports into SAP

  • Define a System user for Celonis

  • Upload the role template and assign it to the Celonis user

  • Set up the Logical Path Z_CELONIS_TARGET

More details about it can be found here.

RFC Module Description

Explanation of SAP extraction logic

The entry point of the extraction process is the RFC enabled function /CELONIS/EX_NEW. The Extractor service passes several parameters to this function, i.e. the Table, Columns, filters, and other extraction metadata. The /CELONIS/EX_NEW then creates a Background Job, scheduling it to run the program /CELONIS/EXTRACTION.

During an extraction a cursor on the table that is being extracted is opened with a default buffer size of 10,000 lines (the number of lines held in memory at a time). Those lines are written to CSV files in the location specified by the FILE path Z_CELONIS_TARGET. When the CSV file exceeds a specific number of lines (default size: 50,000 lines) it is closed and then - depending on the specified compression type - compressed. Afterwards the extraction continues writing the next file.

The Extractor actively pools the directory Z_CELONIS_TARGET via the RFC function /CELONIS/EX_LIST_FILES. As soon as it identifies a CSV file in the directory, the function /CELONIS/EX_UPLOAD_POLL is invoked to fetch the file from the SAP file system to the extractor server. Once the transfer is completed, the file is removed from the SAP file system via the RFC call to /CELONIS/EX_CLEANUP. If the user cancels the job, the Extractor calls /CELONIS/EX_CANCEL and cancels the Job in SAP.

While the extractions is in progress, the Extractor actively polls the Background Job status via calling /CELONIS/EX_STATUS. Once the Job is completed, the Extractor fetches the Job Logs via the /CELONIS/EX_READ_LOG and publishes it to the Celonis as a part of the extraction logs.

How does a filter work

Filters are appended on the select statement of the cursor. It is important to note that arbitrary filters are possible on the tables meaning that the performance of an extraction can be influenced greatly by correct filtering on indexed columns.

How does a join work

Joins are not translated directly to the select query but instead each join will use its own cursor. After a chunk of the join partner has been read we use a FOR ALL ENTRIES statement for selecting from the target table. Additional filters are appended as usual.

Other RFC Operations
Selecting Tables/Columns to Extract

The user can select which tables and columns to extract in the Celonis Cloud. The table list is displayed to the user via calling the functions /CELONIS/EX_LOOKUP_TABLES and /CELONIS/EX_LIST_TABLES. To list the columns the function /CELONIS/EX_GET_METADATA is called.

Logging System Information

We log several system level parameters for our internal logging purposes, i.e. NetWeaver version, Database type, etc... The function /CELONIS/EX_SYSINFO is invoked to fetch this info before the start of each extraction.

Programs

The following Programs are included in the package /CELONIS/DATA_EXTRACTION:

  • /CELONIS/EXTRACTION: This is the program that executes the extraction. It is executed via a background job.

  • /CELONIS/CLEANUP_CL_TABLE: This program is called to execute the clean up of Log tables. It is relevant only for real time extractions.

  • /CELONIS/CLEANUP: This program cleans up the directory Z_CELONIS_TARGET from the obsolete files. Its execution should be set up manually via a background job.

Function modules

The following Function Modules are included in the package /CELONIS/DATA_EXTRACTION.

  • /CELONIS/EX_CANCEL: Cancels a running extraction

  • /CELONIS/EX_CLEANUP: Cleans up old extraction files

  • /CELONIS/EX_CONFIGURATION_TEST: Tests the functionality of the extraction

  • /CELONIS/EX_GET_METADATA: Gets the metadata of the extraction tables

  • /CELONIS/EX_LIST_FILES: Lists extraction files

  • /CELONIS/EX_LIST_TABLES: Lists extraction tables

  • /CELONIS/EX_LOOKUP_TABLES: Looks up the table data in DD02L

  • /CELONIS/EX_NEW: Starts a new extraction

  • /CELONIS/EX_READ_LOG: Reads the Job Log of a extraction

  • /CELONIS/EX_STATUS: Gets the current status of a extraction

  • /CELONIS/EX_SYSINFO: Looks up the SAP system information

  • /CELONIS/EX_UPLOAD_POLL: Exports the extracted csv file to the Extractor server

  • /CELONIS/EX_UPLOAD: Uploads a result file to the connector [Deprecated after the RFC version 2019-06-14]

  • /CELONIS/EX_CL_GET_TABLE: Get change log table

  • /CELONIS/EX_CL_NEW: Starts a new change log extraction

  • /CELONIS/EX_CL_SET_EXTRACTED: Sets changes to extracted

Used external functionality

We use a few standard sap function modules in our extraction flow:

  • CALCULATE_HASH_FOR_CHAR: For pseudonymization functionality. SHA1 algorithm is used by default.

  • VIEW_AUTHORITY_CHECK: To check for read access on a table that is being extracted

  • DDIF_FIELDINFO_GET: To read metadata of a table

  • BP_JOB_ABORT: Cancelling a running job

  • JOB_OPEN, JOB_CLOSE: Submitting a background job

  • BP_JOBLOG_READ: Reading the log of a background job

  • BP_JOB_STATUS_GET: Getting the status of a background job

  • SPLIT_LINE: Splitting lines of log messages

  • C_SAPGPARAM: Querying os parameters like directory separator character

  • FILE_GET_NAME_USING_PATH: Building the path to the chunk files

  • SXPG_COMMAND_EXECUTE: Executing the compression commands (if applicable)

  • CL_HTTP_CLIENT: Instance methods are used to send the result files to the connector

  • EPS_GET_DIRECTORY_LISTING: List files in the output directory

SAP Real Time Extension: Solution Overview

Overview

Data Extraction from SAP is the first step in the process analytics pipeline. The data is replicated from the SAP database into Celonis Analytics Storage, where it is converted into an Event Log and loaded into the Process Mining Engine.

In an ideal world there would be a real-time integration between SAP and Celonis, so that updates in transactional ERP data are immediately propagated into the Celonis Event Log. However, in reality this is not that easy to do, and one of the main challenges is to identify the updated records, so that the load is isolated only to them. One way to overcome this is to use timestamp based delta extractions. The idea is that each record has a timestamp field which is updated each time the record is modified. So by filtering on that timestamp we can isolate only the records that have been updated after the last load. However, this approach does not work in all cases, because some very large tables simply do not have timestamps, making it impossible to identify the updated records. As a result, a full load is carried out for these tables, loading the whole data during each run. In addition to that, the Delta Load may be very slow sometimes, because the filtering is done on non-indexed fields.

To overcome this limitation and get one step closer to real-time data extraction Celonis has come up with a new approach on how to extract delta. The basic idea is that the relevant tables are continuously monitored for changes, and once a record is updated, its id is logged in a dedicated table. The next time an extraction is run, only the records that have been updated and logged will be extracted. To be more specific, the extractor will inner join the tables and get only the updated records. Since the join will be done on primary records, the delta filter will be very effective performance wise.

Technology Details
Set Up

As part of enabling real time extractions a change log table and a trigger should be created for each source table that should be extracted in real time mode. The trigger listens for create, update, and delete events on the source table, capturing primary keys of the modified records, and inserting records into the respective change log table The change log tables and triggers are created during the setup. It is important to note that even though the setup is done from SAP GUI, the triggers are indeed created on the database level via a native SQL statement in the syntax of your underlying database.

The triggers and log tables are created and maintained via transaction code /CELONIS/CLMAN_UI which becomes available after the real time extractor is imported into SAP.

41191829.png

The triggers are not SAP DDIC objects, so you can create/delete them manually in PRD environment without unlocking the SAP client for changes.

The log tables, however, are SAP DDIC objects, so they are usually created in DEV and transported forward to the QA and PRD.

Extraction Logic

The real time extractor works with the standard Celonis continuous SAP extractor (you can find more info on the continuous extractor here). It can be turned on/off anytime from Celonis EMS.

In case the real time extraction mode is enabled, the extractor service calls the RFC function /CELONIS/EX_CL_NEW. For each table, it checks whether real time extractions have been enabled or not. If so, modifies the extraction filter to ensure that only the changed rows are extracted, and then delegates the task to the /CELONIS/EXTRACTION program which follows the standard extraction flow, by writing the data to csv files and sending it to the extractor service and then to EMS.

Once the extraction is complete the RFC function /CELONIS/EX_CL_SET_EXTRACTED is invoked to flag the extracted records so that subsequent extractions only receive new changes. If the extraction failed, then the function is not invoked, records are not marked as extracted, and subsequent extractions will receive the same (and all new) changes until successful.

The digram below explains how the real-time extractor operates.

22122343.png
SAP User Role CELONIS_EXTRACTION in Detail

The following describes in detail what the role contains and why the authorizations are necessary, along customisation options

Cross-application Authorization Objects

Authorization Check for RFC Access - So the Extractor can remotely access the functions in the RFC module

Object

Field

Activities/Values

S_RFC

ACTVT

16

S_RFC

RFC_NAME

/CELONIS/46C_EXTRACTION, /CELONIS/CL_EXTRACTION, /CELONIS/EXTRACTION, RFC1, SDIFRUNTIME, SDTX, SRFC, SYST, SYSU

S_RFC

RFC_TYPE

FUGR

Basis: Administration

Background Processing: Operations on Background Jobs - So the RFC module can immediately run the extractions as background jobs

Object

Field

Activities/Values

S_BTCH_JOB

JOBACTION

RELE

S_BTCH_JOB

JOBGROUP

*

Administration Functions in Change and Transport System - So the RFC module can list directory contents

No longer necessary in RFC Module versions >= 1.8.0

Object

Field

Activities/Values

S_CTS_ADMI

CTS_ADMFCT

EPS1

Authorization for file access - So the RFC module can write, read, and delete files in the physical path defined for the logical path 'Z_CELONIS_TARGET'

Object

Field

Activities/Values

S_DATASET

ACTVT

06, 33, 34

S_DATASET

FILENAME

*

S_DATASET

PROGRAM

/CELONIS/*

Note

You can replace the '*' in 'FILENAME' with the physical path you have chosen for Z_CELONIS_TARGET, e.g. /<YOUR_PATH>/*

Authorization to Execute Logical Operating System Commands - So the RFC module can compress data it extracts

Object

Field

Activities/Values

S_LOG_COM

COMMAND

Z_CELO_GZIP, Z_CELO_SAPCAR

S_LOG_COM

HOST

*

S_LOG_COM

OPSYSTEM

*

Note

Unnecessary if using 'Uncompressed' or 'Native SAP Compression' in the EMS Data Connection

Table Maintenance (via standard tools such as SM30) - So the RFC module can extract data from tables

Object

Field

Activities/Values

S_TABU_DIS

ACTVT

03

S_TABU_DIS

DICBERCLS

*

Note

You can replace the '*' in 'DICBERCLS' with the authorisation group of tables you will be extracting.

Alternatively, If you need to control access to individual tables instead to groups of tables, you can use authorisation object S_TABU_NAM.

Important: When using the Real-Time Extractor, the Changelog tables (ZCL...) should also be whitelisted.

Extracting the table STXL
Context

The SAP table STXL contains encoded data that is stored in the column CLUSTD. This data is not human readable unless decrypted, and since our standard RFC Module reads the text from the DB without any conversion, it is not suitable for this use case. The data though is important for some business use cases, and we need a solution to fetch it into Celonis EMS as a readable text.

Solution Overview

Rather than extending the RFC Module functionality, we have developed a separate program (the attachment is at the end of the page) that decrypts STXL into a readable format using the native SAP function READ_TEXT, and then writes this data into another table - ZSTXL.

This table can be afterwards extracted using our standard RFC Extractor.

The Setup

The converter program is distributed in the package ZSTX_CONVERT. It contains the following objects:

  • Table ZSTXL - stores the converted data is written

  • Program ZRP_STX_CONVERT - performs the conversion.

57542212.png

The package is distributed as an SAP transport which should be imported into the SAP system. The user can create a variant of the report ZRP_STX_CONVERT, and then execute it regularly via an SAP Background Job to convert the STXL. The running interval is up to the user, and is based on how frequently they want to refresh this data.

The report supports several parameters. The parameter PACKSZ defines the number of rows to process in memory, and the other parameters allow to filter the processed data.

57542222.png

File

Modified

STXL Conversion.zip

Oct 25, 2021 by Celonis

Labels

  • No labels

  • Edit Labels

Preview$itemLabel $itemLabel