Skip to main content

Celonis Product Documentation

Optimal Data Job Scheduling

If you have multiple scheduled Data Jobs, these are the lowest hanging fruits which will speed up your data pipeline and increase freshness of data.

Don’t schedule everything at the same time

Whether in the cloud or not, there is always a computer on the backend processing all the tasks. If it has too many tasks, it is overloaded. This results in extended processing times and increased risk of errors.

People instinctively schedule everything at full hours (e.g. 10.00). This results in spikes in utilization of compute resources. Before the full hours the resources are underutilized.

55706985.png

If you have scheduled more than one job, make sure they are not set to run at the same times. Let’s assume you have 3 jobs. Each runs every hour at full hour for 20 minutes. The compute resource is used heavily for 20 minutes and remain idle for 40 minutes every hour. Instead, schedule jobs to run after another. I.e. one job would run at full hour (x:00), one 20 minutes past (x:20), one 40 minutes past (x:40). After the change, you will most likely notice that the tasks run in less than 20 minutes. To check at what times the jobs run, check the Schedules and Data Jobs logs.

55706986.png
In case of idle time, increase schedule frequency

If you run your data jobs for example once a day and it takes 3 hours to run them (after implementing the recommendation above), the compute resource remains idle for 21 hours. If that’s the case, why not increase the schedule frequency? The data will be more up to date. In every company there will be always a time when someone will need fresher data. Schedule your jobs in a way that the idle periods are short. There should be a “buffer” break between the jobs to compensate fluctuations in duration. How long should be the buffer? This depends how big are the fluctuations of your data job times. You can calculate them based on data in Data Job logs.