Skip to main content

Celonis Product Documentation

BUCKET_UPPER_BOUND - BUCKET_LOWER_BOUND
Description

BUCKET_UPPER_BOUND and BUCKET_LOWER_BOUND calculate histogram buckets and identify the buckets in which the input values reside.

BUCKET_UPPER_BOUND and BUCKET_LOWER_BOUND can be used with INT and FLOAT input columns. The resulting column is of INT or FLOAT type, respectively.

The operator generates a temporary table which is joined to the input column. Depending on which operator is used (BUCKET_UPPER_BOUND or BUCKET_LOWER_BOUND), the result of the operator is a column of upper or lower bucket boundaries (respectively). The result column contains upper/lower bucket boundaries of the owner buckets (buckets into which input elements are sorted). Additionally, all upper/lower boundaries of all histogram buckets are appended, regardless of whether they are empty or contain input column elements.

Syntax
 BUCKET_UPPER_BOUND ( table.column [, SUGGESTED_WIDTH(suggested_width)] [, SUGGESTED_TICKS[suggested_ticks] ] [, MAX_COUNT(max_count) ] )
 BUCKET_UPPER_BOUND ( table.column [, SUGGESTED_COUNT(suggested_count)] [, SUGGESTED_TICKS[suggested_ticks] ] [, MAX_COUNT(max_count) ] )
 BUCKET_LOWER_BOUND ( table.column [, SUGGESTED_WIDTH(suggested_width)] [, SUGGESTED_TICKS[suggested_ticks] ] [, MAX_COUNT(max_count) ] )
 BUCKET_LOWER_BOUND ( table.column [, SUGGESTED_COUNT(suggested_count)] [, SUGGESTED_TICKS[suggested_ticks] ] [, MAX_COUNT(max_count) ] )

If the query already contains an operator call with an input column and an (optional) histogram configuration, the other operator calls in the same query can be empty:

 BUCKET_UPPER_BOUND ( )
 BUCKET_LOWER_BOUND ( )
  • table.column: The input column that the histogram buckets calculation is based on.

  • suggested_width: Preferred bucket width. This value is taken into consideration when calculating the bucket width. The value of suggested_width defined in the query or a value close to that value is used. The suggested_width must be greater than 0 and cannot be NULL. The type of suggested_width depends on the data type of the input column:

    • INT input column: The suggested_width can be of type INT (example: 4, 5, 6). The suggested_width can also be of type INT represented as FLOAT, (example: 4.0, 5.0, 6.0).

    • FLOAT input column: The suggested_width can be of type FLOAT or INT.

    In case a max_count is configured and suggested_width contradicts its value, suggested_width will be ignored.

  • suggested_count: Preferred bucket count. This value is taken into consideration when calculating the number of buckets in the histogram. The value of suggested_count defined in the query or a bucket count close to the preferred count is chosen. The suggested_count must be of type INT and greater than 0. The value of suggested_count cannot be NULL. In case a max_count is configured and suggested_count contradicts its value, suggested_count will be ignored.

  • suggested_ticks: A list of desired bucket boundaries. If any of the suggested ticks are outside of the input column range, they will be ignored. In case max_count is not configured, tick values are not just suggestions like suggested_width and suggested_count, it is mandatory that all ticks (which are inside the input column range) appear as bucket boundaries in the end result. In case max_count is configured and the biggest possible step defined by suggested_ticks results in a number of buckets greater than max_count, all ticks will be ignored apart from the first one in the list (unsorted). NULL values are not allowed in the list of suggested ticks. The possible data type of the suggested_ticks depends on the data type of the input column, the same as suggested_width:

    • INT input column: suggested can be of type INT (example: 4, 5, 6). suggested_ticks can also be of type INT represented as FLOAT (example: 4.0, 5.0, 6.0). The list suggested_ticks can contain a mix of these values.

    • FLOAT input column: suggested_ticks can be of type FLOAT or INT. The list suggested_ticks can contain a mix of these values.

  • max_count: Maximum bucket count. This value can be used to limit the number of buckets in the end result. In case a suggested_count is configured and it contradicts max_count, the suggested count will be ignored. In case a suggested_width is configured and it contradicts max_count, the suggested width will be ignored. In case suggested_ticks are defined and the biggest step that they allow results in a number of buckets greater than max_count, all of the suggested_ticks will be ignored, apart from the first one mentioned in the list (unsorted). The resulting number of buckets will always be smaller or equal to max_count. Null buckets and outlier buckets are not counted towards max_count. It can happen that the final number of buckets is smaller than max_count if this is needed to make the bucket boundaries user-friendly and easily readable. The max_count must be of type INT and greater than 0. The value of max_count cannot be NULL.

Limitations

Nested operator calls of BUCKET_UPPER_BOUND/BUCKET_LOWER_BOUND are not allowed.

The maximum number of SUGGESTED_TICKS is limited to 10.

The maximum absolute value allowed for the elements of the input column, elements in SUGGESTED_TICKS or SUGGESTED_WIDTH is 10^15.

The number of buckets in the histogram (either defined in the query using SUGGESTED_COUNT or calculated by the algorithm) will not exceed 10^4.

NULL and empty input column handling

Rows from input column, which are NULL, are also NULL in the output column. The BUCKET_UPPER_BOUND/BUCKET_LOWER_BOUND operators ignore NULL value rows when doing the histogram calculation.

BUCKET_LOWER_BOUND and BUCKET_UPPER_BOUND on an empty column result in an empty result column, when no additional configuration is specified.

Configuration propagation

In the case when there are multiple calls to BUCKET_UPPER_BOUND/BUCKET_LOWER_BOUND, it is possible to leave all calls empty apart from one, like so: BUCKET_UPPER_BOUND ( ) or BUCKET_LOWER_BOUND ( ) and the bucketing configuration (SUGGESTED_WIDTH, SUGGESTED_COUNT, SUGGESTED_TICKS, MAX_COUNT) will be propagated from the operator which contains the configuration to all the empty calls.

Multiple calls to BUCKET_UPPER_BOUND/BUCKET_LOWER_BOUND with histogram configurations that differ from one another can not exist, since a common table for these operators does not exist.

Examples

[1]

BUCKET_LOWER_BOUND and BUCKET_UPPER_BOUND on an INT column with COUNT_TABLE. The result shows the calculated histogram buckets and the number of input column elements belonging to each bucket.

Query

Column1

         BUCKET_LOWER_BOUND ( "Table"."Column" )
        

Column2

         BUCKET_UPPER_BOUND ( "Table"."Column" )
        

Column3

         COUNT_TABLE ( "Table" )
        

Input

Output

Table

Column : int

1

5

9

10

16

Result

Column1 : int

Column2 : int

Column3 : int

0

2

1

2

4

0

4

6

1

6

8

0

8

10

1

10

12

1

12

14

0

14

16

0

16

18

1

[2]

BUCKET_LOWER_BOUND and BUCKET_UPPER_BOUND on an INT column with null elements. Each of the two operators results in one column. The first five values are the lower/upper bucket boundaries of the buckets that the non-null input column elements belong to. The NULL element is placed into the [null - null] bucket. These values are joined to the corresponding input values. Then, the lower/upper bucket boundaries of all histogram buckets are appended, including the [null-null] bucket. The appended values are not joined to any value from the input column.

In this example we can also notice that one of the values (50) is "far away" from where the majority of the elements are. This results in an outlier bucket being created, [20 - (+inf)] which we mark as [20 - null]. Additionally, we can see that this value influences the histogram bucket calculation - now the bucket width is 4, whereas in the previous example it was 2:

Query

Column1

         BUCKET_LOWER_BOUND ( "Table"."Column" )
        

Column2

         BUCKET_UPPER_BOUND ( "Table"."Column" )
        

Input

Output

Table

Column : int

1

5

9

10

50

null

Result

Column1 : int

Column2 : int

0

5

5

10

5

10

10

15

20

null

null

null

0

5

5

10

10

15

15

20

20

null

null

null

[3]

BUCKET_LOWER_BOUND on a FLOAT column with SUGGESTED_WIDTH. The first two values are the lower bucket boundaries of the buckets that input column elements belong to. These values are joined to the corresponding input values. Then, the lower bucket boundaries of all histogram buckets are appended. The appended values are not joined to any value from the input column.The resulting bucket width matches the value used in the query:

Query

Column1

         BUCKET_LOWER_BOUND ( "Table"."Column" , SUGGESTED_WIDTH ( 0.5 ) )
        

Input

Output

Table

Column : float

0.5

2.5

Result

Column1 : float

0.5

2.5

0.5

1.0

1.5

2.0

2.5

[4]

BUCKET_LOWER_BOUND on a FLOAT column with SUGGESTED_WIDTH. Here we can see that the SUGGESTED_WIDTH used in the query (4) is just taken as a suggestion, the resulting bucket width is different (5):

Query

Column1

         BUCKET_LOWER_BOUND ( "Table"."Column" , SUGGESTED_WIDTH ( 4 ) )
        

Input

Output

Table

Column : int

-2

2

6

10

Result

Column1 : int

-5

0

5

10

-5

0

5

10

[5]

BUCKET_LOWER_BOUND on an INT column with SUGGESTED_COUNT. The first five values are the lower bucket boundaries of the buckets that input column elements belong to. Then, the lower bucket boundaries of all histogram buckets are appended. The resulting bucket count is the same as the one used in the query, not counting the outlier buckets:

Query

Column1

         BUCKET_LOWER_BOUND ( "Table"."Column" , SUGGESTED_COUNT ( 10 ) )
        

Input

Output

Table

Column : int

-50

-5

5

5

100

Result

Column1 : int

null

-5

5

5

25

null

-20

-15

-10

-5

0

5

10

15

20

25

[6]

Example of input column propagation. BUCKET_UPPER_BOUND will use the same input column as BUCKET_LOWER_BOUND :

Query

Column1

         BUCKET_LOWER_BOUND ( "Table"."InputColumn" )
        

Column2

         BUCKET_UPPER_BOUND ( )
        

Input

Output

Table

InputColumn : int

1

5

9

10

50

null

Result

Column1 : int

Column2 : int

0

5

5

10

5

10

10

15

20

null

null

null

0

5

5

10

10

15

15

20

20

null

null

null

[7]

In this example, we filter based on the values in the FilterColumn. The filter only keeps edges where the FilterColumn equals 1. Nevertheless, all input elements are used when calculating histogram buckets:

Query

Filter

         FILTER ( "Table"."FilterColumn" = 1 );
        

Column1

         BUCKET_UPPER_BOUND ( "Table"."InputColumn" )
        

Input

Output

Table

InputColumn : int

FilterColumn : int

1

0

5

1

9

0

10

0

50

1

null

1

Result

Column1 : int

10

null

null

5

10

15

20

null

null

[8]

In this example, we filter based on the values in the FilterColumn. The FILTER_TO_NULL is applied to the input column, which means that only the values where FilterColumn equals 1 are kept and used to calculate histogram buckets:

Query

Filter

         FILTER ( "Table"."FilterColumn" = 1 );
        

Column1

         BUCKET_UPPER_BOUND ( FILTER_TO_NULL ( "Table"."InputColumn" ) )
        

Input

Output

Table

InputColumn : int

FilterColumn : int

1

0

5

1

9

0

10

0

50

1

null

1

Result

Column1 : int

10

60

null

10

20

30

40

50

60

null

[9]

Example with MAX_COUNT. SUGGESTED_WIDTH contradicts MAX_COUNT and is ignored in favor of the MAX_COUNT value. Null buckets and outlier buckets are not counted towards MAX_COUNT.

Query

Column1

         BUCKET_LOWER_BOUND ( "Table"."InputColumn" , SUGGESTED_WIDTH ( 2 ) , MAX_COUNT ( 2 ) )
        

Column2

         BUCKET_UPPER_BOUND ( )
        

Column3

         COUNT_TABLE ( "Table" )
        

Input

Output

Table

InputColumn : int

1

5

9

10

50

Result

Column1 : int

Column2 : int

Column3 : int

0

10

3

10

20

1

20

null

1