Skip to main content

Celonis Product Documentation

KMEANS
Description

KMeans finds clusters based on the K-means++ algorithm. It assigns a number to every cluster.

KMeans consists of two steps, as in Linear Regression. The first step is training a model, meaning finding the centroids of the clusters. Second, cluster the values by the model. Our KMeans implementation is very flexible and allows the user to specify on which data the algorithm should train, and which data it should cluster.

There is also a simple KMeans operator, which just expects the number of clusters.

Syntax

Simple Kmeans. It creates K clusters based on all other inputs in the query:

 KMEANS ( k )
  • k: Number of clusters.

Flexible Kmeans:

 KMEANS ( TRAIN_[FILTERED_]KM ( [ EXCLUDED ( table.exclude_column, ... ), ] INPUT ( table.input_column, ... ), k ), CLUSTER ( table.cluster_column, ... ) )
  • TRAIN_KM: Finds K centroids in the cluster. Authorization objects are still respected.

  • TRAIN_FILTERED_KM: Finds K centroids in the filtered data.

  • EXCLUDED: Here you can optionally define dimensions which influence the aggregations used for model training, but are not itself included in the training.

  • INPUT: One or more columns, which is used to train the model.

  • k: Number of clusters.

  • CLUSTER: One or more columns, which are clustered based on the trained centroids.

All columns in TRAIN_KM have to be joinable. The columns in CLUSTER do not have to be joinable with the columns in TRAIN_KM.

The input of the model training is regarded as an independent sub query. This means if an aggregation is used, it is independent of the dimensions defined in the rest of the query. This also means that the columns within TRAIN_KM have to be joinable, but not with the columns used in the rest of the query.

NULL handling
  • If a row contains a NULL value, the value is ignored and does not affect the model.

  • If a CLUSTER row contains a NULL value, the result for that row will be NULL.

Filter behavior
Standard KMeans

If rows of a column are filtered, it does not affect the linear model, as long as the kmeans model is not trained on aggregation results. This means independent of filters and selections, the underlying model stays the same. If you want to restrict the input data of a model you can use a CASE WHEN statement and map the values you want to be ignored to null. If a model is trained on results of an aggregation it still changes with the filtering because the result of the aggregation is affected by the filtering.

Filtered KMeans

If a filter or selection changes, the model is retrained and the resulting function adopts to the new of view of data. This has a serious performance impact.

Result

KMeans finds clusters based on the K-means++ algorithm. It assigns a number to every cluster. K-means is not a stable algorithm, so you can get different results when executing the algorithm multiple times. Therefore you must not rely that clusters stay the same.

Use Cases
Examples

[1]

Example using simple KMeans syntax on an aggregation.

Query

Column1

         "Table1"."MONTH"
        

Column2

         sum ( "Table1"."Income" )
        

Column3

         KMEANS ( 3 )
        

Input

Output

Table1

Month : int

Income : int

1

100

1

150

2

450

2

500

3

800

3

900

Result

Column1 : int

Column2 : int

Column3 : int

1

250

0

2

950

2

3

1700

1

[2]

Example returning the same result as above but using advanced syntax.

Query

Column1

         "Table1"."MONTH"
        

Column2

         sum ( "Table1"."Income" )
        

Column3

         KMEANS ( TRAIN_KM ( INPUT ( "Table1"."Month" , SUM ( "Table1"."Income" ) ) , 3 ) , CLUSTER ( "Table1"."Month" , SUM ( "Table1"."Income" ) ) )
        

Input

Output

Table1

Month : int

Income : int

1

100

1

150

2

450

2

500

3

800

3

900

Result

Column1 : int

Column2 : int

Column3 : int

1

250

0

2

950

2

3

1700

1

[3]

Here, month 2 and 3 are clustered together, even though the income of month 8 (550) would be closer to the income of month 2 (500). This is because we also cluster by the months themselves. If we want to cluster only by the income and still use month as dimension, we need to use the EXCLUDED tag as shown in the next example.

Query

Column1

         "Table1"."MONTH"
        

Column2

         KMEANS ( TRAIN_KM ( INPUT ( "Table1"."Month" , sum ( "Table1"."Income" ) ) , 3 ) , CLUSTER ( "Table1"."Month" , sum ( "Table1"."Income" ) ) )
        

Input

Output

Table1

Month : int

Income : int

1

100

1

50

2

500

3

650

8

550

Result

Column1 : int

Column2 : int

1

0

2

1

3

1

8

2

[4]

This example is based on the same data as above. But now Month is excluded from the training data. The result is that month 2 and 8 are clustered together. The timeline has now no influence on the clustering but is still used for the aggregation.

Query

Column1

         "Table1"."MONTH"
        

Column2

         KMEANS ( TRAIN_KM ( EXCLUDED ( "Table1"."Month" ) , INPUT ( sum ( "Table1"."Income" ) ) , 3 ) , CLUSTER ( sum ( "Table1"."Income" ) ) )
        

Input

Output

Table1

Month : int

Income : int

1

100

1

50

2

500

3

650

8

550

Result

Column1 : int

Column2 : int

1

0

2

2

3

1

8

2