CLUSTER_VARIANTS

Description

The CLUSTER_VARIANTS operator groups similar process variants (or traces) into clusters. Variants which cannot be assigned to a cluster are marked as noise.

Warning

Operator Performance This operator performs very expensive computations and requires excessive memory and CPU resources. To avoid running out of memory, this operator is currently limited to 10,000,000 distinct variants.

Computation Times In some settings, this operator may require excessive CPU time. If the execution time exceeds 10 minutes, the execution is stopped and an according error is reported.

Syntax

 CLUSTER_VARIANTS ( variant_column, MIN_PTS, EPSILON )

variant_column: The column which stores the result of the VARIANT operator.
MIN_PTS: INT value giving the minimal density of similar variants that is required to create a cluster. Lower values tend to create more clusters, while higher values tend to classify more variants as noise. An estimate of a well-performing parameter value can be computed by the ESTIMATE_CLUSTER_PARAMS operator.
EPSILON: INT value giving the search radius for measuring the variant density. It is quantified by the number of different relations between two subsequent activities in the variants. The value must be an integer in the range [0, 5]. The higher the value, the more it is likely that all variants are assigned to the same cluster. It is recommended to choose a quite low value (e.g. 2). However, the value of 0 requires equality between variants to be clustered.

Result

A case-based INT column in which the number represents the ID of the cluster, to which the case (and its variant respectively) has been assigned to. A cluster ID of -1 indicates that the case has been classified as noise and a cluster ID of -2 indicates that the respective case does not have any activities (i.e. the case is empty).

NULL handling

Since the computation depends on the VARIANT operator, this operator only needs to handle NULL values if VARIANT also returns NULL (i.e. a case has only NULL-value activities). The cluster ID will be then -2.

Example

[1]

Variant 1 (trace A, B, C of cases 1, 2, 3, and 4) and variant 3 (trace A, B, B, C of case 7) are similar and grouped into the same cluster (ID 0), while variant 2 (trace A, B, D of cases 5 and 6) forms a cluster on its own (ID 1). The variant 4 (trace X, Y, Z of case 8) is classified as noise due to its missing similarity to other traces and its low frequency (it occurs only once).

Query

Column1

         "Cases"."Case"

Column2

         VARIANT ( "Activities"."Activity" )

Column3

         CLUSTER_VARIANTS ( VARIANT ( "Activities"."Activity" ) , 2 , 2 )

Input

Output

Activities

Case : int	Activity : string	Timestamp : date
1	'A'	Fri Jan 01 2016 01:00:00.000
1	'B'	Fri Jan 01 2016 02:00:00.000
1	'C'	Fri Jan 01 2016 03:00:00.000
2	'A'	Fri Jan 01 2016 04:00:00.000
2	'B'	Fri Jan 01 2016 05:00:00.000
2	'C'	Fri Jan 01 2016 06:00:00.000
3	'A'	Fri Jan 01 2016 07:00:00.000
3	'B'	Fri Jan 01 2016 08:00:00.000
3	'C'	Fri Jan 01 2016 09:00:00.000
4	'A'	Fri Jan 01 2016 10:00:00.000
4	'B'	Fri Jan 01 2016 11:00:00.000
4	'C'	Fri Jan 01 2016 12:00:00.000
5	'A'	Fri Jan 01 2016 13:00:00.000
5	'B'	Fri Jan 01 2016 14:00:00.000
5	'D'	Fri Jan 01 2016 15:00:00.000
6	'A'	Fri Jan 01 2016 16:00:00.000
6	'B'	Fri Jan 01 2016 17:00:00.000
6	'D'	Fri Jan 01 2016 18:00:00.000
7	'A'	Fri Jan 01 2016 19:00:00.000
7	'B'	Fri Jan 01 2016 20:00:00.000
7	'B'	Fri Jan 01 2016 21:00:00.000
7	'C'	Fri Jan 01 2016 22:00:00.000
8	'X'	Fri Jan 01 2016 23:00:00.000
8	'Y'	Sat Jan 02 2016 00:00:00.000
8	'Z'	Sat Jan 02 2016 01:00:00.000

Cases

Case : int
1
2
3
4
5
6
7
8

Foreign Keys

Cases.Case

Activities.Case

Result

Column1 : int	Column2 : string	Column3 : int
1	'A, B, C'	0
2	'A, B, C'	0
3	'A, B, C'	0
4	'A, B, C'	0
5	'A, B, D'	1
6	'A, B, D'	1
7	'A, B, B, C'	0
8	'X, Y, Z'	-1