Skip to main content

Celonis Product Documentation

LINEAR_REGRESSION
Description

The Linear Regression function models the relationship between input variables by fitting a linear equation to the input data.

The method consists of two steps. First, a model is trained. Second, values are predicted by the model. Linear Regression is integrated into the existing table framework. Currently, model training and prediction can not be separated.

Syntax
  LINEAR_REGRESSION ( TRAIN_[FILTERED_]LM ( INPUT ( table.input_column, ... ), OUTPUT ( table.output_column ) ), PREDICT ( table.predict_column, ... ) )
 
  • TRAIN_LM: Trains a Linear Regression model.

  • TRAIN_FILTERED_LM: The data on which the linear model is trained is filtered.

  • INPUT: One or more columns, which is used to train the model so that it describes the Output.

  • OUTPUT: One column, through which the model lays a predictor function.

  • PREDICT: One or more columns, on which for each row a predicted value is returned by applying the trained model.

All columns in TRAIN_LM have to be joinable. The columns in PREDICT do not have to be joinable with the columns in TRAIN_LM.

The input of the model training is regarded as an independent sub query. This means if an aggregation is used, it is independent of the dimensions defined in the rest of the query. This also means that the columns within TRAIN_LM have to be joinable, but not with the columns used in the rest of the query.

To ensure the correctness of the result, it is important that the matrix that is constructed from the input columns and an additional column of ones (representing the bias) contains at least as many linearly independent rows as there are variables in the linear model. The number of variables is equal to the number of columns (including the bias column). If this condition is not met, the function aborts with an error message.

NULL handling
  • If a row contains a NULL value, the value is ignored and does not affect the model.

  • If a PREDICT row contains a NULL value, the result for that row will be NULL.

Filter behavior
Standard Linear Regression

If rows of a column are filtered, it does not affect the linear model, as long as the linear model is not trained on aggregation results. This means independent of filters and selections, the underlying model stays the same. If you want to restrict the input data of a model you can use a CASE WHEN statement and map the values you want to be ignored to null. If a model is trained on results of an aggregation it still changes with the filtering because the result of the aggregation is affected by the filtering.

Filtered Linear Regression

If a filter or selection changes, the model is retrained and the resulting function adopts to the new of view of data. This has a serious performance impact.

Examples

[1]

Linear regression on time series data.

Query

Column1

         "Table1"."MONTH"
        

Column2

         LINEAR_REGRESSION ( TRAIN_LM ( INPUT ( "Table1"."Month" ) , OUTPUT ( Table1.Income ) ) , PREDICT ( "Table1"."Month" ) )
        

Input

Output

Table1

Month : int

Income : int

1

100

1

300

2

400

3

300

4

500

Result

Column1 : int

Column2 : float

1

217.647

1

217.647

2

302.941

3

388.235

4

473.529

[2]

Linear regression on the same data as above, but using a constant for prediction.

Query

Column1

         LINEAR_REGRESSION ( TRAIN_LM ( INPUT ( "Table1"."Month" ) , OUTPUT ( Table1.Income ) ) , PREDICT ( 1 ) )
        

Input

Output

Table1

Month : int

Income : int

1

100

1

300

2

400

3

300

4

500

Result

Column1

217.647

[3]

Example for a higher order linear regression.

Query

Column1

         "Table1"."MONTH"
        

Column2

         LINEAR_REGRESSION ( TRAIN_LM ( INPUT ( "Table1"."Month" , POWER ( "Table1"."Month" , 2 ) ) , OUTPUT ( Table1.Income ) ) , PREDICT ( "Table1"."Month" , POWER ( "Table1"."Month" , 2 ) ) )
        

Input

Output

Table1

Month : int

Income : int

1

100

1

300

2

400

3

300

4

500

Result

Column1 : int

Column2 : float

1

215.384

1

215.384

2

307.692

3

392.307

4

469.23

[4]

Example for filter behavior on standard linear regression. Despite the filter the full data is used to train the model.

Query

Filter

         Filter "Table1"."Month" > 1;
        

Column1

         "Table1"."MONTH"
        

Column2

         LINEAR_REGRESSION ( TRAIN_LM ( INPUT ( "Table1"."Month" ) , OUTPUT ( Table1.Income ) ) , PREDICT ( "Table1"."Month" ) )
        

Input

Output

Table1

Month : int

Income : int

1

100

1

300

2

400

3

300

4

500

Result

Column1 : int

Column2 : float

2

302.941

3

388.235

4

473.529

[5]

Same example as above but this time the model is only trained on the filtered data.

Query

Filter

         Filter "Table1"."Month" > 1;
        

Column1

         "Table1"."MONTH"
        

Column2

         LINEAR_REGRESSION ( TRAIN_FILTERED_LM ( INPUT ( "Table1"."Month" ) , OUTPUT ( Table1.Income ) ) , PREDICT ( "Table1"."Month" ) )
        

Input

Output

Table1

Month : int

Income : int

1

100

1

300

2

400

3

300

4

500

Result

Column1 : int

Column2 : float

2

350.0

3

400.0

4

450.0

[6]

An example with too many linearly dependent rows.

Query

Column1

         "Table1"."MONTH"
        

Column2

         LINEAR_REGRESSION ( TRAIN_FILTERED_LM ( INPUT ( "Table1"."Month" , "Table1"."Income" ) , OUTPUT ( Table1.Income ) ) , PREDICT ( "Table1"."Month" , "Table1"."Income" ) )
        

Input

Output

Table1

Month : int

Income : int

1

100

2

200

3

300

4

400

Warning

LINEAR_REGRESSION: There is/are only [2] linearly independent equation(s) (row(s)) in the linear system that is constructed from the input columns and an additional column of ones for the bias, while [3] (number of columns) are required to guarantee the correctness of the result!