Skip to main content

Celonis Product Documentation

The Duplicate Checking patterns

The algorithm is at the heart of the Duplicate Checking App (see The algorithm in section "Before getting started"). It groups invoices together that are suspected to be duplicates based on different matching patterns of the four fields: Vendor Name, Invoice Value, Invoice Reference and Document Date. The following section describes the different patterns and their determination logic.

Exact Match:

The four fields, Invoice Value, Document Date, Invoice Reference, and Vendor Name, match exactly.

Similar Vendor:

The three fields, Invoice Value, Document Date, and Invoice Reference, match exactly. The field, Vendor Name, is a fuzzy match. The following criteria are used to determine the fuzzy match:

  1. Remove all special characters, white spaces etc besides: [^a-zA-ZА-z\d] e.g. "A R-AMC1234" <-> "AR AMC\ 1234"

  2. Remove all company key words like "corp", "llc" etc.

  3. Check for matches with the given string similarity metric and the given threshold. The similarity is 1 if the given string similarity metric is bigger than the threshold.

Example:

similar_vendor_pattern.png

Similar Reference:

The three fields, Invoice Value, Document Date, and Vendor Name, match exactly. The field, Invoice Reference, is a fuzzy match. The following criteria are used to determine the fuzzy match:

  1. Remove all special characters, white spaces etc besides: [^a-zA-ZА-z\d] e.g. "A R-AMC1234" <-> "AR AMC\ 1234"

  2. Check whether they are exactly equal except for 0-3 extra characters in one of the two records e.g. "AR-AMC1234" <-> "A-AMC1234"

  3. Check for common scanning errors like "8" <-> "B" e.g. "AR-AMC1238" <-> "RA-AMC123B"

  4. Check for turners in characters e.g. "AR-AMC1234" <-> "RA-AMC1234"

Example:

similar_reference_pattern.png

Similar Date:

The three fields, Invoice Value, Invoice Reference, and Vendor Name, match exactly. The field, Document Date, is a fuzzy match. The following criteria are used to determine the fuzzy match:

  1. Exact same date e.g. "2020-11-09" <-> "2020-11-09"

  2. Month and day swapped e.g. "2020-01-02" <-> "2020-02-01"

  3. Month swapped common error e.g. “2020-07-02" <-> "2020-06-02"

  4. Distance between less than 7 days e.g. | "2020-07-02" - "2020-07-08" | < 7 days. This threshold is based on experience with customers.

Example:

similar_date_pattern.png

Similar Value:

The three fields, Document Date, Invoice Reference, and Vendor Name, match exactly. The field, Invoice Value, is a fuzzy match. The following criteria are used to determine the fuzzy match:

  1. Compute the (partial) similarity between numeric values. The implemented algorithms are: 'step', 'linear', 'exp', 'gauss' or 'squared'. In case of agreement, the similarity is 1 and in case of complete disagreement it is 0.

  2. Check for turners in digits e.g. 150 234 <-> 105 234

Example:

similar_value_pattern.png