Skip to main content

Celonis Product Documentation

Custom duplicate invoice patterns

With version 2.1.0 (see Release Notes May 2023), we make the algorithm logic customizable. It is now possible to:

  • Define custom patterns that determine a match.

  • Specify a combination of columns to be checked per pattern, such as five columns.

  • Perform more than one fuzzy match per pattern, for example, three exact and two fuzzy matches.

  • Tweak the parameters per fuzzy match; for example, increase the date range to 30 days.

These pattern customizations are done in the Knowledge Model (ML) of a given package and will extend the configuration in the ML Sensor. If customizations are used, the ML Sensor will still determine the knowledge model and the use case that should be applied (“Duplicate Checking Invoices”). The ML Sensor will also still provide the logs. However, everything previously defined in the ML Sensor: the record, the columns to be checked and written back, the reversal flag, and filters, will now be defined in the Knowledge Model (KM). All this is done using a custom object.

Custom object

A custom object in the Knowledge Model stores the parameters formerly defined in the ML Sensor parameters and the newly added search patterns that determine a match. The custom object should be structured as follows:

Under “customObjects”, add an object with an arbitrary “id” and “displayName”. Under “customAttributes” define the key “record” and specify the record id used for all attributes defined in the custom object. With keys “invoiceDate”, “invoiceValue”, “invoiceReference” and “invoiceVendorName”, define the attributes that should be persisted in the result table. With the key “invoiceReversalFlag”, add the reversal flag. The parameter “excludeExactFromFuzzy” determines how fuzzy matches are handled with regard to exact values. Under the key “invoiceFilters”, list the filter IDs that should be applied. Under  “searchPatterns”, define the patterns. It is possible to define multiple custom search patterns.

customObjects:
  - id: ALGORITHM_CUSTOMIZATIONS
    displayName: Custom Matching Rules
    customAttributes:
      record: INVOICE # Required
      invoiceDate: INVOICE_DATE # Required
      invoiceValue: INVOICE_VALUE # Required
      invoiceReference: INVOICE_REFERENCE # Required
      invoiceVendorName: INVOICE_VENDOR_NAME # Required
      invoiceReversalFlag: INVOICE_REVERSAL_FLAG # Optional
      invoiceFilters: # Optional
        - FITLER_DUPLICATE_CHECKER_SHORT_REF
        - FITLER_DUPLICATE_CHECKER_INTERNAL_VENDOR
      excludeExactFromFuzzy: True # Optional (omitted by default)
      searchPatterns: # Required
        REFERENCE_FUZZY:
        ...
Persisted data fields

The algorithm will always persist four columns in the result table. Any additional columns defined in the search pattern will be checked but not persisted in the result table. The four columns to be persisted have to be defined under the fixed keys “invoiceDate”, “invoiceValue”, “invoiceReference” and “invoiceVendorName”.

Each attribute defined there has to be part of the record selected (“INVOICE” by default). The behavior of configuration changes remains consistent with the ML Sensor setup. For instance, changing an attribute results in the renaming of the column in the “DUPLICATE_INVOICES” table based on the attribute’s ID. Make sure to account for that in the variables being used. In case a result table already exists that was previously populated by the ML Sensor configuration, make sure that an attribute is persisted in the same column (e.g. “INVOICE_DATE” should be specified under “invoiceDate”). The data type of an attribute must align with the data type of the corresponding column in the table. In case of a mismatch, it is necessary to recreate

Important

Even if less than four attributes are checked, it is still necessary to specify four attributes to be persisted.

Fuzzy Handling of Exact Values

The “excludeExactFromFuzzy” parameter determines the handling of fuzzy matches for exact values. It can be set to either “True” or “False”. When set to “True,” a fuzzy comparer considers exact values as non-matches. In other words, a match is only found if values are different, i.e., fuzzy. To match cases with exact values, it is necessary to explicitly define a pattern that uses the “Exact” comparer. When set to “False,” a fuzzy comparer also considers exact values as matches. The parameter can be omitted and then set to “True” by default.

Custom search patterns

The way potential duplicate groups are formed is determined by the patterns. Each search pattern consists of a combination of columns and comparers. A single pattern is defined in the following format:

REFVENDORDATEFUZZY: # Required. Name of the pattern. Unique but arbitrary.
  columns: # Required.
  - columnName: INOVICE_REFERENCE # Required. Under the key “columnName", specify the id of the record attribute to be checked.
    comparer: # Required.
        comparerName: Reference # Required. Under the key “comparerName”, specify the comparer name that should be used.
        parameters: # Optional. Tweak parameters of the comparer.
            numDays: 7 # Optional. Parameter name and parameter value
  - ....

The pattern name can be freely defined, but it is recommended to use short names (e.g., A1 or REF_FUZZY) in case there are more than four patterns. This ensures we can persist all applicable pattern names in the “PATTERN” column in the DUPLICATE_INVOICES table. By default, the width for this column is 200 characters. If you run into problems with the width, you can increase it with an “ALTER TABLE” SQL statement after the result table creation.

Multiple attributes (“columnName”) can be specified per pattern. Each attribute defined in a given pattern has to be part of the record selected (“INVOICE” by default). Please note that the default patterns are applied if no search pattern (key “searchPattern”) has been defined in the knowledge model.

Note

The default patterns are applied, if no search pattern (key “searchPattern”) has been defined in the knowledge model. If “searchPattern” has been defined, only the patterns explicitly specified are applied, i.e., the default patterns are not applied automatically.

Important

The current constraint of a search pattern is that it must contain at least one exact column using the “Exact” comparer.

Comparers and parameters

Within the search pattern definition, there is the option to use different comparers and tweak certain parameters. Currently, six comparers can be used:

  • One exact comparer:

    • Exact

  • One different comparer:

    • Diff

  • Four fuzzy comparers:

    • CompanyName

    • Date

    • Reference

    • Value

    Each comparer applies a different matching logic and consists of a list of parameters that can be tweaked.

Important

Please be aware that comparers might have additional parameters by default that cannot be customized. For example, for the “Date” comparer, a check for commonly swapped months and swapped days and months is always performed. Check the Standard Patterns for the entire pattern logic. Furthermore, not specifying the parameters in the pattern definition, results in the default parameters being applied.

The parameters that can be customized for a given comparer are defined as follows:

  • Exact: Designed to find matches of the same value.

  • Diff: Designed to find matches of different values (i.e., not exact).

  • CompanyName: Designed to find similar company names but can be used as a general string comparer.

    • CompanyPatterns: A list of company suffixes to be removed before comparing.

      companyPatterns = [
             " gmbh",
              " ag",
              " llc",
      	…
          ]
    • Method: The method used for calculating the string similarity score. Choose from “jaro” or “levenshtein”.

      "method": "jaro"
    • Threshold: The threshold of the score above which a fuzzy match is considered. The higher, the stricter.

      "threshold": 0.85
  • Date: Designed to find similar invoice dates but can be used as a general date comparer.

    • NumDays: The maximum allowed day difference between two dates.

      "numDays": 7
  • Reference: Designed to find similar invoice references.

    • ConfusionChars: Pairs of letters and numbers that can be mistakenly recognized as the same, for example, “8” and “b.” Use two lists of characters where each position in the character list forms a pair (“8” and “b” form the first pair).

      "confusionChars": ["86i10oqsz", "bgllddd52"]
    • CharMaxErrors: The maximum allowed number of characters skipped (or errors) in one of the strings so that it is equal to the other string. For example, “abcd1234” and “abcd124”. If we skip character “3” in the first string, the two strings would be identical. This constitutes one error. A string of “abc123” and “abca12z3” would constitute two errors since “a” (the second) and “z” can be skipped in the second string to make the two strings match. A string of “abc123” and “abg123” would not be considered a match since skipping “c” does not result in the two strings matching (“ab123” vs. “abg123”), and skipping “g” does not result in the two strings to match (“abc123” vs “ab123”).

      "charMaxErrors": 3
    • TurnerMaxErrors: The maximum number of swapped character pairs (turners). For instance, there are two turners in “ABC” and “BAC” where “A and B” and “B and A” represent a pair of inverted characters. Two characters need to be swapped to turn “ABC” into “BAC” and vice versa. This requires the two strings to have the same characters.

      "turnerMaxErrors": 3	

      Note

      If there is a numerical inversion such as “01” and “10”, we automatically consider this not to be a match. For instance, the references “ABC01” and “ABC10” do not match. This prevents false positives with recurring invoices, often with incremental numbers.

  • Value: Designed to find similarities between invoice values but can be used as a general numeric comparer.

    • MaxPriceLimit: The maximum allowed absolute difference between two numbers.

      "maxPriceLimit": 80.0
    • TurnerMaxErrors: The maximum number of swapped number pairs (turners). For instance, there are four turners in “21500” and “12005”, where “2 and 1”, “1 and 2”, “5 and 0” and “0 and 5” represent a pair of inverted characters. To turn “21500” into “12005” and vice versa, four digits need to be swapped. This requires two numbers to have the same digits.

      "turnerMaxErrors": 5
Pattern Definition - Best Practices

When introducing or modifying custom patterns, it is highly recommended to test the new configuration in a non-production environment initially. This testing environment should include a separate package and data model to avoid interference with the productive setup. Since the behavior of configuration changes remains consistent with the ML Sensor setup, already checked documents will not be re-evaluated with a new custom pattern configuration. To re-check all documents, a manual reset of the result tables is necessary (see Troubleshooting).

Pattern Application

Standard Patterns outline that a pattern is always applied to a pair of documents. However, it can be that different custom patterns that are applicable for a given document pair have been defined simultaneously. For instance, consider two patterns with a fuzzy date check - one having a threshold of 30 days and another with 90 days.

The only reason to define the stricter pattern with a 30-day threshold is to better distinguish the group formation in the front end. With both the 30-day and 90-day patterns listed in the “PATTERN” column, this information can be used to display a more specific group pattern in the view, for example, “Similar Date (short).” However, this clear distinction only works with groups of two documents. Groups with more than two documents can be formed with multiple patterns (see Default patterns), potentially both the 30- and 90-day thresholds.

Pattern Determination

The default group pattern shown to the user is determined by comparing the checked columns and counting distinct values within the group. For instance, if documents within the group share the same vendor name, date and value, but have more than one reference, this translates to the group pattern “Similar Reference”. When defining custom patterns with new or additional columns checked, adjust the variable “GROUP_DUPLICATE_PATTERN”. In case of overlapping patterns, it might also be necessary to incorporate the “PATTERN” column. Returning to the example with the 30 and 90 days threshold and assuming a fifth column is checked, the additional PQL could look like this:

WHEN ... 
PU_COUNT_DISTINCT ( "GROUPS_DUPLICATE_INVOICES" , 
"DUPLICATE_INVOICES"."INVOICE_DATE" , INDEX_ORDER ( 
"GROUPS_DUPLICATE_INVOICES"."GROUP_UUID" , ORDER BY ( 
"DUPLICATE_INVOICES"."GROUP_CREATION_DATE" ASC ) , PARTITION BY ( 
"GROUPS_DUPLICATE_INVOICES"."GROUP_UUID" , 
"DUPLICATE_INVOICES"."INVOICE_ID" ) ) = 1 ) > 1
AND PU_COUNT_DISTINCT ( "GROUPS_DUPLICATE_INVOICES" , BIND ( 
"DUPLICATE_INVOICES", PU_LAST ( "DISTINCT_INVOICES" , 
"CASE_TABLE"."EXTRA_COLUMN")), INDEX_ORDER ( 
"GROUPS_DUPLICATE_INVOICES"."GROUP_UUID" , ORDER BY ( 
"DUPLICATE_INVOICES"."GROUP_CREATION_DATE" ASC ) , PARTITION BY ( 
"GROUPS_DUPLICATE_INVOICES"."GROUP_UUID" , 
"DUPLICATE_INVOICES"."INVOICE_ID" ) ) = 1 )  = 1
AND
PU_LAST ( "GROUPS_DUPLICATE_INVOICES", 
"DUPLICATE_INVOICES"."PATTERN") LIKE '%DOC_DATE_SHORT%'
THEN 'Similar Date (short)'...
Performance Considerations

With custom patterns, there are trade-offs and performance considerations that you should be aware of:

  • The pattern complexity, in combination with the number of documents checked, determines the resource usage.

  • Exact column checks are less resource-intensive than fuzzy checks.

  • The more exact columns checked, the less memory is used. A pattern with three exact columns uses less memory than a pattern with only one exact column.

  • Checking fewer columns and applying less strict patterns naturally results in larger groups. For example, using multiple patterns with only one Exact and one CompanyName comparer, set to a very low threshold, can lead to the creation of groups containing thousands of documents, which should be avoided.

  • Each additional fuzzy check per pattern significantly increases memory usage, especially the “Reference” comparer. If possible, avoid adding multiple fuzzy checks with the “Reference” comparer for a single pattern.

  • We recommend a maximum of 7 columns checked per pattern as they have a medium impact on memory usage.

  • We recommend a maximum of 15 search patterns as they have a minor impact on memory usage. This is because the patterns are independent of each other.

Having a “resource intensive” pattern configuration beyond these recommendations is, of course, possible but comes with the trade-off of being able to check fewer documents simultaneously.

Configuration examples
Default configuration

The configuration of the default search pattern (applied when using the default ML Sensor setup). Removing the parameters results in the default parameters being applied.

"""
customObjects:
  - id: ALGORITHM_CUSTOMIZATIONS 
    displayName: Custom Matching Rules
    customAttributes:
      record: INVOICE
      invoiceDate: INVOICE_DATE
      invoiceValue: INVOICE_VALUE
      invoiceReference: INVOICE_REFERENCE
      invoiceVendorName: INVOICE_VENDOR_NAME
      invoiceReversalFlag: INVOICE_REVERSAL_FLAG
      invoiceFilters:
        - FITLER_DUPLICATE_CHECKER_SHORT_REF
        - FITLER_DUPLICATE_CHECKER_INTERNAL_VENDOR
      searchPatterns:
        REFERENCE_FUZZY:
          columns:
            - columnName: INVOICE_REFERENCE
              comparer:
                comparerName: Reference
                parameters:
                  charMaxErrors: 3
                  turnerMaxErrors: 3
                  confusionChars: ["86i10oqsz", "bgllddd52"]
            - columnName: INVOICE_DATE
              comparer:
                comparerName: Exact
            - columnName: INVOICE_VALUE
              comparer:
                comparerName: Exact
            - columnName: INVOICE_VENDOR_NAME
              comparer:
               comparerName: Exact
        DOC_DATE_FUZZY:
          columns:
            - columnName: INVOICE_REFERENCE
              comparer:
                comparerName: Exact
            - columnName: INVOICE_DATE
              comparer:
                comparerName: Date
                parameters:
                  numDays: 7
            - columnName: INVOICE_VALUE
              comparer:
                comparerName: Exact
            - columnName: INVOICE_VENDOR_NAME
              comparer:
                comparerName: Exact
        VALUE_FUZZY:
          columns:
            - columnName: INVOICE_REFERENCE
              comparer:
                comparerName: Exact
            - columnName: INVOICE_DATE
              comparer:
                comparerName: Exact
            - columnName: INVOICE_VALUE
              comparer:
                comparerName: Value
                parameters:
                  maxPriceLimit: 80.0
                  turnerMaxErrors: 5
            - columnName: INVOICE_VENDOR_NAME
              comparer:
                comparerName: Exact
        VENDOR_FUZZY:
          columns:
            - columnName: INVOICE_REFERENCE
              comparer:
                comparerName: Exact
            - columnName: INVOICE_DATE
              comparer:
                comparerName: Exact
            - columnName: INVOICE_VALUE
              comparer:
                comparerName: Exact
            - columnName: INVOICE_VENDOR_NAME
              comparer:
                comparerName: CompanyName
                parameters:
                  threshold: 0.85 
                  method: jaro
                  companyPatterns: [
                      " gmbh",
                      " ag",
                      " llc",
                      " inc",
                      " ltd",
                      " limited",
                      " sdn",
                      " bhd",
                      " se",
                      " corporation",
                      " corp",
                      " sl",
                      " coltd",
                      " group",
                      " mbh",
                      " co",
                      " kg",
                      " ltda",
                      " sa",
                      " sro",
                      " des",
                      " sas",
                      " sasu",
                      " zoo",
                      " sp",
                      " sau",
                      " cokg",
                      ]
        EXACT:
          columns:
            - columnName: INVOICE_REFERENCE
              comparer:
                comparerName: Exact
            - columnName: INVOICE_DATE
              comparer:
                comparerName: Exact
            - columnName: INVOICE_VALUE
              comparer:
                comparerName: Exact
            - columnName: INVOICE_VENDOR_NAME
              comparer:
                comparerName: Exact
"""
Additional column checked

Including the due date (“CFIELD_DUE_DATE”) as an additional column to be checked using a fuzzy comparer. Ensure that the attribute “CFIELD_DUE_DATE” is part of the INVOICE record.

"""
searchPatterns:
  DUE_DATE_FUZZY:
    columns:
      - columnName: INVOICE_REFERENCE
        comparer:
          comparerName: Exact
      - columnName: INVOICE_DATE
        comparer:
          comparerName: Exact
      - columnName: INVOICE_VALUE
        comparer:
          comparerName: Exact
      - columnName: INVOICE_VENDOR_NAME
        comparer:
          comparerName: Exact
      - columnName: CFIELD_DUE_DATE
        comparer:
          comparerName: Date
"""
Multiple fuzzy matching

Only set the invoice value to be exact and apply fuzzy matches to the other three columns.

"""
searchPatterns:
  VALUE_EXACT:
    columns:
      - columnName: INVOICE_REFERENCE
        comparer:
          comparerName: Reference
      - columnName: INVOICE_DATE
        comparer:
          comparerName: Date
      - columnName: INVOICE_VALUE
        comparer:
          comparerName: Exact
      - columnName: INVOICE_VENDOR_NAME
        comparer:
          comparerName: CompanyName
"""
Date comparer parameters

Loosen the days' constraint of the date fuzzy match to 60 days.

"""
searchPatterns:
  LOOSE_DATE_FUZZY:
    columns:
      - columnName: INVOICE_REFERENCE
        comparer:
          comparerName: Exact
      - columnName: INVOICE_DATE
        comparer:
          comparerName: Date
     parameters:
       numDays: 60
      - columnName: INVOICE_VALUE
        comparer:
          comparerName: Exact
      - columnName: INVOICE_VENDOR_NAME
        comparer:
          comparerName: Exact
"""
CompanyName comparer parameters

Make the vendor name fuzzy matching stricter by increasing the threshold to 0.9.

"""
searchPatterns:
  TIGHT_VENDOR_FUZZY:
    columns:
      - columnName: INVOICE_REFERENCE
        comparer:
          comparerName: Exact
      - columnName: INVOICE_DATE
        comparer:
          comparerName: Exact
      - columnName: INVOICE_VALUE
        comparer:
          comparerName: Exact
      - columnName: INVOICE_VENDOR_NAME
        comparer:
          comparerName: CompanyName
          parameters:
            threshold: 0.9	
"""
Reference comparer parameters

Make the reference fuzzy matching stricter by setting the maximum character errors to 2.

"""
searchPatterns:
  TIGHT_REFERENCE_FUZZY:
    columns:
      - columnName: INVOICE_REFERENCE
        comparer:
          comparerName: Reference
          parameters:
            charMaxErrors: 2
      - columnName: INVOICE_DATE
        comparer:
          comparerName: Exact
      - columnName: INVOICE_VALUE
        comparer:
          comparerName: Exact
      - columnName: INVOICE_VENDOR_NAME
        comparer:
          comparerName: Exact
"""
Value comparer parameters

Make the value fuzzy matching stricter by setting the maximum allowed value difference to be 30.

"""
searchPatterns:
  TIGHT_VALUE_FUZZY:
    columns:
      - columnName: INVOICE_REFERENCE
        comparer:
          comparerName: Exact
      - columnName: INVOICE_DATE
        comparer:
          comparerName: Exact
      - columnName: INVOICE_VALUE
        comparer:
          comparerName: Value
          parameters:
            max_price_limit: 30
      - columnName: INVOICE_VENDOR_NAME
        comparer:
          comparerName: Exact
"""