Algorithm and default search patterns for the object-centric Duplicate Invoice Checker
The Duplicate Invoice Checker's algorithm groups invoices that might be duplicates based on different matching patterns. After the groups are formed, the algorithm uses Machine Learning to calculate a confidence score for each identified group, which predicts the probability of the group containing a true duplicate. The score lets users sort or filter the results in the Action View to focus on the right groups and to identify duplicates faster.
Every time new invoices are loaded into the perspective, the algorithm compares them against the invoice backlog. It’s important to know that:
Only new (unchecked) documents are checked.
New (unchecked) documents are checked against other new documents, as well as against old (already checked) documents.
Old (already checked) documents are not checked against old (already checked) documents.
If you want to re-check all documents - for example, because you’ve made a major change to your filtering or search patterns - you’ll need to reset the app’s results. Resetting the result tables tells you how.
You can configure filters in the AI Annotations user interface to limit the algorithm to checking only the relevant invoices (pre-filters). You can also filter out groups in the Action View (post-filters), but if you can remove the invoices from checking in the first place, you won’t waste any resources on them. By default, we set these filters:
Invoices from internal vendors are excluded. This filter is in the AI annotation.
Groups with a debit/credit balance of zero are hidden. This filter is in the Action View.
You can remove the supplied filters, and you can define additional filters based on your business requirements. For example, you can set filters to:
Consider only invoices after a particular start date.
Exclude invoices with a short or repeating reference.
Exclude invoices below a specific value.
Include only certain company codes, vendor names, or document types.
In the out-of-the-box setup, the app checks four attributes of the invoice: the invoice reference, the document date, the invoice value, and the vendor name. If they match, the invoices are grouped as duplicates. Documents with an exact match on all four attributes are grouped, and the four default fuzzy search patterns require that three attributes are an exact match and one attribute is a fuzzy match.
You can configure the default search patterns by customizing them in the AI Annotations user interface. For example, you could extend the date range within which two invoices are identified as possible duplicates, or apply a stricter similarity threshold for vendor name matching. You can also create your own custom search patterns to group invoices in different ways to suit your business needs. Configuring the AI Annotation for the object-centric Duplicate Invoice Checker tells you how to customize the default patterns or create your own.
Here are the default search patterns:
Exact match (EXACT)
The attributes “Invoice Reference”, “Document Date”, “Invoice Value”, and “Vendor Name” match exactly in both of the invoices being compared.
Similar vendor (VENDOR_FUZZY)
The attributes “Invoice Reference”, “Document Date”, and “Invoice Value” match exactly in both of the invoices being compared. The attribute “Vendor Name” is a fuzzy match. Here's how the fuzzy match is determined:
The app removes everything except uppercase and lowercase letters and digits - such as special characters and white spaces. For example, "AcmeValue" and "Acme Value" will be treated as the same string.
The app then removes all company keywords like "Corp," "LLC," etc. For example, "Celonis SE" and "Celonis GmbH" will be treated as the same string.
Using the edited vendor name, the app checks for matches with the given string similarity metric and the given threshold. If the string similarity exceeds the threshold, the strings are considered an approximate match.
Invoice A | Match | Invoice B | |
---|---|---|---|
Invoice Reference | XX0116621912 | Exact | XX0116621912 |
Document Date | 2024-06-10 | Exact | 2024-06-10 |
Invoice Value | 5135.50 | Exact | 5135.50 |
Vendor Name | Celonis SE | Fuzzy | Celonis GmbH |
Similar reference (REF_FUZZY)
The attributes “Document Date”, “Invoice Value”, and “Vendor Name” match exactly in both of the invoices being compared. The attribute “Invoice Reference” is a fuzzy match. Here's how the fuzzy match is determined:
The app removes everything except uppercase and lowercase letters and digits - such as special characters and white spaces. For example, "A R-AMC1234" and "AR AMC\ 1234" will be treated as the same string.
Using the edited invoice reference, the app makes the following checks independently of each other, and declares a match if any one of the checks discovers it:
Check whether the invoice references are exactly equal with the exception of 0-3 extra characters in one of the two records. For example, "AR-AMC1234" and "AR-AMC124" match.
Check whether the invoice reference is a substring of another reference. For example, “AR-AMC1234” and “AR-AMC” match.
Check for common scanning errors, such as the letter "B" in place of the number "8". For example, "AR-AMC1238" and "AR-AMC123B" match.
Check for transposed characters. For example, "AR-AMC1234" and "AR-MAC1234" match.
Invoice A | Match | Invoice C | |
---|---|---|---|
Invoice Reference | XX0116621912 | Fuzzy | XX011662I912 |
Document Date | 2024-06-10 | Exact | 2024-06-10 |
Invoice Value | 5135.50 | Exact | 5135.50 |
Vendor Name | Celonis SE | Exact | Celonis SE |
Similar date (DATE_FUZZY)
The attributes “Invoice Reference”, “Invoice Value”, and “Vendor Name” match exactly in both of the invoices being compared. The attribute “Document Date” is a fuzzy match. To determine a fuzzy match, the app makes the following checks independently of each other, and declares a match if any one of the checks discovers it:
Check whether the dates are the same except that the month and day are swapped. For example, "2024-01-02" and "2024-02-01" match.
Check whether the dates are the same except that the month has been swapped for another. We only do this for June/July and September/October comparisons. For example, “2024-07-02" and "2024-06-02" match.
Check whether the distance between the dates is less than or equal to 7 days. This threshold is based on experience with customers. For example, invoices dated "2024-07-02" and "2024-07-08" might be duplicates if all the other fields match.
Invoice A | Match | Invoice D | |
---|---|---|---|
Invoice Reference | XX0116621912 | Exact | XX0116621912 |
Document Date | 2024-06-10 | Fuzzy | 2024-07-10 |
Invoice Value | 5135.50 | Exact | 5135.50 |
Vendor Name | Celonis SE | Exact | Celonis SE |
Similar value (VALUE_FUZZY)
The attributes “Invoice Reference”, “Document Date”, and “Vendor Name” match exactly in both of the invoices being compared. The attribute “Invoice Value” is a fuzzy match. To determine a fuzzy match, the app makes the following checks independently of each other, and declares a match if any one of the checks discovers it:
Check if there is only a small absolute difference between the two values. For example, “5080” and “5000” match.
Check for transposed digits. For example, “150234” and “105234” match.
Check for skipped digits. The default value is 0, meaning that the app doesn't check for skipped digits. If you define a value of 1 or more, it does - for example, with a value of 1, “5000” and “500” match.
Invoice A | Match | Invoice E | |
---|---|---|---|
Invoice Reference | XX0116621912 | Exact | XX0116621912 |
Document Date | 2024-06-10 | Exact | 2024-06-10 |
Invoice Value | 5135.50 | Fuzzy | 5153.50 |
Vendor Name | Celonis SE | Exact | Celonis SE |
Multiple patterns
With the default fuzzy search patterns, the expectation is that three attributes are an exact match and one attribute is a fuzzy match for one set of documents. A set consists of two documents. You can get groups formed with multiple patterns, where multiple sets of documents are connected through different patterns.
Here’s a group formed with multiple patterns - Similar reference and Similar vendor:
Invoice A | Match | Invoice C | Match | Invoice F | |
---|---|---|---|---|---|
Invoice Reference | XX0116621912 | Fuzzy | XX011662I912 | Exact | XX011662I912 |
Document Date | 2024-06-10 | Exact | 2024-06-10 | Exact | 2024-06-10 |
Invoice Value | 5135.50 | Exact | 5135.50 | Exact | 5135.50 |
Vendor Name | Celonis SE | Exact | Celonis SE | Fuzzy | Celonis GmbH |
In the example, invoices A and C were matched due to the similar Invoice Reference. The other three attributes, “Document Date”, “Invoice Value”, and “Vendor Name”, each match exactly. At the same time, invoices C and F were matched due to the similar Vendor names. The other three attributes, “Invoice Reference”, “Document Date”, and “Invoice Value”, each match exactly. Since invoice C is part of the two sets of documents, it acts as a bridge connecting all three invoices in one group.
However, with the default search patterns, a pair of invoices is not matched if it would take more than one fuzzy match to achieve this. The documents in this example are not grouped:
Invoice A | Match | Invoice G | |
---|---|---|---|
Invoice Reference | XX0116621912 | Exact | XX0116621912 |
Document Date | 2024-06-10 | Fuzzy | 2024-07-10 |
Invoice Value | 5135.50 | Exact | 5135.50 |
Vendor Name | Celonis SE | Fuzzy | Celonis GmbH |
Though “Invoice Reference” and “Invoice Value” match exactly, “Document Date” and “Vendor Name” are both fuzzy matches. Because the default patterns require that three attributes are an exact match and one column is a fuzzy match, these invoices are not grouped. If you want to match invoices like this, you’ll need to define a custom search pattern.