Matching Algorithms
Kora Compliance uses a composite matching algorithm to compare subject names against watchlist entries. The composite score combines four different string similarity methods, each catching different types of variations.
Composite Score
The final match score is a weighted combination of four algorithms:
Score = (Primary × 0.60) + (Token × 0.15) + (N-gram × 0.15) + (Phonetic × 0.10)
where Primary = max(Token, Fuzzy, N-gram)
Match Strength Classification
| Strength | Composite Score | Description |
|---|---|---|
EXACT | 1.0 | Normalized names are identical |
STRONG | ≥ 0.92 | Very high confidence match |
POSSIBLE | ≥ 0.75 | Potential match requiring review |
Scores below 0.75 are not returned as matches.
Algorithm Details
1. Jaro-Winkler Distance (Primary)
Weight: 60% (as primary score component)
Measures character-level similarity between two strings, with a bonus for matching prefixes. Effective for catching typos and character transpositions.
| Input A | Input B | Score |
|---|---|---|
| "John Smith" | "Jon Smith" | 0.96 |
| "Mohammed Ali" | "Mohamed Ali" | 0.97 |
| "Smith, John" | "John Smith" | 0.82 |
Parameters:
- Scaling factor: 0.1 (prefix bonus)
- Maximum prefix length: 4 characters
2. Token-Based Matching
Weight: 15%
Splits names into tokens (words) and compares all combinations. Handles name reordering — "John Doe" matches "Doe John" equally well.
| Input A | Input B | Score |
|---|---|---|
| "John Michael Doe" | "Doe John Michael" | 1.00 |
| "John Doe" | "John Michael Doe" | 0.88 |
| "Al-Rashid Mohammed" | "Mohammed Al Rashid" | 0.95 |
Parameters:
- Minimum token match threshold: 85–95%
3. N-gram Similarity (Bigrams)
Weight: 15%
Compares 2-character sequences (bigrams) between strings. Catches character-level variations and partial name matches.
| Input A | Input B | Score |
|---|---|---|
| "Alexander" | "Aleksander" | 0.89 |
| "Mueller" | "Muller" | 0.86 |
| "Tchaikovsky" | "Chaikovsky" | 0.88 |
Parameters:
- N-gram size: 2 (bigrams)
- Minimum threshold: 0.85
4. Soundex (Phonetic)
Weight: 10%
Compares phonetic encodings of names. Catches names that sound alike but are spelled differently.
| Input A | Input B | Soundex A | Soundex B | Match |
|---|---|---|---|---|
| "Smith" | "Smyth" | S530 | S530 | Yes |
| "Schmidt" | "Smith" | S530 | S530 | Yes |
| "Catherine" | "Katherine" | C365 | K365 | Partial |
Name Normalization
Before matching, all names go through normalization:
- Lowercase conversion — "JOHN DOE" → "john doe"
- Diacritic removal — "José García" → "jose garcia"
- Title/honorific removal — Strips: Mr., Mrs., Ms., Dr., Prof., Sir, Lord, Dame, Hon.
- Punctuation removal — Removes punctuation while preserving spaces and digits
- Whitespace normalization — Collapses multiple spaces into one
Normalization Examples
| Original | Normalized |
|---|---|
| "Dr. José María García-López" | "jose maria garcia lopez" |
| "Mr. MOHAMMED AL-RASHID" | "mohammed al rashid" |
| "Prof. Sir John Smith III" | "john smith iii" |
| "김정은 (Kim Jong-un)" | "김정은 kim jong un" |
Matching Example
Subject: "Mohamed Al Rasheed" Watchlist entry: "Mohammed Al-Rashid"
| Step | Result |
|---|---|
| Normalize subject | "mohamed al rasheed" |
| Normalize entry | "mohammed al rashid" |
| Jaro-Winkler | 0.91 |
| Token match | 0.93 |
| N-gram | 0.85 |
| Soundex | 0.90 |
| Primary | max(0.93, 0.91, 0.85) = 0.93 |
| Composite | (0.93 × 0.60) + (0.93 × 0.15) + (0.85 × 0.15) + (0.90 × 0.10) = 0.915 |
| Strength | POSSIBLE (≥ 0.75) |
Tuning Thresholds
Match thresholds can be adjusted per screening check type via the configuration API. Lowering thresholds increases recall (more matches) but may increase false positives. Raising thresholds reduces noise but may miss fuzzy matches.
Default thresholds work well for most compliance use cases.