0.2.0 - This version may not be safe as it has not been updated for a long time. Find out if your coding project uses this component and get notified of any reported security vulnerabilities with Meterian-X Open Source Security Platform
Maintain your licence declarations and avoid unwanted licences to protect your IP the way you intended.
Apache-2.0 - Apache License 2.0Akin is a collection of string comparison algorithms for Elixir. This solution was born of a Record Linking project. It combines and modifies The Fuzz and Fuzzy Compare. Algorithms can be called independently or in total to return a map of metrics. This library was built to facilitiate the disambiguation of names but can be used to compare any two binaries.
Add a dependency in your mix.exs:
deps: [{:akin, "~> 0.2.0"}]
To see all of the avialable algorithms. Hamming Distance is excluded as it only compares strings of equal length. Hamming may be called directly. See: Single Algorithms
iex> Akin.Util.list_algorithms()
["bag_distance", "substring_set", "sorensen_dice", "jaccard", "jaro_winkler",
"levenshtein", "metaphone", "double_metaphone", "substring_double_metaphone", "ngram",
"overlap", "substring_sort", "tversky"]
Compare two strings using all of the available algorithms. The return value is a map of scores for each algorithm.
iex> Akin.compare("weird", "wierd")
%{
bag_distance: 1.0,
sorensen_dice: 0.25,
double_metaphone: 1.0,
jaccard: 0.14,
jaro_winkler: 0.94,
levenshtein: 0.6,
metaphone: 1.0,
ngram: 0.25,
overlap: 0.25,
tversky: 0.14
}
iex> Akin.compare("beginning", "begining")
%{
bag_distance: 0.89,
sorensen_dice: 0.93,
double_metaphone: 1.0,
jaccard: 0.88,
jaro_winkler: 0.95,
levenshtein: 0.89,
metaphone: 1.0,
ngram: 0.88,
overlap: 1.0,
tversky: 0.88
}
Comparison accepts options in a Keyword list.
algorithms
: algorithms to use in comparision. Accepts the name or a keyword list. Default is algorithms/0.
metric
- algorithm metric. Default is bothunit
- algorithm unit. Default is both.level
- level for double phonetic matching. Default is "normal".
match_at
: an algorith score equal to or above this value is condsidered a match. Default is 0.9ngram_size
: number of contiguous letters to split strings into. Default is 2.short_length
: qualifies as "short" to recieve a shortness boost. Used by Name Metric. Default is 8.stem
: boolean representing whether to compare the stemmed version the strings; uses Stemmer. Default false
Restrict the list of algorithms by name or metric and/or unit.
iex> opts = [algorithms: ["bag_distance", "jaccard", "jaro_winkler"]]
iex> Akin.compare("weird", "wierd", opts)
%{
bag_distance: 1.0,
jaccard: 0.14,
jaro_winkler: 0.94
}
iex> opts = [algorithms: [metric: "phonetic", unit: "whole"]]
iex > Akin.compare("weird", "wierd", opts)
%{
double_metaphone: 1.0,
metaphone: 1.0
}
The default ngram size for the algorithms is 2. You can change by setting a value in opts.
iex> Akin.compare("weird", "wierd", [algorithms: ["sorensen_dice"]])
%{sorensen_dice: 0.25}
iex> Akin.compare("weird", "wierd", [algorithms: ["sorensen_dice"], ngram_size: 1])
%{sorensen_dice: 0.8}
The default match strictness is "normal" You change it by setting
a value in opts. Currently it only affects the outcomes of the substring_set
and
double_metaphone
algorithms
iex> left = "Alice in Wonderland"
iex> right = "Alice's Adventures in Wonderland"
iex> Akin.compare(left, right, [algorithms: ["substring_set"]])
%{substring_set: 0.85}
iex> Akin.compare(left, right, [algorithms: ["substring_set"], level: "weak"])
%{substring_set: 0.85}
iex> left = "which way"
iex> right = "whitch way"
iex> Akin.compare(left, right, [algorithms: ["double_metaphone"], level: "weak"])
%{double_metaphone: 1.0}
iex> Akin.compare(left, right, [algorithms: ["double_metaphone"], level: "strict"])
%{double_metaphone: 0.0}
Compare the stemmed version of two strings.
iex> Akin.compare("write", "writing", [algorithms: ["bag_distance", "double_metaphone"]])
%{bag_distance: 0.57, double_metaphone: 0.0}
iex> Akin.compare("write", "writing", [algorithms: ["bag_distance", "double_metaphone"], stem: true])
%{bag_distance: 1.0, double_metaphone: 1.0}
iex> Akin.compare("weird", "wierd", algorithms: ["bag_distance", "jaro_winkler", "jaccard"])
%{bag_distance: 1.0, jaccard: 0.14, jaro_winkler: 0.94}
iex> Akin.compare("weird", "wierd", algorithms: [metric: "string", unit: "whole"], ngram_size: 1)
%{
bag_distance: 1.0,
jaccard: 0.67,
jaro_winkler: 0.94,
levenshtein: 0.6,
sorensen_dice: 0.8,
tversky: 1.0
}
Before being compared, strings are converted to downcase and unicode standard, whitespace is standardized, nontext (like punctuation & emojis) is replaced, and accents are converted. The string is then composed into a struct representing the corpus of data used by the comparison algorithms.
"Alice Liddell" becomes
%Akin.Corpus{
list: ["alice", "liddell"],
original: "alice liddell",
set: #MapSet<["alice", "liddell"]>,
stems: ["alic", "liddel"],
string: "aliceliddell"
}
iex> Akin.compare("Hubert Łępicki", "Hubert Lepicki")
%{
bag_distance: 0.92,
dice_sorensen: 0.83,
double_metaphone: 0.0,
jaccard: 0.71,
jaro_winkler: 0.97,
levenshtein: 0.92,
metaphone: 0.0,
ngram: 0.83,
overlap: 0.83,
tversky: 0.71
}
iex> Akin.phonemes("virginia")
["frjn", "frkn"]
iex> Akin.phonemes("beginning")
["bjnnk", "pjnnk", "pknnk"]
iex> Akin.phonemes("wonderland")
["wntrlnt", "antrlnt", "fntrlnt"]
UNDER DEVELOPMENT
Identity is the challenge of author name disambiguation (AND). The aim of AND is to match an author's name to that author when the author appears in a list of many authors. Complexity arises from homonymity (many people with the same name) and synonymity (when one person uses different forms/spellings of their name in publications).
Given the name of an author which is divided into the given, middle, and family name parts (i.e. "Virginia", nil, "Woolf") and a list of possible matching author names, find and return the matches for the author in the list. If initials exist in the left name, a separate comparison is performed for the initals and the sets of the right string.
If the comparison metrics produce a score greater than or equal to 0.9, they considered a match and returned in the list.
iex> Akin.match_names("V. Woolf", ["V Woolf", "V Woolfe", "Virginia Woolf", "V White", "Viginia Wolverine", "Virginia Woolfe"])
["v woolfe", "v woolf"]
iex> Akin.match_names("V. Woolf", ["V Woolf", "V Woolfe", "Virginia Woolf", "V White", "Viginia Wolverine", "Virginia Woolfe"])
["virginia woolfe", "v woolf"]
This may not be what you want. There are likely to be unwanted matches.
iex> Akin.match_names("V. Woolf", ["Victor Woolf", "Virginia Woolf", "V White", "V Woolf", "Virginia Woolfe"])
["v woolf", "virginia woolf", "victor woolf"]
The bag distance is a cheap distance measure which always returns a distance smaller or equal to the edit distance. It's meant to be an efficient approximation of the distance between two strings to quickly rule out strings that are largely different.
Calculates the Double Metaphone Phonetic Algorithm metric of two strings. The return value is based on the match level: strict, strong, normal (default), or weak.
Note: Hamming algorithm is not used in an of the comparison functions because it requires the strings being compared are of the same length. It can be called directly, however, so it is still a part of this library.
The Hamming distance between two strings of equal length is the number of positions at which the corresponding letters are different. Returns the percentage of change needed to the left string of the comparison of one string (left) with another string (right). Returns 0.0 if the strings are not the same length. Returns 1.0 if the string are equal.
Calculates the similarity of two strings as the size of the intersection divided by the size of the union. Default ngram: 2.
Jaro-Winkler calculates the edit distance between two strings. A score of one denotes equality. Unlike the Jaro Similarity, it modifies the prefix scale to gives a more favorable rating to strings that match from the beginning.
Compare two strings for their Levenshtein score. The score is determined by finding the edit distance: the minimum number of single-character edits needed to change one word into the other. The distance is substracted from 1.0 and then divided by the longest length between the two strings.
Compares two strings by converting each to an approximate phonetic representation in ASCII and then comparing those phoenetic representations. Returns a 1 if the phoentic representations are an exact match.
Calculates the ngram distance between two strings. Default ngram: 2.
Uses the Overlap Similarity metric to compare two strings by tokenizing the strings and measuring their overlap. Default ngram: 1.
Sørensen–Dice coefficient is calculated using bigrams. The equation is 2nt / nx + ny
where nx is the number of bigrams in string x, ny is the number of bigrams in string y, and nt is the number of bigrams in both strings. For example, the bigrams of night
and nacht
are {ni,ig,gh,ht}
and {na,ac,ch,ht}
. They each have four and the intersection is ht
.
(2 · 1) / (4 + 4) = 0.25
Iterate over the cartesian product of the two lists sending each element through
the Double Metaphone using all strictness levels until a true value is found
in the list of returned booleans from the Double Metaphone algorithm. Return the
percentage of true values found. If true is never returned, return 0. Increases
accuracy for search terms containing more than one word.
Splits the strings on spaces, sorts, re-joins, and then determines Jaro-Winkler distance. Best when the strings contain irrelevent substrings.
Sorts substrings by words, compares the sorted strings in pairs, and returns the maximum ratio. If one strings is signficantly longer than the other, this method will compare matching substrings only.
A generalization of Sørensen–Dice and Jaccard.