When working with text processing or natural language processing (NLP) tasks, one common requirement is to measure the "distance" or difference between two strings. One popular method to achieve this is through the Levenshtein distance. The Python-Levenshtein
module is an efficient way to compute this distance in Python, as well as several other related metrics such as string similarity, edit operations, and matching ratios.
What is Levenshtein Distance?
The Levenshtein distance or the edit distance is a string distance measure parameter that can be used to measure string dissimilarity. It identifies the number of single-character operations that would be needed to transform one string into another. It is used widely in many areas including natural language processing, bioinformatics, and text editing.
Let's consider an example that the distance between “butter” and “matter” will be equal to 2, which can be realized through the replacement of ‘b’ for ‘m’ and the replacement of ‘u’ for ‘a’. It can also be used for undefined spell check predicting words with the least distance from the words entered incorrectly. Another example where it is been used in day-to-day life is in search engines that allow matching of the user input with the index terms even if the provided word from a user is not accurate.
Python-Levenshtein Module
The Python-Levenshtein package provide fast and easy python code access to the Levenshtein algorithm to perform operations on string. It is highly optimized and is comparability faster than other Python implementation because it is implemented using the C extension. As it is implemented using the C extension is work efficiently for large dataset which is again a big advantage over normal python implementation.
Key Features of python-Levenshtein
The python-Levenshtein
module provides a highly optimized C extension for computing the Levenshtein distance and other related metrics such as:
- Levenshtein distance (minimum edit distance between two strings).
- Levenshtein ratio (a measure of similarity between two strings, based on the Levenshtein distance).
- String similarity (ratio of matching characters).
- Hamming distance (only for strings of equal length).
- Jaro and Jaro-Winkler similarity (useful for approximate string matching).
Installing python-Levenshtein Module via pip
To start using the python-Levenshtein
module, we can install it using pip
:
pip install python-Levenshtein
Uses of the Python-Levenshtein Module
There are multiple uses of python Levenshtein module the following are mentioned below.
1. Levenshtein.distance
This method calculates the Levenshtein Distance of two strings this defines how many operations where needed to turn one string into another by inserting a character, removing a character or replacing a character.
Python
import Levenshtein
input1 = "cooking"
input2 = "looking"
dist_calc = Levenshtein.distance(input1, input2)
print(f"The Levenshtein distance between {input1} and {input2} is : {dist_calc}")
Output:
The Levenshtein distance between cooking and looking is : 1
2. Levenshtein.ratio
According to this method, it provided a similarity ratio in between the two string values which ranges from 0 to 1. The smaller the represented value is to the 1, the more similar the strings belong to each other.
It can be used in data matching like in cases such as matching names or records in other databases, the calculation of the ratio will be very useful in that it will work even though there may be slight variations in string (for example slang or abbreviations).
Python
import Levenshtein
check_sim = Levenshtein.ratio("orange", "oranges")
print(f"The similarity ratio between 'orange' and 'oranges' is : {check_sim}")
Output:
The similarity ratio between 'orange' and 'oranges' is : 0.9230769230769231
3. Levenshtein.hamming
This method measures the Hamming distance on the basis of number of positions where two different string may have varied.
Hamming distance is used in the error detection where the transmitted data is compared to received data and the number of positions for which the data differ is counted.
Python
import Levenshtein
ham_dist= Levenshtein.hamming("biswajeet", "biswajit")
print(f"The Hamming distance between 'biswajeet' and 'biswajit' is : {ham_dist}")
Output:
The Hamming distance between 'biswajeet' and 'biswajit' is : 3
4. Levenshtein.jaro
The Jaro distance is an indication of how close the two strings are, meaning the extent to which the strings can be transformed through transpose, insertions, deletions and substitution.
Jaro distance is usually used in record linkage to find similarities on names and other textual values, where possible transpositions or slight spelling typo may occur.
Python
import Levenshtein
jaro_dist = Levenshtein.jaro("bunty", "bunti")
print(f"The Jaro distance between 'bunty' and 'bunti' is : {jaro_dist}")
Output:
The Jaro distance between 'bunty' and 'bunti' is : 0.8666666666666667
5. Levenshtein.jaro_winkler
Jaro Winkler distance is a variation of Jaro distance which is a little bit tougher and gives more weight to prefix matching as compared to other matching.
Jaro Winkler distance is used in a scenario such as in comparing personal names in a database where small differences or transpositions may arise with more emphasis on the matching of prefixes.
Python
import Levenshtein
jaro_winkler_dist = Levenshtein.jaro_winkler("Rudra", "Rudhra")
print(f"The Jaro Winkler distance between 'Rudra' and 'Rudhra' is : {jaro_winkler_dist}")
Output:
The Jaro Winkler distance between 'Rudra' and 'Rudhra' is : 0.9611111111111111
This method finds the median string in a set of strings by calculating the edit distance.
In bioinformatics, this can be used to identify the conserved sequence out of the collection of DNA or the protein sequence by measuring the distance in a minimum of all.
Python
import Levenshtein
input = ["Ru", "Rud", "Rudra"]
median_str = Levenshtein.median(input)
print(f"The Median string in {input} is : {median_str}")
Output:
The Median string in ['Ru', 'Rud', 'Rudra'] is : Rud
This method brings an enhancement to the median string by continuously relaying the edit operations in order to reduce the distance.
It is useful when trying to search for the best representative string within a dataset, where there can be variations of the same name or entity the method improves on the median to get a closer value.
Python
import Levenshtein
input = ["Ru", "Rud", "Rudra"]
improved_median = Levenshtein.median_improve("Ru", input)
print(f"Improved median in {input} is : {improved_median}")
Output:
Improved median in ['Ru', 'Rud', 'Rudra'] is : Rud
This method offers a faster approach of calculating the median string than the previous method though it may not offer the most accurate solution.
Favorable in situations where the solution does not have to be exact and the time of evaluation is the key factor (for very large sets of data).
Python
import Levenshtein
input = ["Ru", "Rud", "Rudra"]
quick_median = Levenshtein.quickmedian(input)
print(f" The quick median in {input} is : {quick_median}")
Output:
The quick median in ['Ru', 'Rud', 'Rudra'] is : Rua
It calculates the median of strings in a provided set while observing the strings as sets of characters.
It is useful when we want to compare two words or strings where the position of characters does not matter.
Python
import Levenshtein
input = ["Ru", "Rud", "Rudra"]
set_med = Levenshtein.setmedian(input)
print(f" The set median in {input} is : {set_med}")
Output:
The quick median in ['Ru', 'Rud', 'Rudra'] is : Rua
10. Levenshtein.seqratio
This method computes the ratio based on the longest common subsequence (LCS) between two strings.
It is particularly helpful in alignment problems in texts where we require to know the sequential order of characters in two strings.
Python
import Levenshtein
ratio_seq = Levenshtein.seqratio("pqrst", "pqrtu")
print(f"The sequence ratio between 'pqrst' and 'pqrtu' is : {ratio_seq}")
Output:
The sequence ratio between 'pqrst' and 'pqrtu' is : 0.6
11. Levenshtein.setratio
Used to find similarity ratio by treating strings as sets.
This can be used in the comparison of tokens in the NLP tasks such as search when order is irrelevant and only the presence of token in one set in the other is important.
Python
import Levenshtein
set_ratio = Levenshtein.setratio("grape", "grapes")
print(f"The set similarity ratio between 'grape' and 'grapes' is : {set_ratio}")
Output:
The set similarity ratio between 'grape' and 'grapes' is : 0.9090909090909091
12. Levenshtein.editops
This method just returns the list of the operations needed to edit one string in to another string.
Used in applications such as text editors which display the required sequence of edits needed to convert one string to another to help the users compare the differences.
Python
import Levenshtein
modify = Levenshtein.editops("kitten", "sitting")
print(f"The edit operations between 'kitten' and 'sitting' is : {modify}")
Output:
The edit operations between 'kitten' and 'sitting' is : [('replace', 0, 0), ('replace', 4, 4), ('insert', 6, 6)]
13. Levenshtein.opcodes
This method gives out a list of operation that is insertions, deletions and replacements needed to get from one string to the next.
It can be employed in creating text diff tools (for instance, the Git diff) that help in identifying the differences within the two versions of the text.
Python
import Levenshtein
list_op = Levenshtein.opcodes("apple", "mapple")
print(f"The opcodes between 'apple' and 'mapple' is : {list_op}")
Output:
The opcodes between 'apple' and 'mapple' is : [('insert', 0, 0, 0, 1), ('equal', 0, 5, 1, 6)]
14. Levenshtein.inverse
This method basically returns the inverse of a string transformation.
This can be used in situations where they are needed to revert the string back to the previous state before an editing was done.
Python
import Levenshtein
edit_op = Levenshtein.editops("kitten", "sitting")
inv_op = Levenshtein.inverse(edit_op)
print(f"The inverse of {edit_op} is : {inv_op}")
Output:
The inverse of [('replace', 0, 0), ('replace', 4, 4), ('insert', 6, 6)] is :
[('replace', 0, 0), ('replace', 4, 4), ('delete', 6, 6)]
15. Levenshtein.apply_edit
Performs a sequence of edit operations on the specified string to modified string and returns result.
Common in text editing programs for undo/redo purpose where users can apply certain changes or reverse them.
Python
import Levenshtein
edit_op = Levenshtein.editops("kitten", "sitting")
edit_applied = Levenshtein.apply_edit(edit_op, "kitten", "sitting")
print(f"The applied edit: {edit_applied}")
Output:
The applied edit: sitting
16. Levenshtein.matching_blocks
It basically returns a list of matching blocks between two strings.
Appropriate in solving problems, in which it is necessary to find two sequences that are built from the same characters as blocks.
Python
from Levenshtein import matching_blocks, editops
a = "apple"
b = "mapple"
matching_block = matching_blocks(editops(a, b), a, b)
print(f"The matching blocks: {matching_block}")
matching_block = matching_blocks(editops(a, b), len(a), len(b))
print(f"The matching blocks using length: {matching_block}")
Output:
The matching blocks: [MatchingBlock(a=0, b=1, size=5), MatchingBlock(a=5, b=6, size=0)]
The matching blocks using length: [MatchingBlock(a=0, b=1, size=5), MatchingBlock(a=5, b=6, size=0)]
17. Levenshtein.subtract_edits
It calculates the difference of edit operations between these two strings.
This can be applied in software that records the change history of text to know which changes has been done and which are still in the queue to be made.
Python
from Levenshtein import editops, apply_edit, subtract_edit
a = "man"
b = "woman"
e = editops(a, b)
print('e:', e)
e1 = e[:2]
print('e1:', e1)
app_edit = apply_edit(e1, 'man', 'woman')
print('app_edit:', app_edit)
sub_edit = apply_edit(subtract_edit(e, e1), app_edit, 'scotsman')
print('edit:', sub_edit)
Output:
e: [('insert', 0, 0), ('insert', 0, 1)]
e1: [('insert', 0, 0), ('insert', 0, 1)]
app_edit: woman
edit: woman
Example Use Cases of Python-Levenshtein
The python-Levenshtein
module can be used in various practical scenarios:
1. Spelling Correction
If we're building a simple spelling correction system, we can use the Levenshtein distance to find the closest match to a misspelled word from a dictionary of known words.
Python
import Levenshtein
word = "speling"
dictionary = ["spelling", "speaking", "spell", "splitting"]
closest_match = min(dictionary, key=lambda x: Levenshtein.distance(word, x))
print(f"Closest match for '{word}': {closest_match}")
Output:
Closest match for 'speling': spelling
2. Fuzzy String Matching
Suppose we have two datasets with similar but not identical text data (e.g., names or product titles), and we want to find pairs that are likely to be the same. The Levenshtein ratio or Jaro-Winkler similarity can help identify these pairs.
Python
import Levenshtein
name1 = "Jonathon"
name2 = "Jonathan"
similarity = Levenshtein.ratio(name1, name2)
if similarity > 0.8:
print(f"The names '{name1}' and '{name2}' are likely the same.")
Output:
The names 'Jonathon' and 'Jonathan' are likely the same.
3. DNA Sequence Comparison
Levenshtein distance is also useful for comparing DNA sequences to find similarities, which is important in bioinformatics.
Python
import Levenshtein
dna1 = "ACTGAG"
dna2 = "ACGGAG"
distance = Levenshtein.distance(dna1, dna2)
print(f"Levenshtein distance between DNA sequences: {distance}")
Output:
Levenshtein distance between DNA sequences: 1
Conclusion
In this article we have described that what is Levenshtein distance, what is python Levenshtein Distance, and how to install and implement in code then why to prefer it as it is efficient, reliable that can handle large data and what are the common and advance use cases of this module and some limitation is consume more space so memory management is very important and we need to render c complier to be installed in the environment other than Summing up it is possible to state that compared to the string operations the Levenshtein module in python language is more effective in many aspects.