Introduction to Python Levenshtein Module

Last Updated : 16 Sep, 2024

When working with text processing or natural language processing (NLP) tasks, one common requirement is to measure the "distance" or difference between two strings. One popular method to achieve this is through the Levenshtein distance. The Python-Levenshtein module is an efficient way to compute this distance in Python, as well as several other related metrics such as string similarity, edit operations, and matching ratios.

Table of Content

What is Levenshtein Distance?
Python-Levenshtein Module
Uses of the Python-Levenshtein Module
Example Use Cases of Python-Levenshtein

What is Levenshtein Distance?

The Levenshtein distance or the edit distance is a string distance measure parameter that can be used to measure string dissimilarity. It identifies the number of single-character operations that would be needed to transform one string into another. It is used widely in many areas including natural language processing, bioinformatics, and text editing.

Let's consider an example that the distance between “butter” and “matter” will be equal to 2, which can be realized through the replacement of ‘b’ for ‘m’ and the replacement of ‘u’ for ‘a’. It can also be used for undefined spell check predicting words with the least distance from the words entered incorrectly. Another example where it is been used in day-to-day life is in search engines that allow matching of the user input with the index terms even if the provided word from a user is not accurate.

Python-Levenshtein Module

The Python-Levenshtein package provide fast and easy python code access to the Levenshtein algorithm to perform operations on string. It is highly optimized and is comparability faster than other Python implementation because it is implemented using the C extension. As it is implemented using the C extension is work efficiently for large dataset which is again a big advantage over normal python implementation.

Key Features of `python-Levenshtein`

The python-Levenshtein module provides a highly optimized C extension for computing the Levenshtein distance and other related metrics such as:

Levenshtein distance (minimum edit distance between two strings).
Levenshtein ratio (a measure of similarity between two strings, based on the Levenshtein distance).
String similarity (ratio of matching characters).
Hamming distance (only for strings of equal length).
Jaro and Jaro-Winkler similarity (useful for approximate string matching).

Installing python-Levenshtein Module via pip

To start using the python-Levenshtein module, we can install it using pip:

pip install python-Levenshtein

Uses of the Python-Levenshtein Module

There are multiple uses of python Levenshtein module the following are mentioned below.

1. Levenshtein.distance

This method calculates the Levenshtein Distance of two strings this defines how many operations where needed to turn one string into another by inserting a character, removing a character or replacing a character.

Python

import Levenshtein

input1 = "cooking"
input2 = "looking"

dist_calc = Levenshtein.distance(input1, input2)

print(f"The Levenshtein distance between {input1} and {input2} is : {dist_calc}")

Output:

The Levenshtein distance between cooking and looking is : 1

2. Levenshtein.ratio

According to this method, it provided a similarity ratio in between the two string values which ranges from 0 to 1. The smaller the represented value is to the 1, the more similar the strings belong to each other.

It can be used in data matching like in cases such as matching names or records in other databases, the calculation of the ratio will be very useful in that it will work even though there may be slight variations in string (for example slang or abbreviations).

Python

import Levenshtein

check_sim = Levenshtein.ratio("orange", "oranges")
print(f"The similarity ratio between 'orange' and 'oranges' is : {check_sim}")

Output:

The similarity ratio between 'orange' and 'oranges' is : 0.9230769230769231

3. Levenshtein.hamming

This method measures the Hamming distance on the basis of number of positions where two different string may have varied.

Hamming distance is used in the error detection where the transmitted data is compared to received data and the number of positions for which the data differ is counted.

Python

import Levenshtein

ham_dist= Levenshtein.hamming("biswajeet", "biswajit")
print(f"The Hamming distance between 'biswajeet' and 'biswajit' is : {ham_dist}")

Output:

The Hamming distance between 'biswajeet' and 'biswajit' is : 3

4. Levenshtein.jaro

The Jaro distance is an indication of how close the two strings are, meaning the extent to which the strings can be transformed through transpose, insertions, deletions and substitution.

Jaro distance is usually used in record linkage to find similarities on names and other textual values, where possible transpositions or slight spelling typo may occur.

Python

import Levenshtein

jaro_dist = Levenshtein.jaro("bunty", "bunti")
print(f"The Jaro distance between 'bunty' and 'bunti' is : {jaro_dist}")

Output:

The Jaro distance between 'bunty' and 'bunti' is : 0.8666666666666667

5. Levenshtein.jaro_winkler

Jaro Winkler distance is a variation of Jaro distance which is a little bit tougher and gives more weight to prefix matching as compared to other matching.

Jaro Winkler distance is used in a scenario such as in comparing personal names in a database where small differences or transpositions may arise with more emphasis on the matching of prefixes.

Python

import Levenshtein

jaro_winkler_dist = Levenshtein.jaro_winkler("Rudra", "Rudhra")
print(f"The Jaro Winkler distance between 'Rudra' and 'Rudhra' is : {jaro_winkler_dist}")

Output:

The Jaro Winkler distance between 'Rudra' and 'Rudhra' is : 0.9611111111111111

6. Levenshtein.median

This method finds the median string in a set of strings by calculating the edit distance.

In bioinformatics, this can be used to identify the conserved sequence out of the collection of DNA or the protein sequence by measuring the distance in a minimum of all.

Python

import Levenshtein

input = ["Ru", "Rud", "Rudra"]

median_str = Levenshtein.median(input)
print(f"The Median string in {input} is : {median_str}")

Output:

The Median string in ['Ru', 'Rud', 'Rudra'] is : Rud

7. Levenshtein.median_improve

This method brings an enhancement to the median string by continuously relaying the edit operations in order to reduce the distance.

It is useful when trying to search for the best representative string within a dataset, where there can be variations of the same name or entity the method improves on the median to get a closer value.

Python

import Levenshtein
input = ["Ru", "Rud", "Rudra"]

improved_median = Levenshtein.median_improve("Ru", input)
print(f"Improved median in {input} is : {improved_median}")

Output:

Improved median in ['Ru', 'Rud', 'Rudra'] is : Rud

8. Levenshtein.quickmedian

This method offers a faster approach of calculating the median string than the previous method though it may not offer the most accurate solution.

Favorable in situations where the solution does not have to be exact and the time of evaluation is the key factor (for very large sets of data).

Python

import Levenshtein
input = ["Ru", "Rud", "Rudra"]

quick_median = Levenshtein.quickmedian(input)
print(f" The quick median in {input} is : {quick_median}")

Output:

The quick median in ['Ru', 'Rud', 'Rudra'] is : Rua

9. Levenshtein.setmedian

It calculates the median of strings in a provided set while observing the strings as sets of characters.

It is useful when we want to compare two words or strings where the position of characters does not matter.

Python

import Levenshtein

input = ["Ru", "Rud", "Rudra"]

set_med = Levenshtein.setmedian(input)
print(f" The set median in {input} is : {set_med}")

Output:

The quick median in ['Ru', 'Rud', 'Rudra'] is : Rua

10. Levenshtein.seqratio

This method computes the ratio based on the longest common subsequence (LCS) between two strings.

It is particularly helpful in alignment problems in texts where we require to know the sequential order of characters in two strings.

Python

import Levenshtein

ratio_seq = Levenshtein.seqratio("pqrst", "pqrtu")
print(f"The sequence ratio between 'pqrst' and 'pqrtu' is : {ratio_seq}")

Output:

The sequence ratio between 'pqrst' and 'pqrtu' is : 0.6

11. Levenshtein.setratio

Used to find similarity ratio by treating strings as sets.

This can be used in the comparison of tokens in the NLP tasks such as search when order is irrelevant and only the presence of token in one set in the other is important.

Python

import Levenshtein

set_ratio = Levenshtein.setratio("grape", "grapes")
print(f"The set similarity ratio between 'grape' and 'grapes' is : {set_ratio}")

Output:

The set similarity ratio between 'grape' and 'grapes' is : 0.9090909090909091

12. Levenshtein.editops

This method just returns the list of the operations needed to edit one string in to another string.

Used in applications such as text editors which display the required sequence of edits needed to convert one string to another to help the users compare the differences.

Python

import Levenshtein

modify = Levenshtein.editops("kitten", "sitting")
print(f"The edit operations between 'kitten' and 'sitting' is : {modify}")

Output:

The edit operations between 'kitten' and 'sitting' is : [('replace', 0, 0), ('replace', 4, 4), ('insert', 6, 6)]

13. Levenshtein.opcodes

This method gives out a list of operation that is insertions, deletions and replacements needed to get from one string to the next.

It can be employed in creating text diff tools (for instance, the Git diff) that help in identifying the differences within the two versions of the text.

Python

import Levenshtein

list_op = Levenshtein.opcodes("apple", "mapple")
print(f"The opcodes between 'apple' and 'mapple' is : {list_op}")

Output:

The opcodes between 'apple' and 'mapple' is : [('insert', 0, 0, 0, 1), ('equal', 0, 5, 1, 6)]

14. Levenshtein.inverse

This method basically returns the inverse of a string transformation.

This can be used in situations where they are needed to revert the string back to the previous state before an editing was done.

Python

import Levenshtein

edit_op = Levenshtein.editops("kitten", "sitting")
inv_op = Levenshtein.inverse(edit_op)
print(f"The inverse of {edit_op} is : {inv_op}")

Output:

The inverse of [('replace', 0, 0), ('replace', 4, 4), ('insert', 6, 6)] is :
[('replace', 0, 0), ('replace', 4, 4), ('delete', 6, 6)]

15. Levenshtein.apply_edit

Performs a sequence of edit operations on the specified string to modified string and returns result.

Common in text editing programs for undo/redo purpose where users can apply certain changes or reverse them.

Python

import Levenshtein

edit_op = Levenshtein.editops("kitten", "sitting")
edit_applied = Levenshtein.apply_edit(edit_op, "kitten", "sitting")
print(f"The applied edit: {edit_applied}")

Output:

The applied edit: sitting

16. Levenshtein.matching_blocks

It basically returns a list of matching blocks between two strings.

Appropriate in solving problems, in which it is necessary to find two sequences that are built from the same characters as blocks.

Python

from Levenshtein import matching_blocks, editops

a = "apple"
b = "mapple"
matching_block = matching_blocks(editops(a, b), a, b)
print(f"The matching blocks: {matching_block}")

matching_block = matching_blocks(editops(a, b), len(a), len(b))
print(f"The matching blocks using length: {matching_block}")

Output:

The matching blocks: [MatchingBlock(a=0, b=1, size=5), MatchingBlock(a=5, b=6, size=0)]
The matching blocks using length: [MatchingBlock(a=0, b=1, size=5), MatchingBlock(a=5, b=6, size=0)]

17. Levenshtein.subtract_edits

It calculates the difference of edit operations between these two strings.

This can be applied in software that records the change history of text to know which changes has been done and which are still in the queue to be made.

Python

from Levenshtein import editops, apply_edit, subtract_edit

a = "man"
b = "woman"

e = editops(a, b)
print('e:', e)
e1 = e[:2]
print('e1:', e1)

app_edit = apply_edit(e1, 'man', 'woman')
print('app_edit:', app_edit)

sub_edit = apply_edit(subtract_edit(e, e1), app_edit, 'scotsman')
print('edit:', sub_edit)

Output:

e: [('insert', 0, 0), ('insert', 0, 1)]
e1: [('insert', 0, 0), ('insert', 0, 1)]
app_edit: woman
edit: woman

Example Use Cases of Python-`Levenshtein`

The python-Levenshtein module can be used in various practical scenarios:

1. Spelling Correction

If we're building a simple spelling correction system, we can use the Levenshtein distance to find the closest match to a misspelled word from a dictionary of known words.

Python

import Levenshtein

word = "speling"
dictionary = ["spelling", "speaking", "spell", "splitting"]

closest_match = min(dictionary, key=lambda x: Levenshtein.distance(word, x))
print(f"Closest match for '{word}': {closest_match}")

Output:

Closest match for 'speling': spelling

2. Fuzzy String Matching

Suppose we have two datasets with similar but not identical text data (e.g., names or product titles), and we want to find pairs that are likely to be the same. The Levenshtein ratio or Jaro-Winkler similarity can help identify these pairs.

Python

import Levenshtein

name1 = "Jonathon"
name2 = "Jonathan"

similarity = Levenshtein.ratio(name1, name2)
if similarity > 0.8:
    print(f"The names '{name1}' and '{name2}' are likely the same.")

Output:

The names 'Jonathon' and 'Jonathan' are likely the same.

3. DNA Sequence Comparison

Levenshtein distance is also useful for comparing DNA sequences to find similarities, which is important in bioinformatics.

Python

import Levenshtein

dna1 = "ACTGAG"
dna2 = "ACGGAG"

distance = Levenshtein.distance(dna1, dna2)
print(f"Levenshtein distance between DNA sequences: {distance}")

Output:

Levenshtein distance between DNA sequences: 1

Conclusion

In this article we have described that what is Levenshtein distance, what is python Levenshtein Distance, and how to install and implement in code then why to prefer it as it is efficient, reliable that can handle large data and what are the common and advance use cases of this module and some limitation is consume more space so memory management is very important and we need to render c complier to be installed in the environment other than Summing up it is possible to state that compared to the string operations the Levenshtein module in python language is more effective in many aspects.

nrudrz5sc

Improve

Introduction to Python Levenshtein Module

What is Levenshtein Distance?

Python-Levenshtein Module

Key Features of python-Levenshtein

Uses of the Python-Levenshtein Module

1. Levenshtein.distance

2. Levenshtein.ratio

3. Levenshtein.hamming

4. Levenshtein.jaro

5. Levenshtein.jaro_winkler

6. Levenshtein.median

7. Levenshtein.median_improve

8. Levenshtein.quickmedian

9. Levenshtein.setmedian

10. Levenshtein.seqratio

11. Levenshtein.setratio

12. Levenshtein.editops

13. Levenshtein.opcodes

14. Levenshtein.inverse

15. Levenshtein.apply_edit

16. Levenshtein.matching_blocks

17. Levenshtein.subtract_edits

Example Use Cases of Python-Levenshtein

1. Spelling Correction

2. Fuzzy String Matching

3. DNA Sequence Comparison

Conclusion

Similar Reads

What kind of Experience do you want to share?

Key Features of `python-Levenshtein`

Example Use Cases of Python-`Levenshtein`