code hackers blog

TheUnknown · Sunday, June 21, 2015

Hello everyone, I have been recently reading about Text Mining and Natural Language Processing (NP). So, I will be sharing a few articles on TextMining from now on. So, that you can have an interest in this beautiful field of Computer Science.

Prerequisites

In order to understand the main content of the article you need some prerequisites which are as follows :-

Familiar with Programming.
Know at least one programming language.
Basics of Python (Optional but Recommended)
Have some prior exposure to Data Structures and algorithmic techniques like Backtracking, Dynamic Programming etc.
Have some gray cells and have patience.
Passion for Computer Science.

Let's Start,

The first question that should arise in your mind when you hear the term TextMining is What is TextMining and What does it do ? hence we will understand first what is Text or Data Mining. Lets proceed.

Text Mining or Data Mining

Text mining, also referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning.

Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output.

To explain it simply

Data Mining is a processing of locating or finding important expression, phrases or sentences or events from data or text which can be further processed to get more useful information. Even though data mining tries to collect information from data, its different from Information Retrieval, instead of the fact that Data Mining, Information Retrieval or Natural Language Processing are often connected together and misunderstood as one, I will explain how they are different some other time and not here. If you wanna develop an application based on Text or where you have to deal with large amount of data and you need an intelligent system which will classify the data and sorts them into different categories or do different processes then you need these concepts a lot.

Real life examples of Data Mining

1. Google Search : Whenever you starting writing a query it automatically suggests the possible search queries. How does it do ? They use these concept to derive the most probable search query.

2. Spell Correction : You might have seen that if you type wrong spelling's most search engines or word processors mark them with a red underline which means that you have a spelling error. MS Word or Android apps has automatic spell correction. They use these NLP and Text Mining concepts. See the image it will clear a lot of doubts. See how Google does spell correction and the search results are an example of the query I entered.

3. Amazon or Online Shopping Stores : Most of you must have done online shopping a lot. Whenever you visit a site for the first time and search for some specific product. You see some thing written as recommended for you. Next time when you visit that site you might be astonished to see that the products you see when you open the webpage are similar to what you had searched in your last visit but more refined. So, how does it work ? the earlier mentioned concepts are used for these. They collect your browsing history and process the data and finally show you the most recommended products.

There are thousands of examples just Google for it :P
I hope by now you have somewhat understood what is data or text mining. Now we will come to today's real discussion that is fuzzy string searching. So lets continue.

Today's Discussion

In this section we discuss about String matching or string similarity algorithms. We will discuss the following algorithms today.

1. Damerau-Lavenshtein Distance.
2. Levenshtein Distance.
3. Wagner-Fisher Algorithm.

Strings are among the most important elements of computer science. There are literally 100's of research papers on strings. String searching and matching is one of he most important part in Text Processing and text analysis, since all the data to intelligent systems are fed as string. Strings are nothing but a sequence of characters joined together.

Fuzzy String Searching or Fuzzy String Matching

Fuzzy string search algorithms are algorithms that are used to match either exactly or partially of one string with another string. So, what exactly does fuzzy mean ?

Fuzzy by the word we can understand that elements that aren't clear or is like an illusion. Therefore, in short we say that when there is a set of n elements and another set of m elements and they both have partially same elements then we can say that the relation between them is fuzzy.

(Funny example Most women have a fuzzy thinking that their men are cheating on them but instead of taking it as fuzzy they take it for granted :P)

By Fuzzy string searching (also known as approximate string matching) we identify those strings which are similar to a set of other strings. A number of conclusions can be drawn from these like Similarity co-efficient, similar words etc, etc.

The problem of approximate string matching is typically divided into two sub-problems: finding approximate sub-string matches inside a given string and finding dictionary strings that match the pattern approximately.

Fuzzy search algorithms (also known as similarity search algorithms) serve as a base for spell-checkers and search engines like Google or Bing or Yahoo etc. For example, these algorithms are used to provide the "Did you mean ..." functionality that is very common.

Example :-

If you ask a machine to find the similarity between two words and correct them if required, say, "Looks" and "Lookes", the machine will return true for the first string "Looks" as its correct and will return the corrected string as "Looks" from the erroneous string "Lookes".

Simple Formulation of Fuzzy String Problems :-

"Find in the text or dictionary of a finite size n for all the words that match a word given as input (or starts with the given word), taking into account m possible differences (errors)."

Fuzzy search algorithms are characterized by metric - a function of distance between two words, which provides a measure of their similarity. A strict mathematical definition of metric includes a requirement to meet triangle inequality (p - metric function, Z - a set of words):

[Image: dbh35lX.png]

LaTeX code for above equation.

p(x,y) \leq p(x,k) + p(k,y), where, (x,y,k) \varepsilon Z

What is Edit-Distance ?

Definition From Wikipedia

Edit distance is a way of quantifying how dissimilar two strings (e.g., words) are to one another by counting the minimum number of operations required to transform one string into the other. Edit distances find applications in natural language processing, where automatic spelling correction can determine candidate corrections for a misspelled word by selecting words from a dictionary that have a low distance to the word in question.
Simply we can say that - The closeness of a match is measured in terms of the number of primitive operations necessary to convert the string into an exact match. This number is called the edit distance between the string and the pattern.

The primitive operations performed are, Insertion, Deletion, Substitution. Example :-

1. Mad -- > Made, Insertion Operation Performed
2. Loooks -- > Looks, Deletion Operation Performed
3. Lick -- > Sick, Substitution Operation Performed

Now we will move to discuss the algorithms.

Damerau-Levenshtein Distance

Damerau-Levenshtein Distance is a distance (string metric) between two Strings, say String A and String B, which gives the minimum number of edit operations need to perform to transform String A to String B. Damerau's Algorithm can be used for spell correction with atmost 1 edit-distance.

There are four edit operations that can be performed with this algorithm. They are - 1. Insertion, 2. Deletion, 3. Substitution, and 4. Transposition. We have already seen the operation of the first three algorithms and so let me explain the last operation, that is transposition.

Transposition is the swapping of adjacent characters to bring them to a certain form.Example :-

String A := olko
String B := look

When the above two strings are feed to a machine that computes the Damerau-Lavenshtein Distance find that the edit distance is two.

So how it operates ?
We have two string transpositions here :-

1.) olko -> loko
2.) loko -> look

Hence we get the final result as :- look
And the number of edits performed is : 2

Please Note that Damerau-Levenshtein Distance doesn't follow the Triangle inequality. You can see that in the image below :-

[Image: MnpRetD.jpg]

"All Theory and No Code Makes Psycho a Dull guy"

I am giving the working function for now. You can see the Psuedo code on Wiki, its is well written.

Here's the code in Python. I would encourage you to write it by yourself too.

def damerau_levenshtein(s1, len_s1, s2, len_s2):
    """
    A function to calculate the Damerau-Levenshtein Distance.

    :param s1: Parent String
    :param len_s1: Length Of s1
    :param s2: pattern String
    :param len_s2: length of s2
    :rtype : int returns the damerau-levenshtein distance
    """

    d = {}
    len_s1 = len(s1)
    len_s2 = len(s2)
    for i in range(-1, len_s1 + 1):
        d[i, -1] = i + 1
    for j in range(-1, len_s2 + 1):
        d[-1, j] = j + 1

    for i in range(len_s1):
        for j in range(len_s2):
            if s1[i] == s2[j]:
                cost = 0
            else:
                cost = 1
                
            d[(i, j)] = min(
                d[i - 1, j] + 1,  # Deletion
                d[i, j - 1] + 1,  # Insertion
                d[i - 1, j - 1] + cost,  # Substitution
            )

            if i and j and s1[i] == s2[j - 1] and s1[i - 1] == s2[j]:
                d[i, j] = min(d[i, j], d[i - 2, j - 2] + cost)  # transposition
    return d[len_s1 - 1, len_s2 - 1]

You can get the complete source code here : https://github.com/AnimeshShaw/TextMin...nshtein.py
To see the Code in Execution :- http://code.hackerearth.com/0356acE
You can visualize how the steps or the algorithm works here : Click Me

Levenshtein Distance

Levenshtein distance is also a string metric calculate the edit-distance between two strings. They have found a lot of applications and in most of the real world systems an efficient and modified form of Levenshtein distance is used including various NLP toolkits like LingPipe or NLTK. or GATE. They use different data structures which reduces the time complexity.

The only difference between Lavenshtein and Damerau-Lavenshtein is that this algorithm doesn't supports transposition and we have only can perform only three operations namely, insertion, deletion and substitution.

Mathematically, it can be written as :-

[Image: W82K5YX.jpg]

LaTeX code for above equation

\qquad\operatorname{lev}_{s,t}(i,j) = \begin{cases}
  \max(i,j) \text{ if} \min(i,j)=0, \\
  \min \begin{cases}
          \operatorname{lev}_{s,t}(i-1,j) + 1 \\
          \operatorname{lev}_{s,t}(i,j-1) + 1 \\
          \operatorname{lev}_{s,t}(i-1,j-1) + 1_{(s_i \neq t_j)}
       \end{cases} \text{ else,}
\end{cases}

Another significant difference between Levenshtein Distance and Damerau-Levenshtein Distance is that the former satisfies the Triangle Inequality whereas the latter does not. It is visible from the image given below where the numbers represent the edit distance between two nodes :- [Image: 9aRw1lA.jpg]

Okay, lets do some programming now. We can follow the above math relation an we can formulate a recursive solution :-

def recursivelevenshtein(s1, len_s1, s2, len_s2):
    """
    A function to calculate the Levenshtein Distance using recursion.

    :param s1: Parent String
    :param len_s1: Length Of s1
    :param s2: pattern String
    :param len_s2: length of s2
    :rtype : int returns the levenshtein distance
    """
    if len_s1 == 0:
        return len_s2
    if len_s2 == 0:
        return len_s1
    return min(
        recursivelevenshtein(s1, len_s1 - 1, s2, len_s2) + 1,  # Deletion
        recursivelevenshtein(s1, len_s1, s2, len_s2 - 1) + 1,  # Insertion
        recursivelevenshtein(s1, len_s1 - 1, s2, len_s2 - 1) + (s1[len_s1 - 1] != s2[len_s2 - 1])  # Substitution
    )

Full Source :- https://github.com/AnimeshShaw/TextMin...nshtein.py
See Execution :- http://code.hackerearth.com/4ac2f4u

(I haven't given the visualize python link here, but you can do it by yourself)

I hope you understood the process. But this backtracking technique is very very inefficient for larger strings. As the complexity is cubic. The space complexity is large. In order to improve the performance we use the Wagner-Fisher algorithm.

Wagner-Fisher Algorithm
Wagner-Fisher algorithm presents before us a Dynamic Programming approach to solve the Levenshtein Distance Problem in Quadratic time, O(m * n) where m and n are the length two input strings respectively.

Lets get to the code already :-

def wagner_fisher(s1, len_s1, s2, len_s2):
    """
    A function to calculate the Levenshtein Distance using
    Wagner-Fisher Algorithm.

    :param s1: Parent String
    :param len_s1: Length Of s1
    :param s2: pattern String
    :param len_s2: length of s2
    :rtype : int returns the levenshtein distance
    """
    d = {}

    #Make to use O(min(n,m)) space
    if len_s1 > len_s2:
        s1, s2 = s2, s1
        len_s1, len_s2 = len_s2, len_s1
    for i in range(-1, len_s1 + 1):
        d[(i, -1)] = i + 1
    for j in range(-1, len_s2 + 1):
        d[(-1, j)] = j + 1

    for i in range(len_s1):
        for j in range(len_s2):
            if s1[i] == s2[j]:
                d[i, j] = d[i - 1, j - 1]
            else:
                d[(i, j)] = min(
                    d[(i - 1, j)] + 1,  # deletion
                    d[(i, j - 1)] + 1,  # insertion
                    d[(i - 1, j - 1)] + 1,  # substitution
                )
    return d[len_s1 - 1, len_s2 - 1]

Full Source :- https://github.com/AnimeshShaw/TextMin...-Fisher.py
See Execution :- http://code.hackerearth.com/75dc9aJ

(I haven't given the visualize python link here, but you can do it by yourself)
Lets See how the lookup matrix is generated for the strings : "LAD" and "LLUDO"
Here's the final Matrix, just follow the algorithm step[ by step and you should get the following matrix, if you don't then you have made some mistakes.

+------+---+---+---+---+
| S1 → |   | L | A | D |
+------+---+---+---+---+
| S2 ↓ | 0 | 1 | 2 | 3 |
+------+---+---+---+---+
| L    | 1 | 0 | 1 | 2 |
+------+---+---+---+---+
| L    | 2 | 0 | 1 | 2 |
+------+---+---+---+---+
| U    | 3 | 1 | 1 | 2 |
+------+---+---+---+---+
| D    | 4 | 2 | 2 | 2 |
+------+---+---+---+---+
| O    | 5 | 3 | 3 | 3 |
+------+---+---+---+---+

The bottom-right element gives the Levenshtein edit distance. This process is better than recursive. Now my question to you is that, Can we improve further. Well I leave this to you. Find a method and write a code to find the edit distance using this algorithm which performs better than the present algorithm. Comment in the section below.

Using NLTK

If you're using NLTK the you can calculate it using :-

Conclusion

So, all folks we have come to an end of our first encounter with text Processing. If you made it till the end you probably have understood a nice algorithm. I will be posting more articles on these topics as soon as I have time. If you have some doubt then please ask but before that go through the tutorial properly and execute the codes by yourself and practice them by yourself.

References

http://en.wikipedia.org/wiki/Levenshtein_distance
http://en.wikipedia.org/wiki/Wagner-Fischer_algorithm
http://en.wikipedia.org/wiki/Fuzzy_string_searching
http://en.wikipedia.org/wiki/Damerau%E2%...n_distance
https://www8.cs.umu.se/kurser/TDBA77/VT0...NODE46.HTM

Online Tools Used

Online LaTeX : http://www.sciweavers.org/free-online-latex-equation-editor
Draw Diagrams : https://www.draw.io/
Text Table Generator : http://www.tablesgenerator.com/text_tables

Credits to @XxGreenLanternxX to header banner image.

Thank you,
Sincerely,
Psycho_Coder.

code hackers blog

A blog on Programming, InfoSec and Computer Forensics.

Sunday, June 21, 2015

Fuzzy String Matching or Searching with Python Code

Follow Me!

Popular Posts

Categories

Blog Archive

Followers

Pages

Visitor Map

My Blog List