Happy holidays everyone! Hope everyone had an awesome Christmas and is getting excited for a fun New Years Eve and then a great 2010.
Anyway, since the sun never sets on the Setfive empire I was actually doing some coding earlier when I ran across an interesting little problem. What I was looking to do was “match” a string input against a set of acceptable strings. The caveat was that the inputs might have spelling mistakes or typos. For example, an input might be “onnlinee ad” matching against [“online ad”, “video”, “news”, “online”] with the goal of matching “online ad”.
Unfortunately, you can’t simply iterate over the two strings matching letters because a single wrong letter will cause you to miss all of the rest. Remembering back to some old engineering courses I found my way over to the Hamming distance article on Wikipedia. From there, I made my way over to the Levenshtein distance article which proved extremely useful.
So, at this point I figured I wanted to minimize the Levenshtein distance and that would be my matching string. Fortunately enough, PHP has a built in function to calculate Levenshtein distances! levenshtein() The Levenshtein distance works pretty well for what I was looking to do. In addition, PHP has another built in function – similar_text() for comparing two strings. similar_text will return the number of matching characters in the two input strings.
Anyway, the only thing to be aware of is that both these functions have really bad running times. similar_text clocks in at O(n^3) where n is the length of the longest string and levenshtein runs at O(m*n) where m and n are the lengths of the input strings.
Well that’s it for now. Happy string comparing.