Friday, February 20, 2015

The anagram of Samarang is Anagrams

A few days ago I randomly stumbled upon the "check if two strings are anagrams" interview question, and it piqued my interest.

While it is immediately obvious there is an O(n-logn) algorithm for this -- sort the character arrays and compare them -- after a few moments I flashed on the O(n) algorithm for comparing two strings and determining if they are anagrams of each other.

This got me thinking, and I remembered the fun I had about 15 years ago assembling a huge list of english words and checking to see if the .com domain names were available. A little quick googling found some interesting word lists and I was off to the races.

I ended up implementing the O(n) comparison algorithm along with a data structure that chops up the word list by word length and then hashes the words using a hashing algorithm that causes potential anagrams to hash to the same bucket. However, checking all the words in each hash bucket is O(n^2)

Thus, it is typically much slower than the simpler nlogn algorithm in Python, because the lengths of the strings and the typical hash bucket are small, the python list.sort() function is highly optimized, and once you do have the strings and bucket sorted, the search for anagrams is O(n).

And O(n^2-logn) < O(n^2-n)

The moral of this story is, in order to figure out the best algorithm, you need to consider the whole problem. I fell in love with the O(n) comparison, but didn't notice for a while that changing the problem from "check if two strings are anagrams" to "find all the anagrams in a word list" made a big difference.

Still, it was a fun evening of coding, and my python is getting marginally less horrible.

You can find the python code, word lists and sample output here.

PS: Samarang is a place in Indonesia