Google Code Jam: Ungarbling a Garbled Email

22 May 2013

Google Code Jam offers up some pretty challenging problems. Here's where I break down my approach to solving one of the tougher ones: the "Garbled Email" problem of Round 1B, 2013.

I participated in Google Code Jam for the first time this year, and while I was able to pass the qualification round fairly easily, Round 1 proved to be much tougher. In this round, you get 2.5 hours to complete 3 problems, each getting progressively harder, and it was the time limit that proved to be the biggest challenge for me. When I competed in Round 1C, I took over 2 hours to complete the first problem, but a contestant named "xiaowuc1" cracked it in under 2 minutes! That's insane...I mean, who are those guys?

Anyway, even though I was unable to finish them in the alloted time, the problems would stick with me, and I would be compelled to chip away at them until I was able to figure out my own solution. Only then would I look through the successful code submissions to see how some of the winners of the round were able to solve it.

Below is my description of how I tackled the most difficult problem of Round 1C. Perhaps it's a tad verbose...but, no one can argue that it's not thorough!

Problem

I'll start by giving you my description of the problem, but you can find Google's wording here.

A friend writes you an email, using only words that can be found in a supplied dictionary of 521,196 words. Unfortunately, sometime after your friend hits the send button, the email becomes corrupted, so what you receive is a garbled version of the original.

How is it garbled? First, all the spaces have been removed, so it's just one long string of letters. Second, some of the letters have been changed to other random letters, but, for any 2 letters that have been changed, there are at least 5 characters of separation between them. So, if the original word is 'fiesta', it's possible that it could appear as 'ziesty' in your garbled version, since the changed 'y' appears 5 characters after the changed 'z'. However, 'fiesta' could not be corrupted to become 'zyesta', since the 'z' and 'y' are right next to each other.

Lastly, in the dictionary your friend used, some of the entries are familiar English words, but most of them are nonsense. Which means that even after you successfully ungarble the email, it still looks like gibberish!

Now, given the above rules for email garbling, what is the minimum number of letters that could have been changed from the original to produce the version that we receive?

My Solution in code

Garbled Email solution on Github.

My Algorithm (without the exhaustive explanation)

Create wildcards dictionary.
Use trie data structure for wildcards dictionary.
Find all possible words starting with each character of email.
Starting with last character of email, find best candidates with least number of wildcards for substring from character to end.
When you reach the first character of email, choose candidate with least number of wildcards and return count of wildcards.

Wait, I want the exhaustive explanation!

I know, I know. The "tl;dr" summary isn't much fun...so, let's walk through the algorithm in a bit more detail, shall we?

You're gonna need a bigger dictionary

The first thing to notice is that even though we don't have spaces between words to tell us where the word breaks are in the garbled email, we can still narrow down the possibilities of what the original words were by looking up substrings of letters in the dictionary. For instance, if the garbled email is 'pumpkinhead', we know that the original email could have started with the words 'pu', 'pump', or 'pumpkin', all of which are in the dictionary. But, it could not have started with 'pum', or 'pumpk', since they aren't.

But, that doesn't cover all of the possibilities, since we haven't taken into account the randomly changed letters. What if the original word was 'gum' (a valid dictionary entry), but the 'g' became a 'p' when it was corrupted? We need to handle those cases as well, but just doing lookups in the supplied dictionary isn't going help us with that.

What we really need is a bigger, more complete dictionary...one that's filled with the original list of words, plus all of the valid wildcard variations for each word.

"Say what? Valid wildcard variations?"

Right, since 'gum' could be garbled to be 'pum', we really should have an entry for 'pum' in our complete dictionary. Plus, an entry for 'aum', and 'bum'...and well, you get the idea. Lots of new entries.

But, instead of adding a new entry for every 'a' through 'z' substitution possibility, let's just represent a substitution as a wildcard, denoted with an asterisk. So if we look at the original entry of 'gum', we would add the following entries to our new dictionary: 'gum', '*um', 'g*m', and 'gu*'.

For longer words, some new entries would have multiple wildcards: 'spatchcock' would produce '*patc*cock' and 's*atch*ck' (among many others), but not '*pat*hcock', since according to the garbling rules, the wildcards wouldn't have enough separation.

So, to produce our new, complete, wildcard-filled dictionary, we just run through the original dictionary, and for each of the 521,196 words, we produce a list of valid wildcard variations using the word_variations() function. Since, we'll get a lot of duplicate entries in our new dictionary, we dedupe the results, and since we only want to do this work one time, we save the new dictionary list to a file.

I call it wildcards_dictionary.txt and it contains 3,879,650 entries.

That's a lot of lookups

What kind of data structure makes sense for our dictionary of wildcard variations?

My first thought is to use a set. Each entry is unique, and I'm primarily concerned with checking if a substring is in the wildcards dictionary. While that works great, what it can't help me with is telling me when to stop checking substrings...and I quickly realize that I could be doing a lot of lookups.

For instance, say the garbled email starts with 'xyz'. Just to figure out what words could have started with the first character of the email, I need to check all variations of 1 character substrings, then all variations of 2 character substrings, and so on. The complete list of substrings to check would look like this: 'x', '*', 'xy', '*y', 'xyz', 'xy*'.

Now, imagine that the email is 1000 characters long. To check all possibilities, I would need to check substrings from 1, 2, 3...all the way up to 1000 characters in length. And that's just using the 1st character as starting point. Every character could also be a starting point, so I would have to repeat the process for each one.

One way to shortcut this process would be to take the length of the longest word in the dictionary and use that length as the max length of a substring. But, I went in a different direction.

Instead, I decided to organize the dictionary as a tree data structure.

You say "tree", I say "trie"

Imagine a tree structure where each node of the tree contains a single letter. To store a word, you would create a series of nodes, one for each letter, where the first letter of the word is the parent node, and each following letter is the child node of the preceding letter node.

As I learned from Wikdpedia, this specific kind of tree data structure is called a "trie" (derived from the word "retrieval"). Wikipedia has a helpful diagram (although it shows multiple characters per node in order to show the complete word in the last node at the end of the path):

What's so great about using a trie here? Two things. The first, which is not super important, is that the common prefixes are stored one time, rather than stored as part of multiple strings in a set, so we achieve are some space savings. Notice how the words 'tea', 'ted' and 'ten' all share the same prefix path: 't'->'e'.

But, the real reason the trie is so great is that it can tell us when to stop searching. If we're examining the word 'tarpit' using the above diagram, we can see that the prefix 't' exists as a path in the trie, but that 'ta' does not. And since the prefix 'ta' isn't present, we now know that that words that start with that prefix (like 'tar' and 'tarp') aren't present in the dictionary either, which means we can stop searching right there.

Datrie is da man

A trie isn't a data structure that's available as part of the Python standard library, but luckily for me, an implementation is just a simple pip install away:

$ pip install datrie

I learned about datrie from an interesting survey of "Fast Non-Standard Data Structures for Python" (a blog post written by the author of datrie), where it was pointed out that since it's a wrapper around the libdatrie library written in C, it's much faster than a pure Python implementation.

For my purposes, it was super simple to use:

Build a trie

def make_trie(filename):
    valid_chars = string.ascii_lowercase + '*'
    trie = datrie.BaseTrie(valid_chars)
    with open(filename) as f:
        for line in f:
            word = line.strip().decode('utf-8')
            trie[word] = 0
    return trie

trie = make_trie('wildcards_dictionary.txt')

Note:

You have to define valid characters for trie keys. In my case, I need lowercase letters, plus the asterisk.
Keys need to be unicode. Since I'm using Python 2.x, I have to call decode('utf-8') on each word I read from the wildcards_dictionary file before adding it to the trie.
Each entry in the trie must have a value. Since I don't ever need the values, I just assign 0 to each one.

Check if word exists

if word in trie:
    print "It's in the dictionary!"

Check if prefix exists

if trie.has_keys_with_prefix(substring):
    print "Must keep searching!"

So many possibilities

Now that we've created the wildcards dictionary, and we understand the data structure we're going to use, it's time to attack the problem.

Since we don't know where the word breaks are in the email, we need to consider that every character in the email could be the start of a word, so let's compile a list of possibilities for each character. If we use the word 'bread' as an example, here are the possible words for each letter.

b: ['*', 'b*', 'b*e', 'b*ea', 'b*ead'], 
r: ['*', '*e', '*ea', '*ead'], 
e: ['*', '*a', '*ad', 'e*', 'e*d', 'ea', 'ea*'], 
a: ['*', '*d', 'a', 'a*', 'ad'], 
d: ['*']

Here's my function to produce an array of arrays of all possibilities:

 def indexed_variations(word, trie):
    """
    Build data structure of matching words from trie.

    Return array of arrays:
        d[index][list of matching words in trie]

    """
    wlist = list(word)
    data = [[] for x in range(len(wlist))]

    for i in range(len(wlist)):
        data[i] = []
        segment = ''.join(wlist[i:i + 1]).decode('utf-8')
        segments = word_variations(segment)

        while segments:
            s = segments.pop()
            len_s = len(s)
            if s in trie:
                data[i].append(s)
            if i + len_s < len(wlist) and trie.has_keys_with_prefix(s):
                segments.append(s + wlist[i + len_s])
                wild_seg = s + '*'
                if is_valid(wild_seg):
                    segments.append(wild_seg)
    return data

From the top

Now, it seems like all we have to do is just make combinations of all the possible words, and then determine which of the valid possibilities has the fewest number of wildcards. Sounds simple, right?

Again, using 'bread' as an example, if we try '*' as our first word, then the next word would have to be one of the 'r' (2nd letter) possibilities: ['*', '*e', '*ea', '*ead']. However, none of these possibilities would be valid, because the wildcards would be too close together, so we throw out '*' as a possibility.

Next, if we try 'b*', we see we can make a valid combination with a word from one of the 'e' (3rd letter) choices: 'ea'. But, the only possibility for the last word would be '*', and again that would be invalid.

In fact, it turns out that the only possible valid combination for 'bread' is 'b*ead', so our answer for the minimum number of letters that could have been changed is 1.

That works pretty well for a short input, like 'bread', but what happens when the input email is much longer?

Trouble, that's what.

This is where my computer's fan kicks in

I can always tell that I've picked an inefficient way to attack a Codejam problem when my laptop's fan starts up before I get an answer. Unfortunately, my "try all combinations" idea above suffers from this problem.

Some ballpark math can show us just how slow this idea can be for large inputs. Say the email is 1000 characters long. A rough estimate is that producing our array of arrays of possible words would result in about 4 words per character. Say each word 3 is letters long. That would mean that the total number of word combinations that we would have to test to see if they were valid: 4 ** (1000/3).

In Big-O notation, I believe that's considered to be really f'in slow...

The last shall be first

Rather than attempting to solve the problem from the beginning, sometimes it helps to flip it around and start working from the end.

Why would that help? Well, it allows you to find the best solutions for a substring starting at a certain letter and going to the end of the email. Then, when you search for best solutions for earlier letters, you can consider only the best solutions you've already found, rather than testing all possible solutions.

Make sense? No? Right, we need another example.

Here's the array of arrays of possibilities for the word 'ungarble':

u: ['*', '*n', '*ng', '*nga', '*ngar*', 'u*', 'u*g',
    'u*ga', 'un', 'un*', 'un*a', 'ung', 'ung*', 'unga',
    'unga*', 'ungar*'],
n: ['*', '*g', '*ga', 'n*', 'n*a'],
g: ['*', '*a', 'g*', 'ga', 'ga*', 'ga*b', 'ga*ble'],
a: ['*', 'a', 'a*', 'a*b', 'a*ble', 'ar*le', 'arb*e', 'arbl*'],
r: ['*', '*b', '*ble'],
b: ['*', '*l', '*le', 'b*', 'b*e', 'bl*'],
l: ['*', '*e', 'l*', 'le'],
e: ['*']

Now, let's make a list of best solutions for each starting letter, starting with the last letter. For each letter, we'll call the find_best() function to get the list of best solutions, then append it to the solutions list.

For the last letter 'e', '*' is the only a solution possible, since 'e' isn't in the original dictionary.

So, solutions looks like this:

solutions = [['*']]

For the second-to-last letter 'l', we have 4 starting possibilities.

If we start with '*', then we need one more letter to reach the end of the email, so we combine if with the best solution for 1 remaining character, which is also '*'. But, that's an invalid string (wildcards too close!), so we throw it out.
If we start, with the '*e' or 'l*', we don`t need to combine either of them with another best solution, since they reach the end all by themselves. So, they are both valid and are the best solutions so far.
If we start with 'le', that also reaches the end. And, it doesn't contain any wildcards, so therefore, it is the best solution possible. Any solution that requires a word to start on the second-to-last letter of the email will have to use 'le' to have the minimum number of wildcards.

That means we can throw out the '*e' and 'l*' possibilities, and after 2 letters, solutions looks like this:

solutions = [['*'], ['le']]

The next iteration reveals a few subleties, and once we understand those, then it's just a matter of repeatedly adding to best solutions for each letter until we reach the beginning of the email. Let's consider 'b', which has 6 start possibilities.

'*' needs a 2-letter solution to reach the end, so we combine it with 'le' and add '*le' to solutions.
'*l' needs a 1-letter solution, so we try to combine it with '*', but that's invalid.
'*le' reaches the end, but we've already added '*le' to solutions.
'b*' needs a 1-letter solution, but 'b**' is invalid.
'b*e' is reaches the end, so we add it to solutions.
'bl*' also reaches the end, so we add it to solutions.

Now, solutions looks like this:

solutions = [['*'], ['le'], ['*le', 'b*e', 'bl*']]

There can be only one best solution, right?

Notice that we have 3 different "best" solutions for the 'ble' substring at the end of the email. Now, you may be wondering, couldn't we reduce it further? Since all 3 solutions contain 1 wildcard, aren't they all equivalent? Can't we just pick one as the best and toss away the others?

I thought so at first, but it turns out, you can't.

The reason for that has to do with the wildcard garbling rules. Say we chose '*le' as the only 3-letter solution. That would mean that all preceding solutions couldn't have a wildcard as one of the last 4 characters, because when you combine it with '*le' it would be invalid. For example, if you tried to append '*le' to '*ar', the result would be invalid. But, if you combined '*ar' with 'bl*', the result would be just fine.

That means that for the purposes of combining words, wildcards in the first four indexes of a solution could form different valid and invalid solutions

So, the rules for adding a "best" solution are a bit more complicated:

If there's a solution with no wildcards that reaches the end, that's the sole best solution.
If there are any solutions with no wildcards in the first four indexes of the word, then select the one from this group that has the lowest number of wildcards. After all, for the purposes of combining words in valid combinations, it doesn't matter if there is a wildcard in the 5th index spot or the 10th. The result will be valid regardless.
If there are multiple solutions with a wildcard in the same location in the first four indexes of the word, add only the one with the fewest total wildcards for each index.

Because, '*le', 'b*e', and 'bl*' all have wildcards at different indexes, they all need to be considered as best solutions.

Finis

With that last explanation out of the way, hopefully it's now clear just how the solve() function is doing its work.

def solve(word, trie):
    variations = indexed_variations(word, trie)

    solutions = []
    while variations:
        last = variations.pop()
        best = find_best(last, solutions)
        solutions.append(best)
    finals = solutions[-1]
    return min([x.count('*') for x in finals])

If you've reached the end of this post, and somehow, you want to read still more about this problem, be sure to check out the Contest Analysis from Google. And, this great write-up as well.