Training a Classifier/Predictor

Rather than hand-coding the rules that are used in a rule-based classifier or predictor, we can use data.

For example, for the gender classifier described in a previous section, instead of hard-coding the rules, you could try to learn them. Suppose that you were given long lists of male and female names, like the ones below, but with many more examples.

In machine learning, we refer to a set of labeled examples like this as training data. You could then consider various features, such as last letters or first letters, or the presence of certain letter combinations. For each of those features, you could count up how many of the male and female names in the training set match the feature (e.g., have the last letter ‘i’). If the feature is much more common in examples labeled as female than male, then you could add a new rule to the rule set, outputting “female” when the feature is present.

Let’s use training data to create rules for a rule-based predictor for the Shannon Game. The interesting twist will be that the training data will consist not only of some publicly available English texts, but also the sequence of letters that have already been revealed from the current text. The features will be the most recent letter or letters that have been revealed. Thus, for example, using publicly available English texts, you might learn a rule that following a ‘q’ the next letter is almost always ‘u’. Using the previously revealed letters from the current text, you might learn that if the most recent letters are “Mr. “, then the next letter is likely to be “C”.

First, let’s make a base rule to fall back on that guesses all the letters in the alphabet but rather than guessing them in alphabetic order, guesses them in order of how frequently they appear in our training data of English text.

Suppose that we have an English corpus stored in the string train_txt. We first count how frequent all the letters are, accumulating results in a dictionary. We then sort the letters by their frequency, using the pattern in Sorting Dictionaries.

How much better do we do at guessing the next letter, just by guessing letters in order of their frequency in the training text?

Now, suppose that we try to learn some context-sensitive rules, like guessing ‘U’ after a ‘Q’?

Let’s create a nested dictionary to store the frequency of letters that follow each letter in the alphabet, using the input training text. The structure of the dictionary we will create looks like this:

{'A':{'B':2,'C':4},
 'D':{'E':6,'A':7}}

This would reflect training data where B came after A 2 times in the training text, and C came after A 4 times in the data, E came after D 6 times, and so on.

Given that counts dictionary, we can create a different guessing rule for each possible previous letter. The keys of the outer dictionary are possible last letters. The keys in each inner dictionary represent a possible next letter, and the value associated with the key is how frequently that next letter occurred. If we sort the inner dictionary’s keys based on the associated values, we can get a sorted list of next guesses to make.

We got another big improvement from that. Our estimated entropy is down to 2.61 bits per character. That’s a big improvement, but still quite a bit higher than Shannon’s estimate of around 1.3, based on people doing the predicting. We could go even farther with training our predictor, generating additional rules in various ways. For example, with a larger training corpus we could sort our next guesses based on what follows from the previous two characters instead of the last character. Or, once we get to the end of a word, we might estimate frequencies of complete words. Within a word, we might maintain a dictionary of possible completions of the current word. Challenge: see how much additional improvement you can make!

Next Section - Evaluating a Classifier/Predictor