Spelling Corrector program in Cobra

by **callisto** » Sat Oct 29, 2011 3:54 am

Hi,

For my first reasonably substantial program, I wrote a Cobra version of Peter Norvig's spelling corrector: http://norvig.com/spell-correct.html

My version is several times longer than the Python version, mostly because I couldn't translate the set-builder expressions. Perhaps this is where LINQ will help?

For example, the Python version has:

Code: Select all: def known_edits2(word): return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)

Whereas I have:

def known_edits2(word as String) as Set<of String>
        result = Set<of String>()
        for e1 in .edits1(word)
            for e2 in .edits1(e1)
                if e2 in _model.keys
                    result.add(e2)
        return result

Also, the program takes three times longer to run on my machine, but this is because Python is freakishly fast at reading and splitting up the input text file. (I tried processing the file line-by-line instead of using readAllText, but that made things slower.)

I've attached my program. Any comments on style/performance welcome!

by **Charles** » Sat Oct 29, 2011 11:36 am

Thanks, I'll take a look today. Even my I/O bound programs tend to be 2X faster in Cobra than in Python, so it will be interesting to see what's going on there.

by **Charles** » Sat Oct 29, 2011 5:48 pm

I haven't played with the code much, but I'm getting faster results with Cobra out of the box:

Using http://norvig.com/spell.py :

$ time python spell.py
{'bad': 68, 'bias': None, 'unknown': 15, 'secs': 8, 'pct': 74, 'n': 270}
real 0m9.596s
user 0m9.539s
sys 0m0.051s

$ time cobra spelling.cobra
Spelling corrector
Dictionary contains 29157 words
Correcting word produces word
Correcting pease produces peace
Correcting guenberg produces gutenberg
Correcting inlcuded produces included
Correcting appple produces apple
Correcting wrod produces word
Correcting korrecter produces corrected

real 0m5.038s
user 0m4.898s
sys 0m0.142s

So that's almost 2X faster in Cobra. Now with -turbo:

$ time cobra -turbo spelling.cobra
...
real 0m4.954s

Not much improvement. Now just the .exe that is left from compilation:

$ time mono spelling.exe
...
real 0m3.053s

I'm on Mac OS X 10.6.8, 2.66GHz Quad-Core Intel Xeon. Cobra is the latest from subversion and Python is 2.6.6.

Let me know if I'm using the same Python program you are, what op sys and ver you are on, Python ver, etc.

by **Charles** » Sat Oct 29, 2011 6:05 pm

You can definitely tighten up the code like so:

return Set<of String>(for w in words where w in _model.keys)

But that makes a list and then makes a set. Your original version is more efficient because it just makes a set. Also, with a little formatting it can be condensed:

result = Set<of String>()
for w in words, if w in _model.keys, result.add(w)
return result

Of course, in this sample program you can't really feel the difference in speed, so use whichever you like.

It would be nice if Cobra's for expression produced a set when given a set. I discovered this earlier and it's on my never-ending to-do list. Then we would just have:

return for w in words where w in _model.keys

since words is already a set.

by **Charles** » Sat Oct 29, 2011 6:30 pm

Cobra supports set literals like:

print {1, 2, 3}

So this:

candidates = .known(Set<of String>([word]))

# can be done as:
candidates = .known({word})

FYI.

by **Charles** » Sat Oct 29, 2011 6:37 pm

Here is a tweaked version. Your code was fine though. There wasn't much to change or improve. I'm still curious about the performance differences. I'm sure we can figure that out if we drill down on the details.

use System.Text.RegularExpressions

class Spelling
    """
    Implementation of Peter Norvig's spelling corrector.
    <!-- m --><a class="postlink" href="http://norvig.com/spell-correct.html">http://norvig.com/spell-correct.html</a><!-- m -->
    """

    # store a map from words to their frequency
    var _model = Dictionary<of String, int>()

    var _re = Regex(r'[^abcdefghijklmnopqrstuvwxyz]+', RegexOptions.Compiled)

    def train(words as String[]?)
        """ Given an array of words, put their counts into the _model """
        for word in words
            if word.length > 0 
                if _model.containsKey(word), _model[word] += 1
                else, _model[word] = 1

    def readWords(filename as String)
        """
        Given a filename, retrieve all words, defined as sequences of a-z
        and train _model
        """
        text = File.readAllText(filename)
        # .train(Regex.split(text.toLower, r"[^abcdefghijklmnopqrstuvwxyz]+"))
        .train(_re.split(text.toLower))

    def known(words as Set<of String>) as Set<of String>
        """ Given a set of words, retrieve the subset which is known. """
        return Set<of String>(for w in words where w in _model.keys)

    def edits1(word as String) as Set<of String>
        """
        Given a word, return a set of strings at edit 1 distance
        """
        alphabet = 'abcdefghijklmnopqrstuvwxyz'
        result = Set<of String>()
        for i in word.length # delete one letter
            result.add(word[:i]+word[i+1:])
        for i in word.length # insert one letter
            for c in alphabet
                result.add(word[:i]+c.toString+word[i:])
        for i in word.length-1 # transpose adjacent letters
            result.add(word[:i]+word[i+1:i+2]+word[i:i+1]+word[i+2:])
        for i in word.length # replace a letter
            for c in alphabet
                result.add(word[:i]+c.toString+word[i+1:])
        return result

    def known_edits2(word as String) as Set<of String>
        """ Given a word, return a set of known words at edit 2 distance. """
        result = Set<of String>()
        for e1 in .edits1(word), for e2 in .edits1(e1), if e2 in _model.keys, result.add(e2)
        return result

    def correct(word as String) as String
        """
        Given a word, return its more frequent nearest neighbour,
        or else the word itself
        """
        candidates = .known({word})
        if candidates.count == 0, candidates = .known(.edits1(word))
        if candidates.count == 0, candidates = .known_edits2(word)
        if candidates.count == 0, candidates = {word}
        # best = candidates.toList.get(0)
        # more efficient:
        for best in candidates, break
        # ^ but would probably read nicer if put in an extension utility method
        bestCount = 0
        for word in candidates
            if word in _model.keys and _model[word] > bestCount
                best, bestCount = word, _model[word]
        return best

    def main
        start = DateTime.now
        print "Spelling corrector"
        .readWords("big.txt")
        print "Dictionary contains [_model.keys.count] words"
        for word in 'word pease guenberg inlcuded appple wrod korrecter'.split
            print "Correcting [word] produces [.correct(word)]"
        duration = DateTime.now.subtract(start)
        print duration.totalSeconds

by **callisto** » Sun Oct 30, 2011 4:34 am

Thanks for the tweaked version, that was instructive. I hadn't found the string/dictionary literals, and those make a big difference to readability.

About the timings, first I should have made clear my code was not running the tests that are in Norvig's complete version. My first attempt just read the dictionary in and ran a few test words. I've now added in the test dictionary (which became a copy-and-paste exercise after learning about the literal syntax!), and (after fixing an error) I get the same output and results as http://norvig.com/spell.py.

I have also rewritten the method edits1, to follow Norvig's pattern of creating the list of splits once, which is more efficient. After the change, the timings are more comparable.

My timings are:

$ cobra -c -turbo spelling2.cobra
Compilation succeeded
$ time ./spelling2.exe
Spelling corrector
Dictionary contains 29157 words
bad=68, n=270, pct=74.814814814814814814814814815, unknown=15
26.60473

real 0m26.777s
user 0m25.678s
sys 0m0.288s

$ time python spell.py
{'bad': 68, 'bias': None, 'unknown': 15, 'secs': 19, 'pct': 74, 'n': 270}

real 0m21.362s
user 0m20.633s
sys 0m0.164s

Cobra is from subversion yesterday, Python 2.7.1+.
Mono JIT compiler version 2.6.7 (Debian 2.6.7-5ubuntu3)
My computer's speed is 2.8GHz, and I'm using Ubuntu 11.04.

This problem is playing very much to Python's strengths: reading from a file, and constructing a dictionary datatype. I find Python faster by 4 seconds in the initial read of the data file. After that, it is hard to call!

by **Charles** » Sun Oct 30, 2011 10:53 am

Thanks for the clarifications. Are you able to test with Mono 2.10?

Spelling Corrector program in Cobra

Spelling Corrector program in Cobra

Re: Spelling Corrector program in Cobra

Re: Spelling Corrector program in Cobra

Re: Spelling Corrector program in Cobra

Re: Spelling Corrector program in Cobra

Re: Spelling Corrector program in Cobra

Re: Spelling Corrector program in Cobra

Re: Spelling Corrector program in Cobra

Who is online