Forums

Spelling Corrector program in Cobra

General discussion about Cobra. Releases and general news will also be posted here.
Feel free to ask questions or just say "Hello".

Spelling Corrector program in Cobra

Postby callisto » Sat Oct 29, 2011 3:54 am

Hi,

For my first reasonably substantial program, I wrote a Cobra version of Peter Norvig's spelling corrector: http://norvig.com/spell-correct.html

My version is several times longer than the Python version, mostly because I couldn't translate the set-builder expressions. Perhaps this is where LINQ will help?

For example, the Python version has:
Code: Select all
def known_edits2(word):
    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)

Whereas I have:
def known_edits2(word as String) as Set<of String>
result = Set<of String>()
for e1 in .edits1(word)
for e2 in .edits1(e1)
if e2 in _model.keys
result.add(e2)
return result


Also, the program takes three times longer to run on my machine, but this is because Python is freakishly fast at reading and splitting up the input text file. (I tried processing the file line-by-line instead of using readAllText, but that made things slower.)

I've attached my program. Any comments on style/performance welcome!
Attachments
spelling.cobra
Spelling Corrector program in Cobra
(2.62 KiB) Downloaded 403 times
callisto
 
Posts: 10

Re: Spelling Corrector program in Cobra

Postby Charles » Sat Oct 29, 2011 11:36 am

Thanks, I'll take a look today. Even my I/O bound programs tend to be 2X faster in Cobra than in Python, so it will be interesting to see what's going on there.
Charles
 
Posts: 2515
Location: Los Angeles, CA

Re: Spelling Corrector program in Cobra

Postby Charles » Sat Oct 29, 2011 5:48 pm

I haven't played with the code much, but I'm getting faster results with Cobra out of the box:

Using http://norvig.com/spell.py :

$ time python spell.py
{'bad': 68, 'bias': None, 'unknown': 15, 'secs': 8, 'pct': 74, 'n': 270}
real 0m9.596s
user 0m9.539s
sys 0m0.051s

$ time cobra spelling.cobra
Spelling corrector
Dictionary contains 29157 words
Correcting word produces word
Correcting pease produces peace
Correcting guenberg produces gutenberg
Correcting inlcuded produces included
Correcting appple produces apple
Correcting wrod produces word
Correcting korrecter produces corrected

real 0m5.038s
user 0m4.898s
sys 0m0.142s

So that's almost 2X faster in Cobra. Now with -turbo:

$ time cobra -turbo spelling.cobra
...
real 0m4.954s

Not much improvement. Now just the .exe that is left from compilation:

$ time mono spelling.exe
...
real 0m3.053s

I'm on Mac OS X 10.6.8, 2.66GHz Quad-Core Intel Xeon. Cobra is the latest from subversion and Python is 2.6.6.

Let me know if I'm using the same Python program you are, what op sys and ver you are on, Python ver, etc.
Charles
 
Posts: 2515
Location: Los Angeles, CA

Re: Spelling Corrector program in Cobra

Postby Charles » Sat Oct 29, 2011 6:05 pm

You can definitely tighten up the code like so:
return Set<of String>(for w in words where w in _model.keys)

But that makes a list and then makes a set. Your original version is more efficient because it just makes a set. Also, with a little formatting it can be condensed:
result = Set<of String>()
for w in words, if w in _model.keys, result.add(w)
return result

Of course, in this sample program you can't really feel the difference in speed, so use whichever you like.

It would be nice if Cobra's for expression produced a set when given a set. I discovered this earlier and it's on my never-ending to-do list. Then we would just have:
return for w in words where w in _model.keys
since words is already a set.
Charles
 
Posts: 2515
Location: Los Angeles, CA

Re: Spelling Corrector program in Cobra

Postby Charles » Sat Oct 29, 2011 6:30 pm

Cobra supports set literals like:
print {1, 2, 3}

So this:
candidates = .known(Set<of String>([word]))

# can be done as:
candidates = .known({word})

FYI.
Charles
 
Posts: 2515
Location: Los Angeles, CA

Re: Spelling Corrector program in Cobra

Postby Charles » Sat Oct 29, 2011 6:37 pm

Here is a tweaked version. Your code was fine though. There wasn't much to change or improve. I'm still curious about the performance differences. I'm sure we can figure that out if we drill down on the details.

use System.Text.RegularExpressions

class Spelling
"""
Implementation of Peter Norvig's spelling corrector.
<!-- m --><a class="postlink" href="http://norvig.com/spell-correct.html">http://norvig.com/spell-correct.html</a><!-- m -->
"""

# store a map from words to their frequency
var _model = Dictionary<of String, int>()

var _re = Regex(r'[^abcdefghijklmnopqrstuvwxyz]+', RegexOptions.Compiled)

def train(words as String[]?)
""" Given an array of words, put their counts into the _model """
for word in words
if word.length > 0
if _model.containsKey(word), _model[word] += 1
else, _model[word] = 1

def readWords(filename as String)
"""
Given a filename, retrieve all words, defined as sequences of a-z
and train _model
"""
text = File.readAllText(filename)
# .train(Regex.split(text.toLower, r"[^abcdefghijklmnopqrstuvwxyz]+"))
.train(_re.split(text.toLower))

def known(words as Set<of String>) as Set<of String>
""" Given a set of words, retrieve the subset which is known. """
return Set<of String>(for w in words where w in _model.keys)

def edits1(word as String) as Set<of String>
"""
Given a word, return a set of strings at edit 1 distance
"""
alphabet = 'abcdefghijklmnopqrstuvwxyz'
result = Set<of String>()
for i in word.length # delete one letter
result.add(word[:i]+word[i+1:])
for i in word.length # insert one letter
for c in alphabet
result.add(word[:i]+c.toString+word[i:])
for i in word.length-1 # transpose adjacent letters
result.add(word[:i]+word[i+1:i+2]+word[i:i+1]+word[i+2:])
for i in word.length # replace a letter
for c in alphabet
result.add(word[:i]+c.toString+word[i+1:])
return result

def known_edits2(word as String) as Set<of String>
""" Given a word, return a set of known words at edit 2 distance. """
result = Set<of String>()
for e1 in .edits1(word), for e2 in .edits1(e1), if e2 in _model.keys, result.add(e2)
return result

def correct(word as String) as String
"""
Given a word, return its more frequent nearest neighbour,
or else the word itself
"""
candidates = .known({word})
if candidates.count == 0, candidates = .known(.edits1(word))
if candidates.count == 0, candidates = .known_edits2(word)
if candidates.count == 0, candidates = {word}
# best = candidates.toList.get(0)
# more efficient:
for best in candidates, break
# ^ but would probably read nicer if put in an extension utility method
bestCount = 0
for word in candidates
if word in _model.keys and _model[word] > bestCount
best, bestCount = word, _model[word]
return best

def main
start = DateTime.now
print "Spelling corrector"
.readWords("big.txt")
print "Dictionary contains [_model.keys.count] words"
for word in 'word pease guenberg inlcuded appple wrod korrecter'.split
print "Correcting [word] produces [.correct(word)]"
duration = DateTime.now.subtract(start)
print duration.totalSeconds
Attachments
spelling.cobra
(2.8 KiB) Downloaded 386 times
Charles
 
Posts: 2515
Location: Los Angeles, CA

Re: Spelling Corrector program in Cobra

Postby callisto » Sun Oct 30, 2011 4:34 am

Thanks for the tweaked version, that was instructive. I hadn't found the string/dictionary literals, and those make a big difference to readability.

About the timings, first I should have made clear my code was not running the tests that are in Norvig's complete version. My first attempt just read the dictionary in and ran a few test words. I've now added in the test dictionary (which became a copy-and-paste exercise after learning about the literal syntax!), and (after fixing an error) I get the same output and results as http://norvig.com/spell.py.

I have also rewritten the method edits1, to follow Norvig's pattern of creating the list of splits once, which is more efficient. After the change, the timings are more comparable.

My timings are:

$ cobra -c -turbo spelling2.cobra
Compilation succeeded
$ time ./spelling2.exe
Spelling corrector
Dictionary contains 29157 words
bad=68, n=270, pct=74.814814814814814814814814815, unknown=15
26.60473

real 0m26.777s
user 0m25.678s
sys 0m0.288s

$ time python spell.py
{'bad': 68, 'bias': None, 'unknown': 15, 'secs': 19, 'pct': 74, 'n': 270}

real 0m21.362s
user 0m20.633s
sys 0m0.164s

Cobra is from subversion yesterday, Python 2.7.1+.
Mono JIT compiler version 2.6.7 (Debian 2.6.7-5ubuntu3)
My computer's speed is 2.8GHz, and I'm using Ubuntu 11.04.

This problem is playing very much to Python's strengths: reading from a file, and constructing a dictionary datatype. I find Python faster by 4 seconds in the initial read of the data file. After that, it is hard to call!
Attachments
spelling2.cobra
(8.4 KiB) Downloaded 398 times
callisto
 
Posts: 10

Re: Spelling Corrector program in Cobra

Postby Charles » Sun Oct 30, 2011 10:53 am

Thanks for the clarifications. Are you able to test with Mono 2.10?
Charles
 
Posts: 2515
Location: Los Angeles, CA


Return to Discussion

Who is online

Users browsing this forum: No registered users and 25 guests