Automatic language classification, the slow way
#!/usr/bin/python2 import sys import bz2 def classify(text, langs=('english', 'german', 'french')): results = {} for lang in langs: with open(lang + '.txt') as f: corpus = f.read() compressed = len(bz2.compress(corpus)) results[lang] = len(bz2.compress(corpus + text)) - compressed return sorted(results, key=results.__getitem__) if __name__ == '__main__': print "Most likely %s." % classify(sys.stdin.read())[0].capitalize()
$ wget -qO - http://www.gutenberg.org/ebooks/31469.txt.utf8 | ./classific.py
Most likely English.
$ wget -qO - http://www.gutenberg.org/ebooks/22367.txt.utf8 | ./classific.py
Most likely German.
$ wget -qO - http://www.gutenberg.org/ebooks/4968.txt.utf8 | ./classific.py
Most likely French.
Anonymous said,
May 10th, 2011 at 1:02 am
from reverend.thomas import Bayes
Coren said,
May 10th, 2011 at 9:11 am
Ah, yes. I’ve seen this type of compression-based methods used for authorship attribution.
It’s nifty how you can solve a complex task using such simple methods (though obviously someone had to think up the compression algorithm first).
Anonymous said,
June 22nd, 2011 at 2:56 pm
bz2 divides the input in independent 900KB chunks, making a 4MB corpus unscientific and ultimately destructive.
Anonymous said,
July 3rd, 2011 at 8:09 am
Saw this on /prog/, are you the author? This blew my beginning programmer’s mind