Rosio Pavoris a blog

Automatic language classification, the slow way

#!/usr/bin/python2

import sys
import bz2

def classify(text, langs=('english', 'german', 'french')):
    results = {}
    for lang in langs:
        with open(lang + '.txt') as f:
            corpus = f.read()

        compressed = len(bz2.compress(corpus))
        results[lang] = len(bz2.compress(corpus + text)) - compressed

    return sorted(results, key=results.__getitem__)

if __name__ == '__main__':
    print "Most likely %s." % classify(sys.stdin.read())[0].capitalize()

$ wget -qO - http://www.gutenberg.org/ebooks/31469.txt.utf8 | ./classific.py
Most likely English.
$ wget -qO - http://www.gutenberg.org/ebooks/22367.txt.utf8 | ./classific.py
Most likely German.
$ wget -qO - http://www.gutenberg.org/ebooks/4968.txt.utf8 | ./classific.py
Most likely French.

4 Comments

  1. Anonymous said,

    from reverend.thomas import Bayes

  2. Coren said,

    Ah, yes. I’ve seen this type of compression-based methods used for authorship attribution.
    It’s nifty how you can solve a complex task using such simple methods (though obviously someone had to think up the compression algorithm first).

  3. Anonymous said,

    bz2 divides the input in independent 900KB chunks, making a 4MB corpus unscientific and ultimately destructive.

  4. Anonymous said,

    Saw this on /prog/, are you the author? This blew my beginning programmer’s mind

Post a Comment

RSS feed for comments on this post · TrackBack URL