Rosio Pavoris a blog

Χριστούγεννα

I gets presents! All of them came from ThinkGeek, because apparently nobody else ships to Belgium in less than three months.
I got an Einstein action figure, Tetris magnets (weakest magnets I’ve ever seen, though they’ll hold up a single piece of paper adequately), an RFID-blocking wallet thing, two of these (brain cell (which isn’t technically a microbe) and sleeping sickness), and these, which look really good on my copy of Origin but smell like death.

I was also supposed to get Real World Haskell, but apparently Barnes & Noble sucks and it hasn’t arrived “yet”. This only goes to confirm my suspicion that Haskell isn’t a real programming language, but rather a very elaborate running joke. You nearly had me with GHC, guys.

Anyway, I updated progscrape to address two nitpicking complaints and one marginally more serious one (though that post is mine), as well as fix the fact that it didn’t extract e-mails at all (despite claiming to).
The thread-parsing code could still do with a desucking, but I’m not going to do it.

The new code is here (diff), and since running that on your old database will cause all kinds of interesting errors, a new database is here (15.0 MB, 59.9 MB uncompressed). I could have written a script to just siphon data from the old database into the new one, but instead I rebuilt it from scratch so I could time how long it took:

real	123m19.681s
user	4m45.306s
sys	0m22.145s

That’s for 6,826 threads and 173,390 posts.

The main advantage is that it’s now possible to do things like

select * from posts where thread = 1228980536 and id = 50;

which used to require

select * from posts where thrid = 1228980536 order by time limit 1 offset 49;

which is considerably less secksy.

Ceterum censeo Rasmus Lerdorf is to be beaten to death with a hardcover copy of SICP.

Permalink 2 Comments

This semester was too long

But at least it’s over now. Exams start January 8th, Dynamic Websites project is due December 31st. I’ve done all my Christmas shopping, and the first family Christmas party isn’t until some time around New Year. I’m just not going to leave my bed for a week, I think.

In unrelated news, apparently the people recommending Snow Crash were trolling.
That may be too generous an assumption, but the alternative (that someone would actually like this book) is a bit too hard to believe. It reads like Jargon File fan fiction co-authored by Cory Doctorow and Shii.

The protagonist is a weeaboo half-Asian nigra Hacker™ (the last one, naturally) who wears a black leather kimono and carries around two Japanese Nipponese swords, being the best swordsman in Second Life the Metaverse (he doesn’t get credit for “predicting” Second Life; this was written after the invention of the Internet and right in the middle of the first 3D gaming hype). His sidekick is that classic cyberpunk archetype, the 15-year-old technoslut who takes off her clothes in her very first character development scene. The storyline is basically masturbatory techomysticism catering to 15-year-old wannabe Hackers™.
It’s one of those books where the author tries to disguise the fact that he has the vocabulary of a third-grader by making up half the words, and where “deliberate” shit writing is supposed to mean that he has a self-deprecating sense of humor. I’m sure the over-the-top stupidity is just parody, or something.

I really want to like cyberpunk. The concept is great in theory, and it tends to make for pretty visuals. If this and William Gibson are the best it can do, though, maybe it’s time for the genre to die.

Permalink 5 Comments

The Sieve of Eratosthenes

Prime numbers are ridiculously important in cryptography, and I recently found myself needing a way to generate them (for a toy implementation of the Diffie-Hellman key exchange algorithm). It keeps amazing me how few languages actually have libraries for generating prime numbers, or even just for primality testing.
Because I didn’t feel like thinking, and I didn’t need incredibly massive primes, I just wrote a quick implementation of the sieve of Eratosthenes, which is an interesting enough algorithm that I’d thought I’d share it (this is first-year stuff even for us, but I guess I have non-programmers in my audience).

The basic idea is that you start with a list of integers from 2 up to the number you want to find the largest prime smaller than. You then take the smallest number in this list (2), and cross off all multiples of this number (except for 2 itself). You repeat this for every other number, in order, and you’ll be left with just a list of primes.

Wikipedia has a visualisation which could be clearer but is still pretty good:

Sieve of Eratosthenes

The implementation (in Python) is dead simple:

def sieve1(n):
    l = range(n)
    l[0], l[1] = None, None
    for i in xrange(int(__import__('math').sqrt(n)) + 1):
        if l[i] is not None:
            j = i * 2
            while j < n:
                l[j] = None
                j += i
    return filter(None, l)

We could just use range(2, n) and then work with an offset, but the convenience of having the indices equal to the values at those indices is too useful to complain about the extra memory required for two additional integers. Note also the obvious optimisation of only going up to sqrt(n) + 1 (no apologies for the unorthodox import).
filter(None, l) gets rid of the Nones in our list. Passing None as the function argument just applies the identity function, which has the effect of getting rid of None, False, 0, and empty lists, tuples, and dictionaries (anything which would evaluate to False in an if statement).

This is nice, but it does have the disadvantage of having to generate the whole list every time you want more primes. You can’t just input a list of pregenerated primes and have it continue from there.
The fix is easy enough:

def sieve2(n, primes = [2, 3]):
    if n <= primes[-1]:
        return filter(lambda m: m <= n, primes)

    offset = primes[-1] + 1
    l = range(offset, n)

    max = int(__import__('math').sqrt(n))
    for p in primes:
        if p > max:
            break
        i = p * 2 - offset
        while i < 0:
            i += p
        while i < len(l):
            l[i] = None
            i += p

    for k in xrange(max - offset + 1):
        if l[k] is not None:
            i = k * 2 + offset
            while i < len(l):
                l[i] = None
                i += l[k]

    l = filter(None, l)
    primes.extend(l)

    return primes

This function uses memoisation, and takes advantage of a feature of Python’s function defaults which some believe to be a bug: if mutable default arguments are altered, they will remain altered for the next time the function is called.
This is a side-effect of the fact that functions have a single field that holds the default arguments (a tuple called func_defaults in the function’s namespace), which isn’t duplicated into a working copy when the function is called. Most default arguments will be things like integers and strings, which are immutable, so most people never even run into this. Being able to use it for memoisation is really handy, though. Alternatively, you could just use a global variable and access it from inside the function.

The algorithm is basically the same, except that if we’ve already generated the primes requested, we can just return them. Then we construct a list of all integers larger than our largest prime, and sieve it with our existing primes (now we really do have to work with an offset). Then we just sieve the rest classically. At the end, we add our newly sieved list to our old primes, and return that.
Next time we invoke the function, primes will still have the new primes. You can access them from outside the function at sieve2.func_defaults[0].

Evaluating sieve2(100) and then sieve2(1000) will be slower than just evaluating sieve1(1000), but it will be faster than evaluating sieve1(100) and then sieve1(1000). And if you happen to have a large number of pregenerated primes, you can save some more time.

The sieve of Eratosthenes is just one of a number of sieve-based prime number generators, but it’s the oldest one, and it compares favorably to modern improvements.

Permalink 2 Comments