Rosio Pavoris a blog

Automatic language classification, the slow way

#!/usr/bin/python2

import sys
import bz2

def classify(text, langs=('english', 'german', 'french')):
    results = {}
    for lang in langs:
        with open(lang + '.txt') as f:
            corpus = f.read()

        compressed = len(bz2.compress(corpus))
        results[lang] = len(bz2.compress(corpus + text)) - compressed

    return sorted(results, key=results.__getitem__)

if __name__ == '__main__':
    print "Most likely %s." % classify(sys.stdin.read())[0].capitalize()

$ wget -qO - http://www.gutenberg.org/ebooks/31469.txt.utf8 | ./classific.py
Most likely English.
$ wget -qO - http://www.gutenberg.org/ebooks/22367.txt.utf8 | ./classific.py
Most likely German.
$ wget -qO - http://www.gutenberg.org/ebooks/4968.txt.utf8 | ./classific.py
Most likely French.

Permalink 4 Comments

Processing constraints is easy

Alright, we’ve covered search trees in some detail, and they work great for problems where we have clear states and rules of production to move from one state to the next. Sometimes that’s not a very convenient way to state a problem, though, and a more natural way to think about things is as a bunch of variables which can take values in a certain domain, and a number of constraints which describe the relationships of these variables to each other.
The canonical example here is Dijkstra’s eight queens problem. However, that’s been done to death, so let’s instead have two queens and seven knights, and instead of the usual 8×8 chess board, let’s have a 6×6 one.

not queens problem

Read the rest of this entry »

Permalink 2 Comments

Towards a better BBCode

Everyone knows BBCode is a pain to work with, and while WordPress supports limited HTML in user comments, it should be obvious HTML is no better. The unnecessary repetition of SGML-based languages and the insistence on the proper nesting of tags makes them all hideous and unnecessarily error-prone. We can do better.
The discussions of learned societies on the subject have been less than satisfactory, so I decided to just implement my own mark-up language, based on the venerable S-expression:

{b This} is {i {u expert} {o mark-up}}.

This will turn into:

This is expert mark-up.

The immediate effect is that nesting problems and text redundancy immediately disappear. The syntax also lends itself to easy function composition:

{b.i.o.u EXPERT}

EXPERT

Finally, for this first version,0 we also support function iteration:

{sup*3 To the moon}{sub*3 and back.}

To the moonand back.

It goes without saying this can be combined with function composition in arbitrarily complex expressions, with the iteration operator having a higher precedence than the function composition operator.

I’ve elected to use curly braces rather than the more typical parentheses, because curly braces barely see any use in natural language, which is where this mark-up would generally be used. If you do need literal curly braces, you can escape them with a backslash (and if you need a literal \{, you can escape your backslash with a backslash).

As a proof of concept, and because I eat my own dog food, I’ve written (and enabled) a WordPress plugin that enables this SexpCode in blog comments. For sanity, iteration doesn’t go beyond *3. Supported tags are b, i, u, s, o, sub, sup, code, spoiler, quote, blockquote, and m. If you want to use it yourself, adding more tags or changing their definitions should be straightforward.
Trying to use an unsupported or empty tag, or having unbalanced braces (except for closing braces at the end), will assume you’re actually trying to post C-like code, and disable SexpCode for your comment.

Ladies and gentlemen, BBCode was our COBOL. This is our Lisp.

Edit: People who want to implement this themselves should be following this document rather than this post.

Edit again: Play with it!

Edit again again: More implementations:

Know of another implementation (SexpCode+ or SexpCode)? Let me know!


0 Future versions of the language are expected to add support for function arguments (for things like url, img, and colour) and the ability to define aliases (for example, {define exp b.i.o.u}, which would let you use a new exp function as if it were b.i.o.u).

Permalink 48 Comments

Julia settee

I’ve written this, so I might as well share it.

In my post on the Mandelbrot set earlier, I mentioned the Julia sets of the quadratic polynomial fc(z) = z2 + c where c is a given (constant) complex number and z are the points of the complex plane. Because I wanted to visualise how those Julia sets changed as c varied, I’ve written a short program to do that for me.
You can find it here. As usual, you’ll need Allegro, and the compilation instruction is on the first line.

What it does is take two complex numbers as parameters, plus the number of steps it should take to go from the first to the second. At each step, it will calculate and display the Julia set of the quadratic polynomial with that complex number as c, and hopefully your computer is fast enough that the successive Julia sets look like an animation.
For example, if you invoke it as:

./julia -0.8 -1 -0.8 1 200

you’ll see the following:

Though probably not at the same speed. I’ve made no effort to maintain a certain frame rate; the whole thing moves as quickly as your CPU can keep up, because I just wanted a visualisation of how Julia sets change, not a screensaver. If it’s moving too slowly for you, you can try reducing the number of steps, or lowering the numbers in the ZOOM or ITERS #defines, though the first one will make the image smaller and the second will make it darker. If you aren’t interested in the window title, you can also remove the snprintf and set_window_title steps for a significant speed-up.
If it’s too fast, you can do the reverse, or you can build in a delay with Allegro’s install_timer and rest, or POSIX’s usleep or nanosleep.

(Once it’s done, it will just pause at the last Julia set. Press any key to close it. If you want to close it before it’s done, you’ll have to kill it manually.)

The interesting points to explore are the ones inside the Mandelbrot set, as anything else will just be Fatou dust (though you’ll probably still be able to see it because of the grey). For those points, the salient area is the one within 2 unit lengths of the origin, which is why the field displayed ranges from (-2, 2) in the top left corner to (2, -2) in the bottom right (or probably (-2, -2) to (2, 2), I don’t remember). If you need a bigger plane, replace all instances of ZOOM * 4 with ZOOM * (bigger number), and all instances of ZOOM * 2 with ZOOM * (half of bigger number) (if you want to keep the origin in the center of the window).
If you actually want to save the animations, man 3alleg save_bitmap and assemble the images yourself in something like the GIMP. I initially started out doing it this way, but animated GIFs get really big really quickly, so I went with this instead.

Enjoy.

Permalink Comments

Playing games is easy

People who take an active interest in AI are quite unlikely to have very many friends, so it should come as no surprise that trying to get computers to play games has always been a popular subfield of AI. Traditionally that game has mostly been chess, but I feel chess has a grinding tedium to it, so we’re going to look at tic-tac-toe instead, because that at least has the benefit of being over quickly.

Read the rest of this entry »

Permalink 4 Comments

Mandelbrots

I was bored, so I made this.

Rorschach test on fire

Basic introduction to the Mandelbrot set and what this image represents follows.

Read the rest of this entry »

Permalink 11 Comments

Optimal search is easy

Last time we looked at how to solve the eight puzzle using the hill climbing algorithm, which gave us a result much more quickly than a blind depth-first search did, but we wondered if the solution we found was the best we could do, and we asked if there was a way to use heuristics to find not just a solution, but the best solution. Today, we’ll see that there is, and it’s actually really straightforward.

Read the rest of this entry »

Permalink 6 Comments

Heuristics are easy

(This post assumes you read the previous one.)

Today we’ll be looking at the hill climbing algorithm, which is just a plain old depth-first search with heuristics added.
“Heuristics” is a fancy word (from the Greek εὑρίσκω, “I discover”) for a very simple concept. In the context of search trees, it simply means that at a every node, you’re going to look at each possible branch, and take the one that looks the most promising first, instead of just one at random. “Most promising” can be a tricky concept, though.0
Our river-crossing example isn’t necessarily the best one to demonstrate the concept, so let’s go with another classic: the 8 puzzle.

Read the rest of this entry »

Permalink 3 Comments

Search trees are easy

A decent proportion of my readers are noobie programmers or people who aren’t in a position to receive a formal CS education, so I thought I’d cover the basics of a fundamental concept most people cover in their first semester of algorithms or AI today: search trees. The fact that my college considers this to be third-year material so advanced they cannot in good faith make the class compulsory is neither here nor there.

Consider the famous problem of the farmer who wants to cross a river with his fox, goose, and grain, though the only boat can only carry himself and one of these three possesions. Ignore for a moment why a farmer would own a fox, and let’s stretch credibility a bit more by assuming that while the fox and goose are well-trained enough not to wander off in the absence of the farmer, they are not trained not to eat the goose or the grain, respectively, in said absence. How can he safely get to the other side without losing his goose or grain?

Oh noes!

Read the rest of this entry »

Permalink 6 Comments

Xlib hates me

Having finished another popsci book on chaos theory recently (Ian Stewart’s Does God Play Dice?), I thought it’d be an interesting exercise to visualise the Lorenz attractor, and since it’s been a while since I’ve done anything new in programming, to take the opportunity to get into Xlib, the X Window System C library. Results aren’t very encouraging.

I mean, I got something to work easily enough, but any attempt at introducing color beyond black and white for clarity fails miserably and in non-deterministic ways. Eventually I gave up and redid it using something I know.
Compare:

Lorenz attractor (X11) Lorenz attractor (Allegro)

(It’s prettier animated, so do compile the code yourself and see.)
In both cases, the screen represents the Cartesian plane (X-axis horizontal, Y-axis vertical, origin right in the center; one unit is ten pixels). In the Xlib version (left) the Z component is ignored entirely (so it’s really a projection of the attractor onto the Cartesian plane), in the Allegro version (right) some attempt at representing it using shades of gray has been made, with z=0 being black and z=55 being white (though because it is drawn with no real care, it will happily scribble dark lines over light ones if it has to).
You can mess with the variables and starting condition to see how it behaves, or swap around some Xs and Ys and Zs to get different angles, and at least in the Allegro version, messing with color is trivial enough.

Which brings me to my question: does anyone know of decent introductions to Xlib? The Internet is full of tutorials, and as usual, all of them seem to suck. I know Xlib isn’t really supposed to be used directly, but I want to.

Permalink 3 Comments

1 Kings 7:23

Permalink 6 Comments

Cisco sucks at crypto

I’m in a class called Netwerkbeheer (Network Management), which spans two semesters and is a transparent excuse to peddle CCNA certifications. As a result, I spend a lot of time playing with Cisco routers and switches, and one of the many, many things that annoy me about Cisco’s IOS is their cavalier attitude towards security and cryptosystems. A particularly egregious example of this is Cisco’s type 7 encryption.
If you’ve ever configured a Cisco router, you’ve probably encountered it. When the misleadingly named service password-encryption is running, setting a password with the enable password command “encrypts” the password, so that when you issue the show running-config command, you’ll see a line like

enable password 7 08314940000A

instead of the plaintext password, which you’d see if the so-called “password-encryption” was turned off.
Type 7 “encryption” manifests itself in a few other places, including in FTP passwords and various routing protocol authentication passwords.

Type 7 has been known to be broken for a decade and a half now,0 but people continue to use it, almost always for bad reasons.1,2 To drive home just how broken type 7 is, let’s look at it in detail.

The general form of the type 7 “ciphertext” is (0[0-9]|1[0-5])([0-9A-F]{2})+. Some experimenting finds that the length of the “ciphertext” is always twice the length of the plaintext, plus two. Can you guess why?

The “encryption” key is always a number in the range 0-15, which would be easy enough to bruteforce, but that turns out to be unnecessary, since it’s provided (in decimal form) as the first two characters of the “ciphertext”.
That key determines the starting point in a table of twenty-six secondary keys (which, incidentally, is dsfd;kfoA,.iyewrkldJKDHSUB; I don’t know why the table has 26 entries instead of 16), which are XORed in turn with the characters in the plaintext. If the key is, say, 7, the first character in the plaintext is XORed with the seventh character in the table, the second character in the plaintext is XORed with the eighth character in the table, the third with the ninth, &c.
Each resulting character is then converted to two hexadecimal digits (the input can only be ASCII, of course) and appended to the ciphertext.

And that’s seriously all there’s to it. The result is a “cipher” that’s either slightly less or slightly more secure than writing out your passwords in permanent marker on the outside of the door of the server room, depending on how you manage your configuration files.
Because I know this is going to be an issue at some point, I’ve written a simple utility that encrypts and decrypts passwords using type 7, which you can find here.

You’d think this would be a moot point because people should realise their configuration files are sensitive information, but people are, of course, idiots. In that sense, type 7 isn’t just worthless, but actively harmful, because it gives people a false sense of security.


0 http://insecure.org/sploits/cisco.passwords.html

1 The original intent of type 7 was apparently to foil shoulder-surfers, who might see your configuration file as it scrolls by on your screen. Cisco’s official stance (now) is that if security is an issue, the router configuration file itself should be treated as vulnerable data, not just the passwords that may or may not be displayed in it. That would be fair enough, if it wasn’t at odds with Cisco’s default way of saving and loading configuration files, which is through plain TFTP over the regular network, with no options for encryption of either the config or the passwords themselves. But, you know.
(The claim that type 7 is so weak because the router has to be able to reverse it is bullshit, of course. At most it’s true for PAP authentication, but anyone who considers PAP passwords secret information has no business being anywhere near a router.)

2 Cisco themselves now advise against using it, instead suggesting people use type 5, which isn’t encryption, but just hashing with MD5. Which is also broken, of course. The CCNA materials also state that at least type 7 is “better than no encryption”, but I’d argue that it’s worse, because its security is equivalent to plaintext, while also giving idiot network admins the impression that it’s not.
I’m told a type 6 exists now, which is based on AES and supposed to be better. AFAIK our routers don’t support it, and I’m not holding my breath either way.

Permalink 13 Comments

Afrikaners don’t know what the hell they’re talking about

I’m not even talking about apartheid or AIDS. I’m talking about language. Quite apart from the fact that they incomprehensibly keep pretending their dialect is a language in its own right, they keep applying the wrong words to things, and then passing them on to other languages.
Three examples.


Exhibit A: Meerkat

meerkatmeerkat

This is probably the most famous example, and also an odd one, because mainstream Dutch is in the wrong here too. Examine the pictures above.

The one on the left is Suricata suricatta, a member of the mongoose family. In Afrikaans this has been called a meerkat, and this has been adopted into English. In regular Dutch it’s a stokstaartje (“little stick tail”).

The one on the right is a vervet monkey (Chlorocebus pygerythrus), one of thirty-five species of Old World monkey in the tribe Cercopithecini, which in Dutch are collectively called meerkatten.

This is, of course, an absurd name for either of those, as meerkat means “lake cat”. If anything should be called this, it should be Prionailurus viverrinus, which currently labors under the descriptive but utterly boring name fishing cat.

fishing cat

This medium-sized cat is semi-aquatic, and while it prefers streams and swamps to actual lakes, at least the name would be sort of appropriate. Being medium-sized, it’s also meer kat than the housecat.

Let’s just agree to call Suricata suricatta suricates, okay?


Exhibit B: Steenbok

steenboksteenbok

The one on the left is Raphicerus campestris, a small antelope native to southern and eastern Africa. For some reason, it was named “steenbok” after the one on the right, Capra ibex, the ibex, one of a few goat species called steenbok in Dutch. I have no idea what Afrikaners call the actual steenbok (the one on the right, that is; I know Germans call it Steinbock, and the other one Steinböckchen—that is to say, the diminutive), but the Dutch have taken to calling R. campestris “steenbokantilope”, which at least is fair enough.

Steenbok, of course, translates to stone buck (as in a male goat), which makes sense for the ibex because it lives in the Alps. It very much does not make sense for an antelope that spends its days in grassland.

Since it’s closely related to the two species of grysbok (which by rights should be spelled grijsbok), call it the fancy grysbok and stop confusing people.
Even though it’s not grey.


Exhibit C: Eland

elandeland

The last one is particularly ridiculous. Yes, that’s a moose. In Dutch, meese (elk, in European, though the North-American elk is something else; that’s a different discussion) are called eland. Afrikaners named two species of antelope eland because apparently they’re blind. Even the giant eland (Taurotragus derbianus, pictured) doesn’t even come close to Alces alces in size. The common eland, Taurotragus oryx, is even smaller.

The common eland (just eland in Afrikaans) is called the elandantilope in Dutch. The giant eland is reuzenelandantilope (the prefix reuzen- meaning “giant”). If you’re going to keep the dumb name, “eland antilope” and “giant eland antilope” seem like a good compromise.

I hate Afrikaans.


Having said that, there are some words that made it into English that they get right (boomslang, for example, means tree snake, and it’s exactly that), and a lot that, while dumb, aren’t confusing (aardvark (“earth pig”), aardwolf (“earth wolf”), wildebeest (“wild beast”), hartebeest (more correctly hertebeest, “deer-type animal”); all of these are at least vaguely misspelled by modern standards). Many non-animal words that made it into English are even fully accurate: spoor, veld, trek, and, of course, apartheid.
The three listed examples, though, are bunk. The animals are awesome enough to deserve decent names of their own.

And I still say Afrikaans is just a dialect of Dutch. It’s closer to standard Dutch than, say, Limburgs or West-Vlaams, and while there’s a movement to have those recognised as separate languages, that’s a tiny, tiny minority. The only real difference is that Afrikaans has a standardised spelling and good reasons to hate the Dutch.

(But then, so do we.)

Permalink 3 Comments

Quadratic spline interpolation

You’ve had this problem before: you have a bunch of data points, and you want to interpolate between them.
For various reasons, higher order polynomial interpolation (where you try to find an nth-degree polynomial through n + 1 of your data points) can be a bad idea, so you decide that rather than using a simple equation, you’ll use a series of them to connect your data points. These equations are splines, and the simplest form of spline interpolation is just, well, connecting your data points directly:

That’s pretty ugly, though. Is there a way to achieve something like this:

instead?
Yes, obviously, and one of those ways is to use quadratic splines instead. Let’s use a simpler example, though. Suppose we only have four data points, (x0, y0) through (x3, y3):

linear spline interpolation

The black dots are our actual data points, the red lines are our linear splines. What we’d actually like, though, is this:

quadratic spline interpolation

Turns out that’s not that hard to do. As you can see, every spline is a quadratic equation, which obviously is of the form f(x) = ax² + bx + c. So each spline equation has three unknowns (a, b, and c), and there are three splines, for a total of nine unknowns (let’s call them a1 through a3 and so on).

Since two points are known for each spline equation, that gives us the following six equations:

To solve for nine unknowns, obviously we need nine equations. So what else do we know?

Well, the reason the linear spline interpolation looks like crap is because of the sharp breaks at the spline edges, so we would like our neighboring quadratic splines to have the same slope in the point that they share. In other words, if our spline equations are f, g, and h, we want the derivative of f to equal the derivative of g in the point (x1, y1), and we want the derivative of g to equal the derivative of h in the point (x2, y2).
The derivative is easy enough to find:

Filling in, this gives us two more equations:

Or equivalently:

Which brings our total to eight equations. We aren’t going to squeeze another legitimate equation out of this, so let’s just fill in one of the unknowns ourselves. If we make one of the as equal to 0, one of the quadratic splines becomes a linear spline, which is fine. Let’s take a1 for simplicity’s sake.
This enables us to construct the following matrix:

The first three columns are the as, the next three the bs, the next three the cs, and the final column will hold the solutions after reduction.
Filled in and solved for our particular dataset:

Which gives us the following equations for our splines:

Obviously this is a lot of work, but it’s mechanical work that doesn’t require a lot of judgement. Which is why I’ve written this Python script to do it for you. Feed it data points and it’ll produce gnuplot code to plot your splines:

$ python qsi.py < data.txt
plot 1.000000 <= x && x <= 3.000000 ? 0.000000 * x * x + 1.500000 * x + 1.500000 :\
     3.000000 <= x && x <= 5.000000 ? -1.250000 * x * x + 9.000000 * x + -9.750000 :\
     5.000000 <= x && x <= 9.000000 ? 1.125000 * x * x + -14.750000 * x + 49.625000 : 0/0 notitle

As you can tell, it’s not necessarily gorgeous, but it (probably) works, and it’s not like anyone has to see the code itself.
Format for the input file is as you’d expect: two numbers per line, first being x and second y, sorted. If gnuplot‘s output is jagged, increase the sampling (set sample 1000).
And if it doesn’t work, fix it.

Edit: In light of overwhelming demand, this is a script that interpolates using a higher-order polynomial, as mentioned above. Here’s how the approaches compare for our sample dataset:

This script will fail if you only have one datapoint and its x value is 0, but everything else should work.

Permalink 6 Comments

Literate Tripcrackers

There’s been some interest in tripcode crackers lately, so I thought I’d write on in Haskell. I mentioned this before, but I’ve improved it a bit since.
I’ll be discussing the code step by step in this post. By the end, we should have a working application that takes a POSIX regex as an argument, and then outputs tripcodes that match it.

If everything goes right, this post should be literate Haskell, but I can’t promise that it’ll actually work, what with WordPress being what it is.
Let’s get started.

The tripcode algorithm is relatively straightforward: the input is converted to SJIS, there are some XML character entity substitutions, then a salt is calculated, and the whole thing is passed to Unix crypt.
We won’t be dealing with SJIS conversions, since our input will be ASCII only, which (with one exception) is a subset of SJIS anyway, and our target board will probably be Shiichan or Futallaby anyway, neither of which even does it. We also won’t be writing our own crypt implementation in Haskell, so we’ll have to get it from a C library. To do this, we use the Foreign Function Interface language extension, like so:

> {-# LANGUAGE ForeignFunctionInterface #-}

The usual boilerplate:

> module Main (main) where

> import Char (chr, ord)
> import Data.List (inits)
> import Foreign.C
> import System (getArgs, exitFailure)
> import System.IO.Unsafe (unsafePerformIO)
> import Text.Regex.Posix ((=~))

It’ll become clear why we need all of these as we go along.
Let’s import our C crypt:

> foreign import ccall unsafe "DES_crypt" crypt :: CString -> CString -> CString

We’re using OpenSSL’s implementation, because GNU crypt is slow as fuck, and this thing is going to be slow enough as it is. That bit of code is just saying to take a C function named DES_crypt from a linked library, and expose it as a Haskell function named crypt, with the listed type signature.

We said the tripcode algorithm involves some XML character entity substitutions, so let’s write a function to do that. The canonical algorithm just escapes ", <, and >. If yours escapes more (or fewer), just add (or remove) them here.

> xmlescape :: String -> String
> xmlescape [] = []
> xmlescape (x:xs) = case x of
>     '"' -> (++) "&quot;" $ xmlescape xs
>     '<' -> (++) "&lt;"   $ xmlescape xs
>     '>' -> (++) "&gt;"   $ xmlescape xs
>     otherwise -> (:) x   $ xmlescape xs

Straightforward enough.
Next, we’ll need a function to generate the salt. crypt‘s salt is string of length 2 whose characters are in the range [a-zA-Z0-9.]. The tripcode function obtains this by appending H.. to the input and taking the second and third characters, performing some transformations to ensure they’re in the allowed range:

> salt :: String -> String
> salt t  =
>     map f . take 2 . tail $ t ++ "H.."
>     where
>         f c
>             | notElem c ['.'..'z'] = '.'
>             | elem c [':'..'@'] = chr $ ord c + 7
>             | elem c ['['..'`'] = chr $ ord c + 6
>             | otherwise = c

Now we’re ready for the actual tripcode.
This will happen in the IO monad, because we’re dealing with the FFI. We don’t have to use unsafePerformIO to escape from it, but to keep our algorithm conceptually cleaner, we will anyway.1

> tripcode :: String -> String
> tripcode tr = unsafePerformIO $ do
>     trip <- newCString t
>     salt <- newCString $ salt t
>     peekCString (crypt trip salt) >>= return . drop 3
>     where t = xmlescape tr

Great. Now that we have everything we need to calculate the tripcode for a given input, we need a way to generate inputs.
Let’s first start by specifying the characters we want to allow. We’re leaving out # and ! because they’re usually separator characters and as such illegal in tripcodes (though most boards will allow !), and \ because that’s the aforementioned ASCII/SJIS exception.

If you’re targetting a specific algorithm and you know which characters are fine and which aren’t, you can edit this. For Shiichan, for instance, the only forbidden character is #.

> allowedChars :: [Char]
> allowedChars = filter (\c -> c `notElem` "#!\\") [' '..'~']

You can avoid the call to xmlescape altogether by disallowing characters that would be escaped, of course; that’s what I did for my C implementation, too. That will obviously reduce your search space, though.

Now we can use these characters to generate our (infinite) list of inputs:

> ins :: [String]
> ins = (inits . repeat) allowedChars >>= sequence

This will generate all possible combinations, from "" to strings of infinite length. Since crypt disregards input over eight characters wide, it will generate far more combinations than you’ll ever need. Since it will take almost forever to even get up to that point, though, the issue is kind of moot.2

Now that we have our infinite input list, we can turn this into the input/output combinations we need; first we’ll team up each input with its corresponding output with zip, and then we’ll filter it with our regular expression. Obviously, this is where Haskell’s laziness is really handy:

> tripPairs :: String -> [(String, String)]
> tripPairs regex = filter (\(a, b) -> b =~ regex) $ zip ins $ map tripcode ins

It’s cute that (=~) is used for regular expressions. It’s also handy that it takes a string as its second argument and we don’t have to dick around with compiling regular expressions and what have you.
(=~) actually has a pretty complicated type signature, but its return value is a polymorphic value that converts into a boolean without complaints, so we don’t have to worry about that.

So what do we have now? We have our tripcode function, our list of inputs, and a function that filters this list based on a regex argument. We’re pretty much done. Now we only need the main function:

> main :: IO ()
> main = getArgs >>= f
>     where
>         f []       = putStrLn " USAGE: tripcode [regex]" >> exitFailure
>         f (arg:xs) = mapM_ (putStrLn . show) $ tripPairs arg

When no argument is passed on the command line, we display a short usage note and return failure, otherwise we generate our list of matching pairs, turn them into displayable strings, and then print them to stdout. Try it, it works!3

It’ll be slow as fuck, though,4 for a number of reasons. The first is that OpenSSL’s crypt, while much faster than GNU crypt, still isn’t very fast. The second is that Haskell’s Text.Regex.Posix is very slow (if you know how, I suggest you use another regex library; I went with Text.Regex.Posix because most casual Haskellers (which I am too) aren’t likely to have any of the others). The third is that it can only use one processor at a time, whereas most others are multithreaded or multiprocess affairs. The fourth is that a high-level language like Haskell is obviously going to be slower than tripcode crackers written in C and Sepples by moon people. The fifth is that I suck at Haskell.

Still, at least it’s written in the world’s leading fictional programming language.

Edit: Just to make it absolutely clear, because this post gets a lot of hits: you’d have to be an idiot to actually use this to look for tripcodes. Write your own cracker in C. It’s easy and will be much faster. (Or if you don’t want to, I wrote one that was posted elsewhere in the thread you got this link from.)


1 If we didn’t, the type signature would be String -> IO String, of course.

2 If you only want to check eight-character inputs, though, you can do something like ins = sequence . take 8 $ repeat allowedChars instead.

3 Copy/paste this post into a file named Tripfind.lhs, then compile it using ghc -lssl Tripfind.lhs

4 Slower than my C one almost by an order of magnitude, even after adjusting for the fact that my C one can use both of my processors. And since my C one is already slower than, say, Tripcode Explorer by an order of magnitude…

Permalink 3 Comments

Strange attractor

You know Sierpiński gaskets, right? I used one in my Christmas tree last December. They’re fractals created by taking a triangle, connecting the midpoints of the sides to divide it into four, removing the middle one, and then repeating that on the remaining triangles, ad infinitum (literally). They have an area of 0 and a Hausdorff dimension of log2(3).

There’s another, more interesting way of constructing them, though: take the three corner points of a triangle, and a random starting point x. Roll a three-sided die1 to select a random corner point, and mark the midpoint between that point and x. Then, this midpoint becomes x. Repeat forever.

It turns out the Sierpiński gasket is the attractor for this system. I’ve written a Python script to save you some paper and a large number of pencils.2 Here is the result I got after 10,000 iterations:

Sierpiński gasket attractor

To run the script, you’ll need to have the Python Imaging Library installed. It takes three optional arguments: the side of the triangle in pixels (defaults to 1,000), the number of iterations (default 10,000), and the output filename (default out.png).

Strange attractors are fun. Coming up: more of the same.

Edit: If you’d prefer something you can see on the screen to something that dumps to a file, this may interest you. You can change the WIDTH and HEIGHT #defines to your actual resolution if you like (or basically any value, really; it should produce an equilateral gasket for any realistic resolution, though it might not for odd ones, including ones that are higher than they are wide).
You’ll need (besides a C compiler) the Allegro libraries. If you’re using a Debian-based distro, the package is liballegro-dev, IIRC, or you can get them here.


1 Or a six-sided one where you divide the result by two, rounding down, if you like.

2 It’s not exactly the same thing: raster images don’t have an infinite resolution, IEEE floating point numbers don’t have infinite precision, and you (probably) don’t have the patience to let your computer run forever.3

3 If you do, consider cracking tripcodes instead.

Permalink 1 Comment

Forced indentation of Huffman encoding

Inspired by rmuser’s Youtube videos on information theory (and specifically the one about Huffman encoding), I wrote a Python script to calculate a Huffman encoding for text.

It reads input from stdin (preferably in ASCII), calculates a Huffman mapping, and shows it to you. It also calculates how long the text would be if encoded with that mapping, and how many bytes you’ve saved compared to ASCII1, which just uses a byte for each character, regardless of how frequently it’s used.

Here’s the result of running it on itself:

$ python huffman.py < huffman.py
Symbol	Freq	Encoding
' '	560	10
'e'	262	111
't'	138	1100
's'	134	1101
'r'	125	00001
'\n'	110	00011
'f'	105	00100
'n'	98	00110
'l'	96	00111
'a'	75	01100
'i'	72	01110
'o'	67	000000
'd'	61	000100
'.'	54	001010
'('	48	010000
')'	48	010001
'c'	46	010011
','	46	010100
'q'	45	010101
'm'	36	011011
'b'	35	011110
'u'	32	0000010
':'	30	0001010
'_'	26	0001011
'='	26	0010110
'y'	24	0010111
'p'	24	0100100
'h'	22	0101100
'g'	20	0101110
'['	11	01001011
'%'	11	01011010
'#'	11	01011011
'1'	10	01101000
'"'	10	01101001
'0'	9	01101010
"'"	9	01101011
']'	9	01111100
'F'	9	01111101
'\\'	8	000001110
'T'	5	010111100
'+'	5	010111101
'v'	5	010111110
'w'	5	010111111
'/'	4	0000011010
'R'	4	011111100
'3'	4	011111101
'x'	4	011111110
'k'	4	011111111
'I'	3	0100101000
'2'	3	0100101001
'j'	3	0100101010
'S'	3	0100101011
'>'	2	00000110000
'C'	2	00000110010
'A'	2	00000110011
'E'	2	00000111100
'B'	2	00000111101
'P'	2	00000111110
'H'	2	00000111111
'*'	1	000001100010
'!'	1	000001100011
'8'	1	000001101100
'-'	1	000001101101
'W'	1	000001101110
'L'	1	000001101111

Encoded message length: 12237 bits (1529.62 bytes)
This message contained 2634 characters. Huffman encoding saved 1104 bytes
compared to ASCII.

As you can see, the most frequently-used characters have the shortest encoding, while the rarest have the longest. I’m assuming that means it’s working the way it should.

Simple toy, but it beats paying attention in class.


1 If the input isn’t in ASCII, it should still come up with a correct mapping, but that last bit will be off by a bit.

Permalink 2 Comments

Here’s a math problem for you

\frac{64^8!}{64^{8n}(64^8-n)!}=\frac{1}{2}

Solve for n.

Edit: This is basically the reverse birthday problem, with the fixed probability being 50%. Just applying that approximating formula is a bit easier than working out the problem above:

n(p)\approx\sqrt{2\times 64^8\times \ln\left(2\right)}\approx 19753662

Which is much lower than I expected and also crap. It means that the algorithm I’m using for my work-in-progress distributable tripcode searcher is broken. There are some obvious ways to fix it, but finding a way that’s both good and not likely to slow things down too much1 (though obviously the bottleneck is still going to be crypt itself) requires some thought.


1 Right now it’s running at about 330,000 tripcodes per second on my laptop, which is over four times the speed of tripper+ and over sixteen times the speed of my Haskell tripcode implementation, so I guess some slow-down wouldn’t kill it.
Though Asztal claimed 4 million tripcodes per second with Tripcode Explorer on his machine, so I suppose there’s still room for improvement, too.

Permalink 6 Comments

Dawkins on race

One thing that continues to annoy me whenever my internets get into a discussion about race is that invariably, very nearly everyone gets it wrong. The most recent example of this is, of course, when some Stormfront morons declared war on Pharyngula.

On the one hand you have the common racists, which are wrong for obvious and uninteresting reasons, but on the other you have the “enlightened” people who claim race is entirely a social construct, or at least of no significance whatsoever. They’re wrong too.

The following is an excerpt from Richard Dawkins’ The Ancestor’s Tale, which I finished a few weeks ago. It may be the clearest explanation I’ve seen so far.

Dick the DawkIt is genuinely true that, if you measure the total variation in the human species and then partition it into a between-race component and a within-race component, the between-race component is a very small fraction of the total. Most of the variation among humans can be found within races as well as between them. Only a small admixture of extra variation distinguishes races from each other. That is all correct. What is not correct is the inference that race is therefore a meaningless concept. This point has been clearly made by the distinguished Cambridge geneticist A. W. F. Edwards in the recent paper called ‘Human genetic diversity: Lewontin’s fallacy’. R. C. Lewontin is an equally distinguished Cambridge (Mass.) geneticist, known for the strength of his political convictions and his weakness for dragging them into science at every possible opportunity. Lewontin’s view of race has become near-universal orthodoxy in scientific circles. He wrote, in a famous paper of 1972:

It is clear that our perception of relatively large differences between human races and subgroups, as compared to the variation within these groups, is indeed a biased perception and that, based on randomly chosen genetic differences, human races and populations are remarkably similar to each other, with the largest part by far of human variation being accounted for by the differences between individuals.

This is, of course, exactly the point I accepted above, not surprisingly since what I wrote was largely based on Lewontin. But see how Lewontin goes on:

Human racial classification is of no social value and is positively destructive of social and human relations. Since such racial classification is now seen to be of virtually no genetic or taxonomic significance either, no justification can be offered for its continuance.

We can happily agree that human racial classification is of no social value and is positively destructive of social and human relations. That is one reason why I object to ticking boxes in forms and why I object to positive discrimination in job selection. But that doesn’t mean that race is of ‘virtually no genetic or taxonomic significance’. This is Edward’s point, and he reasons as follows. However small the racial partition of the total variation may be, if such racial characteristics as there are are highly correlated with other racial characteristics, they are by definition informative, and therefore of taxonomic significance.

It’s not surprising that Lewontin’s1 views are most popular in the US, where casual racism is so common many smart people are so eager to dissociate themselves from it they swing too far in the other direction.

Dawkins then goes on to say that if we have a person and we are told about his sex, we immediately know more about the shape of his genitals, though not with absolute certainty. That is to say, our uncertainty about some of his attributes is reduced. Similarly, if we are told this person is black, our uncertainty about a number of his attributes, such as (but not exclusively) the color of his skin, is reduced as well, so it’s intuitively obvious that race cannot be exclusively a social construct.

The whole thing is worth reading, though the book as a whole is not his best. If you’re going to buy it, buy the hardcover version. It’s expensive, but the book relies on pictures too much for the paperback to be very useful.

Incidentally, contrary to what aforementioned Stormfront morons claim, there is no conclusive causative link between race and IQ. It’s true that blacks on average have a lower IQ than whites in the US, but that difference disappears once you adjust for class (the lower classes tend to have lower IQs than the upper classes, of course, given the strong correlation between IQ and education levels), and the fact that blacks on average tend to be lower class than whites seems to be more of a result of discrimination based on racism than it is of anything inherent in blacks.2

Either way, this whole discussion makes me tired. Talking to either side in it is like talking to a brick wall.


1 It should also not be surprising that Lewontin is an erstwhile compatriot of Gould’s, and a longtime opponent of the straw-man “genetic determinism” of evolutionary psychology.

2 And to the “other side”, before you try to dispute the validity of IQ testing, I suggest you at least read this article (Wikipedia has an article about that article).

Permalink 10 Comments

The Sieve of Eratosthenes

Prime numbers are ridiculously important in cryptography, and I recently found myself needing a way to generate them (for a toy implementation of the Diffie-Hellman key exchange algorithm). It keeps amazing me how few languages actually have libraries for generating prime numbers, or even just for primality testing.
Because I didn’t feel like thinking, and I didn’t need incredibly massive primes, I just wrote a quick implementation of the sieve of Eratosthenes, which is an interesting enough algorithm that I’d thought I’d share it (this is first-year stuff even for us, but I guess I have non-programmers in my audience).

The basic idea is that you start with a list of integers from 2 up to the number you want to find the largest prime smaller than. You then take the smallest number in this list (2), and cross off all multiples of this number (except for 2 itself). You repeat this for every other number, in order, and you’ll be left with just a list of primes.

Wikipedia has a visualisation which could be clearer but is still pretty good:

Sieve of Eratosthenes

The implementation (in Python) is dead simple:

def sieve1(n):
    l = range(n)
    l[0], l[1] = None, None
    for i in xrange(int(__import__('math').sqrt(n)) + 1):
        if l[i] is not None:
            j = i * 2
            while j < n:
                l[j] = None
                j += i
    return filter(None, l)

We could just use range(2, n) and then work with an offset, but the convenience of having the indices equal to the values at those indices is too useful to complain about the extra memory required for two additional integers. Note also the obvious optimisation of only going up to sqrt(n) + 1 (no apologies for the unorthodox import).
filter(None, l) gets rid of the Nones in our list. Passing None as the function argument just applies the identity function, which has the effect of getting rid of None, False, 0, and empty lists, tuples, and dictionaries (anything which would evaluate to False in an if statement).

This is nice, but it does have the disadvantage of having to generate the whole list every time you want more primes. You can’t just input a list of pregenerated primes and have it continue from there.
The fix is easy enough:

def sieve2(n, primes = [2, 3]):
    if n <= primes[-1]:
        return filter(lambda m: m <= n, primes)

    offset = primes[-1] + 1
    l = range(offset, n)

    max = int(__import__('math').sqrt(n))
    for p in primes:
        if p > max:
            break
        i = p * 2 - offset
        while i < 0:
            i += p
        while i < len(l):
            l[i] = None
            i += p

    for k in xrange(max - offset + 1):
        if l[k] is not None:
            i = k * 2 + offset
            while i < len(l):
                l[i] = None
                i += l[k]

    l = filter(None, l)
    primes.extend(l)

    return primes

This function uses memoisation, and takes advantage of a feature of Python’s function defaults which some believe to be a bug: if mutable default arguments are altered, they will remain altered for the next time the function is called.
This is a side-effect of the fact that functions have a single field that holds the default arguments (a tuple called func_defaults in the function’s namespace), which isn’t duplicated into a working copy when the function is called. Most default arguments will be things like integers and strings, which are immutable, so most people never even run into this. Being able to use it for memoisation is really handy, though. Alternatively, you could just use a global variable and access it from inside the function.

The algorithm is basically the same, except that if we’ve already generated the primes requested, we can just return them. Then we construct a list of all integers larger than our largest prime, and sieve it with our existing primes (now we really do have to work with an offset). Then we just sieve the rest classically. At the end, we add our newly sieved list to our old primes, and return that.
Next time we invoke the function, primes will still have the new primes. You can access them from outside the function at sieve2.func_defaults[0].

Evaluating sieve2(100) and then sieve2(1000) will be slower than just evaluating sieve1(1000), but it will be faster than evaluating sieve1(100) and then sieve1(1000). And if you happen to have a large number of pregenerated primes, you can save some more time.

The sieve of Eratosthenes is just one of a number of sieve-based prime number generators, but it’s the oldest one, and it compares favorably to modern improvements.

Permalink 3 Comments