Rosio Pavoris a blog

Forced indentation of cryptanalysis

Alright, so it’s about programming again.
I’ve been meaning to write some tools to help me encrypt text using some classical ciphers, and also to break those ciphers. I’ve also been meaning to using Python more, because holy fuck Python is awesome.
Maybe it’s just the fact that I only really have any experience with seriously shit languages (Java, PHP, COBOL; I’ve used other languages, some of which didn’t suck, but never to the extent I’ve been forced to use those three), but after I wrote a Python script to deal with that ball algorithm in a third of the time it took me to write it in Java, which is probably the language I’ve used the most in my life, I redid some of the Project Euler problems, which I’d previously also mostly done in Java (some in Scheme), and I was amazed how much less painful it was. If this puts me on the same level as Randall Munroe I apologise.

Anyway.
The obvious place to start was with the Caesar cipher, which is the one where you take each letter and move it ahead X places in the alphabet. ROT13 is a Caesar cipher with the key set to 13.
There are two obvious attacks here: one is frequency analysis, where you count the number of letters and compare it to a chart of letter frequencies (the most common letter in the ciphertext is likely to be E if it’s an English (or Dutch, for that matter) text, for example); the other is bruteforcing, where you just go through all possible keys. This is possible because there are really only 25 possible keys, plus one (26, or 0) which produces a ciphertext identical to the plaintext.

Frequency analysis seemed overkill, since bruteforcing is so easy and much more fool-proof, especially for short ciphertexts, so I went with an algorithm that encrypts the ciphertexts with all 26 possible keys, and uses a dictionary file to count the number of dictionary words in the resulting texts. The one with the highest number of dictionary words is likely to be the plaintext, so that’s then returned.
You can find the code here.

A sample session (text in bold is entered by the user):

$ ./caesar.py
Phrase to encrypt? All that we see or seem is but a dream within a dream.
Shift by? 12
Encrypted: Mxx ftmf iq eqq ad eqqy ue ngf m pdqmy iuftuz m pdqmy.
Decrypting...
Decrypted: All that we see or seem is but a dream within a dream.
Success!

It’s slow as fuck because of the way it uses the dictionary file, of course (but that shouldn’t be too hard to optimise), and because the decrypting function doesn’t take punctuation into account it won’t count words that might actually be in the dictionary, which can be a problem for really short sentences with a lot of punctuation:

$ ./caesar.py
Phrase to encrypt? A ghost!
Shift by? 11
Encrypted: L rszde!
Decrypting...
Decrypted: L rszde!
Fail!

It also doesn’t work if punctuation and—more importantly—spaces have been stripped, though, as was usual for real-life applications of this cipher, and that’s more of a problem.
The easiest way around this is to let the user identify which the decrypted string is, of course, or you could look at random parts of the text to try to guess where words end and see if anything matches a dictionary word, but that’s a lot of work. It may be easier to solve this using frequency analysis.

Like I said, frequency analysis counts the number of letters in a word and guesses what they’re likely to map to based on how often they normally appear in a text written in the same language as the plaintext.
This sort of attack can break not just Caesar ciphers, but any simple substitution cipher, though for a polygraphic one (that is, one that replaces groups of letters rather than individual letters), you’ll need a polygraph frequency chart, of course. It doesn’t do too well with polyalphabetic substitution ciphers (where a plaintext glyph can map to multiple ciphertext ones), but given enough ciphertext, it can break Vigenère ciphers as well.

I wrote a function that does just this: it counts the number of times each letter appears in a ciphertext (ignoring punctuation and spacing, if any), and then naively replaces the letters in the ciphertext based on a letter frequency chart.
It works a lot faster than having to look up words in a dictionary, but the main problem seems to be finding a good letter frequency chart. ETAOIN SHRDLU is a famous sequence, of course, and if you look at the source you’ll see I used that first. My ciphertext was this speech by Karl Popper, ROT13′d (because it was the first large block of text that I could find that had been written in modern English).

With Linotype’s ETAOIN SHRDLU, the “decrypted” version started off like this:

Dr. Rujior, Dr. Mufg, Lfmtuz fgm Nugiludug

Io ipu 3rm Afjeliq oa Dumtjtgu oa ipu Jpfrluz Egtvurztiq fgm tiz Mufg, Hroauzzor Pouzjpl, fgm fll oa qoe kpo pfvu jodu puru iomfq, T ktzp io uxhruzz dq muuhlq auli ipfgwz aor ipu nrufi pogoer kptjp qoe fru jogaurrtgn og du.

Then I tried the one used by Samuel F. B. Morse to determine which letters should map to the least amount of dots and dashes in his code: E IT SAN HURDM WGVLFBK OPJXCZ YG.
This was the result:

Kr. Ruohlr, Kr. Dufv, Wfdsuc fvd Guvhwukuv

Hl heu 3rd Bfoawhz lb Kudsosvu lb heu Oefrwuc Avsnurcshz fvd shc Dufv, Mrlbucclr Elucoew, fvd fww lb zla iel efnu olku euru hldfz, S isce hl ujmrucc kz duumwz buwh hefvxc blr heu grufh elvlar iesoe zla fru olvburrsvg lv ku.

And finally, the one given by Robert Edward Lewand in Cryptographical Mathematics:

Dr. Rutlor, Dr. Wufg, Jfwiuc fgw Nugljudug

Lo lhu 3rw Aftejlq oa Duwitigu oa lhu Thfrjuc Egivurcilq fgw ilc Wufg, Proauccor Houcthj, fgw fjj oa qoe mho hfvu todu huru lowfq, I mich lo uxprucc dq wuupjq aujl lhfgkc aor lhu nrufl hogoer mhith qoe fru togaurrign og du.

In the first case, it got 47 letters right (not counting how many letters out of the 26 it matched up right, just counting how many letters in our “decrypted” paragraph match the original, ignoring spaces, punctuation, and numbers (out of 209)) and managed to turn “honour” into “pogoer”.
The second only got 25 right. For shame, Mr. Morse.
The third got a whopping 67 right, though “hogoer” isn’t really a word.

Unsurprisingly, the cryptography frequency chart came closest, but none of them did particularly well. Presumably they’d do better with more text, but it still surprised me.
It’s particularly surprising how few vowels they got right: both ETAOIN SHRDLU and the cryptography one got O, and the cryptography one also got I. And that’s it. Morse didn’t get a single one, and none of them got A, E, or U.
I suspect the “decrypted” text would be much easier to use as a starting point for a human code breaker if more vowels had been found. Maybe Popper is just really atypical, I don’t know.

Either way, this is just a taste of what I’ve been doing since my last exam on Friday. More to come, I’m sure.

Permalink Comments

In which /prog/ makes me feel slightly better about having a French exam tomorrow

                                 ,
                             ____|____
                          .-`         `.
                       .-`              `.
                    .-`                 .`
    '-._         .-`             _______`
        `.__     `-_________---`` ,-.`-,         AVEZ-VOUS LU
            `''-------'          ( p )  `._       VOTRE SIPI
                                  `-'      \     AUJOURD'HUI?
                                     `\   ,-\
                                  .    ```  \
                                   \---..,--'
       ................._           --...--,
                         `-.._         _.-'
                              `'-----''

I’ve probably been complaining about how ridiculous it is that I even have a French exam too much lately, but that doesn’t mean it isn’t true.

I’ve also been posting too much about programming lately, but most of the sciencey things I’ve been dealing with lately have been about the evolution of sex and sexual selection, and too much of that requires too much background, or doesn’t have straightforward answers that can be summed up in two or three sentences, so a blog isn’t a terribly good medium for discussing it.
I’ll see if I can think of something interesting, though, but it’ll have to wait until after tomorrow (French is my last exam) either way.

Also, today is an anniversary, but it’d surprise me if anyone remembered which one.

Permalink 2 Comments

Hurf durf Java

This surprised me, mostly because I didn’t even know Java had labels. Turns out their only purpose is to label loops so you can break and continue outer loops from within inner loops, like so:

hurf:
    for (int i = 0; i < 10; i++) {
        for (int j = 0; j < 10; j++) {
            System.out.print("!");
            if (j == 4)
                break hurf;
        }
    }

This will print five exclamation marks rather than fifty, because the outer loop will only run once before being interrupted from the inner one.

The guy’s next post had more “quirks”, including one involving auto-boxing which I’d actually run in to myself a few months ago.
To understand it, you’ll need some background.

If you’ve read this far, you presumably already know that Java, rather than being a real OO language, wusses out and has some primitives, like int, char, boolean, and so on. The developers, being aware of how Smalltalk has primitives that are objects but not being bright enough to come up with something like that themselves, instead opted to create a slew of classes that wrapped around each of these primitives, presumably because they were afraid that real OO languages would beat them up and take their lunch money otherwise.
The result is that the Java standard library also has an Integer class that wraps an int, a Character class that wraps a char, a Boolean class that wraps a boolean, and so on. Some of these classes make an attempt to provide useful functionality (Character can tell whether its char is uppercase or lowercase, for instance, and can convert between those and titlecase and the like), but mostly they’re a pain in the ass to work with because every time you wanted to do something with the primitive values in them you had to retrieve it with a method, do what you wanted to do, and then (if you changed it) update it with another method.

So Java 1.5.0 (or Java 5, if you prefer) added auto-boxing and auto-unboxing, which made this a bit easier while also moving them further away from having a real OO language.
What auto-unboxing did was allow the programmer to use instances of the primitive wrapper classes to be used as if they were the primitives they wrap in certain circumstances. Auto-boxing, on the other hand, could automatically wrap primitives in instances of their respective classes. It became possible to do things like this:

Character c = new Char('c');
char d = 'd';
c = d; // Auto-boxing of d

Double f = 3.14; // Auto-boxing of primitive 3.14
f -= new Float(2.324f); // Auto-unboxing of f and the new Float, and auto-
                        // boxing of the result of the subtraction.

And, of course, it became possible to compare Integers as if they were ints, which was his example:

Integer n = 128;
Integer m = 128;
assert n <= m; // This works.1
assert n >= m; // This works.
assert n == m; // This fails.

The first two asserts work as you’d expect, because the variables are both being unboxed automatically and compared as if they were primitive ints. The third, however, does not, because == has meaning for objects, and if auto-unboxing worked here, that would break backward compatibility. <, <=, >=, and > comparisons all auto-unbox because those operators don’t work on regular objects (because that crosses some imaginary line at which it becomes too low-level).
As you may or may not know, non-primitive variables in Java are actually pointers (well, references) to objects, which are all passed around by reference, so what == does is compare those pointers to find out if they point to the same memory address. In this case, m and n do not, even though they are otherwise identical objects.2

All well and good. A possible pitfall, perhaps, but at least it’s clear why it behaves the way it does, and if you’re going to have auto-unboxing this is probably the way it should work.
However, this is where they go off the deep end:

If the value p being boxed is true, false, a byte, a char in the range \u0000 to \u007f, or an int or short number between -128 and 127, then let r1 and r2 be the results of any two boxing conversions of p. It is always the case that r1 == r2.

Why only for those values? Who the fuck knows~

Just to demonstrate that it’s not just PHP that seems to suffer from ad-hoc brain damage.3


1 assert is a keyword that throws an exception when the condition following it is false. It’s used for debugging, but it’s not as popular as it is in the languages it was stolen from; I didn’t even know it existed until a few weeks ago myself.

2 The difference between reference equality (which is tested through ==) and object equality (which is tested through the equals() method), which was actually a question on my Java exam for the third semester.

3 Which reminds me of another thing I saw a while ago. The following is valid PHP code and will do what you’d expect it to:

System.out.print("Failsauce");

Why? It becomes more obvious when you turn error reporting up to the maximum:

Notice: Use of undefined constant System - assumed 'System' in Command line code on line 1
Notice: Use of undefined constant out - assumed 'out' in Command line code on line 1

PHP treats all unknown identifiers as constants, and all undefined constants as if their contents were the same as their name. Remember also that the period is the concatenation operator.
What this statement is doing is basically concatenating the string “System” to the string “out”, and that to the return value of print("Failsauce"), which is a function that returns nothing but has side-effects (specifically, it prints the argument). It then takes this string (“Systemout”) and does nothing with it.

Isn’t PHP grand?

Permalink Comments

ADD 1 TO COBOL GIVING CANCER

You wouldn’t believe how glad I am never to have to touch COBOL ever again.
To give you some idea of the cancerousness of the language, this is part of what I wrote on my exam today, as pieced together from my notes:

Read the rest of this entry »

Permalink 2 Comments

JAEV

I know programmers can expect to maintain at least one cancerously designed codebase in their careers (especially Java programmers), but it seems to me a class trying to get across the principles of black-box abstraction and encapsulation is a relatively horrible place to prepare people for that.
I guess I could see the point if it was used to discuss exactly what was wrong with it and how to avoid making those kinds of empty-headed mistakes, but apparently the teacher (and I use that word in the broadest possible sense) is under the impression it’s actually good code.
In other words: my Java exam was painful.

Also, people who feel threatened by students knowing a fuck of a lot more about their chosen field than they do shouldn’t ask broad theory questions on their exams.
Though those were really the only questions that weren’t a total pain to answer. Even though one of them was about design patterns.

(The fact that he insists on “syntactically correct” UML diagrams without apparently having a clue what they are didn’t help.)

Permalink Comments

Mojave is an admission of defeat

This is a few weeks old at this point, but people won’t shut up about it.
The Mojave Experiment is a marketing stunt by Microsoft that involves taking people off the street and presenting them with an OS called “Mojave” and asking for their opinion on it. Afterwards, it is revealed that this OS is actually Windows Vista.
Anyone with half a brain can tell you why this is a bullshit way of doing things, but unfortunately Microsoft fanboys (which do exist, for some reason) don’t tend to be in possession of even that much.

The obvious problem is that installing Vista on powerful hardware and putting people in front of it for five minutes isn’t the same thing as making people pay for that hardware (and for the OS), and then having them use it for a few weeks. What’s being evaluated here is the at-a-glance flashy wank, not the actual OS.
Yes, Vista is rather pretty, given the expensive hardware required to run it. We already knew that.

The test subjects weren’t, however, given a chance to see how it interacts with hardware that isn’t carefully selected to be compatible with it, or to use software not explicitly designed for Vista. They didn’t even get to touch the computers themselves; everything was demonstrated by a salesman.

And of course, Microsoft states that these people had never seen Vista in action before. How many of you haven’t seen a computer running Vista by now? None of my machines run any version of Windows and everyone I interact with IRL runs Linux, and even I have seen Vista machines.
Nobody who knows enough about computers to be in any position to judge an operating system won’t have seen Vista well over a year and a half after its release.

Anyway, all of this is obvious and can, indeed, also be learned from the Wikipedia article. Even if the “experiment” didn’t suffer from these flaws, though, here’s why it would still be bullshit:

The operating system market is very much a lemons market. That is to say, it’s not immediately obvious to the purchaser (the user) whether or not a product is any good, and it may not become clear until quite some time after the time of purchase if it isn’t.

So the consumer has to rely on market signals to tell good products from bad ones: quality labels (none to be had, since Microsoft considers itself a standard unto itself), money-back guarantees (nope; at least, not for users who already accepted the EULA, which is a requirement for using it), third-party reviews, &c..
Third-party reviews are pretty much the most important market signal here, be it from professionals or even just from friends who’ve tried it. Of course, they’re almost universally negative.

This marketing campaign neatly gets around that by removing the critical, informed voices and instead bestowing authority on clueless laypersons who’ve been force-fed their opinions by slick marketroids.
What Microsoft is doing with the Mojave Experiment is admitting that Vista cannot compete without subverting market signals and suckering inexperienced users into buying their crap.

You’d wonder why they even bother, since their stranglehold on the OEMs means that it’s next to impossible to get a PC without Vista preinstalled anyway1, and anyone stupid enough to fall for this sort of thing isn’t going to be installing Vista on his XP machine himself.

Though if he does, of course, he’ll soon become a market signal himself.


1 Which, incidentally, is the only reason Vista adoption rates are as high as they are, but only among household consumers. If you look at businesses using Windows, where the people making these decisions are generally slightly more informed, XP still dominates.

Permalink Comments

Lessons from America

Alternative title: democracy is wasted on the ignorant.

Today I learned that if you don’t really like a candidate, you have no choice but to endorse his opponent, even if his political views are diametrically opposed to yours. Even if he’s a hypocritical bigot with anger management issues. Even if he would take away the rights your ancestors fought and died for. Even if he would kill your family and friends given half a chance. Even if he would invade Poland.
If you don’t really like the other guy (regardless, of course, of that guy’s political positions), you have to endorse him.

I also learned that the US is the most powerful and influential country in the world, whose policies, both foreign and domestic, impact the lives of billions of people, but its presidential elections don’t concern people who don’t live within its borders.
If you’re wondering why even many people who haven’t spent most of their life living under the rule of tyrants instated by the US or being terrorised by American or American-funded occupiers don’t much like the US, I think you might find one or two hints here.

(Speaking of which, I paid our hosting bill today, which, because our server is in Michigan, was in dollars. After conversion, it came to 75 €. Last year, it was 88 €. While that’s a nice side-effect of the comical incompetence of the self-proclaimed “fiscal conservatives” when it comes to running an economy, US recessions tend to cross the ocean as well, which isn’t good for my investments.
While the US is working very hard to dismantle its influence on the global stage, everything that goes on there still quite directly affects me, and given the circumstances much of it affects me more directly than it does many American citizens. So quit telling me US politics are none of my business just because you don’t know enough about them to defend your own ill-considered positions, hypothetical Republican/Libertarian reader.)

Permalink 5 Comments