Forced indentation of cryptanalysis
Alright, so it’s about programming again.
I’ve been meaning to write some tools to help me encrypt text using some classical ciphers, and also to break those ciphers. I’ve also been meaning to using Python more, because holy fuck Python is awesome.
Maybe it’s just the fact that I only really have any experience with seriously shit languages (Java, PHP, COBOL; I’ve used other languages, some of which didn’t suck, but never to the extent I’ve been forced to use those three), but after I wrote a Python script to deal with that ball algorithm in a third of the time it took me to write it in Java, which is probably the language I’ve used the most in my life, I redid some of the Project Euler problems, which I’d previously also mostly done in Java (some in Scheme), and I was amazed how much less painful it was. If this puts me on the same level as Randall Munroe I apologise.
Anyway.
The obvious place to start was with the Caesar cipher, which is the one where you take each letter and move it ahead X places in the alphabet. ROT13 is a Caesar cipher with the key set to 13.
There are two obvious attacks here: one is frequency analysis, where you count the number of letters and compare it to a chart of letter frequencies (the most common letter in the ciphertext is likely to be E if it’s an English (or Dutch, for that matter) text, for example); the other is bruteforcing, where you just go through all possible keys. This is possible because there are really only 25 possible keys, plus one (26, or 0) which produces a ciphertext identical to the plaintext.
Frequency analysis seemed overkill, since bruteforcing is so easy and much more fool-proof, especially for short ciphertexts, so I went with an algorithm that encrypts the ciphertexts with all 26 possible keys, and uses a dictionary file to count the number of dictionary words in the resulting texts. The one with the highest number of dictionary words is likely to be the plaintext, so that’s then returned.
You can find the code here.
A sample session (text in bold is entered by the user):
$ ./caesar.py
Phrase to encrypt? All that we see or seem is but a dream within a dream.
Shift by? 12
Encrypted: Mxx ftmf iq eqq ad eqqy ue ngf m pdqmy iuftuz m pdqmy.
Decrypting...
Decrypted: All that we see or seem is but a dream within a dream.
Success!
It’s slow as fuck because of the way it uses the dictionary file, of course (but that shouldn’t be too hard to optimise), and because the decrypting function doesn’t take punctuation into account it won’t count words that might actually be in the dictionary, which can be a problem for really short sentences with a lot of punctuation:
$ ./caesar.py
Phrase to encrypt? A ghost!
Shift by? 11
Encrypted: L rszde!
Decrypting...
Decrypted: L rszde!
Fail!
It also doesn’t work if punctuation and—more importantly—spaces have been stripped, though, as was usual for real-life applications of this cipher, and that’s more of a problem.
The easiest way around this is to let the user identify which the decrypted string is, of course, or you could look at random parts of the text to try to guess where words end and see if anything matches a dictionary word, but that’s a lot of work. It may be easier to solve this using frequency analysis.
Like I said, frequency analysis counts the number of letters in a word and guesses what they’re likely to map to based on how often they normally appear in a text written in the same language as the plaintext.
This sort of attack can break not just Caesar ciphers, but any simple substitution cipher, though for a polygraphic one (that is, one that replaces groups of letters rather than individual letters), you’ll need a polygraph frequency chart, of course. It doesn’t do too well with polyalphabetic substitution ciphers (where a plaintext glyph can map to multiple ciphertext ones), but given enough ciphertext, it can break Vigenère ciphers as well.
I wrote a function that does just this: it counts the number of times each letter appears in a ciphertext (ignoring punctuation and spacing, if any), and then naively replaces the letters in the ciphertext based on a letter frequency chart.
It works a lot faster than having to look up words in a dictionary, but the main problem seems to be finding a good letter frequency chart. ETAOIN SHRDLU is a famous sequence, of course, and if you look at the source you’ll see I used that first. My ciphertext was this speech by Karl Popper, ROT13′d (because it was the first large block of text that I could find that had been written in modern English).
With Linotype’s ETAOIN SHRDLU, the “decrypted” version started off like this:
Dr. Rujior, Dr. Mufg, Lfmtuz fgm Nugiludug
Io ipu 3rm Afjeliq oa Dumtjtgu oa ipu Jpfrluz Egtvurztiq fgm tiz Mufg, Hroauzzor Pouzjpl, fgm fll oa qoe kpo pfvu jodu puru iomfq, T ktzp io uxhruzz dq muuhlq auli ipfgwz aor ipu nrufi pogoer kptjp qoe fru jogaurrtgn og du.
Then I tried the one used by Samuel F. B. Morse to determine which letters should map to the least amount of dots and dashes in his code: E IT SAN HURDM WGVLFBK OPJXCZ YG.
This was the result:
Kr. Ruohlr, Kr. Dufv, Wfdsuc fvd Guvhwukuv
Hl heu 3rd Bfoawhz lb Kudsosvu lb heu Oefrwuc Avsnurcshz fvd shc Dufv, Mrlbucclr Elucoew, fvd fww lb zla iel efnu olku euru hldfz, S isce hl ujmrucc kz duumwz buwh hefvxc blr heu grufh elvlar iesoe zla fru olvburrsvg lv ku.
And finally, the one given by Robert Edward Lewand in Cryptographical Mathematics:
Dr. Rutlor, Dr. Wufg, Jfwiuc fgw Nugljudug
Lo lhu 3rw Aftejlq oa Duwitigu oa lhu Thfrjuc Egivurcilq fgw ilc Wufg, Proauccor Houcthj, fgw fjj oa qoe mho hfvu todu huru lowfq, I mich lo uxprucc dq wuupjq aujl lhfgkc aor lhu nrufl hogoer mhith qoe fru togaurrign og du.
In the first case, it got 47 letters right (not counting how many letters out of the 26 it matched up right, just counting how many letters in our “decrypted” paragraph match the original, ignoring spaces, punctuation, and numbers (out of 209)) and managed to turn “honour” into “pogoer”.
The second only got 25 right. For shame, Mr. Morse.
The third got a whopping 67 right, though “hogoer” isn’t really a word.
Unsurprisingly, the cryptography frequency chart came closest, but none of them did particularly well. Presumably they’d do better with more text, but it still surprised me.
It’s particularly surprising how few vowels they got right: both ETAOIN SHRDLU and the cryptography one got O, and the cryptography one also got I. And that’s it. Morse didn’t get a single one, and none of them got A, E, or U.
I suspect the “decrypted” text would be much easier to use as a starting point for a human code breaker if more vowels had been found. Maybe Popper is just really atypical, I don’t know.
Either way, this is just a taste of what I’ve been doing since my last exam on Friday. More to come, I’m sure.
Post a Comment
RSS feed for comments on this post · TrackBack URL