On Hash Functions
Cryptography is becoming increasingly important in our daily lives, but there’s still a lot of confusion over even very basic cryptographic concepts in the media, so I thought I’d write a bit about some general cryptography-related topics for a bit.
This may be the first post in a series, or I may get tired of it after this one. We’ll see.
Hash functions have lots of uses beyond cryptography, making it particularly important for people to understand. The media generally seem fond of confusing them with encryption, which they aren’t.
A hash function is any function that takes data as input and converts this input into fixed-length string, or hash. They differ from encryption in that they are meant to be irreversible, while encryption is always meant to be reversed.
MD5
An example, maybe.
A popular hashing algorithm on the internets is MD5. It was developed by Ronald Rivest (who did some other interesting things, as well; he is the R in the popular asymmetric cipher RSA, for instance), and was the fifth in his Message Digest series of algorithms (hence the name; message digest is sometimes used as a synonym for hash function for reasons we’ll get to). It takes input of any length, and outputs a 128-bit hash.
If the input is MD5 is an algorithm developed by Ronald Rivest., the output will be f2bb0b1298911d71ed56e6b548032e5e, which is 32 characters long. Each character is a hexadecimal digit representing a 4-bit binary number, so that comes to 128 bits.
If we change one character of the input string so that it becomes MD4 is an algorithm developed by Ronald Rivest. (which is also true), the hash becomes e575262f48922eedf0bffabe8a29e433 (which, you’ll note, is a completely different hash, due to a cascade effect; MD5 works bitwise, so you could say a character difference in the input string is actually eight bits differences, but in ASCII, 4 and 5 are only a bit apart).
If you want to know how MD5 works, you can find a full description of the algorithm here.
Uses
There are tons of uses for hashes.
In cryptography, they’re often used to guarantee the integrity of a message, to ensure it hasn’t been tampered with. The sender sends a message (encrypted or otherwise) and a hash of the message (along another channel, obviously), and the receiver can apply the same hash function to the message and compare hashes to make sure nobody’s changed the message.
Another cryptographic use is storing passwords in a database. If you store passwords as plaintext, and an attacker somehow gains access to that database, all accounts are, of course, compromised. Not only that, since users often use the same password for different applications, the attacker could potentially fuck them over much more severely.
If the passwords are hashed, though, they’re useless to the attacker (unless the hashes are used in a retarded way, of course; for instance, if a web application sets a cookie containing username and password hash upon login, and uses that cookie for authentication, the accounts are just as compromised as they would have been if the passwords had been stored as plaintext).
And there’s no reason not to store them as hashes, of course; when the user logs in, the server can just hash the password the user sent and compare it to the hash in the database.
A popular non-cryptographic use for hash functions is to guarantee data integrity. This is related to the first use I described, of course, except that the only thing that messed with the message, in this case, would be a lossy network.
If you’ve ever downloaded a Linux OS or similar things that come as large .iso files, you’ll have noticed many of them come with hashes. This is so you can check the .iso wasn’t corrupted in transit, simply by applying a hash function to the entire .iso and comparing it to the hash that came with it. You’d notice if it was corrupted if you tried to burn it to a CD, of course, but you’d also waste a CD in the process.
It’s possible the hash is corrupted as well, of course, but much less likely, since the hash will be much, much smaller.
Hashes in this context are also called checksums.
And the use that’s probably most familiar to programmers: hash tables. This is a specialised use that takes a while to explain and won’t be interesting to anyone who doesn’t already know what it is, probably, so I’m just going to refer to Wikipedia.
It’s not very relevant to cryptographic hashes anyway.
Attacks
The fact that hash functions output a fixed-length string is what makes them irreversible (since most inputs are longer than the output, some information is inevitably destroyed; if they’re shorter, padding is added and there’s no true way of telling which information is padding and which was original), but it also makes them theoretically weaker.
If you’ve heard about hash functions in the media, maybe you’ve heard about collisions. This is when two different inputs produce the same output. Note that this is always possible, since a good hash function can accept input of any length (and thus infinite variety), but can only output a finite number of hashes (even though in practice, that number is big enough that collisions aren’t likely to occur anyway; for MD5, for instance, it’s 2128, or 340 sextillion (long scale; 3.4 * 1038)).
This hints at one attack against hash functions: looking for collisions.
This isn’t just going through zillions of inputs looking for ones that produce the same hash, of course; generally you’d look for a weakness in the algorithm’s structure that makes it more likely to output some hashes than others.
In fact, MD5 is considered cracked because it’s relatively easy to find two inputs that produce the same output, even if it’s not quite possible to reverse an MD5 hash into its (likely) input at this point. “Cracked” is a relative term, in cryptography, and for most purposes, it’s still quite alright to use MD5.
Another attack against hash functions are so-called rainbow tables.
What this essentially boils down to is that some guy spent some (well, more than some, usually) time in advance generating strings, hashing them, and putting the string/hash pair in a database somewhere (there are some additional steps here that distinguish rainbow tables from “regular” lookup tables, but they’re mostly optimisations that don’t change the nature of the thing; Wikipedia has more information, if you’re interested).
Then, all it takes to “reverse” a hash is to look it up in the database.
Obviously, this only works if the database is massive, typically on the order of several terabytes. Rainbow tables are essentially a brute-force attack1, except that the brute-forcing is done in advance, greatly reducing the effort involved in cracking more than one hash.
Both of these attacks lose much of their bite if salting is used. Salting is essentially appending (or prepending) a secret string (the salt) to the input string before it’s hashed.
For instance, if your input string is “penis“, the string you would actually hash would be “penislol” (if your salt is “lol“). Even if the attacker could find “penislol” (which isn’t that likely; he’d be more likely to find any of the infinite other inputs that produce the same hash), he has no way of knowing where the original string stops and where the salt begins. Of course, if he had more than one hash using the same salt, he might be able to figure out the salt.
This is a weaksauce example, but in practice, a good salt can make your hashes assloads more secure.
So yes
Muffins! uses hashes for authentication. The user’s password is stored in the database as a salted MD52. When the user logs in, he sends his password to the server, which salts and hashes it and compares it to the stored hash. If it matches, the server generates a SHA-13 hash based on the MD5 hash and salted again with a random number (in the range 0-RAND_MAX); this is used as a type of session token, and is stored in the database and sent back to the user for use in a cookie. For the rest of the session, it compares the token in the cookie to the token in the database (and several other factors). A token is only valid for the length of the session, and only from the IP the session was started from.
It’s not epically secure, but it’s secure enough for a game like Muffins!, and there’s no chance of the user’s password being exposed (except through packet sniffing during the actual login, but that’s only because I don’t have the ability to implement TLS on our server).
So now you (more or less) know about hash functions. Was this interesting?
1 A brute-force attack is simply one that goes through all possible options until it hits one that works. It’s quite popular with script kiddies, but for some algorithms, it can be the only one that’s feasible. For instance, asymmetric cryptography depends on the fact that it’s hard to factor large numbers. Even though there are plenty of ways to factor numbers (many of which even a child could come up with), factoring the numbers involved would take more time than brute-forcing it would (quantum computing and Shor’s algorithm aside).
2 So no, I can’t see your password. This is also why I can’t just send you your password if you forgot it; instead, a new password has to be generated randomly and e-mailed to you. I’d advise you not to forget your password, since e-mail is still less secure than just writing your information on a postcard and giving it to a random stranger to mail, especially if the stranger looks somewhat trustworthy.
3 Secure Hash 1 is the “other” popular hashing algorithm on the internet. It was developed by the NSA, and it’s now also considered cracked by the cryptographic community. As always, though, “cracked” is a relative term, and it’s still quite safe to use. I used it because it and MD5 are the only two cryptographic hash functions included in vanilla PHP (as md5() and sha1(); there’s one other hash function, crc32(), but it’s intended purely as a checksum function), and they’re still very adequate.
Terras said,
November 20th, 2007 at 6:47 am
I agree.
That’s interesting.
echomikeromeo said,
November 20th, 2007 at 6:57 am
Cool, I think I get it.
And I *love* your use of footnotes. Actually, that might be more interesting than cryptography.
Cairnarvon said,
November 20th, 2007 at 9:23 am
I should write a WordPress plugin to make footnoting easier. Messing around with anchor tags and IDs gets old quickly.
echomikeromeo said,
November 20th, 2007 at 11:26 pm
I concur.