From Real-World Cryptography by David Wong
This article explores hash functions: what they are and how they are used to increase software security.
In front of you a download button is taking a good chunk of the page. You can read the letters DOWNLOAD, and clicking on them seems to redirect you to a different website containing the file. Below it lies a long string of unintelligible letters:
This is followed by what looks like an acronym of some sort:
sha256sum. Sounds familiar? You’ve probably downloaded something in your past life that was also accompanied with such an odd string.
If you’ve ever wondered what was to be done with it:
- Download the file.
- Use the SHA-256 algorithm to hash the file.
- Compare the output (the digest) with the long string displayed on the webpage.
NOTE: The output of a hash function is often named a digest or a hash. I use the two words interchangeably throughout this article. Note that others might call it a checksum or a sum, which I avoid as it’s primarily used by non-cryptographic hash functions and could lead to more confusion. Keep that in mind when different codebases or documents use different terms.
On MacOS for example, this can be done by opening the terminal and writing the following line:
$ shasum -a 256 downloaded_file
We used SHA-256 (a hash function) to transform the input (the file) into a unique identifier (the digest). What do these extra steps provide? Integrity and Authenticity. It tells you that your downloaded file is the file you were meant to download.
We can verify the authenticity of our file thanks to a security property of the hash function called second pre-image resistance. This math-inspired term means that from the long output of the hash function
f63e…, you can’t find another file that “hashes” to the same output. In practice it means that this digest is closely tied to the file you’re downloading, and no attacker should be able to fool you by giving you a different file.
NOTE: By the way, the long output string
f63e… represents binary data displayed in hexadecimal (a base 16 encoding using numbers from
9 and letters from
f to represent several bits of data). We could have displayed the binary data with 0s and 1s (base 2) but it takes more space.
Instead, the hexadecimal encoding allows us to write two alphanumeric characters for every 8 bits (1 byte) encountered. It’s somewhat readable by humans and takes less space. Other ways can be used to encode binary data, but the two most widely used encodings are hexadecimal and base64. The larger the base, the less space it takes to display a binary string, but at some point, we run out of human-readable characters 🙂
Note that this long digest is controlled by the owner of the webpage, and it could easily be replaced by anyone who controls this page. (If you aren’t convinced about this take a moment to think about it.) This means that we need to trust the page that gave us the digest, its owners and the mechanism used to retrieve the page (although we don’t need to trust the page that gave us the file we downloaded). In this sense, the hash function alone doesn’t provide integrity. The integrity and authenticity of the downloaded file comes from the digest combined with the trusted mechanism which gave us the digest (HTTPS in this case, imagine that it magically allows you to communicate securely with a website).
Back to our “hash function”, which can be visualized as a black box in figure 2 which takes a single input, and gives out a single output.
The input of this function can be of any size. It can even be empty.
The output is always of the same length and deterministic: it always produces the same result if given the same input. In our example, SHA-256 always provides an output of 256 bits (32 bytes), which is always encoded as 64 alphanumeric characters in hexadecimal. One major property of a hash function is that you can’t revert the algorithm, meaning that you shouldn’t be able to find the input from only the output. We say that hash functions are oneway.
To illustrate how a hash function work in practice, we hash different inputs with SHA-256 in pseudo-code:
// hashing the same input produces the same result
// a tiny change in the input completely changes the output
// the output is always of the same size, no matter the input size SHA-256("this is a very very very very very very very very very very very long sentence") -> 009e286e0261a8d0eca95649cf795db3572c515fe9dc7e319ece5af8f133637a
Security Properties of a Hash Function
Hash functions in applied cryptography are constructions which were commonly defined to provide three specific security properties. This definition has changed over time, as we’ll see in the next sections. For now, let’s define the three foundations.
The first one is pre-image resistance. This property ensures that no one can reverse the hash function in order to, given an output, recover the input. In figure 3 we illustrate this one-wayness by imagining that our hash function is like a blender, making it impossible to recover the ingredients from the produced smoothie.
The second one is second pre-image resistance. We’ve already seen this security property when we wanted to protect the integrity of a file. The property says the following: if I give you an input and the digest it hashes to, you should be unable to find a different input that hashes to the same digest. This is illustrated in figure 4.
Note that we don’t control the first input, this emphasis is important for the third security property.
Finally, the third property is collision-resistance. It guarantees that no one can produce two different inputs that hash to the same output (as seen in figure 5). Here an attacker can choose the two inputs, unlike the previous property that fixes one of the inputs.
In addition, hash functions are usually designed such that their digests are unpredictable and random. This is useful because one can’t always prove a protocol to be secure thanks to one of the security properties of a hash function we’ve talked about (like collision resistance for example). Many protocols are instead proven in the random oracle model where a fictive and ideal participant called a random oracle is used.
In this type of protocol, one can send any inputs as requests to that random oracle which is said to return completely random outputs in response, and like a hash functions, giving it the same input twice returns the same output twice. Proofs in this model are somewhat controversial, as we don’t know for sure if we can replace these random oracles with real hash functions in practice. Yet, many legitimate protocols are proven secure using this method where hash functions are seen as more ideal than they probably are.
Security Considerations for Hash Functions
We’ve seen three security properties of a hash function:
- pre-image resistance
- second pre-image resistance
These security properties are often meaningless on their own, and they depend on how you make use of the hash function. Nonetheless, it’s important that we understand some limitations here before we look at some of the real-world hash functions.
First, these security properties assume that you’re reasonably using the hash function. Imagine that I either hash the word “yes” or the word “no” and then publish the digest. If you have some idea of what I was doing, you can hash both of the words and compare the result with what I gave you. Because there are no secrets involved, and because the hashing algorithm we’ve used is public, you’re free to do that. And indeed, one could think this breaks the pre-image resistance of the hash function, but we’ll argue that your input wasn’t “random” enough. Furthermore, because a hash function accepts an arbitrary-length input and always produces an output of the same length, there are also an infinite number of inputs that hash to the same output.
Second, the size of the parameters matters. This isn’t a peculiarity of hash functions by any mean, all cryptographic algorithms must care about the size of their parameters in practice. Let’s imagine the following extreme example, we have a hash function which produces outputs of length two bits in a uniformly random fashion (meaning that it outputs
00 25% of the time,
01 25% of the time, etc.) You don’t need to do too much work to produce a collision: after hashing a few random input strings you should be able to find two that hash to the same output. For this reason, there’s a minimum output size which a hash function must produce in practice: 256 bits (or 32 bytes). With this large an output, collisions should be out of reach unless a breakthrough happens in computing.
How was this number obtained? In real world cryptography, algorithms aim for a minimum of 128 bits of security. It means that an attacker who wants to break an algorithm (providing 128-bit security) would have to perform around 2128 operations (for example, trying all the possible input strings of length 128-bit would take 2128 operations). For a hash function to provide all three security properties we mentioned earlier, it needs to provide at least 128 bits of security against all three attacks. The easiest attack is usually to find collisions, due to the birthday bound.
The birthday bound takes its roots from probability theory, in which the birthday problem reveals some unintuitive results: how many people do you need in a room when there’s at least a 50% chance two people share the same birthday (this is a collision). It turns out that twenty-three people taken at random are enough to reach these odds. In practice, when we are randomly generating strings from a space of 2N possibilities, you can expect someone to find a collision with 50% chance after having generated 2N/2 strings.
If our hash function generates random outputs of 256 bits, the space of all outputs is of size 2256. This mean that collisions can be found with good probability after generating 2128 digests. This is in the number we’re aiming for, and this is why hash functions at a minimum must provide 256-bit outputs.
Certain constraints sometimes push developers to reduce the size of a digest by truncating it (removing some of its bytes). In theory this is possible, but can greatly reduce security. In order to achieve 128-bit security at a minimum, a digest must not be truncated under:
- 256-bit for collision-resistance.
- 128-bit for pre-image and second pre-image resistance.
This means that depending on which property one relies on, the output of a hash function could be truncated to obtain a shorter digest.
Hash Functions in practice
As we’ve said earlier, in practice hash functions are rarely used alone. They’re most often combined with other elements to either create a cryptographic primitive, or a cryptographic protocol. Many examples of using hash functions to build more complex objects can be found in this article, but here are a few different ways hash functions are used in the real world:
Commitments. Imagine that you know that a stock in the market will increase in value and reach $50 in the coming month, but you can’t tell your friends about it (for some reason). You still want to be able to tell your friends you knew about it, after the fact. What you can do is commit to a sentence like “stock X will reach $50 next month“. To do this, hash the sentence and give your friends the output. A month later, reveal the sentence. Your friends can hash the sentence to observe that indeed, it produces the same
output. Can you tell which security property of the hash function prevents you from revealing a different sentence once you commit to a digest (by sending it to them)? What property prevents your friends from using the digest to recover the message?
Blockchains. Cryptocurrencies like Bitcoin keep track of all financial transactions (since the beginning of the currency) in a large digital ledger which is shared across all users. The pages of that ledger (containing the transactions) are called blocks, and each block authenticates the previous one by hashing it and carrying the digest. Hence, if you’ve the most current block, you can retrieve the previous block and verify if it hashes to the digest contained in the most current block. By doing this recursively, one can verify the entire chain of block (hence the name blockchain).
Tor. The Tor browsers’ goal is to give individuals the ability to browse the Internet anonymously. Another feature is the ability to create hidden webpages, which physical locations are difficult to track. Connections to these pages are secured via a protocol that uses the webpage’s keypair. For example, Silk Road which used to be the eBay of drugs until it got seized by the FBI, was accessible via
silkroad6ownowfk.onion in the Tor browser. How could we have trusted the public key presented by the webpage at the time? The ingenuity of these “onion addresses” was that they were also the output of the public key being hashed; by knowing the address we could verify the public key of the page.
In all of these examples, a hash function was used to provide content integrity in situations where:
- the produced digest was securely communicated to us
- the content being hashed might have been tampered with
We sometimes also say that we authenticate something or someone. It’s important to understand that if the hash isn’t obtained securely, then anyone can replace it with the hash of something else, and it doesn’t provide integrity by itself.