Random Number Generation#
Many cryptographic algorithms and protocols require randomness as an input. We saw several instances of this in the chapter on block cipher modes: randomness is one way to create secure IVs and nonces. Key generation is another very common application of randomness.
The quality of the randomness is of great importance in these applications[HDWH12]. In several cases, if an attacker is able to predict bits used by another algorithm that are supposed to be randomly generated, it can cause catastrophic security failure. In this chapter, we’ll look at how bits can be randomly generated in a way that is suitable for cryptographic use.
Randomness#
We’ve used the term “random” casually in the course so far, and it’s now time to define it more precisely.
In casual usage, if data is “random”, it means there is no obvious pattern to the data. In cryptography, it is a stronger property: given part of a randomly-generated sequence of bits, it is not feasible to predict other bits in the sequence with accuracy better than 50%.
This is an impossible definition to use in determining whether a given sequence of bits is random. How can one prove that it is not possible to predict other bits in the sequence? You could say that no known techniques succeed in making correct predictions more than 50% of the time, but what’s to say there aren’t other techniques that aren’t yet known?
In addition, randomness isn’t a binary, yes-or-no property. As an analogy, consider flipping a coin. If you flip a coin 100 times and it comes up heads 49 times, tails 51 times, is it a fair coin? It seems reasonable to say yes. What if it comes up heads zero times and tails 100 times? It seems reasonable to say no, that’s not a fair coin. What about 43 heads, 57 tails? Then it’s less clear. Maybe it’s reasonable to say the coin is not obviously unfair, but you’re less confident that it’s fair than you were in the 49-51 case. It’s not possible, in general, to take the result of 100 coin flips and translate that into a yes/no determination of whether the coin is fair; there is a spectrum of fairness.
In practice, there are a variety of statistical tests that can be applied to a supposedly random sequence of bits, to determine how random it is. One very simple test is that the sequence should have roughly equal numbers of 0-bits and 1-bits. Another is that the lengths of continuous runs of 0-bits and 1-bits should follow a certain statistical pattern. Some tests get much more complicated.
There is a popular software tool called TestU01 [LEcuyerS09] which contains a
huge variety of tests and random number generators; this is commonly used to
evaluate methods of randomly generating bits.
Numbers Aren’t Random#
It’s important to understand that in the phrase “random number generator”, the word “random” describes the generator, not the numbers. A number, in and of itself, can’t be “random”. What matters is where the number came from.
As such, in this reading, we’ll be careful not to talk about “random numbers” or “random bits”, and to emphasize that the randomness applies to the process of generating the numbers[Ska04].
True Randomness#
Computers are deterministic devices. They cannot produce true randomness within themselves: they need to collect it from outside sources, by interacting with the physical world.
Because true randomness requires interacting with the physical world via hardware, software needs support from the operating system to get it. Modern operating systems offer simple APIs for doing so, and we’ll discuss those later in this chapter.
True randomness comes from unpredictable physical phenomena like background radiation picked up from wires and ports. Another common source is the motion of a physical mouse, or delays between key presses, which are unpredictable enough to contain some true randomness; however, this obviously can’t be done on computers that don’t have mice or keyboards attached, like servers.
Because true randomness comes from measuring physical phenomena, the amount of it that is available to the operating system can vary, and sometimes it can be insufficient for what applications need. (For example, right after a computer starts up, it may not have had time to gather enough measurements.) To deal with that, there are ways to stretch a small number of bits generated from true randomness into a larger number of bits; see the next section.
Pseudorandomness#
A pseudorandom number generator (commonly abbreviated to PRNG) is an algorithm that is deterministic, yet outputs a sequence of bits that perform well on statistical tests of randomness. You can repeat the process of generating a sequence of bits from a PRNG, and the exact same sequence will be produced.
It’s very important to note that even if the output of a PRNG does well on statistical tests, that emphatically does not mean that it’s hard to accurately predict other bits of the output. On the contrary, with some PRNGs, it is possible to reverse engineer them completely, given a certain amount of output.
A PRNG has a seed: a value that it is initialized with before it can generate any output. A PRNG will always produce exactly the same output for a given seed.
Every PRNG has a state: some variables that it uses to generate its output. To produce output, a PRNG will derive the output from its current state, and then transform its state so that the next output will be different. All of this is deterministic: for any given state, the output from it, and the next state, will always be the same.
Because its state is finite, there are finitely many possible states: \(n\) bits of state means at most \(2^n\) possible states. That means that a PRNG’s output is guaranteed to repeat after a certain number of outputs are generated, because it must eventually return to a state it has been in before. The number of outputs generated before the sequence repeats is called the period of the PRNG. The period of a PRNG is very important in cryptography, because of the potentially disastrous consequences of repeating a sequence of bits that are supposed to be unique.
Pseudorandomness has a lot of applications in computing, both cryptographic and otherwise. One very useful application is in software testing. You can pseudorandomly generate a large volume of inputs to some code under test: this creates more, and more varied, test cases than could feasibly be created by hand. By using a fixed seed, the generation of the test cases is repeatable, so that the code is tested in the same way every time the tests are run. Then, if one of the tests fails, a developer can easily reconstruct the failing test case to debug it.
Another common application of pseudorandomness is in games. Solitaire is an example: shuffling the virtual deck of cards needs randomness, and using a PRNG means that the game can exactly recreate a given deal (Windows Solitaire has this feature; each deal is numbered, and the number seeds the PRNG).
Example PRNGs#
We’ll quickly illustrate what some PRNGs look like, to make this discussion more concrete.
The simplest type of PRNG is called a linear congruential generator (LCG). An LCG is defined by three numbers: a multiplier called \(a\), an increment called \(c\), and a modulus called \(m\). It starts with a seed, which we will call \(X_0\). To generate the \(n\)-th number in the sequence:
Multiply \(X_{n-1}\) by \(a\), the multiplier.
Add \(c\), the increment.
Reduce modulo \(m\).
The result is \(X_n\), the next number in the sequence.
In short:
Let’s do an example, using \(a = 3\), \(c = 5\), \(m = 23\), and \(X_0 = 10\).
With \(X_{11}\), we get a repeat. If we start with a different seed, for example 3, we will get a different sequence:
The state of an LCG is simply \(X_{n-1}\), the last number it generated; that is the only thing that changes. That means that the period of an LCG is at most \(m\) (since that is how many different values \(X_{n-1}\) could possibly take on), and possibly less, as we see above.
Another very popular PRNG is called the Mersenne Twister. We won’t go into
detail on it, but it is popular mostly because of its incomprehensibly long
period of \(2^{19937} - 1\). It’s commonly implemented as the PRNG of programming languages’ standard libraries1Java is an exception; java.util.Random uses an LCG..
In Cryptography#
To generate bits used in cryptography, there is a special class of PRNGs called cryptographically secure pseudorandom number generators, or CSPRNGs. These have the additional property that, given some output bits, it’s not feasible to predict any other output bits with accuracy better than 50%. This is the case even if the predictor knows the algorithm (but not the seed or the state).
The Mersenne Twister, the popular PRNG, is not cryptographically secure. If you can get 624 consecutive outputs of Mersenne Twister, you can completely reconstruct its state, and thus predict all future outputs perfectly. LCGs are also not cryptographically secure.
Even if you use a CSPRNG, its security is only as good as that of its seed. A common pattern in programs is to seed PRNGs with a timestamp; i.e. some representation of the current time. This is not secure for cryptographic purposes, because the range of possible seeds is quite small — even if you use nanosecond-resolution timestamps — and the CSPRNG seed could thus be brute-forced. The only secure way to get a seed for a CSPRNG is from a source of true randomness.
Furthermore, a CSPRNG should be reseeded after generating some number of outputs. In reseeding, additional bits from true randomness are mixed into the state. If an attacker could somehow learn the state of the algorithm, they would be able to compute forward to predict future output, which is why it’s important to reseed periodically from a source of true randomness.
In an ideal CSPRNG, an attacker who has learned the algorithm’s state should not be able to work backwards to compute outputs that the algorithm produced before reaching that state.
Cryptographically Secure PRNGs#
Most CSPRNGs used in practice are based on hash functions or block ciphers, rather than purpose-built algorithms. Those types of algorithm have properties similar to what is required from a CSPRNG (e.g. hash functions’ preimage resistance), so there are relatively simple ways to use them as CSPRNGs.
In addition, hash functions and block ciphers have already gotten plenty of analysis, so we can be reasonably confident in their security when used as CSPRNGs. Also, almost any practical cryptographic system uses either a block cipher or a hash function or both, so there would be an implementation already available to use as part of a CSPRNG.
It’s not particularly important to know all the details of how CSPRNGs are built from hash functions and block ciphers, so we’ll only look at those at a high level.
Hash Functions#
A hash function can be used to build a CSPRNG. The recommended way is to use a construction called HASH_DRBG, standardized by NIST[nis15]. (“DRBG” stands for “deterministic random bit generator”.)
There are two parts to HASH_DRBG’s internal state, called \(V\) and \(C\). \(V\) changes with every chunk of output; \(C\) changes only during reseeds. They are both of the same length, which depends on the hash function being used: if the output length is 256 bits or less (e.g. SHA-256), then \(V\) and \(C\) are each 440 bits; otherwise, they are 888 bits.
Each time bits are requested from the generator, \(V\) is hashed together with an incrementing counter until enough bits have been produced. Then \(V\) itself is hashed and \(C\) is added to it; the resulting value replaces \(V\).
To reseed, before generating the output bits, \(V\) is hashed together with new bits from true randomness. Then this new \(V\) is hashed and the result replaces \(C\).
Also standardized by NIST is HMAC_DRBG, which is also based on a hash function, but in an HMAC construction. The HMAC key is part of the internal state, and it is updated with every chunk of output.
Both of these CSPRNGs are considered secure, as long as the underlying hash function is secure.
Block Ciphers#
NIST also standardizes CTR_DRBG, which is a construction that uses a block cipher in CTR mode. This construction is considered less secure, and easier to misuse, than either of the hash-based options.
One particular problem with it is that if entire blocks of the cipher’s output are returned as output bits, they are distinguishable from truly random because a CTR-mode block cipher will not output the same block twice (because its input is just an incrementing counter plus a nonce), whereas a truly-random source of bits could do so. This means that the only secure way to use CTR_DRBG is to truncate the output of the block cipher by a few bits; however, the standard does not require this.
Fortuna is another CSPRNG that uses a block cipher in CTR mode[FSK10], as well as a hash function for reseeding. Its procedures for generating bits and reseeding are considerably simpler than CTR_DRBG’s.
After each time Fortuna generates output, it generates an additional 256 bits and uses those as the block cipher’s key for the next chunk of output; this prevents working backwards from a compromised state. To solve the above problem of a CTR-mode block cipher not generating the same block twice, Fortuna restricts the number of bits that can be generated in a single call to \(2^{20}\) (131 kilobytes), which is low enough that the statistical deviation from truly random is not noticeable.
Fortuna is actually more than just a CSPRNG. Unlike the NIST standards, Fortuna includes a good amount of detail on how to combine various sources of true randomness, in a way that is secure even if some of those sources are compromised by attackers. It also includes detail on how to recover after reboots, when the computer’s hardware may not have collected any true randomness yet — Fortuna has a way to save some data to a disk file and use it to initialize the generator after rebooting.
Blum Blum Shub#
Blum Blum Shub (named after its inventors Lenore Blum, Manuel Blum, and Michael Shub), commonly abbreviated to BBS, is a CSPRNG based on different math altogether. It is based on the difficulty of finding the prime factors of large numbers. We will look at this problem in much more detail in a later chapter.
BBS is extremely simple. It works as follows[BBS86]:
Choose two large primes, called \(p\) and \(q\), that both leave a remainder of 3 when divided by 4. Compute \(M = pq\). Choose a seed value \(x_0\), a positive integer that is not divisible by either \(p\) or \(q\), and is not equal to 1 or 0.
To generate a bit, compute:
\[x_{n+1} \leftarrow x_n^2 \Mod M\]Then the output bit is the least-significant bit of \(x_{n+1}\).
There are other options for how to produce an output bit from \(x_{n+1}\) in step 2. Taking the least-significant bit is what the original paper specified, but it could also use the parity of the number of 1-bits in the base-2 representation of \(x_{n+1}\).
The cycle length of a BBS generator depends on the choice of \(p\) and \(q\). To ensure a long cycle length, it should be the case that:
\(p = 2a + 1\) where \(a\) is prime, and \(q = 2b + 1\) where \(b\) is prime.
The greatest common divisor of \(a-1\) and \(b-1\) is small.
If the first condition is true, then the cycle length is
Making sure \(\gcd(a-1,b-1)\) is small makes the cycle length longer.
We won’t go into the mathematical details of how this is derived, or why BBS’s security is equivalent to the difficulty of factoring; the math is somewhat advanced.
Let’s go through an example. We will choose \(p=11, q=23\) (you can verify that these fulfill the criteria above). Therefore \(M = 253\). Let \(x_0 = 124\). Then we run the generator until an \(x\) value is repeated:
There are 20 distinct values before the sequence repeats. We can confirm this with the cycle length formula above, noting that \(p = 2\cdot5 + 1\) and \(q = 2\cdot 11 + 1\):
Remember that the output of the CSPRNG isn’t the \(x\) values; it’s their least-significant bits. So the actual pseudorandom sequence of bits output by this generator is:
BBS is not often used in practice, because it is slower than other CSPRNGs: you must multiply two large numbers together to produce just a single bit of output. It is a good choice as a secure CSPRNG if speed is not a priority.
DUAL_EC_DRBG#
DUAL_EC_DRBG is a CSPRNG based on the math of elliptic curves, formerly part of the same NIST standards document as HASH_DRBG etc. We’ll see elliptic curves in a later chapter, so we won’t talk about them here. We are only mentioning DUAL_EC_DRBG here because it’s historically significant. Do not use it; it is fully insecure.
DUAL_EC_DRBG was designed by the NSA, and it was their most brazen attempt in many years to push broken cryptography on the public. About a year after its publication, two academic researchers found that the algorithm could contain a backdoor: a way by which an attacker with the right knowledge can reverse-engineer the internal state of the PRNG from only 32 bytes of output[SF07].
The algorithm, by the nature of elliptic curves, contains a lot of large constants, and there is no explanation for how they were generated. There could have been; the constants could have been nothing-up-my-sleeve numbers. But they aren’t, and this is where the backdoor would be. During the standardization process, the NSA was insistent that it be the sole editor of the DUAL_EC_DRBG part. If they really did introduce a backdoor, they were not subtle about it; if they didn’t, then they acted very suspiciously for no apparent reason[Sch07]. Bottom line: this algorithm can’t be trusted.
Implementations#
In practice, you should never have to choose a CSPRNG. Whatever operating system, programming language, or software tool you’re using should make the choice for you. The following APIs are provided by common OSes and languages; they all handle choosing a CSPRNG, and seeding and reseeding it using true randomness from hardware.
Windows has an API called
CryptGenRandom. Its exact method of operation isn’t published, but it is known to be based on a hash function.Unix-like OSes such as Linux and Apple OSes have a special file at the path
/dev/urandom, which pseudorandomly produces bits when you read from it. Linux uses an algorithm based on ChaCha20 [lin16], which we saw in an earlier chapter. Apple OSes use Fortuna[app21b].Java has
java.security.SecureRandom, which has a variety of possible implementations, but on most platforms it delegates to the OS (CryptGenRandomor/dev/urandom).Python has
os.urandom, which delegates to the OS.JavaScript in browsers has
window.crypto.getRandomValues, and Node.JS hascrypto.randomBytes, both of which delegate to the OS.Poor random number generation in JavaScript (using
Math.random) has led to a widespread vulnerability in older Bitcoin wallets, called “Randstorm”[ran22]. Using JavaScript to do cryptography in web browsers is generally fraught with peril and best avoided when possible.
Key Takeaways#
Randomness is crucial for many cryptographic algorithms, primarily in generating keys and nonces. If an attacker can predict future random inputs to cryptographic algorithms, or learn past ones, this can easily cause catastrophic security failure.
There is a distinction between true randomness, which comes from measuring unpredictable physical phenomena, and pseudorandomness, which is random-looking output from a deterministic algorithm.
Not all pseudorandom number generators are suitable for cryptographic use. It must not be possible to take part of the output from a CSPRNG and compute or predict other parts of the output.
Operating systems provide CSPRNGs, and they are the only way to get cryptographically secure randomness in software. Programming libraries offer APIs to access operating system CSPRNGs.
CSPRNGs are commonly built from hash functions and block ciphers, in constructions called HASH_DRBG, HMAC_DRBG, and CTR_DRBG.