What’s the best general purpose random number generating algorithm available?

In this article I’ll present a RandomGenerator protocol and use it to implement 8 different random number generating algorithms. Implementations will include wrappers around Mac/iOS built-in algorithms, my own implementations in Swift of some popular algorithms and some corresponding C implementations for comparison. The implementations of the RandomGenerator protocol all seed from /dev/urandom by default, can generate data of arbitrary size (although I’ll focus on 64-bit integer generation) and offer conversion to Double (preserving up to 52-bits of randomness in the significand).

As an aside, my Swift implementation of the Mersenne Twister ended up 20% faster than the official mt19937-64.c implementation. Curious to understand what I had done, I ended up “fixing” the C version to be just as fast as the Swift version. Yes, it’s true: with a little tuning, C can be just as fast as Swift.

Welcome to C with love.

## Use case

My primary use for random numbers is in fuzz testing; deliberately sending mixed and garbled inputs to my functions (to look for data handling errors) or running large numbers of threads with different timing offsets and data sizes (to look for timing or thread-safety bugs).

It’s difficult to know exactly what performance or quality I require from random numbers in this scenario – most “good quality” options would probably suffice – but what I do require is:

• no shared global data (since I run multiple tests in parallel)
• the ability to set the initial seed (since I want to reproduce bugs when I find them)

Historically, I’ve used a C implementation of the Mersenne Twister. I don’t have any particular problems with it but the algorithm is nearing its 20th birthday so I was curious to see what else was around.

## Built-in sources of randomness on Mac and iOS

The C standard library on the Mac and iOS contains a few different functions for random number generation:

1. rand()
2. random()
3. [d|e|j|l|m]rand48() et al
4. arc4random()
5. /dev/[u]random

In iOS 10, macOS 10.12 and later, the first 3 are all “unavailable” in Swift, leaving us with just arc4random() and /dev/urandom as default sources of random numbers in Swift.

### /dev/urandom

If you need cryptographically secure random numbers, the only option you should consider is /dev/[u]random.

On Mac and iOS, both /dev/random and /dev/urandom are identical and use the Yarrow algorithm in conjunction with bits accumulated from hardware entropy sources. The existence of two different names is largely for cross-compatibility with Linux where the different devices have historically had a complicated range of distinct security considerations with general advice leaning towards /dev/urandom. I’ve used /dev/urandom to minimize portability problems.

The problem is that reading from /dev/urandom is slow. In my testing, it is between 100 and 1000 times slower than other generators and uses additional system resources on top of the userspace resources of typical random number generators.

The end result is that /dev/urandom, while useful for setting initial values, is a poor choice for a general-use random number generator.

### The “allegedly” RC4 generator

The only “general purpose” random number generator available by default in Swift is arc4random. It is generally high quality but it is slow due to the use of large amounts of state, locks on the global data and periodic mixing of additional entropy from /dev/urandom.

Like /dev/urandom, arc4random can’t be seeded, so you can’t regenerate previous sequences. This means that, while useful for random generation, it is harder to test and less useful for dynamically generated content, distributions or modelling.

The “arc4” in the name is because the algorithm is “allegedly” compatible with the RC4 algorithm developed by RSA Labs. Like the official RC4, arc4random was originally intended for use as a cryptographic random number generator. While the arc4random implementation in Mac/iOS doesn’t suffer the same problems that made the RC4 implementation in WEP vulnerable to trivial attacks, the output of arc4random should be treated as no longer cryptographically secure.

What does that mean?

arc4random provides good quality but is about 5 times slower than algorithms of equivalent quality. It uses global state and can’t be directly seeded for debugging purposes or other situations requiring repeatability.

## Other general-purpose random number generators

I’m going to look at some high quality, simple, fast random number generators, implement them in Swift and see how they compare.

### Linear feedback shift register

A linear-feedback shift register (LFSR) is just a series of shift and XOR operations. Wikipedia has a very clear animated GIF of the operation generating a random sequence from a 4-bit number.

By carefully selecting the feedback points, the shift amounts and combining multiple registers, you can get very long cycles, good distribution and low predicatability. Researcher Pierre L’Ecuyer gives some tables of values for “Tausworthe” style linear-feedback shift register generators in his paper Tables of Maximally-Equidistributed Combined LFSR Generators.

In the code I present as part of this article, I give two variants, Lfsr176 and Lfsr258 – with periods approximately equal to ${2}^{176}$ and ${2}^{258}$ respectively.

### Mersenne Twister

The Mersenne Twister was a huge advancement when it was introduced by M Matsumoto and T Nishimura in 1997 and it remains the random generator that newer non-cryptographic generators are compared against (the algorithm isn’t cryptographic because you can observe just 624 values and from that point, predict the sequence).

So the Mersenne Twister fails the “unpredictable” test but it is well tested, has a good distribution, very long period and is fairly fast. But there are some caveats.

In a tight loop, the Mersenne Twister is within a factor of 2 of the fastest algorithms tested. However, the standard Mersenne Twister uses 2496 bytes of internal storage. That might not seem like a lot of space on a modern computer but it is big enough to put additional burden on your L1 cache.

On the quality front: the Mersenne Twister has some known problems with entering “zero” states (situations where its internal state contains a large number of zeros and the generator gets “stuck”).

### Xoroshiro

There has been an academic contest in the last three years between Melissa O’Neill (creator of PCG) and Sebastiano Vigna (creator of xorshift+ and xorshift*). Both authors have relied on automated statistical quality tests to experiment with different variations on themes to find heavily refined versions of their respective approaches – O’Neill applying block ciphers on top of linear congruential generators and Vigna performing variations on xorshift.

The latest release from Sebastiano Vigna (in collaboration with David Blackman) is xoroshiro128plus – a specialized linear-feedback shift register. Xoroshiro claims to be the fastest algorithm to fare well on the TestU01 set of quality tests for random number generators and claims to provide a better statistical distribution on the PracRand tests than previous xorshift algorithms.

While the period of the generator is fairly low (${2}^{128}$), it’s easily high enough for any common purpose.

## Implementations

I defined the following protocol and provided implementations of the protocol for each of the algorithms:

public protocol RandomGenerator {
public init()
mutating public func randomize(buffer: UnsafeMutablePointer<Void>, size: Int)
mutating public func random64() -> UInt64
mutating public func random32() -> UInt32

mutating public func random64(max: UInt64) -> UInt64
mutating public func random32(max: UInt32) -> UInt32

/// Generates a double with a random 52 bit significand on the half open range [0, 1)
mutating public func randomHalfOpen() -> Double

/// Generates a double with a random 52 bit significand on the closed range [0, 1]
mutating public func randomClosed() -> Double

/// Generates a double with a random 51 bit significand on the open range (0, 1)
mutating public func randomOpen() -> Double
}

and a derived protocol:

public protocol RandomWordGenerator: RandomGenerator {
associatedtype WordType
mutating func randomWord() -> WordType
}

The advantage with these protocols: a generator need only implement randomize(_, size:) or randomWord() and all the remaining functions are automatically provided (although I’ve implemented optimized versions of random64() in all cases to ensure a fair comparison).

The implementations are:

• Arc4Random – 64 bit integers generated with arc4random_buf
• DevRandom – data read from /dev/urandom
• Lfsr176 – a 3 register linear-feedback shift register with period ${2}^{176}$
• Lfsr258 – a 5 register linear-feedback shift register with period ${2}^{258}$
• MersenneTwister – a Swift implementation of MT19937_64
• MT19937_64 – the Mersenne Twister as generated by mt19937-64.c by Takuji Nishimura and Makoto Matsumoto.
• Xoroshiro – 64-bit integers generated by this custom xor/shift generator
• xoroshiro128plus – the original C implementation of Xoroshiro by David Blackman and Sebastiano Vigna

and

• ConstantNonRandom – baseline that returns a constant 64-bit number

## Performance

All algorithms were used to generate 100 million 64-bit UInt64 values. These are the timing results:

Seconds
DevRandom 113.250
Arc4Random 4.359
Lfsr258 0.717
MT19937_64 0.535
MersenneTwister 0.533
Lfsr176 0.498
xoroshiro128plus 0.352
Xoroshiro 0.347
ConstantNonRandom 0.211
Time taken to generate 100 million 64-bit values

Tests were performed on a 2.67Ghz Nehalem Mac Pro, using Swift 3.0. CwlRandom.swift was statically linked with the testing bundle (rather than dynamically linked through the CwlUtils.framework) for performance reasons. The MT19937_64 and xoroshiro128plus implementations were implemented in the same file as the tests so the extra function call layer for these tests would be inlined away. The testing bundle was compiled with whole module optimization off to ensure they weren’t optimized into the core loop.

performance ranges from 300 million per second for Xoroshiro to 813 thousand per second for DevRandom.

These numbers show the Swift version of Xoroshiro faster than xoroshiro128plus but from run to run, they trade places. Any difference between them is within the margin of error of this test setup.

## MersenneTwister in C versus Swift

So the Swift implementation of MersenneTwister ended up 20% faster than the mt19937-64.c C implementation. Hooray, Swift is the fastest!

### Why?

The Mersenne Twister is made up of two parts:

1. The “xor-shift-mask” steps performed on every iteration
2. The “twist” steps performed every 312 iterations

The “xor-shift-mask” has very little wiggle room. Here’s the C implementation:

x = ctx->state[ctx->index++];
x ^= (x >> 29) & 0x5555555555555555ULL;
x ^= (x << 17) & 0x71D67FFFEDA60000ULL;
x ^= (x << 37) & 0xFFF7EEE000000000ULL;
x ^= (x >> 43);
return x;

There is a minor performance issue with this code but there’s no significant room for refactoring here. The real difference between my Swift code and the C is in the “twist” steps.

A simple Swift implementation of “twist” would look like this:

for i in 0..<stateCount {
let x = (state[i] & upperMask) | (state[(i + 1) % stateCount] & lowerMask)
int xA = (x & 1 == 0) ? (x >> 1) : ((x >> 1) ^ 0xB5026F5AA96619E9)
state[i] = state[(i + (stateCount / 2)) % stateCount] ^ xA
}

This code would work but unfortunately, % is not a fast operation and at 100 million per second speeds, the ternary conditional operator ?: is also too slow (don’t worry about the stateCount / 2 part, that’s a constant and is optimized away). Avoiding these requires restructuring the loop so that they’re not required.

The mt19937-64.c implementation breaks the loop apart into two halves and follows up with an epilogue to handle the final position:

for (i = 0; i < (stateCount / 2); i++) {
ctx->state[i] = ctx->state[i + (stateCount / 2)] ^ (x >> 1) ^ mag01[x & 1];
}
for (; i < stateCount - 1; i++) {
ctx->state[i] = ctx->state[i - (stateCount / 2)] ^ (x >> 1) ^ mag01[x & 1];
}
ctx->state[stateCount - 1] = ctx->state[(stateCount / 2) - 1] ^ (x >> 1) ^ mag01[x & 1];

(I’ve renamed the variables and reformatted this to look more like my Swift code, to aid comparison.)

This code contains no division or modulo (remember: the stateCount / 2 is a constant and optimized away) and no conditionals or ternary operators (except the loops themselves). But there’s a weird mag01 array (which is 0 at index 0 and 0xB5026F5AA96619E9 at index 1) and worse: the ctx->state is fully iterated multiple times since the ctx->state[i + (stateCount / 2)] and ctx->state[i - (stateCount / 2)] accesses each walk the values from the other loop.

I took a different approach to optimize the “twist” code for my implementation. My code walks both halves of the loop simultaneously, using two different indexes, offset by stateCount / 2:

let (a, mid, stateMid) = (0xB5026F5AA96619E9, stateCount / 2, state[stateCount / 2])
var (i, j) = (0, mid)
repeat {
let x1 = (state[i] & upperMask) | (state[i &+ 1] & lowerMask)
state[i] = state[i &+ mid] ^ (x1 >> 1) ^ ((state[i &+ 1] & 1) &* a)
let x2 = (state[j] & upperMask) | (state[j &+ 1] & lowerMask)
state[j] = state[j &- mid] ^ (x2 >> 1) ^ ((state[j &+ 1] & 1) &* a)
(i, j) = (i &+ 1, j &+ 1)
} while i != mid &- 1

let x3 = (state[mid &- 1] & upperMask) | (stateMid & lowerMask)
state[mid &- 1] = state[stateCount &- 1] ^ (x3 >> 1) ^ ((stateMid & 1) &* a)
let x4 = (state[stateCount &- 1] & upperMask) | (state[0] & lowerMask)
state[stateCount &- 1] = state[mid &- 1] ^ (x4 >> 1) ^ ((state[0] & 1) &* a)

The epilogue needs to handle two indexes but the state array is only traversed once, we’re hitting half as many loop conditions and a simple multiply by 0 or 1 (optimized to bitwise arithmetic) is used to eliminate the ternary operator conditional.

If there were no other elements at play, this alone would improve performance 7% versus the mt19937-64.c implementation.

But there’s another advantage: with these two iterations (using the i index and the j index) side-by-side in the same loop, the compiler can automatically optimize to SIMD instructions to give us an extra 10% boost.

There’s another 3% performance difference but to understand that, we’ll need to make the C and Swift code more similar.

### What happens if the C code does the same thing?

Swift is faster but it’s not an apples-to-apples comparison. Is it possible to make a true comparison? Where both C and Swift follow the same logic?

Let’s try changing the C code to:

for (i = 0, j = mid; i != mid - 1; i++, j++) {
ctx->state[i] = ctx->state[i + mid] ^ (x >> 1) ^ ((ctx->state[i + 1] & 1) * a);
ctx->state[j] = ctx->state[j - mid] ^ (y >> 1) ^ ((ctx->state[j + 1] & 1) * a);
}
ctx->state[mid - 1] = ctx->state[stateCount - 1] ^ (x >> 1) ^ ((stateMid & 1) * a);
ctx->state[stateCount - 1] = ctx->state[mid - 1] ^ (y >> 1) ^ ((ctx->state[0] & 1) * a);

As expected, this brings the C to within 3% of Swift.

So what’s the cause of the remaining difference?

At this point, the only differences are a few int types in C that are Int in Swift. We need to move the C code to a more consistent 64-bit everywhere with unsigned long long instead of int.

The remaining difference? The use of the postincrement operator I showed in the first line of the C “xor-shift-mask” code. We need to change:

x = ctx->state[ctx->index++];

to

x = ctx->state[ctx->index];
ctx->index = ctx->index + 1;

It might seem like this shouldn’t make any difference but it does affect the generated assembly and appears to result in a 1-2% difference.

Both the C and Swift are now the same. I don’t just mean “they take the same time”, I mean they are compiled to literally the same instructions.

Hooray, C is also the fastest!

## Usage

The project containing these RandomGenerator implementations is available on github: mattgallagher/CwlUtils.

The CwlRandom.swift file is fully self-contained so you can just copy the file, if that’s all you need.

Otherwise, the ReadMe.md file for the project contains detailed information on cloning the whole repository and adding the framework it produces to your own projects.

The mt19937-64.c file contains my changes to the Takuji Nishimura and Makoto Matsumoto file of the same name. You’re welcome to use this in your own C projects, if desired.

License note: the mt19937-64.c file is “Copyright © 2004, Makoto Matsumoto and Takuji Nishimura” and contains its own BSD 3-clause license however, this file is not built into the CwlUtils.framework, it is used only by the testing bundle for validation and performance testing.

## Conclusion

It looks like Xoroshiro is the best general purpose algorithm currently available. Low memory (just 128 bits of storage), extremely high performance (1.2 nanoseconds per 64-bit number, after subtracting baseline overheads) and very well distributed (beating other algorithms on a range of automated tests). Mersenne Twister might still be a better choice for highly conservative projects unwilling to switch to such a new algorithm, but the current generation of statistically tested algorithms brings a baseline of assurance from the outset that previous generations lacked.

Of course, if you only need a few random numbers in your program and you don’t really care about multithreading or repeatability then these alternative algorithms are unnecessary – there’s no problem with the built-in arc4random or even reading from /dev/urandom (you could generate hundreds of numbers from /dev/urandom per second without overheads reaching 1%). However, the RandomGenerator protocol I’ve presented makes it easier to seed and generate 64-bit values or Double from these sources too.

Getting C-level performance in Swift for numerical algorithms is quirky but not particularly difficult. If you limit yourself to value types (no classes or existentials), use unsafe pointers and tuples instead of arrays, use overflow discarding operators &+/&-/&* instead of normal +/-/*, use while or repeat/while for your loops, then Swift and clang C will generally compile to identical instructions.

It’s not as though C is maximally performant without a little contortion. Using int types for indexes on 64-bit systems should be avoided and so should common idioms like inline use of the ++ postincrement operator.