Mutexes and closure capture in Swift

I want to briefly talk about the absence of threading and thread synchronization language features in Swift. I’ll discuss the “concurrency” proposal for Swift’s future and how, until this feature appears, Swift threading will involve traditional mutexes and shared mutable state.

Using a mutex in Swift isn’t particularly difficult but I’m going to use the topic to highlight a subtle performance nuisance in Swift: dynamic heap allocation during closure capture. We want our mutex to be fast but passing a closure to execute inside a mutex can reduce the performance by a factor of 10 due to memory allocation overhead. I’ll look at a few different ways to solve the problem.

An absence of threading in Swift

When Swift was first announced in June 2014, I felt there were two obvious omissions from the language:

  • error handling
  • threading and thread synchronization

Error handling was addressed in Swift 2 and was one of the key features of that release.

Threading remains largely ignored by Swift. Instead of language features for threading, Swift includes the Dispatch module (libdispatch, aka Grand Central Dispatch) on all platforms and implicitly suggests: use Dispatch instead of expecting the language to help.

Delegating responsibility to a bundled library seems particularly strange compared to other modern languages like Go and Rust that have made threading primitives and strict thread safety (respectively) core features of their languages. Even Objective-C’s @synchronized and atomic properties seem like a generous offering compared to Swift’s nothing.

What’s the reasoning behing this apparent omission in Swift?

Future “concurrency” in Swift

The answer is, somewhat tersely, discussed in the “Concurrency” proposal in the Swift repository.

I mention this proposal to highlight that the Swift developers would like to do something around concurrency in the future but keep in mind what Swift developer Joe Groff points out: “that document is only a proposal and not an official statement of direction.”

This proposal appears to describe a situation where, like in Cyclone or Rust, references can’t be shared between threads. Whether or not the result is anything like those languages, it appears that the plan is for Swift to eliminate shared memory between threads except for types that implement Copyable and are passed through strictly governed channels (called Streams in the proposal). There’s also going to be a form of coroutine (called Tasks in the proposal) which appear to behave like pausable/resumable asynchronous dispatch blocks.

The proposal then goes on to claim that most common threading language features (Go-like channels, .NET-like async/await, Erlang-style actors) can then be implemented in libraries on top of the Stream/Task/Copyable primitives.

It all sounds great but when is Swift’s concurrency expected? Swift 4? Swift 5? Not soon.

So it doesn’t help us right now. In fact, it kind of gets in the way.

Impact of future features on the current library

The problem right now is that Swift is avoiding simple concurrency primitives in the language or thread-safe versions of language features on the grounds that they would be replaced or obviated by future features.

You can find explicit examples of this occurring by watching the Swift-Evolution mailing list, including:

There are plenty of common features that don’t even get as far as the mailing list, including language syntax for mutexes, synchronized functions, spawning threads and everything else “threading” related that requires a library to achieve in Swift.

Trying to find a fast, general purpose mutex

In short: if we want multi-threaded behavior, we need to build it ourselves using pre-existing threading and mutex features.

The common advice for mutexes in Swift is usually: use a DispatchQueue and invoke sync on it.

I love libdispatch but, in most cases, using DispatchQueue.sync as a mutex is in the slowest tier of solutions to the problem – more than an order of magnitude slower than other options due to unavoidable closure capture overhead for the closure passed to the sync function. This happens because a mutex closure needs to capture surrounding state (specifically, it needs to capture a reference to the protected resource) and this capturing involves a heap allocated closure context. Until Swift gains the ability to optimize non-escaping closures to the stack, the only way to avoid the heap allocation overhead for closures is to ensure that they are inlined – unfortunately, this is not possible across module boundaries, like the Dispatch module boundary – making DispatchQueue.sync an unnecessarily slow mutex in Swift.

The next most commonly suggested option I see is objc_sync_enter/objc_sync_exit. While faster (2-3 times) than libdispatch, it’s still a little slower than ideal (because it is always a re-entrant mutex) and relies on the Objective-C runtime (so it’s limited to Apple platforms).

The fastest option for a mutex is OSSpinLock – more than 20 times faster than dispatch_sync. Other than the common limitations of a spin-lock (high CPU usage if multiple threads actually try to enter simultaneously) there are some serious scheduling problems that render this a bad idea on macOS and totally unusable on iOS.

The fastest realistic option is os_unfair_lock which is only 30% slower than OSSpinLock. However, this lock is not first-in-first-out (FIFO). Instead, the mutex is given to an arbitrary waiter (hence, “unfair”). You need to decide if this is a problem for your program but in general, this means that it shouldn’t be your first choice for a general purpose mutex.

All these problems leave us with pthread_mutex_lock/pthread_mutex_unlock as the only reasonable performance, portable option.

Mutexes and closure capture pitfalls

Like most things in plain C, pthread_mutex_t has a pretty clunky interface, so it helps to use a Swift wrapper around it (particularly for construction and automatic cleanup). Additionally, it’s helpful to have a “scoped” mutex – one which accepts a function and runs that function inside the mutex, ensuring a balanced “lock” and “unlock” either side of the function.

I’ll call my wrapper PThreadMutex. Here’s an implementation of a simple scoped mutex function on this wrapper:

public func sync<R>(execute: () -> R) -> R {
   pthread_mutex_lock(&m)
   defer { pthread_mutex_unlock(&m) }
   return execute()
}

This was supposed to be high performance but it’s not. Can you see why?

The problem occurs because I implement reusable functions like this in my separate CwlUtils module, leading to exactly the same problem that DispatchQueue.sync had: closure capture causing heap allocation. Due to heap allocation overhead, this function will be more than 10 times slower than it needs to be (around 2 seconds for 10 million invocations, versus an ideal 0.17 seconds).

What exactly is being “captured”? Let’s look at the following example:

mutex.sync { doSomething(&protectedMutableState) }

In order to do something useful inside the mutex, a reference to the protectedMutableState must be stored in the “closure context”, which is heap allocated data.

This might seem innocent enough (afterall, capturing is what closures do). But if the sync function can’t be inlined into its caller (because it’s in another module or it’s in another file and whole module optimization is turned off) then the capture will involve a heap allocation.

We don’t want heap allocation. Let’s avoid it by passing the relevant parameter into the closure, instead of capturing it.

WARNING: the next few code examples get increasingly goofy and I don’t suggest doing this in most cases. I’m doing this to demonstrate a the extent of the problem. Read through to the section titled “A different approach” to see what I actually use in practice.

public func sync_2<T>(_ p: inout T, execute: (inout T) -> Void) {
   pthread_mutex_lock(&m)
   defer { pthread_mutex_unlock(&m) }
   execute(&p)
}

That’s better… now the function runs at nearly full speed (0.18 seconds for the 10 million invocation test).

We’ve only solved the problem with values passed in to the function. There’s a similar problem with returning a result. The following function:

public func sync_3<T, R>(_ p: inout T, execute: (inout T) -> R) -> R {
   pthread_mutex_lock(&m)
   defer { pthread_mutex_unlock(&m) }
   return execute(&p)
}

adds a return type and slightly decreases performance (0.19 seconds for 10 million invocations).

A different approach

One of the advantages with closure capture is how effortless it seems. Items inside the closure have the same names inside and outside the closure and the connection between the two is obvious. When we avoid closure capture and instead try to pass all values in as parameters, we’re forced to either rename all our variables or shadow names – neither of which helps comprehension – and we still run the risk of accidentally capturing a variable and losing all efficiency anyway.

Let’s set aside all of this and solve the problem another way.

We can encourage the compiler to inline the closure and eliminate the need to capture by copying the code into the file where it is used.

Lets add this extension on PThreadMutex to our test file:

extension PThreadMutex {
   private func sync<R>(execute: () throws -> R) rethrows -> R {
      pthread_mutex_lock(&m)
      defer { pthread_mutex_unlock(&m) }
      return try execute()
   }
}

This forces Swift to treat the self parameter as @guaranteed, eliminating retain/release overhead and we’re finally down to the baseline 0.17 seconds.

Semaphores, not mutexes?

After I originally wrote the article, a few people asked… why not use a dispatch_semaphore_t instead? The advantage with dispatch_semaphore_wait and dispatch_semaphore_signal is that no closure is required – they are separate, unscoped calls.

You could use dispatch_semaphore_t to create a mutex-like construct as follows:

public struct DispatchSemaphoreWrapper {
   let s = DispatchSemaphore(value: 1)
   init() {}
   func sync<R>(execute: () throws -> R) rethrows -> R {
      _ = s.wait(timeout: DispatchTime.distantFuture)
      defer { s.signal() }
      return try execute()
   }
}

It turns out that it’s approximately a third faster than a pthread_mutex_lock/pthread_mutex_unlock mutex (0.14 seconds versus 0.17 seconds) but despite the speed increase, using a semaphore for a mutex is a poor choice for a general mutex.

Semaphores are prone to a number of mistakes and problems. Most serious are forms of priority inversion. Priority inversion is the same type of problem that made OSSpinLock usable iOS but the problem for semaphores is a little more complicated but still a real concern that can lead to deadlocks.

Ultimately, semaphores are a good way to communicate completion notifications between threads (something that isn’t easily done with mutexes) but semaphores have design complications and risks and should be limited to completion notification signalling, rather than usage contention.

All of this might seem a little esoteric – since you probably don’t deliberately create threads of different priorities in your own programs. However, the Cocoa frameworks add a little twist here that you need to consider: Cocoa frameworks use dispatch queues pervasively and every dispatch queue has a “QoS class” which may result in the queue running at a different thread priority. Unless you know how every task in your program is queued (including user-interface and other tasks queued by the Cocoa frameworks), you might find yourself in a multiple thread priority scenario that you didn’t plan. It’s best to avoid this risk.

Usage

The project containing this PThreadMutex and DispatchSemaphore implementation is available on github: mattgallagher/CwlUtils.

The CwlMutex.swift file is fully self-contained so you can just copy the file, if that’s all you need.

Otherwise, the ReadMe.md file for the project contains detailed information on cloning the whole repository and adding the framework it produces to your own projects.

Conclusion

The best, safe option for a mutex across both Mac and iOS in Swift remains pthread_mutex_t. In future, Swift will probably add the ability to optimize non-escaping closures to the stack or to inline across module boundaries. Either of these will fix the inherent problems with Dispatch.sync, likely making it a better option, but until that point, it is needlessly inefficient.

While semaphores and other “lightweight” locks are a valid approaches in some scenarios, they are not good general-use mutexes and carry additional design considerations and risks.

No matter your choice of mutex machinery, you need to be careful to ensure inlining for maximum performance – otherwise the overhead of closure capture will slow the mutex down by a factor of 10. In the current version of Swift, that might mean copying and pasting the code into the file where it’s used.

Threading, inlining and closure optimizations are all topics we may hope will change dramatically beyond the Swift 5 timeframe but current Swift users need to get work done – and this article describes current behavior in these versions when trying to get maximum performance from a scoped mutex.

Appendix: performance numbers

I ran a simple loop, 10 million times, entering a mutex, incrementing a counter, and leaving the mutex. The “slow” versions of DispatchSemaphore and PThreadMutex are compiled as part of a dynamic framework, separate to the test code.

These are the timing results:

Mutex variant Seconds
(Swift 3.0, MacPro4,1)
Seconds
(Swift 4.2, MacBookPro15,1)
DispatchQueue.sync 3.530 3.687
PThreadMutex.sync (capturing closure) 3.124 2.015
objc_sync_enter 0.833 0.446
PThreadMutex.sync_4 (dual inout params) 0.310 0.192
PThreadMutex.sync_3 (returning result) 1.364 0.187
PThreadMutex.sync_2 (single inout param) 0.284 0.184
PThreadMutex.sync_1 (inlined function in same file) 0.265 0.175
direct pthread_mutex_lock/unlock calls 0.263 0.175
os_unfair_lock 0.187 0.130
OSSpinLockLock 0.108 0.075

The test code used is part of the linked CwlUtils project but the test file containing these performance tests (CwlMutexPerformanceTests.swift) is not linked into the test module by default and must be deliberately enabled.