Mutexes and closure capture in Swift

I want to briefly talk about the absence of threading and thread synchronization language features in Swift. I’ll discuss the “concurrency” proposal for Swift’s future and how, until this feature appears, Swift threading will involve traditional mutexes and shared mutable state.

Using a mutex in Swift isn’t particularly difficult but I’m going to use the topic to highlight a subtle performance nuisance in Swift: dynamic heap allocation during closure capture. We want our mutex to be fast but passing a closure to execute inside a mutex can reduce the performance by a factor of 10 due to memory allocation overhead. I’ll look at a few different ways to solve the problem.

An absence of threading in Swift

When Swift was first announced in June 2014, I felt there were two obvious omissions from the language:

  • error handling
  • threading and thread synchronization

Error handling was addressed in Swift 2 and was one of the key features of that release.

Threading remains largely ignored by Swift. Instead of language features for threading, Swift includes the Dispatch module (libdispatch, aka Grand Central Dispatch) on all platforms and implicitly suggests: use Dispatch instead of expecting the language to help.

Delegating responsibility to a bundled library seems particularly strange compared to other modern languages like Go and Rust that have made threading primitives and strict thread safety (respectively) core features of their languages. Even Objective-C’s @synchronized and atomic properties seem like a generous offering compared to Swift’s nothing.

What’s the reasoning behing this apparent omission in Swift?

Future “concurrency” in Swift

The answer is, somewhat tersely, discussed in the “Concurrency” proposal in the Swift repository.

I mention this proposal to highlight that the Swift developers would like to do something around concurrency in the future but keep in mind what Swift developer Joe Groff points out: “that document is only a proposal and not an official statement of direction.”

This proposal appears to describe a situation where, like in Cyclone or Rust, references can’t be shared between threads. Whether or not the result is anything like those languages, it appears that the plan is for Swift to eliminate shared memory between threads except for types that implement Copyable and are passed through strictly governed channels (called Streams in the proposal). There’s also going to be a form of coroutine (called Tasks in the proposal) which appear to behave like pausable/resumable asynchronous dispatch blocks.

The proposal then goes on to claim that most common threading language features (Go-like channels, .NET-like async/await, Erlang-style actors) can then be implemented in libraries on top of the Stream/Task/Copyable primitives.

It all sounds great but when is Swift’s concurrency expected? Swift 4? Swift 5? Not soon.

So it doesn’t help us right now. In fact, it kind of gets in the way.

Impact of future features on the current library

The problem right now is that Swift is avoiding simple concurrency primitives in the language or thread-safe versions of language features on the grounds that they would be replaced or obviated by future features.

You can find explicit examples of this occurring by watching the Swift-Evolution mailing list, including:

There are plenty of common features that don’t even get as far as the mailing list, including language syntax for mutexes, synchronized functions, spawning threads and everything else “threading” related that requires a library to achieve in Swift.

Trying to find a fast, general purpose mutex

In short: if we want multi-threaded behavior, we need to build it ourselves using pre-existing threading and mutex features.

The common advice for mutexes in Swift is usually: use a DispatchQueue and invoke sync on it.

I love libdispatch but, in most cases, using DispatchQueue.sync as a mutex is in the slowest tier of solutions to the problem – more than an order of magnitude slower than other options due to unavoidable closure capture overhead for the closure passed to the sync function. This happens because a mutex closure needs to capture surrounding state (specifically, it needs to capture a reference to the protected resource) and this capturing involves a heap allocated closure context. Until Swift gains the ability to optimize non-escaping closures to the stack, the only way to avoid the heap allocation overhead for closures is to ensure that they are inlined – unfortunately, this is not possible across module boundaries, like the Dispatch module boundary – making DispatchQueue.sync an unnecessarily slow mutex in Swift.

The next most commonly suggested option I see is objc_sync_enter/objc_sync_exit. While faster (2-3 times) than libdispatch, it’s still a little slower than ideal (because it is always a re-entrant mutex) and relies on the Objective-C runtime (so it’s limited to Apple platforms).

The fastest option for a mutex is OSSpinLock – more than 20 times faster than dispatch_sync. Other than the common limitations of a spin-lock (high CPU usage if multiple threads actually try to enter simultaneously) there are some serious problems on iOS that make it totally unusable on that platform, making it useable on Mac only.

If you’re targetting iOS 10 or macOS 10.12, or newer, then you can use os_unfair_lock_t. This should have performance close to OSSpinLock while avoiding its most serious problems. However, this lock is not first-in-first-out (FIFO). Instead, the mutex is given to an arbitrary waiter (hence, “unfair”). You need to decide if this is a problem for your program but in general, this means that it shouldn’t be your first choice for a general purpose mutex.

All these problems leave us with pthread_mutex_lock/pthread_mutex_unlock as the only reasonable performance, portable option.

Mutexes and closure capture pitfalls

Like most things in plain C, pthread_mutex_t has a pretty clunky interface, so it helps to use a Swift wrapper around it (particularly for construction and automatic cleanup). Additionally, it’s helpful to have a “scoped” mutex – one which accepts a function and runs that function inside the mutex, ensuring a balanced “lock” and “unlock” either side of the function.

I’ll call my wrapper PThreadMutex. Here’s an implementation of a simple scoped mutex function on this wrapper:

public func sync<R>(execute: () -> R) -> R {
   pthread_mutex_lock(&m)
   defer { pthread_mutex_unlock(&m) }
   return execute()
}

This was supposed to be high performance but it’s not. Can you see why?

The problem occurs because I implement reusable functions like this in my separate CwlUtils module, leading to exactly the same problem that DispatchQueue.sync had: closure capture causing heap allocation. Due to heap allocation overhead, this function will be more than 10 times slower than it needs to be (3.124 seconds for 10 million invocations, versus an ideal 0.263 seconds).

What exactly is being “captured”? Let’s look at the following example:

mutex.sync { doSomething(&protectedMutableState) }

In order to do something useful inside the mutex, a reference to the protectedMutableState must be stored in the “closure context”, which is heap allocated data.

This might seem innocent enough (afterall, capturing is what closures do). But if the sync function can’t be inlined into its caller (because it’s in another module or it’s in another file and whole module optimization is turned off) then the capture will involve a heap allocation.

We don’t want heap allocation. Let’s avoid it by passing the relevant parameter into the closure, instead of capturing it.

WARNING: the next few code examples get increasingly goofy and I don’t suggest doing this in most cases. I’m doing this to demonstrate a the extent of the problem. Read through to the section titled “A different approach” to see what I actually use in practice.

public func sync_2<T>(_ p: inout T, execute: (inout T) -> Void) {
   pthread_mutex_lock(&m)
   defer { pthread_mutex_unlock(&m) }
   execute(&p)
}

That’s better… now the function runs at full speed (0.282 seconds for the 10 million invocation test).

We’ve only solved the problem with values passed in to the function. There’s a similar problem with returning a result. The following function:

public func sync_3<T, R>(_ p: inout T, execute: (inout T) -> R) -> R {
   pthread_mutex_lock(&m)
   defer { pthread_mutex_unlock(&m) }
   return execute(&p)
}

is back to the same, sluggish speed of the original, even when the closure captures nothing (at 1.371 seconds, it’s actually worse). This closure is performing a heap allocation to handle its result.

We can fix this by making the result an inout parameter too.

public func sync_4<T, U>(_ p1: inout T, _ p2: inout U, execute: (inout T, inout U) -> Void) -> Void {
   pthread_mutex_lock(&m)
   execute(&p, &p2)
   pthread_mutex_unlock(&m)
}

and invoke like this:

// Assuming `mutableState` and `result` are valid, mutable values in the current scope
mutex.sync_4(&mutableState, &result) { $1 = doSomething($0) }

We’re back to full speed, or close enough (0.307 seconds for 10 million invocations).

A different approach

One of the advantages with closure capture is how effortless it seems. Items inside the closure have the same names inside and outside the closure and the connection between the two is obvious. When we avoid closure capture and instead try to pass all values in as parameters, we’re forced to either rename all our variables or shadow names – neither of which helps comprehension – and we still run the risk of accidentally capturing a variable and losing all efficiency anyway.

Let’s set aside all of this and solve the problem another way.

We could create a free sync function in our file that takes the mutex as a parameter:

private func sync<R>(mutex: PThreadMutex, execute: () throws -> R) rethrows -> R {
   pthread_mutex_lock(&mutex.m)
   defer { pthread_mutex_unlock(&mutex.m) }
   return try execute()
}

By placing this in the file where it is called, it almost works. The heap allocation overhead is gone, bringing the time taken from 3.043 seconds down to 0.374 seconds. But we still haven’t reached the baseline 0.263 seconds of calling pthread_mutex_lock/pthread_mutex_unlock directly. What’s going wrong now?

It turns out that despite being a private function in the same file – where Swift can fully inline the function – Swift is not eliminating redundant retains and releases on the the PThreadMutex parameter (which is a class type to avoid breaking the pthread_mutex_t by copying it).

We can force the compiler to avoid these retains and releases by making the function an extension on PThreadMutex, rather than a free function:

extension PThreadMutex {
   private func sync<R>(execute: () throws -> R) rethrows -> R {
      pthread_mutex_lock(&m)
      defer { pthread_mutex_unlock(&m) }
      return try execute()
   }
}

This forces Swift to treat the self parameter as @guaranteed, eliminating retain/release overhead and we’re finally down to the baseline 0.264 seconds.

Semaphores, not mutexes?

After I originally wrote the article, a few people asked… why not use a dispatch_semaphore_t instead? The advantage with dispatch_semaphore_wait and dispatch_semaphore_signal is that no closure is required – they are separate, unscoped calls.

You could use dispatch_semaphore_t to create a mutex-like construct as follows:

public struct DispatchSemaphoreWrapper {
   let s = DispatchSemaphore(value: 1)
   init() {}
   func sync<R>(execute: () throws -> R) rethrows -> R {
      _ = s.wait(timeout: DispatchTime.distantFuture)
      defer { s.signal() }
      return try execute()
   }
}

It turns out that it’s approximately a third faster than a pthread_mutex_lock/pthread_mutex_unlock mutex (0.168 seconds versus 0.264 seconds) but despite the speed increase, using a semaphore for a mutex is a poor choice for a general mutex.

Semaphores are prone to a number of mistakes and problems. Most serious are forms of priority inversion. Priority inversion is the same type of problem that made OSSpinLock usable iOS but the problem for sempahores is a little more complicated.

With a spin lock, priority inversion involves:

  1. high priority thread active, spinning, waiting for the lock held by a lower priority thread
  2. lower priority thread never releases the lock because it is starved by the higher priority thread.

With a semaphore, priority inversion involves…

  1. High priority thread waits on a semaphore
  2. Medium priority thread does work unrelated to semaphore
  3. Low priority thread is expected to signal the semaphore so the high priority thread can continue

The medium thread will starve the low priority thread (that’s okay, that’s what thread priority does) but since the high priority thread is waiting for the low priority thread to signal the semaphore, the high priority thread is also starved by the medium priority thread. Ideally, this should never happen.

If a proper mutex was used, instead of a semaphore, the high priority thread’s priority would be donated to the lower priority thread when the high priority thread is waiting on a mutex held by the low priority thread – letting the low priority thread finish its work and unblock the high priority thread. However, semaphores are not held by a thread so no priority donation can occur.

Ultimately, semaphores are a good way to communicate completion notifications between threads (something that isn’t easily done with mutexes) but semaphores have design complications and risks and should be limited to situations where you know all the threads involved and know their priorities in advance – where the waiting thread is known to be equal or lower priority than the signalling thread.

All of this might seem a little esoteric – since you probably don’t deliberately create threads of different priorities in your own programs. However, the Cocoa frameworks add a little twist here that you need to consider: Cocoa frameworks use dispatch queues pervasively and every dispatch queue has a “QoS class” which may result in the queue running at a different thread priority. Unless you know how every task in your program is queued (including user-interface and other tasks queued by the Cocoa frameworks), you might find yourself in a multiple thread priority scenario that you didn’t plan. It’s best to avoid this risk.

Usage

The project containing this PThreadMutex and DispatchSemaphore implementation is available on github: mattgallagher/CwlUtils.

The CwlMutex.swift file is fully self-contained so you can just copy the file, if that’s all you need.

Otherwise, the ReadMe.md file for the project contains detailed information on cloning the whole repository and adding the framework it produces to your own projects.

Conclusion

The best, safe option for a mutex across both Mac and iOS in Swift remains pthread_mutex_t. In future, Swift will probably add the ability to optimize non-escaping closures to the stack or to inline across module boundaries. Either of these will fix the inherent problems with Dispatch.sync, likely making it a better option, but until that point, it is needlessly inefficient.

While semaphores and other “lightweight” locks are a valid approaches in some scenarios, they are not general use mutexes and carry additional design considerations and risks.

No matter your choice of mutex machinery, you need to be careful to ensure inlining for maximum performance – otherwise the overhead of closure capture will slow the mutex down by a factor of 10. In the current version of Swift, that might mean copying and pasting the code into the file where it’s used.

Threading, inlining and closure optimizations are all topics we can expect to change dramatically beyond the Swift 3 timeframe but current Swift users need to get work done in Swift 2.3 and Swift 3 – and this article describes current behavior in these versions when trying to get maximum performance from a scoped mutex.

Appendix: performance numbers

I ran a simple loop, 10 million times, entering a mutex, incrementing a counter, and leaving the mutex. The “slow” versions of DispatchSemaphore and PThreadMutex are compiled as part of a dynamic framework, separate to the test code.

These are the timing results:

Mutex variant Seconds
(Swift 2.3)
Seconds
(Swift 3)
PThreadMutex.sync (capturing closure) 3.043 3.124
DispatchQueue.sync 2.330 3.530
PThreadMutex.sync_3 (returning result) 1.371 1.364
objc_sync_enter 0.869 0.833
sync(PThreadMutex) (function in same file) 0.374 0.387
PThreadMutex.sync_4 (dual inout params) 0.307 0.310
PThreadMutex.sync_2 (single inout param) 0.282 0.284
PThreadMutex.sync (inlined non-capturing) 0.264 0.265
direct pthread_mutex_lock/unlock calls 0.263 0.263
OSSpinLockLock 0.092 0.108

The test code used is part of the linked CwlUtils project but the test file containing these performance tests (CwlMutexPerformanceTests.swift) is not linked into the test module by default and must be deliberately enabled.