SBCL Windows threads implementation notes, part 1

Since I've Windows threads patches to SBCL for review, I'm publishing my implementation notes.

Thread-local storage

For each thread, Windows allocates a Thread Information Block (ref. 1, ref. 2, ref. 3). Thread can access its own Thread Information Block through the FS register (e.g., %FS:0 is the first field in TIB). TIB contains frequently used thread-specific data such as Last Error Number, pointer to SEH (structured exception handling) frame, thread id, and TLS backing store. This structure is documented in ref. 2 and ref. 4. TIB has an ‘Arbitrary’ field which is described as:

The 14h DWORD pvArbitrary field is theoretically available for applications to use however they want. It's almost like an extra thread-local storage slot for you to use, although I've never seen an application use it.

This sound just like what is needed for implementing thread-local storage - a memory location that we can freely use and that is easily accessible. Unfortunately, there is a possibility that some libraries would also use this field for their purposes (and there are examples of such libraries). And when some other conflicting code will run, it will be hard to detect this.

So clearly, a better way of storing the TLS pointer is needed. TIB also contains the first 64 slots for Windows thread-local storage. There are several options for implementing TLS with that:

  1. Windows allows executables to preallocate TLS slots. We can take, e.g., TLS slot 0, and all access to lisp's TLS will go through TLS at a fixed offset. But it would somewhat complicate the initialization of lisp runtime and the build process.
  2. We can grab some other fixed slot number by successively allocating TLS slots (TlsAlloc) until we get the slot we want. After that, we free the slots we allocated that we don't need. This way, we can use TLS slot 63 and lisp's TLS pointer will be at a known offset. It is quite safe to use this slot; libraries on Windows commonly use no more than one TLS slot (and ‘system’ libraries don't even use TLS - they have their own fields in TIB).
  3. We can allocate the TLS slot in a normal Windows way and store it in a global variable. This way, we don't need to do any weird things, but TLS access will have one more indirection. This would further complicate SBCL's TLS access because macros and code generation assume that TLS does not require any indirection. This is the best way to go but it requires more changes.

Currently, TLS is implemented as option 2. On initialization, we take slot 63 (or fail if we couldn't - but I can't imagine the situation where this can happen). When the windows-threads branch will be merged to SBCL, we should consider option 3 because if implemented now, the change would be more than necessary for Windows threading support.

TLS notes:

  1. This only applies to Win32. Win64 is very different

  2. Clozure CL, on the other hand, uses ES segment register and undocumented Win32 functions to allocate a segment and store it in a segment register. This is problematic because WOW64 (the Windows subsystem to run Win32 applications in Windows 64) does not preserve the value of ES register during context switches (including thread preemption). This is the reason why the 32-bit version of Clozure CL does not work on Windows 64.

Thread suspension and interruption

Lisp code in SBCL runs in a managed environment – SBCL needs to be able to safely suspend threads (because it uses the stop-the-world garbage collector) and interrupt them (i.e., call some function in a specific thread).

On Unix-like systems, suspending and interrupting threads is simple (but correctly synchronizing threads and writing code that is tolerable to asynchronous interruptions is not simple). This is achieved by using POSIX signals.

Windows API, on the other hand, does not provide equivalent asynchronous interruptions. They have to be emulated. I'll list several ways of emulating them:

  1. ‘Thread hijacking’. Windows lets stop a thread, examine its context, modify it, and restart a thread. This way, another thread may modify the stack and registers in such a way that thread will execute some other function and then return to where it was.

    This might seem to be the way to emulate signals, but it is not. This is racy with Windows internals in several ways (ref. 5).

  2. QueueUserAPCEx. Windows has support for Asynchronous Procedure Calls (APCs), but only ‘kernel APCs’ can interrupt a thread when it is in user mode. ‘kernel APCs’ are available from drivers and there is the QueueUserAPCEx project (ref. 6) that provides support for them.

    This has a very big drawback in that it requires a kernel-mode driver to be installed to function correctly. Because of that, I haven't tried it. By the way, the pthreads_win32 (ref. 7) project uses QueueUserAPCEx if it's available.

  3. Thread polling. Thread periodically checks if it should suspend or interrupt. This does not require any special support from an operating system. But it requires injecting polling into the code and to tolerate potentially infinite delays (they may occur when a thread executes a blocking operation).

Among the noted approaches, the only one that really works is the last one. So it's clear that we have no other choice but to implement polling in SBCL threads. Luckily, I didn't even need to do this myself – Paul Khuong implemented ‘gc safepoints’ for SBCL (ref. 8, ref. 9). The safepoints are implemented by reading a specific memory location and discarding the result (e.g., with test %eax, GC_POLL_PAGE_ADDR instruction). In normal conditions, this instruction will not affect the running code. But if the memory page is read-protected or unmapped, this will cause an exception and we will land in the exception handler where we can analyze what has happened and what to do next.

In the next post, I'll write about how threads are synchronized for timely suspension and interruption.