SBCL's Windows threads implementation notes

Thread-local storage

Windows allocates a special thread-specific structure — Thread Information Block (TIB) [1, 2, 3]. TIB contains thread-specific variables such as last error number, pointer to SEH (structured exception handling) frame, thread identifier, TLS backing store. This structure is documented in [2] and [4]. Thread can access its own TIB via the FS segment register: %fs:0 is a first field in TIB.

First 64 TLS slots in are stored in TLS array in TIB at offset 0xE10 (slots with numbers higher than 63 are stored in some other memory location; pointer to this location is also stored in TIB). This location is documented and is used by several C/C++ compilers for thread-local variables [TODO: need a reference]. Windows provides TlsAlloc function to allocate a slot in this array; we can allocate one slot from TLS array and use it to store a pointer to struct thread.

  1. Windows lets executables control the initial allocation of TLS slots. We can claim the TLS slot 0 and use it to store the pointer to struct thread. Controlling initial allocation of TLS slots would complicate the initialization of lisp runtime and the build process. And this will be a barrier for embedding SBCL.
  2. We can take the last TLS slot (slot 63) by allocating TLS slots (TlsAlloc) until we get the slot we want. After that, we free the slots that we don't need. This way the pointer to struct thread will be at at a known memory location. It is safe to assume that slot 63 is free when SBCL is initialed: libraries on Windows commonly use no more that one TLS slot (and system libraries like Winsock don't even use TLS as they have their own fields in TIB). When sbcl.exe is started, only several TLS slots are taken. Even if we fail to own slot 63, we will know this at the initialization time and we will not crash silently somewhere later.
  3. We can allocate the TLS slot in normal Windows way (by calling TlsAlloc) and store slot index in a global variable. The initialization process is easy: just call TlsAlloc. But all TLS accesses will have one more indirection. I consider that this does not present any kind of performance issue. SBCL's compiler uses macros like (pseudo-atomic &body body), (load-tl-symbol-value reg symbol). It would be necessary to allocate temporary register(s) for resolving the indirections.

The last option is clearly the best one; but it would require changes to code generation so I'll leave it to some later time.

Currently TLS is implemented as option 2. On initialization, we take the slot 63 (or fail if we couldn't - but I can't imagine the situation where this can happen). When windows-threads will be merged to SBCL, we should consider the option 3 because if implemented now, the change would be more than necessary for windows threading support.

TLS notes:

  1. This only applies to Win32. Win64 is very different
  2. Clozure CL uses ES segment register and undocument Windows NT functions to allocate a segment and store it in Local Descriptor Table. This is problematic because WOW64 (the Windows subsystem used to run Win32 applications in Windows 64) does not preserve the value of ES register during context switches (including thread preemtion) and syscalls. This is the reason why 32-bit version of Clozure CL does not work on Windows 64.
  3. In previous versions of windows-threads, pointer to struct thread was stored in TIB's pvArbitrary field. Windows lets libraries to freely use pvArbitrary field — thus there is a potential conflict which is hard to detect. Anton Kovalenko persuaded me that using known slot of TLS array is better.

Thread suspension and interruption

Lisp code in SBCL runs in managed environment — SBCL needs to be able to safely suspend threads (because it uses stop-the-world garbage collector) and interrupt them (interruption is a process of stopping the thread and making it call some function).

On Unix-like systems, suspending and interrupting threads is conceptually simple. By sending UNIX signal to the thread we can suspend it or interrupt it. Despite conceptual simplicity, writing code that uses signals to synchrone threads is hard due to asynchronous code.

Windows, on the other hand, does not provide equivalent asynchronous interruptions. There are several ways of emulating them:

  1. «Thread hijacking». Windows' debugging API has functions SuspendThread (stops a thread), GetThreadContext and SetThreadContext) (examine and modify the context of a thread) and ResumeThread (resume a thread). We can push a stack frame on a stack and point thread context's EIP register to the signal handler, imitating the delivery of a signal. This method is called «thread hijacking» or «EIP hijacking».

    The bad news is that thread hijacking will lead to race conditions inside Windows internals [5] and so it should be avoided.

  2. Asynchronous Procedure Calls. An APC in Windows acts like a signal in Unix — it executes some function in a context of thread and then returns to whatever thread was doing.

    There are two kinds of APCs — «user APC» and «kernel APC». They differ in that kernel APCs are only available from kernel drivers and are delivered immediatelly while user APCs are delivered only to threads in an «alertable» state. Thread is in alertable state only for duration of some specific syscalls like ReadFileEx, SleepEx, WaitForSingleObjectEx and some others. Only kernel code can issue a kernel APC but there is a QueueUserAPCEx[6] project that lets applications use kernel APCs by installing a custom kernel driver.

    Since QueueUserAPCEx requires installing a kernel-mode driver, I haven't tried it. By the way, pthreads_win32[7] project uses QueueUserAPCEx for implementation of pthread_kill and pthread_cancel if it's available.

  3. Periodical polling. Thread periodically checks if it should suspend or interrupt. Polling does not require special support from operating system. But it requires modifying the code so it would polling at regular intervals and SBCL should be prepared for potentially long or event infinite delays in polling (they may occur when thread executes a blocking operation).

Among these methods of thread suspension, only polling really works. So it's clear that we have no other choice but to implement polling in SBCL threads. Luckily, I didn't even need to do this myself — Paul Khuong implemented GC safepoints for SBCL[8, 9]. The safepoints are implemented by reading a known fixed memory location: test %eax, GC_POLL_PAGE_ADDR where GC_POLL_PAGE_ADDR is some fixed address. Under normal circumstances, this instruction will not have side effects on the running code (it only modifies EFLAGS register and it is not inserted inside condition-checking code). But if the memory page is read-protected or unmapped, this instruction will cause a page fault which will be handled by an exception handler. Exception handler will put the thread to sleep or call the interruption function.

Implementation of thread suspension

To signal all threads that GC is coming and they should pause, GC_POLL_PAGE is read-protected. After that, threads will start receiving page faults (unless they are blocked in foreign code). We must consider that:

  1. The reaction to GC_POLL_PAGE unmapping is not immediate — it might take some time for a thread to reach the safepoint
  2. If some thread is executing foreign code or a blocking system call, it will not reach a safepoint in any foreseeable future
  3. Even if a thread has reached a safepoint, it does not mean that GC can start. The thread may be inside WITHOUT-GCING section or doing some operation that temporarily breaks invariants. The thread must have a chance to finish what it is doing; but if such thread is resumed it would retry the safepoint instruction forever.

From these considerations we can draw some conclusions:

  1. Thread will always reach safepoint in some finite time — unless it is running foreign code or is doing a blocking operation (e.g., reading from file or waiting on synchronization object).
  2. Every thread that can reach safepoint must reach it before GC can proceed
  3. Every thread can can not reach safepoint must not interfere with GC if it will suddenly return to lisp code
  4. After all threads have reached safepoint, we must wait for all threads to be ready for GC

Thread suspension proceeds in two phases. In the first phase, threads are notified that they should pause. In the second phase threads are actually stopped. This separation is needed because we must know when to return back the GC poll page.

Phase 1:

  1. GC poll page is mapped as unreadable
  2. Master thread (the thread that is doing the GC) enumerates threads; if the thread is running lisp code, wait until its state becomes STATE_SUSPENDED_BRIEFLY.

The thread state STATE_SUSPENDED_BRIEFLY means that the thread has reached a safepoint and is waiting for the second phase of suspension process. Thread enters this state even if it's some state that prevents garbage collection.

Phase 2:

  1. GC poll page is mapped as readable so that threads can prepare for gc
  2. Threads that have reached a safepoint are resumed (because reaching a safepoint does not necessarily mean readiness for GC)
  3. Master thread waits for every thread to be ready for GC. This is achieved by waiting for state of every thread to become STATE_SUSPENDED (except for threads that are ready for GC)

    Thread is ready for GC if:

    Thread marks itself as being gc-safe when it is running non-lisp code and blockable signals are unblocked and it is not inside WITHOUT-GCING or WITHOUT-INTERRUPTS.

    To track whether a thread that cannot reach a safepoint allows the garbage collection, special variable *GC-SAFE* is introduced. It is guaranteed that thread can not 'slip through' when garbage collection is commencing.

Interruption

Thread interruption is similar, but we only need to wait for the thread that we are interested in.

Phase 1:

  1. GC poll page is remapped as unreadable
  2. If the thread is doing pthread_cond_wait, a spurious wakeup is generated. This e.g., wakes the thread from futex_wait. In future versions, other wakeups should be done: e.g., sleep is canceled, I/O is cancelled.
  3. Master thread checks interrupted thread: if it's running lisp code, wait until it reaches a safepoint. Thread is considered to reach a safepoint when it's state is STATE_SUSPENDED_BRIEFLY.

Phase 2:

  1. GC poll page is mapped
  2. All threads that have reached a safepoint are released
  3. Interrupted thread runs an interruption function or queues it if the thread can't run it right now.

Safepoint code

«Safepoint» is a code that performs some bookkeeping activities for a thread. It is called:

  1. When thread reaches a safepoint
  2. When thread leaves or enters non-lisp code
  3. At the beginning of exception handler.
  4. When thread changes its state variables like INTERRUPTS-ENABLED, IN-WITHOUT-GCING, GC-PENDING and others.

Safepoint code is responsible for:

  1. If a safepoint was hit, this should be noted by changing state to STATE_SUSPENDED_BRIEFLY and waiting for a change of a state.
  2. If thread should suspend, it suspends if possible. If it's not possible, sets STOP_FOR_GC_PENDING and INTERRUPT_PENDING (and pseudo_atomic_interrupted, if inside pseudo-atomic section)
  3. If thread should interrupt, it executes interruption if possible. If it's not possible, sets INTERRUPT_PENDING and pseudo_atomic_interrupted.
  4. If GC is pending and thread can do GC, runs the GC
  5. Is interrupt is pending and thread can execute it, executes it.

On some occasions, runtime is in very fragile state and can not do anything that safepoint must do (e.g., change thread state, execute GC, execute interruption). For example, running lisp thread synchronization objects. To control this, *DISABLE-SAFEPOINTS* variable is used.

GC code is run inside a safepoint, and safepoint code itself is not reenterable. Since GC code itself has safepoints (SUB-GC is a normal lisp function, it calls lisp synchronization routines and does several switches to/from foreign code). To prevent rentering of a safepoint code, *IN-SAFEPOINT* variable is used.