Thread synchronization of atomic invariants in .NET 4.5

I've written before about multi-threaded programming in .NET (C#). Spinning up threads and executing code on another thread isn't really the hard part. The hard part is synchronization of data between threads.

Most of what I've written about is from a processor agnostic point of view. It's written from the historical point of view: that .NET supports many processors with varying memory models. The stance has generally been that you're programming for the .NET memory model and not a particular processor memory model.

But, that's no longer entirely true. In 2010 Microsoft basically dropped support for Itanium in both Windows Server and in Visual Studio (http://blogs.technet.com/b/windowsserver/archive/2010/04/02/windows-server-2008-r2-to-phase-out-itanium.aspx). In VS 2012 there is no "Itanium" choice in the project Build options. As far as I can tell, Windows 2008 R2 is the only Windows operating system, still in support, that supports Itanium. And even Windows 2008 R2 for Itanium is not supported for .NET 4.5 (http://msdn.microsoft.com/en-us/library/8z6watww.aspx)

So, what does this mean to really only have the context of running only x86/x64? Well, if you really read the documentation and research the Intel x86 and x64 memory model this could have an impact in how you write multi-threaded code with regard to shared data synchronization. x86 and x64 memory models include guarantees like "In a multiple-processor system…Writes by a single processor are observed in the same order by all processors." but doesand also includes guarantees like "Loads May Be Reordered with Earlier Stores to Different Locations". What this really means is that a store or a load to a single location won't be reordered with regard to a load or a store to the same location across processors. That is we don't need fences to ensure a store to a single memory location is "seen" by all threads or that a load from memory loads the "most recent" value stored. But, it does mean that in order for multiple stores to multiple locations to be viewed by other threads in the same order, a fence is necessary (or the group of store operations is invoked as an atomic action through the user of synchronization primitives like Monitor.Enter/Exit, lock, Semaphore, etc.) (See section 8.2 Memory Ordering of the Intel Software Developer's Manual Volume 3A found here). But, that deals with non-atomic invariants which I'll detail in another post.

To be clear, you could develop to just x86 or just x64 prior to .NET 4.5 and have all the issues I'm about to detail.

Prior to .NET 4.5 you really programmed to the .NET memory model. This has changed over time since ECMA defined it around .NET 2.0; but that model was meant to be a "supermodel" to deal with the fact that .NET could be deployed to different CPUs with disparate memory models. Most notably was the Itanium memory model. This model is much looser than the Intel x86 memory model and allowed things like a store without a release fence and a load without an acquire fence. This meant that a load or a store might be done only in one CPU's memory cache and wouldn't be flushed to memory until a fence. This also meant that other CPUs (e.g. other threads) may not see the store or may not get the "latest" value with a load. You can explicitly cause release and acquire fences with .NET with things like Monitor.Enter/Exit (lock), Interlocked methods, Thread.MemoryBarrier, Thread.VolatileRead/VolatileWrite, etc. So, it wasn't a big issue for .NET programmers to write code that would work on an Itanium. For the most part, if you simply guarded all your shared data with a lock, you were fine. lock is expensive, so you could optimize things with Thread.VolatileRead/VolatileWriter if your shared data was inherently atomic (like a single int, a single Object, etc) or you could use the volatile keyword (in C#). The conventional wisdom has been to use Thread.VolatileRead/VolatileWrite rather than decorate a field with volatile because you may not need every access to be volatile and you don't want to take the performance hit when it doesn't need to be volatile.

For example (borrowed from Jeffrey Richter, but slightly modified) shows synchronizing a static member variable with Thread.VolatileRead/VolatileWrite:

   1: public static class Program {


   2:   private static int s_stopworker;


   3:   public static void Main() {


   4:     Console.WriteLine("Main: letting worker run for 5 seconds");


   5:     Thread t = new Thread(Worker);


   6:     t.Start();


   7:     Thread.Sleep(5000);


   8:     Thread.VolatileWrite(ref s_stopworker, 1);


   9:     Console.WriteLine("Main: waiting for worker to stop");


  10:     t.Join();


  11:   }


  12:  


  13:   public static void Worker(object o) {


  14:     Int32 x = 0;


  15:     while(Thread.VolatileRead(ref s_stopworker) == 0)


  16:     {


  17:       x++;


  18:     }


  19:   }


  20: }

Without the call to Thread.VolatileWrite the processor could reorder the write of 1 to s_stopworker to after the read (assuming we're not developing to on particular processor memory model and we're including Itanium). In terms of the compiler, without Thread.VolatileRead it could cache the value being read from s_stopworker in to a register. For example, removing the Thread.VolatileRead, the compiler optimizes the comparison of s_stopworker to 0 in the while to single register (on x86):

00000000  push        ebp 


00000001  mov         ebp,esp 


00000003  mov         eax,dword ptr ds:[00213360h] 


00000008  test        eax,eax 


0000000a  jne         00000010 


0000000c  test        eax,eax 


0000000e  je          0000000C 


00000010  pop         ebp 


00000011  ret

The loop is 0000000c to 0000000e (really just testing that the eax register is 0). Using Thread.VolatileRead, we'd always get a value from a physical memory location:

00000000  push        ebp 


00000001  mov         ebp,esp 


00000003  lea         ecx,ds:[00193360h] 


00000009  call        71070480 


0000000e  test        eax,eax 


00000010  jne         00000021 


00000012  lea         ecx,ds:[00193360h] 


00000018  call        71070480 


0000001d  test        eax,eax 


0000001f  je          00000012 


00000021  pop         ebp 


00000022  ret

The loop is now 00000012 to 0000001f, which shows calling Thread.VolatileRead each iteration (location 00000018). But, as we've seen from the Intel documentation and guidance, we don't really need to call VolatileRead, we just don't want the compiler to optimize the memory access away into a register access. This code works, but we take the hit of calling VolatileRead which forces a memory fence through a call to Thread.MemoryBarrier after reading the value. For example, the following code is equivalent:

while(s_stopworker == 0)


{


  Thread.MemoryBarrier();


  x++;


}

And this works equally as well as using Thread.VolatileRead, and compiles down to:

00000000  push        ebp 


00000001  mov         ebp,esp 


00000003  cmp         dword ptr ds:[002A3360h],0 


0000000a  jne         0000001A 


0000000c  lock or     dword ptr [esp],0 


00000011  cmp         dword ptr ds:[002A3360h],0 


00000018  je          0000000C 


0000001a  pop         ebp 


0000001b  ret

The loop is now is 0000000c to 00000018. As we can see at 0000000c we have an extra "lock or" instruction—which is what the compiler optimizes a call to Thread.MemoryBarrier to. This instruction really just or's 0 with what esp is pointing to (i.e. "nothing", zero or'ed with something else does not change the value). But the lock prefix forces a fence and is less expensive than instructions like mfence. But, based on what we know of the x86/x64 memory model, we're only dealing with a single memory location and we don't need that lock prefix—the inherent memory guarantees of the processor means that our thread can see any and all writes to that memory location without this extra fence. So, what can we do to get rid of it? Well, using volatile actually results in code that doesn't generate that lock or instruction. For example, if we change our code to make s_stopworker volatile:

   1: public static class Program {


   2:   private static volatile int s_stopworker;


   3:   public static void Main() {


   4:     Console.WriteLine("Main: letting worker run for 5 seconds");


   5:     Thread t = new Thread(Worker);


   6:     t.Start();


   7:     Thread.Sleep(5000);


   8:     s_stopworker = 1;


   9:     Console.WriteLine("Main: waiting for worker to stop");


  10:     t.Join();


  11:   }


  12:  


  13:   public static void Worker(object o) {


  14:     Int32 x = 0;


  15:     while(s_stopworker == 0)


  16:     {


  17:       x++;


  18:     }


  19:   }


  20: }

We tell the compiler that we don't want accesses to s_stopworker optimized. This then compiles down to:

00000000  push        ebp 


00000001  mov         ebp,esp 


00000003  cmp         dword ptr ds:[00163360h],0 


0000000a  jne         00000015 


0000000c  cmp         dword ptr ds:[00163360h],0 


00000013  je          0000000C 


00000015  pop         ebp 


00000016  ret

The loop is now 0000000c to 00000013. Notice that we're simply getting the value from memory on each iteration and comparing it to 0. There's no lock or. One less instruction and no extra memory fence. Although in many cases it doesn't matter (i.e. you might only do this once, in which case an extra few milliseconds won't hurt and this might be a premature optimization), but using lock or with the register optimization is about 992% slower when measured on my computer (or volatile is 91% faster than using Thread.MemoryBarrier and probably a bit faster still than use Thread.VolatileRead). This is actually contradictory to conventional wisdom with respect to a .NET memory model that supports Itanium. If you want to support Itanium, every access to a volatile field would be tantamount to Thread.VolatileRead or Thread.VolatileWrite, in which case, yes, in scenarios where you don't really need the field to be volatile, you take a performance hit.

In .NET 4.5 where Itanium is out of the picture, you might be thinking "volatile all the time then!". But, hold on a minute, let's look at another example:

   1: static void Main()


   2: {


   3:   bool complete = false; 


   4:   var t = new Thread (() =>


   5:   {


   6:     bool toggle = false;


   7:     while (!complete)


   8:     {


   9:         Thread.MemoryBarrier();


  10:         toggle = !toggle;


  11:     }


  12:   });


  13:   t.Start();


  14:   Thread.Sleep (1000);


  15:   complete = true;


  16:   t.Join();


  17: }

This code (borrowed from Joe Albahari) will block indefinitely at the call to Thread.Join (line 16) without the call to Thread.MemoryBarrier() (at line 9).

This code blocks indefinitely without Thread.MemoryBarrier() on both x86 and x64; but this is due to compiler optimizations, not because of the processor's memory model. We can see this in the disassembly of what the JIT produces for the thread lambda (on x64):

00000000  push        ebp 


00000001  mov         ebp,esp 


00000003  movzx       eax,byte ptr [ecx+4] 


00000007  test        eax,eax 


00000009  jne         0000000F 


0000000b  test        eax,eax 


0000000d  je          0000000B 


0000000f  pop         ebp 


00000010  ret

Notice the loop (0000000b to 0000000d), the compiler has optimized access to the variable toggle into a register and doesn't update that register from memory—identical to what we saw with the member field above. If we see the disassembly when using MemoryBarrier:

00000000  movzx       eax,byte ptr [rcx+8] 


00000004  test        eax,eax 


00000006  jne         0000000000000020 


00000008  nop         dword ptr [rax+rax+00000000h] 


00000010  lock or     dword ptr [rsp],0 


00000015  movzx       eax,byte ptr [rcx+8] 


00000019  test        eax,eax 


0000001b  je          0000000000000010 


0000001d  nop         dword ptr [rax] 


00000020  rep ret

We see that loop testing toggle (instructions from 00000010 to 0000001b) grabs the memory value into eax then tests eax until it's true (or non-zero). MemoryBarrier has been optimized to "lock or" here as well.

What we're dealing with here is a local variable and can't use the volatile keyword. We could use the lock keyword to get a fence, it couldn't be around the comparison (the while) because that would enclose the entire while block and would never exit the lock to get the memory fence and thus the compiler believes reads of toggle aren't guarded by lock's implicit fences. We'd have to wrap the assignment to toggle to get the release fence before and the acquire fence after, ala:

var lockObject = new object();


while (!complete)


{


    lock(lockObject)


    {


        toggle = !toggle;


    }


}

Clearly this lock block isn't really a critical section because the lockObject instance can't be shared amongst threads. Anyone reading this code is likely going to think "WTF?". But, we do get our fences, and the compiler will not optimize access to toggle to only a register and our code will no longer block at the call to Thread.Join. It's apparent that Thread.MemoryBarrier is the better choice in this scenario, it's just more readable and doesn't appear to be poorly written code (i.e. code that only depends on side effects).

But, you still take the performance hit on "lock or". If you want to avoid that, then refactor the local toggle variable to a field and decorate it with volatile.

Although some of this seems like micro-optimizations, but it's not. You have to be careful to "synchronize" shared atomic data with respect to compiler optimizations, so you might as well pick the best way that works.

In the next post I'll get into synchronizing non-atomic invariants shared amongst threads.