Atomic C

If you've ever tried working with multiple threads you'll know that thread communication can be a real problem when it comes to synchronisation and sharing data. As soon as you start to use mutexes or other forms of IPC at scale you start to wonder where all your CPU cycles are going and why you're only seeing a fraction of the performance you thought you were going to get when you worked it all out on paper.

On modern CPU's there is however a better way, you can find the specs for the GNU wrappers here, but essentially it provides a facility to do a full lock-change-unlock on a variable at CPU level, which is rather quicker than doing inside the threads library.

Consider the following example ring-buffer code;

void ring_push(ring_buffer_t* ring, void* data)
{
  ring_buffer_entry_t* entry;
  entry = (ring_buffer_entry_t*)malloc(sizeof(ring_buffer_entry_t));
  bzero(entry,sizeof(ring_buffer_entry_t));
  entry->data = data;
  ring->tail->next = entry;
  ring->tail = entry;
  ring_entries++;
}

void* ring_pop(ring_buffer_t* ring)
{
  ring_buffer_entry_t* entry;
  if(!ring->entries) return NULL;
  ring_entries--;
  entry = ring->head;
  ring->head = ring->head->next;
  free(entry);
  return ring->head->data;
}

When run in a single process, this code works fine as-is, but if you try to use it to pass code between two threads, there is nothing to stop the (ring_entries++) and (ring_entries--) code instructions being executed on two CPU cores at the same time, obviously both referencing the same memory location.

If you deep-dive the instructions doing this, although there are single autoincrement and autodecrement machine code instructions, at sub-machine code level each of these instructions needs to do a "memory fetch" and "register operation", subsequently followed by a "memory put". If you have two cores running in parallel, they can both get the same value on the "fetch", both carry out their "register operation" (one being +1, one -1) and then both will try to write, one will succeed and one will fail, or rather one will get overwritten by the other .. but either way the resulting value in (ring_entries) will be off by a value of 1. The real killer is that it's only going to happen occasionally, and it's only going to happen under load.

My first inclination was think, "but, but, auto-increment and decrement are supposed to be atomic operations!", and of course they are .. on any given CPU core (!) A nice example of "it would've been good code when my hair wasn't so gray.." [or when a CPU was a CPU, rather than a whole bunch of CPU's].. So anyway, a really easy way to fix things, replace the auto-increment and decrement with the following;

__sync_fetch_and_add(&ring->entries,1);
__sync_fetch_and_sub(&ring->entries,1);

Problem solved! .. of course these aren't quite as quick as pure auto's, however when compared to threading mutexes, you want these every time, just so long as you're running on a fairly modern CPU .. :-)

Caveat; that's not to say that some modern threading implementations don't actually use such atomic applications, but at the very least these operations should resolve to direct assembler rather than a threading mutex which will ultimately use at the very least a shared library call if not a system call.