AIO part deux - HA

So I forgot to mention the use-case that initially attracted me to AIO, and that was High Availability. If you have (say) 4 hard drives and you're spreading data across them for RAID type functionality, when you get a read request (say for 8 blocks) then you'll be making a number of reads across all four devices.

The traditional method would look something like this;

  seek(dev1,block1)
  read(dev1,buffer1,4096)  // block here
  seek(dev2,block2)
  read(dev2,buffer2,4096)  // block here
  .. etc ..

So if the bandwidth of each drive is 150MB per second, you would see a total throughput across all four drives of only 150MB. Moreover, if one of the drives developed a fault, you would see a retry timeout on one of the operations that could take between 20 seconds and a minute .. effectively your application would freeze. Not really a viable solution.

Back to AIO again and we have;

{
  array[0] = (dev1,block1,buffer1,4096,callback)
  array[1] = (dev2,block2,buffer2,4096,callback)
  ..
  aio_read(array)
  => program continues with no blocking
}
function callback(array_element) {
  // do something with the data
}

Using this mechanism, not only do we not block, at OS level, Linux will be able to submit requests to all four devices in parallel, so we'll be getting close to 600MB per second. And of course, instead of 16 system calls (8 seeks, 8 reads) we're actually only making 1, and system calls are the other overhead to consider from a resource standpoint.

HA

So, adapting this for High Availability, there is a very nice system call available on Linux called io_getevents. Taking the context of a RAID type read over a number of devices with error detection and recovery, we can have something like this; (I'm pseudo,coding obviously)

{
  array[0] = (dev1,block1,buffer1,4096)
  array[1] = (dev2,block2,buffer2,4096)
  ..
  aio_read(array)
  => program continues with no blocking
}
thread read_handler() {
  timeout = <maximum time we will wait for a complete read>;
  int events = io_getevents(8,8,array,timeout);
  if(events != 8) do_error_and_recovery();
  else // do something with the data
}

In reality, io_getevents returns an array containing the result of each submitted operation so you can scan the array to find out which operation failed and retry accordingly, and presumably mark that particular device as unavailable.

And the relevance today ...

Ok, so you don't end up writing low-level RAID drivers all that often. That said, traditional High Availability tends to involve multiple servers and failover of an entire service in the event of an outage. This procedure has developed from a position where we had lots of legacy system designs that did not involve the Internet and High Availability, and had no consideration for commodity hardware, frequent connectivity issues and potentially millions of users.

HA through server failover is a bit of a dead duck these days, a better way to do failover is at application level.

What we're really looking at is Internet aware applications with HA as a built-in feature. Whereas there will always (currently) be a single point of failure (to an extent) for the application itself, dependencies on other machines or indeed on local devices should be transparent to the end user.

We can get away from the single point of failure (mostly) using things like Round Robin DNS and keepalived to distribute multiple IP's over multiple servers (which also provides load-balancing functionality) but for front-end servers that will to a degree be dependent on back-end servers, we're still looking for some form of software defined HA .. and we're back to AIO again.

Let's say we have a user-facing application that needs to reference a data store to obtain data for the user, we'll assume here that there are a bunch of replicated data stores in the background that we'll be accessing via TCP.

We can do this one of two ways depending on the levels of performance required for the user vs levels of resource used on the back-end.

Method # 1 - conserve resource

submit read request to server # 1
int events io_getevents(maxwait=<x>ms)
if( !events ) {
  submit read request to server # 2
  events io_getevents(maxwait=<x>ms)
}
if( !events ) // no back-end servers available
else // do something with the data

Method #1 is the most efficient, but if Server#1 goes down, you'll get some read latency for each read request as each request will go to server #2. In reality you would maintain the servers in an array and on detecting the first outage, just switch the order so the last responding server is always tried first on the next pass. In this instance you would get an (x)ms (only) delay in the event of an error on one of the servers.

Method # 2 - perfect HA

submit read request to server # 1
submit read request to server # 2
int events io_getevents(maxwait=<x>ms)
if( !events ) // no back-end servers available
else // do something with the data

Method 2 is very wasteful as it throws away half of the data it reads, however, it will give the user maximum performance at all times, even if one of the servers is down, slow, faulty or in some way not doing what it should in timeframe it should. Generally not a length to which you would need to resort, unless you're working in real-time on very lossy connections.

Going a little further, if you're reading large amounts of information in this scenario, you could always request half from one server and half from the other and apply the same mechanism as above in the event of a failure. In this instance, assuming you run the connections to the two servers via different network trunks, you could expect to double your application's performance under normal conditions within the context of data acquisition via the network.