To replicate, or not to replicate

To replicate, or not to replicate

Often a very important question when designing your database and applications .. but what if it wasn't?

Current / common database systems like MySQL, Postgres, MongoDB and so forth, all seem to have adopted a modern paradigm which involves disconnecting applications from the data they are intended to operate on, typically by way of a TCP socket or at the very least a Unix Domain socket. This is great for security, partitioning of data and code, multi-user environments, modern micro-service design and PAAS 12 factor App implementation.

What often seems to be overlooked is performance. Whereas at a low-level over a local socket, most of these databases are pretty swift, once you extend access out onto a network, put all queries through a parser, then access the bundle from a p-code language, things start to look a little less rosy.

Back in the good old days (!) a database was a file, and to get hold of your data, all you needed to do was to get it off disk - and the data was yours! Problem there (typically) was that your data was available to one local application only which really didn't play well in a network scenario, let alone on the Internet.

So, given we really want the speed of being able to access data locally without the network / multi-user overheads, what if it were possible to replicate that local file, independently of any applications using it, and at the same time provide a network locking mechanism which would provide for ACID transactions on replicated data?

Then if we're going to access the data via a high-level -> C interface, what if we could code the majority of the database logic in the high-level language making it easy to code, and making the database part of your application, rather than your application having to call out to an external database connection. After all, if each application has local access to the data, why would we want access via a socket?

The benefits of being able to read your data from a potentially memory-mapped buffer read directly from a local SSD, over reading data via a serialised network connection, should not be underestimated. Whereas this is not a model for a general purpose multi-user database system, it does look attractive from the perspective of a micro-service designed to act as a gateway to a specific data-set providing access via a REST API (or WAMP) interface.

You're already thinking, "nah, what sort of idiot would try to write a database in Python thinking it would be faster than something written in 'C'/'C++' ?!" - and I'm with you, it would be daft to expect a database written in Python to outperform MySQL in the real world. But then again, I'd potentially be wrong right alongside you!

An implementation

My base implementation is written in Python and is modelled on PyMamba. The low-level database is provided by LMDB, which is a high-performance key value store that has a number of unique features, including the ability to handle ACID transactions across multiple sub-tables, and the ability to facilitate coherent access by multiple concurrent processes.

replication setup

First of all, write access inside PyMamba is wrapped such that anything that can change the database also writes that change into a table called 'binlog'. This write takes place 'inside' the user's existing transaction so the real-world overhead for this is negligible. As part of each transaction commit, PyMamba increments a POSIX semaphore, (again the overhead for this is negligible) and that's pretty much the end of the story as far as the application is concerned. When writing 20k records, the combined overhead of appending to the 'binlog' and setting the semaphore aren't really measurable, so enabling replication essentially incurs no overhead.

Da Management

Moving on to the Management processes, these are Python daemons employing the Multiprocessing module. This ensures we don't get hung up on the GIL and performance is linked to the number of cores we provide rather than the CPU frequency. Note in the diagram above, there is no connection between the Manager processes and the applications - the only thing they have in common is access to the LMDB database files.

The Manager is responsible for maintaining a connection to all peers, technically you could have any one of many network topologies from a star network to a ring, but currently I like the idea of a mesh. With a mesh, any one node failure will have zero effect on the mesh as every node has at least two peers. Multiple failures will still have a minimal effect as the dynamic routing (node to node) will find a way to replicate via any connected nodes.

So each peer connection has two components, a listener and a syncer. The listener waits for updates from the peer, and on receipt writes them to the local database (and binlog), while the syncer tails the binlog and sends any changes out to the peer. Again, each Manager can have (n) peer processes, each process will be independently tailing the binlog for changes. When messages are passed forward, the nodes the message has already visited are included in the message, and when the next peer comes to re-broadcast the update, previously updated nodes are excluded. When a message arrives at a common node via two independent paths, the first is written to the database, the second is dropped. Whereas this 'sounds' inefficient, network bandwidth (these days) is rarely the bottleneck in database replication, so this mechanism ensures maximum fault tolerance.


So without diving too deeply into the code, the above mechanism is working to the point it can be bench-marked. The initial implementation was based entirely on the Python multiprocessing module with shared Queues for message passing, but this was more of an exercise in 'what can it do' than a serious attempt. As expected shared Queues in the multiprocessing module suck a little bit, so the second revision employs ZeroMQ for message passing, and this is a whole other kettle-of-fish.

For testing I'm currently using an Open Source database file courtesy of the UK land registry, it's around 17MB of CSV with 106178 records. If I import this into a database with replication set up, my CSV parser writing a single isolated transaction goes down to 11k transactions per second. If I batch up multiple records (say 50) into individual LMDB transactions, I can drive this up to ~ 21k transactions per second. At the same time, watching the replication thread, this is able to dynamically batch updates and distribute at a rate of between 20k and 30k transactions per second, so realistically running multiple applications against one database I'm expecting to give a potential replication throughput of something towards 30k transactions / second.

Doesn't sound like telephone numbers, but if you take a look at some online benchmarks like MySql Highavailability then these numbers would seem competitive, not least as the replication is single threaded and running on a six year old workstation.

Anyway, still needs a good tidy-up and I still need to work on the network record locking, but in principle I think it has legs, not least as there a lot of scope for re-coding parts of PyMamba using CPython which could provide an interesting speed boost in many areas.


Here's a quick test of 9-node mesh, all running on my workstation. Note that this heavily overloads the old six-core machine, i.e. it's doing 9x more work than a node is designed to do. All the same, running in 106k records happens at around 7k records per second, with maybe a 1-2 second overrun on the synchronisation. All things considered, not least it's all in pure Python with no C in sight, that's 1M (record) writes (over 9 databases) and fully meshed resilient auto-routed replication.

Mesh (logically) looks like this, with (for testing purposes) the data being injected into node "A".

   A - B - C        ; terminal sessions laid out to match topology
   |   |   |        ; A peers with B & D
   D - E - F        ; D peers with A & G & E
   |   |   |        ; E peers with B & D & H & F
   G - H - I        ; etc ...

An example route for data to take to arrive at "I" might be either;

  • A->B->C->F->I
  • A->B->E->F->I
  • A->D->E->F->I
  • A->D->G->H->I

The successful choice is based on fastest path.