Replication - for real

A couple of weeks ago I reached a point with PyNNDB where 'replication' reached the top of the to-do list. How hard could it be ...

Replication - for real

A couple of weeks ago I reached a point with PyNNDB where 'replication' reached the top of the to-do list. As v2 of the library has been specifically written with replication in mind, and given I managed to cobble together a sort of working version for the version #1 library, I thought "hey, couple of days ...". Hmm.

My first iteration managed to duplicate a database to another local database, that looked good and that only took a couple of days. My second iteration attempted to do the duplication over network sockets (zmq), that took a few more days. But then came the transformation into mesh replication, which has taken a couple of weeks, and while there's still a good bit of packaging to do, the core design now seems to cope with all the edge cases I can some up with, and pytest tells me it's working thus far.

Writing the code so that an entire mesh network of (n) nodes can be tested in it's entirety using pytest with no mocking has proven to be quite interesting, certainly feels nicer than writing PgSQL mocks, or running unit tests using SqLite.

Replication model - the story so far

Whereas the diagram above is showing two nodes just to demonstrate the communication principles, we're working from the following principles;

  • Each mesh node is stateless and all configuration is dynamic via command channels
  • Node controllers and services are not connected to applications, just the data
  • Communication between nodes is PUB/SUB, so node numbers are 'flexible'
  • All socket based communication is via ZeroMQ (i.e. it's quick)
  • Nodes can removed from the Mesh at any time
  • Nodes can be added to the Mesh at any time and will automatically re-sync
  • Nodes will automatically re-sync following from disconnect / reconnect
  • Each node can connect to one or more other nodes and each connection is "hot", this provides zero down-time and zero recovery latency in the event of a single connection or node failure.
  • Each node can source one or more databases
  • Each node can replicate one or more databases
  • Each replica can be a 1-1 read-only, 1-1 read-write, or n-1 read-only, the latter can be used as a real-time aggregation function to replicate changes from multiple remote databases into one local database.
  • All nodes talk to each other via the command channel, so connecting to the command channel via a local node allows you to interact with any node on the mesh with regards to adding replicas, starting and stopping replication, gathering status and monitoring information etc.

Just to translate this into a bit of code, we can create (for local / testing purposes) a connected two node network with;

nodea = Controller(7000, 7002, 7004, label='nodea').start()
nodeb = Controller(7100, 7102, 7104, label='nodeb').start()

We can then get access to the command interface on one of the nodes with;

cmd = Plugin(nodea)

So far so good, next we need a couple of databases to play with, as we're working locally this is relatively easy to demonstrate;

manager = Manager()
db1 = manager.database(path1, config={'replication': True})
db2 = manager.database(path2, config={'replication': True})

Now we're going to tell "nodea" to replicate it's database to "nodeb", and we're going to do this by talking to the command interface we just set up, which is connected to "nodea".

# the uuid of the database we're replicating
uuid = db1.replication.uuid

# acquire the id of the 'other' peer ('nodeb')
reply ='LOOKUP', {'name': 'nodeb'})
peer = reply.get('peer')

# register (locally) the database we want to replicate'REGISTER', {'path': db1._path})

# register (remotely) the target for our replication'REPLICATE', {'path': db2._path, 'uuid': uuid}, target=peer)

Services will start automatically and data will sync automatically, so we should now be up and running, for two-way replication this code is just duplicated, but in reverse, if that makes sense, and to add more nodes, as the REGISTER is essentially a publisher it only needs to be done once, so more nodes just means more REPLICATE commands.

Just to test this all makes sense we can add some data, and the command module has a useful testing feature called "diff", which when supplied with the "uuid"'s of two database instances (i.e. a source and replica), will stream the contents of each database into a local memory buffer, then use "deepdiff" to compare the two copies top make sure they're identical. So;

# Create a table, add an index, then add a record
people = db1.table('people')
people.ensure('by_name', '{name}', duplicates=True)
people.append(Doc({'name': 'Tom Jones'}))

# This will take (ns) but just to be sure replication is complete

# Now see what we have
source_uuid = db1.replication.uuid
target_uuid = db2.replication.uuid
assert cmd.diff([source_uuid, target_uuid]) == {}, 'database mismatch!'

So if we follow the white rabbit, how far does "mesh" take us with regards to a workable network topology? The answer is, I don't know yet, but the answer is going to cover "quite a way". One of the interesting features of this mechanism is that it copes with disparate connection lengths and multiple hybrid topological configurations, consider;

Example of two different mesh network, that could easily be just one ...

So mesh #1 would be an example of a resilient four node network, that could survive an outage of up to two connections per node while still maintaining 100% capacity and near 100% performance. Mesh #2, a slightly less intelligent nine node network with a different topology .. but if you were to connect node C to node A, and D to G, you would have one mesh, and within it, two distinct configurations, and the latter two links could easily be long-distance / intercontinental.

Just to clarify how the above mesh would work, you could for example REGISTER a database on NODEB in MESH1, then create REPLICA's on MESH1/NODEA, MESH2/NODEE, and MESH2/NODEI, and the mesh routing would take care of all the connectivity.