So a couple of years ago while playing around with micro-services, I came across a requirement for a local, fast, key-value based database that could handle a few more advanced features like automatic secondary indexes and compound keys. The net result of this was PYNNDB.

Now as far as it goes this is fine, essentially everything is stored on disk as JSON so the programmer essentially reads and writes Python dict's, and at the same time has access to a feature-set that approximates to something like "dbase".

One remaining issue however is speed. If we're doing a linear seek through a table, the speed limit is still in the low hundreds of thousands of rows per second, primarily because once a row has been read, you need to convert the serialised JSON data into a Python "dict" before Python can use it. ("json.loads") Now this can be mitigated a little by using the likes of "ujson", however there is still the fundamental issue that converting JSON to a Python "dict" is actually a time consuming process, not least because it will create a bunch of Python variables, an operation in itself that becomes significant once you start to repeat it millions of times per second.

Enter ZLMDB. This is kind of cool and runs along the same lines as PYNNDB, but has the ability to use flatbuffers to serialise data, the benefit here is that you can read data out of a "flatbuffers" record, without de-serialising the entire record. i.e. you can just read out the data you need, which makes it "much" faster. The serious downside here is that flatbuffers requires a compiled schema, so it would not typically be classified as a NOSQL implementation.

So .. the solution would seem to be, we want to be able to read from a serialised structure, but at the same time, we need the structure to honour schema-less semantics such that we still adhere to NOSQL principles. Turns out that not only is this possible, it's also pretty quick, indeed (it's not "finished") currently it's looking a little faster than the compiled "flatbuffers" solution.

Raw Speed

First thing to establish is, if we create a structure to effective replace the "dict" we would have received from "json.loads", how much slower is it going to be? So, currently I have a module called "gpack" that implements three different classes;


    # GHASHTABLE - is a c-extension that implements a fast C-level hash lookup table
    # GCODEC - has the ability to convert from a Python dict into our new packed format
    # GOBJECT - is a wrapper around our packed data
    
    JSON = {
        'name': 'Fred Bloggs',
        'age': 21
    }
    htab = gpack.GHASHTABLE()
    codec = gpack.GCODEC(htab)
    assert codec.encode(JSON) is True
    obj = gpack.GOBJECT(htab)
    obj._setbuffer(codec.buffer)
    
    print(obj.name, obj.age)
    
    => Fred Bloggs 21

So typically in a database scenario looping through rows in a table, we would read a row, use "_setbuffer" to point to the data read from the row, then "obj" would be the top-level object through which we could access the attributes within the row. So removing all the encoding, decoding and database from the equation, how does our new class compare to a raw dict item in terms of speed;

        JSON = {
            'name': 'Fred Bloggs',
            'age': 21,
            'mylist': ["one","two","three","four",5],
            'mydict': {
                'a': 1,
                'b': 2,
                'c': {
                    'nested': 'this is a long string',
                    'num': 9321,
                    'list': [{'a': 1}, {'b': 2, 'c':3}]
                }
            },
            'amt': 1.23,
            'end': '****',
        }
        htab = gpack.GHASHTABLE()
        codec = gpack.GCODEC(htab)
        assert codec.encode(JSON) is True
        obj = gpack.GOBJECT(htab)
        obj._setbuffer(codec.buffer)
    
        beg = time.time()
        total = 0
        max = 1000000
        obj._setbuffer(codec.buffer)
        for x in range(max):
            total += obj.age
        end = time.time()
        rate1 = max/(end-beg)/1000000
        print(f'Time1: "{end-beg:2.6}" , Cycles: {rate1:2.02}M/sec')

        encoded = ujson.dumps(JSON)
        beg = time.time()
        total = 0
        max = 1000000
        obj = ujson.loads(encoded)
        for x in range(max):
            total += obj['age']
        end = time.time()
        rate2 = max/(end-beg)/1000000
        print(f'Time2: "{end-beg:2.6}" , Cycles: {rate2:2.02}M/sec')
        print(f'Compare RAW dict speed: {round(rate1/rate2,2)}x')

Gives us;

Time1: "0.146585" , Cycles: 6.8M/sec
Time2: "0.145299" , Cycles: 6.9M/sec
Compare RAW dict speed: 0.99x

So for our test case, GOBJECT (which is reading data directly from within a serialised object) is coming in at pretty much the same speed at a raw Python dictionary .. so we're not really losing anything speed-wise by substituting dict's for our new custom object.

Compared to JSON "loads"

So now what happens when we emulate what's happening when we read from a database. With JSON encoding, we need to "loads" the buffer to de-serialise it before we can access it, but with GOBJECT, we can read data directly from the serialised structure. (note we're doing 10x fewer iterations for the JSON loop just to speed things up a little)

       beg = time.time()
        total = 0
        max = 1000000
        for x in range(max):
            obj._setbuffer(codec.buffer)
            total += obj.age
        end = time.time()
        rate1 = max/(end-beg)/1000000
        print(f'Time1: "{end-beg:2.6}" , Cycles: {rate1:2.02}M/sec')

        encoded = ujson.dumps(JSON)
        beg = time.time()
        total = 0
        max = 100000
        for x in range(max):
            obj = ujson.loads(encoded)
            total += obj['age']
        end = time.time()
        rate2 = max/(end-beg)/1000000
        print(f'Time2: "{end-beg:2.6}" , Cycles: {rate2:2.02}M/sec')
        print(f'Best Speed increase: {round(rate1/rate2,2)}x')

Gives us;

Time1: "0.420047" , Cycles: 2.4M/sec
Time2: "0.335924" , Cycles: 0.3M/sec
Best Speed increase: 8.0x

So if you allow for the fact that GOBJECT is still being tweaked, there's already an 8x performance boost in there .. which is pretty substantial. By it's very nature GOBJECT is mutable rather than immutable, which means you can modify the attributes within the structure without creating a new Python object, which also carries some huge performance benefits when you need to update your data.

Next steps ...

It'll be interesting to see how fast PYNNDB gets once also coded as a C-Extension. So far it's looking like for the above test (i.e. sum(age)) that a scan rate of around 10M rows per second (on my test machine) is potentially achievable.