If it walks like a Database ...

And crawls like a Database, then likely as not it's probably a modern Database. Just to be specific I've spent the last few years working with MongoDB, partly because it's a new-style NoSQL Database and is better for web work than SQL DB's, and partly because it interfaces really well (despite it's Javascript semantics) with Python, and isn't subject to the need for intermediate ORM type applications to make it's use more palatable. A few years back I tried to use MongoDB for a high performance application, but to be honest the demands were so high that no Database was likely to cut it. More recently however I've been working on something with relatively modest performance requirements, and MongoDB still isn't there - which led me to look at alternatives.

Back to basics with LMDB

LMDB is a relatively straightforward KV store that was developed as a part of the OpenLDAP project. It was sufficiently successful that it acquired a life of it's own turning into a generic storage engine which (based on my historical testing) beats pretty much everything else hands down within the scope of small to medium sizes general databases. Whereas there are other potential engines in this arena, after having evaluated a number of them, implementation aside I've not seen anything else that can match the LMDB design. I did spend quite a while evaluating LevelDB (the Google solution) and couldn't believe some of the shortcomings I encountered. Take a quick peek at the historical benchmarks just for an idea ... (I know, lies, damned lies and statistics, but it's a starting point..)

But a KV store isn't a Database ...

No, but it doesn't take a vast amount of work to wrap it in something that looks like a database. Now if you need a big centralised database system that's accessible over the web, maybe MongoDB or MariaDB is the solution for you, however if you want a relatively self-contained database with a small footprint to be accessed primarily from Python, there might be a more appropriate alternative.

PyMamba

This is something I've put together for the project I'm currently working on, but in a form that can be reused elsewhere. Essentially it's a small library of functions to emulate the functionality I would expect from (specifically) MongoDB, linked the to the LMDB storage engine, and presented as a relatively consistent API. The code is available here on github under an MIT license.

Pro's and Con's

Bear in mind that this is all from "inside" Python, if you want or need faster access you can use the C interface which is so quick it'll blow your socks off.

  • Read performance - it's faster than MongoDB, your mileage may vary depending on your data, but on my workstation I'm seeing a for loop through a recordset of 200 character items with a run-rate of around 200k per second.
  • Write performance - real-work tests would indicate it's more than 10x faster then MongoDB and yes this is testing with a very recent version of Mongo using the WiredTiger storage engine. (30,000+ per second on a table with indexes)
  • Memory footprint - looking at my system now I have a mysql instance that's consuming 131m of RAM, a tuned down mongo instance that's consuming 27m, and a PyMamba instance with a .. no wait, no server needed, no overhead at all. So if you're running inside a Virtual Machine where memory is literally money, this can be a fairly critical consideration.
  • Compatibilty - to be fair I've not had any problems getting MySQL or MariaDB running on any system in recent memory, and version to version compatibility is usually fairly solid. MongoDB on the other hand is a bit of a challenge to get going and needs different startup scripts for different versions of your distribution. Getting version 3 running on recent versions of Ubuntu has been a royal PITA on a number of occasions. PyMamba on the other should run on pretty much any distribution without any special consideration.
  • Drivers - both Mongo and MySQL have drivers that allow them to be used with a number of different languages, but at this time PyMamba is purely Python3.5+. (although you can easily get at the raw data directly via the C API)
Ease of Use

Always an emotive and / or subjective question, but I'm going to go with PyMamba being more intuitive and easier to use, while MongoDB is more powerful and feature rich. That said, the two aren't a million miles apart and porting my current application from MongoDB to PyMamba was measured in minutes rather than hours. For an simple comparison, this is an example script to transfer a tables from Mongo to Mamba.

from pymongo import MongoClient  
from pymamba import Database  
mongo_client = MongoClient('localhost')

databases = ['Livestats']  
tables = ['users', 'sessions', 'servers']

for database in databases:  
    src_db = client[database]
    dst_db = Database(database)

    for table in tables:
        src_users = src_db.users
        dst_users = dst_db.table('users')
        dst_users.empty()
        for doc in src_users.find():
            dst_users.append(doc)

So things to note;

  • Both databases deal in terms of Python objects, i.e. both databases read / write Python dict's
  • Both databases are keyed off a unique ObjectId
  • Both databases work with the concept of databases and tables (or collections) [as does MySQL]
  • Indexing is completely transparent

In this instance, if I wanted to add an index to the result it would simply be;

dst_users.index('by_name', '{name}')                    # index by field 'name'  
dst_users.index('by_age', '{age:03}, duplicates=True')  # index by field 'age' (allow duplicates)  
dst_users.index('by_age_name', '{age:03}{name}')        # compound index using age,name  

And to search there are many options;

# scan table in name order
for doc in dst_users.find('by_name'):  
    print(doc)

# scan for anyone who's 21
for doc in dst_users.seek('by_age', {'age': 21}):  
    print(doc)

# everyone aged between 18 and 21 in name order
for doc in dst_users.range('by_age_name',  
        {'age': 18, 'name': ''},
        {'age': 21, 'name': 'zz'}
    ):
    print(doc)

# everyone called Fred Bloggs
for doc in dst_users.seek('by_name', {'name': 'Fred Bloggs'}):  
    print(doc)

This IS different to the way Mongo does things, and in terms of the explicit naming of indexes, different to MySQL too. The reason for this is that in practice, the main source of poor DB performance I've come across all stems from a reliance on the inbuilt query optimiser of the Database concerned. Whether it's the index not being used, people misunderstanding how to use compound indexes, or simply indexes not being created, all can have dire consequences for the overall performance of the finished article. By forcing the programmer to explicitly choose the index to use for any given operation, sure it means the programmer needs to understand the data he's working with, but it does mean you're going to get a higher quality and more predictable result.

Yes .. I am working on inter-machine replication ... in between other stuff .. :)

Gareth Bult

Linux and Open Source Enthusiast, Programmer, SysOps, DevOps, BeerOps and anything related to Javascript, C, Python or Pizza.

South Wales (UK) http://linux.co.uk

Subscribe to Swapping Apples ...

Get the latest posts delivered right to your inbox.

or subscribe via RSS with Feedly!