I did some work a couple of years ago on a disk caching application for virtual machines, at the time inspired by some Open Source released by Facebook called FlashCache. As I saw it, although a nice tool (and I used it for a while) there were two major shortcomings.

  • It was client side only
  • It relied on an LRU cache, as do most caching applications

Not something often thought about, you hear the word "cache" and you immediately think of more speed, efficiency etc etc. However, there is a massive difference both in edge case performance and in implementation.


Least Recently Used. The implementation is simple, maintain a list of block numbers in the cache, when you read a new block, free the block at the head of the list and pop the head, then append the new block to the tail of the list. When you re-read a block pull it out of the list and stick it on the end again.

  • The good; if you are reading the same small subset of blocks repeatedly, this works well and provides a huge benefit.
  • The bad; Linux already has a page cache, if you have free memory, then you actually need quite a large cache to get any benefit at all. And of course LRU had no insight into how the cache is actually used, for example if you copy a large file (virtual machine image for example) it's likely to be larger than your cache, in which case, not only will the cache have no benefit, and not only will it wipe out your cache while copying, but the additional work it's doing will actually make it slower.

Least Frequently Used. Somewhat more complex, for a start you need to maintain a count against each cached block to indicate how often it's accessed. Not a major hassle, the issue comes when you want to expire a block, you need to locate the block with the lowest reference count.

  • The good; blocks that are referenced a lot are always cached as the cache is effectively aware of the systems usage patterns.
  • The bad; code becomes far more complex and uses more memory.
Is it worth the complexity?

It depends. If you have enough free memory then there is an argument for saying that you can survive without caching or with an LRU cache. That said, the page table sort of functions as an LRU cache already, so with enough memory an LRU cache is pretty much a waste of space. If you're going to add a cache, then it's really only worth going for the complex option.


Well the first requirement is that the cache needs an O(1) mapping of block number to cache slot, once you start trying to be clever the lookup time (i.e. getting from block number to cache slot) starts to become an issue.

Secondly, when we want a new block, this also needs to be an O(1) mapping to the next free slot because again, we're going to be doing this for every uncached block read.

Thirdly, we need to be able to increment read counts for any given block, again we need an O(1) because we're doing this for every read. (cached or uncached)

This is an awful lot of O(1)'s to be maintaining, however, there is actually a very simple and relatively elegant solution. The method goes like this;

  • Maintain a hash table for the main cache using block => cache slot, this gives you a fast initial lookup.
  • Hold a linked list of cached blocks storing the read frequency count for each block in the linked list node and indexing the list by frequency
  • Critically, link the cache slot to the linked list node (!)

So, for critical procedures;

  1. Add a block, we simply read the end of the linked list to recover the lowest read count and pop the list tail to give a free slot. When we insert the block, it had a read count of 1 so goes onto the end of the list.
  2. Read a block, we use the hash to recover the cache slot based on block number, then we use the link directly into the linked list to increment the read count for that block. We then compare the read count to the next highest entry in the linked list and if it's higher, swap the two entries - so we automatically maintain the sort order - with no sorting! (i.e. it's effectively an O(1) sorting algorithm!)

Indeed I went on to apply the same mechanism to writing and if you assume that your cache is running on SSD (which mine was) using it's persistency with buffered write-back gives a massive 'write' performance boost.

I initially had concerns about performance, not least as my first attempt to do this using Python and MongoDB topped out at around 1MB per second (well, it would've been a nice in theory). However when coded up in 'C' and using multiple SSD's on a RAID-0 stripe I was easily getting speeds of ~ 1GB / second on a single desktop CPU core.

Going overboard ..

Addressing the second weakness of client side caches, I ended up using my own server protocol and found an interesting extension to client side caching. If when you write a block, or indeed flush a block, you save the cached read count to a buffer and periodically save this buffer server side, you can enhance the read process by recovering the historical read frequency when a block is read, effectively pre-loading the read count for every block. This makes your cache usage patterns persistent so there's no learning curve required for an empty cache.

The downside here is that you need to convert the list into a tree so you can insert based on frequency count rather than a tail append, and you loose the efficiency of the tail pop for a new block - so it becomes a toss-up between overall performance and persistent cache usage intelligence.

Still thinking about that one ... the (now long-dead) associated site is here, but fundamentally the next time someone says cache it's worth noting the difference becauce 9 times out of 10, they'll be talking LRU, and the next time someone says "my hybrid drive never performs as well as I expect it to" ... well .. :-)