Thursday, August 16, 2012

Augmented Data Structures: LRU Cache

New post after a long absence. I have been busy with my job search but it seems to have come to an end, so I have time to write some more and share some code. Here's a post about augmenting data structures. Enjoy!
Augmenting a data structure allows to add additional operations to textbook data structure such as linked lists, trees or hash tables. The basic operations of the data structure should maintain their original asymptotic runtime and the new operations should be efficient. A very good introduction on the topic can be found on the CLR textbook.
This post focuses on a simple data structure that basically mixes a hash table with a linked list. This data structure can be used to implement a simple LRU cache. The idea is to only cache the N most recently used objects and discard the rest, i.e., the least recently used (LRU) ones. The run time complexity for data retrieval from the cache should be constant, i.e., from an object design point of view a cache object of the form Cache could provide a get(Key) method that returns the Data object in O(1) time.
The first idea that comes to mind is to have a hash table do the heavy lifting of storing (Key,Data) pairs, leading to constant retrieval time. The LRU requirement is more tricky. The first thing that one can think of is to use a queue to store the N most recently accessed keys. When the size of the queue exceeds N, we could just pop it and remove that key from the table as well. This strategy however will not work: what if the same key was accessed N times? We would be removing the only object in the cache! Moreover, if we store in the queue the same object for N times, we are basically storing just the last object used, not the last N. Replacing the queue with a linked list and scanning the whole list every time might fix this problem but it will make the access time linear in the size of the cache, i.e., O(N), which would completely defeat the purpose.
There are probably multiple ways to fix this problem, but I think the most elegant and efficient solution is to maintain a doubly linked list of (Key,Data) pairs and to store the nodes of the list in a hash table indexed by the key. When the get method is called, the node can be retrieved and can be put at the top of the linked list. We can easily make sure that the list never exceeds N elements by cutting its tail (and removing corresponding objects from the table), which will contain the least recently used keys.
Here's a java implementation of this idea. The interface DataSource simply defines the get(Key) method returning a Data object.
 import java.util.Hashtable;  
 public class LRUCache<Key, Data> implements DataSource<Key, Data> {  
      private final DataSource<Key, Data> source;  
      private final Hashtable<Key, Node> cache;  
      private final int maxRecords;  
      private Node head = null, tail = null;  
      public LRUCache(DataSource<Key, Data> source, int maxRecords) {  
           this.source = source;  
           this.maxRecords = maxRecords;  
           cache = new Hashtable<>();  
      }  
      @Override  
      public Data get(Key key) {  
           if (cache.containsKey(key)) {// the data is in the cache  
                Node node = cache.get(key);  
                if (node != head) {  
                     if (node == tail)  
                          tail = tail.prev;  
                     if (node.prev != null)  
                          node.prev.next = node.next;  
                     if (node.next != null)  
                          node.next.prev = node.prev;  
                     node.prev = null;  
                     node.next = head;  
                     if (head != null)  
                          head.prev = node;  
                     else  
                          tail = node;  
                     head = node;  
                }  
                // return the data  
                return node.data;  
           }  
           // retrieve the data from the source  
           Data data = source.get(key);  
           // put the data at the head of the list and into the table  
           Node node = new Node(key, data);  
           cache.put(key, node);  
           node.next = head;  
           if (head != null)  
                head.prev = node;  
           else  
                tail = node;  
           head = node;  
           while (cache.size() > maxRecords)  
                if (tail != null) {  
                     cache.remove(tail.key);  
                     if (tail.prev != null)  
                          tail.prev.next = null;  
                     tail = tail.prev;  
                }  
           return data;  
      }  
      private class Node {  
           public Node(Key key, Data data) {  
                this.key = key;  
                this.data = data;  
           }  
           final Key key;  
           final Data data;  
           Node prev;  
           Node next;  
      }  
 }  

It's worth noting that this implementation is NOT concurrent. Designing a concurrent LRU cache is a little more tricky: if we synchronize the get() method then the cache might block for a long time when accessing the data source (say, a database), while it could be responding to clients that are querying data that's already cached. A simple idea would be to synchronize the blocks of code that access the cache in the get() method and leave to the external DataSource object the task of synchronizing access to external resources. In the code above, the if statement in the get would have to be synchronized and so would the block going from the definition of the new node to right before returning. If a concurrent hash map was used, synchronization of the if statement itself wouldn't be necessary, just the if block would have to be synchronized.

1 comment:

  1. Just to add to my own post, it's instructive to look at the Google Guava cache implementation. It provides different cache strategies and combinations of them and it allows to set the concurrency level for the cache, which basically creates a sliced lock on the data allowing many clients to lock parts of the cache only, not the whole cache at the same time.

    ReplyDelete