Wednesday, March 28, 2012

Cache Solutions - The Overview

I've been following cache solutions for quite some time. The reasons being
  • I need them
  • I see them as a big upgrade to the infrastructure of a system
  • They cost a lot of money and I'm determined to find a cheap working alternative.

This led me to do some research for a team in the company that I work for, BMC Software. I've also been corresponding with Johan Eriksson from whom I've taken some very interesting insights as part of this post. Finally, I'm presenting a session about Cache Solutions in AlphaCSP's The Edge seminar on March 29th and I want to give the audience a place to go back and read about what I talked in case my Hebrew speaking abilities were especially horrible that day. You can find the Prezi presentation here.

Let's get started! As most of the things we do in life, everything starts because we have a problem. So our dynamics here will be:

I have a problem
     I have a dream - Non distributed caching
          My dream has problems
               I have a much more awesome dream - Distributed Caching
                    What is the market offering
                         Final Considerations


The problem: Why do we need caching?

The reason for which we start thinking about caching is speed. Databases are quite often the bottleneck of our applications. Databases can scale but it's hard to obtain the response times we are looking for in order to build a smooth application with the level of responsiveness we are looking for.


The dream: Non-distributed caching

So the first step is straightforward: Get part of the data we use all the time and save it somewhere in our heap so it can to be accessed extremely fast. We can do it ourselves with a Map that we maintain in one of our application instances or we can use something like Ehcache that is open source solution. So why would we prefer Ehcache to the good old Map? Well mainly due to the fact that it manages all those maps in memory for us, it avoids the possibility of generating memory leaks, it has support for various evictions policies, among other reasons.


My dream's problem

In relatively small applications, the non-distributed solution can work relatively well. However, if we are building a high-load enterprise application, we may encounter some issues:

GC issues

Garbage collection is one of the things we love about Java. But believe me, it can give you some serious headaches. When we store data in our heap, we occupy more and more space that the GC has to manage in order to dispose our dead objects. Also, if we store and evict objects all the time, there will be a lot of promotions between the generations in our heap and all that is work for the GC. We'll find ourselves making our heap grow to several gigabytes and we can end up having full GCs that will take more than 10 seconds, freezing our application completely. We were looking for high speed and suddenly we don't respond for more than 10 seconds? My dream has a problem.

Lack of failover

We start storing some valuable information in our cache and suddenly our Java process crashes (not because of Java of course, someone must have logged-in into the server and used "kill -9"). We are facing two problems here: If our data wasn't persisted on disk, we just lost it. If it does exist somewhere else (e.g. database), we can start our server again but it will take some time to go online and then even some more time to get warmed-up while it fills the cache again.

We'll see that replication of our data provides us a solution to these issues. And with replication comes something extremely important in the cloud era: High Availability. In this day and age we don’t have the liberty to allow our applications to go offline anymore. If a server goes down, another one completely synchronized with the status of the working one, should be ready to take on the work instantly.

Doesn't scale as my application or web servers

So we did our homework, we have a cluster of application servers that can scale along with my traffic, but these app servers make heavy use of our cache and suddenly they can't keep up with the pace as we scale-out. If we want to maintain our throughput, we need our cache to accompany our app servers cluster size.

Can’t share information between servers

Every server requires more or less the same information. Due to the fact that I can't share this data among servers, I end up imposing a big load on every server's heap to store almost the same information; this doesn't make a lot of sense. I also may be interested in sharing some data among my servers as a way to orchestrate some parallel job execution (i.e. I have jobs to be executed, several servers that know how to do it and I want them to pull jobs from a shared queue).


My new (much more awesome) dream: Distributed Caching.

It's clear that non-distributed caching is useful but doesn't provide us with the features required to build an enterprise level caching system. Let's go through the goals we are looking to achieve using a distributed cache and then the features that we need to evaluate while choosing the right solution for our system.

Goals



Replication (High Availability, Failure resilience)

As we said, we can't allow ourselves to be offline any more. $#!t happens and a server can go down, but hardware is relatively cheap, we can even pay for it as much as we use it, so there is no excuse not to be prepared to such situation. We want to replicate the data we have in a node belonging to our caching cluster. That way, if a machine goes down, the cluster itself takes care for replicating it again so the information will be save. If that process runs smoothly, we built for ourselves a mechanism for failover and at the end of the day we can be sure that we'll be online no matter what happens.

Don't forget that this replication can happen with the server seating in the same rack or with a server located in a datacenter far away from the one we are operating on. This way, we are building the infrastructure for a disaster recovery site to take charge on a seamless way in case our main datacenter is experiencing issues.

Elasticity

We want to be able to scale-out as our work load increases and then shrink when we are idle in order to pay less for our servers. We also want to keep the pace with the size of the cluster of our application servers. As the belly of a pregnant woman, we want to be elastic!

Geographical distribution

Quite often the systems that we write have to be geographically distributed. How do I keep everything synchronized across my datacenters at the same time that I achieve good performance in every site? Replication and partitioning are the answer for a distributed caching that can provide us with these features.

Communication/messaging backbone.


It's not new that since we started to scale-out (more servers) instead of scaling-up (one more powerful server), the role of a communication backbone between our server has become specially important. And here we have basically two options: A message broker or shared memory.

Message brokers have been around for a long time. WebSphereMQ is probably their grandfather and more modern implementations as SpringSource's RabbitMQ are being widely used. Nevertheless, if we already have a caching solution plugged to our system, we can use this shared memory to achieve the same functionality.

Why is it so important to have a communication backbone between our servers? Because we understood that we don't want to buy a huge machine and have it working on a 10% of its capacity only to be able to deal with that spike of 90% once a month. We'd rather buy a cluster of virtualized hardware that can grow or shrink depending on our needs and the best of all is that we can pay for what we use but always have the ability to grow as much as we want. And in that size varying cluster, we need something that will orchestrate that unknown number of servers working together, which can be our cache.


Features

By this part of the post (if you made it here, of course), we should be clear about which benefits we can expect from using a distributed cache solution. Let's go over the features that will be part of our evaluation form when we go shopping for one of the solutions out there.


Built-in marshalling

This is, at least for me, an extremely important feature. Every time we think of an object coming in or out our JVM process, the issue is the same one: serialization. In our ideal world, we want to send away and get them back as alive as they were in our heap, we don't want to invest any work on this. That's why we want our caching solution to take care of the marshalling and unmarshalling. And now that our cache is structure-aware, we also want to store more than a map in it, we want queues and maybe lists, or whatever structure fits our needs and that also can be a requirement to the solution.

Replication and Partitioning

We talked about replication: It's the ability to have information stored in more than one node. We can do it for putting the data closer to the consumer and also for high availability and failure resilience purposes.

Partitioning is choosing which node is going to store which information. Let's say, for instance, that we are a bank that has branches in Tel Aviv and in Buenos Aires. We want to cache the account information of our clients. Now, the closer the information is, the quicker the access to it. Does it make any sense for us to store information about the accounts of our customers in Tel Aviv in our Buenos Aires datacenter or viceversa? The answer is no. Proximity to the information has a big say on the speed we have access to it. This kind of topology can be achieved only by partitioning in an adequate way our data.

Off the heap

The bigger our heap, the more impact it has on the performance of the garbage collector. We want our cache to grow regardless of the freezing times that accompany the garbage collection of a big heap (bigger than 4 GB) and for that we need to take all those cached objects outside our application heap. That imposes the maintenance of a new server, a cache server (or more probably, several servers), but it's worth it if we want to make our cache grow to a size where it will provide a real boost in performance for our application.

The other side of the coin is embedding. Some solutions make use of a cache small enough to hold it in the heap of our own application. Having the ability to embed the cache server in our own applicative server means that we can bring it up in bootstrap time and it won't require any further administration. It will also provide the rest of the features we are mentioning in this list.

Compute grid

If we already have so much useful information in a distributed cluster and techniques exist for running distributed jobs in such a cluster, why not make use of it? This is the idea behind providing an executor service that will run code using the map reduce paradigm across all our nodes and crunch our data making use of the cluster we have already working. This is what some vendors propose and turn our data cluster to a compute grid.

Persistence

Whether it be for disaster recovering purposes or because we are making use of our cache as the main storage and we'll persist to disk using a write-behind strategy, we want our data to be stored in the file system or a database. That is what persistence is about.

Transactions - Locking

Following the same idea of our cache becoming the storage the application is going to talk to, our operation will often require transaction support. Locking objects and making several changes in an atomic way is a requirement that we will end up demanding from our cache solution.

Multiple language support

Very often our system will be composed by components built in different programming languages. The data of our organization will be (hopefully) consolidated on a single datastore and that's the one we want to cache. However, we want to give access to this data to all the relevant components of our system and that's why we will find very handy if our cache knows how to speak to different technologies.

Open source


Although it depends on the culture of each organization, the industry tends to go open source when they can. Whether it is for economical reasons, ideological ones, or just quality, open source is always the first option when we just want to prototype and if the product is good enough, there are big chances that it will end up being part of our system.


What is the market offering?

Now that it's much more clear for us what we should evaluate on a caching solution, let's take a look at what the market has to offer.

I've chosen a group of caching solutions that I found as those ones the community talk about them the most. Please feel free to propose new ones that you find relevant in the comments section in order to be included in the blog post.


Ehcache was one of the first and probably the most popular non-distributed cache solution. Bought by Terracotta in 2009, it offers a seamless integration with Terracota's BigMemory which adds all the off-the-heap and distributed features.


Bought by Oracle in April 2007, Coherence is the preferred robust caching solution, at least for those ones that can afford it.


Taking on JBoss Cache, Infinispan showed up in 2008. Using the very well known JGroups for communication between nodes, Infinispan is JBoss' answer to distributed caching. It's distributed with an open source license.


Hazelcast is a Turkish developed product which started on 2008. They offer an open source community edition as well as a paid enterprise one. They have had a big buzz during the last months and are the first ones to present a list of clients using their product in production.


Focused on being a BigData compute grid, GridGain also provides the features for a caching solution. Their efforts are put on providing what they call "real time big data".


Memcached is one of the oldest, most robust, widely used cache solution. It is intended to store small chunks of arbitrary data (no marshalling). It's distributed free and with an open source license and although it lacks of many features, it has been the platform over which many other products have been built.


Time to compare

As we are engineers (yes, I know, I sat way too long in a classroom and even more at home), we love tables, so let's compare these products using the features we had already mentioned.


EhCache Coherence Infinispan Hazelcast Gridgain memcached
Compute Grid No Yes Yes Yes Yes No
Search capabilities Yes Yes Yes Yes Yes No
Partitioning Yes Yes Yes Yes Yes No
Replication Yes Yes Yes Yes Yes No
Off the heap Yes Yes Yes (not stable) Yes Yes Yes
Transactions Yes Yes Yes Yes Yes No
Built-in marshalling Yes Yes Yes Yes Yes No
Multiple languages support Yes Yes Yes


How much is it going to cost me?


Open Source /
Community Edition
Enterprise Edition
EhCache Yes Yes
Coherence No Yes
Infinispan Yes No
Hazelcast Yes Yes
Gridgain Yes Yes
Memcached Yes No



Final considerations

Maybe you just scrolled down to the end of the post because you saw that it's too long.

My personal advice on choosing a solution is: Give Infinispan and Hazelcast a try. As you can see, all the cache solutions are quite young so there is still not a dominant player. These two give us a product with almost all the features we are looking for with all the benefits of being open source. And if you need support, Hazelcast has an offer for enterprise and Infinispan is backed by JBoss so I assume there will be an option for it.

A distributed cache solution running as part of your system will provide you with a lot of features that you'll need to offer sooner or later as part of your cutting edge product. High availability, elasticity, communication backbone, replication and geographical distribution, all these for free because you were looking for a boost in your data access sounds for me like a great reason to dive into this world. Enjoy the ride and please share your experience.

4 comments:

  1. Any reason you left GemFire out of the discussion? Would round the otherwise impressively detailed article off quite nicely :).

    ReplyDelete
  2. Nice blog post! I work for SpringSource/VMware, so I'd be remiss if I didn't mention GemFire, a distributed data cache worthy of your consideration. However, I also wanted to encourage developers to check out Spring 3.1 and its fantastic cache abstraction which makes integrating many of the above caches - as well as any cache that lets you interface with it as a java.util.Map - as simple as declaratively annotating your service tier business methods. In addition, the abstraction will support JSR109 and it supports other caches not covered here like Redis through the Spring Data Redis integration. Thanks!

    ReplyDelete
  3. Hey Oliver and Josh! I attended your sessions at the Spring I/O 2012 so it's a great surprise for me to find your comments here.

    I agree that GemFire can be added to the comparison and I plan to start working on it soon. To make things clearer to the readers, VMware acquired GemStone (the company behind GemFire) in May 2010 and integrated the product to their vFabric Cloud Platform although SpringSource distributes GemFire for free.

    Josh, I think that when you mentioned Spring's cache abstraction, you were thinking of JSR 107 (not 109) and I agree that until the API will become part of Java, it's a good idea to use Spring's abstraction to allow us to switch cache solutions without introducing modifications to our code.

    Thanks guys!

    ReplyDelete
  4. Thanks, very nice post.

    ReplyDelete