Avoid Caching

After you’ve developed software professionally for a while, you notice certain repeating patterns: Some solutions work repeatedly, while others repeatedly cause errors. One very popular but dangerous technique is caching. Naturally, caching is necessary in certain situations to achieve satisfactory performance. However, it can also cause a lot of problems when used poorly. Per definition, cached data is potentially outdated data as it isn’t read / calculated every time it is queried. This can lead to hard-to-understand bugs if the cache is not invalidated when the data is changed. However, deciding when to invalidate a cache is one of the two most difficult things in software engineering. As a result, bugs caused by outdated cache data are quite common.

How can we mitigate this problem? First of all, we should only cache data if it is absolutely necessary. Many developers are in love with premature optimization and introduce caches even though there is no data which indicates a performance problem. Software performance is notoriously difficult to gauge by reasoning about the code in a vacuum, as there are too many variables. Instead, it needs to be measured under as realistic conditions as possible to get a clear picture. Also, it is critical to repeat the measurement multiple times, as a single measurement can easily be distorted. After you have a decent number of data points, you can calculate the average, median, as well as the standard deviation. With this information in hand, you should see whether or not you have a performance issue. In the best-case scenario, you’ll discover that your worries were unjustified, and you can just skip adding the cache completely.

In case you do have a performance problem, you should add the cache and then repeat the measurements to discover how much performance you’ll gain. You should only add the cache if the gains are substantial, as small improvements aren’t worth the trouble of having to deal with cache invalidation and any potential bugs caused by outdated cache data. Once you’re absolutely sure that you want a cache, you need to decide on a suitable cache level. The higher the cache level, the higher the scope of the cache, and the longer its time to live becomes. An example of a cache with a low cache level is a cache which only affects a single HTTP request made to a server by a single user. This can be realized, for example, by using a class member as a cache and by recreating the object when a new request is received. A dependency injection framework like Spring can make this very easy. If request-based caching isn’t enough to solve your performance problem, you need a cache on a higher cache level, e.g., on the session level or even on the application server level. As soon as the cache is shared between different users, data separation between these users becomes a must. Otherwise, wrong data will be read and data invalidation will become too frequent if the actions of a single user can invalidate the cache for every other user. If your application is shared by different customers via some kind of tenant feature, you have to consider this in your cache as well. As you can see, things become more error-prone and complex on higher cache levels.

What can you do if caching on the application server level isn’t enough? It is not uncommon to have multiple application servers in a web application, and it would be great if cached data were available for each application server as soon as it was calculated once. To achieve this, you need a caching service like Redis, which becomes the single source of truth for all application servers. To achieve suitable performance as well as suitable robustness, you probably want to use a whole cluster of Redis servers instead of a single one. However, using a cluster adds an additional potential problem: Clusters are only eventually consistent. If you’re unlucky, your query might hit an outdated server and get back an outdated result. Hence, any data that has to be consistent must not be cached in a clustered caching server. This limits what you can achieve with a clustered caching service, but correctness trumps performance. As an alternative to a caching service, you can also use a persisted cache on a shared database. This only makes sense if the cached data is very expensive to calculate on the fly, and therefore you can improve performance by storing the results in the database.

Once you’ve determined what cache level you need, you have to think about cache invalidation. In the best case, cache invalidation is not necessary at all because the cached data never changes. For example, if you’re caching the mapping from international standardized book numbers to book titles, then you’ll never have to update the cache at all. However, if you’re caching the employment status of an employee in a human capital management application, you have to assume that it will change. Whenever the data has changed, you’ll have to invalidate the cache. The higher the cache level, the more important explicit invalidation becomes, and the longer the time to live will be. A very low-level cache, e.g., on a request level, might not need any cache invalidation at all, as its short time to live makes bugs caused by stale data unlikely. All caches on high cache levels, however, will probably have a long time to live as their performance benefits would be negated otherwise and therefore will most likely require explicit data invalidation. This is especially true for caches in the database: As data in the database doesn’t get invalidated automatically, you’ll have to explicitly update it after each write to the input data. Depending on whether the update of the cache table is part of the database transaction or not, you might run into timing issues here as well.

It is also a good idea to periodically (e.g., every week) refresh all caches on a high cache level even if no explicit data change was detected, just to minimize the chance of bugs caused by stale data. It also should be easy to manually refresh these caches, if necessary, as this makes it much easier to diagnose a caching problem as well as to fix it. Some kind of self-service for the support team might be a good idea.

To sum up, we should follow these guidelines to get the maximum value out of caches as well as the minimal amount of pain:

Only cache when you have proof that it is necessary.
Ideally, only cache data which cannot change or rarely changes.
Choose the lowest cache level which could possibly work.
Make sure to use tenant and user as key for caches on a higher cache level.
Make sure that cache keys never clash between different tenants.
Never store data that needs to be consistent in clustered caches.
Periodically evict data from caches with a high cache level.
Make it easy to refresh any cache with a long time to live manually in case any problems occur.

If you liked this blog post, please share it with someone. You can also follow me on Twitter/X.