January 2021

When Best Practices for caching goes wrong

One of the more challenging bugs I've encountered is widespread, industry-standard and looks like normal working code but with consequences serious enough slow to a crawl well provisioned big-enterprise systems.

With caching a common pattern I see is to check to see if the cache contains a value for a key. If it exists, get it from the cache, otherwise fetch the data from the database then save it into the cache.

In Node.js using Redis's node library and SQL such as PostgreSQL, this pattern looks like this:

redisClient.exists(key, (err, exists) => {
  if (exists) {
    redisClient.get(key, (err, value) => {
      resolve(JSON.parse(value));
    })
  } else {
    postgres.query({text: "SELECT * FROM accounts"}).then((results) => {
      redisClient.set(key, JSON.stringify(results.rows),  'EX',  '3600');
      resolve(results.rows);
    })
  }
})

'EX', '3600' in our redisClient.set() tells Redis our key has a time to live (TTL) of 1 hour. A better way to write this would be:

redisClient.get(key, (err, value) => {
  if (value !== null) {
    resolve(JSON.parse(value));
  } else {
    sql.query({text: "SELECT * FROM accounts"}).then((results) => {
      redisClient.set(key, JSON.stringify(results.rows), 'EX',  '3600');
      resolve(results.rows);
    })
  }
})

We've removed the if (exists), which minimizes calls to Redis, but also eliminates the possibility that our key expires in the short time between our redisClient.exists() call and our redisClient.get().

This is usually a good enough solution ā€“ I would recommend stopping here unless you've profiled and found your app needs even more performance.

Let's say the above code exists in a function called getOrSetCache(). Get it if it's in the cache, set it if it's not, resolve with the value. This getOrSetCache() implementation does have one surprisingly common edge case where our caching falls short and we query the database more often than we need to.

for (let i=0; i<5000; i++) {
  getOrSetCache(key);
}

It's a cache right? You couldn't fault somebody for assuming wrongly that these 5000 calls should only result in 1 database query.

The issue is in these lines:

redisClient.get(key, (err, value) => {
    ...
    sql.query({text: "SELECT * FROM accounts"}).then((results) => {
        ...
        redisClient.set(key, JSON.stringify(results.rows), 'EX', '3600');

Each one of these callback functions (err, value) => {} is delayed as we make the network requests to Redis and SQL. That means that in between the short amount of time we call get(), query the db, then set(), other processes calling the same code will execute their get() call and find the cache empty, triggering more database calls.

Node.js' event loop will execute the get() part of our function 5000 times, each of them sending network requests to Redis who diligently finds nothing. Then SQL would be queried, 5000 times. This sort of bug can bottleneck the database, especially on repeated complex queries.

This sort of performance bug isn't unique to Node.js. By using a cache that interfaces via network/api calls, a delay is introduced. Even if we used a synchronous in-memory cache, there's a delay while we wait on our database query, any thread checking the cache will not see it populated.

Sometimes in an app, especially one that handles data sources from multiple databases, a column from one database might reference an UUID for rows stored in a completely separate database. In these cases, it's easy to end up writing code that loops over values in this column, and ends up making duplicated database calls to the other database.

User input can also lead to cascading database calls. One of the more interesting performance bugs Iā€™ve seen in my career was while I was contracting with Microsoft, which I'll get to in a moment.

First, how do we solve this caching issue? We need to add batching to our cache. Thankfully, this isn't hard to do with promises.

const batched = {}
function batch(key, promiseFunction, args, thisObj) {
  return new Promise(async (resolve, reject) => {
    if (batched[key] === undefined) {
      batched[key] = [{resolve: resolve, reject: reject}];
      try {
        const value = await promiseFunction();
        batched[key].forEach((promise) => {
          promise.resolve(value);
        })
        delete batched[key]
      } catch (e) {
        batched[key].forEach((promise) => {
          promise.reject(e);
        })
        delete batched[key];
      }
    } else {
      batched[key].push({resolve: resolve, reject: reject});
    }
  })
}

The first call to batch(key) will call promiseFunction and await for it to resolve with a value. Any calls to batch(key) in the meanwhile with the same key will also wait on and resolve when that first promiseFunction resolves.

It might be tempting to add batching to the code that queries the database, but the proper place is as part of the cache like so:

async function getOrSetCache(key, dbQueryPromise) {
  return await batch(key, () => {
    return new Promise((resolve, reject) => {
      redisClient.get(key, async (err, value) => {
        if (err) {
          reject(err);
          return
        }

        if (value !== null) {
          resolve(JSON.parse(value));
        } else {
          try {
            const results = await dbQueryPromise();
            redisClient.set(key, JSON.stringify(results.rows), "EX", "3600", (err) => {
              reject(err);
            });
            resolve(results.rows);
          } catch (e) => {
            reject(e);
          }
        }
      })
    })
  })
}

The inside bulk of the code should look familiar. It's called like this:

const rows = await getOrSetCache('key', async () => {
  return await sql.query({text: "SELECT * FROM accounts"});
})

You can even have the key be the query string to the database itself.

const queryString = "SELECT * FROM accounts";
const rows = await getOrSetCache(queryString, async () => {
  return sql.query({text: queryString});
})

Redis supports arbitrary strings as keys, so your queries will be cached with the query string as the key.

For PostgreSQL users, it can be helpful to have Node.js log all of your database calls to determine how many calls your app makes.

Only batch on the cache

It's also a bad idea to over-use batching. Generally you should only pair it with a cache. Why? I'll illustrate using an extreme example: What if you batch all database calls?

It sounds very appealing. Wouldn't you save in performance if duplicate queries only triggered one database call?

On reflection it's obvious why it's bad. Repeated INSERT will disappear. Transactions may not reflect updated states of the database.

It only takes a few lines of modern javascript to monkeypatch every query we make to be batched:

// DO NOT use this code, your queries will lose ordering
// postgres.oldQuery = postgres.query;
// postgres.query = async (...args) => {
//   return await batch(args[0].text, async () => {
//     return await postgres.oldQuery(...args);
//   })
// }

postgres.query({text: "SELECT username FROM accounts;"}).then();
postgres.query({text: "INSERT INTO accounts (username) VALUES ('Jimmy Page')"}).then();
postgres.query({text: "SELECT username FROM accounts;"}).then();

The 2nd SELECT here could get batched, returning with the same values as the 1st SELECT. Potentially we may not see user Jimmy Page in our 2nd SELECT. Although our database may have levels of isolation, our queries do not.

However there is one place in our applications we already make the assumption that our data may be out-of-date. The cache. Use batching on the cache, where it belongs.

In the real world

At Microsoft, there is a dashboard used by hundreds of employees on a daily basis. Users complained about slowness so I was brought in to help.

These dashboards would query various databases whose results were cached in Redis with a default 24 hour time to live. Many of these queries were -large-, written by data scientists and researchers and could take upwards of 15 minutes to complete. Even worse, each page would often have duplicate queries because each page contained up to several dozen charts, each one customized to trigger a database call on load.

The 24 hour ttl on the cache keys would naturally synchronize to people's work schedule so they would start expiring around 7-8am, causing cache misses... and database calls. Lots of them.

The psychological effect of this was disastrous. The first person to load the dashboard in the mornings, could find it took a few minutes for the first chart to load. But the code, even with a cache, was showing loading spinners on other widgets that all make the exact same database call. So what does the user do? They hit refresh. Because they've been trained by the UI to expect a refresh to populate these empty charts.

These refresh's would trigger an even bigger cascade of database calls for the remaining cache misses. Imagine being the 4th person in the morning to load the page and see cache misses. The first 3 user's queries are still in progress. They are waiting at the end of an enormous queue of queries, and their charts won't pull cached data until they hit refresh enough times that they don't see a cache miss.

In the meanwhile, the poor databases were chugging along, fulfilling large workload duplicate requests that could have been batched away into a single call.

Using a combination of server-side batching and an optional cache auto-refreshing, our team eliminated the bottleneck, to the delight of the research teams.

Summary

Caching is an optimizing technique. It's best done once you have metrics on performance. Consider batching with your cache as a solution to large volumes of the same database call.

Sometimes code can't be cached, because your app needs to be up-to-date with what's in the database. While you may still optimize your code to remove redundant queries, know that in these cases, these successive reads are a good thing. Your app is maintaining its state against the database, doing what it's supposed to be doing.