At Honeybadger, we use Redis a lot. It's our Swiss Army Knife; it's a cache, a single source of truth, it stores background jobs, and more. Basically, Redis is one of those services that should never fail.
I was pondering the DevOps apocalypse recently, as one does (could Redis be one of the four horsemen?), which led me to jump into our #ops channel to ask Ben a simple question:
josh [13:58]
what are the risks if someone executed flushall on our redis instances?Ben [13:58]
you just gave me a micro heart attackjosh [13:59]
rofl
Sorry :)
I didn’t do it, for the recordBen [14:00]
all kinds of badness would happen... the job queues would be flushed, there would be a potential for duplicate notifications, timeline charts would start hitting ES instead of being cached, and more :)
it would make for a very bad day :)
Don't worry, Ben recovered after a few hours, and is mostly back to his old self again. I should have prefaced my question; I didn't mean to suggest that I'd actually flushed our Redis cluster. Still, it kind of proved my point. Maybe trolling your SREs and measuring their sweat is a good way to plan for catastrophes...
SO. I'd identified a potential issue. It would be really bad if Redis got flushed. What are the risks of that happening?
Our Redis clusters are deployed with primary and secondary instances across multiple availability zones in AWS, with automatic failover to the secondary in case of primary failure. That's a pretty rock-solid Redis deployment; we can lose entire instances without losing data or even impacting other services.
Unfortunately, preventing human error is much harder, and for some reason, Redis makes it dead-simple to delete all your data with a single command entered in the wrong console. Our architecture did not guard against that. While our team is aware of this, there's a pretty good chance that a future developer could make this mistake.
In fact, my friend Molly Struve, Sr. SRE at Kenna Security, remembers a situation where something similar happened:
We made a change to some code that caused the old cache values in Redis to break with the new code. So we would request the old value and it was not what the code was expecting. Rather than rollback the code, one of our engineers thought it would be fine to run
Rails.cache.data.flushdb
and just start with a fresh cache.
Like Honeybadger, Kenna uses Redis for several things—one is the cache-backend for their Ruby on Rails application. The command Molly mentioned, Rails.cache.data.flushdb
, is the Ruby on Rails equivalent of opening up a Redis console and calling FLUSHDB
(which deletes all data in the current database).
Unfortunately, Redis was also being used to cache report data from Elasticsearch (something that we also do at Honeybadger, incidentally), and that's where things went wrong. When the Redis database was flushed, the cache had to be rebuilt from scratch, which overwhelmed their Elasticsearch cluster:
We have a "Dashboard" page where clients can load part of ALL their reports (think hundreds), and when clients started to hit that without the cache, Elasticsearch lit up like a Christmas Tree. CPU maxed out on all nodes across the board. In the end it was a mad scramble to open multiple consoles to re-cache the reports.
After Kenna's systems were restored, Molly worked with the development team to identify steps they could take to prevent the same thing from happening in the future. They came up with a creative safeguard for new developers who might not realize that clearing the Rails cache is a destructive action: they made all production application consoles read-only by default.
Luckily, Kenna's incident was not catastrophic—they were able to recover from it after just a little downtime. It would have been much worse had the unlucky developer accidentally called FLUSHALL
—which flushes all Redis databases—instead of FLUSHDB
. It would have been an easy mistake under pressure, especially when exception reports are already rolling in (did I mention they also use Honeybadger?)
Let me ask you a quick question: what would happen if someone called FLUSHALL
on your Redis console?
If the answer is "all hell would break loose", then you might consider taking preventative action. Here's what we went with; of course, this is us (I'm more than a little paranoid)—your mileage may vary.
First, access to Redis through clients (i.e., in the Rails console) should disallow use of the FLUSHALL
and FLUSHDB
commands entirely. Developers never need to run these commands in production; doing so would cause serious problems, so why have them at all?
If you're a Ruby/Rails user, feel free to steal this gist. If you want something a bit more comprehensive, see Molly's gist. If you use a different programming language, hopefully there is a way to disable these commands, but even if there isn't, don't worry, I got you.
The Redis config that everyone should use
Mike Perham, the creator of Sidekiq (by far the most popular Ruby background job system, with an awesome business model), knows a thing or two about Redis. Sidekiq is built on top of Redis to provide an incredibly reliable and efficient job system.
I asked Mike what best-practices he recommends to his customers, many of which have mission-critical Redis deployments (think Netflix and Oracle). He told me that users who are concerned about the safety of their Redis data should disable destructive commands entirely via Redis's configuration file.
This approach has the added benefit that the commands are disabled everywhere, including in redis-cli
consoles. The following config should be added to redis.conf
:
rename-command FLUSHALL ""
rename-command FLUSHDB ""
rename-command CONFIG ""
rename-command SWAPDB ""
Renaming the above commands to empty strings means that they will no longer exist as Redis commands. If you still want to be able to call them in rare (and intentional) circumstances, you can rename them to something secret:
rename-command FLUSHALL SUDO_FLUSHALL_222ed15a
rename-command FLUSHDB SUDO_FLUSHDB_2a3bdd5e
For instance, you could put those commands in a doomsday ops playbook which only your operations team has access to. Treat them like your company's nuclear codes.
Of course, there's always a caveat 🤦
We use Amazon's ElastiCache service to host our Redis clusters, and after some research, I learned that ElastiCache does not provide direct access to redis.conf
, and it doesn't provide a Redis configuration parameter for rename-command
. So unfortunately, while our application consoles are safe, we still must handle redis-cli
with care.
In the end, we added a note about this to our internal Redis playbook, and will revisit the ElastiCache documentation occasionally to see if Amazon gives us access to rename-command
.
Failures are inevitable
If it were possible to prevent 100% of failures before they occur, our jobs would be much easier. We wouldn't need on-call rotations or postmortems, and we could all code full-time. Unfortunately, we live in the real world, where chaos rules, and entropy ensures that our systems are constantly deteriorating.
There's risk inherent in everything we do. To ship stable applications, we should take actions which minimize risk. In doing so, we reduce (but not eliminate) the potential for failures.
Being able to evaluate the risks associated with your actions dramatically increases your value as a developer.
Molly's story and others convinced me that the same thing could easily happen to us; the (perceived) risk was high. The solution—disabling potentially destructive commands, or making them extremely difficult to execute—was relatively easy.
There's a name for the combination of high value and minimum effort: low-hanging fruit. In the context of software, it's the idea that if you can gain a lot by making a small change, it's probably worth doing. This felt like it was in the sweet spot of eliminating a big risk for a small amount of effort. Kind of like the first time I installed Honeybadger... ;)