Background tasks are one of the core pillars when considering web applications for scale. The basic idea is simple: A client makes a request to your web application and in handling that request, your app performs several time-expensive tasks. To allow it to respond to the client faster, the app enqueues a background job to a background processing system. The background processing is then tasked with all the heavy lifting, like computations or I/O operations. Leaning on background jobs effectively is one of the most important building blocks when scaling your web applications.
As Rails developers, we are blessed with several fantastic libraries to choose from, all with different advantages, disadvantages, and even backend databases. These libraries make it easy to offload any heavy lifting, allowing our applications to respond faster and serve more users with fewer resources.
Until recently, we used Sidekiq to perform the majority of Honeybadger's background processing jobs. It has been instrumental in helping us maintain a blazing-fast user experience and a robust pipeline for processing the massive amounts of data we ingest.
Where does Sidekiq fall short for us?
Despite its reliability, Sidekiq's dependency on Redis has some limitations. During peak traffic, Redis can face significant memory pressure, leading to potential data loss. By default, Redis/ElastiCache uses the volatile-lru
eviction policy.
The consequence of this is that at times of high memory usage, Redis will remove the least recently used data with a TTL set. At one point, Honeybadger had an event where our memory usage was high enough that Redis started clearing its cache when we didn't intend it to. Fortunately, the data evicted was reproducible (hence the TTL), so we didn't lose any permanent data.
Still, this left us with one question that needed to be addressed - What happens if a similar event occurs and we start losing data we couldn't rebuild? How can we ensure that we never lose customer data? Handling customer error data is the core of our business, so we need a system that is resilient to data loss.
Using Kafka for ingesting data
Kafka is a distributed event pipeline that offers both scalability and resiliency. With our recent launch of Insights, we gained plenty of experience standing up Kafka as the infrastructure to process our event data. After that, we wanted to use the same technology stack to process our error ingestion data. We aimed to achieve redundant storage with better expandability and a more affordable cost by using Kafka.
Since we were running our own AWS MSK cluster for Insights, we already had the infrastructure and autoscaling setup. This meant we "just" had to set up a few topics and create a few consumers that ran the same code as our Sidekiq workers did. The concept was pretty simple which let us focus more on the fine-tuning of our Kafka consumers.
Migrating from Sidekiq to Karafka
Honeybadger is architected as a majestic monolith, and Karafka helps us keep that same architecture. We have already been using Karafka to process some of our Insights data, so adding on a few new consumers was a simple task.
One of the main differences between Karafka and Sidekiq is how jobs are retrieved. With Karafka, jobs are batched and processed together in a single consumer run. In the consumer, we can iterate over the array of messages and run the Sidekiq worker inline:
class NoticeConsumer < ApplicationConsumer
def consume
messages.each do |message|
NoticeWorker.new.perform(message.payload)
end
end
end
Another difference we had to consider is how error handling works. With Sidekiq, since each job is atomic, a worker handles its own error through retries and failure callbacks. With Kafka's batching behavior, there are more options for handling errors. Most notably, Karafka provides a mechanism called Dead Letter Queue. This allows you to specify error handling on a batch or individual basis.
dead_letter_queue(
topic: "ingestion.errors.dead",
max_retries: 5,
independent: true
)
Whenever the Karafka consumer encounters a failure to process any individual message, it'll attempt to reprocess 5 times. If it fails the 5th attempt, the message will be sent to the specified topic. The independent: true
option tells the consumer that only the failed message needs to be sent to the DLQ rather than the entire batch.
Monitoring and scaling Karafka
As it turns out, monitoring and scaling Karafka consumers is pretty complicated. There are many things you can track from both AWS/MSK and Karafka, and many knobs you can turn to tune your system. It requires careful attention to what your code is doing, the behavior of the data flow.
With AWS CloudWatch we monitor a lot of things, but here are a few Kafka-specific metrics we look at:
- SumOffsetLag — For a specified topic and consumer group, this is the sum of all the offset lag across all partitions.
- EstimatedMaxTimeLag — For a specified topic and consumer group, this is the estimated amount of time it'll take to catch up on all the partitions to the current offset.
Karafka also provides some great instrumentation, but you'll need to publish this data and store it yourself:
- processing_lag* — This value is available for every consumed batch of messages. It tells you the time it took for Karafka to take the message from Kafka and start processing it.
- consumption_lag* — This value is similar to processing_lag except that it's the time from when the last message of the batch entered the Kafka system, up until your consumer starts to process it.
- duration* — This is the time it takes for a consumer to process an entire batch of messages.
As it turns out, scaling Sidekiq processes is very different than scaling Karafka consumer processes. When increasing the parallelization of Sidekiq, you can add more processes to whatever your Redis instance can handle. With Kafka, you are bound to have at most 1 process per partition in your topic. As a general rule of thumb, you'll want to have more partitions than you plan for since a Kafka consumer can be assigned to more than 1 partition.
Another thing to remember is that scaling up and scaling down Kafka consumers can be a very lengthy operation. Adding and removing consumers from a consumer group requires the group to rebalance itself. This means reassigning partitions as necessary. During reassignment, consumers stop processing messages. Although you can mitigate this issue somewhat with sticky-cooperative
assignment, you generally want to avoid the rebalancing if you can by over-provisioning resources.
We are currently monitoring SumOffsetLag
as one of our scaling metrics. An important thing to note is that during rebalancing, this metric does not get reported. So as you can imagine, during a rebalancing period, this metric would drastically increase until the rebalancing finishes. This is another reason it is important to keep scaling to a minimum.
What's next for Karafka at Honeybadger?
We've been running our Kafka/Karafka implementation at 100% for over a month now and it's safe to say we're pretty satisfied. Still, it's great to know that we can always fall back to Sidekiq at the push of a button if we ever need to. This gives us even greater resiliency when maintenance work is required on any of these systems.
In the process of migrating from Sidekiq to Karafka, we also learned a lot more about working with Kafka and Karafka. If you haven't updated to the most recent version of the Honeybadger gem, you should check it out! I added some new features to the karafka plugin. When you have Insights enabled, our gem will start tracking some of the important stats to give you a better view on the overall health of your Kafka system.
In addition, we now have a Insights Karakfa Dashboard to help you visualize this data and give you a better understanding of how your Kafka consumers are behaving. The Karafka Dashboard requires metrics to be enabled for the plugin. To do that add the following config in your honeybadger.yml
:
karafka:
insights:
metrics: true
Here's a sample of what the dashboard looks like:
We're excited to see how our customers use this data to improve their own Kafka systems. If you have any questions about how we migrated from Sidekiq to Karafka, or how we use Kafka in general, feel free to reach out!