← Go back to dddice

Cache Outage

January 19, 2023 at 11:58 PM UTC

Application API Cache Worker

Resolved after 12h 48m of downtime. January 20, 2023 at 12:46 PM UTC

dddice experienced a major outage that affected rolling dice and user sessions. During the incident, users were periodically logged out from their accounts and rolls would not appear or take a long time to appear. This affected all dddice plugins and the main website.

The incident lasted approximately 12-hours.

Background

dddice uses Redis as our main cache backend which stores user sessions, queued rolls, and queued assets for processing.

During a traffic spike our queue workers were overwhelmed which caused a backup in the queues, reaching >90% storage capacity. New rolls were not being queued and partial page caches could not be generated which resulted in some sections of the site being inaccessible and caused rolls to not roll.

Problem

Our worker uses a “main” process that monitors “child” processes. This worker is powered by a single 2GB memory VM. Initial configuration of this worker assigned 1.5GB to the “main” process and 256MB to the “child” processes. The worker was also configured to manage a maximum of 10 “child” processes.

Incorrect assumptions about the configuration of the worker were made which resulted in the worker VM to consume all resources. This caused all queues to halt, including rolls.

When the worker was restarted on multiple occasions, the queue would begin processing which immediately began to overwhelm the Redis cache.

In short, redis was overwhelmed because of invalid configuration in our worker process.

Solution

This incident resulted in two fixes.

  1. Redis memory was updated from 100MB to 256MB to handle an increase in traffic.
  2. Worker configuration was updated to consume less memory on the “main” process and consume more memory on the “child” processes. Child processes were scaled down from 10 to 3 in order to stay within VM resource limits. This configuration has been proven to be more effective after careful analysis of how our worker operates.

Timeline

Below is a timeline of the incident and the solutions attempted to resolve it.

Future Solutions

We are actively monitoring the fix and have further thoughts on how to improve our response times, transparency to the community, and performance of the site.

Last updated: January 20, 2023 at 2:02 PM UTC