Slack uses a job queue system for business logic that is too time-consuming to run in the context of a web request. This system is a critical component of our architecture, used for every Slack message post, push notification, URL unfurl, calendar reminder, and billing calculation. On our busiest days, the system processes over 1.4 billion jobs at a peak rate of 33,000 per second. Job execution times range from a few milliseconds to (in some cases) several minutes.
The previous job queue implementation, which dates back to Slack’s earliest days, has seen us through growth measured in orders of magnitude and has been adopted for a wide range of uses across the company. Over time we continued to scale the system when we ran into capacity limits on CPU, memory, and network resources, but the original architecture remained mostly intact.
However, about a year ago, Slack experienced a significant production outage due to the job queue. Resource contention in our database layer led to a slowdown in the execution of jobs, which caused Redis to reach its maximum configured memory limit.
This incident led to a re-evaluation of the job queue as a whole. What follows is a story of how we made a significant change in the core system design, with minimal disruption to dependent systems, no “stop the world” changeovers or one-way migrations, and room for future improvements.
Read the full article at: slack.engineering