[stripe.com] Learning to operate Kubernetes reliably

January 10, 2018

We recently built a distributed cron job scheduling system on top of Kubernetes, an exciting new platform for container orchestration. Kubernetes is very popular right now and makes a lot of exciting promises: one of the most exciting is that engineers don’t need to know or care what machines their applications run on.

Distributed systems are really hard, and managing services on distributed systems is one of the hardest problems operations teams face. Breaking in new software in production and learning how to operate it reliably is something we take very seriously. As an example of why learning to operate

In this post, we’ll explain why we chose to build on top of Kubernetes. We’ll examine how we integrated Kubernetes into our existing infrastructure, our approach to building confidence in (and improving) our Kubernetes’ cluster’s reliability, and the abstractions we’ve built on top of Kubernetes.

