Veit's Blog

Some notes on on-call

2026-03-16

As organizations scale, some of them realize they need an on-call structure. And often, when I watch them get set up, the people doing it seem to believe their situation is unique enough to warrant doing it from scratch. In my experience, it rarely is, and the bespoke approach rarely works. The problems are almost always the same, and the solutions are well-documented (the Google SRE book alone covers most of the fundamentals).

As I keep seeing the same things, though, I figured I’d write them down. if you see different failure modes, let me know.

What on-call is

On-call means assigning someone responsibility for production systems during a defined window. When something breaks, they get paged, they assess, and they either fix it or escalate.

It’s not the same as insisting everyone be always reachable. It’s not an implicit expectation that people check Slack after dinner1. And it’s not the most junior person on the team getting stuck with the pager because nobody else wants it. If you don’t have a defined rotation with clear expectation boundaries, you don’t really have an on-call system I’d be confident relying upon.

Pitfalls

Here are some things I’ve seen go wrong more than once:

Non-negotiables

A few things I wouldn’t compromise on:

Fin

On-call should be boring. The default should be that nothing happens. If that’s not the case, it usually points to something that needs attention: the release process, the monitoring or alerting configuration, maybe the staffing. The goal is a setup where the pager rarely goes off, and when it does, the person holding it knows what to do and can actually do it.

These practices are easy to get right in principle and remarkably easy to get wrong in practice. I think the gap is mostly about whether someone decided to care about the details.

Footnotes

1. Though I’ve seen this exact pattern more often than I’d like. It tends to happen “naturally” if nobody actively prevents it, and once it’s the norm it’s surprisingly hard to walk back.

2. Your on-call engineers are not a substitute for a test harness. I feel like this should go without saying, but somehow I feel compelled to say it anyway.