Why developers are quitting over on-call contracts [Q&A]
The US Labor Department recently reported that 4.5 million workers left their jobs in November 2021, the highest exodus on record. The Great Resignation has become a hot topic in the tech world as the pandemic, new virtual team dynamics and other factors have created new waves of attrition.
In the tech industry -- where developer talent is a make or break factor in a company's success -- HR departments, hiring managers and software team leads are scrambling to rethink the developer happiness factors they can control to stem the tide.
One phenomena that is increasingly being linked to developer unhappiness is so-called ‘on-call’ teams -- the practice of requiring software engineers to be available for extended hours to respond to service outages.
Kit Merker, COO at the software reliability startup Nobl9, made waves recently with his tweet that, "People used to quit managers, now they quit on-call rotations."
People used to quit managers, now they quit on-call rotations
— Kit “SLOconf is May 9-12 2022" Merker (@KitMerker) October 27, 2021
We spoke with Merker to learn more about how this general dissatisfaction with on-call rotations is leading to developer burn-out, and how companies are rethinking the practice in the midst of the Great Resignation.
BN: Walk us through exactly why these on-call rotations are so despised?
KM: The reason why on-call on its face is so bad is that it's interrupting your personal life and your family. It's hard to find work-life balance when your time away from work is restricted by the possibility that you may have to rush to fix an emergency. Often there are explicit requirements, like being within 15 minutes of your computer at all times during on-call.
Sometimes organizations will give engineers hazard pay or bonuses for these on-call rotations, but it doesn’t make up the difference in lost peace of mind and family time.
But the on call rotation itself is not the fundamental issue. Developers know they have to look after the software and fix if it goes wrong, and understand that is part of the gig.
The fundamental issue is repetitive issues that developers have already called out. Every code base has known legacy debt that developer teams already know need attention. What's incredibly frustrating in this on-call context is how these issues get pushed to the backlog due to the never-ending pressure of new feature development. That is what makes developers want to quit -- responding to useless pages, and exhausting postmortems where bugs never get fixed.
BN: There's major tension between the need to ship features, and software reliability. How does that play out at most companies? Why is it not handled better?
KM: Every engineering team faces huge pressure for new features, because that’s what wins new business and gets people promoted -- shiny new things that win new customers and satisfy existing customers. So businesses put reliability on hold in favor of innovation.
Reliability is invisible until it isn't. Reliability issues are actually the consequence of shipping good features. You create new features that attract new customers, and now you run into reliability issues that happen only at a certain scale. Without consistent maintenance, software systems degrade and entropy takes course.
This general phenomena of known reliability issues is commonly referred to as 'tech debt', and I would say most engineering teams are extremely aware they have tech debt that deserves prioritization, but struggle to push back against the business leaders who are cracking the whip on features.
BN: Your background includes managing products at Microsoft and Google. How do they handle on-call rotations and avoid this type of burnout scenario differently from other enterprises?
KM: When I was working on Google Kubernetes Engine (GKE) at Google, the same engineers that created GKE were also responsible for all of the on-call rotation. So that provided a healthy stick to make sure it worked correctly.
But the carrot was working every week with the broader Google SRE team to review what’s called ‘service level objectives' (SLOs), which basically define the desired behavior of the services, so we could set basic thresholds for when system reliability needed to be favored over shipping new features. I would say that is the fundamental difference between the webscale companies and mainstream enterprises -- this investment in tools and teams to understand the appropriate level of reliability that can be modeled out. This allows reasoning with the right pace between features sprints, versus calling timeout to address critical issues before they create outages and emergencies for on-call teams.
BN: What are some practical steps that companies can take to make on-call practices less painful for their dev teams?
KM: The number one step is to be willing to put features on hold in favor of real reliability issues. When your engineering team is vocal about a reliability issue, don't blame the team about releasing bad software. Set clear reliability targets with SLOs to make sure you’re balancing new feature creation software reliability issues and the sanity of your engineering team.
When engineers are looking at new jobs today, they are asking specifically about how the hiring company deals with this balance. How do you put features on hold? How do you measure, improve and manage the pressure to run production up against the pressure to ship new features?
Image credit: ijeab/depositphotos.com