DevOps VS MLOps: Who Holds the Pager?

Chase Christensen

Welcome Back to Evil Tux!

Our last post explored the model development lifecycle’s most infamous and exciting parts: training, serving, and retiring models. Of course, those topics get the most attention since they provide value to a business and start the cycle once a model is retired. Yet, this “transfer of power” from one model to the next was portrayed as a peaceful, calm, and simple transition. The story we told was that the model was no longer performing to the best of its capabilities, and the team simply trained a new model and moved the older model out of production. This was a beautiful MLOps fairy tale with a happy ending, but fine folks with models in production know this is not always how that story pans out. Unlike perfectly curated MLOps stories filled with sugar plumbs and perfect handoffs, we know that reality is often disappointing and messy.

Anyone who’s been on-call knows that even the best environments run by the most brilliant people are subject to the chaos that is reality. Heck, the concept of a blameless post-mortem is the egoless exploration of what went wrong and how we can automate away the chaos for next time. This process is blameless because it is pretty challenging to pinpoint the exact cause, and shaming a particular engineer for making a mistake does more harm than good. Outages and on-call rotations are paths well carved out for site reliability, software, and support engineers, but what about on-call data scientists, machine learning engineers, and other ML/AI system maintainers? What does that healthy tension between ML and OPs look like? Who’s accountable when something goes wrong with a model? We will explore all these topics in this blog titled “DevOps VS. MLOps: Who Holds the Pager?!”.

Also, since the Wild West doesn’t have software developers, we will have to drop the Wild West theme a bit. Hopefully, this is not too jarring, and I promise I will try to throw some in.

“It Was the Best of Ops, It Was the Worst of Ops”

Development Operations (DevOps) and Machine Learning Operations (MLOps) are not identical practices, yet MLOps was inspired by DevOps philosophies. DevOps began to help developers and infrastructure teams align themselves towards a common goal and improve production stability. In Western terms, how can the miners efficiently hand off the gold to the refineries to make more gold products available? Classic books like “The Phoenix Project” and “Modern Software Engineering” emerged and helped fuel the DevOps buzz. Anyone who’s been to a KubeCon has seen many topics on DevOps and its best practices. For those interested in learning more, check out Minimum Viable CD!

MLOps came from the need for machine learning engineering resources to be familiar with the model lifecycle and the automation required for stable service/software deployment. Many DevOps engineers believed that MLOps was the death of DevOps, but they are often part of one giant team because a model is, after all, still software. Let’s unpack these teams.

A DevOps team consists of:

Developers: Focused on creating, maintaining, and improving code to solve business problems and deliver customer value. They prioritize consistent environments, efficient collaboration, and reliable deployment to production. Developers continuously review and update applications to meet evolving needs, striving to ship features quickly while balancing stability and performance.

Operations: Focused on orchestrating and maintaining infrastructure to ensure business-critical applications meet production demands. Operations teams are often responsible for “holding the pager” (late-night escalations caused by network misconfigurations or resource exhaustion). Their primary goal is to build automation that prevents or resolves failures, ensuring stability and reproducibility. These teams rely on tools like containerization and orchestration platforms (e.g., Kubernetes) to enforce idempotency—ensuring applications run predictably regardless of deployment.

The tension between teams is due to operations teams needing to be cautious about deploying new code that might disrupt their understanding of its requirements, prioritizing minimal changes to reduce risk. This focus on stability can be counterintuitive to developers who aim to ship new features quickly. As Greg Brockman of OpenAI puts it, “Code is a liability, not an asset”.

When we were introduced to MLOps, the teams got more complicated! We won’t go into every persona, but the image below shows one example of an MLOps team.

As you can see, the team grew. I won’t go into exact details, but I can summarize the issues impacting MLOps teams as “more data, more problems”.

The More Data We Come Across, the More Problems We See

So… when did DevOps become MLOps, and where did things go so wrong (or right!) The major wrench in the previous DevOps system was the introduction of data and, therefore, data scientists, data engineers, and the data systems they support! One thing often overlooked is that data scientists are NOT engineers. They are brilliant people focused on algorithms and techniques to create models that can accurately predict, classify, or generate insights from data. Scaling infrastructure, code lifecycle, and cloud engineering are beyond the scope of their responsibilities. Statistical analysis and data insights are enough! This is the equivalent of wanting a geologist specializing in finding new gold mines to find the mines and mine gold! As for MLOps teams, now teams have not one but TWO (or more, to be perfectly honest) walls of confusion (the wall representing the handing off of work in ways the next member is unfamiliar with; see below).

All these professionals have their tools and priorities yet rely on each other to specialize and scale their systems. However, as stated above, data scientists are NOT software engineers. So, how did the industry solve this? By creating the new role, we alluded to the machine learning engineer!

Machine Learning Engineer Rise!

A machine learning engineer bridges the gap between data scientists’ prototypes and production-ready systems. They optimize models, transform them into production-grade software, and integrate them with existing IT infrastructure. MLOps supports their role by providing tools for version control, CI/CD pipelines for models, and monitoring systems, ensuring models are deployable, maintainable, and scalable in production environments.

This is where DevOps and MLOps converge—transforming a prototype into a production-ready system is like refining raw gold into a valuable piece of jewelry. In traditional development, a developer identifies patterns and codes them into an application, akin to separating impurities from ore. In machine learning, a data scientist uses data to train a model to discover patterns and generate predictions, like applying heat and precision to extract the purest gold.

Think of the process as building a machine-learning-powered system: the model is the brain, crafted and refined by data scientists and machine learning engineers to make intelligent predictions. The application, built by software engineers, acts as the body, utilizing the model’s insights to function and interact with the world. Finally, the data serves as the lifeblood, fueling the entire system and keeping it rich with insights, much like oxygen sustains every cell. Together, these elements form a seamless, interdependent ecosystem, polished and ready to shine in production (see below).

Still, the question remains with all these new team members: who holds the pager?

To Page the Data Scientist or Not to Page the Data Scientist? That is the Question.

We have finally reached the end of this blog, where we want to find out who holds the pager. In true solution engineering fashion, I will say, “It depends”. Realistically, retraining and putting a model in production is difficult without oversight. Many organizations are not continuously training or developing models due to the high amount of metrics and automation required. Not to mention that model drift is not the same as a golden signal style outage. Operations teams will still hold pagers when APIs are not making calls or applications are not responding (based on SLA, SLIs, and SLOs). However, model and concept drift have different thresholds than traditional services. A model may be taken offline if it’s underperforming, but rarely will a data scientist be woken up to troubleshoot a model.

The data scientist works at the pace of science; after all, high-quality data is complex to acquire. They must capture the poorly performing data inputs, predictions, and confidence scores. Then, they must build a training and test dataset and a strategy to move the model into production. Essentially, the mismatch here is that data science does not respond at the speed of traditional IT operations or software development and, therefore, is most likely not the best use of a late-night on-call rotation. An application-side bug, an API call, or other issues may be related to the model, but the model logic takes time and is not a quick fix.

So, to answer once and for all, who holds the pager? It is most likely the development and operations teams, but ultimately, it depends on the organization’s specialties and requirements! Does MLOps maturity change this decision? MLOps can help smooth the handoff between teams so that models can move quickly in and out of production but still rely on access to high-quality data and the processes supporting its curation! Mature teams may automate the entire lifecycle with humans in the loop for validation, but that does not mean we are waking up a data scientist.

Coming Soon

Now that we’ve covered the general structure of MLOps teams and the reality of their on-call behaviors, we will move on to what it takes to ensure teams running MLOps systems can efficiently manage models and share their work in the next blog, “Composable, Scalable, and Portable OH MY!”. In the meantime, we’d love to know your experience on an MLOps team. Have you held a pager as a Data Scientist? As a machine learning engineer, are you on-call? Do you support continuous training and retraining!? We’d love to hear from you and hope to see you here again to discuss the next post!

About the Author

Chase Christensen is a machine learning solutions engineer who lives at the intersection of business value and technical execution. He’s not just here to talk about what’s possible—he’s focused on making it real. With a background in open source and a hands-on approach, Chase works alongside teams to connect real-world problems to the right ML tools and infrastructure. He’s fluent in both the boardroom and the terminal, but knows that business problems are easy to spot—it’s delivering the solution that counts.

Kubeflow Series 6: DevOps vs. MLOps: Who Holds the Pager?

Welcome Back to Evil Tux!

“It Was the Best of Ops, It Was the Worst of Ops”

The More Data We Come Across, the More Problems We See

Machine Learning Engineer Rise!

To Page the Data Scientist or Not to Page the Data Scientist? That is the Question.

Coming Soon

About the Author

You might also like:

Kubeflow Series 8: Kubeflow’s Components and Kubernetes: How the Kube Supports the Flow

From the Cradle to the OS 3: RISC-V Conventions

Kubeflow Series 6: DevOps vs. MLOps: Who Holds the Pager?

Welcome Back to Evil Tux!

“It Was the Best of Ops, It Was the Worst of Ops”

The More Data We Come Across, the More Problems We See

Machine Learning Engineer Rise!

To Page the Data Scientist or Not to Page the Data Scientist? That is the Question.

Coming Soon

About the Author

Share

You might also like:

The Cyber Week in Review

Kubeflow Series 8: Kubeflow’s Components and Kubernetes: How the Kube Supports the Flow

From the Cradle to the OS 3: RISC-V Conventions