Research Explainer

The Hidden Work of Keeping AI Running

Everyone talks about building AI models. Almost nobody talks about keeping them alive, healthy, and useful in the real world. This study went straight to the people who do that work every single day.

Shankar, Garcia, Hellerstein & Parameswaran · UC Berkeley · 2022
18 in-depth interviews · 12 industries · 1 massive question: How does anyone actually do MLOps?

The Problem

Building the Model Is Like Writing a Song. MLOps Is Running the Concert Every Night.

Think of it this way: creating an AI model is like composing a great song in a studio. You fiddle with notes, remix tracks, and eventually produce something that sounds amazing. But deploying that model to the real world? That's like performing that song live, every night, in different cities, with unpredictable weather, changing audiences, and instruments that occasionally go out of tune on their own.

The field that manages all of this is called MLOps — short for Machine Learning Operations. It's the set of practices for deploying and maintaining AI models in production reliably and efficiently.

90%
of ML models reportedly never make it to production
85%
of AI projects said to fail to deliver business value

But are these scary numbers actually a problem? Or are they a natural side effect of how experimental ML is? This study set out to find the real answer — by talking to the humans behind the curtain.

The Research

Talking to 18 Engineers Who Keep AI Alive

The researchers conducted semi-structured ethnographic interviews — think of it as a guided but open conversation, lasting 45–75 minutes each, with ML engineers (MLEs) who actively deploy and maintain models in production. They used a method called grounded theory, which means they let the patterns emerge from the conversations rather than imposing assumptions upfront.

IDRoleCompany SizeApplication
P1MLE ManagerLarge (1000+)Autonomous Vehicles
P2MLEMedium (100-1000)Autonomous Vehicles
P3MLESmall (<100)Computer Hardware
P4MLEMediumRetail
P5MLE ManagerLargeAds
P6MLELargeCloud Computing
P7MLESmallFinance
P8MLESmallNLP
P10MLESmallOCR + NLP
P11MLE ManagerMediumBanking
P12MLELargeCloud Computing
P13MLESmallBioinformatics
P14MLEMediumCybersecurity
P15MLEMediumFintech
P16MLESmallMarketing & Analytics
P17MLEMediumWebsite Builder
P18MLELargeRecommender Systems
P19MLE ManagerLargeAds

Company Size Distribution

Recruitment happened in rounds over the 2021–2022 academic year, 3–5 candidates per round, reached via professional networks and open calls on MLOps communities (Discord, Slack, Twitter). Interviews were transcribed via Zoom, redacted of personal info, then analyzed using MaxQDA qualitative analysis software.

Two researchers independently coded each transcript using open and axial coding (both top-down from theory and bottom-up from passages). They produced 1,766 coded segments with 600 unique codes. Coding rounds were repeated until reaching convergence.

The sample was 22% female-identifying. The researchers openly recruited female-identifying MLEs to mitigate sampling bias. Recruitment continued until the findings reached saturation — meaning new interviews weren't producing new themes.

The Workflow

Four Tasks ML Engineers Do Every Day

Imagine maintaining a garden. You plant seeds (collect data), experiment with fertilizers (run experiments), check which plants are healthy enough to transplant outside (evaluate and deploy), and keep checking for pests all season long (monitor). That's the ML lifecycle — and it never stops.

🗃️ Data Collection & Labeling
🧪 Experimentation
✅ Evaluation & Deployment
📊 Monitoring & Response
Hover or tap each step to see it highlighted. This cycle runs continuously — not once.
🗃️

Data Collection & Labeling

Sourcing new data, wrangling it into a centralized repository, cleaning it, and labeling it — sometimes outsourced (e.g., Mechanical Turk), sometimes done by in-house annotators.

🧪

Experimentation

Improving ML performance through data-driven (new features) or model-driven (architecture changes) experiments. Think "trial and error, but systematic."

Evaluation & Deployment

Computing metrics on validation datasets, code review, staged rollouts (1% → 10% → 100% of users), A/B testing, and keeping rollback records.

📊

Monitoring & Response

Tracking live metrics via dashboards, slicing data to investigate failures, patching with heuristics, adding live failures to evaluation sets for the future.

The Core Framework

The Three Vs That Govern MLOps Success

Across all 18 interviews, three variables kept surfacing as the make-or-break factors. Think of them like the three legs of a stool — if any one is missing, the whole thing topples.

🚀 Velocity

Like a chef who can taste, adjust, and re-plate quickly, ML engineers need to go from a new idea to a trained model fast — ideally within a day. High velocity means rapid prototyping, fast debugging, and minimal friction between "I have an idea" and "I can see if it works."

Cited by: P1, P3, P6, P10, P11, P14, P18

🛡️ Validation

Like a quality inspector at a factory, the earlier you catch a defective product, the cheaper it is to fix. For ML, errors caught during development cost pennies; errors found by real users cost a fortune. Validating early — through tests, staged deployments, and sandbox environments — is critical.

Cited by: P1, P2, P5, P6, P7, P10, P14, P15, P18

"The general theme, as we moved up in maturity, is: how do you do more of the validation earlier, so the iteration cycle is faster?"

— P1, MLE Manager, Autonomous Vehicles

📦 Versioning

Like a writer who saves every draft of their manuscript, ML engineers need to keep track of multiple versions of models, datasets, and configurations. When something breaks in production, you need to be able to "undo" — instantly reverting to a previous working version.

Cited by: P6, P8, P10, P14, P15, P18

⚖️ Explore the Tension Between the Three Vs

These three forces often pull against each other. Use the slider to see how prioritizing one affects the others.

Balanced: You iterate fast and validate thoroughly — but documentation suffers.
Findings

What the Interviews Revealed

The researchers uncovered three overarching findings. Use the tabs below to explore each one in depth.

"It's OK That 90% of Models Don't Make It"

Remember the scary statistic that 90% of models never reach production? The interviewees said this is actually fine — and even expected. ML engineering is like a gold prospector panning for nuggets: you sift through a lot of sand. The key isn't reducing the sand; it's sifting faster.

💬 Ideas Come From Collaborators

The best experiment ideas came from conversations with domain experts, data scientists, and analysts — not from engineers working in isolation. Cross-team collaboration was crucial.

📊 Iterate on Data, Not Model

Rather than changing the model architecture, most successful improvements came from adding new features or better data. "Start with a fixed model because it means faster iterations."

📉 Account for Diminishing Returns

Ideas that look great offline often lose their edge in staged deployment. Engineers learned to kill low-gain ideas early. End-to-end deployments could take 3+ months.

🔧 Keep Changes Small

Small, config-driven changes (editing a JSON/YAML file instead of code) reduced bugs, sped up code review, and prevented production surprises.

"I'm gonna start with a [fixed] model because it means faster [iterations]. And often, like most of the time empirically, it's gonna be something in our data that we can use to push the boundary."

— P11, MLE Manager, Banking

Evaluation Is a Living, Breathing Process

In textbooks, you evaluate a model once against a fixed test set. In the real world, evaluation is a constantly evolving organizational effort. It's like grading a student — except the curriculum keeps changing.

🔄 Dynamic Validation Datasets

Engineers continuously update their test sets to reflect new failure modes discovered in the wild. For autonomous vehicles, this means curating specific scenarios: pedestrians, cyclists, roundabouts. For chatbots, it means capturing every weird user input.

"You can't hit pedestrians, right. You can't hit cyclists. You need to work in roundabouts... what you need to be able to do is go very quickly from user-recorded bug to not only fixing it, but driving improvements by changing your data."

— P1, MLE Manager, Autonomous Vehicles

📐 Standardized Evaluation

Different engineers cloning and modifying evaluation notebooks led to chaos. Organizations learned to standardize their evaluation scripts, even though it slowed things down.

🎚️ Multi-Stage Deployment

Models went through 1–4 stages: test → shadow → canary (small % of users) → full production. Each stage had its own evaluation. "Shadow mode" — where a model runs live but users don't see its predictions — was especially valued.

📈 Product Metrics > ML Metrics

Engineers learned to tie evaluation to business outcomes (click-through rates, revenue, churn) rather than just accuracy or precision. As P11 put it: "Let's actually show the same business metrics that everyone else is held accountable to."

Keeping Models Alive Takes Old-School Engineering

Once a model is in production, sustaining it requires a mix of surprisingly simple strategies. Think of it like maintaining a car: regular oil changes, a spare tire in the trunk, and safety sensors.

🔄 Frequent Retraining

Retraining cadences ranged from hourly (P18) to every few months (P17). No one used a scientific procedure to decide the frequency — they just matched it to what was operationally easy.

⏪ Fallback Models

Always keep a previous model version ready to switch to. "If the production model drops, we fall back to the calibration model until someone fixes it."

🚧 Heuristic Guardrails

Rule-based filters layered on top of models. A chatbot might filter out replies that mention specific times (the model doesn't actually know store hours). "We have a lot of filters."

🔍 Data Validation

Schema checks, bounds on values, completeness monitoring. Basic but essential — like checking that a recipe's ingredients haven't been swapped.

🎯 Keep It Simple

Many preferred tree-based models over deep learning for maintainability. "I can probably do the same with neural nets. But after deployment, it just doesn't make sense."

👩‍💼 On-Call Rotations

Like doctors on call, engineers took 1–2 week shifts monitoring production models, creating tickets for bugs, and writing incident reports.

⏱️ Retraining Frequency Explorer

How often should you retrain? It depends. Slide to see real examples from the interviews.

Daily (P14): "You don't really need to worry about if your model has gone stale if you're retraining it every day."
Challenges

The Biggest Pain Points Nobody Has Solved

Even seasoned engineers struggle with recurring headaches. The researchers organized these into four major themes, each representing a tension or synergy between the Three Vs.

🔀 The Lab vs. The Factory

Imagine practicing a speech in your quiet bedroom, then delivering it in a noisy stadium. The environment mismatch between development (where engineers experiment) and production (where models serve real users) is a constant source of bugs.

Data Leakage

The model "cheats" by using information during training that won't exist when making real predictions. Like studying with the answer key, then being surprised when you can't solve new problems.

The Jupyter Notebook War

Engineers were split: some loved notebooks in production (easy debugging), others hated them (sloppy code, wrong inputs). Both sides had valid points.

Code Quality Gaps

Experimental code is messy — and that's fine when you're exploring. But the transition to production-grade code is fraught. Code review felt "not worth the effort" to many, yet skipping it invited bugs.

"Model developers don't follow software engineering practices — not because they're lazy, but because [those practices are] contradictory to the agility of analysis and exploration."

— P6, MLE, Cloud Computing

💡 Tool Opportunity

Virtualized ML environments where dev and production share the same foundation but support different iteration speeds. Track discrepancies automatically.

📊 A Spectrum of Data Problems

Data errors aren't one thing — they exist on a spectrum from obvious to invisible, and each type demands a different response.

Hard Errors
Soft Errors
Drift
ObviousHard to SpotNearly Invisible

Hard errors: Swapped columns, negative ages, schema violations. Obvious and crash-worthy.

Soft errors: A few null values in a feature. The model still produces "reasonable" predictions, making these insidiously hard to catch.

Drift: Live data slowly diverges from training data — like fashion trends changing. Frequent retraining helped, but hand-curated features could silently corrupt.

🚨 The Alert Fatigue Epidemic

Engineers tracked so many metrics that false-positive alerts became overwhelming. ~90% of alerts were ignored according to P18. One company even built a model to predict which alerts to act on!

"You typically ignore most alerts... I guess on record I'd say 90% of them aren't immediate."

— P18, MLE, Recommender Systems

The Data Issue Hierarchy (Most → Least Frequent)

  1. Feedback delays — Ground-truth labels arrive late (sometimes 2+ weeks). "Nobody is solving the label lag problem."
  2. Unnatural drift — Sudden data corruption, missing data, COVID-like disruptions.
  3. Natural drift — Slow, expected changes over time. Most handled this well with retraining.

💡 Tool Opportunity

Self-tuning alert systems that balance precision and recall for detecting real performance drops. Automatically adjust thresholds as data drifts naturally.

🐛 Every Bug Feels Unique

Unlike traditional software where you can write test cases for known categories of bugs, ML bugs form a "long tail" — a vast collection of rare, seemingly one-of-a-kind issues. Debugging is ad-hoc: "I just poked around until I figured it out."

Example bugs from interviews:

  • Accidentally flipping labels (P1, P3, P6, P11)
  • Forgetting to set random seeds (P1, P12, P13)
  • Forgetting to drop special characters in NLP (P8)
  • Corrupted imputation values for missing features (P6)
  • Half the keys missing in a JSON feature column (P18)

The Silver Lining: Predictable Symptoms

While the bugs were unique, their symptoms were not. A big gap between offline and live accuracy almost always pointed to data issues. And the debugging method was consistently "slice and dice" — cutting data by different customer groups or categories to find where the model fails.

The Paranoia Effect

After enough trauma from bespoke bugs, engineers developed a sense of paranoia — obsessively checking code, even when everything looked fine.

"ML bugs don't get caught by tests... and just silently cause slight reductions in performance. This is why you need to be paranoid when you're writing ML code."

— P1, MLE Manager, Autonomous Vehicles

💡 Tool Opportunity

Break the chicken-and-egg problem: precise performance monitoring → map drops to bugs → categorize the long tail into actionable groups.

🐌 Three Months From Idea to Deployment

End-to-end staged deployments could take months. During that time, user behavior changes, business priorities shift, and initially promising ideas fizzle out.

40–50%
of ideas make it to initial launch (P19)
~50%
of launched experiments are dropped for legal/privacy/complexity
3+
months for a single new feature idea to fully deploy

"Metrics keep on rotating based on the company's priorities. Is it the revenue? The total installs? Or clicks? They keep on changing with the company's roadmap."

— P18, MLE, Recommender Systems

💡 Tool Opportunity

Streamline multi-stage deployments, minimize wasted work, and help practitioners predict end-to-end value of experiment ideas before committing months of effort.

Anti-Patterns

Four Common Traps MLEs Fall Into

Beyond pain points, the researchers identified anti-patterns — common behaviors that seem reasonable but lead to trouble.

1
Industry-Classroom Mismatch
2
Keeping GPUs Warm
3
Retrofitting Explanations
4
Undocumented Tribal Knowledge

🎓 Anti-Pattern 1: School Doesn't Prepare You

Multiple engineers said their education left them unprepared for production ML. "I learned a lot of data science in school, but none of it was quite like all these things," said P7. Key production skills — monitoring, debugging drift, managing deployment stages — were learned entirely on the job.

🔥 Anti-Pattern 2: "Keeping GPUs Warm"

This is the compulsion to run as many experiments as possible just because the computational resources exist. It's like cooking every recipe in a cookbook simultaneously — you run out of attention long before you run out of stove burners.

The better approach? Guided search over random search. Focus on one idea per week instead of five in parallel. "Developer time and energy is the limiting reagent," not GPU hours.

"You can be overly concerned with keeping your GPUs warm, so much so that you don't actually think deeply about what the highest-value experiment is."

— P5, MLE Manager, Ads

🧩 Anti-Pattern 3: Making Up a Story After the Fact

Engineers would find something that works, deploy it, and then construct an elegant-sounding explanation for why it works. Sometimes these explanations are useful for team alignment and customer trust — but they can also be misleading.

"People just try everything and then backfit some nice-sounding explanation for why it works."

— P1, MLE Manager, Autonomous Vehicles

📜 Anti-Pattern 4: Tribal Knowledge Trap

Critical pipeline knowledge lives in people's heads, not documentation. When those people leave, their pipelines become black boxes. "Some of our models are pretty old and not well documented, so I don't have great expectations for what they should be doing," said P17.

The root cause: high velocity creates many versions, making documentation perpetually outdated. "We learn faster than we can document."

For Tool Builders

The Four-Layer MLOps Stack

Think of the MLOps tool ecosystem like a building. Each floor serves a different purpose, and changes happen at very different speeds depending on which floor you're on. Hover over each layer to learn more.

🏃 Run Layer
Run Layer: A record of each pipeline execution. Tools: Weights & Biases, MLFlow, Hive metastores. Changes most frequently — daily or even hourly.
🔗 Pipeline Layer
Pipeline Layer: The dependency graph between computations. Tools: Airflow, TFX, Sagemaker, DBT. Changes less often than runs, more often than components.
🧩 Component Layer
Component Layer: Individual scripts or functions (feature generation, model training). Tools: PyTorch, TensorFlow, Scikit-learn. Some orgs maintain shared component libraries.
🏗️ Infrastructure Layer
Infrastructure Layer: Cloud, GPUs, Docker, Kubernetes. Changes rarely, but each change has wide-ranging consequences. Tools: AWS, GCP, Docker.

↑ Changes happen fastest here   |   ↓ Changes are rare but high-impact

A key insight for tool builders: most day-to-day changes happen at the Run Layer (tweaking hyperparameters, running experiments). Engineers modify Dockerfiles only occasionally. Don't build tools for the wrong layer!

Engineers preferred tools that provided "10x" improvements in one of the Three Vs: experiment trackers boosted velocity, feature stores improved versioning, and staging platforms improved validation.

Research Data

What the Engineers Talked About Most

The researchers coded every interview transcript, identifying recurring themes. Here are the top topics by how many different interviews mentioned them — not just raw frequency, but breadth across participants.

Total coded segments: 1,766. Unique codes: 600. The coding was done using MaxQDA software with both top-down (theory-driven) and bottom-up (data-driven) approaches. Two researchers coded independently, then met to discuss and reconcile.

Interview transcripts ranged from ~400 to ~650 sentences each, with 40–160 coded segments per transcript. Codes were organized into hierarchies and analyzed for co-occurrence (codes appearing within 20 sentences of each other).

The Bottom Line

Five Things to Remember

If you walked away from this page and had to explain this research over coffee, here's what to say:

1. MLOps is governed by three forces: Velocity, Validation, Versioning

Speed of iteration, catching errors early, and keeping track of every version. These three forces are often in tension — optimizing one can undermine another.

2. It's okay for most experiments to fail

The 90% "failure rate" isn't a crisis — it's the natural cost of experimentation. What matters is failing fast and cheap, not failing less.

3. Evaluation must be an active, ongoing effort

Static test sets aren't enough. Validation datasets should evolve, evaluation should be standardized, and ML metrics must tie to business value.

4. Simple beats clever in production

Frequent retraining, fallback models, rule-based guardrails, and on-call rotations keep AI systems reliable. Nobody is using cutting-edge techniques to handle drift — they just retrain more often.

5. The biggest unsolved problems are human, not technical

Alert fatigue, tribal knowledge, the industry-classroom gap, and the cognitive overload of managing parallel experiments — these are people problems dressed as engineering problems.