Everyone talks about building AI models. Almost nobody talks about keeping them alive, healthy, and useful in the real world. This study went straight to the people who do that work every single day.
Think of it this way: creating an AI model is like composing a great song in a studio. You fiddle with notes, remix tracks, and eventually produce something that sounds amazing. But deploying that model to the real world? That's like performing that song live, every night, in different cities, with unpredictable weather, changing audiences, and instruments that occasionally go out of tune on their own.
The field that manages all of this is called MLOps — short for Machine Learning Operations. It's the set of practices for deploying and maintaining AI models in production reliably and efficiently.
But are these scary numbers actually a problem? Or are they a natural side effect of how experimental ML is? This study set out to find the real answer — by talking to the humans behind the curtain.
The researchers conducted semi-structured ethnographic interviews — think of it as a guided but open conversation, lasting 45–75 minutes each, with ML engineers (MLEs) who actively deploy and maintain models in production. They used a method called grounded theory, which means they let the patterns emerge from the conversations rather than imposing assumptions upfront.
| ID | Role | Company Size | Application |
|---|---|---|---|
| P1 | MLE Manager | Large (1000+) | Autonomous Vehicles |
| P2 | MLE | Medium (100-1000) | Autonomous Vehicles |
| P3 | MLE | Small (<100) | Computer Hardware |
| P4 | MLE | Medium | Retail |
| P5 | MLE Manager | Large | Ads |
| P6 | MLE | Large | Cloud Computing |
| P7 | MLE | Small | Finance |
| P8 | MLE | Small | NLP |
| P10 | MLE | Small | OCR + NLP |
| P11 | MLE Manager | Medium | Banking |
| P12 | MLE | Large | Cloud Computing |
| P13 | MLE | Small | Bioinformatics |
| P14 | MLE | Medium | Cybersecurity |
| P15 | MLE | Medium | Fintech |
| P16 | MLE | Small | Marketing & Analytics |
| P17 | MLE | Medium | Website Builder |
| P18 | MLE | Large | Recommender Systems |
| P19 | MLE Manager | Large | Ads |
Recruitment happened in rounds over the 2021–2022 academic year, 3–5 candidates per round, reached via professional networks and open calls on MLOps communities (Discord, Slack, Twitter). Interviews were transcribed via Zoom, redacted of personal info, then analyzed using MaxQDA qualitative analysis software.
Two researchers independently coded each transcript using open and axial coding (both top-down from theory and bottom-up from passages). They produced 1,766 coded segments with 600 unique codes. Coding rounds were repeated until reaching convergence.
The sample was 22% female-identifying. The researchers openly recruited female-identifying MLEs to mitigate sampling bias. Recruitment continued until the findings reached saturation — meaning new interviews weren't producing new themes.
Imagine maintaining a garden. You plant seeds (collect data), experiment with fertilizers (run experiments), check which plants are healthy enough to transplant outside (evaluate and deploy), and keep checking for pests all season long (monitor). That's the ML lifecycle — and it never stops.
Sourcing new data, wrangling it into a centralized repository, cleaning it, and labeling it — sometimes outsourced (e.g., Mechanical Turk), sometimes done by in-house annotators.
Improving ML performance through data-driven (new features) or model-driven (architecture changes) experiments. Think "trial and error, but systematic."
Computing metrics on validation datasets, code review, staged rollouts (1% → 10% → 100% of users), A/B testing, and keeping rollback records.
Tracking live metrics via dashboards, slicing data to investigate failures, patching with heuristics, adding live failures to evaluation sets for the future.
Across all 18 interviews, three variables kept surfacing as the make-or-break factors. Think of them like the three legs of a stool — if any one is missing, the whole thing topples.
Like a chef who can taste, adjust, and re-plate quickly, ML engineers need to go from a new idea to a trained model fast — ideally within a day. High velocity means rapid prototyping, fast debugging, and minimal friction between "I have an idea" and "I can see if it works."
Cited by: P1, P3, P6, P10, P11, P14, P18
Like a quality inspector at a factory, the earlier you catch a defective product, the cheaper it is to fix. For ML, errors caught during development cost pennies; errors found by real users cost a fortune. Validating early — through tests, staged deployments, and sandbox environments — is critical.
Cited by: P1, P2, P5, P6, P7, P10, P14, P15, P18
"The general theme, as we moved up in maturity, is: how do you do more of the validation earlier, so the iteration cycle is faster?"
Like a writer who saves every draft of their manuscript, ML engineers need to keep track of multiple versions of models, datasets, and configurations. When something breaks in production, you need to be able to "undo" — instantly reverting to a previous working version.
Cited by: P6, P8, P10, P14, P15, P18
The researchers uncovered three overarching findings. Use the tabs below to explore each one in depth.
Remember the scary statistic that 90% of models never reach production? The interviewees said this is actually fine — and even expected. ML engineering is like a gold prospector panning for nuggets: you sift through a lot of sand. The key isn't reducing the sand; it's sifting faster.
The best experiment ideas came from conversations with domain experts, data scientists, and analysts — not from engineers working in isolation. Cross-team collaboration was crucial.
Rather than changing the model architecture, most successful improvements came from adding new features or better data. "Start with a fixed model because it means faster iterations."
Ideas that look great offline often lose their edge in staged deployment. Engineers learned to kill low-gain ideas early. End-to-end deployments could take 3+ months.
Small, config-driven changes (editing a JSON/YAML file instead of code) reduced bugs, sped up code review, and prevented production surprises.
"I'm gonna start with a [fixed] model because it means faster [iterations]. And often, like most of the time empirically, it's gonna be something in our data that we can use to push the boundary."
In textbooks, you evaluate a model once against a fixed test set. In the real world, evaluation is a constantly evolving organizational effort. It's like grading a student — except the curriculum keeps changing.
Engineers continuously update their test sets to reflect new failure modes discovered in the wild. For autonomous vehicles, this means curating specific scenarios: pedestrians, cyclists, roundabouts. For chatbots, it means capturing every weird user input.
"You can't hit pedestrians, right. You can't hit cyclists. You need to work in roundabouts... what you need to be able to do is go very quickly from user-recorded bug to not only fixing it, but driving improvements by changing your data."
Different engineers cloning and modifying evaluation notebooks led to chaos. Organizations learned to standardize their evaluation scripts, even though it slowed things down.
Models went through 1–4 stages: test → shadow → canary (small % of users) → full production. Each stage had its own evaluation. "Shadow mode" — where a model runs live but users don't see its predictions — was especially valued.
Engineers learned to tie evaluation to business outcomes (click-through rates, revenue, churn) rather than just accuracy or precision. As P11 put it: "Let's actually show the same business metrics that everyone else is held accountable to."
Once a model is in production, sustaining it requires a mix of surprisingly simple strategies. Think of it like maintaining a car: regular oil changes, a spare tire in the trunk, and safety sensors.
Retraining cadences ranged from hourly (P18) to every few months (P17). No one used a scientific procedure to decide the frequency — they just matched it to what was operationally easy.
Always keep a previous model version ready to switch to. "If the production model drops, we fall back to the calibration model until someone fixes it."
Rule-based filters layered on top of models. A chatbot might filter out replies that mention specific times (the model doesn't actually know store hours). "We have a lot of filters."
Schema checks, bounds on values, completeness monitoring. Basic but essential — like checking that a recipe's ingredients haven't been swapped.
Many preferred tree-based models over deep learning for maintainability. "I can probably do the same with neural nets. But after deployment, it just doesn't make sense."
Like doctors on call, engineers took 1–2 week shifts monitoring production models, creating tickets for bugs, and writing incident reports.
Even seasoned engineers struggle with recurring headaches. The researchers organized these into four major themes, each representing a tension or synergy between the Three Vs.
Imagine practicing a speech in your quiet bedroom, then delivering it in a noisy stadium. The environment mismatch between development (where engineers experiment) and production (where models serve real users) is a constant source of bugs.
The model "cheats" by using information during training that won't exist when making real predictions. Like studying with the answer key, then being surprised when you can't solve new problems.
Engineers were split: some loved notebooks in production (easy debugging), others hated them (sloppy code, wrong inputs). Both sides had valid points.
Experimental code is messy — and that's fine when you're exploring. But the transition to production-grade code is fraught. Code review felt "not worth the effort" to many, yet skipping it invited bugs.
"Model developers don't follow software engineering practices — not because they're lazy, but because [those practices are] contradictory to the agility of analysis and exploration."
Virtualized ML environments where dev and production share the same foundation but support different iteration speeds. Track discrepancies automatically.
Data errors aren't one thing — they exist on a spectrum from obvious to invisible, and each type demands a different response.
Hard errors: Swapped columns, negative ages, schema violations. Obvious and crash-worthy.
Soft errors: A few null values in a feature. The model still produces "reasonable" predictions, making these insidiously hard to catch.
Drift: Live data slowly diverges from training data — like fashion trends changing. Frequent retraining helped, but hand-curated features could silently corrupt.
Engineers tracked so many metrics that false-positive alerts became overwhelming. ~90% of alerts were ignored according to P18. One company even built a model to predict which alerts to act on!
"You typically ignore most alerts... I guess on record I'd say 90% of them aren't immediate."
Self-tuning alert systems that balance precision and recall for detecting real performance drops. Automatically adjust thresholds as data drifts naturally.
Unlike traditional software where you can write test cases for known categories of bugs, ML bugs form a "long tail" — a vast collection of rare, seemingly one-of-a-kind issues. Debugging is ad-hoc: "I just poked around until I figured it out."
Example bugs from interviews:
While the bugs were unique, their symptoms were not. A big gap between offline and live accuracy almost always pointed to data issues. And the debugging method was consistently "slice and dice" — cutting data by different customer groups or categories to find where the model fails.
After enough trauma from bespoke bugs, engineers developed a sense of paranoia — obsessively checking code, even when everything looked fine.
"ML bugs don't get caught by tests... and just silently cause slight reductions in performance. This is why you need to be paranoid when you're writing ML code."
Break the chicken-and-egg problem: precise performance monitoring → map drops to bugs → categorize the long tail into actionable groups.
End-to-end staged deployments could take months. During that time, user behavior changes, business priorities shift, and initially promising ideas fizzle out.
"Metrics keep on rotating based on the company's priorities. Is it the revenue? The total installs? Or clicks? They keep on changing with the company's roadmap."
Streamline multi-stage deployments, minimize wasted work, and help practitioners predict end-to-end value of experiment ideas before committing months of effort.
Beyond pain points, the researchers identified anti-patterns — common behaviors that seem reasonable but lead to trouble.
Multiple engineers said their education left them unprepared for production ML. "I learned a lot of data science in school, but none of it was quite like all these things," said P7. Key production skills — monitoring, debugging drift, managing deployment stages — were learned entirely on the job.
This is the compulsion to run as many experiments as possible just because the computational resources exist. It's like cooking every recipe in a cookbook simultaneously — you run out of attention long before you run out of stove burners.
The better approach? Guided search over random search. Focus on one idea per week instead of five in parallel. "Developer time and energy is the limiting reagent," not GPU hours.
"You can be overly concerned with keeping your GPUs warm, so much so that you don't actually think deeply about what the highest-value experiment is."
Engineers would find something that works, deploy it, and then construct an elegant-sounding explanation for why it works. Sometimes these explanations are useful for team alignment and customer trust — but they can also be misleading.
"People just try everything and then backfit some nice-sounding explanation for why it works."
Critical pipeline knowledge lives in people's heads, not documentation. When those people leave, their pipelines become black boxes. "Some of our models are pretty old and not well documented, so I don't have great expectations for what they should be doing," said P17.
The root cause: high velocity creates many versions, making documentation perpetually outdated. "We learn faster than we can document."
Think of the MLOps tool ecosystem like a building. Each floor serves a different purpose, and changes happen at very different speeds depending on which floor you're on. Hover over each layer to learn more.
↑ Changes happen fastest here | ↓ Changes are rare but high-impact
A key insight for tool builders: most day-to-day changes happen at the Run Layer (tweaking hyperparameters, running experiments). Engineers modify Dockerfiles only occasionally. Don't build tools for the wrong layer!
Engineers preferred tools that provided "10x" improvements in one of the Three Vs: experiment trackers boosted velocity, feature stores improved versioning, and staging platforms improved validation.
The researchers coded every interview transcript, identifying recurring themes. Here are the top topics by how many different interviews mentioned them — not just raw frequency, but breadth across participants.
Total coded segments: 1,766. Unique codes: 600. The coding was done using MaxQDA software with both top-down (theory-driven) and bottom-up (data-driven) approaches. Two researchers coded independently, then met to discuss and reconcile.
Interview transcripts ranged from ~400 to ~650 sentences each, with 40–160 coded segments per transcript. Codes were organized into hierarchies and analyzed for co-occurrence (codes appearing within 20 sentences of each other).
If you walked away from this page and had to explain this research over coffee, here's what to say:
Speed of iteration, catching errors early, and keeping track of every version. These three forces are often in tension — optimizing one can undermine another.
The 90% "failure rate" isn't a crisis — it's the natural cost of experimentation. What matters is failing fast and cheap, not failing less.
Static test sets aren't enough. Validation datasets should evolve, evaluation should be standardized, and ML metrics must tie to business value.
Frequent retraining, fallback models, rule-based guardrails, and on-call rotations keep AI systems reliable. Nobody is using cutting-edge techniques to handle drift — they just retrain more often.
Alert fatigue, tribal knowledge, the industry-classroom gap, and the cognitive overload of managing parallel experiments — these are people problems dressed as engineering problems.