Why AI Isn’t Boosting Developer Productivity

Recently, METR published a randomized controlled trial showing that Cursor 3.7 slowed experienced open-source maintainers by 19%. For those familiar with practical AI adoption in complex codebases, this result wasn't surprising.

Once you examine the study setup, the outcome becomes easier to understand. The participants were highly experienced maintainers working in deeply familiar, mature repos: environments where generalized AI suggestions are rarely a clean fit. No ramp-up time, limited task diversity, outdated models, and surface-level integration all contributed to a scenario where AI was bound to underperform.

That said, this kind of outcome doesn't reflect what's possible when AI assistants are rolled out with structure and intention. At DataArt, we've seen dramatic productivity gains in real-world projects, especially when the setup includes documentation, context, and a deliberate onboarding phase. This post explores what might go wrong, why some AI rollouts stall, and what patterns have emerged from setups that consistently deliver results.

Why the METR Study Was Designed to Show Slowdowns

Worst-Case Task Selection

The METR trial results make more sense when you look at the task setup. Developers were working in deeply familiar, highly customized codebases, the kind where experienced maintainers rely on years of accumulated knowledge.

In these environments, AI suggestions rarely land well right away. Critical context often lives in the developer's head: custom build flags, test harness quirks, undocumented conventions. None of it is visible to the model unless it's been clearly documented or included in training. Without that scaffolding, the AI is left to guess.

Missing Ramp-Up Periods

What was missing were the preparation steps that make success possible. There are two ramp-ups that matter:

Ramp-up for the developer — learning how to work with AI, write better prompts, and integrate it naturally into workflow
Ramp-up for the agent — either by refining the model (like Amazon Q Developer Customization), or by unlocking knowledge from your repo through documentation

Tools like Cursor 3.7 aren't plug-and-play. In the trial, most developers had completed only ~20 AI-assisted tasks. That's early days. There was no structured onboarding, no prompt refinement, and no context setup.

At DataArt, we've learned it takes longer. A few quick tries don't do much. But once AI becomes part of the daily flow — after two to three weeks of steady use — things shift. That's when you begin to recognize the assistant's strengths, guide it more effectively, and see it adapt to your codebase.

Outdated Models & Low Acceptance Rates

The experiment used Sonnet 3.5/3.7, which have since been surpassed by tools like Claude Code and Opus 4. The trial saw a 39% acceptance rate on suggestions — meaning over 60% of the AI's output was rejected or required heavy editing. That kind of friction adds up.

At DataArt, we often use agents in auto-accept mode — but only once they've structured the problem and scoped the task properly. In those cases, the assistant iterates and tests autonomously with high success rates, particularly for peripheral or "safe" code paths.

Incentive & Measurement Biases

The study design also introduced subtle distortions. With an hourly pay structure and no incentive to move faster, some participants reportedly waited for AI output while checking email. AI wasn't integrated — just available.

Results were based mostly on task duration, which might miss other important effects like scaffolding, architectural improvements, or reduced mental load. At DataArt, we use AI not just to write code, but to generate runbooks, maintain documentation, and explain tricky system parts — all of which save time across the team, even if they don't speed up an individual commit.

Real-World Example: What Onboarding Actually Looks Like

At DataArt, we build data pipelines using declarative configurations in AILA — DataArt's AI Lake Accelerator — to keep things modular, reusable, and simple. When I first asked an agent to help configure a new pipeline, I expected it to understand that pattern.

Instead, it gave me something completely off-track: a mix of new Python and Terraform code. Useful in theory, but it broke the whole point of our low-code, config-driven approach. It basically said: "write new code from scratch."

That's when it became clear: the agent needed better context.

So I followed OpenAI's prompt engineering guide and built a meta-agent prompt that rewrote my inputs into clearer, best-practice instructions. I used it to generate documentation for our repo by combining code comments, architecture diagrams, and platform principles spread across different sources.

Then, I created a prompt that required the agent to "read" this documentation first, before answering. And so, no more generic code dumps. The agent started generating accurate, declarative JSON configurations aligned with our standards.

After a few iterations, we layered in validation checks and self-reflection steps. The agent became part of the workflow. Not just a code generator, but a thinking partner. See how it works in our demo video.

The Missing Piece: Structured Agent Onboarding

This is where most successful AI rollouts diverge from failed ones. Before anyone expects value from an AI assistant, there needs to be a foundation, one that mirrors how a senior hire would be onboarded.

Blog Post

AI Assistants: DataArt Experts’ Experience and Key Takeaways

At DataArt, we call this "agent onboarding," and it's become essential to our AI workflow. Here's what that foundation looks like:

Low-Level Documentation (Code & API)

Module responsibilities, extension points, data models
Run/test instructions: configs, test harnesses, dependencies
Coding conventions, formatting, naming patterns, refactor idioms

Mid-Level (Components & Flows)

Feature guides with component interaction flows
Sequence diagrams for key request/response cycles
Configuration references: flags, environments, presets

High-Level (Architecture & Domain)

System overview and business context
Domain-driven events and state boundaries
Rationale for design choices, known pitfalls

Even the best engineers take weeks or months to ramp up in unfamiliar code. AI agents are no different. They require context, examples, conventions, and time. The teams seeing real productivity gains are the ones who front-load this investment, and then compound it by keeping it alive.

DOs & DON'Ts for AI-Assisted Development

Setup & Ramp-Up

✅ DO

Create an AI Onboarding Bundle: README.md, design docs, prompt libraries. Keep it versioned and close to your code
Allocate 1–2 weeks of agent ramp-up: refine prompts, test suggestions, build trust
Train or customize your model when possible (e.g., Amazon Q Developer Customization) to reflect your domain and coding patterns

❌ DON'T

Drop AI into an undocumented repo and expect instant results
Expect productivity boosts on day one
Assume generic models will perform well on domain-specific or high-context tasks out of the box

Usage & Integration

✅ DO

Use auto-accept or autonomous loops for scoped, low-risk tasks (e.g., UI scaffolding, test helpers)
Mirror human onboarding: use checklists, buddy reviews, and quick-start guides
Share and evolve prompt libraries across the team to reduce friction and improve consistency

❌ DON'T

Case Study

Building Enterprise-Ready LLM Infrastructure: How We Solved Access, Governance, and Scalability Challenges

Intelligent Automation

Blindly trust AI output for core business logic without review checkpoints
Forget that AI agents, like humans, need structure and repetition to perform well
Hardcode prompts into private IDEs — they'll disappear with the developer

Knowledge & Documentation

✅ DO

Keep docs alive: maintain Claude.md, agent.md, inline comments, and session summaries that the assistant can reference
Co-locate context files with your codebase: architecture guides, environment configs, key decisions

❌ DON'T

Treat docs as one-time deliverables — stale docs confuse both humans and agents
Expect AI to infer system design or domain knowledge without access to it

Measurement & Expectations

✅ DO

Measure long-term impact: bug rate, review speed, code clarity, onboarding time
Give AI workflows time to settle and scale. Track the curve over weeks, not hours

❌ DON'T

Rely only on stopwatch-style task duration metrics
Ignore the learning phase for both the human and the tool

The Bottom Line

AI agents and assistants do work (even at these earlier stages) but only when they're treated as part of the team, not as a magical shortcut.

At DataArt, the biggest gains come from structure: clear documentation, intentional onboarding, feedback loops, and thoughtful task design. The teams seeing consistent impact are the ones willing to invest upfront, stay disciplined, and give both the developer and the assistant time to learn.

There's no instant magic, but with the right setup, the payoff grows fast.

If AI Isn't Working for You, This Might Be Why

Article by

Why the METR Study Was Designed to Show Slowdowns

Worst-Case Task Selection

Missing Ramp-Up Periods

Outdated Models & Low Acceptance Rates

Incentive & Measurement Biases

Real-World Example: What Onboarding Actually Looks Like

The Missing Piece: Structured Agent Onboarding

AI Assistants: DataArt Experts’ Experience and Key Takeaways

DOs & DON'Ts for AI-Assisted Development

Setup & Ramp-Up

Usage & Integration

Building Enterprise-Ready LLM Infrastructure: How We Solved Access, Governance, and Scalability Challenges

Knowledge & Documentation

Measurement & Expectations

The Bottom Line

Subscribe to Our Newsletter

Beyond Billable Hours: How AI Accelerates Business Outcomes

Scaling AI in Retail: Fix the Data! Let AI Act

Why Insurance Data Lives Everywhere Except Where You Need It

AI-Ready Data Infrastructure: The Real Blocker to Scaling AI in Asset Management

From AI Creation to AI Operations: Governing Music Data at Scale

Validating AI in Software as a Medical Device (SaMD): Meeting MDR, GDPR, and EU AI Act Requirements

Stop Buying AI. Start Fixing Data: The AI Readiness Stack for Asset Managers

Agentic AI in Aviation: Fix the Data. Then Let AI Act

Validating AI in SaMD: Meeting MDR, GDPR, and EU AI Act Requirements

Rewiring Capital Markets: Real-Time Data and AI as the New Risk Spine

How to Make Legacy Insurance Data Actually Usable

Scaling AI in Asset Management: From Pilots to Production Pipelines

host description