The 'Devin' Aftermath: How AI Software Engineers Are Actually Being Used in Production


- Premium Results
- Publish articles on SitePoint
- Daily curated jobs
- Learning Paths
- Discounts to dev tools
7 Day Free Trial. Cancel Anytime.
When Cognition Labs announced Devin AI in March 2024, the pitch was extraordinary: a fully autonomous AI software engineer that could plan, write, debug, and deploy code end to end. Fourteen months later, autonomous coding looks more fragmented and less autonomous than those demos promised.
Table of Contents
- The Devin Hype vs. the Devin Reality
- What Devin (and Its Competitors) Can Actually Do Today
- Real Production Use Cases: What Teams Are Actually Doing
- The Metrics That Matter: Production Data from AI Agent Adoption
- The Failure Modes Nobody Talks About
- A Practical Framework for Integrating AI Agents Into Your Team
- What This Means for Software Engineers
The Devin Hype vs. the Devin Reality
Cognition Labs' initial claims were anchored on SWE-bench performance. The company reported that Devin could resolve 13.86% of issues in SWE-bench Lite (the 300-issue curated subset) autonomously, a number that far exceeded any prior system at the time. When comparing this figure to other systems, note which SWE-bench variant (Lite, Full, or Verified) is being used, as the denominator and difficulty differ significantly. The demo showed Devin working inside its own sandboxed environment, spinning up a shell, browsing documentation, writing code, running tests, and submitting pull requests, all from a single natural language prompt.
The backlash was swift. Independent developers attempted to reproduce the demonstrated tasks and published detailed debunking videos. Several of the Upwork examples turned out to be significantly simpler than portrayed, and the SWE-bench results, while real, came with caveats about task selection and evaluation methodology that Cognition's marketing materials glossed over. Hacker News threads and Reddit discussions dissected the demos frame by frame, identifying cases where the agent's apparent reasoning was closer to pattern matching than genuine problem-solving.
Where things stand now: Devin entered general availability in late 2024 at $500/month for teams. Meanwhile, GitHub shipped Copilot agent mode integrated directly into VS Code and GitHub's web interface; Google launched Jules as a code-generation agent with Gemini underpinnings; Cursor deepened its agent capabilities within its IDE fork; and OpenAI released Codex as an asynchronous coding agent operating within ChatGPT's ecosystem (note: this is a distinct product from OpenAI's original Codex model). The market fragmented fast, and the conversation has shifted from "Can an AI write code?" to "Under what conditions is autonomous code generation actually worth deploying?"
What Devin (and Its Competitors) Can Actually Do Today
Devin's Core Architecture and Workflow
Devin operates inside a fully sandboxed cloud environment that includes its own shell, code editor, and web browser. Engineers interact with it through Slack integration or direct IDE connections, assigning tasks asynchronously. A typical workflow involves posting a natural language description of a task (often linked to a Jira or GitHub issue), after which Devin generates a plan, executes that plan step by step, runs tests if a test suite exists, and submits a pull request for human review.
The term "autonomous" deserves scrutiny. In practice, Devin's autonomy means it can iterate through multiple attempts without a human in the loop, but it still operates within a single session context. As of mid-2025, it does not maintain long-term memory across sessions, and its ability to reason about a codebase is bounded by what it can load into its context window during a given session. Think of it as a capable but amnesiac contractor who needs fresh onboarding every time.
Think of it as a capable but amnesiac contractor who needs fresh onboarding every time.
The Competitive Field in Mid-2025
| Feature | Devin | GitHub Copilot Agent Mode | Cursor Agent | Google Jules | OpenAI Codex |
|---|---|---|---|---|---|
| Autonomy Level | Full session autonomy | Task-scoped within IDE | Multi-file agent edits | Async issue resolution | Async task completion |
| Environment | Sandboxed cloud VM | GitHub.com, VS Code, Visual Studio | Local IDE (Cursor fork) | Cloud-based | Cloud-based (ChatGPT) |
| Shell/Browser Access | Yes | Limited | Terminal access | Yes | Yes |
| Integration | Slack, IDE, GitHub | GitHub-native | IDE-native | GitHub | ChatGPT, API |
| Pricing | $500/mo (team) | Included in Copilot Enterprise | Cursor Pro ($20/mo+) | Free tier + paid | ChatGPT Plus/Pro |
| Best Suited For | Async ticket resolution | In-flow code assistance | Complex multi-file edits | Issue-level fixes | Prototyping, exploration |
Pricing as of mid-2025; verify current pricing on each vendor's website before making purchasing decisions.
The market fragmented quickly because autonomous coding occupies a spectrum rather than a single product category. Teams optimizing for deep IDE integration gravitate toward Cursor or Copilot. Teams wanting fully hands-off async task completion lean toward Devin or Jules. The tools are converging on similar capabilities, but the workflow assumptions embedded in each product remain distinct.
Real Production Use Cases: What Teams Are Actually Doing
Automated Bug Triage and Low-Complexity Fixes
Multiple engineering teams have documented workflows where autonomous agents handle well-defined bug tickets pulled directly from backlogs. The pattern is consistent: a ticket with clear reproduction steps and a narrow scope gets assigned to an agent, which generates a PR that a human engineer reviews.
A typical Devin-generated fix for a null-check issue in a TypeScript codebase might look like this:
// NotFoundError must be imported from your project's error module; e.g.:
import { NotFoundError } from './errors';
// Before: agent-identified bug — unhandled null in user lookup
async function getUserProfile(userId: string): Promise<{ name: string; email: string }> {
const user = await db.users.findOne({ id: userId });
return { name: user.name, email: user.email }; // throws if user is null
}
// After: Devin-generated fix with error handling
async function getUserProfile(userId: string): Promise<{ name: string; email: string }> {
let user;
try {
user = await db.users.findOne({ id: userId });
} catch (err) {
throw new Error(`Database error looking up user ${userId}: ${(err as Error).message}`);
}
if (!user) {
throw new NotFoundError(`User not found: ${userId}`);
}
return { name: user.name, email: user.email };
}
The fix itself is clean. In practice, human reviewers report adjusting an estimated 30–40% of such PRs (based on anecdotal team reports; no controlled study is cited here), most commonly to align error handling patterns with team conventions (for example, switching from a generic Error to a project-specific NotFoundError class, as shown above), or to add missing edge cases the agent did not consider. The success rate drops sharply when tickets lack precise reproduction steps or involve cross-service dependencies.
Migration and Refactoring Tasks
Framework upgrades, dependency bumps, and API migration boilerplate represent a sweet spot for autonomous agents. These tasks are repetitive, pattern-heavy, and span many files, exactly the kind of work that exhausts human attention but maps well to an agent's tireless consistency.
Where agents break down: migrations that require understanding deep domain context, such as a payment processing flow where field renames carry regulatory implications. The agent will mechanically update the code without understanding that renaming tax_id to vat_number has compliance consequences in EU-facing services.
Test Generation and Documentation
Generating tests for untested legacy code is one of the most commonly cited production uses. Consider an untested utility function and the agent-generated test suite:
// Utility function (existing, untested)
function parseQueryParams(url) {
const params = {};
const withoutHash = url.split('#')[0];
const queryString = withoutHash.split('?')[1];
if (!queryString) return params;
queryString.split('&').forEach(pair => {
if (!pair) return; // skip empty segments from && or trailing &
const eqIndex = pair.indexOf('=');
if (eqIndex === -1) {
// key-only param: ?flag
const key = decodeURIComponent(pair);
if (key) params[key] = '';
return;
}
const rawKey = pair.slice(0, eqIndex);
const rawValue = pair.slice(eqIndex + 1);
if (!rawKey) return; // skip =value pairs with no key
try {
const key = decodeURIComponent(rawKey);
const value = decodeURIComponent(rawValue);
params[key] = value;
} catch {
// malformed percent-encoding: skip silently or handle per app policy
}
});
return params;
}
// Agent-generated Jest test suite
describe('parseQueryParams', () => {
test('parses single parameter', () => {
expect(parseQueryParams('https://example.com?foo=bar'))
.toEqual({ foo: 'bar' });
});
test('parses multiple parameters', () => {
expect(parseQueryParams('https://example.com?a=1&b=2'))
.toEqual({ a: '1', b: '2' });
});
test('returns empty object for URL without query string', () => {
expect(parseQueryParams('https://example.com'))
.toEqual({});
});
test('handles empty values', () => {
expect(parseQueryParams('https://example.com?key='))
.toEqual({ key: '' });
});
test('last value wins for duplicate keys', () => {
// This exposes the function's silent last-value-wins behavior
expect(parseQueryParams('https://example.com?a=1&a=2'))
.toEqual({ a: '2' });
});
test('value containing = sign is preserved in full', () => {
expect(parseQueryParams('https://example.com?data=a=b=c'))
.toEqual({ data: 'a=b=c' });
});
test('hash fragment is excluded from last param value', () => {
expect(parseQueryParams('https://example.com?foo=bar#section'))
.toEqual({ foo: 'bar' });
});
test('malformed percent-encoding does not throw', () => {
expect(() => parseQueryParams('https://example.com?a=%ZZ')).not.toThrow();
});
test('empty segment from double ampersand is ignored', () => {
const result = parseQueryParams('https://example.com?a=1&&b=2');
expect(result).toEqual({ a: '1', b: '2' });
expect(Object.keys(result)).not.toContain('undefined');
});
test('key-only parameter (no = sign) is parsed as empty string value', () => {
expect(parseQueryParams('https://example.com?flag'))
.toEqual({ flag: '' });
});
});
The generated tests cover the happy path and a range of edge cases, including duplicate keys (the function silently takes the last value), values containing = characters, hash fragments after query strings, malformed percent-encoding, and key-only parameters. Agent-generated test suites consistently provide a useful starting scaffold but rarely achieve the adversarial thinking that catches real production bugs without human guidance to expand coverage.
Prototyping and Throwaway Code
Teams report high satisfaction using agents for rapid proof-of-concept work, internal tools, and one-off scripts that will never ship to production as-is. The "disposable code" philosophy makes agent output more acceptable because the bar for correctness is lower and the iteration speed matters more.
"We use Devin for internal dashboards and data migration scripts. Nobody expects production-grade code. We expect a working starting point that saves four hours of boilerplate. It delivers on that consistently."
— Senior backend engineer, mid-stage SaaS startup (via Hacker News thread, April 2025)
That "four hours" figure is worth contextualizing. It aligns roughly with the 25–45 minute savings per task reported across multiple teams, extrapolated across a multi-component internal tool where the agent handles dozens of boilerplate files in a single session.
The Metrics That Matter: Production Data from AI Agent Adoption
A note on the data in this section: no controlled studies of AI agent productivity exist as of mid-2025. The figures below are drawn from informal, self-reported data in public engineering discussions (Hacker News, engineering blogs, X/Twitter threads). Treat all numbers as directional estimates, not benchmarks.
PR Acceptance Rates and Revision Cycles
Approximately 20–30% of agent-generated PRs for well-scoped tasks merge with no significant revisions. Another 40–50% merge after one round of human feedback. The remaining 20–30% are either substantially rewritten or closed entirely.
For comparison, junior developer PRs at similar companies typically merge after 1.2–1.8 review cycles on average. Agent-generated PRs average 1.5–2.3 review cycles, placing them roughly at junior-developer level for well-defined tasks but worse for anything requiring architectural judgment.
Time Savings vs. Supervision Overhead
The net time calculation is more nuanced than vendors acknowledge. Teams report saving 25–45 minutes per well-scoped bug fix or test generation task. However, prompt crafting, session monitoring, and reviewing agent output adds 10–20 minutes of overhead per task. The net gain is real but modest: roughly 15–30 minutes per task for suitable work.
"The 'babysitting tax' is real. For complex tasks, I spend more time steering the agent than I would have spent writing the code myself. But for the boring, repetitive stuff, the math works out."
— Full-stack developer at a Series B fintech company (via X/Twitter thread, May 2025)
Code Quality Indicators
Engineers anecdotally report defect rates in agent-generated code at roughly 1.5–2x higher than senior-developer-authored code for equivalent task complexity. These figures come from public engineering discussions and should not be treated as controlled study results. Linting and static analysis pass rates are generally high (agents produce syntactically clean code), but semantic correctness issues slip through automated gates. Agents introduce technical debt that shows up as overly verbose solutions, redundant null checks, or patterns that work but diverge from a codebase's established conventions.
The Failure Modes Nobody Talks About
The "Confident Hallucination" Problem
The most dangerous failure mode is not obviously broken code. It is plausible, well-structured code that does the wrong thing. Agents regularly generate calls to API methods that do not exist in the version of a library the project uses. They fabricate configuration options. They produce code that looks correct on casual review and even passes basic tests, but contains subtle logical errors that only surface under specific runtime conditions.
This is categorically more dangerous than a syntax error or a failed build, because it passes the first line of defense (automated checks and quick human review) and reaches production.
Context Window and Codebase Scale Limitations
Agent performance degrades on large monorepos, though no controlled benchmarks quantify the drop. When a codebase exceeds what the agent can hold in its context window, it begins losing track of architectural decisions, type definitions from distant modules, and cross-cutting concerns. Teams working with monorepos exceeding 500K lines of code (roughly corresponding to several hundred thousand tokens depending on language and whitespace density; verify against your model's context limit) report failure rates roughly double those seen on smaller, well-scoped repositories.
The common workaround is "fresh sessions" scoped to narrow directories or modules, but this forces humans to decompose the architecture up front, undermining the autonomy promise.
Security and Compliance Blind Spots
Agents regularly introduce patterns that would fail a security audit. Consider this Express.js handler:
// EXAMPLE OF VULNERABLE CODE — DO NOT USE IN PRODUCTION
// Agent-generated: subtle SQL injection vulnerability
app.get('/api/users', async (req, res) => {
const { role } = req.query;
const users = await db.query(`SELECT * FROM users WHERE role = '${role}'`);
res.json(users);
});
The agent-generated version above concatenates user input directly into a SQL query, a textbook injection vector. Below is the corrected version with parameterized queries, input validation, and error handling:
// Corrected: parameterized query with input validation and error handling
// PostgreSQL (pg driver) syntax; use ? placeholder for MySQL or SQLite
app.get('/api/users', async (req, res) => {
const { role } = req.query;
const VALID_ROLES = ['admin', 'user', 'moderator']; // expand as needed
if (!role || !VALID_ROLES.includes(role)) {
return res.status(400).json({ error: 'Invalid or missing role parameter' });
}
try {
const result = await db.query(
'SELECT * FROM users WHERE role = $1',
[role]
);
return res.json(result.rows);
} catch (err) {
console.error({ msg: 'DB query failed', route: '/api/users', err });
return res.status(500).json({ error: 'Internal server error' });
}
});
// MySQL (mysql2) or SQLite (better-sqlite3) equivalent:
// const result = await db.query('SELECT * FROM users WHERE role = ?', [role]);
Most SAST tools flag this specific pattern, but subtler variants (such as using template literals with an ORM that does not auto-parameterize) can evade basic scanners. Not all teams have those gates in place. In compliance-sensitive environments (healthcare, finance), autonomous code generation introduces audit trail complications that go beyond code quality into regulatory territory.
A Practical Framework for Integrating AI Agents Into Your Team
Step 1: Define Your Agent-Ready Task Taxonomy
Not every ticket belongs in an agent's queue. A practical classification:
Tasks with high agent suitability include bug fixes with clear repro steps, test generation for well-isolated functions, dependency version bumps, boilerplate scaffolding, and documentation updates.
Medium suitability covers refactoring with clear patterns, API migrations with straightforward mappings, and internal tool prototypes.
Low suitability applies to features requiring domain expertise, performance-critical optimizations, and cross-service integrations.
Some tasks should never go to an agent: security-sensitive authentication/authorization logic, financial calculation code, or anything touching PII handling without human authorship for audit purposes.
Step 2: Establish Review and Guardrail Protocols
Agent-generated PRs need stricter automated gates than human-authored code. Here is a GitHub Actions workflow that triggers additional security scanning and coverage checks on agent-labeled PRs:
Prerequisites: This workflow assumes your repository has a Node.js project with a package-lock.json (replace npm ci with yarn install --frozen-lockfile or pnpm install --frozen-lockfile if using a different package manager), Jest configured as the test runner, and GitHub Advanced Security enabled (required for CodeQL; free for public repos, paid for private repos). PRs must be labeled agent-generated for the workflow to trigger — enforce this labeling convention on your team or the guardrail is silently bypassed.
name: Agent PR Guardrails
on:
pull_request:
types: [opened, synchronize]
jobs:
agent-checks:
if: contains(github.event.pull_request.labels.*.name, 'agent-generated')
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Node.js
uses: actions/setup-node@v4
with:
node-version: '20'
- name: Install dependencies
run: npm ci
- name: Run full test suite with coverage
run: npm test -- --coverage --passWithNoTests
- name: Initialize CodeQL
uses: github/codeql-action/init@v3
with:
languages: javascript
- name: Autobuild
uses: github/codeql-action/autobuild@v3
- name: SAST security scan
uses: github/codeql-action/analyze@v3
- name: Check for hardcoded secrets
env:
TRUFFLEHOG_VERSION: "3.78.0"
TRUFFLEHOG_SHA256: "REPLACE_WITH_OFFICIAL_SHA256_FOR_THIS_VERSION"
run: |
curl -sSfL \
"https://github.com/trufflesecurity/trufflehog/releases/download/v${TRUFFLEHOG_VERSION}/trufflehog_${TRUFFLEHOG_VERSION}_linux_amd64.tar.gz" \
-o trufflehog.tar.gz
echo "${TRUFFLEHOG_SHA256} trufflehog.tar.gz" | sha256sum --check
tar -xzf trufflehog.tar.gz trufflehog
chmod +x trufflehog
./trufflehog filesystem ./ --fail --no-update
Note on TruffleHog: Replace REPLACE_WITH_OFFICIAL_SHA256_FOR_THIS_VERSION with the checksum published on the TruffleHog releases page for the pinned version. Pinning to a specific version with hash verification prevents supply-chain attacks from a compromised install script and ensures the --fail flag (a v3-specific feature) behaves as expected.
Note on coverage thresholds: Rather than passing --coverageThreshold as a CLI argument (which is prone to shell-quoting issues), define your thresholds in jest.config.js. Jest reads this file automatically when running npm test -- --coverage; the build fails if thresholds are not met:
// jest.config.js
module.exports = {
coverageThreshold: {
global: {
branches: 80,
functions: 80,
lines: 80,
statements: 80,
},
},
};
Important: If your project defines Jest configuration in package.json instead of jest.config.js, move the coverageThreshold block into the "jest" key in package.json, or ensure jest.config.js takes precedence. Without coverageThreshold in the active Jest config, the coverage step reports numbers but never fails the build.
Beyond CI/CD gates, establish human review checklists specific to agent output: verify that imported modules actually exist at the pinned version, confirm error handling matches team conventions, and check that no new dependencies were added without justification.
Step 3: Measure and Iterate
From day one, track: PR merge rate without revision, average review cycles, net time saved per task category, and post-merge defect rate for agent-generated code. Run a weekly 15-minute retrospective focused solely on agent usage. Expand the scope of agent-eligible tasks only when the data supports it. Pull back immediately when defect rates spike or review overhead exceeds time savings.
What This Means for Software Engineers
The Shifting Role: From Code Writer to Code Reviewer and System Architect
The engineers extracting the most value from autonomous agents are not the ones delegating the hardest problems. They decompose work into agent-suitable chunks, review AI output with the same rigor they would apply to a junior developer's code, and make architectural decisions that agents cannot. The senior engineer's value increasingly lies in judgment, system design, and the ability to spot the confident hallucination that passes every automated check.
Junior engineers should learn agent-assisted workflows now. The emerging skills (designing prompts, managing sessions, evaluating output) are becoming as fundamental as code review or understanding build tooling.
The 12-Month Outlook
The trajectory points toward longer context windows, multi-agent orchestration (where one agent writes code and another reviews it), and deeper IDE-native autonomy that reduces the friction of session management. Gemini's 2M-token context window already allows agents to ingest far larger codebases in a single session; scaling that to full-repo context across competing models is plausible within 12 months. None of this eliminates the need for human engineers. It shifts where human time is most valuable: away from typing code, toward designing systems and verifying correctness.
Autonomous agents handle the mechanical, well-defined, low-risk work. Humans handle everything that requires taste, context, and accountability. The gap between those two categories is narrowing, but it is not closing. Teams that build their workflows around this division, rather than expecting full automation or dismissing the tools entirely, are the ones reporting measurable productivity gains on the order of 15–30 net minutes per suitable task.