Infonestira

Testing AI-Generated Code: Strategies That Actually Work

SitePoint Premium

Stay Relevant and Grow Your Career in Tech

Premium Results
Publish articles on SitePoint
Daily curated jobs
Learning Paths
Discounts to dev tools

7 Day Free Trial. Cancel Anytime.

How to Test AI-Generated Code

Write failing tests that define expected behavior before prompting the AI for code.
Generate the implementation by passing your spec to the AI as a constrained prompt.
Run unit tests immediately and record every failure the AI's first pass produces.
Add property-based tests with fast-check to surface edge cases you didn't anticipate.
Validate UI and integration points with Playwright end-to-end smoke tests.
Review the output with an adversarial AI prompt targeting hallucinated APIs and off-by-one errors.
Fix failures, re-run the full suite, and confirm all tests pass before committing.
Gate merges in CI so property-based and E2E test failures block deployment automatically.

AI code from tools like GitHub Copilot and Claude Code arrives syntactically polished, well-commented, and structured in ways that look production-ready. That's exactly what makes it dangerous. What follows is not theory but a practical, tool-backed workflow for catching these failures before they reach production.

Prerequisites
The Testing Dilemma: Code You Didn't Write (and Don't Fully Understand)
Why Standard Testing Isn't Enough for AI Code
Strategy 1: Write Tests Before You Prompt
Strategy 2: Property-Based Testing to Catch What You Didn't Think Of
Strategy 3: End-to-End Validation with Playwright
Strategy 4: Use AI to Test AI (Adversarial Review Prompts)
The AI Code Testing Checklist
Putting It All Together: A Realistic Workflow
Key Takeaways

Prerequisites

The examples in this article use the following tools and minimum versions. Pin these in your project to ensure reproducibility:

Node.js ≥ 18 LTS
Jest ≥ 29
fast-check ≥ 3
Playwright ≥ 1.40

All code samples use CommonJS (require / module.exports). If your project sets "type": "module" in package.json, replace require with import and module.exports with export.

To run the examples, create a project directory and install the required dependencies:

mkdir ai-testing-demo && cd ai-testing-demo
npm init -y
npm install --save-dev jest fast-check @playwright/test
npx playwright install

Create a minimal jest.config.js in the project root:

// jest.config.js
module.exports = {
  testEnvironment: 'node',
  roots: ['<rootDir>'],
};

All source and test files should live in the same directory (the project root) unless you adjust the roots configuration above.

The Testing Dilemma: Code You Didn't Write (and Don't Fully Understand)

AI code from tools like GitHub Copilot and Claude Code arrives syntactically polished, well-commented, and structured in ways that look production-ready. That's exactly what makes it dangerous. It doesn't signal its weaknesses the way a junior developer's pull request does, with messy formatting or uncertain naming. The output is confidently wrong in ways that resist casual review.

The failure modes differ from typical human bugs. AI models commonly hallucinate methods that don't exist in current library versions. They produce plausible-but-wrong logic that handles the happy path flawlessly while silently breaking on edge cases. They reference outdated API signatures drawn from stale training data. What follows is not theory but a practical, tool-backed workflow for catching these failures before they reach production.

The output is confidently wrong in ways that resist casual review.

Why Standard Testing Isn't Enough for AI Code

The Overconfidence Problem

When code arrives well-formatted and idiomatically correct, teams assume it works. The "vibe coding" testing trap is real: the code "looks right," so the instinct to rigorously verify it diminishes. Standard code review habits evolved for human-written code, where structural messiness often correlates with logical errors. AI-generated code breaks that heuristic. The structure is always clean. The logic may not be.

Common AI Code Failure Patterns

AI-generated code fails in four recurring ways:

The AI references a method that existed in a previous major version of a library or never existed at all. These hallucinated function signatures compile and read correctly but throw at runtime.

Loop bounds, array slicing, and range checks are frequently wrong by one position. These off-by-one and boundary errors rarely surface in happy-path manual testing because they only trigger at the edges of input ranges.

Happy-path-only logic. The generated code handles typical inputs correctly but fails on empty strings, null values, Unicode characters, or unexpectedly large datasets.
Outdated patterns from stale training data. The AI uses deprecated APIs or superseded configuration approaches because its training data predates the current release.

Strategy 1: Write Tests Before You Prompt

Test-First Prompting with Jest

The most effective defense against AI code failures is inverting the typical workflow. Instead of prompting the AI for an implementation and then writing tests, define the expected behavior in test cases first. This anchors the AI's output to verifiable specifications and immediately exposes gaps. It transforms the AI from an autonomous code author into a constrained implementation engine that must satisfy a pre-existing contract.

Consider a parseQueryString utility function. The test suite below defines behavior for standard inputs, empty strings, duplicate keys, encoded characters, keys without values, values containing = characters, leading ? characters, and malformed percent-encoding, all written before you prompt for an implementation:

// parseQueryString.test.js
const { parseQueryString } = require('./parseQueryString');

describe('parseQueryString', () => {
  test('parses simple key-value pairs', () => {
    expect(parseQueryString('name=alice&age=30')).toEqual({ name: 'alice', age: '30' });
  });

  test('returns empty object for empty string', () => {
    expect(parseQueryString('')).toEqual({});
  });

  test('handles duplicate keys by keeping the last value', () => {
    expect(parseQueryString('color=red&color=blue')).toEqual({ color: 'blue' });
  });

  test('decodes URI-encoded characters', () => {
    expect(parseQueryString('greeting=hello%20world')).toEqual({ greeting: 'hello world' });
  });

  test('handles keys with no value', () => {
    expect(parseQueryString('active&verbose')).toEqual({ active: '', verbose: '' });
  });

  test('handles values containing = characters', () => {
    expect(parseQueryString('token=abc%3Ddef')).toEqual({ token: 'abc=def' });
  });

  test('strips leading ? from query string', () => {
    expect(parseQueryString('?name=alice')).toEqual({ name: 'alice' });
  });

  test('does not throw on malformed percent-encoding', () => {
    expect(() => parseQueryString('x=%GG')).not.toThrow();
  });

  test('handles value containing multiple = signs (e.g. JWT)', () => {
    expect(parseQueryString('token=abc%3Ddef%3Dghi')).toEqual({ token: 'abc=def=ghi' });
  });

  test('handles empty key (=value)', () => {
    const result = parseQueryString('=orphan');
    expect(result['']).toBe('orphan');
  });
});

Now, here is a typical first-pass AI implementation, the kind that looks correct but fails under scrutiny:

// parseQueryString.js — BROKEN first-pass (DO NOT USE)
// ⚠️ This example is intentionally broken to illustrate common AI failures.
// Copying this into production code will cause crashes.
function parseQueryString(qs) {
  if (!qs) return {};
  return qs.split('&').reduce((acc, pair) => {
    const [key, value] = pair.split('=');
    // BUG 1: 'value' is undefined when no '=' is present (e.g., 'active').
    // decodeURIComponent(undefined) throws a URIError — it does not
    // return '' or 'undefined'. This crashes the caller.
    // BUG 2: split('=') splits on every '=', and destructuring discards
    // everything after the second element, silently truncating values
    // that contain '=' (e.g., base64 strings, JWTs).
    acc[decodeURIComponent(key)] = decodeURIComponent(value);
    return acc;
  }, {});
}

module.exports = { parseQueryString };

A typical first-pass AI implementation fails these tests in two ways. First, const [key, value] = pair.split('=') leaves value as undefined when no = is present, and decodeURIComponent(undefined) throws a URIError. This is not merely a wrong-value bug but an uncaught exception that crashes the caller. Second, split('=') splits on every = and the destructuring discards everything after the second element, silently truncating values containing =. The test-first approach catches both failures immediately.

Here is the corrected implementation that passes all tests, including guards against malformed percent-encoding and leading ? characters:

// parseQueryString.js — CORRECTED
function parseQueryString(qs) {
  if (!qs) return {};
  if (qs.startsWith('?')) qs = qs.slice(1);

  return qs.split('&').reduce((acc, pair) => {
    const eqIdx = pair.indexOf('=');
    const rawKey = eqIdx === -1 ? pair : pair.slice(0, eqIdx);
    const rawVal = eqIdx === -1 ? '' : pair.slice(eqIdx + 1);

    try {
      acc[decodeURIComponent(rawKey)] = decodeURIComponent(rawVal);
    } catch (e) {
      // Malformed percent-encoding (e.g., '%GG'): store raw value
      // to avoid silent data loss or uncaught URIError
      acc[rawKey] = rawVal;
    }

    return acc;
  }, {});
}

module.exports = { parseQueryString };

The key differences: using indexOf('=') and slice instead of split('=') preserves everything after the first = in the value, correctly handling base64 strings, JWTs, and other values that contain =. Stripping a leading ? prevents it from becoming part of the first key. The try/catch around decodeURIComponent prevents malformed percent-encoding (e.g., %GG) from crashing the process with an uncaught URIError.

You can verify the corrected implementation with a quick sanity check:

node -e "
const {parseQueryString} = require('./parseQueryString');
const cases = [
  ['name=alice&age=30',        JSON.stringify({name:'alice',age:'30'})],
  ['active&verbose',           JSON.stringify({active:'',verbose:''})],
  ['token=abc%3Ddef',          JSON.stringify({token:'abc=def'})],
  ['?greeting=hello%20world',  JSON.stringify({greeting:'hello world'})],
  ['x=%GG',                    'no throw'],
];
cases.forEach(([input, expected]) => {
  try {
    const r = JSON.stringify(parseQueryString(input));
    const pass = expected === 'no throw' || r === expected;
    console.log(pass ? 'PASS' : 'FAIL', input, '->', r, '(expected', expected + ')');
  } catch(e) {
    console.log('FAIL', input, '->', e.message);
  }
})"

Expected output (all lines begin PASS):

PASS name=alice&age=30 -> {"name":"alice","age":"30"} (expected {"name":"alice","age":"30"})
PASS active&verbose -> {"active":"","verbose":""} (expected {"active":"","verbose":""})
PASS token=abc%3Ddef -> {"token":"abc=def"} (expected {"token":"abc=def"})
PASS ?greeting=hello%20world -> {"greeting":"hello world"} (expected {"greeting":"hello world"})
PASS x=%GG -> {"x":"%GG"} (expected no throw)

Strategy 2: Property-Based Testing to Catch What You Didn't Think Of

Why Property-Based Testing Is Ideal for AI Code

Manual test cases only cover scenarios the developer anticipates. Property-based testing flips this constraint by defining invariants that must hold true across 100 randomly generated inputs by default (configurable via { numRuns: 1000 }). For AI-generated code testing, this is invaluable because the developer cannot anticipate every edge case the AI silently mishandled.

Property-based testing flips this constraint by defining invariants that must hold true across 100 randomly generated inputs by default.

Implementing with fast-check

Install fast-check:

npm install --save-dev fast-check

The fast-check library integrates directly with Jest. The following tests validate an AI-generated sortNumbers function by asserting properties across integers, floats (including Infinity), and NaN handling. Note that [...arr].sort() can throw a RangeError on very large arrays due to V8's stack limit for spread; using Array.from(arr) avoids this:

First, create the implementation file (or replace with your AI-generated version):

// sortNumbers.js
function sortNumbers(arr) {
  return Array.from(arr).sort((a, b) => a - b);
}

module.exports = { sortNumbers };

Then the property-based tests:

// sortNumbers.test.js
const fc = require('fast-check');
const { sortNumbers } = require('./sortNumbers');

const finiteFloat = fc.float({ noNaN: true, noDefaultInfinity: false });

describe('sortNumbers property-based tests', () => {
  test('output length always equals input length', () => {
    fc.assert(
      fc.property(fc.array(fc.integer()), (arr) => {
        expect(sortNumbers(arr)).toHaveLength(arr.length);
      })
    );
  });

  test('output is always sorted in ascending order (integers)', () => {
    fc.assert(
      fc.property(fc.array(fc.integer()), (arr) => {
        const sorted = sortNumbers(arr);
        for (let i = 1; i < sorted.length; i++) {
          expect(sorted[i]).toBeGreaterThanOrEqual(sorted[i - 1]);
        }
      })
    );
  });

  test('output is sorted for finite floats including Infinity', () => {
    fc.assert(
      fc.property(fc.array(finiteFloat), (arr) => {
        const sorted = sortNumbers(arr);
        for (let i = 1; i < sorted.length; i++) {
          expect(sorted[i]).toBeGreaterThanOrEqual(sorted[i - 1]);
        }
      })
    );
  });

  test('does not mutate the original array', () => {
    fc.assert(
      fc.property(fc.array(fc.integer()), (arr) => {
        const copy = [...arr];
        sortNumbers(arr);
        expect(arr).toEqual(copy);
      })
    );
  });

  test('NaN values in input do not cause silent data loss', () => {
    fc.assert(
      fc.property(fc.array(fc.integer()), (arr) => {
        const withNaN = [...arr, NaN];
        const sorted = sortNumbers(withNaN);
        expect(sorted).toHaveLength(withNaN.length);
        expect(sorted.some(Number.isNaN)).toBe(true);
      })
    );
  });
});

The mutation test is critical: it captures the input array before sorting and asserts it is unchanged after the call. Without it, a mutating sort implementation (using arr.sort() instead of Array.from(arr).sort()) silently passes the first two tests. The float and NaN tests are equally important: fc.integer() alone never generates NaN, Infinity, or -Infinity, leaving the most dangerous numeric edge cases untested. Note that NaN ordering in Array.prototype.sort with (a, b) => a - b is implementation-defined. V8 typically places NaN at the end, but this is not guaranteed by the spec. The NaN test above verifies that values are not silently dropped, regardless of their final position.

An integration test verifies that sortNumbers handles large arrays without hitting the stack limit:

// sortNumbers.integration.test.js
const { sortNumbers } = require('./sortNumbers');

test('sortNumbers handles large array without RangeError', () => {
  const large = Array.from({ length: 100_000 }, (_, i) => 100_000 - i);
  let result;
  expect(() => { result = sortNumbers(large); }).not.toThrow();
  expect(result[0]).toBe(1);
  expect(result[result.length - 1]).toBe(100_000);
});

When AI-generated sorting functions use custom comparators incorrectly (a common failure pattern where a - b is replaced with string comparison), fast-check surfaces the failure with a minimal counterexample. With default configuration (100 runs), this typically completes in under a second on modern hardware.

Strategy 3: End-to-End Validation with Playwright

When Unit Tests Aren't Enough

AI-generated UI components and API integration code frequently pass unit tests while failing at integration points. A form component might render correctly in isolation but submit malformed data. An API client might construct valid requests to an endpoint that no longer exists. E2E coverage is where automated testing of AI output proves its value.

Quick E2E Smoke Tests for AI-Generated Features

Install Playwright:

npm install --save-dev @playwright/test
npx playwright install

Configure baseURL, timeouts, and a webServer block in playwright.config.js so that the dev server starts automatically during testing. Without the webServer block, page.goto('/contact') will fail with net::ERR_CONNECTION_REFUSED in CI environments where no server is pre-running:

// playwright.config.js
const { defineConfig } = require('@playwright/test');

module.exports = defineConfig({
  timeout: 10000,
  retries: 1,
  reporter: 'list',
  use: {
    baseURL: 'http://localhost:3000',
  },
  webServer: {
    command: 'npm run dev',
    url: 'http://localhost:3000',
    reuseExistingServer: !process.env.CI,
    timeout: 30000,
  },
});

⚠️ Important: The webServer.command above assumes your project has a dev script in package.json (e.g., "dev": "node server.js" or "dev": "next dev"). Adjust the command to match your project's dev server. In CI, reuseExistingServer is false, so Playwright will start and stop the server automatically.

The following Playwright test validates an AI-generated contact form by checking rendering, client-side validation, and successful submission:

// contact.spec.js
const { test, expect } = require('@playwright/test');

test('AI-generated contact form validates and submits', async ({ page }) => {
  await page.goto('/contact');

  // Verify the form renders all expected fields
  await expect(page.getByLabel('Email')).toBeVisible();
  await expect(page.getByLabel('Message')).toBeVisible();

  // Submit empty form — guards against missing validation logic
  await page.getByRole('button', { name: 'Submit' }).click();
  await expect(page.getByText('Email is required')).toBeVisible();

  // Fill valid data and submit
  await page.getByLabel('Email').fill('[email protected]');
  await page.getByLabel('Message').fill('Integration test message');
  await page.getByRole('button', { name: 'Submit' }).click();

  // Wait for network idle to resolve async submission before asserting success state
  await page.waitForLoadState('networkidle');
  await expect(page.getByText('Message sent')).toBeVisible();
});

Each assertion targets a specific AI failure mode: missing field rendering, absent validation logic, and broken submission handlers. The waitForLoadState('networkidle') call after the final click prevents a race condition where the success message assertion runs before the async form handler completes. Without it, the test passes or fails non-deterministically depending on backend response time.

Strategy 4: Use AI to Test AI (Adversarial Review Prompts)

Prompting Claude Code or Copilot to Critique Its Own Output

After generating code, prompt the same or a different AI model to act as a hostile reviewer. The key is specificity. Fill in the actual package name and version from your package.json. A sample adversarial prompt template:

"Review this code as a senior engineer. Specifically check for: hallucinated methods that don't exist in [library name, e.g., axios] v[version, e.g., 1.7], off-by-one errors in all loops and array operations, missing null/undefined checks, and any deprecated API usage. List every issue with line numbers."

This technique supplements but cannot replace deterministic test execution. It catches hallucinated APIs and deprecated method usage reliably, but routinely misses off-by-one logic errors and other systematic reasoning failures. Always verify flagged issues against actual documentation.

Generating Test Cases with AI, Then Running Them Independently

AI models can generate edge-case test suites covering null, boundary, type-coercion, and Unicode inputs in under a minute of prompting. The critical rule: use AI to generate test cases but execute them in a standard CI pipeline. Never trust AI to both generate and validate. Generate and verify in separate steps.

The AI Code Testing Checklist

☐ Write failing tests before prompting for implementation
☐ Run property-based tests with randomized inputs
☐ Verify all imported methods/APIs actually exist in current library versions
☐ Execute E2E tests on any AI-generated UI or integration code
☐ Use adversarial AI prompts to review generated code
☐ Check edge cases: empty inputs, nulls, boundary values, Unicode, large datasets
☐ Diff AI output against your project's linting and type-checking rules
☐ Never ship AI-generated code with only AI-generated tests; include at least one human-written assertion
☐ Review AI code line-by-line before committing (treat it like a junior dev's PR)
☐ Track AI-generated code bugs separately to identify recurring failure patterns

Putting It All Together: A Realistic Workflow

The Test-Generate-Verify Loop

The integrated workflow follows a tight loop: write specification tests (unit and property-based), prompt the AI for implementation, run the full test suite (unit, property-based, and E2E), perform adversarial AI review, then commit. For the parseQueryString example, the full write-tests, generate, fix, re-run cycle took about three minutes. The overhead scales with E2E scope, but it prevents hours of debugging production issues caused by silently broken AI output.

CI Integration Tips

Property-based tests and E2E tests belong in the CI pipeline alongside traditional unit tests. Without automated gates, teams ship AI-generated code without verification, especially in fast-moving projects where skipping review becomes the default. Add a dedicated test:ai script in package.json that runs property-based and E2E suites as a separate CI stage, so failures block the merge.

Never trust AI to both generate and validate. Generate and verify in separate steps.

Key Takeaways

Test-first prompting constrains AI output to verifiable specifications and catches failures immediately.
Property-based testing surfaces unknown unknowns by stress-testing with randomized inputs (100 by default; configurable higher).
Never trust AI to validate its own output without independent execution in a real test runner and CI pipeline.

Testing AI-Generated Code: Strategies That Actually Work