Debuggability in Local AI: Profiling WebGPU Memory Usage

- Premium Results
- Publish articles on SitePoint
- Daily curated jobs
- Learning Paths
- Discounts to dev tools
7 Day Free Trial. Cancel Anytime.
Running large language models directly in the browser through WebGPU has moved from experimental novelty to practical reality. But there is a quiet problem lurking in single-page applications that load and unload these models: GPU buffers allocated for model weights frequently survive teardown, silently building until the tab crashes.
Table of Contents
- The Hidden Cost of Local AI in the Browser
- Why WebGPU Memory Leaks Are Different from JavaScript Memory Leaks
- Tooling Setup: Chrome DevTools and chrome://gpu
- Profiling Session Walkthrough: Finding the Leak
- Automated Leak Detection in CI
- Testing the Tracking and Disposal Logic
- Key Takeaways and Checklist
The Hidden Cost of Local AI in the Browser
Running large language models directly in the browser through WebGPU has moved from experimental novelty to practical reality. Projects like web-llm and transformers.js with WebGPU backends now let developers ship inference capabilities without server-side infrastructure. But there is a quiet problem lurking in single-page applications that load and unload these models: WebGPU memory profiling reveals that GPU buffers allocated for model weights frequently survive teardown. After as few as five route changes, ~1 GiB of GPU memory can end up pinned, silently building until the tab crashes with an out-of-memory error.
GPU memory operates under fundamentally different rules than JavaScript heap memory. The garbage collector that quietly cleans up dereferenced objects on the JS side has no jurisdiction over
GPUBufferandGPUTextureallocations.
This matters because GPU memory operates under fundamentally different rules than JavaScript heap memory. The garbage collector that quietly cleans up dereferenced objects on the JS side has no jurisdiction over GPUBuffer and GPUTexture allocations. Those must be explicitly released. When developers debug local AI workloads and overlook this distinction, the result is a staircase pattern of retained memory that degrades rendering performance and eventually triggers device loss.
This tutorial provides a concrete, reproducible workflow for diagnosing and fixing these memory leaks using Chrome DevTools internals, chrome://gpu diagnostics, buffer tracking patterns, and automated CI assertions. The examples below were tested with Chrome 125 on Windows 11 with an NVIDIA RTX 3060 (6 GiB VRAM), using Puppeteer 22. If you are using different versions, some flag names and API calls may differ.
Why WebGPU Memory Leaks Are Different from JavaScript Memory Leaks
The GPUBuffer Lifecycle
A GPUBuffer follows a strict lifecycle: createBuffer() allocates GPU-side memory, you use the buffer in compute or render passes, and .destroy() releases that allocation. This is explicit resource management, not garbage collection.
The critical nuance: dereferencing a GPUBuffer in JavaScript (letting it fall out of scope, nulling the variable) does not free the underlying GPU memory. The GC will eventually collect the JS wrapper object, but the GPU-side allocation persists until you call .destroy() or the GPUDevice itself is lost. Conversely, holding a JS reference to an already-destroyed buffer is perfectly safe; the JS object just becomes inert.
This asymmetry catches developers who rely on the GC for cleanup.
SPA-Specific Risks
Single-page applications amplify the problem in several ways. Route changes that unmount a model inference component often re-initialize on the new route without tearing down the previous session's buffers. The result compounds during development: each HMR cycle may allocate fresh buffers while old ones remain pinned. When multiple inference sessions share a single GPUDevice, the accumulated unreleased buffers from prior sessions eat into the device's memory budget.
Here is a minimal reproduction of the leak pattern:
async function simulateLeakyModelLoad() {
const adapter = await navigator.gpu.requestAdapter();
if (!adapter) throw new Error("WebGPU not supported");
const device = await adapter.requestDevice();
const buffers = [];
// Simulate loading model weights into GPU buffers
try {
for (let i = 0; i < 50; i++) {
const buffer = device.createBuffer({
label: `weight-layer-${i}`,
size: 1024 * 1024 * 4, // 4 MiB per buffer
usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST,
});
buffers.push(buffer);
}
} catch (err) {
// Destroy any buffers allocated before the failure
for (const b of buffers) b.destroy();
throw new Error(
`Buffer allocation failed at index ${buffers.length}: ${err.message}`
);
}
console.log(`Allocated ${buffers.length} buffers (~200 MiB GPU memory)`);
// Simulate SPA navigation: buffers array goes out of scope
// but .destroy() is never called on any buffer.
// GPU memory remains allocated.
}
// Call this 3-5 times to observe the staircase pattern
simulateLeakyModelLoad();
Each invocation allocates roughly 200 MiB of GPU memory that is never reclaimed. After five "navigations," approximately 1 GiB of GPU memory may be pinned. Some drivers reclaim memory under pressure, but the general staircase pattern holds across common desktop configurations (tested on Chrome 125 / Windows 11 / NVIDIA RTX 3060 6 GiB VRAM). Note that createBuffer throws synchronously on quota exhaustion or validation errors; the try/catch block above ensures that any buffers allocated before the failure are properly destroyed rather than leaked.
Tooling Setup: Chrome DevTools and chrome://gpu
Enabling WebGPU Developer Features
Navigate to chrome://flags/#enable-webgpu-developer-features and enable the flag. This unlocks enhanced error messages, validation layers, and additional diagnostic data in DevTools. After restarting Chrome (version 113 or later is required for stable WebGPU support), open chrome://gpu and scroll to the "Video Memory" and "WebGPU" sections (the exact section label may vary by Chrome version; look for GPU memory statistics). Use these as ground-truth baselines during profiling -- numbers that are otherwise invisible from JavaScript.
DevTools Performance Panel Configuration
In the Performance panel, enable the "GPU" lane by clicking the gear icon and checking the GPU checkbox. This surfaces GPU task timing alongside main-thread activity. The Memory panel's "Allocation instrumentation on timeline" mode correlates JS object allocation timestamps (including GPUBuffer wrappers) with specific function calls, making it possible to trace exactly when and where buffers are created.
Using the GPUDevice.lost Promise as a Canary
The device.lost promise resolves when the GPU device becomes unavailable -- this can happen from an OOM condition, a driver-level reset, or when device.destroy() is called intentionally (in which case info.reason will be "destroyed"). Wiring this up early provides a signal that memory pressure has become critical, but you should gate error reporting on info.reason !== 'destroyed' to avoid false alarms during normal cleanup.
To use this in your development entry point, set window.__DEV__ = true before loading your application scripts. In production builds, ensure this flag is absent or false. This controls whether the buffer registry is exposed on window for automated testing.
async function createTrackedDevice() {
const adapter = await navigator.gpu.requestAdapter();
if (!adapter) throw new Error("No WebGPU adapter available");
// Clamp requested limits to what the adapter actually supports.
// requestDevice() rejects if requiredLimits exceed adapter maximums.
const desiredBufferSize = 256 * 1024 * 1024;
const desiredBindingSize = 128 * 1024 * 1024;
let device;
try {
device = await adapter.requestDevice({
requiredLimits: {
maxBufferSize: Math.min(
desiredBufferSize,
adapter.limits.maxBufferSize
),
maxStorageBufferBindingSize: Math.min(
desiredBindingSize,
adapter.limits.maxStorageBufferBindingSize
),
},
});
} catch (err) {
throw new Error(`requestDevice failed: ${err.message}`);
}
// Canary for device loss; also fires on explicit device.destroy()
// — gate error reporting on info.reason !== 'destroyed'
device.lost.then((info) => {
if (info.reason === "destroyed") {
console.log("GPUDevice intentionally destroyed.");
return;
}
console.error(`GPUDevice lost: ${info.message} (reason: ${info.reason})`);
// Report to error tracking service
});
// Development-only buffer registry
const bufferRegistry = new Map(); // label → { buffer, size }
function trackBuffer(label, size, usage) {
if (bufferRegistry.has(label)) {
throw new Error(
`[GPU] Duplicate buffer label "${label}". ` +
`Use unique labels per allocation.`
);
}
const buffer = device.createBuffer({ label, size, usage });
bufferRegistry.set(label, { buffer, size });
console.log(
`[GPU] Allocated "${label}" (${(size / 1024 / 1024).toFixed(1)} MiB). ` +
`Registry total: ${bufferRegistry.size} buffers`
);
return buffer;
}
// Expose for Puppeteer assertions in CI.
// Set window.__DEV__ = true in your development HTML entry point or
// bundler define config. Never set it in production builds.
if (typeof window !== "undefined" && window.__DEV__ === true) {
window.__gpuBufferRegistry = bufferRegistry;
}
return { device, bufferRegistry, trackBuffer };
}
Setting explicit requiredLimits during device creation is a deliberate choice: it surfaces allocation failures early rather than letting the application silently consume all available GPU memory before crashing. The limits are clamped to the adapter's reported maximums so that requestDevice() does not reject on hardware with lower caps than the desired values.
Profiling Session Walkthrough: Finding the Leak
Step 1: Establish a Baseline
Open the SPA in Chrome (version 113 or later) with WebGPU developer features enabled. Open DevTools, switch to the Performance panel with the GPU lane active, and take an initial heap snapshot in the Memory panel. Record the "Video Memory" value from chrome://gpu. Then load a quantized 125M-parameter model (web-llm works well for this purpose). Record the new memory values. The delta between these two readings is your model's GPU footprint.
Step 2: Trigger the Leak
Navigate away from the model page using a client-side route change, then navigate back. Repeat at least three times, or until chrome://gpu video memory exceeds the Step 1 baseline by more than 2x. After each cycle, check chrome://gpu video memory or query the buffer registry size. The signature pattern is a staircase: memory climbs with each load but never returns to baseline after unload. If using the trackBuffer utility from the tooling setup, the registry count will climb monotonically.
Step 3: Isolate Unreleased Buffers
Use the buffer registry to diff the set of tracked buffers before and after teardown. Flag any entry that survives component unmount as a leak. In the heap snapshot, search for GPUBuffer objects in the class filter and inspect the retainer count; the "Detached" label applies to DOM nodes, not WebGPU objects, so filtering by class name is the correct approach here.
Step 4: Pinpoint the Retention Path
If the delta from Step 1 is negligible, your model may not be loading onto the GPU at all -- verify that the WebGPU backend is active before continuing. Otherwise, select a leaked GPUBuffer in the heap snapshot and open the "Retainers" view. This traces the chain of references back to the GC root. Common culprits include closures captured in web worker message handlers that hold references to the buffer array, uncleared setInterval or requestAnimationFrame callbacks polling inference status, and event listeners on the GPUDevice that the code never removed.
Here is a ModelSession class showing both the leaking and fixed versions. The fixed version uses a per-session unique identifier and a private set of owned registry keys, so that multiple sessions sharing the same registry cannot accidentally delete each other's entries during disposal:
// LEAKING VERSION — no cleanup on dispose
class ModelSessionLeaky {
constructor(device) {
this.device = device;
this.buffers = [];
this.pollInterval = null;
}
async init() {
for (let i = 0; i < 30; i++) {
const buf = this.device.createBuffer({
label: `model-weight-${i}`,
size: 1024 * 1024 * 8,
usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST,
});
this.buffers.push(buf);
}
// Polling callback captures `this`, preventing GC
this.pollInterval = setInterval(() => this.checkStatus(), 1000);
}
checkStatus() {
/* inference polling logic */
}
dispose() {
// Bug: buffers are never destroyed, interval never cleared
this.buffers = [];
}
}
// FIXED VERSION — explicit teardown with per-session isolation
class ModelSession {
constructor(device, registry, sessionId) {
this.device = device;
this.registry = registry; // shared Map across sessions
this.sessionId = sessionId ?? crypto.randomUUID();
this._ownedKeys = new Set(); // track only this session's registry keys
this.buffers = [];
this.pollInterval = null;
this._disposed = false;
}
async init() {
for (let i = 0; i < 30; i++) {
const label = `${this.sessionId}-model-weight-${i}`;
const size = 1024 * 1024 * 8;
const buf = this.device.createBuffer({
label,
size,
usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST,
});
this.buffers.push(buf);
this.registry.set(label, { buffer: buf, size });
this._ownedKeys.add(label);
}
this.pollInterval = setInterval(() => this.checkStatus(), 1000);
}
checkStatus() {
/* inference polling logic */
}
dispose() {
// Guard against double-dispose
if (this._disposed) return;
this._disposed = true;
// 1. Destroy every GPU buffer owned by this session
for (const buf of this.buffers) {
buf.destroy();
}
this.buffers = [];
// 2. Remove only this session's entries from the shared registry
for (const key of this._ownedKeys) {
this.registry.delete(key);
}
this._ownedKeys.clear();
// 3. Clear the polling interval
clearInterval(this.pollInterval);
this.pollInterval = null;
console.log(`[GPU] Session ${this.sessionId} disposed. Registry size: ${this.registry.size}`);
}
}
The fixed version iterates every buffer and calls
.destroy(), removes only its own entries from the tracking registry using the private_ownedKeysset (so sessions sharing the same registry cannot interfere with each other), clears the interval that would otherwise keepthis(and by extension the buffer references) alive, and guards against double-dispose with a_disposedflag.
The difference is mechanical but consequential. The fixed version iterates every buffer and calls .destroy(), removes only its own entries from the tracking registry using the private _ownedKeys set (so sessions sharing the same registry cannot interfere with each other), clears the interval that would otherwise keep this (and by extension the buffer references) alive, and guards against double-dispose with a _disposed flag. Each session's labels are prefixed with a unique sessionId, preventing label collisions when multiple sessions coexist.
Automated Leak Detection in CI
Prerequisites
- Node.js 18 or later
- Puppeteer 22 (
npm install puppeteer@22) -- API calls below assume this version - Chrome 113 or later (Puppeteer 22 installs a compatible Chrome for Testing build)
- GPU-accelerated WebGPU will fall back to a software adapter on headless CI runners without discrete GPUs. Buffer creation and destroy semantics are defined by the WebGPU spec regardless of adapter type, so the leak detection logic remains valid even without hardware acceleration, though absolute memory figures will differ from those on hardware GPUs.
- A local SPA that exposes
window.loadModel()andwindow.disposeModel()from its entry point (these functions must trigger model initialization and thedispose()teardown shown above) - The SPA must set
window.__modelReady = truewhen model loading completes and clear it when disposal begins, to enable reliable polling in the test script wait-onpackage for CI server readiness (npm install --save-dev wait-on)npmscriptsbuildandservedefined in yourpackage.json
Puppeteer Script for Memory Assertions
Preventing regressions requires automated checking. Puppeteer can launch Chrome with WebGPU developer features, drive the SPA through load/unload cycles, and assert that the buffer registry returns to zero.
The script below uses a polling helper instead of fixed setTimeout delays, making it resilient to variable load times on different CI runners or hardware:
const puppeteer = require("puppeteer");
async function waitForCondition(
page,
conditionFn,
{ timeout = 10000, interval = 200 } = {}
) {
const start = Date.now();
while (Date.now() - start < timeout) {
const result = await page.evaluate(conditionFn);
if (result) return;
await new Promise((r) => setTimeout(r, interval));
}
throw new Error("Condition timed out");
}
(async () => {
const browser = await puppeteer.launch({
headless: "new",
args: ["--enable-webgpu-developer-features"],
});
const page = await browser.newPage();
page.on("pageerror", (err) => {
console.error("Page error:", err);
});
await page.goto("http://localhost:3000");
// Load and dispose the model 3 times
for (let i = 0; i < 3; i++) {
await page.evaluate(() => window.loadModel());
await waitForCondition(page, () => window.__modelReady === true, {
timeout: 15000,
});
await page.evaluate(() => window.disposeModel());
await waitForCondition(
page,
() => window.__gpuBufferRegistry?.size === 0,
{ timeout: 5000 }
);
}
const remaining = await page.evaluate(() => {
const reg = window.__gpuBufferRegistry;
if (reg == null) {
throw new Error(
"__gpuBufferRegistry not exposed — check DEV guard"
);
}
return reg.size;
});
console.log(`Buffers remaining after 3 cycles: ${remaining}`);
if (remaining !== 0) {
console.error(`LEAK DETECTED: ${remaining} buffers not released`);
await browser.close();
process.exit(1);
}
console.log("No GPU memory leaks detected.");
await browser.close();
})();
Integrating with GitHub Actions
A minimal workflow configuration runs this test against every pull request. The workflow includes a job timeout, waits for the dev server to be ready before running tests, and ensures the background server process is killed on both success and failure:
name: GPU Memory Leak Check
on: [pull_request]
jobs:
leak-test:
runs-on: ubuntu-latest
timeout-minutes: 10
steps:
- uses: actions/checkout@v4
- run: npm ci
- run: npm run build
- name: Start dev server
run: npm run serve &
- run: npx wait-on http://localhost:3000 --timeout 30000
- run: npx puppeteer browsers install chrome
- run: node tests/gpu-leak-test.js
- name: Kill server
if: always()
run: kill $(lsof -ti:3000) || true
Testing the Tracking and Disposal Logic
The following tests verify the core buffer tracking and session disposal behavior. They assume the createTrackedDevice function and ModelSession class from the sections above are importable from your source modules.
Unit Tests
// test/trackBuffer.test.js
import { createTrackedDevice } from "../src/gpu-device.js";
import { ModelSession } from "../src/model-session.js";
describe("trackBuffer", () => {
let ctx;
beforeEach(async () => {
ctx = await createTrackedDevice();
});
afterEach(() => {
ctx.device.destroy();
});
test("registers buffer on allocation", () => {
ctx.trackBuffer("test-buf", 1024, GPUBufferUsage.STORAGE);
expect(ctx.bufferRegistry.size).toBe(1);
expect(ctx.bufferRegistry.get("test-buf").size).toBe(1024);
});
test("throws on duplicate label", () => {
ctx.trackBuffer("dup-buf", 1024, GPUBufferUsage.STORAGE);
expect(() =>
ctx.trackBuffer("dup-buf", 1024, GPUBufferUsage.STORAGE)
).toThrow("Duplicate buffer label");
});
test("registry size stays zero after no allocations", () => {
expect(ctx.bufferRegistry.size).toBe(0);
});
});
describe("ModelSession.dispose()", () => {
test("is idempotent — second dispose does not throw", async () => {
const ctx = await createTrackedDevice();
const session = new ModelSession(ctx.device, ctx.bufferRegistry);
await session.init();
session.dispose();
expect(() => session.dispose()).not.toThrow();
ctx.device.destroy();
});
test("registry is empty after dispose", async () => {
const ctx = await createTrackedDevice();
const session = new ModelSession(ctx.device, ctx.bufferRegistry);
await session.init();
session.dispose();
expect(ctx.bufferRegistry.size).toBe(0);
ctx.device.destroy();
});
test("two sessions with shared registry do not interfere", async () => {
const ctx = await createTrackedDevice();
const reg = ctx.bufferRegistry;
const s1 = new ModelSession(ctx.device, reg, "s1");
const s2 = new ModelSession(ctx.device, reg, "s2");
await s1.init();
await s2.init();
s1.dispose();
// s2's entries must still be present
expect(reg.size).toBe(30);
s2.dispose();
expect(reg.size).toBe(0);
ctx.device.destroy();
});
});
Integration Test
// test/integration/leak-cycle.test.js
// Requires a running dev server on :3000 and window.__DEV__ = true in the page.
// Run with: node test/integration/leak-cycle.test.js
const puppeteer = require("puppeteer");
(async () => {
const browser = await puppeteer.launch({ headless: "new" });
const page = await browser.newPage();
page.on("pageerror", (err) => {
throw err;
});
await page.goto("http://localhost:3000");
for (let i = 0; i < 3; i++) {
await page.evaluate(() => window.loadModel());
await page.waitForFunction(() => window.__modelReady === true, {
timeout: 15000,
});
await page.evaluate(() => window.disposeModel());
await page.waitForFunction(
() => window.__gpuBufferRegistry?.size === 0,
{ timeout: 5000 }
);
}
const remaining = await page.evaluate(() => {
const reg = window.__gpuBufferRegistry;
if (reg == null) throw new Error("Registry not exposed");
return reg.size;
});
console.assert(remaining === 0, `LEAK: ${remaining} buffers retained`);
await browser.close();
process.exit(remaining === 0 ? 0 : 1);
})();
Sanity Check
You can verify the core registry disposal logic without a browser or GPU by running:
node -e "
const assert = require('assert');
const map = new Map();
const keys = new Set();
for (let i = 0; i < 30; i++) { map.set('s1-weight-'+i, i); keys.add('s1-weight-'+i); }
for (const k of keys) map.delete(k);
assert.strictEqual(map.size, 0, 'Registry not empty after dispose');
console.log('PASS: registry empties correctly');
"
Expected output:
PASS: registry empties correctly
Key Takeaways and Checklist
Runtime Safeguards
Call .destroy() on every GPUBuffer and GPUTexture during component or session teardown. The GC will not do this for you.
Maintain a runtime buffer registry (a simple Map of label to size) in development builds. Expose it on window for automated testing, but guard the assignment behind window.__DEV__ === true so it never ships to production.
Use unique labels per allocation. The trackBuffer utility throws on duplicate labels to prevent silent overwrites that leak the original buffer reference.
Isolate sessions with unique prefixes. When multiple ModelSession instances share a registry, each session should use a unique sessionId prefix and track its own keys in a private set, so that dispose() only removes entries belonging to that session.
Guard against double-dispose. A _disposed flag at the top of the dispose() method prevents redundant cleanup and avoids confusing log output.
Wire
device.lostto your error tracking system, but filter oninfo.reason-- an unexpected device loss with reason"unknown"is almost always an OOM signal, while"destroyed"indicates intentional teardown and should not trigger alerts.
Wire device.lost to your error tracking system, but filter on info.reason -- an unexpected device loss with reason "unknown" is almost always an OOM signal, while "destroyed" indicates intentional teardown and should not trigger alerts.
Clear all intervals, listeners, and worker message handlers during teardown. These are the most common retention paths for buffer references.
CI and Tooling
Profile with chrome://gpu video memory across multiple navigation cycles. Look for the staircase pattern: memory that rises but never returns to baseline.
Add automated Puppeteer leak-detection tests in CI that assert the buffer registry empties after disposal. Use polling-based waits rather than fixed timeouts to avoid flaky results on slow runners. See the WebGPU specification's resource management section for the underlying destroy semantics, and Chrome's DevTools WebGPU documentation for profiling guidance. The teardown APIs in web-llm and transformers.js provide additional context on cleanup patterns specific to those libraries.
