AI & ML

Handling Large Model Downloads: UX Patterns for Client AI

· 5 min read
SitePoint Premium
Stay Relevant and Grow Your Career in Tech
  • Premium Results
  • Publish articles on SitePoint
  • Daily curated jobs
  • Learning Paths
  • Discounts to dev tools
Start Free Trial

7 Day Free Trial. Cancel Anytime.

Client-side AI promises local inference, data privacy, and zero server costs. In practice, it means asking a user to download a model file that can range from 500MB to well over 7GB before anything useful happens. Handling large model downloads with good UX patterns, service workers, and progressive loading is the difference between an application people actually use and one they abandon mid-download.

Table of Contents

The Download Problem with Client-Side AI

Frameworks like WebLLM, Transformers.js, and MediaPipe have made it feasible to run models such as Llama, Whisper, and Stable Diffusion variants directly in the browser. The catch is the payload. A Q4_0 quantized Llama 2 7B model weighs in around 3.8 GB. Even mid-size Whisper variants (such as Whisper small) are approximately 483 MB; the large-v2 variant exceeds 1.5 GB. Stable Diffusion checkpoints can exceed 7GB.

Traditional fetch or XMLHttpRequest calls break down at this scale in several concrete ways. Some browsers and network proxies enforce timeouts on long-running requests, though the Fetch API specification itself imposes none. A user closing a tab or their device going to sleep kills the download entirely, with no way to resume. Buffering multi-gigabyte responses in memory before writing them to storage creates severe memory pressure on mobile devices especially, unless the Streams API pipes data directly to storage. There is no built-in resume capability. A failed download at 90% means starting over from zero.

A failed download at 90% means starting over from zero.

This tutorial covers three layers that address these problems directly: background downloading that survives tab closures, caching and persistence that prevent re-downloads, and progressive UI feedback that keeps users informed and in control.

Prerequisites

  • Secure context required: Service workers require HTTPS in production or localhost for local development. The browser will refuse to register a service worker on plain HTTP origins.
  • Browser support: Cache API and Service Workers are supported in all modern browsers. Background Fetch API is limited to Chromium-based browsers (see the Background Fetch section below). BroadcastChannel, navigator.storage.estimate(), and navigator.storage.persist() have broad but not universal support; check MDN compatibility tables for your target browsers.
  • Server requirements: The server hosting model files must return Accept-Ranges: bytes in its response headers and serve HTTP 206 responses to Range requests for chunked/resumable downloads to work. It must also return a Content-Length header. Verify with:
curl -I --header 'Range: bytes=0-1' <model-url>

Confirm the response is HTTP/1.1 206 Partial Content and includes Accept-Ranges: bytes. If you are loading models from a different origin, the server must include appropriate CORS headers (Access-Control-Allow-Origin, and Access-Control-Allow-Headers must permit the Range header).

Using Service Workers to Manage Model Downloads

Registering a Service Worker for Model Caching

A service worker acts as a programmable network proxy that runs independently of the page lifecycle. Service workers require a secure context: HTTPS in production or localhost for local development. It persists beyond a single page session, which makes it the right place to manage downloads that outlast any given tab. The Cache API, accessible from within a service worker, stores request/response pairs and can store multi-gigabyte binary responses without a per-object size limit; the only constraint is the origin's storage quota. IndexedDB can also store Blobs natively, but its transactional API adds complexity; the Cache API's request/response model maps more directly to network-fetched binary files. localStorage is synchronous and typically capped at 5-10MB depending on the browser, making it unsuitable here.

// main.js
if ('serviceWorker' in navigator) {
  navigator.serviceWorker.register('/model-sw.js')
    .then(reg => console.log('SW registered, scope:', reg.scope))
    .catch(err => console.error('SW registration failed:', err));
}

// model-sw.js
const MODEL_CACHE = 'ai-models-v1';

self.addEventListener('fetch', event => {
  const url = new URL(event.request.url);
  if (!url.pathname.startsWith('/models/')) return;

  event.respondWith(
    caches.open(MODEL_CACHE).then(cache =>
      cache.match(event.request).then(cached => {
        if (cached) return cached;
        return fetch(event.request).then(response => {
          cache.put(event.request, response.clone());
          return response;
        }).catch(fetchErr => {
          console.error('Fetch failed for model request:', fetchErr);
          return new Response('Network error', { status: 503, statusText: 'Service Unavailable' });
        });
      })
    )
  );
});

This intercepts any request to a /models/ path, checks the Cache API first, and falls back to a network fetch that populates the cache on success. If cache.put fails (for example, due to a quota exceeded error), the error is caught so the page still receives the fetched response rather than a raw network error. For production use, verify model file integrity against a known SHA-256 hash before writing to cache, since a corrupted file from a CDN or network issue will otherwise be cached and served indefinitely. For small model files, this alone may suffice. For multi-gigabyte downloads, it is not enough.

Chunked Download with Resume Support

HTTP Range headers allow requesting specific byte ranges of a file. By downloading in chunks and tracking progress, a service worker can resume an interrupted download from the last successfully received byte rather than starting over.

Important: The server hosting the model file must support Range requests and return HTTP 206 Partial Content responses. If the server ignores the Range header and returns 200 with the full body, the resume logic will corrupt the assembled file. The code below explicitly validates both Accept-Ranges support and 206 status on every chunk. Verify server support as described in the Prerequisites section above.

// model-sw.js
const DEFAULT_CHUNK_SIZE = 10 * 1024 * 1024; // 10 MB
const MODEL_DOWNLOAD_CHANNEL = 'model-download';

async function chunkedModelFetch(url, cacheName, chunkSize = DEFAULT_CHUNK_SIZE) {
  const cache = await caches.open(cacheName);

  // Normalize the URL to a string so cache.match and cache.put
  // always use the same key type, regardless of whether the caller
  // passes a string or a Request object.
  const urlString = url instanceof Request ? url.url : url;

  // Use a stored metadata key for resume offset to avoid loading
  // the entire partial download into memory just to measure its size.
  const OFFSET_KEY = urlString + '__offset';
  const offsetEntry = await cache.match(OFFSET_KEY);
  let offset = offsetEntry ? parseInt(await offsetEntry.text(), 10) : 0;
  if (!Number.isFinite(offset) || offset < 0) offset = 0;

  const headResp = await fetch(urlString, { method: 'HEAD' });
  if (!headResp.ok) {
    throw new Error(`HEAD request failed: ${headResp.status} ${headResp.statusText}`);
  }

  // Verify the server supports Range requests before attempting chunked download.
  const acceptRanges = headResp.headers.get('Accept-Ranges');
  if (acceptRanges !== 'bytes') {
    throw new Error(
      'Server does not support Range requests (Accept-Ranges: bytes missing). ' +
      'Resume is not possible.'
    );
  }

  const totalSize = parseInt(headResp.headers.get('Content-Length'), 10);
  if (!Number.isFinite(totalSize) || totalSize <= 0) {
    throw new Error(
      'Server did not return a valid Content-Length. Range-based resumption requires a known file size.'
    );
  }

  // If the stored offset indicates the download was already complete, return it.
  if (offset >= totalSize) {
    return cache.match(urlString);
  }

  // Open the BroadcastChannel once before the loop; close it in the finally block.
  const channel = new BroadcastChannel(MODEL_DOWNLOAD_CHANNEL);

  try {
    while (offset < totalSize) {
      const end = Math.min(offset + chunkSize - 1, totalSize - 1);

      // Add a per-chunk timeout to prevent a stalled fetch from hanging
      // the service worker indefinitely.
      const controller = new AbortController();
      const timeoutId = setTimeout(() => controller.abort(), 60_000);

      let resp;
      try {
        resp = await fetch(urlString, {
          headers: { Range: `bytes=${offset}-${end}` },
          signal: controller.signal,
        });
      } finally {
        clearTimeout(timeoutId);
      }

      // Only 206 is acceptable. A 200 means the server ignored the Range
      // header and returned the full file, which would corrupt the assembly.
      if (resp.status !== 206) {
        throw new Error(
          `Expected 206 Partial Content but got ${resp.status}. ` +
          'Server may not support Range requests for this resource.'
        );
      }

      const chunk = await resp.blob();

      // Advance offset by actual bytes received, not by the requested range end.
      // A server may return fewer bytes than requested (short read).
      offset += chunk.size;

      // Store each chunk individually keyed by byte offset, rather than
      // reassembling all chunks into a single Blob on every iteration.
      // This avoids O(n²) memory pressure that would OOM-crash mobile
      // devices on large models.
      await cache.put(
        `${urlString}__chunk__${offset}`,
        new Response(chunk, {
          status: 206,
          headers: {
            'Content-Type': 'application/octet-stream',
            'X-Partial': 'true',
            'X-Chunk-End': String(offset),
          },
        })
      );

      // Persist current offset for resume across service worker restarts.
      await cache.put(OFFSET_KEY, new Response(String(offset)));

      channel.postMessage({
        type: 'progress',
        downloaded: offset,
        total: totalSize,
        percent: Math.round((offset / totalSize) * 100),
      });
    }

    // All chunks received. Reassemble into a single cached response once.
    // This is the only point where the full file is assembled in memory.
    const chunkEntries = [];
    const cacheKeys = await cache.keys();
    const chunkPrefix = `${urlString}__chunk__`;
    for (const request of cacheKeys) {
      if (request.url.startsWith(chunkPrefix)) {
        chunkEntries.push(request);
      }
    }

    // Sort chunk keys by their byte-offset suffix to ensure correct order.
    chunkEntries.sort((a, b) => {
      const aOffset = parseInt(a.url.slice(chunkPrefix.length), 10);
      const bOffset = parseInt(b.url.slice(chunkPrefix.length), 10);
      return aOffset - bOffset;
    });

    const allChunks = [];
    for (const key of chunkEntries) {
      const chunkResp = await cache.match(key);
      allChunks.push(await chunkResp.blob());
    }

    const finalBlob = new Blob(allChunks, { type: 'application/octet-stream' });

    // Optional: verify integrity before committing to cache.
    // const hashBuffer = await crypto.subtle.digest('SHA-256', await finalBlob.arrayBuffer());
    // const hashHex = [...new Uint8Array(hashBuffer)].map(b => b.toString(16).padStart(2, '0')).join('');
    // if (hashHex !== expectedSha256) throw new Error('Integrity check failed');

    await cache.put(
      urlString,
      new Response(finalBlob, {
        status: 200,
        headers: {
          'Content-Type': 'application/octet-stream',
          'Content-Length': String(finalBlob.size),
        },
      })
    );

    // Clean up chunk entries and the offset key.
    for (const key of chunkEntries) {
      await cache.delete(key);
    }
    await cache.delete(OFFSET_KEY);

    channel.postMessage({ type: 'complete', downloaded: totalSize, total: totalSize });
    return cache.match(urlString);
  } finally {
    channel.close();
  }
}

This function issues a HEAD request to determine total file size and verify Accept-Ranges support, checks for a stored resume offset in the cache, and resumes from the last successfully received byte. Each chunk is stored individually by its byte offset rather than reassembling the entire file on every iteration, which avoids the O(n²) memory pressure that would crash mobile devices on large models. The final assembly into a single Blob happens only once when all chunks arrive. The offset advances by the actual number of bytes received (chunk.size), not by the requested range end, so a short read from the server does not introduce a gap. Each chunk fetch has a 60-second timeout via AbortController to prevent a stalled request from hanging the service worker indefinitely. The BroadcastChannel is always closed in the finally block, regardless of how the function exits.

The 10 MB default chunk size balances memory overhead against request frequency. On mobile, 2-5 MB reduces peak memory usage. On desktop with fast connections, 25-50 MB reduces round trips. Profile on your target devices.

The Background Fetch API for Long-Running Downloads

Why Background Fetch Exists

The chunked approach above still has a fundamental limitation: if the browser terminates the service worker (which it can do at any time to conserve resources) or the user navigates away, the download stops. The Background Fetch API delegates the actual download to the browser itself, which manages it at the OS level. The download survives tab closures, device sleep, and service worker termination.

The Background Fetch API delegates the actual download to the browser itself, which manages it at the OS level. The download survives tab closures, device sleep, and service worker termination.

Browser support as of mid-2025 is limited to Chromium-based browsers: Chrome, Edge, and Opera. Samsung Internet's support status should be verified against the MDN Background Fetch API compatibility table before targeting it. Firefox and Safari do not support it. This makes a fallback strategy non-optional.

Implementing a Background Fetch for a Model File

From the client page, initiate a background fetch through the service worker registration:

// main.js
async function startModelDownload(modelUrl, modelName, totalBytes) {
  const reg = await navigator.serviceWorker.ready;

  if (!reg.backgroundFetch) {
    // Fallback to chunked SW fetch.
    // reg.active may be null if no SW has claimed the page yet (e.g., first load).
    const sw = reg.active || reg.installing || reg.waiting;
    if (!sw) {
      console.error('No active service worker available for chunked download.');
      return;
    }
    sw.postMessage({ type: 'chunked-download', url: modelUrl });
    return;
  }

  const bgFetch = await reg.backgroundFetch.fetch(`model-${modelName}`, [modelUrl], {
    title: `Downloading ${modelName}`,
    icons: [{ src: '/icons/model-download.png', sizes: '192x192', type: 'image/png' }],
    downloadTotal: totalBytes,
  });

  const onProgress = () => {
    const percent = Math.round((bgFetch.downloaded / bgFetch.downloadTotal) * 100);
    updateProgressUI(percent, bgFetch.downloaded, bgFetch.downloadTotal);
  };

  bgFetch.addEventListener('progress', onProgress);

  // Remove the progress listener when the fetch settles to avoid
  // accumulating listeners if startModelDownload is called multiple times.
  bgFetch.addEventListener(
    'success',
    () => bgFetch.removeEventListener('progress', onProgress),
    { once: true }
  );
  bgFetch.addEventListener(
    'failure',
    () => bgFetch.removeEventListener('progress', onProgress),
    { once: true }
  );
}

Note: Pass downloadTotal only when you have a verified Content-Length from a prior HEAD request. If the value is inaccurate, the browser's progress UI will be wrong or the fetch may be rejected. Omit the property if the total size is unknown.

Inside the service worker, handle completion and failure:

// model-sw.js
self.addEventListener('backgroundfetchsuccess', async event => {
  const bgFetch = event.registration;
  const cache = await caches.open(MODEL_CACHE);

  event.waitUntil((async () => {
    const records = await bgFetch.matchAll();

    if (!records.length) {
      throw new Error(
        `backgroundfetchsuccess fired for ${bgFetch.id} but matchAll() returned no records.`
      );
    }

    for (const record of records) {
      const response = await record.responseReady;
      await cache.put(record.request, response);
    }

    const channel = new BroadcastChannel(MODEL_DOWNLOAD_CHANNEL);
    channel.postMessage({ type: 'complete', id: bgFetch.id });
    channel.close();
  })());
});

self.addEventListener('backgroundfetchfailure', event => {
  const channel = new BroadcastChannel(MODEL_DOWNLOAD_CHANNEL);
  channel.postMessage({
    type: 'failed',
    id: event.registration.id,
    reason: event.registration.failureReason ?? 'unknown',
  });
  channel.close();
});

The failure handler now includes failureReason from the registration object, providing actionable information (network error, quota exceeded, user cancel) rather than a silent, opaque failure message.

Graceful Fallback When Background Fetch Is Unavailable

The feature detection is straightforward. The code above already demonstrates it, but here it is isolated for clarity:

const reg = await navigator.serviceWorker.ready;
if ('backgroundFetch' in reg) {
  // Use Background Fetch API
} else {
  // Fall back to chunked service worker download
}

Always implement both paths. Shipping a Background Fetch-only solution means excluding Firefox and Safari users entirely.

Progressive Loading UI Patterns

Communicating Download State to the User

Users need to see exactly three states at all times: not yet downloaded, downloading with real progress, and ready to use. Anything less and they will assume the app is broken.

BroadcastChannel provides a clean communication pipe between the service worker and any open client pages. The chunked download function above already sends progress messages over the channel. On the client side, listen for them:

// main.js
const channel = new BroadcastChannel(MODEL_DOWNLOAD_CHANNEL);
channel.addEventListener('message', event => {
  const { type, downloaded, total, percent } = event.data;
  if (type === 'progress') {
    document.querySelector('.progress-bar').style.width = `${percent}%`;
    document.querySelector('.progress-text').textContent =
      `${(downloaded / 1e6).toFixed(0)} MB / ${(total / 1e6).toFixed(0)} MB — ${percent}%`;
  }
});

Designing the Progress Indicator

An effective progress indicator for model downloads uses a multi-stage progress bar showing: download percentage, bytes downloaded out of total (e.g., "1,247 MB / 3,800 MB"), estimated time remaining based on rolling average throughput, and a clearly labeled cancel button. Always show total size upfront, before the user commits to the download. Showing only a percentage without byte counts leaves users guessing whether they are downloading 50MB or 5GB.

┌──────────────────────────────────────────────────┐
│  Downloading Llama-2-7B-Q4                       │
│  ████████████████░░░░░░░░░░  62%                 │
│  2,356 MB / 3,800 MB  —  ~4 min remaining       │
│                                      [Cancel]    │
└──────────────────────────────────────────────────┘

Once complete, replace the progress bar with a confirmation state that includes the model name, size on disk, and an option to delete the cached model. Users need a way to manage storage, especially on devices with limited space.

┌──────────────────────────────────────────────────┐
│  ✓ Llama-2-7B-Q4 ready                          │
│  3,800 MB stored  —  Downloaded today            │
│                                  [Delete model]  │
└──────────────────────────────────────────────────┘

Skeleton UI and Partial Functionality While Downloading

Let users explore every non-AI feature of the application while the model loads in the background. This is the most common pattern in shipping apps like Notion AI and GitHub Copilot Chat, where the AI component loads independently of the core interface. AI-dependent components render in a skeleton or disabled state with a contextual badge: "Downloading model: 43% complete." This transforms a blocking wait into a background process.

The layout during download keeps all navigation, settings, and non-AI features fully interactive, while the AI chat panel shows a skeleton state with the progress badge. After download completes, the skeleton is replaced with a fully active AI interface.

┌─────────────────────┬────────────────────────────┐
│  [Nav] [Settings]   │   AI Chat                  │
│                     │  ┌──────────────────────┐  │
│  Non-AI features    │  │  ░░░░░░░░░░░░░░░░░░  │  │
│  fully interactive  │  │  Downloading: 43%    │  │
│                     │  │  ░░░░░░░░░░░░░░░░░░  │  │
│                     │  └──────────────────────┘  │
└─────────────────────┴────────────────────────────┘

Handling Multiple or Quantized Model Variants

Offering quantized model variants is not just a nice-to-have; it directly affects whether users on slower connections or constrained devices will use the application at all. Present the choice explicitly with concrete numbers: "Standard (3.8 GB, higher quality) vs. Lite (1.2 GB, faster download, slightly lower accuracy)." Display estimated download time based on a brief bandwidth probe. Let users upgrade to the larger model later without losing the smaller one.

Storage Management and Cache Eviction

Before initiating any multi-gigabyte download, check whether the device has room:

async function checkStorageAndDownload(modelUrl, modelName, requiredBytes) {
  if (!Number.isFinite(requiredBytes) || requiredBytes <= 0) {
    throw new RangeError(`requiredBytes must be a positive finite number; got ${requiredBytes}`);
  }

  if (navigator.storage && navigator.storage.estimate) {
    const { usage, quota } = await navigator.storage.estimate();
    const available = quota - usage;

    if (available < requiredBytes * 1.1) { // 10% buffer for filesystem metadata and Cache API overhead
      showStorageWarning(available, requiredBytes);
      return;
    }
  }

  if (navigator.storage && navigator.storage.persist) {
    const persisted = await navigator.storage.persist();
    if (!persisted) {
      console.warn('Storage persistence denied — model may be evicted');
      // Inform the user so they are not surprised by a re-download later.
      showPersistenceWarning(
        'Your browser may automatically clear the downloaded model to free space. You may need to re-download it.'
      );
    }
  }

  // Await so errors from the download propagate to the caller.
  await startModelDownload(modelUrl, modelName, requiredBytes);
}

navigator.storage.estimate() returns browser-estimated values; quota reflects the browser's self-imposed limit, which can fall well below actual disk capacity. Treat quota - usage as an upper bound on what the browser will permit, not a guarantee of available storage. The 10% buffer accounts for filesystem metadata, Cache API response wrapper overhead, and partial-chunk tail data. navigator.storage.persist() requests that the browser treat this origin's storage as durable, preventing automatic eviction under storage pressure. Without persistent storage, the browser can silently delete a cached 4GB model to free space, forcing a full re-download. The requiredBytes parameter is validated before use to guard against invalid external input.

Without persistent storage, the browser can silently delete a cached 4GB model to free space, forcing a full re-download.

If storage is insufficient, either warn the user with specific numbers ("Need 3.8 GB, only 2.1 GB available") or automatically suggest the smaller quantized variant.

Summary

  • Use a service worker for model downloads. Raw fetch calls for multi-GB files cannot resume and die on tab close.
  • The Background Fetch API is the most robust option for surviving tab closures and device sleep. Feature-detect it and always implement a chunked fallback.
  • Show real byte progress (downloaded/total), not indeterminate spinners. Users need to know whether they are waiting 30 seconds or 30 minutes.
  • Call navigator.storage.persist() in response to a user gesture (such as the user clicking "Download") to maximize the likelihood of the browser granting persistence. Without persistence, the browser can evict your cached model at any time.
  • Offer quantized model variants. Not every user has a fast connection with unlimited storage. Show the size and quality trade-off explicitly and let users choose.