AI & ML

Ollama for JavaScript Developers: Building AI Apps Without API Keys

· 5 min read
SitePoint Premium
Stay Relevant and Grow Your Career in Tech
  • Premium Results
  • Publish articles on SitePoint
  • Daily curated jobs
  • Learning Paths
  • Discounts to dev tools
Start Free Trial

7 Day Free Trial. Cancel Anytime.

How to Use Ollama with JavaScript and Node.js

  1. Install Ollama from ollama.com and pull a model such as llama3.2:3b.
  2. Verify the Ollama server is running by hitting http://localhost:11434/api/tags.
  3. Scaffold a Node.js 18+ project with npm init -y — no additional dependencies needed.
  4. Send a non-streaming POST request to /api/chat using native fetch.
  5. Enable streaming by setting stream: true and reading the response with a ReadableStream reader.
  6. Accumulate conversation history in a messages array for multi-turn chat.
  7. Extend the pattern into a VS Code extension using the same fetch-based streaming approach.
  8. Optimize performance by choosing the right model size and tuning num_ctx and keep_alive.

Every JavaScript developer who has integrated an LLM into a project knows the friction: sign up for an API key, attach a credit card, worry about rate limits, watch usage costs climb, and accept that every prompt and response transits through a third-party server. Ollama for JavaScript developers changes the equation entirely.

Table of Contents

Why Run LLMs Locally with Ollama?

Every JavaScript developer who has integrated an LLM into a project knows the friction: sign up for an API key, attach a credit card, worry about rate limits, watch usage costs climb, and accept that every prompt and response transits through a third-party server. For prototyping, internal tools, or privacy-sensitive applications, that overhead is hard to justify. Ollama for JavaScript developers changes the equation entirely. It provides one-command local model hosting with a REST API that any HTTP client can talk to, and that includes the native fetch already shipping in modern Node.js.

Ollama for JavaScript developers changes the equation entirely. It provides one-command local model hosting with a REST API that any HTTP client can talk to, and that includes the native fetch already shipping in modern Node.js.

Ollama supports models at several parameter counts and specializations: Llama 3.2 in 1B and 3B parameter sizes, Mistral 7B, Phi-3 Mini, Gemma, and Code Llama for code-centric tasks. That matters. Because the Node.js ecosystem already has mature HTTP and streaming primitives, connecting to a local Ollama instance requires zero additional AI SDK dependencies. No OpenAI client library, no LangChain, no cloud provider SDK. Just HTTP requests to localhost.

Prerequisites and Setup

Installing Ollama

Head to ollama.com and grab the installer for your platform. On macOS, it ships as a standard .dmg application. On Linux, a single curl command handles it. Windows support is available as a preview installer.

Once installed, pull a model. Llama 3.2 at the 3B parameter size is a good starting point because it handles general chat and code tasks while fitting comfortably in 8 GB of RAM:

ollama pull llama3.2:3b

After the pull completes, verify the server is running:

curl http://localhost:11434/api/tags

That should return a JSON object listing llama3.2:3b among your local models. Ollama runs its HTTP server on port 11434 by default and starts automatically on macOS. On Windows (preview), auto-start depends on the installer version; if curl http://localhost:11434/api/tags fails, launch Ollama manually from the Start menu. On Linux, you may need to run ollama serve in a separate terminal.

Project Scaffolding

You need Node.js 18 or later because that is the version where the native fetch API became stable and globally available. Verify your version:

node --version

You should see v18.x.x or higher. If you are stuck on an older version, node-fetch v2 (npm install node-fetch@2) works as a CommonJS drop-in. node-fetch v3 is ESM-only and requires the same module configuration as this article. The code below assumes native fetch.

mkdir ollama-chat && cd ollama-chat
npm init -y

That is the entire dependency setup for the Node.js chat application. No packages to install. The VS Code extension section requires additional tooling.

Understanding the Ollama REST API

Key Endpoints

Ollama exposes a small, focused set of HTTP endpoints.

POST /api/generate accepts a single prompt string and returns a completion, which makes it the right choice for one-shot tasks like summarization or single-question answers. For multi-turn conversation, use POST /api/chat, which accepts a messages array with role-based entries (system, user, assistant). This is the endpoint used throughout the rest of this article.

You can list all locally available models with GET /api/tags. Finally, POST /api/embed generates vector embeddings for a given input, relevant for retrieval-augmented generation (RAG) workflows but outside the scope of this tutorial. (Note: older Ollama versions used /api/embeddings, which is deprecated.)

Request and Response Shape

The /api/chat endpoint expects a JSON body containing model, messages, and optionally stream and options. The options object lets you tune temperature, top_p, num_ctx (context window size), and other inference parameters. When stream is false, the response is a single JSON object. When stream is true (the default), the response is newline-delimited JSON, where each line is a self-contained JSON object containing one or more characters of generated text in message.content.

Here is a raw curl call so the shape is clear before writing any JavaScript:

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2:3b",
  "messages": [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": "Explain closures in JavaScript in two sentences." }
  ],
  "stream": false
}'

The response comes back as:

{
  "model": "llama3.2:3b",
  "message": {
    "role": "assistant",
    "content": "A closure is a function that retains access to variables from its enclosing lexical scope, even after the outer function has finished executing. This allows the inner function to 'remember' and manipulate those variables, which is fundamental to patterns like data privacy, callbacks, and factory functions in JavaScript."
  },
  "done": true,
  "total_duration": 1283000000
}

Note total_duration is in nanoseconds (1,283,000,000 ns ≈ 1.28 s). The done: true field signals the response is complete, which matters more in streaming mode where you need to detect the final chunk.

Building a Node.js Chat App

Non-Streaming Request

Create a file called chat.mjs. The .mjs extension ensures Node.js treats it as an ES module, giving access to top-level await:

// chat.mjs
const MODEL = "llama3.2:3b";
const TIMEOUT_MS = 60_000;

const controller = new AbortController();
const timeoutId = setTimeout(() => controller.abort(), TIMEOUT_MS);

try {
  const response = await fetch("http://localhost:11434/api/chat", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
      model: MODEL,
      messages: [
        { role: "system", content: "You are a helpful coding assistant." },
        { role: "user", content: "Write a debounce function in JavaScript." }
      ],
      stream: false
    }),
    signal: controller.signal
  });

  if (!response.ok) {
    const err = await response.text();
    throw new Error(`Ollama error ${response.status}: ${err}`);
  }

  const data = await response.json();
  console.log(data.message.content);
} finally {
  clearTimeout(timeoutId);
}

Run it with node chat.mjs. The response arrives as a single block after the model finishes generating. For short prompts this works fine, but for generations over roughly 200 tokens, the wait before any output appears exceeds a few seconds and starts to feel unresponsive.

Adding Streaming Responses

Streaming changes the perceived latency dramatically. Instead of waiting for the entire generation to complete, tokens appear on screen as the model produces them. Switch stream to true and read the response body as a ReadableStream:

// chat-stream.mjs
const MODEL = "llama3.2:3b";
const TIMEOUT_MS = 60_000;

const controller = new AbortController();
const timeoutId = setTimeout(() => controller.abort(), TIMEOUT_MS);

try {
  const response = await fetch("http://localhost:11434/api/chat", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
      model: MODEL,
      messages: [
        { role: "system", content: "You are a helpful coding assistant." },
        { role: "user", content: "Explain the event loop in Node.js." }
      ],
      stream: true
    }),
    signal: controller.signal
  });

  if (!response.ok) {
    const err = await response.text();
    throw new Error(`Ollama error ${response.status}: ${err}`);
  }

  if (!response.body) {
    throw new Error("Ollama returned a response with no body.");
  }

  const reader = response.body.getReader();
  const decoder = new TextDecoder();
  let buffer = "";

  try {
    while (true) {
      const { done, value } = await reader.read();
      if (done) break;

      buffer += decoder.decode(value, { stream: true });
      const lines = buffer.split("
");
      // Keep the last (potentially incomplete) line in the buffer
      buffer = lines.pop() ?? "";

      for (const line of lines) {
        if (!line.trim()) continue;
        try {
          const json = JSON.parse(line);
          if (json.message?.content) {
            process.stdout.write(json.message.content);
          }
        } catch (e) {
          console.error("Unexpected JSON parse failure on complete line:", line, e);
        }
      }
    }

    // Flush any remaining buffer content
    if (buffer.trim()) {
      try {
        const json = JSON.parse(buffer);
        if (json.message?.content) {
          process.stdout.write(json.message.content);
        }
      } catch {
        // Truly incomplete final fragment — log in production
      }
    }
  } finally {
    reader.cancel();
  }

  console.log(); // newline after streaming completes
} finally {
  clearTimeout(timeoutId);
}

The key details here: you instantiate TextDecoder once and each .decode() call uses { stream: true }, which decodes multi-byte UTF-8 characters correctly even when they span chunk boundaries. Each chunk may contain one or more newline-delimited JSON objects, so splitting on and keeping the last (potentially incomplete) segment in a buffer ensures the code holds partial JSON lines until the next chunk completes them rather than discarding them. The try/catch around JSON.parse handles the case where a genuinely malformed line arrives. The process.stdout.write call avoids the trailing newline that console.log adds, letting text concatenate naturally. The reader.cancel() in the finally block ensures the reader lock is released even if an exception exits the loop early.

Maintaining Conversation History

Multi-turn conversation requires accumulating the messages array. After each assistant response, append it back into the array so the model sees the full context on the next turn. Combined with Node.js's built-in readline/promises module, this gives you a functional CLI chat:

// chat-loop.mjs
import { createInterface } from "node:readline/promises";
import { stdin as input, stdout as output } from "node:process";

const MODEL = "llama3.2:3b";
const TIMEOUT_MS = 60_000;
const MAX_MESSAGES = 20;

const rl = createInterface({ input, output });
const messages = [
  { role: "system", content: "You are a helpful assistant." }
];

try {
  while (true) {
    const userInput = await rl.question("
You: ");
    if (userInput.trim().toLowerCase() === "exit") break;

    messages.push({ role: "user", content: userInput });

    const controller = new AbortController();
    const timeoutId = setTimeout(() => controller.abort(), TIMEOUT_MS);

    try {
      const response = await fetch("http://localhost:11434/api/chat", {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({ model: MODEL, messages, stream: true }),
        signal: controller.signal
      });

      if (!response.ok) {
        const err = await response.text();
        console.error(`
Ollama error ${response.status}: ${err}`);
        messages.pop(); // remove the failed user message
        continue;
      }

      if (!response.body) {
        console.error("
Error: Ollama returned a response with no body.");
        messages.pop();
        continue;
      }

      const reader = response.body.getReader();
      const decoder = new TextDecoder();
      let assistantMessage = "";
      let buffer = "";

      process.stdout.write("
Assistant: ");
      try {
        while (true) {
          const { done, value } = await reader.read();
          if (done) break;

          buffer += decoder.decode(value, { stream: true });
          const lines = buffer.split("
");
          buffer = lines.pop() ?? "";

          for (const line of lines) {
            if (!line.trim()) continue;
            try {
              const json = JSON.parse(line);
              if (json.message?.content) {
                process.stdout.write(json.message.content);
                assistantMessage += json.message.content;
              }
            } catch (e) {
              console.error(
                "Unexpected JSON parse failure on complete line:",
                line,
                e
              );
            }
          }
        }

        // Flush any remaining buffer content
        if (buffer.trim()) {
          try {
            const json = JSON.parse(buffer);
            if (json.message?.content) {
              process.stdout.write(json.message.content);
              assistantMessage += json.message.content;
            }
          } catch {
            // Truly incomplete final fragment — log in production
          }
        }
      } finally {
        reader.cancel();
      }

      messages.push({ role: "assistant", content: assistantMessage });

      // Trim history: keep system prompt + last (MAX_MESSAGES - 1) turns
      if (messages.length > MAX_MESSAGES) {
        messages.splice(1, messages.length - MAX_MESSAGES);
      }
    } finally {
      clearTimeout(timeoutId);
    }
  }
} finally {
  rl.close();
}

The assistantMessage variable collects the full response so it can be appended to the messages array. Each subsequent request sends the complete conversation history, giving the model the context it needs for coherent multi-turn dialogue. The code bounds the history: when messages.length exceeds the MAX_MESSAGES threshold, it drops the oldest turns while preserving the system prompt at index 0. This prevents unbounded memory growth and keeps the conversation within the model's context window. The rl.close() call sits inside a finally block so the readline interface is properly cleaned up even if an exception occurs, preventing terminal state corruption.

Building a VS Code Extension with Ollama

Why a VS Code Extension?

Developers spend most of their working hours inside an editor. Bringing AI assistance inline eliminates context-switching between a chat UI and the codebase. The extension works offline since Ollama runs locally, requires no API key configuration from the end user, and keeps all code and prompts on the developer's machine.

Bringing AI assistance inline eliminates context-switching between a chat UI and the codebase. The extension works offline since Ollama runs locally, requires no API key configuration from the end user, and keeps all code and prompts on the developer's machine.

Scaffolding the Extension

Use the Yeoman generator for VS Code extensions:

npm install -g yo generator-code
yo code

Select "New Extension (TypeScript)," name it ollama-assistant, and accept the defaults. In package.json, register a command:

{
  "contributes": {
    "commands": [
      {
        "command": "ollama-assistant.ask",
        "title": "Ollama: Ask Assistant"
      }
    ]
  }
}

In src/extension.ts, wire up the activation:

import * as vscode from 'vscode';

const MODEL = 'llama3.2:3b';
const TIMEOUT_MS = 60_000;

export function activate(context: vscode.ExtensionContext) {
  const outputChannel = vscode.window.createOutputChannel('Ollama Assistant');
  context.subscriptions.push(outputChannel);

  const disposable = vscode.commands.registerCommand(
    'ollama-assistant.ask',
    () => askOllama(outputChannel)
  );
  context.subscriptions.push(disposable);
}

export function deactivate() {}

Connecting to Ollama from the Extension

The VS Code extension host runs on Electron 21 or later, which provides a global fetch. If targeting older VS Code versions, add a compatibility check (typeof fetch !== 'undefined') or import node:http directly. The core function captures user input from an input box, streams the response from Ollama, and writes tokens to an Output Channel in real time:

async function askOllama(outputChannel: vscode.OutputChannel) {
  const userPrompt = await vscode.window.showInputBox({
    prompt: 'Ask Ollama something...',
    placeHolder: 'e.g., Refactor this function to use async/await'
  });

  if (!userPrompt) return;

  outputChannel.show(true);
  outputChannel.appendLine(`You: ${userPrompt}
`);

  const controller = new AbortController();
  const timeoutId = setTimeout(() => controller.abort(), TIMEOUT_MS);

  try {
    const response = await fetch('http://localhost:11434/api/chat', {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({
        model: MODEL,
        messages: [
          { role: 'system', content: 'You are a helpful coding assistant.' },
          { role: 'user', content: userPrompt }
        ],
        stream: true
      }),
      signal: controller.signal
    });

    if (!response.ok) {
      const err = await response.text();
      outputChannel.appendLine(`Error ${response.status}: ${err}`);
      return;
    }

    if (!response.body) {
      outputChannel.appendLine('Error: No response body received from Ollama.');
      return;
    }

    const reader = response.body.getReader();
    const decoder = new TextDecoder();
    let buffer = '';

    try {
      while (true) {
        const { done, value } = await reader.read();
        if (done) break;

        buffer += decoder.decode(value, { stream: true });
        const lines = buffer.split('
');
        buffer = lines.pop() ?? '';

        for (const line of lines) {
          if (!line.trim()) continue;
          try {
            const json = JSON.parse(line);
            if (json.message?.content) {
              outputChannel.append(json.message.content);
            }
          } catch {
            // Malformed line — log in production
          }
        }
      }

      // Flush any remaining buffer content
      if (buffer.trim()) {
        try {
          const json = JSON.parse(buffer);
          if (json.message?.content) {
            outputChannel.append(json.message.content);
          }
        } catch {
          // Truly incomplete final fragment
        }
      }
    } finally {
      reader.cancel();
    }

    outputChannel.appendLine('
--- Done ---');
  } finally {
    clearTimeout(timeoutId);
  }
}

Chunk boundaries do not always align with JSON object boundaries, so a partial line at the end of a chunk stays in the buffer until the next chunk completes it. The { stream: true } option on the .decode() call ensures multi-byte UTF-8 characters (such as accented characters, CJK text, or emoji) that span chunk boundaries decode correctly. The reader.cancel() in the finally block releases the reader lock if an exception exits the loop early, and the AbortController timeout prevents the extension host from freezing if Ollama becomes unresponsive.

Going Further: Inline Suggestions and Webview Panels

Two natural extensions of this pattern are worth mentioning. First, VS Code's InlineCompletionItemProvider API allows an extension to suggest code completions directly in the editor, similar to GitHub Copilot. The extension would capture the current file content and cursor position, send them to Ollama's /api/generate endpoint, and return the result as an inline suggestion. Second, a sidebar Webview panel can host a full chat UI built in HTML and JavaScript, communicating with the extension backend via postMessage. The companion GitHub repository includes a starter Webview implementation you can use as a reference for that approach.

Performance Tips and Model Selection

Choosing the Right Model

The right model depends on the task and the hardware available.

Llama 3.2 (3B) runs well on machines with 8 GB RAM and handles general chat and lightweight code tasks. It is the fastest of the four models listed here, which makes it the best choice when response latency matters more than output depth.

If you need higher-quality output and have 16 GB total RAM, Mistral 7B handles a wider range of tasks than the 3B alternatives, particularly multi-step instructions and longer-form explanations. It needs roughly 8 GB for model weights alone, so budget for 16 GB total to accommodate OS overhead.

Phi-3 Mini (3.8B) fits comfortably on 8 GB machines. In the author's testing it outperforms Llama 3.2 3B on multi-step reasoning tasks, making it a good pick when you want better logic without jumping to a 7B model.

For code-heavy workflows, Code Llama (7B variant) is purpose-built for code generation and explanation. Reach for it when the primary use case is writing or refactoring code.

A useful rule of thumb: 7B parameter models require approximately 8 GB for model weights; plan for 16 GB total RAM to accommodate OS overhead. 3B models run comfortably on machines with 8 GB total RAM.

Tuning for Speed

Three levers matter most. The num_ctx parameter controls the context window size; reducing it from the model's default (run ollama show llama3.2 to inspect the current value) to a smaller value yields a measurable but workload-dependent speedup, most noticeable on short prompts where the overhead of a large context window dominates. The keep_alive parameter controls how long a model stays loaded in memory between requests; setting it to "5m" or longer avoids the cold-start penalty of reloading weights. GPU offloading is automatic: Ollama detects NVIDIA GPUs and Apple Silicon and uses them without any configuration.

Common Pitfalls and Troubleshooting

Connection refused on localhost:11434. The Ollama server is not running. On macOS, launch the Ollama application. On Windows (preview), launch Ollama from the Start menu if it did not auto-start. On Linux, run ollama serve in a separate terminal. Also check that no firewall rules block localhost connections on that port.

Model not found errors. The API returns an error if you reference a model that has not been pulled. Always run ollama pull <model> before making API calls. Note that model tags matter: llama3.2 and llama3.2:3b can resolve to different quantizations depending on your Ollama version, so the snippets in this article use llama3.2:3b explicitly to avoid ambiguity.

The fix for slow first responses is simply patience (or prewarming). The initial request after pulling a model or after the keep_alive timeout triggers model loading into GPU/CPU memory. Expect a 3-15 second delay depending on model size and whether GPU offload is active. Subsequent requests skip this loading step and respond much faster.

Streaming parse errors. Chunk boundaries may split a JSON object across two reads. The streaming snippets in this article use a line buffer to handle this correctly: incomplete lines stay in the buffer until the next chunk completes them. Always use this buffering pattern rather than discarding parse failures, which causes silent token loss.

Chunk boundaries may split a JSON object across two reads. The streaming snippets in this article use a line buffer to handle this correctly: incomplete lines stay in the buffer until the next chunk completes them. Always use this buffering pattern rather than discarding parse failures, which causes silent token loss.

If Ollama becomes unresponsive (for example, due to resource exhaustion), a fetch call with no timeout blocks indefinitely. All snippets in this article use an AbortController with a 60-second timeout. For 3B models generating responses under 500 tokens, 60 seconds is generous. For 7B models generating long output, increase TIMEOUT_MS to 120 seconds or more.

Wrapping Up and Next Steps

The companion GitHub repository contains the complete chat application and VS Code extension template, ready to clone and extend.

From here, natural next steps include using the /api/embed endpoint for RAG pipelines, exploring function calling with tool-use models, or wrapping the Node.js backend in an Express server with a web-based chat UI.