Local Code Assistants: Replacing GitHub Copilot with Private AI

SitePoint Team

Published in

AI·Programming·Computing·

February 25, 2026

Share this article

Local Code Assistants: Replacing GitHub Copilot with Private AI

SitePoint Premium

Stay Relevant and Grow Your Career in Tech

Premium Results
Publish articles on SitePoint
Daily curated jobs
Learning Paths
Discounts to dev tools

Start Free Trial

7 Day Free Trial. Cancel Anytime.

Every character typed into a cloud-based code assistant, in most implementations, leaves your machine so a remote server can run inference on it. A local code assistant, configured as a Copilot alternative, keeps all of that on your machine.

How to Set Up a Local Code Assistant as a Copilot Alternative

Install Ollama on your machine using Homebrew (macOS), the official install script (Linux), or the Windows installer.
Pull code-specialized models by running ollama pull codellama:7b-code and ollama pull starcoder2:3b.
Verify Ollama works by running a test prompt from the terminal with ollama run.
Install the Continue.dev extension in VS Code from the Extensions panel.
Configure Continue.dev's config file to point at Ollama on localhost:11434, assigning CodeLlama 7B for chat and StarCoder2 3B for tab autocomplete.
Test inline completions by typing a function signature and accepting ghost-text suggestions with Tab.
Tune performance by creating a custom Modelfile to adjust context window size and GPU offloading.

Why Go Local with Your Code Assistant?
The Local AI Coding Stack at a Glance
Installing Ollama and Pulling a Code Model
Setting Up Continue.dev in VS Code
Using Your Local Code Assistant Day to Day
CodeLlama vs. StarCoder: Choosing the Right Model
Performance Tuning and Troubleshooting
How It Compares to GitHub Copilot
Wrapping Up

Why Go Local with Your Code Assistant?

Every character typed into a cloud-based code assistant, in most implementations, leaves your machine so a remote server can run inference on it. For developers working on proprietary codebases, that means fragments of trade secrets, authentication logic, and internal API designs flow through third-party infrastructure with every tab completion. A local code assistant, configured as a Copilot alternative, keeps all of that on your machine.

Compliance frameworks like HIPAA and SOC 2 place strict controls on where systems process sensitive data. Enterprise security policies frequently prohibit sending source code to external endpoints, blocking GitHub Copilot and similar services. Even teams without formal compliance mandates often restrict cloud AI tool usage for anything touching customer data or proprietary algorithms.

Then there is cost. GitHub Copilot runs $10 per month for individuals and $19 per user per month for business plans (pricing as of mid-2025; verify current pricing at github.com/features/copilot). A local code assistant built on open-weight models costs nothing beyond the hardware already sitting on a developer's desk.

A local code assistant built on open-weight models costs nothing beyond the hardware already sitting on a developer's desk.

Offline availability matters more than most people expect. Airplane mode, spotty conference Wi-Fi, or a VPN that blocks external APIs all become non-issues when inference runs locally. Latency drops too, since there is no round trip to a data center.

The trade-offs are real, though. Local models are smaller than what powers Copilot, which translates to measurably lower completion accuracy: CodeLlama 7B scores roughly 33% pass@1 on HumanEval, compared to reports of 40-50%+ for Copilot's underlying model on similar benchmarks. You also need a machine with decent RAM and ideally a GPU with sufficient VRAM. A 7B parameter model in 4-bit quantized form (as Ollama provides by default) requires roughly 4-6 GB of VRAM. Unquantized variants require approximately 14 GB. A 3B quantized model can squeeze into 3 GB. Developers on older laptops without discrete GPUs will feel the constraints.

The Local AI Coding Stack at a Glance

The architecture is straightforward: VS Code connects to the Continue.dev extension, which sends requests to Ollama running as a local API server, which loads and runs a code-specialized model like CodeLlama or StarCoder2.

We tested this guide with Ollama v0.6 and Continue.dev v0.8. Both projects release frequently with breaking configuration changes between versions. Always check release notes when using newer versions. You will also need VS Code installed.

Component	Role
VS Code	Editor and IDE providing the user interface
Continue.dev	Open-source extension bridging the editor to any LLM backend
Ollama	Local model runtime exposing an OpenAI-compatible API on localhost
CodeLlama / StarCoder2	Code-specialized language models optimized for completion and generation

System requirements:

Spec	Minimum	Recommended
RAM	8 GB	16 GB+
GPU VRAM	3 GB (for 3B quantized models)	8 GB+ (for 7B/13B quantized models)
Disk	5 GB free	20 GB+ for multiple models
OS	macOS 12+, Linux (glibc 2.31+), Windows 10+	Same

Ollama handles quantization and GPU offloading transparently, so developers do not need to manually configure backends in most cases. GPU acceleration requires appropriate drivers: CUDA for NVIDIA GPUs, ROCm for AMD GPUs on Linux, or Metal (built into macOS) for Apple Silicon. Ollama detects these automatically if installed.

Installing Ollama and Pulling a Code Model

Install Ollama

Installation differs slightly by operating system.

macOS (Homebrew):

brew install ollama

Linux (official install script):

Review the install script before executing. Download it first, inspect it, then run it:

curl -fsSL https://ollama.com/install.sh -o ollama_install.sh
less ollama_install.sh
sh ollama_install.sh

The script installs the Ollama binary to /usr/local/bin.

Windows: Download the installer from ollama.com/download and run it. Ollama runs as a background service on Windows.

After installing, verify the installation:

ollama --version

This should return a version string confirming installation (e.g., ollama version 0.x.y). Any version output confirms a successful install. If your shell cannot find the command, add the Ollama binary to your PATH (you may need to restart your shell after installation).

Start the Ollama server if it is not already running:

ollama serve

On macOS and Windows, Ollama typically starts automatically as a background process after installation. You can verify with ollama ps.

Pull CodeLlama and StarCoder Models

Ollama hosts pre-quantized model variants. The key decision is model size, which directly maps to VRAM consumption and inference speed. All VRAM figures below assume the default quantized (typically 4-bit) variants that Ollama provides.

For CodeLlama, the 7B code-specialized variant is the sweet spot for most developer machines with a mid-range GPU. The 13B variant delivers better accuracy but requires roughly 10 GB of VRAM in quantized form. The 34B quantized model needs 20 GB+ and is impractical for most local setups.

StarCoder2 ships in 3B, 7B, and 15B sizes. The 3B variant handles single-line and short block completions at speeds comparable to models twice its parameter count, while fitting comfortably in 3 GB of VRAM (quantized). That makes it the practical choice for laptops with integrated GPUs or limited discrete memory.

ollama pull codellama:7b-code
ollama pull starcoder2:3b
ollama list

The ollama list command confirms both models are downloaded and shows their sizes on disk. Note that codellama:7b-code is approximately 3.8 GB and starcoder2:3b is approximately 1.7 GB, so initial downloads will take time depending on your connection speed.

Quick Smoke Test from the Terminal

Before configuring any editor integration, verify the model responds correctly. Piping the prompt via stdin avoids shell interpretation issues with special characters in the prompt string:

echo "Write a Python function that merges two sorted lists" | ollama run codellama:7b-code

The model should produce a syntactically valid Python function implementing the merge logic. The exact implementation may vary between runs. For example, it might produce something like:

def merge_sorted_lists(list1, list2):
    """Merge two sorted lists into a sorted list.

    Both input lists must already be sorted in ascending order.
    Behaviour is undefined for unsorted input.
    """
    if list1 != sorted(list1) or list2 != sorted(list2):
        raise ValueError("Input lists must be sorted in ascending order")

    merged = []
    i, j = 0, 0

    while i < len(list1) and j < len(list2):
        if list1[i] <= list2[j]:
            merged.append(list1[i])
            i += 1
        else:
            merged.append(list2[j])
            j += 1

    merged.extend(list1[i:])
    merged.extend(list2[j:])
    return merged

The sorted() precondition guard adds O(n log n) overhead. For performance-critical paths, you may remove the runtime check and rely on the docstring contract alone.

If the model produces coherent code, Ollama is working correctly and ready for editor integration.

Setting Up Continue.dev in VS Code

Install the Extension

Open VS Code, go to the Extensions panel (Ctrl+Shift+X / Cmd+Shift+X), and search for "Continue". Install the extension published by Continue.dev. It is one of the most widely installed open-source AI code assistant extensions for VS Code.

On first launch, Continue opens a welcome panel offering to configure cloud providers. Skip this entirely. The goal is a fully local setup, and all configuration will be done manually through the config file.

Configure Continue.dev for Ollama

Continue.dev stores its configuration at ~/.continue/config.json (legacy) or ~/.continue/config.yaml (v0.8+) on macOS and Linux, and the equivalent %USERPROFILE%\.continue\ path on Windows. After installation, run ls ~/.continue/ to confirm which file is present and use the appropriate format.

If your installation uses config.json, open this file and replace its contents with the following:

{
  "models": [
    {
      "title": "CodeLlama 7B",
      "provider": "ollama",
      "model": "codellama:7b-code",
      "apiBase": "http://localhost:11434"
    }
  ],
  "tabAutocompleteModel": {
    "title": "StarCoder2 3B",
    "provider": "ollama",
    "model": "starcoder2:3b",
    "apiBase": "http://localhost:11434"
  }
}

If your installation uses config.yaml, consult the Continue.dev configuration docs for the equivalent YAML format.

This configuration uses CodeLlama 7B for the chat sidebar (where you ask questions, request refactors, generate code) and StarCoder2 3B for inline tab autocomplete. The split makes sense: tab autocomplete fires on nearly every keystroke and needs to be fast, so the smaller 3B model handles it. Chat interactions are less frequent and benefit from the larger model's better reasoning.

Key Configuration Options

The base configuration works, but several options improve the experience significantly:

{
  "models": [
    {
      "title": "CodeLlama 7B",
      "provider": "ollama",
      "model": "codellama:7b-code",
      "apiBase": "http://localhost:11434",
      "parameters": {
        "temperature": 0.2
      }
    },
    {
      "title": "StarCoder2 7B",
      "provider": "ollama",
      "model": "starcoder2:7b",
      "apiBase": "http://localhost:11434"
    }
  ],
  "tabAutocompleteModel": {
    "title": "StarCoder2 3B",
    "provider": "ollama",
    "model": "starcoder2:3b",
    "apiBase": "http://localhost:11434"
  },
  "requestOptions": {
    "timeout": 60
  },
  "contextProviders": [
    { "name": "file" },
    { "name": "codebase" },
    { "name": "terminal" }
  ]
}

The apiBase value http://localhost:11434 uses plain HTTP, which is safe when Ollama runs on your local machine. If you ever change apiBase to point at a remote host, switch to https:// to prevent code and prompts from traveling unencrypted over the network.

Be aware that the terminal context provider grants the model read access to terminal output. If your terminal session contains echoed secrets (API keys, tokens, passwords), those values will be included in prompts sent to the model. While Ollama processes these locally, consider removing terminal from this list if your workflow involves sensitive terminal output, or treat it as opt-in for specific sessions.

Place the timeout value (in seconds) in the top-level requestOptions block; Continue.dev v0.8+ reads it there. This prevents the extension from hanging indefinitely when a model takes too long or becomes unresponsive. Setting temperature to 0.2 in the model-level parameters block produces more focused, consistent completions by reducing randomness. For fully deterministic output, use temperature: 0.

The parameters block key name (parameters vs options) varies by Continue.dev version. Verify against your installed version's schema at the Continue.dev configuration docs. Placing temperature or timeout in the wrong block causes them to be silently ignored with no error message.

The contextProviders array enables referencing open files, the broader codebase, and terminal output in chat prompts using @file, @codebase, and @terminal respectively. Adding multiple model entries lets you switch between them in the Continue sidebar dropdown during a session.

Using Your Local Code Assistant Day to Day

Tab Autocomplete in Action

With Continue.dev configured and Ollama running, inline completions appear as ghost text while typing, similar to Copilot. Press Tab to accept a suggestion, Escape to dismiss it, or keep typing to refine.

Here is a typical flow. A developer types a function signature in a TypeScript file:

// Before: developer types this
function calculateDiscount(price: number, discountPercent: number): number {

The local model completes the body:

// After: StarCoder2 3B generates the completion
function calculateDiscount(price: number, discountPercent: number): number {
  if (discountPercent < 0 || discountPercent > 100) {
    throw new Error("Discount percent must be between 0 and 100");
  }
  return price - (price * discountPercent / 100);
}

Completion speed depends heavily on hardware. On a machine with an NVIDIA RTX 3060 (12 GB VRAM), StarCoder2 3B typically produces completions in under 500ms. Larger models on CPU-only inference may take 2-5 seconds, which is noticeable but still usable for block completions.

Chat-Based Code Generation and Refactoring

The Continue sidebar (Ctrl+L / Cmd+L) opens a chat interface connected to CodeLlama 7B. Select code in the editor and press Ctrl+L to send it to chat with a question.

For inline editing, select code and use Ctrl+I / Cmd+I to open an inline edit prompt (shortcut may vary by Continue.dev version). Here is an example of fixing a bug:

# Original function with an off-by-one error
def find_pairs(nums, target):
    results = []
    for i in range(len(nums)):
        for j in range(i, len(nums)):  # Bug: should start at i+1
            if nums[i] + nums[j] == target:
                results.append((nums[i], nums[j]))
    return results

After highlighting and asking Continue to "fix the bug in this function," the model returns:

def find_pairs(nums, target):
    """Find all unique pairs in nums that sum to target.

    Note: For inputs with duplicate values, duplicate pairs may appear
    in the results. Input order determines pair order — (a, b) where a
    appears before b in the list.
    """
    results = []
    for i in range(len(nums)):
        for j in range(i + 1, len(nums)):  # Fixed: start at i+1 to avoid pairing with self
            if nums[i] + nums[j] == target:
                results.append((nums[i], nums[j]))
    return results

The @file and @codebase context commands are particularly useful. Typing @file utils.py in a chat prompt gives the model visibility into that file's contents for cross-file reasoning (syntax may vary by Continue.dev version; consult the context providers documentation).

Generating Documentation and Tests

Local models handle structured generation tasks like docstrings and unit tests reliably for straightforward functions. Select a function and prompt: "Write pytest tests for the selected function."

# Source function
def celsius_to_fahrenheit(celsius: float) -> float:
    return celsius * 9 / 5 + 32

The model produces test cases covering standard boundaries:

# Generated pytest tests from CodeLlama 7B
# Requires: pip install pytest
# Replace 'your_module' with the actual module name containing celsius_to_fahrenheit
import pytest
from your_module import celsius_to_fahrenheit


def test_freezing_point():
    assert celsius_to_fahrenheit(0) == pytest.approx(32.0)


def test_boiling_point():
    assert celsius_to_fahrenheit(100) == pytest.approx(212.0)


def test_negative_temperature():
    assert celsius_to_fahrenheit(-40) == pytest.approx(-40.0)


def test_body_temperature():
    assert celsius_to_fahrenheit(37) == pytest.approx(98.6, rel=1e-3)

All float assertions use pytest.approx to avoid fragile exact-equality comparisons on floating-point results. For complex business logic with many edge cases, treat the model's output as a starting scaffold rather than a complete suite.

CodeLlama vs. StarCoder: Choosing the Right Model

Criteria	CodeLlama 7B	StarCoder2 3B	StarCoder2 7B
Min VRAM	6 GB	3 GB	6 GB
Best For (by speed)	Python, C++ focused work	Fast multi-language autocomplete	Multi-language autocomplete + chat
FIM Support	Yes	Yes	Yes
License	Llama 2 Community License (commercial restrictions apply; review Meta's terms before enterprise use)	BigCode OpenRAIL-M (conditional open license; review permitted use cases before deployment)	BigCode OpenRAIL-M (conditional open license; review permitted use cases before deployment)
Training Languages	Focus on Python, C/C++, Java, others	600+ languages	600+ languages

FIM (Fill-in-the-Middle) is a training technique enabling the model to complete code given both preceding and following context, improving inline autocomplete quality.

CodeLlama descends from Meta's Llama 2 and was further trained on code-heavy datasets. Its instruction-tuned variants handle conversational code tasks (explaining, refactoring, debugging) competently for single-function scope, making it a strong chat model. Its Python-specialized variant excels in Python-heavy workflows. CodeLlama is licensed under the Llama 2 Community License, which restricts use by organizations exceeding 700 million monthly active users. All users should review the full license at Meta's repository before commercial deployment.

StarCoder2, developed by the BigCode project, was trained on The Stack v2, covering over 600 programming languages. The 3B variant is the standout choice for tab autocomplete due to its small footprint and fast inference. For teams working across JavaScript, TypeScript, Rust, Go, and other languages beyond Python, StarCoder2 covers more ground. StarCoder2 is released under the BigCode OpenRAIL-M license, which permits broad use but includes specific behavioral restrictions. Review the full license at bigcode-project.org before deployment.

Use StarCoder2 3B for autocomplete and CodeLlama 7B for chat. On machines with less than 6 GB VRAM, use StarCoder2 3B for both.

Performance Tuning and Troubleshooting

Speed Up Inference

Check whether Ollama is using your GPU:

ollama ps

This shows running models and how many layers are GPU-offloaded. Output format may vary by Ollama version. Look for a PROCESSOR column indicating gpu, cpu, or a split (e.g., 100% GPU). If all layers show CPU, GPU acceleration is not active. Verify that the appropriate GPU drivers (CUDA, ROCm, or Metal) are installed.

Create a custom Modelfile to control inference parameters:

FROM codellama:7b-code

# num_ctx: reduces context window from model default (4096 for codellama:7b-code)
# to 2048 to save VRAM. Remove or increase if full context is needed.
PARAMETER num_ctx 2048

# num_gpu 99: Ollama convention meaning "offload all layers to GPU".
# Not a literal layer count; will not error if model has fewer layers.
PARAMETER num_gpu 99

Build it with:

ollama create codellama-fast -f Modelfile

Setting num_ctx to 2048 reduces the context window from the model's default (4096 for codellama:7b-code; verify with ollama show <model>), reducing memory usage and speeding up inference. Setting num_gpu to 99 is a conventional way to tell Ollama to offload as many transformer layers to the GPU as available VRAM allows. It is not a literal count and will not error if your model has fewer layers.

Common Issues and Fixes

If a model stops responding, verify Ollama is running:

curl --max-time 5 --connect-timeout 3 http://localhost:11434/api/tags

This should return a JSON list of available models. The --max-time 5 flag ensures the command times out after 5 seconds rather than hanging indefinitely if the server is unresponsive. If the connection is refused, start Ollama with ollama serve. If port 11434 is already in use by another process, ollama serve will fail. Check with lsof -i :11434 on macOS/Linux or netstat -ano | findstr :11434 on Windows.

Slow completions usually mean the model is too large for available VRAM, causing partial or full CPU inference. Switch to a smaller variant. In our testing, StarCoder2 3B on an RTX 3060 produced completions roughly 4x faster than CodeLlama 7B running on CPU-only inference on the same machine.

If Continue.dev does not detect Ollama, double-check the apiBase value in your Continue.dev configuration file. It must be http://localhost:11434 with no trailing slash.

How It Compares to GitHub Copilot

Aspect	Local (Continue + Ollama)	GitHub Copilot
Privacy	Complete: no data leaves machine	Code sent to Microsoft/OpenAI servers
Cost	Free	$10-19/month per user
Offline Use	Full functionality	Requires internet
Accuracy	Reliable for boilerplate and single-function completions; weaker on multi-file reasoning	Stronger on multi-step generation and unfamiliar libraries (based on community benchmarks and reported HumanEval-style evaluations)
Completion Speed	Hardware-dependent (e.g., under 500ms for StarCoder2 3B on RTX 3060)	Usually 200-800ms per completion based on community reports; varies by network and prompt length; no published SLA
Multi-file Context	Supported via context providers	Native deep context
Customization	Full control over models, parameters	Limited configuration

Copilot still outperforms local models on multi-step reasoning, unfamiliar libraries, and large-context tasks. Its underlying model (a large-scale proprietary model; architecture not publicly disclosed by Microsoft) has been trained on far more data. But for boilerplate generation, single-function completions, and standard patterns, a local stack performs comparably while keeping every byte of your source code on your own hardware. Where local models fall short is multi-file reasoning and generating code for niche or recently released libraries, where Copilot's larger training set gives it a clear edge.

For boilerplate generation, single-function completions, and standard patterns, a local stack performs comparably while keeping every byte of your source code on your own hardware.

Wrapping Up

The configuration steps take roughly 15 minutes; add additional time to download models (approximately 5-6 GB total, depending on connection speed). Install Ollama, pull two models, install Continue.dev, and drop in a config file. The Continue.dev documentation and Ollama model library are the best starting points for tracking new models and configuration options as they ship.

Local Code Assistants: Replacing GitHub Copilot with Private AI

Local Code Assistants: Replacing GitHub Copilot with Private AI

How to Set Up a Local Code Assistant as a Copilot Alternative

Table of Contents

Why Go Local with Your Code Assistant?

The Local AI Coding Stack at a Glance

Installing Ollama and Pulling a Code Model

Install Ollama

Pull CodeLlama and StarCoder Models

Quick Smoke Test from the Terminal

Setting Up Continue.dev in VS Code

Install the Extension

Configure Continue.dev for Ollama

Key Configuration Options

Using Your Local Code Assistant Day to Day

Tab Autocomplete in Action

Chat-Based Code Generation and Refactoring

Generating Documentation and Tests

CodeLlama vs. StarCoder: Choosing the Right Model

Performance Tuning and Troubleshooting

Speed Up Inference

Common Issues and Fixes

How It Compares to GitHub Copilot

Wrapping Up

Further Reading

Social Engineering 2.0: The 'Talking to Strangers' Vulnerability

Game Dev Without An Engine: The 2025/2026 Renaissance

NIST vs Global Science: The Impact of Foreign Scientist Restrictions