Local Code Assistants: Replacing GitHub Copilot with Private AI


- Premium Results
- Publish articles on SitePoint
- Daily curated jobs
- Learning Paths
- Discounts to dev tools
7 Day Free Trial. Cancel Anytime.
Every character typed into a cloud-based code assistant, in most implementations, leaves your machine so a remote server can run inference on it. A local code assistant, configured as a Copilot alternative, keeps all of that on your machine.
How to Set Up a Local Code Assistant as a Copilot Alternative
- Install Ollama on your machine using Homebrew (macOS), the official install script (Linux), or the Windows installer.
- Pull code-specialized models by running
ollama pull codellama:7b-codeandollama pull starcoder2:3b. - Verify Ollama works by running a test prompt from the terminal with
ollama run. - Install the Continue.dev extension in VS Code from the Extensions panel.
- Configure Continue.dev's config file to point at Ollama on
localhost:11434, assigning CodeLlama 7B for chat and StarCoder2 3B for tab autocomplete. - Test inline completions by typing a function signature and accepting ghost-text suggestions with Tab.
- Tune performance by creating a custom Modelfile to adjust context window size and GPU offloading.
Table of Contents
- Why Go Local with Your Code Assistant?
- The Local AI Coding Stack at a Glance
- Installing Ollama and Pulling a Code Model
- Setting Up Continue.dev in VS Code
- Using Your Local Code Assistant Day to Day
- CodeLlama vs. StarCoder: Choosing the Right Model
- Performance Tuning and Troubleshooting
- How It Compares to GitHub Copilot
- Wrapping Up
Why Go Local with Your Code Assistant?
Every character typed into a cloud-based code assistant, in most implementations, leaves your machine so a remote server can run inference on it. For developers working on proprietary codebases, that means fragments of trade secrets, authentication logic, and internal API designs flow through third-party infrastructure with every tab completion. A local code assistant, configured as a Copilot alternative, keeps all of that on your machine.
Compliance frameworks like HIPAA and SOC 2 place strict controls on where systems process sensitive data. Enterprise security policies frequently prohibit sending source code to external endpoints, blocking GitHub Copilot and similar services. Even teams without formal compliance mandates often restrict cloud AI tool usage for anything touching customer data or proprietary algorithms.
Then there is cost. GitHub Copilot runs $10 per month for individuals and $19 per user per month for business plans (pricing as of mid-2025; verify current pricing at github.com/features/copilot). A local code assistant built on open-weight models costs nothing beyond the hardware already sitting on a developer's desk.
A local code assistant built on open-weight models costs nothing beyond the hardware already sitting on a developer's desk.
Offline availability matters more than most people expect. Airplane mode, spotty conference Wi-Fi, or a VPN that blocks external APIs all become non-issues when inference runs locally. Latency drops too, since there is no round trip to a data center.
The trade-offs are real, though. Local models are smaller than what powers Copilot, which translates to measurably lower completion accuracy: CodeLlama 7B scores roughly 33% pass@1 on HumanEval, compared to reports of 40-50%+ for Copilot's underlying model on similar benchmarks. You also need a machine with decent RAM and ideally a GPU with sufficient VRAM. A 7B parameter model in 4-bit quantized form (as Ollama provides by default) requires roughly 4-6 GB of VRAM. Unquantized variants require approximately 14 GB. A 3B quantized model can squeeze into 3 GB. Developers on older laptops without discrete GPUs will feel the constraints.
The Local AI Coding Stack at a Glance
The architecture is straightforward: VS Code connects to the Continue.dev extension, which sends requests to Ollama running as a local API server, which loads and runs a code-specialized model like CodeLlama or StarCoder2.
We tested this guide with Ollama v0.6 and Continue.dev v0.8. Both projects release frequently with breaking configuration changes between versions. Always check release notes when using newer versions. You will also need VS Code installed.
| Component | Role |
|---|---|
| VS Code | Editor and IDE providing the user interface |
| Continue.dev | Open-source extension bridging the editor to any LLM backend |
| Ollama | Local model runtime exposing an OpenAI-compatible API on localhost |
| CodeLlama / StarCoder2 | Code-specialized language models optimized for completion and generation |
System requirements:
| Spec | Minimum | Recommended |
|---|---|---|
| RAM | 8 GB | 16 GB+ |
| GPU VRAM | 3 GB (for 3B quantized models) | 8 GB+ (for 7B/13B quantized models) |
| Disk | 5 GB free | 20 GB+ for multiple models |
| OS | macOS 12+, Linux (glibc 2.31+), Windows 10+ | Same |
Ollama handles quantization and GPU offloading transparently, so developers do not need to manually configure backends in most cases. GPU acceleration requires appropriate drivers: CUDA for NVIDIA GPUs, ROCm for AMD GPUs on Linux, or Metal (built into macOS) for Apple Silicon. Ollama detects these automatically if installed.
Installing Ollama and Pulling a Code Model
Install Ollama
Installation differs slightly by operating system.
macOS (Homebrew):
brew install ollama
Linux (official install script):
Review the install script before executing. Download it first, inspect it, then run it:
curl -fsSL https://ollama.com/install.sh -o ollama_install.sh
less ollama_install.sh
sh ollama_install.sh
The script installs the Ollama binary to /usr/local/bin.
Windows: Download the installer from ollama.com/download and run it. Ollama runs as a background service on Windows.
After installing, verify the installation:
ollama --version
This should return a version string confirming installation (e.g., ollama version 0.x.y). Any version output confirms a successful install. If your shell cannot find the command, add the Ollama binary to your PATH (you may need to restart your shell after installation).
Start the Ollama server if it is not already running:
ollama serve
On macOS and Windows, Ollama typically starts automatically as a background process after installation. You can verify with ollama ps.
Pull CodeLlama and StarCoder Models
Ollama hosts pre-quantized model variants. The key decision is model size, which directly maps to VRAM consumption and inference speed. All VRAM figures below assume the default quantized (typically 4-bit) variants that Ollama provides.
For CodeLlama, the 7B code-specialized variant is the sweet spot for most developer machines with a mid-range GPU. The 13B variant delivers better accuracy but requires roughly 10 GB of VRAM in quantized form. The 34B quantized model needs 20 GB+ and is impractical for most local setups.
StarCoder2 ships in 3B, 7B, and 15B sizes. The 3B variant handles single-line and short block completions at speeds comparable to models twice its parameter count, while fitting comfortably in 3 GB of VRAM (quantized). That makes it the practical choice for laptops with integrated GPUs or limited discrete memory.
ollama pull codellama:7b-code
ollama pull starcoder2:3b
ollama list
The ollama list command confirms both models are downloaded and shows their sizes on disk. Note that codellama:7b-code is approximately 3.8 GB and starcoder2:3b is approximately 1.7 GB, so initial downloads will take time depending on your connection speed.
Quick Smoke Test from the Terminal
Before configuring any editor integration, verify the model responds correctly. Piping the prompt via stdin avoids shell interpretation issues with special characters in the prompt string:
echo "Write a Python function that merges two sorted lists" | ollama run codellama:7b-code
The model should produce a syntactically valid Python function implementing the merge logic. The exact implementation may vary between runs. For example, it might produce something like:
def merge_sorted_lists(list1, list2):
"""Merge two sorted lists into a sorted list.
Both input lists must already be sorted in ascending order.
Behaviour is undefined for unsorted input.
"""
if list1 != sorted(list1) or list2 != sorted(list2):
raise ValueError("Input lists must be sorted in ascending order")
merged = []
i, j = 0, 0
while i < len(list1) and j < len(list2):
if list1[i] <= list2[j]:
merged.append(list1[i])
i += 1
else:
merged.append(list2[j])
j += 1
merged.extend(list1[i:])
merged.extend(list2[j:])
return merged
The sorted() precondition guard adds O(n log n) overhead. For performance-critical paths, you may remove the runtime check and rely on the docstring contract alone.
If the model produces coherent code, Ollama is working correctly and ready for editor integration.
Setting Up Continue.dev in VS Code
Install the Extension
Open VS Code, go to the Extensions panel (Ctrl+Shift+X / Cmd+Shift+X), and search for "Continue". Install the extension published by Continue.dev. It is one of the most widely installed open-source AI code assistant extensions for VS Code.
On first launch, Continue opens a welcome panel offering to configure cloud providers. Skip this entirely. The goal is a fully local setup, and all configuration will be done manually through the config file.
Configure Continue.dev for Ollama
Continue.dev stores its configuration at ~/.continue/config.json (legacy) or ~/.continue/config.yaml (v0.8+) on macOS and Linux, and the equivalent %USERPROFILE%\.continue\ path on Windows. After installation, run ls ~/.continue/ to confirm which file is present and use the appropriate format.
If your installation uses config.json, open this file and replace its contents with the following:
{
"models": [
{
"title": "CodeLlama 7B",
"provider": "ollama",
"model": "codellama:7b-code",
"apiBase": "http://localhost:11434"
}
],
"tabAutocompleteModel": {
"title": "StarCoder2 3B",
"provider": "ollama",
"model": "starcoder2:3b",
"apiBase": "http://localhost:11434"
}
}
If your installation uses config.yaml, consult the Continue.dev configuration docs for the equivalent YAML format.
This configuration uses CodeLlama 7B for the chat sidebar (where you ask questions, request refactors, generate code) and StarCoder2 3B for inline tab autocomplete. The split makes sense: tab autocomplete fires on nearly every keystroke and needs to be fast, so the smaller 3B model handles it. Chat interactions are less frequent and benefit from the larger model's better reasoning.
Key Configuration Options
The base configuration works, but several options improve the experience significantly:
{
"models": [
{
"title": "CodeLlama 7B",
"provider": "ollama",
"model": "codellama:7b-code",
"apiBase": "http://localhost:11434",
"parameters": {
"temperature": 0.2
}
},
{
"title": "StarCoder2 7B",
"provider": "ollama",
"model": "starcoder2:7b",
"apiBase": "http://localhost:11434"
}
],
"tabAutocompleteModel": {
"title": "StarCoder2 3B",
"provider": "ollama",
"model": "starcoder2:3b",
"apiBase": "http://localhost:11434"
},
"requestOptions": {
"timeout": 60
},
"contextProviders": [
{ "name": "file" },
{ "name": "codebase" },
{ "name": "terminal" }
]
}
The apiBase value http://localhost:11434 uses plain HTTP, which is safe when Ollama runs on your local machine. If you ever change apiBase to point at a remote host, switch to https:// to prevent code and prompts from traveling unencrypted over the network.
Be aware that the terminal context provider grants the model read access to terminal output. If your terminal session contains echoed secrets (API keys, tokens, passwords), those values will be included in prompts sent to the model. While Ollama processes these locally, consider removing terminal from this list if your workflow involves sensitive terminal output, or treat it as opt-in for specific sessions.
Place the timeout value (in seconds) in the top-level requestOptions block; Continue.dev v0.8+ reads it there. This prevents the extension from hanging indefinitely when a model takes too long or becomes unresponsive. Setting temperature to 0.2 in the model-level parameters block produces more focused, consistent completions by reducing randomness. For fully deterministic output, use temperature: 0.
The parameters block key name (parameters vs options) varies by Continue.dev version. Verify against your installed version's schema at the Continue.dev configuration docs. Placing temperature or timeout in the wrong block causes them to be silently ignored with no error message.
The contextProviders array enables referencing open files, the broader codebase, and terminal output in chat prompts using @file, @codebase, and @terminal respectively. Adding multiple model entries lets you switch between them in the Continue sidebar dropdown during a session.
Using Your Local Code Assistant Day to Day
Tab Autocomplete in Action
With Continue.dev configured and Ollama running, inline completions appear as ghost text while typing, similar to Copilot. Press Tab to accept a suggestion, Escape to dismiss it, or keep typing to refine.
Here is a typical flow. A developer types a function signature in a TypeScript file:
// Before: developer types this
function calculateDiscount(price: number, discountPercent: number): number {
The local model completes the body:
// After: StarCoder2 3B generates the completion
function calculateDiscount(price: number, discountPercent: number): number {
if (discountPercent < 0 || discountPercent > 100) {
throw new Error("Discount percent must be between 0 and 100");
}
return price - (price * discountPercent / 100);
}
Completion speed depends heavily on hardware. On a machine with an NVIDIA RTX 3060 (12 GB VRAM), StarCoder2 3B typically produces completions in under 500ms. Larger models on CPU-only inference may take 2-5 seconds, which is noticeable but still usable for block completions.
Chat-Based Code Generation and Refactoring
The Continue sidebar (Ctrl+L / Cmd+L) opens a chat interface connected to CodeLlama 7B. Select code in the editor and press Ctrl+L to send it to chat with a question.
For inline editing, select code and use Ctrl+I / Cmd+I to open an inline edit prompt (shortcut may vary by Continue.dev version). Here is an example of fixing a bug:
# Original function with an off-by-one error
def find_pairs(nums, target):
results = []
for i in range(len(nums)):
for j in range(i, len(nums)): # Bug: should start at i+1
if nums[i] + nums[j] == target:
results.append((nums[i], nums[j]))
return results
After highlighting and asking Continue to "fix the bug in this function," the model returns:
def find_pairs(nums, target):
"""Find all unique pairs in nums that sum to target.
Note: For inputs with duplicate values, duplicate pairs may appear
in the results. Input order determines pair order — (a, b) where a
appears before b in the list.
"""
results = []
for i in range(len(nums)):
for j in range(i + 1, len(nums)): # Fixed: start at i+1 to avoid pairing with self
if nums[i] + nums[j] == target:
results.append((nums[i], nums[j]))
return results
The @file and @codebase context commands are particularly useful. Typing @file utils.py in a chat prompt gives the model visibility into that file's contents for cross-file reasoning (syntax may vary by Continue.dev version; consult the context providers documentation).
Generating Documentation and Tests
Local models handle structured generation tasks like docstrings and unit tests reliably for straightforward functions. Select a function and prompt: "Write pytest tests for the selected function."
# Source function
def celsius_to_fahrenheit(celsius: float) -> float:
return celsius * 9 / 5 + 32
The model produces test cases covering standard boundaries:
# Generated pytest tests from CodeLlama 7B
# Requires: pip install pytest
# Replace 'your_module' with the actual module name containing celsius_to_fahrenheit
import pytest
from your_module import celsius_to_fahrenheit
def test_freezing_point():
assert celsius_to_fahrenheit(0) == pytest.approx(32.0)
def test_boiling_point():
assert celsius_to_fahrenheit(100) == pytest.approx(212.0)
def test_negative_temperature():
assert celsius_to_fahrenheit(-40) == pytest.approx(-40.0)
def test_body_temperature():
assert celsius_to_fahrenheit(37) == pytest.approx(98.6, rel=1e-3)
All float assertions use pytest.approx to avoid fragile exact-equality comparisons on floating-point results. For complex business logic with many edge cases, treat the model's output as a starting scaffold rather than a complete suite.
CodeLlama vs. StarCoder: Choosing the Right Model
| Criteria | CodeLlama 7B | StarCoder2 3B | StarCoder2 7B |
|---|---|---|---|
| Min VRAM | 6 GB | 3 GB | 6 GB |
| Best For (by speed) | Python, C++ focused work | Fast multi-language autocomplete | Multi-language autocomplete + chat |
| FIM Support | Yes | Yes | Yes |
| License | Llama 2 Community License (commercial restrictions apply; review Meta's terms before enterprise use) | BigCode OpenRAIL-M (conditional open license; review permitted use cases before deployment) | BigCode OpenRAIL-M (conditional open license; review permitted use cases before deployment) |
| Training Languages | Focus on Python, C/C++, Java, others | 600+ languages | 600+ languages |
FIM (Fill-in-the-Middle) is a training technique enabling the model to complete code given both preceding and following context, improving inline autocomplete quality.
CodeLlama descends from Meta's Llama 2 and was further trained on code-heavy datasets. Its instruction-tuned variants handle conversational code tasks (explaining, refactoring, debugging) competently for single-function scope, making it a strong chat model. Its Python-specialized variant excels in Python-heavy workflows. CodeLlama is licensed under the Llama 2 Community License, which restricts use by organizations exceeding 700 million monthly active users. All users should review the full license at Meta's repository before commercial deployment.
StarCoder2, developed by the BigCode project, was trained on The Stack v2, covering over 600 programming languages. The 3B variant is the standout choice for tab autocomplete due to its small footprint and fast inference. For teams working across JavaScript, TypeScript, Rust, Go, and other languages beyond Python, StarCoder2 covers more ground. StarCoder2 is released under the BigCode OpenRAIL-M license, which permits broad use but includes specific behavioral restrictions. Review the full license at bigcode-project.org before deployment.
Use StarCoder2 3B for autocomplete and CodeLlama 7B for chat. On machines with less than 6 GB VRAM, use StarCoder2 3B for both.
Performance Tuning and Troubleshooting
Speed Up Inference
Check whether Ollama is using your GPU:
ollama ps
This shows running models and how many layers are GPU-offloaded. Output format may vary by Ollama version. Look for a PROCESSOR column indicating gpu, cpu, or a split (e.g., 100% GPU). If all layers show CPU, GPU acceleration is not active. Verify that the appropriate GPU drivers (CUDA, ROCm, or Metal) are installed.
Create a custom Modelfile to control inference parameters:
FROM codellama:7b-code
# num_ctx: reduces context window from model default (4096 for codellama:7b-code)
# to 2048 to save VRAM. Remove or increase if full context is needed.
PARAMETER num_ctx 2048
# num_gpu 99: Ollama convention meaning "offload all layers to GPU".
# Not a literal layer count; will not error if model has fewer layers.
PARAMETER num_gpu 99
Build it with:
ollama create codellama-fast -f Modelfile
Setting num_ctx to 2048 reduces the context window from the model's default (4096 for codellama:7b-code; verify with ollama show <model>), reducing memory usage and speeding up inference. Setting num_gpu to 99 is a conventional way to tell Ollama to offload as many transformer layers to the GPU as available VRAM allows. It is not a literal count and will not error if your model has fewer layers.
Common Issues and Fixes
If a model stops responding, verify Ollama is running:
curl --max-time 5 --connect-timeout 3 http://localhost:11434/api/tags
This should return a JSON list of available models. The --max-time 5 flag ensures the command times out after 5 seconds rather than hanging indefinitely if the server is unresponsive. If the connection is refused, start Ollama with ollama serve. If port 11434 is already in use by another process, ollama serve will fail. Check with lsof -i :11434 on macOS/Linux or netstat -ano | findstr :11434 on Windows.
Slow completions usually mean the model is too large for available VRAM, causing partial or full CPU inference. Switch to a smaller variant. In our testing, StarCoder2 3B on an RTX 3060 produced completions roughly 4x faster than CodeLlama 7B running on CPU-only inference on the same machine.
If Continue.dev does not detect Ollama, double-check the apiBase value in your Continue.dev configuration file. It must be http://localhost:11434 with no trailing slash.
How It Compares to GitHub Copilot
| Aspect | Local (Continue + Ollama) | GitHub Copilot |
|---|---|---|
| Privacy | Complete: no data leaves machine | Code sent to Microsoft/OpenAI servers |
| Cost | Free | $10-19/month per user |
| Offline Use | Full functionality | Requires internet |
| Accuracy | Reliable for boilerplate and single-function completions; weaker on multi-file reasoning | Stronger on multi-step generation and unfamiliar libraries (based on community benchmarks and reported HumanEval-style evaluations) |
| Completion Speed | Hardware-dependent (e.g., under 500ms for StarCoder2 3B on RTX 3060) | Usually 200-800ms per completion based on community reports; varies by network and prompt length; no published SLA |
| Multi-file Context | Supported via context providers | Native deep context |
| Customization | Full control over models, parameters | Limited configuration |
Copilot still outperforms local models on multi-step reasoning, unfamiliar libraries, and large-context tasks. Its underlying model (a large-scale proprietary model; architecture not publicly disclosed by Microsoft) has been trained on far more data. But for boilerplate generation, single-function completions, and standard patterns, a local stack performs comparably while keeping every byte of your source code on your own hardware. Where local models fall short is multi-file reasoning and generating code for niche or recently released libraries, where Copilot's larger training set gives it a clear edge.
For boilerplate generation, single-function completions, and standard patterns, a local stack performs comparably while keeping every byte of your source code on your own hardware.
Wrapping Up
The configuration steps take roughly 15 minutes; add additional time to download models (approximately 5-6 GB total, depending on connection speed). Install Ollama, pull two models, install Continue.dev, and drop in a config file. The Continue.dev documentation and Ollama model library are the best starting points for tracking new models and configuration options as they ship.