Skip to content
AI Lab Notes
Go back

Running Local LLMs on Linux: A Practical Guide

Running large language models on your own hardware gives you privacy, eliminates per-token API costs, and lets you experiment without rate limits. The tooling has matured significantly — you no longer need to compile CUDA kernels by hand or manage Python dependency hell. This guide covers the practical choices: which inference engines to use, which models to run, and how to match them to your GPU.

Table of contents

Open Table of contents

Why Run Local?

There are three main reasons to run LLMs locally rather than through an API:

  1. Privacy. Your prompts and data never leave your machine. This matters for proprietary code, personal documents, or any scenario where you do not want a third party processing your inputs.
  2. Cost. API calls add up quickly, especially for agentic workflows that make dozens of LLM calls per task. A local model running on your GPU has zero marginal cost per token.
  3. Latency and availability. No network round-trip, no rate limits, no outages. Local inference is especially fast for small models — you can get sub-second responses for simple tasks.

The tradeoff is capability. As of early 2026, the best local models (70B parameter class) are strong but still a step below frontier cloud models like Claude Opus or GPT-4o on complex reasoning tasks. The practical approach is to use local models for routine work and reserve API calls for tasks that demand the strongest reasoning.

The Two Inference Engines Worth Using

Ollama

Ollama is the fastest path from zero to running a local model. It is a single binary that handles model downloads, quantization, GPU offloading, and serving an OpenAI-compatible API.

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run a model
ollama pull llama3.3
ollama run llama3.3

Ollama runs as a systemd service and listens on http://localhost:11434. Any tool that supports the OpenAI API format can connect to it.

Strengths:

Limitations:

Increasing the context window (important):

Ollama defaults to 4K context, which is too small for coding or document analysis. Create a Modelfile to override:

# Create a custom model with 32K context
cat > Modelfile <<EOF
FROM llama3.3
PARAMETER num_ctx 32768
EOF

ollama create llama3.3-32k -f Modelfile
ollama run llama3.3-32k

LM Studio

LM Studio is a desktop application with a GUI for downloading, managing, and running models. It also includes a CLI (lms) and an OpenAI-compatible API server.

# Start the API server via CLI
lms server start

# The server runs at http://localhost:1234

Strengths:

Limitations:

Which to Use?

Use both. They serve complementary roles:

Both serve OpenAI-compatible APIs, so any downstream tool works with either.

Model Selection by VRAM Budget

The most important constraint is your GPU’s VRAM. Models are distributed in quantized formats (Q4, Q5, Q6, Q8, FP16) that trade quality for memory usage. Here is what fits at each tier:

8 GB VRAM (RTX 3060, RTX 4060, RTX 3070)

At 8 GB, you are limited to small models or aggressive quantization of medium ones.

ModelParametersQuantizationVRAM UsageBest For
Llama 3.1 8B8BQ4_K_M~6 GBGeneral chat, simple coding
Phi 4 Mini3.8BQ8_0~5 GBLightweight tasks, fast responses
Gemma 3 4B4BQ6_K~4 GBCompact general-purpose
Nomic Embed Text137MFP16<1 GBEmbeddings for RAG

Practical note: With 8 GB, you can run one model at a time with room for the OS and other applications. Close GPU-intensive applications before inference.

12 GB VRAM (RTX 3060 12GB, RTX 4070)

The 12 GB tier opens up medium-sized models that are genuinely useful for coding and analysis.

ModelParametersQuantizationVRAM UsageBest For
Mistral Nemo 12B12BQ4_K_M~8 GBChat, instruction following
Codestral 22B22BQ4_K_M~14 GB*Code generation, completion
Gemma 3 12B12BQ6_K~10 GBGeneral purpose, multilingual

*Codestral at Q4 will partially offload to system RAM on a 12 GB card. It works but is slower.

16 GB VRAM (RTX 4080, RTX 5060 Ti, RTX A4000)

With 16 GB, you comfortably run 12-22B parameter models at higher quantization.

ModelParametersQuantizationVRAM UsageBest For
Codestral 22B22BQ4_K_M~14 GBCode generation (best-in-class local)
Mistral Nemo 12B12BQ6_K~11 GBHigh-quality chat
Llama 3.1 8B8BFP16~16 GBMaximum quality small model
Phi 414BQ4_K_M~10 GBReasoning, math

24 GB VRAM (RTX 3090, RTX 4090, RTX A5000)

This is the sweet spot for local LLM work. You can run 27B-70B parameter models that approach cloud model quality for many tasks.

ModelParametersQuantizationVRAM UsageBest For
Llama 3.3 70B70BQ4_K_M~40 GB*Best open general-purpose
Gemma 3 27B27BQ4_K_M~18 GBStrong all-rounder, multilingual
Codestral 22B22BQ6_K~18 GBCode at higher quality
GPT-OSS 20B20BQ6_K~17 GBOpenAI’s open model
Phi 414BQ8_0~16 GBReasoning at near-max quality

*70B models at Q4 need ~40 GB total and will spill into system RAM. Ollama and LM Studio handle this automatically (GPU + CPU split), but expect slower inference. For full GPU inference of 70B, you need 48+ GB VRAM (dual GPU or A6000).

Practical 24 GB setup:

Run Gemma 3 27B as your daily driver (fits entirely in VRAM, fast inference, strong quality). Keep Codestral 22B for coding tasks. Use Llama 3.3 70B when you need maximum local reasoning, accepting the speed penalty from RAM spill.

Essential Configuration

Flash Attention

Enable Flash Attention for faster inference and lower memory usage:

# For Ollama, set via environment variable
sudo systemctl edit ollama
# Add:
# [Service]
# Environment="OLLAMA_FLASH_ATTENTION=1"
sudo systemctl restart ollama

KV Cache Quantization

Reduce memory usage of the context window cache:

# In Ollama Modelfile
PARAMETER num_ctx 32768
# KV cache quantization (Q4_0 or Q8_0)
# Saves ~50% KV cache memory at minimal quality loss

Embedding Models for RAG

If you plan to build retrieval-augmented generation (RAG) pipelines, you need a separate embedding model. These are small and can run alongside your main LLM:

# Pull an embedding model
ollama pull nomic-embed-text

# Use via the API
curl http://localhost:11434/api/embeddings \
  -d '{"model": "nomic-embed-text", "prompt": "Your text here"}'

Good embedding models for local use:

Adding a Chat UI

Running models from the command line gets old fast. Open WebUI gives you a ChatGPT-like interface that connects to Ollama or LM Studio:

# Run Open WebUI connecting to your local Ollama
docker run -d -p 3000:8080 \
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
  -v open-webui:/app/backend/data \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main

Open WebUI supports conversations, document upload (RAG), image generation integration, multi-user accounts, and tool/function calling — all backed by your local models.

Model Selection Principles

A few rules of thumb that hold true across GPU tiers:

  1. Quantization matters less than model size. A 27B model at Q4 is almost always better than a 7B model at FP16. Prioritize larger models with lower quantization over smaller models with higher quantization.

  2. Q4_K_M is the sweet spot. This quantization level retains the vast majority of model quality while cutting memory usage roughly in half compared to FP16. Go lower (Q3, Q2) only if you absolutely must fit the model.

  3. Context length eats VRAM. Doubling the context window roughly doubles the KV cache memory. A model that fits at 4K context may not fit at 32K. Plan accordingly.

  4. Not all tasks need large models. Code completion, text formatting, and simple Q&A work well with 7-12B models. Save the 27B+ models for complex reasoning, analysis, and creative tasks.

  5. Test with your actual workloads. Benchmarks are helpful but your specific use case is what matters. Run your common prompts against a few candidate models and compare quality directly.

What Local Models Are Not Good At (Yet)

Be honest about limitations:

The practical approach: use local models for routine tasks and switch to cloud APIs when you hit a quality ceiling.

Next Steps

Once you have a model running:

  1. Connect it to your editor. Continue is a VS Code extension that works with both Ollama and LM Studio for code completion and chat.
  2. Try agentic coding. OpenCode and Aider are terminal-based coding agents that can use your local models.
  3. Build a RAG pipeline. Combine an embedding model with a vector database to query your own documents.
  4. Set up Open WebUI for a polished chat experience with document upload support.

Running LLMs locally is no longer a bleeding-edge exercise. The tooling works, the models are capable, and the barrier to entry is a decent GPU and a few terminal commands.


Share this post on:

Previous Post
Getting Started with Ghostty: A Modern GPU-Accelerated Terminal for Linux
Next Post
AI Coding Agents: Tools, Workflows, and What Actually Works