Running Local AI Agents: A Guide to Ollama and Open WebUI

The AI revolution has brought powerful language models to everyone's fingertips, but at a cost: your data travels to cloud servers, and API usage can quickly become expensive. What if you could run these models locally, on your own hardware, with complete privacy and no per-request fees?

Enter local LLMs—large language models that run entirely on your machine. With tools like Ollama and Open WebUI, setting up your own private AI assistant is now accessible to anyone with a modern computer.

Why Run AI Locally?

Before diving into the setup, let's understand why local AI is becoming increasingly attractive:

Privacy and Data Control

When you use cloud-based AI services like ChatGPT or Claude, your conversations are processed on remote servers. Even with privacy policies, you're trusting a third party with potentially sensitive information. Running models locally means your data never leaves your machine.

Cost Savings

Cloud AI services charge per token or request. For developers building AI-powered applications or users with high usage, these costs can add up quickly. Once you've invested in hardware, local models have no ongoing API fees.

Customization and Experimentation

Local models give you complete control. You can fine-tune models, experiment with different architectures, and integrate them into your homelab workflows without external dependencies.

Offline Capability

Local AI works without an internet connection, making it perfect for air-gapped environments or situations where connectivity is unreliable.

AI and neural networks visualization — Neural network representation of AI models

Understanding Ollama

Ollama is a tool that simplifies running large language models locally. It handles model downloads, optimization, and provides a simple API for interacting with models. Think of it as Docker for LLMs—it abstracts away the complexity of model management.

Key Features

Easy Model Management: Download and run models with a single command
Optimized Performance: Automatically optimizes models for your hardware
Simple API: RESTful API for easy integration
Cross-Platform: Works on Windows, macOS, and Linux
GPU Support: Leverages GPU acceleration when available

Installing Ollama

Windows Installation

Download the installer from ollama.com
Run the installer and follow the setup wizard
Ollama will start automatically as a background service

Linux Installation

# Install using the official script
curl -fsSL https://ollama.com/install.sh | sh

# Or using Homebrew on Linux
brew install ollama

macOS Installation

# Using Homebrew
brew install ollama

# Or download the installer from ollama.com

After installation, verify Ollama is running:

ollama --version

Your First Local Model

Ollama makes downloading and running models incredibly simple. Let's start with a popular, efficient model:

# Download and run Llama 3.2 (3B parameters - great for testing)
ollama pull llama3.2

# Start chatting with the model
ollama run llama3.2

You'll be dropped into an interactive chat session. Try asking it questions or giving it tasks. When you're done, type /bye to exit.

Popular Models to Try

For General Use:

llama3.2 - Fast, efficient 3B model
qwen2.5:7b - Excellent multilingual support
mistral - Strong reasoning capabilities

For Coding:

deepseek-coder - Specialized for code generation
codellama - Meta's coding-focused model

For Advanced Tasks:

llama3.1:70b - More capable but requires significant RAM/VRAM
qwen2.5:72b - High-performance model

For Reasoning:

deepseek-r1 - Advanced reasoning capabilities

To see all available models, visit the Ollama library.

Hardware Requirements

The hardware you need depends on the model size you want to run:

Minimum Requirements (Small Models)

CPU: Modern multi-core processor
RAM: 8GB (16GB recommended)
Storage: 10GB free space
GPU: Optional but recommended

Models like llama3.2 (3B) can run reasonably well on CPU-only systems with 8GB RAM, though responses will be slower.

Recommended Setup (Medium Models)

CPU: Modern 6+ core processor
RAM: 16GB (32GB for larger models)
GPU: NVIDIA GPU with 8GB+ VRAM (or AMD equivalent)
Storage: 50GB+ free space

With a dedicated GPU, models like qwen2.5:7b or mistral:7b run smoothly and respond quickly.

High-End Setup (Large Models)

CPU: High-end processor
RAM: 32GB+ system RAM
GPU: NVIDIA GPU with 16GB+ VRAM (RTX 3090, RTX 4090, or better)
Storage: 100GB+ free space

Large models (70B+) require substantial hardware but offer the best performance.

VRAM vs System RAM

VRAM (GPU Memory): Much faster, ideal for model weights. Models that fit entirely in VRAM run fastest.
System RAM: Slower but more abundant. Models that don't fit in VRAM will use system RAM, which is slower but still functional.

Ollama automatically manages this, using GPU when available and falling back to CPU/RAM when needed.

Setting Up Open WebUI

While Ollama's command-line interface works, Open WebUI provides a beautiful, ChatGPT-like web interface for your local models. It's open-source, feature-rich, and easy to install.

Installation with Docker

The easiest way to run Open WebUI is with Docker:

# Create a directory for Open WebUI data
mkdir -p ~/open-webui
cd ~/open-webui

# Run Open WebUI (connects to local Ollama automatically)
docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Now visit http://localhost:3000 in your browser. You'll see a setup screen where you can create your first admin account.

Manual Installation (Linux/macOS)

# Clone the repository
git clone https://github.com/open-webui/open-webui.git
cd open-webui

# Install dependencies
pip install -r requirements.txt

# Run the application
python start.py

Configuration

Open WebUI automatically detects your local Ollama instance. If Ollama is running on a different machine or port, you can configure it in the settings:

Click your profile icon → Settings
Navigate to "Connection" settings
Set the Ollama Base URL (default: http://localhost:11434)

Using Open WebUI

Open WebUI provides a polished interface similar to ChatGPT:

Basic Chat

Select a model from the dropdown (top of the chat interface)
Start typing your message
The model responds in real-time

Advanced Features

Model Management:

View all available models
Download new models directly from the UI
Switch between models mid-conversation

Chat Features:

Multiple conversation threads
Export conversations (Markdown, PDF, etc.)
Share conversations with others
Code syntax highlighting

Customization:

Custom system prompts
Adjustable temperature and other parameters
Custom instructions per model

API Integration

Ollama provides a REST API that makes it easy to integrate local AI into your applications:

Basic API Usage

# Generate a completion
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Explain quantum computing in simple terms",
  "stream": false
}'

Python Integration

import requests
import json

def ask_ollama(prompt, model="llama3.2"):
    url = "http://localhost:11434/api/generate"
    data = {
        "model": model,
        "prompt": prompt,
        "stream": False
    }
    response = requests.post(url, json=data)
    return response.json()["response"]

# Example usage
answer = ask_ollama("What is Docker?")
print(answer)

JavaScript/Node.js Integration

async function askOllama(prompt, model = "llama3.2") {
  const response = await fetch("http://localhost:11434/api/generate", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
      model: model,
      prompt: prompt,
      stream: false,
    }),
  });
  const data = await response.json();
  return data.response;
}

// Example usage
const answer = await askOllama("Explain REST APIs");
console.log(answer);

Performance Optimization

GPU Acceleration

Ollama automatically uses GPU when available. To verify GPU usage:

# Check if GPU is being used
ollama ps

For NVIDIA GPUs, ensure you have the latest drivers and CUDA installed. Ollama will automatically detect and use CUDA.

Model Quantization

Many models are available in quantized formats (reduced precision) that use less memory:

llama3.2:3b - Full precision
llama3.2:3b-q4_0 - 4-bit quantization (smaller, faster)

Quantized models trade some accuracy for significantly reduced memory usage and faster inference.

System Optimization

For CPU-only systems:

Close unnecessary applications to free RAM
Use smaller models (3B-7B parameters)
Consider quantized models

For GPU systems:

Ensure models fit in VRAM for best performance
Use ollama ps to monitor resource usage
Consider running multiple smaller models instead of one large model

Common Use Cases

1. Code Assistant

# Pull a coding-focused model
ollama pull deepseek-coder

# Use it for code generation and debugging
ollama run deepseek-coder "Write a Python function to sort a list of dictionaries by a key"

2. Documentation Generation

Local models excel at generating documentation, writing README files, and explaining code.

3. Content Creation

Use local models for drafting blog posts, emails, or creative writing without sending your drafts to cloud services.

4. Homelab Automation

Integrate Ollama into your homelab workflows using the API. For example, create scripts that use AI to analyze logs, generate reports, or assist with system administration tasks.

5. Learning and Experimentation

Local models are perfect for learning about AI, experimenting with prompts, and understanding how language models work without worrying about API costs.

Troubleshooting

Model Won't Download

# Check your internet connection
# Verify you have enough disk space
# Try downloading a smaller model first
ollama pull llama3.2

Slow Performance

Check if GPU is being used: Run ollama ps during inference
Use a smaller model: Try llama3.2 instead of larger models
Check system resources: Ensure you have enough RAM/VRAM
Try quantized models: They're faster and use less memory

Out of Memory Errors

Use smaller models
Close other applications
Consider quantized model variants
Upgrade your hardware if consistently hitting limits

Open WebUI Can't Connect to Ollama

Verify Ollama is running: ollama list
Check the connection URL in Open WebUI settings
Ensure both are on the same network (if using Docker)
Check firewall settings

Security Considerations

While local AI is more private than cloud services, consider these security practices:

Network Exposure: If exposing Ollama to your network, use authentication
Model Sources: Only download models from trusted sources (Ollama's official library)
System Access: Limit who can access your Ollama instance
Updates: Keep Ollama and Open WebUI updated for security patches

Next Steps

Now that you have Ollama and Open WebUI running, consider:

Experiment with Different Models: Try various models to find what works best for your use cases
Integrate with Your Homelab: Use the API to build AI-powered automations
Fine-Tuning: Explore fine-tuning models on your specific data (advanced)
Multi-Model Setup: Run multiple specialized models for different tasks
Explore AI Agents: Build agents that use tools and interact with other systems

Conclusion

Running AI locally with Ollama and Open WebUI gives you the power of modern language models with complete privacy and no ongoing costs. Whether you're a developer building AI applications, a homelab enthusiast exploring new technologies, or someone who values data privacy, local LLMs offer a compelling alternative to cloud-based services.

The barrier to entry is lower than ever—if you have a modern computer, you can start experimenting with local AI today. As hardware becomes more powerful and models become more efficient, running sophisticated AI locally will only become more accessible.

For more advanced topics like building voice assistants or integrating local LLMs into complex workflows, check out resources on building fully local LLM architectures and local LLM best practices.