Running Local AI Agents: A Guide to Ollama and Open WebUI

The AI revolution has brought powerful language models to everyone's fingertips, but at a cost: your data travels to cloud servers, and API usage can quickly become expensive. What if you could run these models locally, on your own hardware, with complete privacy and no per-request fees?
Enter local LLMs—large language models that run entirely on your machine. With tools like Ollama and Open WebUI, setting up your own private AI assistant is now accessible to anyone with a modern computer.
Why Run AI Locally?
Before diving into the setup, let's understand why local AI is becoming increasingly attractive:
Privacy and Data Control
When you use cloud-based AI services like ChatGPT or Claude, your conversations are processed on remote servers. Even with privacy policies, you're trusting a third party with potentially sensitive information. Running models locally means your data never leaves your machine.
Cost Savings
Cloud AI services charge per token or request. For developers building AI-powered applications or users with high usage, these costs can add up quickly. Once you've invested in hardware, local models have no ongoing API fees.
Customization and Experimentation
Local models give you complete control. You can fine-tune models, experiment with different architectures, and integrate them into your homelab workflows without external dependencies.
Offline Capability
Local AI works without an internet connection, making it perfect for air-gapped environments or situations where connectivity is unreliable.
Understanding Ollama
Ollama is a tool that simplifies running large language models locally. It handles model downloads, optimization, and provides a simple API for interacting with models. Think of it as Docker for LLMs—it abstracts away the complexity of model management.
Key Features
- Easy Model Management: Download and run models with a single command
- Optimized Performance: Automatically optimizes models for your hardware
- Simple API: RESTful API for easy integration
- Cross-Platform: Works on Windows, macOS, and Linux
- GPU Support: Leverages GPU acceleration when available
Installing Ollama
Windows Installation
- Download the installer from ollama.com
- Run the installer and follow the setup wizard
- Ollama will start automatically as a background service
Linux Installation
# Install using the official script curl -fsSL https://ollama.com/install.sh | sh # Or using Homebrew on Linux brew install ollama
macOS Installation
# Using Homebrew brew install ollama # Or download the installer from ollama.com
After installation, verify Ollama is running:
ollama --version
Your First Local Model
Ollama makes downloading and running models incredibly simple. Let's start with a popular, efficient model:
# Download and run Llama 3.2 (3B parameters - great for testing) ollama pull llama3.2 # Start chatting with the model ollama run llama3.2
You'll be dropped into an interactive chat session. Try asking it questions or giving it tasks. When you're done, type /bye to exit.
Popular Models to Try
For General Use:
llama3.2- Fast, efficient 3B modelqwen2.5:7b- Excellent multilingual supportmistral- Strong reasoning capabilities
For Coding:
deepseek-coder- Specialized for code generationcodellama- Meta's coding-focused model
For Advanced Tasks:
llama3.1:70b- More capable but requires significant RAM/VRAMqwen2.5:72b- High-performance model
For Reasoning:
deepseek-r1- Advanced reasoning capabilities
To see all available models, visit the Ollama library.
Hardware Requirements
The hardware you need depends on the model size you want to run:
Minimum Requirements (Small Models)
- CPU: Modern multi-core processor
- RAM: 8GB (16GB recommended)
- Storage: 10GB free space
- GPU: Optional but recommended
Models like llama3.2 (3B) can run reasonably well on CPU-only systems with 8GB RAM, though responses will be slower.
Recommended Setup (Medium Models)
- CPU: Modern 6+ core processor
- RAM: 16GB (32GB for larger models)
- GPU: NVIDIA GPU with 8GB+ VRAM (or AMD equivalent)
- Storage: 50GB+ free space
With a dedicated GPU, models like qwen2.5:7b or mistral:7b run smoothly and respond quickly.
High-End Setup (Large Models)
- CPU: High-end processor
- RAM: 32GB+ system RAM
- GPU: NVIDIA GPU with 16GB+ VRAM (RTX 3090, RTX 4090, or better)
- Storage: 100GB+ free space
Large models (70B+) require substantial hardware but offer the best performance.
VRAM vs System RAM
- VRAM (GPU Memory): Much faster, ideal for model weights. Models that fit entirely in VRAM run fastest.
- System RAM: Slower but more abundant. Models that don't fit in VRAM will use system RAM, which is slower but still functional.
Ollama automatically manages this, using GPU when available and falling back to CPU/RAM when needed.
Setting Up Open WebUI
While Ollama's command-line interface works, Open WebUI provides a beautiful, ChatGPT-like web interface for your local models. It's open-source, feature-rich, and easy to install.
Installation with Docker
The easiest way to run Open WebUI is with Docker:
# Create a directory for Open WebUI data mkdir -p ~/open-webui cd ~/open-webui # Run Open WebUI (connects to local Ollama automatically) docker run -d -p 3000:8080 \ --add-host=host.docker.internal:host-gateway \ -v open-webui:/app/backend/data \ --name open-webui \ --restart always \ ghcr.io/open-webui/open-webui:main
Now visit http://localhost:3000 in your browser. You'll see a setup screen where you can create your first admin account.
Manual Installation (Linux/macOS)
# Clone the repository git clone https://github.com/open-webui/open-webui.git cd open-webui # Install dependencies pip install -r requirements.txt # Run the application python start.py
Configuration
Open WebUI automatically detects your local Ollama instance. If Ollama is running on a different machine or port, you can configure it in the settings:
- Click your profile icon → Settings
- Navigate to "Connection" settings
- Set the Ollama Base URL (default:
http://localhost:11434)
Using Open WebUI
Open WebUI provides a polished interface similar to ChatGPT:
Basic Chat
- Select a model from the dropdown (top of the chat interface)
- Start typing your message
- The model responds in real-time
Advanced Features
Model Management:
- View all available models
- Download new models directly from the UI
- Switch between models mid-conversation
Chat Features:
- Multiple conversation threads
- Export conversations (Markdown, PDF, etc.)
- Share conversations with others
- Code syntax highlighting
Customization:
- Custom system prompts
- Adjustable temperature and other parameters
- Custom instructions per model
API Integration
Ollama provides a REST API that makes it easy to integrate local AI into your applications:
Basic API Usage
# Generate a completion curl http://localhost:11434/api/generate -d '{ "model": "llama3.2", "prompt": "Explain quantum computing in simple terms", "stream": false }'
Python Integration
import requests import json def ask_ollama(prompt, model="llama3.2"): url = "http://localhost:11434/api/generate" data = { "model": model, "prompt": prompt, "stream": False } response = requests.post(url, json=data) return response.json()["response"] # Example usage answer = ask_ollama("What is Docker?") print(answer)
JavaScript/Node.js Integration
async function askOllama(prompt, model = "llama3.2") { const response = await fetch("http://localhost:11434/api/generate", { method: "POST", headers: { "Content-Type": "application/json" }, body: JSON.stringify({ model: model, prompt: prompt, stream: false, }), }); const data = await response.json(); return data.response; } // Example usage const answer = await askOllama("Explain REST APIs"); console.log(answer);
Performance Optimization
GPU Acceleration
Ollama automatically uses GPU when available. To verify GPU usage:
# Check if GPU is being used ollama ps
For NVIDIA GPUs, ensure you have the latest drivers and CUDA installed. Ollama will automatically detect and use CUDA.
Model Quantization
Many models are available in quantized formats (reduced precision) that use less memory:
llama3.2:3b- Full precisionllama3.2:3b-q4_0- 4-bit quantization (smaller, faster)
Quantized models trade some accuracy for significantly reduced memory usage and faster inference.
System Optimization
For CPU-only systems:
- Close unnecessary applications to free RAM
- Use smaller models (3B-7B parameters)
- Consider quantized models
For GPU systems:
- Ensure models fit in VRAM for best performance
- Use
ollama psto monitor resource usage - Consider running multiple smaller models instead of one large model
Common Use Cases
1. Code Assistant
# Pull a coding-focused model ollama pull deepseek-coder # Use it for code generation and debugging ollama run deepseek-coder "Write a Python function to sort a list of dictionaries by a key"
2. Documentation Generation
Local models excel at generating documentation, writing README files, and explaining code.
3. Content Creation
Use local models for drafting blog posts, emails, or creative writing without sending your drafts to cloud services.
4. Homelab Automation
Integrate Ollama into your homelab workflows using the API. For example, create scripts that use AI to analyze logs, generate reports, or assist with system administration tasks.
5. Learning and Experimentation
Local models are perfect for learning about AI, experimenting with prompts, and understanding how language models work without worrying about API costs.
Troubleshooting
Model Won't Download
# Check your internet connection # Verify you have enough disk space # Try downloading a smaller model first ollama pull llama3.2
Slow Performance
- Check if GPU is being used: Run
ollama psduring inference - Use a smaller model: Try
llama3.2instead of larger models - Check system resources: Ensure you have enough RAM/VRAM
- Try quantized models: They're faster and use less memory
Out of Memory Errors
- Use smaller models
- Close other applications
- Consider quantized model variants
- Upgrade your hardware if consistently hitting limits
Open WebUI Can't Connect to Ollama
- Verify Ollama is running:
ollama list - Check the connection URL in Open WebUI settings
- Ensure both are on the same network (if using Docker)
- Check firewall settings
Security Considerations
While local AI is more private than cloud services, consider these security practices:
- Network Exposure: If exposing Ollama to your network, use authentication
- Model Sources: Only download models from trusted sources (Ollama's official library)
- System Access: Limit who can access your Ollama instance
- Updates: Keep Ollama and Open WebUI updated for security patches
Next Steps
Now that you have Ollama and Open WebUI running, consider:
- Experiment with Different Models: Try various models to find what works best for your use cases
- Integrate with Your Homelab: Use the API to build AI-powered automations
- Fine-Tuning: Explore fine-tuning models on your specific data (advanced)
- Multi-Model Setup: Run multiple specialized models for different tasks
- Explore AI Agents: Build agents that use tools and interact with other systems
Conclusion
Running AI locally with Ollama and Open WebUI gives you the power of modern language models with complete privacy and no ongoing costs. Whether you're a developer building AI applications, a homelab enthusiast exploring new technologies, or someone who values data privacy, local LLMs offer a compelling alternative to cloud-based services.
The barrier to entry is lower than ever—if you have a modern computer, you can start experimenting with local AI today. As hardware becomes more powerful and models become more efficient, running sophisticated AI locally will only become more accessible.
For more advanced topics like building voice assistants or integrating local LLMs into complex workflows, check out resources on building fully local LLM architectures and local LLM best practices.



