| Model | Daily | Monthly | Annual |
|---|
Ollama's hardware requirement is driven by one thing: the model's size on disk (which roughly equals the VRAM/RAM needed). Models are quantized (compressed) in Ollama — the default Q4_K_M quantization shrinks models to roughly 40-50% of their original size while keeping most of the quality.
Three tiers:
Budget / Minimal ($300-500)
Run small models up to ~8B parameters. You need ~8 GB VRAM for comfortable single-model use.
Mid-Range ($800-1,500)
Run models up to ~32B parameters comfortably, or 70B+ with partial CPU offloading.
High-End ($2,500-5,000+)
Run 70B-120B+ models at full speed. Multi-GPU setups for the largest open models.
Raspberry Pi? Technically yes for tiny models (Gemma 3 4B quantized to Q2), but expect 1-3 tok/s — barely usable for chat. Not recommended unless you're just experimenting.
Key rules of thumb:
ollama list) โ VRAM needed for full GPU accelerationA GPU under load draws 150-350W depending on the model. With the rest of the system, figure 200-450W total while actively inferencing. In idle/low-use, a modern GPU drops to ~10-25W.
Estimated monthly electricity cost (24/7 under load):
| Country | Rate (¢/kWh) | Budget GPU (~200W) | Mid-Range (~300W) | High-End (~450W) |
|---|---|---|---|---|
| Germany | ~40¢ | ~$58/mo | ~$87/mo | ~$130/mo |
| USA (avg) | ~16¢ | ~$23/mo | ~$35/mo | ~$52/mo |
| UK | ~29¢ | ~$42/mo | ~$63/mo | ~$94/mo |
| France | ~25¢ | ~$36/mo | ~$54/mo | ~$81/mo |
| Japan | ~26¢ | ~$38/mo | ~$56/mo | ~$85/mo |
| India | ~8¢ | ~$12/mo | ~$17/mo | ~$26/mo |
| China | ~8¢ | ~$12/mo | ~$17/mo | ~$26/mo |
Reality check: Most users don't run inference 24/7. A typical developer might use 2-4 hours/day, cutting these costs to 8-17% of the 24/7 figures. Running 3 hours/day in the US on a mid-range setup? About $1-2/month in electricity.
Break-even with Ollama Cloud: At Ollama Cloud Pro ($20/mo), even heavy daily use on a budget GPU in most countries (except Germany) runs cheaper on your own hardware once you've amortized the GPU cost. For light use, cloud is more convenient. For heavy sustained use, self-hosting wins — especially in countries with cheap electricity like India or China.
Ollama runs on macOS, Linux, and Windows. Here's the breakdown:
macOS (Apple Silicon) — Easiest setup, great for unified memory
Linux — Best for dedicated GPU servers, most flexible
docker run ollama/ollamaWindows — Works, but least recommended for serious use
Recommendation: MacBook for personal use, Linux (Ubuntu) for a dedicated GPU inference box. Avoid Windows for anything beyond casual experimentation.
Here's a realistic comparison of token/s throughput for popular models across different setups:
| Model | Specs | Budget GPU (RTX 3060) | Mid GPU (RTX 3090) | Mac M4 64GB | Ollama Cloud |
|---|---|---|---|---|---|
| Gemma 3 4B | 8.6 GB | ~35 tok/s | ~55 tok/s | ~30 tok/s | ~60-80 tok/s |
| Mistral 3 8B | 10.4 GB | ~28 tok/s | ~45 tok/s | ~22 tok/s | ~50-70 tok/s |
| Gemma 3 27B | 55 GB | ❌ | ~8 tok/s* | ~10 tok/s | ~40-60 tok/s |
| Devstral 24B | 51.6 GB | ❌ | ~10 tok/s* | ~8 tok/s | ~35-50 tok/s |
| DeepSeek V4 Flash | 140 GB | ❌ | ❌ | ~3 tok/s* | ~30-50 tok/s |
| Kimi K2.6 | 595 GB | ❌ | ❌ | ❌ | ~20-40 tok/s |
* With partial CPU-RAM offloading (slower than full GPU speed). ❌ = model won't fit or is impractically slow.
Key takeaways:
Bottom line: Self-hosting wins on privacy and long-term cost for small-to-medium models. Cloud wins on raw speed, convenience, and access to the largest models. Many users run both — small models locally for privacy-sensitive tasks, cloud for heavy lifting.
Let's do the math for a typical scenario: running a 27B model (like Gemma 3 27B) on a mid-range setup vs Ollama Cloud Pro ($20/mo).
Self-hosting costs (mid-range RTX 3090 setup):
Ollama Cloud Pro: $20/mo = $240/year
Break-even:
But if you're a heavy user hitting Ollama Cloud limits and need Max ($100/mo):
Verdict: If you're on the Free or Pro tier and use it casually, cloud is the better deal (no upfront cost, no maintenance). If you'd hit Max tier or run high volumes regularly, self-hosting pays for itself in under a year. GPU resale value also helps — a used RTX 3090 still sells for ~$600+ after 2 years.