._OOO___ / \ | O O | | \___/ | \ !!! / \_____/ /| |\ / | | \ |||||||

๐Ÿฆž AI Cost Calculator

"Why not Zoidberg? Calculate your AI costs!"

๐Ÿ”ฌ Diagnosis Parameters

Ollama Cloud uses subscription pricing โ€” not per-token. Cost is a flat monthly fee regardless of volume. Free: light usage ยท Pro: $20/mo (50ร— Free, 3 concurrent models) ยท Max: $100/mo (5ร— Pro, 10 concurrent models). Per-token pricing coming soon. Cloud models: see Ollama Cloud docs.5, DeepSeek V4 Flash, Kimi K2.6, Devstral, Nemotron & more.

๐Ÿ“Š Cost Diagnosis

โš–๏ธ Cross-Model Comparison

Model Daily Monthly Annual

๐Ÿ“– Frequently Asked Questions

How much does GPT-4o cost per million tokens?
OpenAI's GPT-4o costs $2.50 per million input tokens and $10.00 per million output tokens as of 2026. Output tokens are typically 3-5ร— more expensive than input tokens across all AI models. Use the calculator above with different token counts to see your real costs.
What is the cheapest AI model API for developers?
DeepSeek V3 ($0.27/1M input, $1.10/1M output) and Google Gemini 2.0 Flash Lite ($0.08/1M input, $0.20/1M output) are among the cheapest direct API models. OpenRouter and Ollama Cloud offer additional budget options starting from free tiers. Switch to "All Models" view and sort the comparison table to find the best deal for your workload.
How do I estimate my monthly AI API costs?
Multiply your daily requests by the token count per request (input + output) and the model's per-token price, then multiply by 30 for a monthly estimate. The calculator above automates this โ€” just enter your usage volume, pick a model, and get instant daily/monthly/annual breakdowns with side-by-side comparisons across all 461 models.
What is per-token pricing for LLMs?
Per-token pricing means AI providers charge based on the number of tokens processed. Input tokens (your prompt + context) and output tokens (the model's response) are billed separately at different rates. One token โ‰ˆ 4 characters โ‰ˆ 0.75 English words. A typical 250-word page uses roughly 333 tokens. Most leading models now use per-token pricing โ€” the calculator supports it for all 461 models.
How much does Claude (Anthropic) cost vs GPT (OpenAI)?
Claude Sonnet 4 costs $3.00/1M input and $15.00/1M output vs GPT-4o at $2.50/1M input and $10.00/1M output. Claude Opus 4 is $15.00/1M input and $75.00/1M output for maximum capability. Pricing changes frequently across all providers โ€” use the calculator to compare live rates.
Can I run AI models on my own hardware instead of paying for cloud APIs?
Yes! Ollama lets you run open-source models locally on your own hardware — no API keys, no per-token charges, no monthly subscription fees. You download a model once and run it as much as you want. The tradeoff: you need sufficient hardware (primarily VRAM on a GPU, or system RAM for CPU inference), and performance depends heavily on your setup. Below we break down hardware tiers, costs, electricity, OS choices, and how local performance compares to cloud hosting.
What hardware do I need to run Ollama models locally?

Ollama's hardware requirement is driven by one thing: the model's size on disk (which roughly equals the VRAM/RAM needed). Models are quantized (compressed) in Ollama — the default Q4_K_M quantization shrinks models to roughly 40-50% of their original size while keeping most of the quality.

Three tiers:

Budget / Minimal ($300-500)
Run small models up to ~8B parameters. You need ~8 GB VRAM for comfortable single-model use.

  • GPU: NVIDIA RTX 3060 12 GB ($250-300 used) — best budget pick; or RTX 4060 8 GB (~$280)
  • RAM: 16 GB system RAM
  • Storage: 256 GB NVMe SSD
  • Models: Gemma 3 4B, Mistral 3 8B/14B, GPT-OSS 20B (offloads partially to RAM)
  • Speed: ~25-40 tok/s (8B models), ~10-20 tok/s (14B models)

Mid-Range ($800-1,500)
Run models up to ~32B parameters comfortably, or 70B+ with partial CPU offloading.

  • GPU: NVIDIA RTX 3090 24 GB ($700-900 used) — the sweet spot; or RTX 4070 Ti Super 16 GB (~$750)
  • RAM: 32 GB system RAM
  • Storage: 512 GB NVMe SSD
  • Models: Gemma 3 27B, Nemotron 3 Nano 30B, Devstral 24B — and larger models partly offloaded to RAM
  • Speed: ~15-25 tok/s (27B models), ~5-12 tok/s (70B+ with offloading)

High-End ($2,500-5,000+)
Run 70B-120B+ models at full speed. Multi-GPU setups for the largest open models.

  • GPU: 2ร— NVIDIA RTX 3090 24 GB ($1,400-1,800 used) or RTX 4090 24 GB (~$1,600 new each)
  • RAM: 64 GB system RAM
  • Storage: 1-2 TB NVMe SSD (large models are 50-700+ GB each)
  • Models: DeepSeek V4 Flash 140B, Qwen3 80B, Devstral 123B, and beyond
  • Speed: ~10-20 tok/s (70B models on single 24 GB GPU), ~5-15 tok/s (120B+ on multi-GPU)

Raspberry Pi? Technically yes for tiny models (Gemma 3 4B quantized to Q2), but expect 1-3 tok/s — barely usable for chat. Not recommended unless you're just experimenting.

Key rules of thumb:

  • Model file size (from ollama list) โ‰ˆ VRAM needed for full GPU acceleration
  • If the model doesn't fit in VRAM, Ollama offloads layers to system RAM — much slower but functional
  • NVIDIA GPUs are strongly preferred (CUDA). AMD works via ROCm on Linux but setup is harder. Apple Silicon Macs work well via Metal — M2/M3/M4 Macs with 32+ GB unified memory are excellent mid-range options
How much does electricity cost to run Ollama 24/7?

A GPU under load draws 150-350W depending on the model. With the rest of the system, figure 200-450W total while actively inferencing. In idle/low-use, a modern GPU drops to ~10-25W.

Estimated monthly electricity cost (24/7 under load):

CountryRate (¢/kWh)Budget GPU (~200W)Mid-Range (~300W)High-End (~450W)
Germany~40¢~$58/mo~$87/mo~$130/mo
USA (avg)~16¢~$23/mo~$35/mo~$52/mo
UK~29¢~$42/mo~$63/mo~$94/mo
France~25¢~$36/mo~$54/mo~$81/mo
Japan~26¢~$38/mo~$56/mo~$85/mo
India~8¢~$12/mo~$17/mo~$26/mo
China~8¢~$12/mo~$17/mo~$26/mo

Reality check: Most users don't run inference 24/7. A typical developer might use 2-4 hours/day, cutting these costs to 8-17% of the 24/7 figures. Running 3 hours/day in the US on a mid-range setup? About $1-2/month in electricity.

Break-even with Ollama Cloud: At Ollama Cloud Pro ($20/mo), even heavy daily use on a budget GPU in most countries (except Germany) runs cheaper on your own hardware once you've amortized the GPU cost. For light use, cloud is more convenient. For heavy sustained use, self-hosting wins — especially in countries with cheap electricity like India or China.

Which operating system should I use for Ollama?

Ollama runs on macOS, Linux, and Windows. Here's the breakdown:

macOS (Apple Silicon)Easiest setup, great for unified memory

  • Install: drag-and-drop app, done
  • Metal GPU acceleration built-in — no driver headaches
  • Unified memory means VRAM = system RAM (M2/M3/M4 with 32/48/64/128 GB can run huge models)
  • Best pick: M4 Mac Studio or Mac Pro with 64+ GB unified memory for serious model work
  • Downside: hardware is expensive and not upgradeable

LinuxBest for dedicated GPU servers, most flexible

  • CUDA support is first-class on NVIDIA — fastest GPU inference
  • ROCm support for AMD GPUs (works on Ubuntu; more limited distro support)
  • Run headless (no display needed) — ideal for always-on inference servers
  • Docker support: docker run ollama/ollama
  • Best pick: Ubuntu 22.04/24.04 LTS (most Ollama docs and community tested here)
  • Downside: more setup, especially NVIDIA drivers + CUDA toolkit

WindowsWorks, but least recommended for serious use

  • CUDA works fine for NVIDIA GPUs
  • WSL2 integration available but adds overhead
  • DirectML supports AMD/Intel GPUs (slower than CUDA)
  • More issues with long-running processes, memory management
  • Fine for trying things out; switch to Linux if you're building a dedicated setup

Recommendation: MacBook for personal use, Linux (Ubuntu) for a dedicated GPU inference box. Avoid Windows for anything beyond casual experimentation.

How does local Ollama performance compare to Ollama Cloud?

Here's a realistic comparison of token/s throughput for popular models across different setups:

ModelSpecsBudget GPU (RTX 3060)Mid GPU (RTX 3090)Mac M4 64GBOllama Cloud
Gemma 3 4B8.6 GB~35 tok/s~55 tok/s~30 tok/s~60-80 tok/s
Mistral 3 8B10.4 GB~28 tok/s~45 tok/s~22 tok/s~50-70 tok/s
Gemma 3 27B55 GB~8 tok/s*~10 tok/s~40-60 tok/s
Devstral 24B51.6 GB~10 tok/s*~8 tok/s~35-50 tok/s
DeepSeek V4 Flash140 GB~3 tok/s*~30-50 tok/s
Kimi K2.6595 GB~20-40 tok/s

* With partial CPU-RAM offloading (slower than full GPU speed). ❌ = model won't fit or is impractically slow.

Key takeaways:

  • Small models (≤8B): Local hardware can nearly match cloud speed. At ~30 tok/s, it's already faster than reading speed — you won't notice the difference.
  • Medium models (14-32B): Cloud is 2-4× faster, but local is still very usable (8-20 tok/s). For interactive chat, this feels fine. For batch processing many requests, cloud wins.
  • Large models (70B+): Unless you have multi-GPU setups or a Mac with 128+ GB unified memory, Ollama Cloud is the practical option. Local inference drops to 1-5 tok/s for these models — painfully slow for anything but patience-testing experiments.
  • Mega models (400B+): Kimi K2, DeepSeek V4 Pro, GLM-5.1 — these require 500-1,600 GB of memory. Self-hosting is only feasible with server clusters. Ollama Cloud or other cloud APIs are the only sensible route for individuals.

Bottom line: Self-hosting wins on privacy and long-term cost for small-to-medium models. Cloud wins on raw speed, convenience, and access to the largest models. Many users run both — small models locally for privacy-sensitive tasks, cloud for heavy lifting.

When does self-hosting Ollama break even vs cloud pricing?

Let's do the math for a typical scenario: running a 27B model (like Gemma 3 27B) on a mid-range setup vs Ollama Cloud Pro ($20/mo).

Self-hosting costs (mid-range RTX 3090 setup):

  • Hardware: ~$900 (used RTX 3090 + existing PC, or ~$1,200 total build)
  • Electricity (3h/day, US rates): ~$2/mo
  • Electricity (3h/day, Germany): ~$5/mo

Ollama Cloud Pro: $20/mo = $240/year

Break-even:

  • US: $900 hardware ÷ ($20 - $2)/mo = ~50 months (~4 years)
  • Germany: $900 hardware ÷ ($20 - $5)/mo = ~60 months (5 years)

But if you're a heavy user hitting Ollama Cloud limits and need Max ($100/mo):

  • US: $900 hardware ÷ ($100 - $2)/mo = ~9 months
  • Germany: $900 hardware ÷ ($100 - $5)/mo = ~10 months

Verdict: If you're on the Free or Pro tier and use it casually, cloud is the better deal (no upfront cost, no maintenance). If you'd hit Max tier or run high volumes regularly, self-hosting pays for itself in under a year. GPU resale value also helps — a used RTX 3090 still sells for ~$600+ after 2 years.