Self-Host Open-Source LLMs
Running your own LLM gives you full control, privacy, and no per-token costs. TurboGPU makes it easy — spin up a GPU, install Ollama, and start chatting.
VRAM Requirements
| Model | Parameters | Min VRAM | Recommended Tier |
|---|---|---|---|
| Llama 3 8B | 8B | 8 GB | Starter (RTX 3060) |
| Llama 3 70B (Q4) | 70B | 40 GB | Power (A6000) |
| Mixtral 8x7B | 46.7B | 24 GB | Standard (RTX 3090) |
| Code Llama 34B | 34B | 24 GB | Standard (RTX 3090) |
| Phi-3 Medium | 14B | 12 GB | Starter (RTX 3060) |
Quick Start with Ollama
# SSH into your TurboGPU instance ssh -p <port> user@<ip> # Install Ollama curl -fsSL https://ollama.com/install.sh | sh # Pull and run Llama 3 ollama run llama3 # Or run Mixtral for code generation ollama run mixtral
That's it. You're running a local LLM with full GPU acceleration in under 2 minutes.
For Production: vLLM
If you need an OpenAI-compatible API server:
pip install vllm
# Start an API server
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-8B-Instruct \
--port 8000
# Now you can call it like OpenAI
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "meta-llama/Meta-Llama-3-8B-Instruct",
"messages": [{"role": "user", "content": "Hello!"}]}'Benchmark Results
| Model | Tier | Tokens/sec | Latency (first token) |
|---|---|---|---|
| Llama 3 8B | Starter | 45 tok/s | 0.3s |
| Llama 3 8B | Standard | 78 tok/s | 0.2s |
| Mixtral 8x7B | Standard | 32 tok/s | 0.5s |
| Llama 3 70B Q4 | Power | 22 tok/s | 0.8s |
Cost Comparison vs API Providers
Running Llama 3 8B for 8 hours on the Starter tier: $3.20. In that time you can generate ~1.3 million tokens. At OpenAI's GPT-4o pricing, that would cost $6.50+. And you get full privacy.
