Embeddable LLM Runtime stable
llama.cpp
LLM inference in C/C++ for CPU and GPU
72.0K stars
800 contributors
Since 2023
LLM inference in C/C++ for CPU and GPU
License
MIT
Min RAM
2 GB
Min CPUs
1 core
Scaling
single_node
Complexity
advanced
Performance
medium
Self-hostable
✓
K8s native
✕
Offline
✓
Pricing
fully free
Docs quality
good
Vendor lock-in
none
Use cases
- ✓ Run LLMs on CPU without GPU
- ✓ Embed AI in desktop/mobile apps
- ✓ Quantized model inference for edge devices
- ✓ Backend for Ollama and other wrappers
Anti-patterns / when NOT to use
- ✕ C++ knowledge needed for embedding
- ✕ Model loading time for large models
- ✕ Less user-friendly than Ollama
Compare with alternatives
Replaces / alternatives to
Technical specs
Language
CC++
API type
SDK
Protocols
HTTP
Deployment
sourcebinary
SDKs
cc++pythonjavascriptgorustswift
Community
GitHub stars 72.0K
Contributors 800
Commit frequency daily
Plugin ecosystem none
Backing Georgi Gerganov
Funding community
Release
Latest version
— Last release —
Since 2023
Best fit
Team size
solosmallmedium
Industries
general