Embeddable LLM Runtime stable

llama.cpp

LLM inference in C/C++ for CPU and GPU

72.0K stars 800 contributors Since 2023
Website → GitHub

LLM inference in C/C++ for CPU and GPU

License
MIT
Min RAM
2 GB
Min CPUs
1 core
Scaling
single_node
Complexity
advanced
Performance
medium
Self-hostable
K8s native
Offline
Pricing
fully free
Docs quality
good
Vendor lock-in
none

Use cases

  • Run LLMs on CPU without GPU
  • Embed AI in desktop/mobile apps
  • Quantized model inference for edge devices
  • Backend for Ollama and other wrappers

Anti-patterns / when NOT to use

  • C++ knowledge needed for embedding
  • Model loading time for large models
  • Less user-friendly than Ollama

Compare with alternatives

Replaces / alternatives to

  • OpenAI API
  • cloud LLM inference

Technical specs

Language
CC++
API type
SDK
Protocols
HTTP
Deployment
sourcebinary
SDKs
cc++pythonjavascriptgorustswift

Community

GitHub stars 72.0K
Contributors 800
Commit frequency daily
Plugin ecosystem none
Backing Georgi Gerganov
Funding community

Release

Latest version
Last release
Since 2023

Best fit

Team size
solosmallmedium
Industries
general

Tags

  • llm-inference
  • cpu
  • gpu
  • quantization
  • gguf
  • local
  • embedded
  • server
  • metal
  • vulkan