Embeddable LLM Runtime stable

llama.cpp

Name: llama.cpp
Author: Georgi Gerganov

LLM inference in C/C++ for CPU and GPU

72.0K stars 800 contributors Since 2023

Website → GitHub

LLM inference in C/C++ for CPU and GPU

License

MIT

Min RAM

2 GB

Min CPUs

1 core

Scaling

single_node

Complexity

advanced

Performance

medium

Self-hostable

✓

K8s native

Offline

✓

Pricing

fully free

Docs quality

good

Vendor lock-in

none

Use cases

✓ Run LLMs on CPU without GPU
✓ Embed AI in desktop/mobile apps
✓ Quantized model inference for edge devices
✓ Backend for Ollama and other wrappers

Anti-patterns / when NOT to use

✕ C++ knowledge needed for embedding
✕ Model loading time for large models
✕ Less user-friendly than Ollama

Compare with alternatives

llama.cpp vs Ollama Compare →

Replaces / alternatives to

Technical specs

Language

CC++

API type

SDK

Protocols

HTTP

Deployment

sourcebinary

SDKs

cc++pythonjavascriptgorustswift

Community

GitHub stars 72.0K

Contributors 800

Commit frequency daily

Plugin ecosystem none

Backing Georgi Gerganov

Funding community

Release

Latest version —

Last release —

Since 2023

Best fit

Team size

solosmallmedium

Industries

general