LLMs & AI Infra LLM Serving stable
vLLM
High-throughput LLM serving engine
45.0K stars
600 contributors
Since 2023
Production-grade LLM inference engine with PagedAttention for memory efficiency, continuous batching, speculative decoding, and OpenAI-compatible API.
License
Apache-2.0
Min RAM
8 GB
Min CPUs
4 cores
Scaling
horizontal
Complexity
advanced
Performance
enterprise grade
Self-hostable
✓
K8s native
✓
Offline
✕
Pricing
fully free
Docs quality
good
Vendor lock-in
none
Use cases
- ✓ Serve LLMs in production with high throughput
- ✓ Multi-model serving for AI gateway
- ✓ Batch inference for document processing
- ✓ Low-latency chatbot backend
Anti-patterns / when NOT to use
- ✕ Requires GPU - no CPU-only mode
- ✕ Complex setup compared to Ollama
- ✕ Not for single-user local development
Integrates with
Compare with alternatives
Replaces / alternatives to
Technical specs
Language
PythonC++CUDA
API type
REST
Protocols
HTTP
Deployment
pipdocker
SDKs
python
Community
GitHub stars 45.0K
Contributors 600
Commit frequency daily
Plugin ecosystem none
Backing UC Berkeley / vLLM Team
Funding vc_backed
Release
Latest version
— Last release —
Since 2023
Best fit
Team size
smallmediumenterprise
Industries
saasfintechhealthcareenterprise