LLMs & AI Infra LLM Serving stable

vLLM

High-throughput LLM serving engine

45.0K stars 600 contributors Since 2023
Website → GitHub

Production-grade LLM inference engine with PagedAttention for memory efficiency, continuous batching, speculative decoding, and OpenAI-compatible API.

License
Apache-2.0
Min RAM
8 GB
Min CPUs
4 cores
Scaling
horizontal
Complexity
advanced
Performance
enterprise grade
Self-hostable
K8s native
Offline
Pricing
fully free
Docs quality
good
Vendor lock-in
none

Use cases

  • Serve LLMs in production with high throughput
  • Multi-model serving for AI gateway
  • Batch inference for document processing
  • Low-latency chatbot backend

Anti-patterns / when NOT to use

  • Requires GPU - no CPU-only mode
  • Complex setup compared to Ollama
  • Not for single-user local development

Replaces / alternatives to

  • OpenAI API endpoints
  • proprietary inference platforms

Technical specs

Language
PythonC++CUDA
API type
REST
Protocols
HTTP
Deployment
pipdocker
SDKs
python

Community

GitHub stars 45.0K
Contributors 600
Commit frequency daily
Plugin ecosystem none
Backing UC Berkeley / vLLM Team
Funding vc_backed

Release

Latest version
Last release
Since 2023

Best fit

Team size
smallmediumenterprise
Industries
saasfintechhealthcareenterprise

Tags

  • llm-serving
  • high-throughput
  • paged-attention
  • batching
  • production-inference
  • gpu