LLMs & AI Infra LLM Serving stable

Text Generation Inference

HuggingFace production-grade LLM serving

10.0K stars 200 contributors Since 2023
Website → GitHub

Rust and Python-based production LLM serving engine from HuggingFace with tensor parallelism, quantization, flash attention, streaming, and continuous batching.

License
Apache-2.0
Min RAM
8 GB
Min CPUs
4 cores
Scaling
horizontal
Complexity
advanced
Performance
enterprise grade
Self-hostable
K8s native
Offline
Pricing
fully free
Docs quality
good
Vendor lock-in
none

Use cases

  • Production LLM serving with HuggingFace models
  • Multi-GPU inference with tensor parallelism
  • Quantized model serving for cost optimization

Anti-patterns / when NOT to use

  • HuggingFace ecosystem focused
  • Less flexible than vLLM for non-HF models
  • Requires GPU

Replaces / alternatives to

  • Proprietary inference endpoints

Technical specs

Language
RustPython
API type
REST
Protocols
HTTP
Deployment
docker
SDKs
python

Community

GitHub stars 10.0K
Contributors 200
Commit frequency daily
Plugin ecosystem none
Backing Hugging Face
Funding vc_backed

Release

Latest version
Last release
Since 2023

Best fit

Team size
smallmediumenterprise
Industries
saasenterpriseresearch

Tags

  • llm-serving
  • rust
  • tensor-parallelism
  • quantization
  • flash-attention
  • streaming
  • production