vLLM
vLLM is a high-performance inference and serving engine for Large Language Models (LLMs), designed to maximize GPU utilization, reduce latency, and enable efficient multi-user, production-grade GenAI systems.
What is vLLM?
vLLM (Virtual Large Language Model) is an open-source LLM inference engine developed by researchers at UC Berkeley. It focuses on serving, not training.
Core Problem It Solves
Traditional LLM serving wastes GPU memory due to:
Fixed-size KV caches
Fragmentation during long or concurrent requests
Inefficient batching
vLLM solves this with PagedAttention.
Key Concepts in vLLM
1. PagedAttention (Core Innovation)
KV cache is stored in non-contiguous GPU memory pages
Similar to virtual memory paging in operating systems
Eliminates memory fragmentation
Enables higher throughput with more concurrent users
Result:
2–4× higher throughput
Lower tail latency
More requests per GPU
2. Continuous Batching
Requests are dynamically added/removed during inference
No need to wait for batch boundaries
Ideal for chatbots and APIs
3. OpenAI-Compatible API
vLLM can expose:
/v1/chat/completions/v1/completions/v1/embeddings
This makes it a drop-in replacement for OpenAI API in self-hosted setups.
4. Multi-Model & Multi-Tenant Serving
Serve multiple models on the same GPU
Efficient scheduling across users
Used in SaaS and internal GenAI platforms
5. Hugging Face & Quantization Support
Supports HF models directly
Works with:
FP16 / BF16
AWQ
GPTQ
Some GGUF via conversion (not native focus)
When Should You Use vLLM?
vLLM is ideal if you are:
Building production LLM APIs
Serving chatbots with many concurrent users
Running RAG systems at scale
Hosting enterprise GenAI platforms
Replacing OpenAI with self-hosted inference
It is not intended for:
Training models
Edge devices
Ultra-low-RAM laptops
vLLM Architecture (High Level)
Major Alternatives to vLLM
Below are the practical and widely used alternatives, categorized by use case.
1. TensorRT-LLM (NVIDIA)


Best for: Maximum NVIDIA GPU performance
Pros
Lowest latency on NVIDIA GPUs
Deep CUDA and kernel fusion optimizations
Used in large-scale enterprise deployments
Cons
NVIDIA-only
Complex build process
Less flexible than vLLM
Choose TensorRT-LLM if
You control the GPU fleet (A100/H100)
Latency is mission-critical
2. TGI (Text Generation Inference – Hugging Face)
Best for: Stable, production-ready HF serving
Pros
Official Hugging Face solution
Strong model compatibility
Streaming, batching, token limits
Cons
Lower throughput than vLLM
Less memory-efficient KV handling
Choose TGI if
You live fully in the HF ecosystem
You want simplicity over peak performance
3. LMDeploy (OpenMMLab)


Best for: Lightweight, fast inference
Pros
Efficient memory usage
Fast startup
Supports Turbomind backend
Cons
Smaller ecosystem
Fewer production features than vLLM
4. Ollama


Best for: Local development and laptops
Pros
Extremely easy to use
Excellent for Mac (Metal, MLX)
GGUF-native
Cons
Not built for high concurrency
Not enterprise-scale
Choose Ollama if
You are developing locally
You want fast iteration, not scale
5. llama.cpp


Best for: Edge devices and CPU inference
Pros
Runs anywhere (CPU, Metal, CUDA)
GGUF ecosystem
Minimal dependencies
Cons
Lower throughput
No advanced scheduler like vLLM
6. DeepSpeed-Inference (Microsoft)


Best for: Large multi-GPU clusters
Pros
Tensor parallelism
ZeRO optimizations
Enterprise-grade
Cons
Heavy configuration
Slower iteration than vLLM
Summary Comparison
vLLM
High-concurrency APIs
Very High
Medium
CUDA
TensorRT-LLM
Lowest latency
Extreme
Low
NVIDIA
TGI
Stable HF serving
High
High
CUDA
LMDeploy
Lightweight serving
High
Medium
CUDA
Ollama
Local dev
Medium
Very High
CPU/Metal
llama.cpp
Edge / CPU
Low–Medium
Medium
Any
DeepSpeed-Inference
Clusters
High
Low
CUDA
Practical Recommendation (for you)
Given your work on:
Self-hosted LLMs
Agent systems
Production GenAI platforms
vLLM should be your default inference layer, with:
Ollama / llama.cpp for local prototyping
TensorRT-LLM if you later optimize for NVIDIA-only latency
What's the pros and cons of vLLM?
Below is a concise but production-oriented evaluation of vLLM, framed the way it is typically assessed in real GenAI platform decisions.
vLLM – Pros and Cons
✅ Pros of vLLM
1. Industry-Leading Throughput (Primary Advantage)
PagedAttention eliminates KV-cache fragmentation
Enables far more concurrent requests per GPU
Sustains high tokens/sec even with long contexts
Impact: You can serve 2–4× more users on the same GPU compared to naive inference.
2. Excellent for Multi-User, Multi-Tenant Systems
Continuous batching
Dynamic request scheduling
Works well for chatbots, copilots, and agent backends
Impact: Tail latency remains stable under load.
3. OpenAI-Compatible API
Drop-in replacement for:
/v1/chat/completions/v1/completions/v1/embeddings
Impact: Minimal app-side changes when migrating from OpenAI.
4. Strong Hugging Face Model Support
Native HF model loading
Supports FP16 / BF16
Works with AWQ and GPTQ
Impact: You can deploy most modern open-source LLMs quickly.
5. Active Development & Adoption
Used by:
Open-source GenAI stacks
Research labs
Early enterprise platforms
Rapid feature velocity (LoRA serving, speculative decoding, etc.)
6. Production-Grade Scheduling
Handles:
Long prompts
Mixed prompt lengths
Streaming + non-streaming users
Impact: Better real-world behavior than static batching engines.
❌ Cons of vLLM
1. NVIDIA-Centric (Major Limitation)
CUDA required
No native Apple Metal / MLX backend
Weak fit for Mac-only inference
Implication: Not suitable if your infra is Apple Silicon or CPU-heavy.
2. Not Optimized for GGUF Ecosystem
GGUF is not a first-class citizen
Conversion often required
llama.cpp / Ollama handle GGUF better
Implication: vLLM is not ideal for the “local GGUF model zoo” workflow.
3. Higher Operational Complexity Than Ollama
Requires:
CUDA drivers
GPU memory planning
Proper scheduler tuning
Debugging can be non-trivial at scale
Implication: DevOps maturity is required for stable production use.
4. Memory Spikes Under Extreme Load
PagedAttention is efficient, but:
Very long contexts
Too many concurrent users can still cause OOMs if not tuned
Requires:
Context limits
Max tokens per request
Admission control
5. Limited Fine-Grained Control vs TensorRT-LLM
Less kernel-level tuning
Slightly higher latency than TensorRT-LLM on H100/A100
Implication: If latency per token is mission-critical, TensorRT-LLM can outperform vLLM.
6. Not a Training or Fine-Tuning Platform
Inference only
No native training loop
LoRA serving is supported, not LoRA training
vLLM in One Line
vLLM trades hardware specificity and operational simplicity for the best open-source LLM serving throughput available today.
Decision Guidance
vLLM is the right choice if:
You are serving many concurrent users
You want OpenAI API compatibility
You run NVIDIA GPUs
You are building a production GenAI backend
vLLM is not the best choice if:
You are Mac-only or CPU-only
You rely heavily on GGUF models
You want zero-config local inference
You need ultra-low latency on NVIDIA hardware
Comparative Snapshot
Throughput
Excellent
Medium
Low–Med
Extreme
Ease of Setup
Medium
Very High
Medium
Low
GGUF Support
Weak
Excellent
Native
Poor
Apple Silicon
❌
✅
✅
❌
Production APIs
✅
⚠️
❌
✅
Latency Floor
High
Medium
Medium
Lowest
How can I use vLLM?
Below is a practical, end-to-end guide to using vLLM, organized the way it is typically adopted in real GenAI systems.
1. Prerequisites
Hardware
NVIDIA GPU (required)
A10 / L4 / A100 / H100 preferred
CUDA ≥ 11.8
Sufficient VRAM (24 GB minimum recommended for 7–13B models)
Software
Python 3.9–3.11
Linux (Ubuntu 20.04+ strongly recommended)
PyTorch (installed automatically by vLLM)
2. Installation
Option A: Pip (Most Common)
Option B: Nightly (latest features)
Verify:
3. Usage Pattern #1 — Python SDK (Direct Inference)
Basic Text Generation
When to Use This
Offline inference
Batch jobs
Research or evaluation scripts
4. Usage Pattern #2 — OpenAI-Compatible API Server (Most Popular)
Start the Server
Call It Like OpenAI
Why This Is Powerful
Works with LangChain, LlamaIndex, CrewAI, etc.
Easy OpenAI migration
Supports streaming
5. Usage Pattern #3 — Docker (Production-Friendly)
Use this when:
Deploying on Kubernetes
Standardizing infra
CI/CD environments
6. Key Configuration Flags (Important)
--gpu-memory-utilization
Fraction of VRAM
0.85–0.92
--max-model-len
Max context length
Match model spec
--dtype
Precision
bfloat16
--tensor-parallel-size
Multi-GPU sharding
= GPU count
--max-num-seqs
Concurrent requests
Tune carefully
7. Using vLLM with RAG
vLLM is commonly used as the generation layer.
Works seamlessly with:
LangChain
LlamaIndex
Custom agent frameworks
No special adapters needed.
8. LoRA / Adapter Serving
vLLM supports LoRA inference:
Use cases:
Per-customer personalization
Domain-specific behavior
Multi-tenant SaaS
9. Common Pitfalls (Important)
❌ Running on Apple Silicon
vLLM does not support Metal / MLX
❌ Ignoring Context Limits
Long prompts × many users = OOM
Always cap
max_tokensandmax-model-len
❌ Using GGUF Models
vLLM is HF-centric
Prefer Ollama / llama.cpp for GGUF
10. Recommended Usage Strategy (For You)
Given your work with:
Agent frameworks
Self-hosted LLMs
Production GenAI stacks
Recommended stack:
Local dev: Ollama
Staging / prod: vLLM
Heavy NVIDIA optimization: TensorRT-LLM
11. Mental Model
Think of vLLM as:
“Kubernetes for LLM tokens” Efficient scheduling, paging, and concurrency for generation.
Can I use vLLM in Mac Pro M4 Max?
Short answer: No. You cannot use vLLM on a Mac Pro with M4 Max for GPU acceleration.
Below is the precise technical reasoning, plus what you should use instead on Apple Silicon.
Why vLLM Does NOT Work on Mac M4 Max
1. vLLM Is CUDA-Only
vLLM depends on:
CUDA kernels
NVIDIA-specific GPU memory management
PagedAttention implemented in CUDA
Apple GPUs use Metal, not CUDA
Result: vLLM cannot access the M4 Max GPU.
2. No Metal / MLX Backend
vLLM does not support:
Metal Performance Shaders (MPS)
MLX (Apple’s ML framework)
There is no experimental or community Metal port of vLLM
Result: No GPU inference path exists on macOS.
3. CPU-Only vLLM Is Not Viable
While Python may import vLLM:
Performance is extremely poor
Memory usage is inefficient
Many features implicitly assume GPU KV cache
Practically: vLLM on CPU is unusable for real workloads.
What You SHOULD Use on Mac M4 Max Instead
✅ Best Options (Ranked)
1. Ollama (Recommended)


Why
Native Metal backend
Excellent GGUF support
Zero-configuration
Optimized for Apple Silicon memory bandwidth
Use when
Local development
Agent prototyping
Small to medium models
2. llama.cpp (Advanced Control)


Why
Direct Metal acceleration
Fine-grained control
Best raw efficiency on Mac
Use when
You want maximum performance tuning
You control quantization and memory
3. MLX (Apple’s Native Stack)


Why
Built specifically for Apple Silicon
Best long-term Apple-native path
Excellent for research and custom workflows
Use when
You are building Apple-first GenAI tooling
You want Python-native GPU execution
What NOT to Do
vLLM on macOS
No CUDA
Docker vLLM on Mac
Still no GPU
Rosetta
No CUDA emulation
CPU-only vLLM
Unusable latency
Practical Architecture for You
Given your setup and current work:
This is the standard industry pattern.
If You Still Want vLLM Access from Mac
Option: Remote vLLM Server
Run vLLM on:
AWS (A10 / L4 / A100)
GCP
On-prem NVIDIA box
Access it from Mac via:
OpenAI-compatible API
LangChain / LlamaIndex
From your Mac, it behaves like OpenAI.
Bottom Line
vLLM and Apple Silicon are fundamentally incompatible today.
Use Ollama / llama.cpp / MLX locally, and vLLM remotely when you need production-grade throughput.
can I use vLLM for training a new model?
Short answer: No. vLLM cannot be used to train a new model.
Below is the precise explanation, followed by what you should use instead, depending on your training goal.
Why vLLM Is NOT a Training Framework
vLLM is strictly an inference and serving engine.
1. No Backpropagation
vLLM does not implement:
Gradient computation
Optimizers
Loss functions
Forward pass only
Implication: You cannot update model weights.
2. PagedAttention Is Inference-Only
Designed for KV cache reuse
Assumes frozen model parameters
Backward pass would break its memory model
Implication: Its core innovation is incompatible with training.
3. No Training APIs
No
.train()No dataset loaders
No checkpoint saving logic
vLLM intentionally avoids the training problem space.
What vLLM Can Do (Training-Adjacent)
Model evaluation
Yes
Batch inference
Yes
Serving LoRA adapters
Yes
A/B testing models
Yes
Reward model inference
Yes
Important: vLLM can serve LoRA adapters, but it cannot train them.
What to Use Instead (Correct Tools)
Scenario 1: Full Model Training (Pretraining / Continued Pretraining)
Use:
PyTorch + Transformers
DeepSpeed (ZeRO-2/3)
FSDP
Megatron-LM
Example stack:
Scenario 2: Fine-Tuning (Instruction / Domain)
Use:
Hugging Face Transformers
PEFT (LoRA, QLoRA)
TRL (SFT, DPO, PPO)
Typical flow:
This is the correct vLLM lifecycle.
Scenario 3: Parameter-Efficient Fine-Tuning (LoRA / QLoRA)
Recommended:
PEFT + Accelerate
Unsloth (fast LoRA training)
Axolotl
After training:
Load adapters into vLLM for inference
Scenario 4: RLHF / Preference Optimization
Use:
TRL
DeepSpeed
Accelerate
vLLM can later be used to:
Serve the reward model
Serve the policy model for evaluation
Canonical Workflow (Industry Standard)
Why vLLM Should NOT Be Extended for Training
Training optimizes compute utilization
vLLM optimizes memory reuse & scheduling
These goals conflict architecturally
Trying to train with vLLM would:
Reduce performance
Break correctness
Add extreme complexity
Bottom Line
vLLM is inference-only by design. Train elsewhere, deploy with vLLM.
Design a LoRA training → vLLM serving pipeline
Below is a production-grade LoRA training → vLLM serving pipeline, structured the way this is actually built in modern GenAI platforms.
I will keep it implementation-focused, not theoretical.
LoRA Training → vLLM Serving Pipeline
1. High-Level Architecture
2. Step 1 — Base Model Selection
Choose a vLLM-friendly model
Recommended:
Llama-3 / Llama-3.1
Mistral / Mixtral
Qwen2
Example:
Requirements
Hugging Face format
FP16 / BF16 weights
NOT GGUF
3. Step 2 — Dataset Format
vLLM expects nothing here; training stack does.
Instruction tuning (JSONL):
4. Step 3 — LoRA Training (PEFT + Transformers)
Environment
Training Script (Minimal but Correct)
Output Artifacts
5. Step 4 — Store Adapters (Critical)
You have three safe options:
Option A: Hugging Face Hub (Recommended)
Option B: Object Storage
S3 / GCS / MinIO
Option C: Local Registry
Versioned directory
Hash-based naming
6. Step 5 — Serve with vLLM (LoRA Enabled)
Start vLLM with LoRA Support
Multi-LoRA (Multi-Tenant SaaS)
Client selects adapter:
7. Step 6 — OpenAI-Compatible API Usage
8. Production Hardening Checklist
Training Side
Freeze base model weights
Validate loss convergence
Version adapters (semantic versioning)
Save tokenizer version
Serving Side
Cap max tokens
Limit concurrent LoRA adapters
Monitor VRAM per adapter
Admission control
9. Common Failure Modes
OOM during serve
Too many LoRAs
Reduce --max-lora-rank
Bad output
Wrong target modules
Use q_proj, v_proj
Adapter ignored
Wrong name
Match lora_adapter
Slow inference
Large r or many adapters
Use r ≤ 16
10. Advanced Enhancements (Optional)
QLoRA for low-VRAM training
Unsloth for 2–5× faster LoRA training
Speculative decoding in vLLM
A/B adapter routing at API gateway
Canary LoRA rollout
Final Mental Model
Train once, swap adapters forever. vLLM turns LoRA into a production feature, not an experiment.
A QLoRA + Unsloth config
Below is a battle-tested QLoRA + Unsloth configuration, aligned for LoRA → vLLM serving and optimized for speed, VRAM efficiency, and stability.
This is not a toy example; it mirrors real production fine-tuning setups.
QLoRA + Unsloth Configuration
1. When to Use This Stack
Use QLoRA + Unsloth if:
You have limited VRAM (24–48 GB)
You want fast fine-tuning (2–5× speedup)
You plan to serve with vLLM later
You want minimal engineering overhead
2. Hardware & Software Assumptions
Hardware
NVIDIA GPU (A10 / L4 / A100 / H100)
≥ 24 GB VRAM (7–8B models)
NVMe storage recommended
Software
Python 3.9–3.11
CUDA 11.8+
PyTorch 2.1+
3. Install Dependencies
Verify:
4. Base Model Selection (Critical)
Choose models known to work well with Unsloth and vLLM:
Recommended:
unsloth/Llama-3.1-8B-Instructunsloth/Mistral-7B-Instructunsloth/Qwen2-7B-Instruct
Avoid:
GGUF
Exotic architectures
Non-HF formats
5. Dataset Format (Instruction Tuning)
Unsloth prefers single-text-column datasets.
6. QLoRA + Unsloth Training Script
Complete, Optimized Script
7. Output Artifacts
These are directly compatible with vLLM.
8. Validation (Before Serving)
9. Serve with vLLM
10. Production Tuning Guidelines
LoRA rank (r)
8–16
Target modules
q,v,o + MLP
Learning rate
2e-4
Context length
≤ 4096
Adapters per GPU
≤ 6
11. Common Failure Modes
Loss explodes
LR too high
Reduce to 1e-4
Slow training
No gradient checkpointing
Enable it
Poor output
Missing MLP targets
Include gate/up/down
OOM
Batch too large
Increase grad accumulation
12. Why This Works Well with vLLM
Base model remains untouched
LoRA adapters are lightweight
vLLM hot-swaps adapters efficiently
High throughput with multi-tenant adapters
Final Recommendation
QLoRA + Unsloth for training, vLLM for serving is currently one of the best cost–performance stacks in GenAI.
Adapter routing via API Gateway
Below is a clean, production-ready design for LoRA adapter routing via an API Gateway, aligned with vLLM serving and multi-tenant GenAI systems.
This pattern is widely used in SaaS copilots, internal platforms, and agent backends.
Adapter Routing via API Gateway (vLLM + LoRA)
1. Problem Statement
You have:
One base LLM (e.g., Llama-3.1-8B)
Multiple LoRA adapters (finance, legal, hr, customer-specific, A/B variants)
You want:
Dynamic adapter selection per request
Centralized auth, policy, and routing
Zero client-side complexity
2. High-Level Architecture
3. Core Routing Principle
The gateway decides the adapter. vLLM simply executes it.
Clients should never choose adapters directly in production.
4. vLLM Configuration (Static)
Start vLLM with all allowed adapters:
Adapters are now addressable by name.
5. Gateway Routing Strategies
Strategy A — Tenant-Based Routing (Most Common)
Use case
Multi-tenant SaaS
Per-customer fine-tuning
Gateway logic
Example
Mapping
Injected into vLLM request
Strategy B — Feature / Product Routing
Use case
One app, multiple GenAI features
Examples
Financial analysis
finance
Contract review
legal
HR Q&A
hr
Gateway rule
Strategy C — User Tier / Plan Routing
Use case
Free vs Pro vs Enterprise
Free
base (no LoRA)
Pro
domain_lora
Enterprise
customer_lora
Gateway may omit lora_adapter entirely for base users.
Strategy D — A/B Testing / Canary Releases
Use case
Safe rollout of new LoRA
Gateway logic
No vLLM restart required.
6. Gateway Request Transformation
Incoming Client Request (Clean API)
Outgoing vLLM Request (Internal)
7. OpenAI-Compatible Call (Gateway → vLLM)
8. Policy Enforcement at Gateway (Critical)
Do this at the gateway, not in vLLM
Enforce:
Max tokens per tier
Max context length
Rate limits
Adapter allow-lists
Prompt sanitization
Example
9. Observability & Cost Attribution
Log per request:
tenant_id
adapter_name
tokens_in / tokens_out
latency
GPU ID
This enables:
Cost per tenant
Adapter ROI analysis
Decommissioning bad LoRAs
10. Failure Handling
Adapter missing
Fallback to base
Adapter OOM
Retry with base
vLLM down
Circuit breaker
Canary regression
Auto-disable adapter
11. Security Model (Important)
Never allow:
Client-supplied
lora_adapterArbitrary adapter names
Adapter path injection
Adapters are server-side assets, not API features.
12. Reference Gateway Stack
Common choices:
FastAPI / Flask
Kong / Envoy
AWS API Gateway + Lambda
NGINX + Lua
13. Minimal FastAPI Gateway Example
14. Mental Model
Adapters are a routing concern, not a model concern.
vLLM executes. The gateway decides which brain to use.
Final Recommendation
For production:
Centralize adapter routing
Treat adapters like deployable artifacts
Version, monitor, and sunset them
Never expose adapter control to clients
Last updated