vLLM
What is vLLM?
Core Problem It Solves
Key Concepts in vLLM
1. PagedAttention (Core Innovation)
2. Continuous Batching
3. OpenAI-Compatible API
4. Multi-Model & Multi-Tenant Serving
5. Hugging Face & Quantization Support
When Should You Use vLLM?
vLLM Architecture (High Level)
Major Alternatives to vLLM
1. TensorRT-LLM (NVIDIA)


2. TGI (Text Generation Inference – Hugging Face)
3. LMDeploy (OpenMMLab)


4. Ollama


5. llama.cpp


6. DeepSpeed-Inference (Microsoft)


Summary Comparison
Engine
Best For
Throughput
Ease
Hardware
Practical Recommendation (for you)
vLLM – Pros and Cons
✅ Pros of vLLM
1. Industry-Leading Throughput (Primary Advantage)
2. Excellent for Multi-User, Multi-Tenant Systems
3. OpenAI-Compatible API
4. Strong Hugging Face Model Support
5. Active Development & Adoption
6. Production-Grade Scheduling
❌ Cons of vLLM
1. NVIDIA-Centric (Major Limitation)
2. Not Optimized for GGUF Ecosystem
3. Higher Operational Complexity Than Ollama
4. Memory Spikes Under Extreme Load
5. Limited Fine-Grained Control vs TensorRT-LLM
6. Not a Training or Fine-Tuning Platform
vLLM in One Line
Decision Guidance
vLLM is the right choice if:
vLLM is not the best choice if:
Comparative Snapshot
Area
vLLM
Ollama
llama.cpp
TensorRT-LLM
1. Prerequisites
Hardware
Software
2. Installation
Option A: Pip (Most Common)
Option B: Nightly (latest features)
3. Usage Pattern #1 — Python SDK (Direct Inference)
Basic Text Generation
When to Use This
4. Usage Pattern #2 — OpenAI-Compatible API Server (Most Popular)
Start the Server
Call It Like OpenAI
Why This Is Powerful
5. Usage Pattern #3 — Docker (Production-Friendly)
6. Key Configuration Flags (Important)
Flag
Purpose
Recommendation
7. Using vLLM with RAG
8. LoRA / Adapter Serving
9. Common Pitfalls (Important)
❌ Running on Apple Silicon
❌ Ignoring Context Limits
❌ Using GGUF Models
10. Recommended Usage Strategy (For You)
11. Mental Model
Why vLLM Does NOT Work on Mac M4 Max
1. vLLM Is CUDA-Only
2. No Metal / MLX Backend
3. CPU-Only vLLM Is Not Viable
What You SHOULD Use on Mac M4 Max Instead
✅ Best Options (Ranked)
1. Ollama (Recommended)


2. llama.cpp (Advanced Control)


3. MLX (Apple’s Native Stack)


What NOT to Do
Approach
Why
Practical Architecture for You
If You Still Want vLLM Access from Mac
Option: Remote vLLM Server
Bottom Line
Why vLLM Is NOT a Training Framework
1. No Backpropagation
2. PagedAttention Is Inference-Only
3. No Training APIs
What vLLM Can Do (Training-Adjacent)
Capability
Supported
What to Use Instead (Correct Tools)
Scenario 1: Full Model Training (Pretraining / Continued Pretraining)
Scenario 2: Fine-Tuning (Instruction / Domain)
Scenario 3: Parameter-Efficient Fine-Tuning (LoRA / QLoRA)
Scenario 4: RLHF / Preference Optimization
Canonical Workflow (Industry Standard)
Why vLLM Should NOT Be Extended for Training
Bottom Line
LoRA Training → vLLM Serving Pipeline
1. High-Level Architecture
2. Step 1 — Base Model Selection
3. Step 2 — Dataset Format
4. Step 3 — LoRA Training (PEFT + Transformers)
Environment
Training Script (Minimal but Correct)
Output Artifacts
5. Step 4 — Store Adapters (Critical)
Option A: Hugging Face Hub (Recommended)
Option B: Object Storage
Option C: Local Registry
6. Step 5 — Serve with vLLM (LoRA Enabled)
Start vLLM with LoRA Support
Multi-LoRA (Multi-Tenant SaaS)
7. Step 6 — OpenAI-Compatible API Usage
8. Production Hardening Checklist
Training Side
Serving Side
9. Common Failure Modes
Problem
Cause
Fix
10. Advanced Enhancements (Optional)
Final Mental Model
QLoRA + Unsloth Configuration
1. When to Use This Stack
2. Hardware & Software Assumptions
Hardware
Software
3. Install Dependencies
4. Base Model Selection (Critical)
5. Dataset Format (Instruction Tuning)
6. QLoRA + Unsloth Training Script
Complete, Optimized Script
7. Output Artifacts
8. Validation (Before Serving)
9. Serve with vLLM
10. Production Tuning Guidelines
Parameter
Recommended
11. Common Failure Modes
Issue
Cause
Fix
12. Why This Works Well with vLLM
Final Recommendation
Adapter Routing via API Gateway (vLLM + LoRA)
1. Problem Statement
2. High-Level Architecture
3. Core Routing Principle
4. vLLM Configuration (Static)
5. Gateway Routing Strategies
Strategy A — Tenant-Based Routing (Most Common)
Strategy B — Feature / Product Routing
Feature
Adapter
Strategy C — User Tier / Plan Routing
Tier
Adapter
Strategy D — A/B Testing / Canary Releases
6. Gateway Request Transformation
Incoming Client Request (Clean API)
Outgoing vLLM Request (Internal)
7. OpenAI-Compatible Call (Gateway → vLLM)
8. Policy Enforcement at Gateway (Critical)
Enforce:
Example
9. Observability & Cost Attribution
10. Failure Handling
Scenario
Gateway Action
11. Security Model (Important)
12. Reference Gateway Stack
13. Minimal FastAPI Gateway Example
14. Mental Model
Final Recommendation
Last updated