vLLM

vLLM is a high-performance inference and serving engine for Large Language Models (LLMs), designed to maximize GPU utilization, reduce latency, and enable efficient multi-user, production-grade GenAI systems.


What is vLLM?

vLLM (Virtual Large Language Model) is an open-source LLM inference engine developed by researchers at UC Berkeley. It focuses on serving, not training.

Core Problem It Solves

Traditional LLM serving wastes GPU memory due to:

  • Fixed-size KV caches

  • Fragmentation during long or concurrent requests

  • Inefficient batching

vLLM solves this with PagedAttention.


Key Concepts in vLLM

1. PagedAttention (Core Innovation)

  • KV cache is stored in non-contiguous GPU memory pages

  • Similar to virtual memory paging in operating systems

  • Eliminates memory fragmentation

  • Enables higher throughput with more concurrent users

Result:

  • 2–4× higher throughput

  • Lower tail latency

  • More requests per GPU


2. Continuous Batching

  • Requests are dynamically added/removed during inference

  • No need to wait for batch boundaries

  • Ideal for chatbots and APIs


3. OpenAI-Compatible API

vLLM can expose:

  • /v1/chat/completions

  • /v1/completions

  • /v1/embeddings

This makes it a drop-in replacement for OpenAI API in self-hosted setups.


4. Multi-Model & Multi-Tenant Serving

  • Serve multiple models on the same GPU

  • Efficient scheduling across users

  • Used in SaaS and internal GenAI platforms


5. Hugging Face & Quantization Support

  • Supports HF models directly

  • Works with:

    • FP16 / BF16

    • AWQ

    • GPTQ

    • Some GGUF via conversion (not native focus)


When Should You Use vLLM?

vLLM is ideal if you are:

  • Building production LLM APIs

  • Serving chatbots with many concurrent users

  • Running RAG systems at scale

  • Hosting enterprise GenAI platforms

  • Replacing OpenAI with self-hosted inference

It is not intended for:

  • Training models

  • Edge devices

  • Ultra-low-RAM laptops


vLLM Architecture (High Level)


Major Alternatives to vLLM

Below are the practical and widely used alternatives, categorized by use case.


1. TensorRT-LLM (NVIDIA)

Image
Image

Best for: Maximum NVIDIA GPU performance

Pros

  • Lowest latency on NVIDIA GPUs

  • Deep CUDA and kernel fusion optimizations

  • Used in large-scale enterprise deployments

Cons

  • NVIDIA-only

  • Complex build process

  • Less flexible than vLLM

Choose TensorRT-LLM if

  • You control the GPU fleet (A100/H100)

  • Latency is mission-critical


2. TGI (Text Generation Inference – Hugging Face)

Image
Image

Best for: Stable, production-ready HF serving

Pros

  • Official Hugging Face solution

  • Strong model compatibility

  • Streaming, batching, token limits

Cons

  • Lower throughput than vLLM

  • Less memory-efficient KV handling

Choose TGI if

  • You live fully in the HF ecosystem

  • You want simplicity over peak performance


3. LMDeploy (OpenMMLab)

Image
Image

Best for: Lightweight, fast inference

Pros

  • Efficient memory usage

  • Fast startup

  • Supports Turbomind backend

Cons

  • Smaller ecosystem

  • Fewer production features than vLLM


4. Ollama

Image
Image

Best for: Local development and laptops

Pros

  • Extremely easy to use

  • Excellent for Mac (Metal, MLX)

  • GGUF-native

Cons

  • Not built for high concurrency

  • Not enterprise-scale

Choose Ollama if

  • You are developing locally

  • You want fast iteration, not scale


5. llama.cpp

Image
Image

Best for: Edge devices and CPU inference

Pros

  • Runs anywhere (CPU, Metal, CUDA)

  • GGUF ecosystem

  • Minimal dependencies

Cons

  • Lower throughput

  • No advanced scheduler like vLLM


6. DeepSpeed-Inference (Microsoft)

Image
Image

Best for: Large multi-GPU clusters

Pros

  • Tensor parallelism

  • ZeRO optimizations

  • Enterprise-grade

Cons

  • Heavy configuration

  • Slower iteration than vLLM


Summary Comparison

Engine
Best For
Throughput
Ease
Hardware

vLLM

High-concurrency APIs

Very High

Medium

CUDA

TensorRT-LLM

Lowest latency

Extreme

Low

NVIDIA

TGI

Stable HF serving

High

High

CUDA

LMDeploy

Lightweight serving

High

Medium

CUDA

Ollama

Local dev

Medium

Very High

CPU/Metal

llama.cpp

Edge / CPU

Low–Medium

Medium

Any

DeepSpeed-Inference

Clusters

High

Low

CUDA


Practical Recommendation (for you)

Given your work on:

  • Self-hosted LLMs

  • Agent systems

  • Production GenAI platforms

vLLM should be your default inference layer, with:

  • Ollama / llama.cpp for local prototyping

  • TensorRT-LLM if you later optimize for NVIDIA-only latency


What's the pros and cons of vLLM?

Below is a concise but production-oriented evaluation of vLLM, framed the way it is typically assessed in real GenAI platform decisions.


vLLM – Pros and Cons


✅ Pros of vLLM

1. Industry-Leading Throughput (Primary Advantage)

  • PagedAttention eliminates KV-cache fragmentation

  • Enables far more concurrent requests per GPU

  • Sustains high tokens/sec even with long contexts

Impact: You can serve 2–4× more users on the same GPU compared to naive inference.


2. Excellent for Multi-User, Multi-Tenant Systems

  • Continuous batching

  • Dynamic request scheduling

  • Works well for chatbots, copilots, and agent backends

Impact: Tail latency remains stable under load.


3. OpenAI-Compatible API

  • Drop-in replacement for:

    • /v1/chat/completions

    • /v1/completions

    • /v1/embeddings

Impact: Minimal app-side changes when migrating from OpenAI.


4. Strong Hugging Face Model Support

  • Native HF model loading

  • Supports FP16 / BF16

  • Works with AWQ and GPTQ

Impact: You can deploy most modern open-source LLMs quickly.


5. Active Development & Adoption

  • Used by:

    • Open-source GenAI stacks

    • Research labs

    • Early enterprise platforms

  • Rapid feature velocity (LoRA serving, speculative decoding, etc.)


6. Production-Grade Scheduling

  • Handles:

    • Long prompts

    • Mixed prompt lengths

    • Streaming + non-streaming users

Impact: Better real-world behavior than static batching engines.


❌ Cons of vLLM


1. NVIDIA-Centric (Major Limitation)

  • CUDA required

  • No native Apple Metal / MLX backend

  • Weak fit for Mac-only inference

Implication: Not suitable if your infra is Apple Silicon or CPU-heavy.


2. Not Optimized for GGUF Ecosystem

  • GGUF is not a first-class citizen

  • Conversion often required

  • llama.cpp / Ollama handle GGUF better

Implication: vLLM is not ideal for the “local GGUF model zoo” workflow.


3. Higher Operational Complexity Than Ollama

  • Requires:

    • CUDA drivers

    • GPU memory planning

    • Proper scheduler tuning

  • Debugging can be non-trivial at scale

Implication: DevOps maturity is required for stable production use.


4. Memory Spikes Under Extreme Load

  • PagedAttention is efficient, but:

    • Very long contexts

    • Too many concurrent users can still cause OOMs if not tuned

Requires:

  • Context limits

  • Max tokens per request

  • Admission control


5. Limited Fine-Grained Control vs TensorRT-LLM

  • Less kernel-level tuning

  • Slightly higher latency than TensorRT-LLM on H100/A100

Implication: If latency per token is mission-critical, TensorRT-LLM can outperform vLLM.


6. Not a Training or Fine-Tuning Platform

  • Inference only

  • No native training loop

  • LoRA serving is supported, not LoRA training


vLLM in One Line

vLLM trades hardware specificity and operational simplicity for the best open-source LLM serving throughput available today.


Decision Guidance

vLLM is the right choice if:

  • You are serving many concurrent users

  • You want OpenAI API compatibility

  • You run NVIDIA GPUs

  • You are building a production GenAI backend

vLLM is not the best choice if:

  • You are Mac-only or CPU-only

  • You rely heavily on GGUF models

  • You want zero-config local inference

  • You need ultra-low latency on NVIDIA hardware


Comparative Snapshot

Area
vLLM
Ollama
llama.cpp
TensorRT-LLM

Throughput

Excellent

Medium

Low–Med

Extreme

Ease of Setup

Medium

Very High

Medium

Low

GGUF Support

Weak

Excellent

Native

Poor

Apple Silicon

Production APIs

⚠️

Latency Floor

High

Medium

Medium

Lowest


How can I use vLLM?

Below is a practical, end-to-end guide to using vLLM, organized the way it is typically adopted in real GenAI systems.


1. Prerequisites

Hardware

  • NVIDIA GPU (required)

    • A10 / L4 / A100 / H100 preferred

  • CUDA ≥ 11.8

  • Sufficient VRAM (24 GB minimum recommended for 7–13B models)

Software

  • Python 3.9–3.11

  • Linux (Ubuntu 20.04+ strongly recommended)

  • PyTorch (installed automatically by vLLM)


2. Installation

Option A: Pip (Most Common)

Option B: Nightly (latest features)

Verify:


3. Usage Pattern #1 — Python SDK (Direct Inference)

Basic Text Generation

When to Use This

  • Offline inference

  • Batch jobs

  • Research or evaluation scripts


Start the Server

Call It Like OpenAI

Why This Is Powerful

  • Works with LangChain, LlamaIndex, CrewAI, etc.

  • Easy OpenAI migration

  • Supports streaming


5. Usage Pattern #3 — Docker (Production-Friendly)

Use this when:

  • Deploying on Kubernetes

  • Standardizing infra

  • CI/CD environments


6. Key Configuration Flags (Important)

Flag
Purpose
Recommendation

--gpu-memory-utilization

Fraction of VRAM

0.85–0.92

--max-model-len

Max context length

Match model spec

--dtype

Precision

bfloat16

--tensor-parallel-size

Multi-GPU sharding

= GPU count

--max-num-seqs

Concurrent requests

Tune carefully


7. Using vLLM with RAG

vLLM is commonly used as the generation layer.

Works seamlessly with:

  • LangChain

  • LlamaIndex

  • Custom agent frameworks

No special adapters needed.


8. LoRA / Adapter Serving

vLLM supports LoRA inference:

Use cases:

  • Per-customer personalization

  • Domain-specific behavior

  • Multi-tenant SaaS


9. Common Pitfalls (Important)

❌ Running on Apple Silicon

  • vLLM does not support Metal / MLX

❌ Ignoring Context Limits

  • Long prompts × many users = OOM

  • Always cap max_tokens and max-model-len

❌ Using GGUF Models

  • vLLM is HF-centric

  • Prefer Ollama / llama.cpp for GGUF


Given your work with:

  • Agent frameworks

  • Self-hosted LLMs

  • Production GenAI stacks

Recommended stack:

  • Local dev: Ollama

  • Staging / prod: vLLM

  • Heavy NVIDIA optimization: TensorRT-LLM


11. Mental Model

Think of vLLM as:

“Kubernetes for LLM tokens” Efficient scheduling, paging, and concurrency for generation.


Can I use vLLM in Mac Pro M4 Max?

Short answer: No. You cannot use vLLM on a Mac Pro with M4 Max for GPU acceleration.

Below is the precise technical reasoning, plus what you should use instead on Apple Silicon.


Why vLLM Does NOT Work on Mac M4 Max

1. vLLM Is CUDA-Only

  • vLLM depends on:

    • CUDA kernels

    • NVIDIA-specific GPU memory management

    • PagedAttention implemented in CUDA

  • Apple GPUs use Metal, not CUDA

Result: vLLM cannot access the M4 Max GPU.


2. No Metal / MLX Backend

  • vLLM does not support:

    • Metal Performance Shaders (MPS)

    • MLX (Apple’s ML framework)

  • There is no experimental or community Metal port of vLLM

Result: No GPU inference path exists on macOS.


3. CPU-Only vLLM Is Not Viable

  • While Python may import vLLM:

    • Performance is extremely poor

    • Memory usage is inefficient

    • Many features implicitly assume GPU KV cache

Practically: vLLM on CPU is unusable for real workloads.


What You SHOULD Use on Mac M4 Max Instead

✅ Best Options (Ranked)

Image
Image

Why

  • Native Metal backend

  • Excellent GGUF support

  • Zero-configuration

  • Optimized for Apple Silicon memory bandwidth

Use when

  • Local development

  • Agent prototyping

  • Small to medium models


2. llama.cpp (Advanced Control)

Image
Image

Why

  • Direct Metal acceleration

  • Fine-grained control

  • Best raw efficiency on Mac

Use when

  • You want maximum performance tuning

  • You control quantization and memory


3. MLX (Apple’s Native Stack)

Image
Image

Why

  • Built specifically for Apple Silicon

  • Best long-term Apple-native path

  • Excellent for research and custom workflows

Use when

  • You are building Apple-first GenAI tooling

  • You want Python-native GPU execution


What NOT to Do

Approach
Why

vLLM on macOS

No CUDA

Docker vLLM on Mac

Still no GPU

Rosetta

No CUDA emulation

CPU-only vLLM

Unusable latency


Practical Architecture for You

Given your setup and current work:

This is the standard industry pattern.


If You Still Want vLLM Access from Mac

Option: Remote vLLM Server

  • Run vLLM on:

    • AWS (A10 / L4 / A100)

    • GCP

    • On-prem NVIDIA box

  • Access it from Mac via:

    • OpenAI-compatible API

    • LangChain / LlamaIndex

From your Mac, it behaves like OpenAI.


Bottom Line

vLLM and Apple Silicon are fundamentally incompatible today.

Use Ollama / llama.cpp / MLX locally, and vLLM remotely when you need production-grade throughput.


can I use vLLM for training a new model?

Short answer: No. vLLM cannot be used to train a new model.

Below is the precise explanation, followed by what you should use instead, depending on your training goal.


Why vLLM Is NOT a Training Framework

vLLM is strictly an inference and serving engine.

1. No Backpropagation

  • vLLM does not implement:

    • Gradient computation

    • Optimizers

    • Loss functions

  • Forward pass only

Implication: You cannot update model weights.


2. PagedAttention Is Inference-Only

  • Designed for KV cache reuse

  • Assumes frozen model parameters

  • Backward pass would break its memory model

Implication: Its core innovation is incompatible with training.


3. No Training APIs

  • No .train()

  • No dataset loaders

  • No checkpoint saving logic

vLLM intentionally avoids the training problem space.


What vLLM Can Do (Training-Adjacent)

Capability
Supported

Model evaluation

Yes

Batch inference

Yes

Serving LoRA adapters

Yes

A/B testing models

Yes

Reward model inference

Yes

Important: vLLM can serve LoRA adapters, but it cannot train them.


What to Use Instead (Correct Tools)

Scenario 1: Full Model Training (Pretraining / Continued Pretraining)

Use:

  • PyTorch + Transformers

  • DeepSpeed (ZeRO-2/3)

  • FSDP

  • Megatron-LM

Example stack:


Scenario 2: Fine-Tuning (Instruction / Domain)

Use:

  • Hugging Face Transformers

  • PEFT (LoRA, QLoRA)

  • TRL (SFT, DPO, PPO)

Typical flow:

This is the correct vLLM lifecycle.


Scenario 3: Parameter-Efficient Fine-Tuning (LoRA / QLoRA)

Recommended:

  • PEFT + Accelerate

  • Unsloth (fast LoRA training)

  • Axolotl

After training:

  • Load adapters into vLLM for inference


Scenario 4: RLHF / Preference Optimization

Use:

  • TRL

  • DeepSpeed

  • Accelerate

vLLM can later be used to:

  • Serve the reward model

  • Serve the policy model for evaluation


Canonical Workflow (Industry Standard)


Why vLLM Should NOT Be Extended for Training

  • Training optimizes compute utilization

  • vLLM optimizes memory reuse & scheduling

  • These goals conflict architecturally

Trying to train with vLLM would:

  • Reduce performance

  • Break correctness

  • Add extreme complexity


Bottom Line

vLLM is inference-only by design. Train elsewhere, deploy with vLLM.


Design a LoRA training → vLLM serving pipeline

Below is a production-grade LoRA training → vLLM serving pipeline, structured the way this is actually built in modern GenAI platforms.

I will keep it implementation-focused, not theoretical.


LoRA Training → vLLM Serving Pipeline


1. High-Level Architecture


2. Step 1 — Base Model Selection

Choose a vLLM-friendly model

Recommended:

  • Llama-3 / Llama-3.1

  • Mistral / Mixtral

  • Qwen2

Example:

Requirements

  • Hugging Face format

  • FP16 / BF16 weights

  • NOT GGUF


3. Step 2 — Dataset Format

vLLM expects nothing here; training stack does.

Instruction tuning (JSONL):


4. Step 3 — LoRA Training (PEFT + Transformers)

Environment


Training Script (Minimal but Correct)


Output Artifacts


5. Step 4 — Store Adapters (Critical)

You have three safe options:

Option B: Object Storage

  • S3 / GCS / MinIO

Option C: Local Registry

  • Versioned directory

  • Hash-based naming


6. Step 5 — Serve with vLLM (LoRA Enabled)

Start vLLM with LoRA Support


Multi-LoRA (Multi-Tenant SaaS)

Client selects adapter:


7. Step 6 — OpenAI-Compatible API Usage


8. Production Hardening Checklist

Training Side

  • Freeze base model weights

  • Validate loss convergence

  • Version adapters (semantic versioning)

  • Save tokenizer version

Serving Side

  • Cap max tokens

  • Limit concurrent LoRA adapters

  • Monitor VRAM per adapter

  • Admission control


9. Common Failure Modes

Problem
Cause
Fix

OOM during serve

Too many LoRAs

Reduce --max-lora-rank

Bad output

Wrong target modules

Use q_proj, v_proj

Adapter ignored

Wrong name

Match lora_adapter

Slow inference

Large r or many adapters

Use r ≤ 16


10. Advanced Enhancements (Optional)

  • QLoRA for low-VRAM training

  • Unsloth for 2–5× faster LoRA training

  • Speculative decoding in vLLM

  • A/B adapter routing at API gateway

  • Canary LoRA rollout


Final Mental Model

Train once, swap adapters forever. vLLM turns LoRA into a production feature, not an experiment.


A QLoRA + Unsloth config

Below is a battle-tested QLoRA + Unsloth configuration, aligned for LoRA → vLLM serving and optimized for speed, VRAM efficiency, and stability.

This is not a toy example; it mirrors real production fine-tuning setups.


QLoRA + Unsloth Configuration


1. When to Use This Stack

Use QLoRA + Unsloth if:

  • You have limited VRAM (24–48 GB)

  • You want fast fine-tuning (2–5× speedup)

  • You plan to serve with vLLM later

  • You want minimal engineering overhead


2. Hardware & Software Assumptions

Hardware

  • NVIDIA GPU (A10 / L4 / A100 / H100)

  • ≥ 24 GB VRAM (7–8B models)

  • NVMe storage recommended

Software

  • Python 3.9–3.11

  • CUDA 11.8+

  • PyTorch 2.1+


3. Install Dependencies

Verify:


4. Base Model Selection (Critical)

Choose models known to work well with Unsloth and vLLM:

Recommended:

  • unsloth/Llama-3.1-8B-Instruct

  • unsloth/Mistral-7B-Instruct

  • unsloth/Qwen2-7B-Instruct

Avoid:

  • GGUF

  • Exotic architectures

  • Non-HF formats


5. Dataset Format (Instruction Tuning)

Unsloth prefers single-text-column datasets.


6. QLoRA + Unsloth Training Script

Complete, Optimized Script


7. Output Artifacts

These are directly compatible with vLLM.


8. Validation (Before Serving)


9. Serve with vLLM


10. Production Tuning Guidelines

Parameter
Recommended

LoRA rank (r)

8–16

Target modules

q,v,o + MLP

Learning rate

2e-4

Context length

≤ 4096

Adapters per GPU

≤ 6


11. Common Failure Modes

Issue
Cause
Fix

Loss explodes

LR too high

Reduce to 1e-4

Slow training

No gradient checkpointing

Enable it

Poor output

Missing MLP targets

Include gate/up/down

OOM

Batch too large

Increase grad accumulation


12. Why This Works Well with vLLM

  • Base model remains untouched

  • LoRA adapters are lightweight

  • vLLM hot-swaps adapters efficiently

  • High throughput with multi-tenant adapters


Final Recommendation

QLoRA + Unsloth for training, vLLM for serving is currently one of the best cost–performance stacks in GenAI.


Adapter routing via API Gateway

Below is a clean, production-ready design for LoRA adapter routing via an API Gateway, aligned with vLLM serving and multi-tenant GenAI systems.

This pattern is widely used in SaaS copilots, internal platforms, and agent backends.


Adapter Routing via API Gateway (vLLM + LoRA)


1. Problem Statement

You have:

  • One base LLM (e.g., Llama-3.1-8B)

  • Multiple LoRA adapters (finance, legal, hr, customer-specific, A/B variants)

You want:

  • Dynamic adapter selection per request

  • Centralized auth, policy, and routing

  • Zero client-side complexity


2. High-Level Architecture


3. Core Routing Principle

The gateway decides the adapter. vLLM simply executes it.

Clients should never choose adapters directly in production.


4. vLLM Configuration (Static)

Start vLLM with all allowed adapters:

Adapters are now addressable by name.


5. Gateway Routing Strategies

Strategy A — Tenant-Based Routing (Most Common)

Use case

  • Multi-tenant SaaS

  • Per-customer fine-tuning

Gateway logic

Example

Mapping

Injected into vLLM request


Strategy B — Feature / Product Routing

Use case

  • One app, multiple GenAI features

Examples

Feature
Adapter

Financial analysis

finance

Contract review

legal

HR Q&A

hr

Gateway rule


Strategy C — User Tier / Plan Routing

Use case

  • Free vs Pro vs Enterprise

Tier
Adapter

Free

base (no LoRA)

Pro

domain_lora

Enterprise

customer_lora

Gateway may omit lora_adapter entirely for base users.


Strategy D — A/B Testing / Canary Releases

Use case

  • Safe rollout of new LoRA

Gateway logic

No vLLM restart required.


6. Gateway Request Transformation

Incoming Client Request (Clean API)


Outgoing vLLM Request (Internal)


7. OpenAI-Compatible Call (Gateway → vLLM)


8. Policy Enforcement at Gateway (Critical)

Do this at the gateway, not in vLLM

Enforce:

  • Max tokens per tier

  • Max context length

  • Rate limits

  • Adapter allow-lists

  • Prompt sanitization

Example


9. Observability & Cost Attribution

Log per request:

  • tenant_id

  • adapter_name

  • tokens_in / tokens_out

  • latency

  • GPU ID

This enables:

  • Cost per tenant

  • Adapter ROI analysis

  • Decommissioning bad LoRAs


10. Failure Handling

Scenario
Gateway Action

Adapter missing

Fallback to base

Adapter OOM

Retry with base

vLLM down

Circuit breaker

Canary regression

Auto-disable adapter


11. Security Model (Important)

Never allow:

  • Client-supplied lora_adapter

  • Arbitrary adapter names

  • Adapter path injection

Adapters are server-side assets, not API features.


12. Reference Gateway Stack

Common choices:

  • FastAPI / Flask

  • Kong / Envoy

  • AWS API Gateway + Lambda

  • NGINX + Lua


13. Minimal FastAPI Gateway Example


14. Mental Model

Adapters are a routing concern, not a model concern.

vLLM executes. The gateway decides which brain to use.


Final Recommendation

For production:

  • Centralize adapter routing

  • Treat adapters like deployable artifacts

  • Version, monitor, and sunset them

  • Never expose adapter control to clients


Last updated