vLLM

vLLM is a high-performance inference and serving engine for Large Language Models (LLMs), designed to maximize GPU utilization, reduce latency, and enable efficient multi-user, production-grade GenAI systems.

What is vLLM?

vLLM (Virtual Large Language Model) is an open-source LLM inference engine developed by researchers at UC Berkeley. It focuses on serving, not training.

Core Problem It Solves

Traditional LLM serving wastes GPU memory due to:

Fixed-size KV caches
Fragmentation during long or concurrent requests
Inefficient batching

vLLM solves this with PagedAttention.

Key Concepts in vLLM

1. PagedAttention (Core Innovation)

KV cache is stored in non-contiguous GPU memory pages
Similar to virtual memory paging in operating systems
Eliminates memory fragmentation
Enables higher throughput with more concurrent users

Result:

2–4× higher throughput
Lower tail latency
More requests per GPU

2. Continuous Batching

Requests are dynamically added/removed during inference
No need to wait for batch boundaries
Ideal for chatbots and APIs

3. OpenAI-Compatible API

vLLM can expose:

/v1/chat/completions
/v1/completions
/v1/embeddings

This makes it a drop-in replacement for OpenAI API in self-hosted setups.

4. Multi-Model & Multi-Tenant Serving

Serve multiple models on the same GPU
Efficient scheduling across users
Used in SaaS and internal GenAI platforms

5. Hugging Face & Quantization Support

Supports HF models directly
Works with:
- FP16 / BF16
- AWQ
- GPTQ
- Some GGUF via conversion (not native focus)

When Should You Use vLLM?

vLLM is ideal if you are:

Building production LLM APIs
Serving chatbots with many concurrent users
Running RAG systems at scale
Hosting enterprise GenAI platforms
Replacing OpenAI with self-hosted inference

It is not intended for:

Training models
Edge devices
Ultra-low-RAM laptops

vLLM Architecture (High Level)

Client → OpenAI API Layer → Scheduler
       → PagedAttention KV Cache
       → GPU Execution Engine

Major Alternatives to vLLM

Below are the practical and widely used alternatives, categorized by use case.

1. TensorRT-LLM (NVIDIA)

Best for: Maximum NVIDIA GPU performance

Pros

Lowest latency on NVIDIA GPUs
Deep CUDA and kernel fusion optimizations
Used in large-scale enterprise deployments

Cons

NVIDIA-only
Complex build process
Less flexible than vLLM

Choose TensorRT-LLM if

You control the GPU fleet (A100/H100)
Latency is mission-critical

2. TGI (Text Generation Inference – Hugging Face)

Best for: Stable, production-ready HF serving

Pros

Official Hugging Face solution
Strong model compatibility
Streaming, batching, token limits

Cons

Lower throughput than vLLM
Less memory-efficient KV handling

Choose TGI if

You live fully in the HF ecosystem
You want simplicity over peak performance

3. LMDeploy (OpenMMLab)

Best for: Lightweight, fast inference

Pros

Efficient memory usage
Fast startup
Supports Turbomind backend

Cons

Smaller ecosystem
Fewer production features than vLLM

4. Ollama

Best for: Local development and laptops

Pros

Extremely easy to use
Excellent for Mac (Metal, MLX)
GGUF-native

Cons

Not built for high concurrency
Not enterprise-scale

Choose Ollama if

You are developing locally
You want fast iteration, not scale

5. llama.cpp

Best for: Edge devices and CPU inference

Pros

Runs anywhere (CPU, Metal, CUDA)
GGUF ecosystem
Minimal dependencies

Cons

Lower throughput
No advanced scheduler like vLLM

6. DeepSpeed-Inference (Microsoft)

Best for: Large multi-GPU clusters

Pros

Tensor parallelism
ZeRO optimizations
Enterprise-grade

Cons

Heavy configuration
Slower iteration than vLLM

Summary Comparison

Engine

Best For

Throughput

Ease

Hardware

vLLM

High-concurrency APIs

Very High

Medium

CUDA

TensorRT-LLM

Lowest latency

Extreme

Low

NVIDIA

TGI

Stable HF serving

High

CUDA

LMDeploy

Lightweight serving

High

Medium

CUDA

Ollama

Local dev

Medium

Very High

CPU/Metal

llama.cpp

Edge / CPU

Low–Medium

Medium

Any

DeepSpeed-Inference

Clusters

High

Low

CUDA

Practical Recommendation (for you)

Given your work on:

Self-hosted LLMs
Agent systems
Production GenAI platforms

vLLM should be your default inference layer, with:

Ollama / llama.cpp for local prototyping
TensorRT-LLM if you later optimize for NVIDIA-only latency

What's the pros and cons of vLLM?

Below is a concise but production-oriented evaluation of vLLM, framed the way it is typically assessed in real GenAI platform decisions.

vLLM – Pros and Cons

✅ Pros of vLLM

1. Industry-Leading Throughput (Primary Advantage)

PagedAttention eliminates KV-cache fragmentation
Enables far more concurrent requests per GPU
Sustains high tokens/sec even with long contexts

Impact: You can serve 2–4× more users on the same GPU compared to naive inference.

2. Excellent for Multi-User, Multi-Tenant Systems

Continuous batching
Dynamic request scheduling
Works well for chatbots, copilots, and agent backends

Impact: Tail latency remains stable under load.

3. OpenAI-Compatible API

Drop-in replacement for:
- /v1/chat/completions
- /v1/completions
- /v1/embeddings

Impact: Minimal app-side changes when migrating from OpenAI.

4. Strong Hugging Face Model Support

Native HF model loading
Supports FP16 / BF16
Works with AWQ and GPTQ

Impact: You can deploy most modern open-source LLMs quickly.

5. Active Development & Adoption

Used by:
- Open-source GenAI stacks
- Research labs
- Early enterprise platforms
Rapid feature velocity (LoRA serving, speculative decoding, etc.)

6. Production-Grade Scheduling

Handles:
- Long prompts
- Mixed prompt lengths
- Streaming + non-streaming users

Impact: Better real-world behavior than static batching engines.

❌ Cons of vLLM

1. NVIDIA-Centric (Major Limitation)

CUDA required
No native Apple Metal / MLX backend
Weak fit for Mac-only inference

Implication: Not suitable if your infra is Apple Silicon or CPU-heavy.

2. Not Optimized for GGUF Ecosystem

GGUF is not a first-class citizen
Conversion often required
llama.cpp / Ollama handle GGUF better

Implication: vLLM is not ideal for the “local GGUF model zoo” workflow.

3. Higher Operational Complexity Than Ollama

Requires:
- CUDA drivers
- GPU memory planning
- Proper scheduler tuning
Debugging can be non-trivial at scale

Implication: DevOps maturity is required for stable production use.

4. Memory Spikes Under Extreme Load

PagedAttention is efficient, but:
- Very long contexts
- Too many concurrent users can still cause OOMs if not tuned

Requires:

Context limits
Max tokens per request
Admission control

5. Limited Fine-Grained Control vs TensorRT-LLM

Less kernel-level tuning
Slightly higher latency than TensorRT-LLM on H100/A100

Implication: If latency per token is mission-critical, TensorRT-LLM can outperform vLLM.

6. Not a Training or Fine-Tuning Platform

Inference only
No native training loop
LoRA serving is supported, not LoRA training

vLLM in One Line

vLLM trades hardware specificity and operational simplicity for the best open-source LLM serving throughput available today.

Decision Guidance

vLLM is the right choice if:

You are serving many concurrent users
You want OpenAI API compatibility
You run NVIDIA GPUs
You are building a production GenAI backend

vLLM is not the best choice if:

You are Mac-only or CPU-only
You rely heavily on GGUF models
You want zero-config local inference
You need ultra-low latency on NVIDIA hardware

Comparative Snapshot

Area

vLLM

Ollama

llama.cpp

TensorRT-LLM

Throughput

Excellent

Medium

Low–Med

Extreme

Ease of Setup

Medium

Very High

Medium

Low

GGUF Support

Weak

Excellent

Native

Poor

Apple Silicon

❌

✅

❌

Production APIs

✅

⚠️

❌

✅

Latency Floor

High

Medium

Lowest

How can I use vLLM?

Below is a practical, end-to-end guide to using vLLM, organized the way it is typically adopted in real GenAI systems.

1. Prerequisites

Hardware

NVIDIA GPU (required)
- A10 / L4 / A100 / H100 preferred
CUDA ≥ 11.8
Sufficient VRAM (24 GB minimum recommended for 7–13B models)

Software

Python 3.9–3.11
Linux (Ubuntu 20.04+ strongly recommended)
PyTorch (installed automatically by vLLM)

2. Installation

Option A: Pip (Most Common)

pip install vllm

Option B: Nightly (latest features)

pip install vllm --pre

Verify:

python - <<EOF
from vllm import LLM
print("vLLM installed")
EOF

3. Usage Pattern #1 — Python SDK (Direct Inference)

Basic Text Generation

from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    dtype="bfloat16",
    gpu_memory_utilization=0.9
)

params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=256
)

outputs = llm.generate(
    ["Explain transformers in simple terms"],
    sampling_params=params
)

print(outputs[0].outputs[0].text)

When to Use This

Offline inference
Batch jobs
Research or evaluation scripts

4. Usage Pattern #2 — OpenAI-Compatible API Server (Most Popular)

Start the Server

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --dtype bfloat16 \
  --gpu-memory-utilization 0.90 \
  --max-model-len 8192

Call It Like OpenAI

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY"
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant"},
        {"role": "user", "content": "What is vLLM?"}
    ]
)

print(response.choices[0].message.content)

Why This Is Powerful

Works with LangChain, LlamaIndex, CrewAI, etc.
Easy OpenAI migration
Supports streaming

5. Usage Pattern #3 — Docker (Production-Friendly)

docker run --gpus all \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --dtype bfloat16 \
  --max-model-len 8192

Use this when:

Deploying on Kubernetes
Standardizing infra
CI/CD environments

6. Key Configuration Flags (Important)

Flag

Purpose

Recommendation

--gpu-memory-utilization

Fraction of VRAM

0.85–0.92

--max-model-len

Max context length

Match model spec

--dtype

Precision

bfloat16

--tensor-parallel-size

Multi-GPU sharding

= GPU count

--max-num-seqs

Concurrent requests

Tune carefully

7. Using vLLM with RAG

vLLM is commonly used as the generation layer.

Retriever → Prompt Builder → vLLM → Post-Processor

Works seamlessly with:

LangChain
LlamaIndex
Custom agent frameworks

No special adapters needed.

8. LoRA / Adapter Serving

vLLM supports LoRA inference:

vllm serve meta-llama/Llama-3.1-8B \
  --enable-lora \
  --lora-modules my_adapter=./lora_path

Use cases:

Per-customer personalization
Domain-specific behavior
Multi-tenant SaaS

9. Common Pitfalls (Important)

❌ Running on Apple Silicon

vLLM does not support Metal / MLX

❌ Ignoring Context Limits

Long prompts × many users = OOM
Always cap max_tokens and max-model-len

❌ Using GGUF Models

vLLM is HF-centric
Prefer Ollama / llama.cpp for GGUF

10. Recommended Usage Strategy (For You)

Given your work with:

Agent frameworks
Self-hosted LLMs
Production GenAI stacks

Recommended stack:

Local dev: Ollama
Staging / prod: vLLM
Heavy NVIDIA optimization: TensorRT-LLM

11. Mental Model

Think of vLLM as:

“Kubernetes for LLM tokens” Efficient scheduling, paging, and concurrency for generation.

Can I use vLLM in Mac Pro M4 Max?

Short answer: No. You cannot use vLLM on a Mac Pro with M4 Max for GPU acceleration.

Below is the precise technical reasoning, plus what you should use instead on Apple Silicon.

Why vLLM Does NOT Work on Mac M4 Max

1. vLLM Is CUDA-Only

vLLM depends on:
- CUDA kernels
- NVIDIA-specific GPU memory management
- PagedAttention implemented in CUDA
Apple GPUs use Metal, not CUDA

Result: vLLM cannot access the M4 Max GPU.

2. No Metal / MLX Backend

vLLM does not support:
- Metal Performance Shaders (MPS)
- MLX (Apple’s ML framework)
There is no experimental or community Metal port of vLLM

Result: No GPU inference path exists on macOS.

3. CPU-Only vLLM Is Not Viable

While Python may import vLLM:
- Performance is extremely poor
- Memory usage is inefficient
- Many features implicitly assume GPU KV cache

Practically: vLLM on CPU is unusable for real workloads.

What You SHOULD Use on Mac M4 Max Instead

✅ Best Options (Ranked)

1. Ollama (Recommended)

Why

Native Metal backend
Excellent GGUF support
Zero-configuration
Optimized for Apple Silicon memory bandwidth

Use when

Local development
Agent prototyping
Small to medium models

2. llama.cpp (Advanced Control)

Why

Direct Metal acceleration
Fine-grained control
Best raw efficiency on Mac

Use when

You want maximum performance tuning
You control quantization and memory

3. MLX (Apple’s Native Stack)

Why

Built specifically for Apple Silicon
Best long-term Apple-native path
Excellent for research and custom workflows

Use when

You are building Apple-first GenAI tooling
You want Python-native GPU execution

What NOT to Do

Approach

Why

vLLM on macOS

No CUDA

Docker vLLM on Mac

Still no GPU

Rosetta

No CUDA emulation

CPU-only vLLM

Unusable latency

Practical Architecture for You

Given your setup and current work:

Mac M4 Max
 ├─ Ollama / llama.cpp (local inference)
 ├─ Agent dev, prompt engineering
 └─ Remote vLLM (Linux + NVIDIA)
        └─ Production inference

This is the standard industry pattern.

If You Still Want vLLM Access from Mac

Option: Remote vLLM Server

Run vLLM on:
- AWS (A10 / L4 / A100)
- GCP
- On-prem NVIDIA box
Access it from Mac via:
- OpenAI-compatible API
- LangChain / LlamaIndex

From your Mac, it behaves like OpenAI.

Bottom Line

vLLM and Apple Silicon are fundamentally incompatible today.

Use Ollama / llama.cpp / MLX locally, and vLLM remotely when you need production-grade throughput.

can I use vLLM for training a new model?

Short answer: No. vLLM cannot be used to train a new model.

Below is the precise explanation, followed by what you should use instead, depending on your training goal.

Why vLLM Is NOT a Training Framework

vLLM is strictly an inference and serving engine.

1. No Backpropagation

vLLM does not implement:
- Gradient computation
- Optimizers
- Loss functions
Forward pass only

Implication: You cannot update model weights.

2. PagedAttention Is Inference-Only

Designed for KV cache reuse
Assumes frozen model parameters
Backward pass would break its memory model

Implication: Its core innovation is incompatible with training.

3. No Training APIs

No .train()
No dataset loaders
No checkpoint saving logic

vLLM intentionally avoids the training problem space.

What vLLM Can Do (Training-Adjacent)

Capability

Supported

Model evaluation

Yes

Batch inference

Yes

Serving LoRA adapters

Yes

A/B testing models

Yes

Reward model inference

Yes

Important: vLLM can serve LoRA adapters, but it cannot train them.

What to Use Instead (Correct Tools)

Scenario 1: Full Model Training (Pretraining / Continued Pretraining)

Use:

PyTorch + Transformers
DeepSpeed (ZeRO-2/3)
FSDP
Megatron-LM

Example stack:

Transformers
 + DeepSpeed ZeRO-3
 + NCCL
 + NVIDIA GPUs

Scenario 2: Fine-Tuning (Instruction / Domain)

Use:

Hugging Face Transformers
PEFT (LoRA, QLoRA)
TRL (SFT, DPO, PPO)

Typical flow:

Train → Save Adapter → Serve with vLLM

This is the correct vLLM lifecycle.

Scenario 3: Parameter-Efficient Fine-Tuning (LoRA / QLoRA)

Recommended:

PEFT + Accelerate
Unsloth (fast LoRA training)
Axolotl

After training:

Load adapters into vLLM for inference

Scenario 4: RLHF / Preference Optimization

Use:

TRL
DeepSpeed
Accelerate

vLLM can later be used to:

Serve the reward model
Serve the policy model for evaluation

Canonical Workflow (Industry Standard)

[ Training Phase ]
Transformers / DeepSpeed / PEFT
        ↓
Save model / LoRA adapters
        ↓
[ Inference Phase ]
vLLM (high-throughput serving)

Why vLLM Should NOT Be Extended for Training

Training optimizes compute utilization
vLLM optimizes memory reuse & scheduling
These goals conflict architecturally

Trying to train with vLLM would:

Reduce performance
Break correctness
Add extreme complexity

Bottom Line

vLLM is inference-only by design. Train elsewhere, deploy with vLLM.

Design a LoRA training → vLLM serving pipeline

Below is a production-grade LoRA training → vLLM serving pipeline, structured the way this is actually built in modern GenAI platforms.

I will keep it implementation-focused, not theoretical.

LoRA Training → vLLM Serving Pipeline

1. High-Level Architecture

┌────────────┐
│  Dataset   │
│ (JSONL)    │
└─────┬──────┘
      │
      ▼
┌────────────────────────┐
│  LoRA Training Stack   │
│                        │
│ Transformers + PEFT    │
│ Accelerate / Unsloth   │
│ DeepSpeed (optional)   │
└─────┬──────────────────┘
      │  LoRA adapters
      ▼
┌────────────────────────┐
│ Model Registry / Artifacts│
│  (HF Hub / S3 / GCS)    │
└─────┬──────────────────┘
      │
      ▼
┌────────────────────────┐
│   vLLM Inference Layer │
│                        │
│ PagedAttention         │
│ OpenAI-Compatible API  │
│ LoRA hot-loading       │
└─────────┬──────────────┘
          │
          ▼
┌────────────────────────┐
│ Apps / Agents / RAG    │
└────────────────────────┘

2. Step 1 — Base Model Selection

Choose a vLLM-friendly model

Recommended:

Llama-3 / Llama-3.1
Mistral / Mixtral
Qwen2

Example:

meta-llama/Llama-3.1-8B-Instruct

Requirements

Hugging Face format
FP16 / BF16 weights
NOT GGUF

3. Step 2 — Dataset Format

vLLM expects nothing here; training stack does.

Instruction tuning (JSONL):

{"messages":[
  {"role":"system","content":"You are a financial analyst"},
  {"role":"user","content":"Analyze balance sheet"},
  {"role":"assistant","content":"Here is the analysis..."}
]}

4. Step 3 — LoRA Training (PEFT + Transformers)

Environment

pip install transformers peft accelerate datasets bitsandbytes trl

Training Script (Minimal but Correct)

from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    TrainingArguments,
    Trainer
)
from peft import LoraConfig, get_peft_model
from datasets import load_dataset

model_id = "meta-llama/Llama-3.1-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="bfloat16",
    device_map="auto"
)

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    target_modules=["q_proj", "v_proj"]
)

model = get_peft_model(model, lora_config)

dataset = load_dataset("json", data_files="train.jsonl")

training_args = TrainingArguments(
    output_dir="./lora-out",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    learning_rate=2e-4,
    num_train_epochs=3,
    fp16=False,
    bf16=True,
    logging_steps=50,
    save_strategy="epoch",
    report_to="none"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"]
)

trainer.train()

Output Artifacts

lora-out/
 ├─ adapter_config.json
 ├─ adapter_model.safetensors

5. Step 4 — Store Adapters (Critical)

You have three safe options:

Option A: Hugging Face Hub (Recommended)

huggingface-cli upload ./lora-out my-org/finance-lora

Option B: Object Storage

S3 / GCS / MinIO

Option C: Local Registry

Versioned directory
Hash-based naming

6. Step 5 — Serve with vLLM (LoRA Enabled)

Start vLLM with LoRA Support

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --enable-lora \
  --lora-modules finance=./lora-out \
  --max-lora-rank 16 \
  --dtype bfloat16 \
  --gpu-memory-utilization 0.9

Multi-LoRA (Multi-Tenant SaaS)

--lora-modules \
  finance=./finance-lora \
  legal=./legal-lora \
  hr=./hr-lora

Client selects adapter:

{
  "model": "meta-llama/Llama-3.1-8B-Instruct",
  "lora_adapter": "finance",
  "messages": [...]
}

7. Step 6 — OpenAI-Compatible API Usage

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY"
)

resp = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    lora_adapter="finance",
    messages=[{"role":"user","content":"Analyze EBITDA"}]
)

print(resp.choices[0].message.content)

8. Production Hardening Checklist

Training Side

Freeze base model weights
Validate loss convergence
Version adapters (semantic versioning)
Save tokenizer version

Serving Side

Cap max tokens
Limit concurrent LoRA adapters
Monitor VRAM per adapter
Admission control

9. Common Failure Modes

Problem

Cause

Fix

OOM during serve

Too many LoRAs

Reduce --max-lora-rank

Bad output

Wrong target modules

Use q_proj, v_proj

Adapter ignored

Wrong name

Match lora_adapter

Slow inference

Large r or many adapters

Use r ≤ 16

10. Advanced Enhancements (Optional)

QLoRA for low-VRAM training
Unsloth for 2–5× faster LoRA training
Speculative decoding in vLLM
A/B adapter routing at API gateway
Canary LoRA rollout

Final Mental Model

Train once, swap adapters forever. vLLM turns LoRA into a production feature, not an experiment.

A QLoRA + Unsloth config

Below is a battle-tested QLoRA + Unsloth configuration, aligned for LoRA → vLLM serving and optimized for speed, VRAM efficiency, and stability.

This is not a toy example; it mirrors real production fine-tuning setups.

QLoRA + Unsloth Configuration

1. When to Use This Stack

Use QLoRA + Unsloth if:

You have limited VRAM (24–48 GB)
You want fast fine-tuning (2–5× speedup)
You plan to serve with vLLM later
You want minimal engineering overhead

2. Hardware & Software Assumptions

Hardware

NVIDIA GPU (A10 / L4 / A100 / H100)
≥ 24 GB VRAM (7–8B models)
NVMe storage recommended

Software

Python 3.9–3.11
CUDA 11.8+
PyTorch 2.1+

3. Install Dependencies

pip install "unsloth[cu118]"
pip install transformers datasets accelerate peft trl bitsandbytes

Verify:

python - <<EOF
from unsloth import FastLanguageModel
print("Unsloth OK")
EOF

4. Base Model Selection (Critical)

Choose models known to work well with Unsloth and vLLM:

Recommended:

unsloth/Llama-3.1-8B-Instruct
unsloth/Mistral-7B-Instruct
unsloth/Qwen2-7B-Instruct

Avoid:

GGUF
Exotic architectures
Non-HF formats

5. Dataset Format (Instruction Tuning)

{"text":"<|system|>You are a financial analyst<|user|>Explain EBITDA<|assistant|>EBITDA is..."}

Unsloth prefers single-text-column datasets.

6. QLoRA + Unsloth Training Script

Complete, Optimized Script

from unsloth import FastLanguageModel
from datasets import load_dataset
from transformers import TrainingArguments
from trl import SFTTrainer

# ----------------------------
# Configuration
# ----------------------------
MODEL_NAME = "unsloth/Llama-3.1-8B-Instruct"
MAX_SEQ_LEN = 4096
OUTPUT_DIR = "lora-finance"

# ----------------------------
# Load Model (QLoRA)
# ----------------------------
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = MODEL_NAME,
    max_seq_length = MAX_SEQ_LEN,
    dtype = None,                 # Auto (bf16 preferred)
    load_in_4bit = True,           # QLoRA
)

# ----------------------------
# Attach LoRA
# ----------------------------
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    lora_alpha = 32,
    lora_dropout = 0.05,
    bias = "none",
    use_gradient_checkpointing = True,
    random_state = 42,
)

# ----------------------------
# Dataset
# ----------------------------
dataset = load_dataset("json", data_files="train.jsonl", split="train")

# ----------------------------
# Training Arguments
# ----------------------------
training_args = TrainingArguments(
    output_dir = OUTPUT_DIR,
    per_device_train_batch_size = 2,
    gradient_accumulation_steps = 8,
    num_train_epochs = 3,
    learning_rate = 2e-4,
    bf16 = True,
    logging_steps = 50,
    save_strategy = "epoch",
    optim = "paged_adamw_8bit",
    lr_scheduler_type = "cosine",
    warmup_ratio = 0.03,
    report_to = "none",
)

# ----------------------------
# Trainer
# ----------------------------
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = MAX_SEQ_LEN,
)

trainer.train()

# ----------------------------
# Save Adapter Only
# ----------------------------
model.save_pretrained(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)

7. Output Artifacts

lora-finance/
 ├─ adapter_config.json
 ├─ adapter_model.safetensors
 ├─ tokenizer.json

These are directly compatible with vLLM.

8. Validation (Before Serving)

FastLanguageModel.for_inference(model)

inputs = tokenizer(
    "Explain cash flow vs profit",
    return_tensors="pt"
).to("cuda")

outputs = model.generate(
    **inputs,
    max_new_tokens=200
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

9. Serve with vLLM

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --enable-lora \
  --lora-modules finance=./lora-finance \
  --max-lora-rank 16 \
  --dtype bfloat16 \
  --gpu-memory-utilization 0.9

10. Production Tuning Guidelines

Parameter

Recommended

LoRA rank (r)

8–16

Target modules

q,v,o + MLP

Learning rate

2e-4

Context length

≤ 4096

Adapters per GPU

≤ 6

11. Common Failure Modes

Issue

Cause

Fix

Loss explodes

LR too high

Reduce to 1e-4

Slow training

No gradient checkpointing

Enable it

Poor output

Missing MLP targets

Include gate/up/down

OOM

Batch too large

Increase grad accumulation

12. Why This Works Well with vLLM

Base model remains untouched
LoRA adapters are lightweight
vLLM hot-swaps adapters efficiently
High throughput with multi-tenant adapters

Final Recommendation

QLoRA + Unsloth for training, vLLM for serving is currently one of the best cost–performance stacks in GenAI.

Adapter routing via API Gateway

Below is a clean, production-ready design for LoRA adapter routing via an API Gateway, aligned with vLLM serving and multi-tenant GenAI systems.

This pattern is widely used in SaaS copilots, internal platforms, and agent backends.

Adapter Routing via API Gateway (vLLM + LoRA)

1. Problem Statement

You have:

One base LLM (e.g., Llama-3.1-8B)
Multiple LoRA adapters (finance, legal, hr, customer-specific, A/B variants)

You want:

Dynamic adapter selection per request
Centralized auth, policy, and routing
Zero client-side complexity

2. High-Level Architecture

Client
  │
  ▼
API Gateway
(Auth • Routing • Policy)
  │
  ▼
vLLM Inference Server
(Base Model + LoRA Pool)
  │
  ▼
Response

3. Core Routing Principle

The gateway decides the adapter. vLLM simply executes it.

Clients should never choose adapters directly in production.

4. vLLM Configuration (Static)

Start vLLM with all allowed adapters:

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --enable-lora \
  --lora-modules \
      finance=./lora-finance \
      legal=./lora-legal \
      hr=./lora-hr \
      canary=./lora-canary \
  --max-lora-rank 16 \
  --gpu-memory-utilization 0.9

Adapters are now addressable by name.

5. Gateway Routing Strategies

Strategy A — Tenant-Based Routing (Most Common)

Use case

Multi-tenant SaaS
Per-customer fine-tuning

Gateway logic

tenant_id → adapter_name

Example

{
  "tenant_id": "acme_corp"
}

Mapping

acme_corp: finance
globex: legal
internal: hr

Injected into vLLM request

"lora_adapter": "finance"

Strategy B — Feature / Product Routing

Use case

One app, multiple GenAI features

Examples

Feature

Adapter

Financial analysis

finance

Contract review

legal

HR Q&A

Gateway rule

if feature == "contract_review":
    adapter = "legal"

Strategy C — User Tier / Plan Routing

Use case

Free vs Pro vs Enterprise

Tier

Adapter

Free

base (no LoRA)

Pro

domain_lora

Enterprise

customer_lora

Gateway may omit lora_adapter entirely for base users.

Strategy D — A/B Testing / Canary Releases

Use case

Safe rollout of new LoRA

90% → finance
10% → finance_v2

Gateway logic

adapter = "finance_v2" if rand() < 0.1 else "finance"

No vLLM restart required.

6. Gateway Request Transformation

Incoming Client Request (Clean API)

{
  "prompt": "Analyze quarterly revenue",
  "feature": "finance_analysis",
  "tenant_id": "acme_corp"
}

Outgoing vLLM Request (Internal)

{
  "model": "meta-llama/Llama-3.1-8B-Instruct",
  "lora_adapter": "finance",
  "messages": [
    {"role": "user", "content": "Analyze quarterly revenue"}
  ],
  "max_tokens": 512
}

7. OpenAI-Compatible Call (Gateway → vLLM)

from openai import OpenAI

client = OpenAI(
    base_url="http://vllm:8000/v1",
    api_key="EMPTY"
)

resp = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    lora_adapter=adapter,
    messages=messages,
)

8. Policy Enforcement at Gateway (Critical)

Do this at the gateway, not in vLLM

Enforce:

Max tokens per tier
Max context length
Rate limits
Adapter allow-lists
Prompt sanitization

Example

if tenant not in ALLOWED_ADAPTERS:
    raise HTTPException(403)

9. Observability & Cost Attribution

Log per request:

tenant_id
adapter_name
tokens_in / tokens_out
latency
GPU ID

This enables:

Cost per tenant
Adapter ROI analysis
Decommissioning bad LoRAs

10. Failure Handling

Scenario

Gateway Action

Adapter missing

Fallback to base

Adapter OOM

Retry with base

vLLM down

Circuit breaker

Canary regression

Auto-disable adapter

11. Security Model (Important)

Never allow:

Client-supplied lora_adapter
Arbitrary adapter names
Adapter path injection

Adapters are server-side assets, not API features.

12. Reference Gateway Stack

Common choices:

FastAPI / Flask
Kong / Envoy
AWS API Gateway + Lambda
NGINX + Lua

13. Minimal FastAPI Gateway Example

@app.post("/chat")
def chat(req: Request):
    tenant = req.headers["x-tenant-id"]
    adapter = TENANT_TO_ADAPTER.get(tenant, None)

    payload = {
        "model": BASE_MODEL,
        "messages": req.json()["messages"],
        "lora_adapter": adapter,
    }

    return forward_to_vllm(payload)

14. Mental Model

Adapters are a routing concern, not a model concern.

vLLM executes. The gateway decides which brain to use.

Final Recommendation

For production:

Centralize adapter routing
Treat adapters like deployable artifacts
Version, monitor, and sunset them
Never expose adapter control to clients

PreviousOllama NextModel Formats

Last updated 1 day ago