Use HF Inference API or deploy using Accelerate + FastAPI (or TorchServe) on your own server

Once your custom model is pushed to the Hub, you have two common ways to serve it for real-time use:

✅ Option 1 — Plug & Play: Hugging Face Inference API ✅ Option 2 — Fully Control It: Host Your Own Inference Server

✅ Option 1: Hugging Face Inference API

Best for: ✔️ Zero setup — no servers, no DevOps ✔️ Easy testing, demos, small apps ✔️ Hosted on Hugging Face’s infra

How it works:

Every public model on the Hub gets a free widget and an API endpoint.
Hugging Face spins up the model on demand and runs inference for you.

Example — Call Your Model

from huggingface_hub import InferenceClient

client = InferenceClient("YOUR_USERNAME/YOUR_MODEL_NAME")

response = client.text_generation(
    "Tell me about Python functions.",
    max_new_tokens=100
)

print(response)

👉 Billing: Free tier = limited calls, then pay-as-you-go for heavy usage. 👉 Perfect for PoCs, plugins, or when you don’t want to manage GPUs.

✅ Option 2: Self-Host with `Accelerate` + FastAPI (or TorchServe)

Best for: ✔️ Full control — run on your own cloud/server ✔️ Customize scaling, auth, logging, limits ✔️ Ideal for private or high-volume apps

🗂️ Basic Setup

1️⃣ Use Accelerate to load big models efficiently (multi-GPU, mixed precision). 2️⃣ Wrap your model behind a FastAPI server to handle incoming HTTP requests.

✅ Example — FastAPI + `Accelerate`

1️⃣ Install requirements

pip install transformers accelerate fastapi uvicorn

2️⃣ Create server.py

from fastapi import FastAPI
from pydantic import BaseModel
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch

# Load with Accelerate if needed
from accelerate import init_empty_weights, infer_auto_device_map, dispatch_model

model_name = "YOUR_USERNAME/YOUR_MODEL_NAME"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    device=0 if torch.cuda.is_available() else -1
)

app = FastAPI()

class Query(BaseModel):
    prompt: str

@app.post("/generate")
def generate(query: Query):
    output = generator(
        query.prompt,
        max_length=256,
        do_sample=True,
        temperature=0.3
    )
    return {"response": output[0]["generated_text"]}

3️⃣ Run the server

uvicorn server:app --host 0.0.0.0 --port 8000

✅ Now you have an inference API at http://localhost:8000/generate.

✅ When to Add `Accelerate`

For larger models, do:

Load with Accelerate to auto-shard over multiple GPUs.
Use device_map="auto" to fit your model on available hardware.
Switch to FP16 or 8-bit for memory savings.

Example:

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.float16
)

✅ Alternative: TorchServe

If you prefer production-ready model serving:

TorchServe can serve PyTorch models as a REST or gRPC API.
Good for scaling with multiple workers and batching.

Basic flow:

Export model to TorchScript if needed.
Write a custom handler.
Deploy using torchserve --start.

👉 Hugging Face has docs for TorchServe integration.

✅ Key Benefits

Option

Pros

Cons

HF Inference API

Easiest, no infra

Pay per call, less custom

Self-Host (Accelerate + FastAPI)

Full control, customize for scale

You manage infra

TorchServe

Good for big teams, multi-model

More setup, DevOps experience needed

🗝️ Key Takeaway

You can ship your model in minutes with the Inference API — or run your own secure server for full flexibility and control.

➡️ Next: Learn how to secure your endpoints, monitor usage, and scale your assistant for real users!

PreviousUpload your custom model to 🤗 Hub NextSecure endpoints, manage resource limits (GPU vs CPU)

Last updated 5 months ago