How are you deploying HF models that don’t have inference providers?

runtime-eng · January 31, 2026, 7:36am

I’ve noticed some models on Hugging Face don’t have an attached inference provider. For those using these models in real projects, how are you deploying them today? Thank for the help

guisalberto · February 5, 2026, 11:47pm

We are waiting for our inference provider to be accepted here at HF. Unfortunately, not hearing from the team

Which models is of interest to be served for you?

runtime-eng · February 11, 2026, 8:07pm

I am mostly looking for a serverless solution

John6666 · February 13, 2026, 10:45am

for now, popular options.

How teams deploy these models in real projects (the common pattern)

Even if a model has no attached provider, most teams still deploy it by doing:

Pull weights from the Hub (private/gated with an HF token if needed)
Serve the model with a standard inference server (LLMs: vLLM or HF TGI are common)
Run that server on a platform that matches their cost/latency needs:
- serverless / scale-to-zero for spiky traffic
- always-on for consistent low latency

Two serving stacks you’ll see repeatedly:

vLLM: widely used for LLM inference/serving and optimized throughput (GitHub)
Hugging Face TGI (Text Generation Inference): used in production by Hugging Face to power HuggingChat/Inference API/Endpoints (GitHub)
- TGI supports a Messages API compatible with OpenAI Chat Completions, which simplifies integration and swapping providers (Hugging Face)

Serverless / scale-to-zero options (HF and non-HF)

These are the most common “serverless” ways people deploy Hub models that don’t have attached providers.

A) Hugging Face-managed

1) Inference Endpoints (autoscaling + scale-to-zero)

HF documents scale-to-zero for Endpoints to reduce cost for intermittent workloads (Hugging Face)
Expect cold starts when scaled to 0 (first request after idle wakes it).

2) Spaces (prototype / light production)

Free CPU tier: 16GB RAM, 2 CPU cores, 50GB ephemeral disk (Hugging Face)
GPU upgrades are available (paid) (Hugging Face)
Best for demos, internal tools, and light traffic; less ideal for serious SLA unless you control warm capacity.

B) “Real serverless” GPU containers (you bring a Docker image)

3) Google Cloud Run + GPUs

Official docs: GPU services can scale down to zero for cost savings (Google Cloud Documentation)
Scaling from zero is request-triggered; if you need background work you must design a “wake-up request” or set min instances > 0 (Google Cloud Documentation)
Google also publishes GPU best practices for inference (memory/KV cache/quantization tuning) (Google Cloud Documentation)

4) Azure Container Apps “serverless GPUs”

Microsoft explicitly positions this as automatic scaling + per-second billing + scale down to zero (Microsoft Learn)
Important nuance: some profiles (e.g., “Flexible”) don’t scale to zero; consumption profiles do (Azure sentence)

C) Serverless inference platforms (popular for “BYO model weights”)

5) Modal

Modal docs: functions scale to zero by default when idle (Modal)
Strong DX if your team likes “code-first” deployments.

6) RunPod Serverless

Docs: endpoints can auto-scale from zero to hundreds of workers (Runpod Documentation)
vLLM integration is a common path for serving Hub LLMs serverlessly (Runpod Documentation)

7) Baseten

Docs: set min_replica = 0 to enable scale-to-zero; first request triggers cold start, and large models can take minutes (Baseten Docs)

8) Replicate (custom model deployments)

Replicate supports deploying and scaling custom models (“deploy a custom model”) (Replicate)

Hosted model APIs (fastest integration, but you usually can’t run any arbitrary Hub repo)

If you can choose from a provider’s supported catalog (instead of “any Hub model”), teams often use these:

Together AI (serverless + also “Dedicated Endpoints”) (Together AI)
Fireworks (serverless token pricing; also on-demand deployments) (Fireworks AI)
GroqCloud (hosted inference; pricing published) (Groq)
DeepInfra (pay-as-you-use pricing; token or execution-time depending on model) (Deep Infra)

These are common in production when:

you can accept a specific model set, and
you want low ops overhead and predictable performance.

How to choose (simple decision logic)

If you must run a specific Hub repo

Pick a BYO-weights option:

HF Inference Endpoints (Hugging Face)
Cloud Run GPUs (Google Cloud Documentation)
Azure Container Apps serverless GPUs (Microsoft Learn)
Modal / RunPod / Baseten / Replicate (Modal)

If you can use “close enough” models

Use a hosted API (Together/Fireworks/Groq/DeepInfra) (Together AI)

If you’re truly serverless (scale-to-zero) and user-facing latency matters

Plan for cold start mitigation:

keep min replicas = 1 during business hours, or
do “wake-up requests,” or
pick smaller/quantized models and an engine like vLLM/TGI.

Cloud Run explicitly notes that scaling from zero requires a request, and suggests min instances or a wake-up request pattern. (Google Cloud Documentation)
Baseten explicitly warns cold starts for large models can take minutes. (Baseten Docs)

yotta-labs · February 16, 2026, 3:08pm

One pattern we’ve seen is that “serverless” often works well for experimentation, but once traffic stabilizes, teams start optimizing for predictability rather than pure scale-to-zero behavior.

Cold starts and GPU spin-up time can become more operationally expensive than the compute itself, especially for user-facing workloads.

A lot of deployments end up hybrid: serverless for spiky jobs, and warm capacity for anything latency-sensitive.

runtime-eng · February 19, 2026, 11:21am

Thank you so much. This is very helpful.

Topic		Replies	Views
How are you deploying models without inference providers? Beginners	5	114	February 10, 2026
What is best way to serve huggingface model with API? Beginners	11	43990	August 29, 2023
Batch inference with huggingface_hub for serverless providers Beginners	3	112	November 20, 2025
Request API access? Remote model access Beginners	3	80	August 25, 2025
On Demand GPU model hosting? Beginners	3	1221	June 2, 2025