I’ve noticed some models on Hugging Face don’t have an attached inference provider. For those using these models in real projects, how are you deploying them today? Thank for the help
We are waiting for our inference provider to be accepted here at HF. Unfortunately, not hearing from the team
Which models is of interest to be served for you?
I am mostly looking for a serverless solution
for now, popular options.
How teams deploy these models in real projects (the common pattern)
Even if a model has no attached provider, most teams still deploy it by doing:
-
Pull weights from the Hub (private/gated with an HF token if needed)
-
Serve the model with a standard inference server (LLMs: vLLM or HF TGI are common)
-
Run that server on a platform that matches their cost/latency needs:
- serverless / scale-to-zero for spiky traffic
- always-on for consistent low latency
Two serving stacks you’ll see repeatedly:
-
vLLM: widely used for LLM inference/serving and optimized throughput (GitHub)
-
Hugging Face TGI (Text Generation Inference): used in production by Hugging Face to power HuggingChat/Inference API/Endpoints (GitHub)
- TGI supports a Messages API compatible with OpenAI Chat Completions, which simplifies integration and swapping providers (Hugging Face)
Serverless / scale-to-zero options (HF and non-HF)
These are the most common “serverless” ways people deploy Hub models that don’t have attached providers.
A) Hugging Face-managed
1) Inference Endpoints (autoscaling + scale-to-zero)
- HF documents scale-to-zero for Endpoints to reduce cost for intermittent workloads (Hugging Face)
- Expect cold starts when scaled to 0 (first request after idle wakes it).
2) Spaces (prototype / light production)
- Free CPU tier: 16GB RAM, 2 CPU cores, 50GB ephemeral disk (Hugging Face)
- GPU upgrades are available (paid) (Hugging Face)
- Best for demos, internal tools, and light traffic; less ideal for serious SLA unless you control warm capacity.
B) “Real serverless” GPU containers (you bring a Docker image)
3) Google Cloud Run + GPUs
- Official docs: GPU services can scale down to zero for cost savings (Google Cloud Documentation)
- Scaling from zero is request-triggered; if you need background work you must design a “wake-up request” or set min instances > 0 (Google Cloud Documentation)
- Google also publishes GPU best practices for inference (memory/KV cache/quantization tuning) (Google Cloud Documentation)
4) Azure Container Apps “serverless GPUs”
- Microsoft explicitly positions this as automatic scaling + per-second billing + scale down to zero (Microsoft Learn)
- Important nuance: some profiles (e.g., “Flexible”) don’t scale to zero; consumption profiles do (Azure sentence)
C) Serverless inference platforms (popular for “BYO model weights”)
5) Modal
- Modal docs: functions scale to zero by default when idle (Modal)
- Strong DX if your team likes “code-first” deployments.
6) RunPod Serverless
- Docs: endpoints can auto-scale from zero to hundreds of workers (Runpod Documentation)
- vLLM integration is a common path for serving Hub LLMs serverlessly (Runpod Documentation)
7) Baseten
- Docs: set
min_replica = 0to enable scale-to-zero; first request triggers cold start, and large models can take minutes (Baseten Docs)
8) Replicate (custom model deployments)
- Replicate supports deploying and scaling custom models (“deploy a custom model”) (Replicate)
Hosted model APIs (fastest integration, but you usually can’t run any arbitrary Hub repo)
If you can choose from a provider’s supported catalog (instead of “any Hub model”), teams often use these:
- Together AI (serverless + also “Dedicated Endpoints”) (Together AI)
- Fireworks (serverless token pricing; also on-demand deployments) (Fireworks AI)
- GroqCloud (hosted inference; pricing published) (Groq)
- DeepInfra (pay-as-you-use pricing; token or execution-time depending on model) (Deep Infra)
These are common in production when:
- you can accept a specific model set, and
- you want low ops overhead and predictable performance.
How to choose (simple decision logic)
If you must run a specific Hub repo
Pick a BYO-weights option:
- HF Inference Endpoints (Hugging Face)
- Cloud Run GPUs (Google Cloud Documentation)
- Azure Container Apps serverless GPUs (Microsoft Learn)
- Modal / RunPod / Baseten / Replicate (Modal)
If you can use “close enough” models
Use a hosted API (Together/Fireworks/Groq/DeepInfra) (Together AI)
If you’re truly serverless (scale-to-zero) and user-facing latency matters
Plan for cold start mitigation:
- keep min replicas = 1 during business hours, or
- do “wake-up requests,” or
- pick smaller/quantized models and an engine like vLLM/TGI.
Cloud Run explicitly notes that scaling from zero requires a request, and suggests min instances or a wake-up request pattern. (Google Cloud Documentation)
Baseten explicitly warns cold starts for large models can take minutes. (Baseten Docs)
One pattern we’ve seen is that “serverless” often works well for experimentation, but once traffic stabilizes, teams start optimizing for predictability rather than pure scale-to-zero behavior.
Cold starts and GPU spin-up time can become more operationally expensive than the compute itself, especially for user-facing workloads.
A lot of deployments end up hybrid: serverless for spiky jobs, and warm capacity for anything latency-sensitive.
Thank you so much. This is very helpful.