How do I run Hugging Face models locally on my laptop?

How can I download and use Hugging Face AI models on my own computer?

2 Likes

In various ways. Depends on spec. of your laptop.


What it means to “run a Hugging Face model locally” (background)

  • Hugging Face models live on the Hugging Face Hub as repos containing weights, a tokenizer, and a config. (Hugging Face)
  • When you use pipeline(...) or from_pretrained(...), the files are downloaded once and stored in a local cache (then reused). (Hugging Face)
  • The cache is typically under ~/.cache/huggingface/hub and can be moved with HF_HOME / HF_HUB_CACHE. (Hugging Face)

Choose the best local setup for your goal

1) “I want to use models in Python code” → transformers

This is the standard way to run text, vision, audio, and multimodal models locally. (Hugging Face)

2) “I want a local ChatGPT-like LLM on a laptop” → GGUF + Ollama or GGUF + llama.cpp

This is often the smoothest laptop experience because GGUF models are commonly quantized (smaller, faster, less memory). (Hugging Face)

3) “I want images (Stable Diffusion / diffusion models)” → diffusers

Diffusers provides DiffusionPipeline.from_pretrained(...) and supports saving/loading locally. (Hugging Face)

4) “No Python; run in browser” → transformers.js

Runs models via ONNX Runtime in the browser. (Hugging Face)


Path A: Run Hugging Face models locally in Python (transformers)

Step 1 — Install

Use a virtual environment + install PyTorch + Transformers. (Hugging Face)

python -m venv .venv
# macOS/Linux
source .venv/bin/activate
# Windows (PowerShell)
# .\.venv\Scripts\Activate.ps1

pip install -U torch transformers

Step 2 — Run a model (auto-downloads once)

Pipelines are the easiest inference API. (Hugging Face)

from transformers import pipeline

clf = pipeline("sentiment-analysis")
print(clf("I can run models locally now."))

Step 3 — If the model is larger: add accelerate and use device_map="auto"

This lets Accelerate place model parts across available devices (GPU first, then CPU, then disk if needed). (Hugging Face)

pip install -U accelerate
from transformers import pipeline

gen = pipeline("text-generation", model="google/gemma-2-2b", device_map="auto")
print(gen("Explain local inference on a laptop:", max_new_tokens=80)[0]["generated_text"])

Path B: Download models to your computer (controlled folders + offline use)

Option 1 — CLI download (hf download)

The hf CLI is the simplest way to download an entire model repo into a local directory. (Hugging Face)

pip install -U "huggingface_hub[cli]"
hf auth login   # only needed for gated/private models
hf download <org-or-user>/<model-repo> --local-dir ./models/<model-repo>
  • For gated models, you may need to request access and then authenticate with a token. (Hugging Face)

Option 2 — Python download (good for scripts)

  • hf_hub_download() for single files; snapshot_download() for full repos. The guide explains versioned caching and warns not to modify cached files. (Hugging Face)

Cache locations and moving the cache (common laptop need)

  • Default cache is ~/.cache/huggingface/hub; move it via HF_HOME or HF_HUB_CACHE. (Hugging Face)
  • You can also set cache_dir=... when calling from_pretrained(...) (commonly used when disk space is tight). (Stack Overflow)

Path C: Run LLMs locally on a laptop (recommended for “chat”)

Why GGUF is popular on laptops (background)

LLM weights in standard PyTorch fp16/bf16 format can be large; GGUF is designed for llama.cpp-style executors and is widely distributed in quantized forms that fit laptop RAM/VRAM more easily. (Hugging Face)

Option 1 — Ollama (fastest)

Hugging Face documents running GGUF checkpoints directly from the Hub with a single command. (Hugging Face)

Typical pattern:

ollama run hf.co/<user-or-org>/<gguf-repo>

Option 2 — llama.cpp (more control)

Hugging Face documents running GGUF by specifying the repo path + file; llama.cpp downloads and caches the model and uses LLAMA_CACHE for cache location. (Hugging Face)


Path D: Run diffusion/image models locally (diffusers)

Install + run

Diffusers’ loading guide shows DiffusionPipeline.from_pretrained(...) and device placement. (Hugging Face)

pip install -U diffusers torch transformers accelerate safetensors
import torch
from diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5")
pipe = pipe.to("cuda" if torch.cuda.is_available() else "cpu")

img = pipe("a watercolor sketch of a laptop running local AI").images[0]
img.save("out.png")

Common pitfalls (and how to avoid them)

1) “It still tries to download something”

  • For offline runs, download first, then use HF_HUB_OFFLINE and/or local_files_only=True. (Hugging Face)
  • In Diffusers, users often rely on local_files_only=True for strict offline behavior. (GitHub)

2) Cache/disk usage surprises

  • The cache layout and how to move it are explained in the caching guide. (Hugging Face)
  • If you download to a local directory with symlinks enabled, files may be symlinked from cache into your folder; the docs warn not to manually edit them. (Hugging Face)

3) Running out of memory with large models

  • device_map="auto" can offload across devices; Accelerate explains the placement order and tradeoffs. (Hugging Face)
  • Memory can still blow up from generation settings (context length, batch size, KV cache). If you hit OOM, reduce context/generation length and prefer smaller/quantized models (GGUF on laptops). (Hugging Face)

4) Safer weight files

  • Prefer safetensors when available (safer than pickle-based formats). (Hugging Face)

Good guides/tutorials/docs (curated, with “what each is for”)

Core “run locally in Python”

  • Transformers Installation — environment setup, caching, offline pointers. (Hugging Face)
  • Pipeline Tutorial — easiest way to run many tasks; mentions GPUs/Apple Silicon support and practical knobs. (Hugging Face)
  • Pipelines Reference — task list + API details. (Hugging Face)

Downloading + offline + cache control

  • Hub Download Guide — hf_hub_download, versioned cache behavior, “don’t edit cached files.” (Hugging Face)
  • CLI Guide (hf CLI) — practical hf download, auth, common workflows. (Hugging Face)
  • Manage Cache — cache layout + HF_HOME / HF_HUB_CACHE. (Hugging Face)
  • StackOverflow: change cache dir — practical examples with cache_dir and env vars. (Stack Overflow)

“Big models” on limited hardware

  • Accelerate Big Model Inference — how device_map="auto" dispatch/offload works. (Hugging Face)
  • Forum thread (device_map OOM confusion) — common misconceptions and troubleshooting context. (Hugging Face Forums)

Laptop-friendly local LLM runtimes (GGUF)

  • Use Ollama with GGUF from the Hub — single-command local runs. (Hugging Face)
  • GGUF usage with llama.cpp — repo+file loading and cache behavior (LLAMA_CACHE). (Hugging Face)
  • What GGUF is — background on the format and its ecosystem. (Hugging Face)

Diffusion/image local runs

  • Diffusers loading guide — from_pretrained and device placement. (Hugging Face)
  • DiffusionPipeline API — saving/loading, best practices. (Hugging Face)

Browser-local (no Python)

  • Transformers.js docs — run models in-browser using ONNX Runtime. (Hugging Face)

A simple starter plan (works for most laptops)

  1. Start with Transformers pipeline for a small model (quick success). (Hugging Face)
  2. If you want LLM chat locally, switch to GGUF + Ollama (best laptop UX). (Hugging Face)
  3. When you care about offline/reproducible, use hf download ... --local-dir ... and then run with offline flags. (Hugging Face)
  4. If you hit memory limits, use smaller models or quantized GGUF, and keep generation/context modest. (Hugging Face)
2 Likes

I usually download Hugging Face models using Python. I install Python, then the transformers and torch libraries, and after that I can load any model by name and use it locally.

1 Like