My space was paused and 503 when restart

after just a few minutes after I created my new space,my space was paused and 503 when restart.

ID:Root=1-693c2f0a-648372ed194a28d5316eb464

Same issue for our BenCzechMark space. Any solution?

1 Like

That information alone isn’t enough to pinpoint the issue, but I recalled that Spaces’ default settings have been acting strangely lately, so I’m passing this along. For some reason, the Python version is set to 3.13, which differs from the documentation…

Adding python_version: “3.10” to README.md should create build conditions similar to those in 2025:

---
title: 🇨🇿 BenCzechMark
emoji: 📊
colorFrom: gray
colorTo: blue
sdk: gradio
sdk_version: 4.42.0
python_version: "3.10"
app_file: app.py

Thank you for looking into it. I tried adding an explicit Python interpreter version. Restart/Factory Rebuild still produces

503

Something went wrong when restarting this Space.

Request ID: Root=1-69831bb9-4871e7833794d02c2cb7a365

Perhaps you could determine a specific issue from the log corresponding to the request id? Is there any more information I could provide?

1 Like

Only HF staff can tell anything from the request ID…

Your space’s code is quite resilient to server-side changes.

Even if I look for flaws, the only possibilities I can find are things like bokeh not being pinned, the contents of environment variables, or errors related to other datasets or servers it references…

Anyway, if it doesn’t boot for some reason, I think it’ll result in a 503. If that’s logged in your container logs, I think we can pinpoint the cause.


“503 – Something went wrong when restarting this Space”

This is a generic front-end error that usually happens when Hugging Face cannot successfully complete one of these phases:

  1. Build phase fails (dependency install, environment setup).
  2. Container starts but crashes immediately (Python exception, missing env var, import error, etc.).
  3. Container stays up but never becomes “healthy” (less common for Gradio; more common for Docker / wrong port / wrong host binding).

The only reliable way to distinguish (1)/(2)/(3) is the Space’s Open Logs → Build logs / Container logs. HF’s docs explicitly point to these two log streams for debugging. (Hugging Face)


The most likely failure points for your Space (based on the repo)

1) Missing required environment variables (hard crash at import time)

Your code reads several env vars with os.environ["..."]. If any are missing, Python raises KeyError and the container will crash immediately → restart yields 503.

In app.py:

  • HF_SPACE_TOKEN = os.environ["HF_SPACE_TOKEN"]
  • HF_SPACE_ID = os.environ["HF_SPACE_ID"] (Hugging Face)

In server.py:

  • HF_TOKEN = os.environ["HF_TOKEN"]
  • HF_RESULTS_DATASET = os.environ["HF_RESULTS_DATASET"] (Hugging Face)

Why this matters: even if the Space used to work, a rebuild/restart can expose missing secrets (or a renamed variable) instantly.

Workaround (code-hardening):

  • Replace os.environ["X"] with os.environ.get("X") and fail gracefully with a clear UI message (or disable affected features) instead of crashing the whole app.

2) Dataset download/auth failure on startup (token / permissions / typo)

LeaderboardServer.__init__() runs on import in app.py and immediately calls results_dataset_integrity_check() and update_leaderboard(). (Hugging Face)

update_leaderboard() calls _update_models_and_tournament_results(), which calls snapshot_download(... token=HF_TOKEN ...) into local_dir="./". (Hugging Face)

If:

  • HF_RESULTS_DATASET is wrong (typo, renamed dataset),
  • HF_TOKEN is invalid/expired,
  • the dataset became private/restricted to a different org,
  • hub requests fail due to transient infra/network,

then startup can fail → Space can’t restart (503).

Workarounds:

  • Temporarily wrap snapshot_download in try/except and show an “upstream unavailable” message while letting the UI boot.
  • Confirm that HF_TOKEN has permission to read/write the dataset repo referenced by HF_RESULTS_DATASET.

3) Self-restart loop if an external host is unreachable (can keep the Space perpetually unstable)

Your UI includes logic to restart the Space if an external endpoint is not reachable:

  • check_significance_is_reachable_hook() calls check_significance_is_reachable()
  • If false: it sleeps 10s then calls api.restart_space(repo_id=HF_SPACE_ID, token=HF_SPACE_TOKEN) (Hugging Face)
  • A timer periodically triggers this hook (gr.Timer(...).tick(...)).

Separately, server.py’s check_significance_is_reachable() hits a fixed URL on czechllm.fit.vutbr.cz. (Hugging Face)

If HF egress/DNS intermittently fails to that host (or that host is down), your app may repeatedly restart itself. This can look like “it never stays up” and can manifest as frequent 503s to users.

Workaround (strongly recommended):

  • Do not restart the entire Space when the endpoint is down.
  • Instead: disable only the significance feature, show a warning banner, and retry with exponential backoff.

4) Dependency drift triggered by a rebuild (very plausible given your requirements.txt)

Your requirements.txt is mostly pinned — except the last line: bokeh is unpinned. (Hugging Face)

Your code uses Bokeh to generate HTML/iframes (e.g., bokeh.embed.file_html(...), CDN.render(), etc.). (Hugging Face)

If a recent commit caused a rebuild (even a README edit), pip will install the latest Bokeh at rebuild time. If Bokeh introduced a breaking change relative to what your code previously ran with, you can get a startup crash (or a crash during leaderboard generation) that did not exist before.

Workaround:

  • Pin Bokeh to a known-good version (example: bokeh==<version-you-tested>).
  • Even better: generate requirements.txt via pip-compile / constraints to avoid “one unpinned package breaks everything”.

5) YAML/config typos or HF “restrictions”

Your README metadata looks structurally correct for a Gradio Space (sdk: gradio, app_file: app.py, python_version: '3.10', etc.). (Hugging Face)
So a simple YAML typo is less likely here.

Policy restrictions are possible in general, but nothing in this repo obviously screams “blocked category”. In practice, restrictions usually come with an explicit moderation banner or a message from HF; if you didn’t get that, treat it as lower probability.


Can I determine the specific issue from the Request ID?

Not from the outside.

That Request ID: Root=... is an internal tracing identifier. It can help Hugging Face staff locate the failing backend operation, but it’s not enough for external debugging without access to HF’s logs.

What you can provide that will usually pinpoint it in minutes:

  • Build logs (first failure; pip errors; resolution conflicts)
  • Container logs (Python traceback; missing env var; import error; HTTP failures)

HF’s docs and forum guidance consistently point to “Open Logs → Build / Container” as the primary debugging path. (Hugging Face)


Concrete triage plan for your repo (fastest path to root cause)

  1. Open Logs → Build

    • If you see pip resolution/install failure: likely dependency drift (start by pinning bokeh).
  2. Open Logs → Container

    • If you see KeyError: 'HF_TOKEN' (or similar): env var missing. (Your code will crash on missing required vars.) (Hugging Face)
    • If you see HF Hub auth errors around snapshot_download: token/permission/dataset name issue. (Hugging Face)
    • If you see repeated restarts tied to czechllm.fit.vutbr.cz reachability: remove the self-restart logic. (Hugging Face)
  3. Quick workaround to get something running

    • Temporarily disable expensive/fragile startup steps (e.g., don’t build plots on init; lazy-load leaderboard after UI is up).
  4. If logs show nothing useful and restart still fails

    • Duplicate the Space into a new repo name and try a clean build. This is a common workaround when a Space gets into a bad runtime state (seen across multiple 503-restart threads). (Hugging Face Forums)

Similar cases & useful references (with context)

HF forum threads matching the same symptom (“paused + 503 restart”)

  • “Can’t restart or rebuild the space” (Oct 2025): same generic 503 restart failure reported; suggests checking logs and considering infra issues. (Hugging Face Forums)
  • “Spaces Docker Build Pauses and 503 Error on Restart” (Dec 2025): illustrates that “503 on restart” often means the container never gets healthy or crashes. (Hugging Face Forums)
  • “HF Space Stopped Working After Pinning” (Dec 2025): rebuild after a change caused breakage; consistent with dependency drift / infra sensitivity. (Hugging Face Forums)
  • “The Space Does not Restart” (Nov 2025): explains the meaning of “Paused + 503 restart” and the usual build/runtime failure interpretation. (Hugging Face Forums)
  • “My space was paused and 503 when I restart” (Dec 2025): shows common root causes (startup/health/port patterns), though more Docker-focused. (Hugging Face Forums)

Official docs you’ll likely use while fixing it

  • Debugging via Build logs and Container logs (HF docs). (Hugging Face)
  • CPU-basic sleep behavior (~48h inactivity). (Hugging Face)
  • Managing Space runtime and state via huggingface_hub (restart/pause/sleep time, etc.). (Hugging Face)

Most probable causes

  1. One of the required env vars is missing or renamed (HF_TOKEN, HF_RESULTS_DATASET, HF_SPACE_TOKEN, HF_SPACE_ID). (Hugging Face)
  2. HF_TOKEN no longer has correct permissions for HF_RESULTS_DATASET, causing snapshot_download to fail at startup. (Hugging Face)
  3. External endpoint reachability triggers a restart loop (your code explicitly restarts the Space when unreachable). (Hugging Face)
  4. Rebuild pulled a new incompatible Bokeh because bokeh is unpinned. (Hugging Face)

Unfortunately, I do not think the issue lies in repo directly. We had a parallel repo turned on, with the exactly same code: 🇨🇿 BenCzechMark - a Hugging Face Space by CZLC - and it was working fine (except for the python version issue you pointed out, which we fixed in both repos now). The problem is that current repo even has empty log.

Due to inactivity from huggingface support, we decided to simple create a new space 🇨🇿 BenCzechMark - a Hugging Face Space by CZLC with a same functionality. Unfortunately, we lost likes and traffic info.

1 Like

Just in case, we also kept old space alive (though still not functional :frowning: ). Available here":

1 Like

Hi, Thanks for reporting. You’ll need to update the version of next.js to latest or a fixed version per https://nextjs.org/blog/CVE-2025-66478. Let us know if you run into further issues - happy to help!

1 Like

Thanks, michellehbn!