What's the catch?

by encryptedoreo - opened 5 days ago

I don't mean to sound like I hate your work, but IIUC, it sounds like some random lab (no offense!) releases a 1B model, competitive with models that are 2-7x its size, trained with 432x less compute and ~900x less tokens. So what's the catch? Because it sounds too good to be true. Training instability? Bad scaling?

amarosnithe

2 days ago

•

edited 2 days ago

I don't mean to sound like I hate your work, but IIUC, it sounds like some random lab (no offense!) releases a 1B model, competitive with models that are 2-7x its size, trained with 432x less compute and ~900x less tokens. So what's the catch? Because it sounds too good to be true. Training instability? Bad scaling?

Tbh I think its the results. HRM went viral over a year ago due to its revolutionary architecture, and the fact that their mere 27M model beat all the mainstream models (ChatGPT, Gemini, Claude..) on complex tasks like puzzles, sudoku, etc.. (while being trained on merely ~1k examples). There was some dataset cross-contamination which polluted the model and inflated the results sure, but even after Sapient fixed it, the benchmarks were still extremely impressive.

Then they went dark for a long time, and now they finally come back with this model (which made me seriously excited, I've been waiting for them), but when you look at the benchmarks, they're not THAT impactful like their previous demo.

Still, the fact that it was trained on significantly less compute, less tokens and is competitive with bigger models is pretty impressive in itself, it really shows that HRMs are still indeed a viable choice to replace LLMs (although Samsung's TRMs have still better potential tbh).

AGSCI2023

1 day ago

Re-read the second sentence - it explains it: "HRM is a dual-timescale recurrent architecture: two Transformer modules (H = high-level / slow, L = low-level / fast) iterate over the same input embeddings for H_cycles × (L_cycles + 1) steps, with additive state injection (z_L + z_H)."
It means, that model is theoretically good at finite number of input transformations tasks (fixing typos, translating languages, fixing code, maze, chess, sudoku), and likely to be worse (again, theoretically and doesn't strictly follow) at any open-ended tasks (writing code from scratch, maintaining long conversations, etc).

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment