CharlesCNorton commited on
Commit ·
597e7c2
1
Parent(s): 6241818
eval_all: hash-keyed result cache (--cache-dir, --no-cache); README: bit-ordering scope rules; docs/ISA.md: opcode reference and end-to-end tutorial; docs/float-pipeline.md: composition gap notes
Browse files- .gitignore +1 -0
- README.md +19 -1
- docs/ISA.md +135 -0
- docs/float-pipeline.md +39 -0
- eval_all.py +77 -3
.gitignore
CHANGED
|
@@ -1,3 +1,4 @@
|
|
| 1 |
__pycache__/
|
| 2 |
*.pyc
|
| 3 |
.pt file
|
|
|
|
|
|
| 1 |
__pycache__/
|
| 2 |
*.pyc
|
| 3 |
.pt file
|
| 4 |
+
.eval_cache/
|
README.md
CHANGED
|
@@ -150,7 +150,7 @@ A self-contained machine. State goes in, state comes out:
|
|
| 150 |
|
| 151 |
### State tensor layout
|
| 152 |
|
| 153 |
-
|
| 154 |
|
| 155 |
```
|
| 156 |
[ PC[N] | IR[16] | R0[8] R1[8] R2[8] R3[8] | FLAGS[4] | SP[N] | CTRL[4] | MEM[2^N][8] ]
|
|
@@ -158,6 +158,24 @@ All multi-bit fields are MSB-first (index 0 is the most-significant bit).
|
|
| 158 |
|
| 159 |
`N` is the address width (configurable, 0–16). Flags are ordered `Z, N, C, V`. Control bits are ordered `HALT, MEM_WE, MEM_RE, RESERVED`.
|
| 160 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 161 |
### Instruction encoding (16-bit, MSB-first)
|
| 162 |
|
| 163 |
```
|
|
|
|
| 150 |
|
| 151 |
### State tensor layout
|
| 152 |
|
| 153 |
+
The **state tensor** uses MSB-first bit ordering: index 0 of each multi-bit field is the most-significant bit. So `R0[0]` is bit 7 of the architectural register, `R0[7]` is bit 0.
|
| 154 |
|
| 155 |
```
|
| 156 |
[ PC[N] | IR[16] | R0[8] R1[8] R2[8] R3[8] | FLAGS[4] | SP[N] | CTRL[4] | MEM[2^N][8] ]
|
|
|
|
| 158 |
|
| 159 |
`N` is the address width (configurable, 0–16). Flags are ordered `Z, N, C, V`. Control bits are ordered `HALT, MEM_WE, MEM_RE, RESERVED`.
|
| 160 |
|
| 161 |
+
#### Bit ordering, one rule per scope
|
| 162 |
+
|
| 163 |
+
The state tensor's MSB-first convention does **not** propagate to subcircuit ports. Each subcircuit names its operand bits in its own scope:
|
| 164 |
+
|
| 165 |
+
| Scope | Convention | Example |
|
| 166 |
+
|---|---|---|
|
| 167 |
+
| State tensor | MSB-first (index 0 = MSB) | `R0[0]` is bit 7 of register R0 |
|
| 168 |
+
| Subcircuit external ports (`$a[i]`, `$b[i]`) | LSB-indexed (index 0 = LSB) | `$a[0]` is bit 0 of operand `a` |
|
| 169 |
+
| Ripple-carry full adders (`fa0..fa7`) | LSB-first (fa0 = LSB) | `fa0` consumes `$a[0]` and `$b[0]` |
|
| 170 |
+
| Instruction word | MSB-first (bit 15 = opcode high) | bit 15 is `opcode[3]` |
|
| 171 |
+
|
| 172 |
+
Worked example for `arithmetic.ripplecarry8bit`:
|
| 173 |
+
|
| 174 |
+
- Inputs: `$a[0]..$a[7]` and `$b[0]..$b[7]` where `$a[0]` is the LSB of `a`. To add `a = 0x05 = 0b00000101` and `b = 0x03`, drive `a[0]=1, a[1]=0, a[2]=1` (rest 0) and `b[0]=1, b[1]=1` (rest 0).
|
| 175 |
+
- Outputs: `fa0.ha2.sum.layer2`..`fa7.ha2.sum.layer2` are sum bits 0..7 (LSB to MSB), and `fa7.carry_or` is the final carry-out. The 8-bit result is `{fa7..fa0}` reading high-to-low.
|
| 176 |
+
|
| 177 |
+
This is also how `safetensors2verilog`'s threshold-logic frontend exposes the ports of any extracted subcircuit. See the project's testbench at `tests/threshold_alu/run.py` for a worked end-to-end example, or use `python -m safetensors2verilog ... --inspect` to print the port contract for any extracted circuit.
|
| 178 |
+
|
| 179 |
### Instruction encoding (16-bit, MSB-first)
|
| 180 |
|
| 181 |
```
|
docs/ISA.md
ADDED
|
@@ -0,0 +1,135 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# ISA reference card — 8-bit threshold-logic CPU
|
| 2 |
+
|
| 3 |
+
This is the architecture exposed by the safetensors files. Every instruction below is *implemented entirely as threshold neurons*; the same gate-level circuits run whether you simulate in Python (`eval.py` / `play.py` / `test_cpu.py`) or compile the CPU's threshold network through `safetensors2verilog` to FPGA-synthesizable Verilog.
|
| 4 |
+
|
| 5 |
+
## Architectural state
|
| 6 |
+
|
| 7 |
+
| Field | Width | Notes |
|
| 8 |
+
|---|---|---|
|
| 9 |
+
| PC | N bits | program counter; N = address width (0–16) |
|
| 10 |
+
| IR | 16 bits | instruction register |
|
| 11 |
+
| R0–R3 | 8 bits each | general-purpose registers |
|
| 12 |
+
| FLAGS | 4 bits | Z, N, C, V |
|
| 13 |
+
| SP | N bits | stack pointer (CALL/RET) |
|
| 14 |
+
| CTRL | 4 bits | HALT, MEM_WE, MEM_RE, RESERVED |
|
| 15 |
+
| MEM | 2^N × 8 bits | byte-addressable memory |
|
| 16 |
+
|
| 17 |
+
State tensor layout (MSB-first within each multi-bit field):
|
| 18 |
+
|
| 19 |
+
```
|
| 20 |
+
[ PC[N] | IR[16] | R0[8] R1[8] R2[8] R3[8] | FLAGS[4] | SP[N] | CTRL[4] | MEM[2^N][8] ]
|
| 21 |
+
```
|
| 22 |
+
|
| 23 |
+
## Instruction encoding
|
| 24 |
+
|
| 25 |
+
```
|
| 26 |
+
15..12 11..10 9..8 7..0
|
| 27 |
+
opcode rd rs imm8
|
| 28 |
+
```
|
| 29 |
+
|
| 30 |
+
| Class | Use of fields |
|
| 31 |
+
|---|---|
|
| 32 |
+
| **R-type** | `rd = rd op rs` — `imm8` ignored |
|
| 33 |
+
| **I-type** | `rd = op rd, imm8` — `rs` ignored |
|
| 34 |
+
| **Address-extended** | next 16-bit word is the absolute address (big-endian); `imm8` reserved. Applies to `LOAD`, `STORE`, `JMP`, `Jcc`, `CALL`. |
|
| 35 |
+
|
| 36 |
+
Address-extended instructions consume **4 bytes** (instruction word + address word). Untaken conditional jumps still skip the address word, so the PC always advances by 4.
|
| 37 |
+
|
| 38 |
+
## Opcode table
|
| 39 |
+
|
| 40 |
+
| Opcode | Mnemonic | Class | Operation |
|
| 41 |
+
|---|---|---|---|
|
| 42 |
+
| 0x0 | ADD | R | R[rd] = R[rd] + R[rs] |
|
| 43 |
+
| 0x1 | SUB | R | R[rd] = R[rd] - R[rs] |
|
| 44 |
+
| 0x2 | AND | R | R[rd] = R[rd] & R[rs] |
|
| 45 |
+
| 0x3 | OR | R | R[rd] = R[rd] \| R[rs] |
|
| 46 |
+
| 0x4 | XOR | R | R[rd] = R[rd] ^ R[rs] |
|
| 47 |
+
| 0x5 | SHL | R | R[rd] = R[rd] << 1 |
|
| 48 |
+
| 0x6 | SHR | R | R[rd] = R[rd] >> 1 |
|
| 49 |
+
| 0x7 | MUL | R | R[rd] = R[rd] * R[rs] (low 8 bits) |
|
| 50 |
+
| 0x8 | DIV | R | R[rd] = R[rd] / R[rs] |
|
| 51 |
+
| 0x9 | CMP | R | flags = R[rd] - R[rs] (no writeback) |
|
| 52 |
+
| 0xA | LOAD | A | R[rd] = M[addr] |
|
| 53 |
+
| 0xB | STORE | A | M[addr] = R[rs] |
|
| 54 |
+
| 0xC | JMP | A | PC = addr |
|
| 55 |
+
| 0xD | Jcc | A | PC = addr if cond. imm8[2:0] selects condition |
|
| 56 |
+
| 0xE | CALL | A | push PC; PC = addr |
|
| 57 |
+
| 0xF | HALT | – | stop execution |
|
| 58 |
+
|
| 59 |
+
### Conditional-jump conditions (encoded in imm8[2:0] of the Jcc opcode)
|
| 60 |
+
|
| 61 |
+
| imm8[2:0] | Mnemonic | Fires when |
|
| 62 |
+
|---|---|---|
|
| 63 |
+
| 0 | JZ | Z flag set (last result was zero) |
|
| 64 |
+
| 1 | JNZ | Z flag clear |
|
| 65 |
+
| 2 | JC | carry-out set (last add overflowed unsigned) |
|
| 66 |
+
| 3 | JNC | carry-out clear |
|
| 67 |
+
| 4 | JN | result was negative (sign bit set) |
|
| 68 |
+
| 5 | JP | result was positive (sign bit clear) |
|
| 69 |
+
| 6 | JV | signed-overflow flag set |
|
| 70 |
+
| 7 | JNV | signed-overflow flag clear |
|
| 71 |
+
|
| 72 |
+
## Worked example: write your own program
|
| 73 |
+
|
| 74 |
+
The Python assembler in `cpu_programs.py` exposes one-method-per-mnemonic helpers on a tiny `Asm` class. Here's "store the value 7 to address 0x10, then halt":
|
| 75 |
+
|
| 76 |
+
```python
|
| 77 |
+
from cpu_programs import Asm
|
| 78 |
+
|
| 79 |
+
a = Asm(size=64) # 64 bytes of memory
|
| 80 |
+
a.org(0)
|
| 81 |
+
# Set R0 to 7. There is no LDI; use XOR R0,R0 to zero it then ADD an
|
| 82 |
+
# immediate from memory.
|
| 83 |
+
a.label("seven")
|
| 84 |
+
a.org(32); a.db(7) # memory byte at addr 32 holds the constant 7
|
| 85 |
+
|
| 86 |
+
a.org(0)
|
| 87 |
+
a.xor_(0, 0) # R0 = 0
|
| 88 |
+
a.load(0, "seven") # R0 = M[seven] = 7
|
| 89 |
+
a.store(0, "dest") # M[dest] = R0
|
| 90 |
+
a.halt()
|
| 91 |
+
|
| 92 |
+
a.label("dest"); a.db(0) # destination cell
|
| 93 |
+
|
| 94 |
+
bytes_ = a.assemble()
|
| 95 |
+
```
|
| 96 |
+
|
| 97 |
+
Then drop the assembled bytes into the CPU's initial memory and let the threshold-network forward pass run.
|
| 98 |
+
|
| 99 |
+
## Using the CPU as a threshold-network forward pass
|
| 100 |
+
|
| 101 |
+
The CPU is a single tensor program. State in, state out. The driver:
|
| 102 |
+
|
| 103 |
+
1. Builds an initial state tensor with the program loaded at `MEM[0..]`.
|
| 104 |
+
2. Calls the safetensors-derived threshold network, which internally loops one fetch–decode–execute cycle and re-feeds the state.
|
| 105 |
+
3. After ≤ N cycles (or earlier if the HALT control bit fires), reads the final memory contents.
|
| 106 |
+
|
| 107 |
+
Concretely, this is what `test_cpu.py` and `play.py` already do; both serve as runnable tutorials. The minimal driver loop is:
|
| 108 |
+
|
| 109 |
+
```python
|
| 110 |
+
from build import ThresholdComputer
|
| 111 |
+
from safetensors.torch import load_file
|
| 112 |
+
|
| 113 |
+
tensors = load_file("variants/neural_computer8_small.safetensors")
|
| 114 |
+
cpu = ThresholdComputer(tensors, data_bits=8)
|
| 115 |
+
state = cpu.initial_state(memory=bytes_)
|
| 116 |
+
state = cpu.run(state, max_cycles=200)
|
| 117 |
+
result = cpu.read_memory(state, addr=0x10)
|
| 118 |
+
print(result) # 7
|
| 119 |
+
```
|
| 120 |
+
|
| 121 |
+
## Common pitfalls
|
| 122 |
+
|
| 123 |
+
- **No load-immediate.** `LOAD` reads from memory; there is no LDI / MOV-imm instruction. To put a constant in a register, place it in memory and `LOAD` it.
|
| 124 |
+
- **Address-extended instructions are 4 bytes wide.** Branch targets must point at the start of an instruction word, not into the middle of one.
|
| 125 |
+
- **`MUL` keeps only the low 8 bits.** Detect overflow via `CMP` against expected truncation.
|
| 126 |
+
- **`CMP` writes only flags**, never the destination register. Always followed by a `Jcc`.
|
| 127 |
+
- **`SHL` and `SHR` shift by 1.** No variable-amount shifter; chain them or compose with bit operations.
|
| 128 |
+
|
| 129 |
+
## Threshold-network artefacts you'll want next
|
| 130 |
+
|
| 131 |
+
- `python eval_all.py variants/<file>.safetensors` — gate-level fitness suite (5,900–7,800 tests per variant covering Boolean, arithmetic, ALU, control, modular, error-detection, threshold, and IEEE 754 float circuits).
|
| 132 |
+
- `python eval_all.py --cpu-program variants/<file>.safetensors` — assembled program through the threshold-gated CPU.
|
| 133 |
+
- `python -m safetensors2verilog <file>.safetensors --frontend threshold_logic --circuit arithmetic.ripplecarry8bit -o rc8.v` — extract one circuit, dependency-closed, into synthesizable Verilog.
|
| 134 |
+
- `python -m safetensors2verilog ... --inspect` — print the port contract for any extracted circuit (which pins exist, what widths).
|
| 135 |
+
- `python -m safetensors2verilog ... --equiv-check` — automatically build a Python-vs-iverilog cross-check testbench for the extracted circuit.
|
docs/float-pipeline.md
ADDED
|
@@ -0,0 +1,39 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# IEEE-754 add: composing the stages
|
| 2 |
+
|
| 3 |
+
The threshold network ships separate `float16.unpack`, `float16.add`, `float16.normalize`, and `float16.pack` (and the `float32.*` siblings). Each is an independent threshold-logic subcircuit with its own external ports.
|
| 4 |
+
|
| 5 |
+
## What you actually get from each stage
|
| 6 |
+
|
| 7 |
+
| Stage | Inputs | Outputs |
|
| 8 |
+
|---|---|---|
|
| 9 |
+
| `float16.unpack` | 1 generic input bit | 16 `bit0..bit15` outputs (sign / exp / mantissa fields) |
|
| 10 |
+
| `float16.add` | 5 `exp_a` + 5 `exp_b` + 5 generic input bits = 15 inputs | 499 outputs covering align, exp_diff, mant_add, mant_sub, mant_select, sign_xor stages |
|
| 11 |
+
| `float16.normalize` | 2 generic input bits | 356 outputs covering exp_adj and mantissa-shift stages |
|
| 12 |
+
| `float16.pack` | 1 generic input bit | 16 `bit0..bit15` outputs (assembled IEEE-754 word) |
|
| 13 |
+
|
| 14 |
+
Use `python -m safetensors2verilog ... --circuit float16.<stage> --inspect` to print the live contract for any stage from the variant you're using.
|
| 15 |
+
|
| 16 |
+
## What's missing for a single composed block
|
| 17 |
+
|
| 18 |
+
None of the four stages exposes ports named in a way that says "I produce the operand the next stage consumes." `float16.unpack` outputs `float16.unpack.bit0..15`; `float16.add` consumes `$float16_exp_a[0..4]` and an opaque `$input[0..4]`. The wire-up between `unpack` and `add` is not encoded in the safetensors metadata. It exists only in the original construction code at `build.py` and in the gate-fitness harness at `eval.py`.
|
| 19 |
+
|
| 20 |
+
Two consequences:
|
| 21 |
+
|
| 22 |
+
1. **Subcircuit extraction works**, but each stage compiles to a standalone module whose bit-bag inputs and outputs need a hand-written wrapper to chain.
|
| 23 |
+
2. **A single `float16.add_full` Verilog block** that runs the complete IEEE-754 pipeline for a 16-bit operand pair is not derivable from the published safetensors files alone.
|
| 24 |
+
|
| 25 |
+
## The intended path forward
|
| 26 |
+
|
| 27 |
+
The right fix is in `build.py`: when generating each float stage, register the inter-stage signal IDs so that, e.g., `float16.unpack.bit15` (sign of `a`) and `float16.add.sign_a` are the same signal in the global registry, with metadata that exposes the per-stage contract by *role* (`sign_a`, `exp_a`, `mant_a`) rather than by ad-hoc port name. The schema-versioned metadata from `safetensors2verilog`'s frontend (see `core.SIGNAL_REGISTRY_SCHEMA_VERSION_LATEST`) is the place to land this.
|
| 28 |
+
|
| 29 |
+
Once that is in place, `safetensors2verilog --circuit float16.add_full` would emit a single composed top-level module by walking the now-explicit cross-stage wiring. The dependency-closure extractor already does the heavy lifting; it currently produces correct output for any circuit whose internal wiring is already explicit (e.g. `arithmetic.ripplecarry8bit`).
|
| 30 |
+
|
| 31 |
+
## Workaround today
|
| 32 |
+
|
| 33 |
+
For end-to-end float16-add evaluation, run the existing gate-level fitness suite which exercises the composed pipeline through Python eval:
|
| 34 |
+
|
| 35 |
+
```bash
|
| 36 |
+
python eval_all.py variants/neural_alu8.safetensors --debug 2>&1 | grep -A 20 "FLOAT16 ADD"
|
| 37 |
+
```
|
| 38 |
+
|
| 39 |
+
For Verilog-side experimentation, pick one stage at a time and feed it driven testbench inputs that match the role each port plays in the original pipeline (read off `build.py`'s `infer_*` routines for that stage).
|
eval_all.py
CHANGED
|
@@ -30,7 +30,7 @@ import os
|
|
| 30 |
import sys
|
| 31 |
import time
|
| 32 |
from pathlib import Path
|
| 33 |
-
from typing import Dict, List, Optional, Tuple
|
| 34 |
|
| 35 |
import torch
|
| 36 |
from safetensors import safe_open
|
|
@@ -443,6 +443,50 @@ def builtin_program(addr_bits: int) -> Tuple[List[int], int]:
|
|
| 443 |
# Eval driver
|
| 444 |
# ---------------------------------------------------------------------------
|
| 445 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 446 |
def list_safetensors(path: Path) -> List[Path]:
|
| 447 |
if path.is_file():
|
| 448 |
return [path]
|
|
@@ -594,6 +638,11 @@ def main() -> int:
|
|
| 594 |
help="Also run a small assembled program through the threshold CPU "
|
| 595 |
"(only applies to 8-bit variants with >= 512 B memory)")
|
| 596 |
parser.add_argument("--json", action="store_true", help="Emit JSON results to stdout instead of a table")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 597 |
args = parser.parse_args()
|
| 598 |
|
| 599 |
files = list_safetensors(Path(args.path))
|
|
@@ -602,12 +651,35 @@ def main() -> int:
|
|
| 602 |
return 2
|
| 603 |
|
| 604 |
print(f"Evaluating {len(files)} file(s) on {args.device}\n")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 605 |
results = []
|
| 606 |
fail_count = 0
|
| 607 |
for f in files:
|
| 608 |
print(f"=== {f.name}")
|
| 609 |
-
|
| 610 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 611 |
results.append(r)
|
| 612 |
print_row(r, show_cpu=args.cpu_program)
|
| 613 |
if r.get("status") != "PASS":
|
|
@@ -634,6 +706,8 @@ def main() -> int:
|
|
| 634 |
print(f"ALL {len(files)} variants PASS")
|
| 635 |
else:
|
| 636 |
print(f"{fail_count}/{len(files)} variants FAIL")
|
|
|
|
|
|
|
| 637 |
return fail_count
|
| 638 |
|
| 639 |
|
|
|
|
| 30 |
import sys
|
| 31 |
import time
|
| 32 |
from pathlib import Path
|
| 33 |
+
from typing import Any, Dict, List, Optional, Tuple
|
| 34 |
|
| 35 |
import torch
|
| 36 |
from safetensors import safe_open
|
|
|
|
| 443 |
# Eval driver
|
| 444 |
# ---------------------------------------------------------------------------
|
| 445 |
|
| 446 |
+
def _file_fingerprint(path: Path) -> str:
|
| 447 |
+
"""Stable cache key for a safetensors file: sha256 of its content.
|
| 448 |
+
|
| 449 |
+
Hashes are content-addressed so renaming a file doesn't blow the cache,
|
| 450 |
+
but mtime-only would re-key on every clone of the repo. The sha256 of a
|
| 451 |
+
30 MB safetensors finishes in tens of milliseconds — small compared to
|
| 452 |
+
a 5,900-test fitness run.
|
| 453 |
+
"""
|
| 454 |
+
import hashlib
|
| 455 |
+
h = hashlib.sha256()
|
| 456 |
+
with open(path, "rb") as f:
|
| 457 |
+
for chunk in iter(lambda: f.read(1 << 20), b""):
|
| 458 |
+
h.update(chunk)
|
| 459 |
+
return h.hexdigest()
|
| 460 |
+
|
| 461 |
+
|
| 462 |
+
def _cache_key(path: Path, opts: Dict[str, Any]) -> str:
|
| 463 |
+
"""Cache key combining file content with the relevant evaluation options."""
|
| 464 |
+
fp = _file_fingerprint(path)
|
| 465 |
+
opt_str = json.dumps(opts, sort_keys=True)
|
| 466 |
+
import hashlib
|
| 467 |
+
suffix = hashlib.sha256(opt_str.encode("utf-8")).hexdigest()[:8]
|
| 468 |
+
return f"{fp}_{suffix}"
|
| 469 |
+
|
| 470 |
+
|
| 471 |
+
def _load_cache(cache_dir: Path, key: str) -> Dict[str, Any] | None:
|
| 472 |
+
p = cache_dir / f"{key}.json"
|
| 473 |
+
if not p.exists():
|
| 474 |
+
return None
|
| 475 |
+
try:
|
| 476 |
+
return json.loads(p.read_text(encoding="utf-8"))
|
| 477 |
+
except (json.JSONDecodeError, OSError):
|
| 478 |
+
return None
|
| 479 |
+
|
| 480 |
+
|
| 481 |
+
def _save_cache(cache_dir: Path, key: str, payload: Dict[str, Any]) -> None:
|
| 482 |
+
cache_dir.mkdir(parents=True, exist_ok=True)
|
| 483 |
+
p = cache_dir / f"{key}.json"
|
| 484 |
+
try:
|
| 485 |
+
p.write_text(json.dumps(payload, indent=2, default=str), encoding="utf-8")
|
| 486 |
+
except OSError:
|
| 487 |
+
pass
|
| 488 |
+
|
| 489 |
+
|
| 490 |
def list_safetensors(path: Path) -> List[Path]:
|
| 491 |
if path.is_file():
|
| 492 |
return [path]
|
|
|
|
| 638 |
help="Also run a small assembled program through the threshold CPU "
|
| 639 |
"(only applies to 8-bit variants with >= 512 B memory)")
|
| 640 |
parser.add_argument("--json", action="store_true", help="Emit JSON results to stdout instead of a table")
|
| 641 |
+
parser.add_argument("--cache-dir", default=".eval_cache",
|
| 642 |
+
help="Directory for hash-keyed result cache "
|
| 643 |
+
"(default: ./.eval_cache). Set to '' to disable.")
|
| 644 |
+
parser.add_argument("--no-cache", action="store_true",
|
| 645 |
+
help="Disable the result cache for this run.")
|
| 646 |
args = parser.parse_args()
|
| 647 |
|
| 648 |
files = list_safetensors(Path(args.path))
|
|
|
|
| 651 |
return 2
|
| 652 |
|
| 653 |
print(f"Evaluating {len(files)} file(s) on {args.device}\n")
|
| 654 |
+
cache_enabled = bool(args.cache_dir) and not args.no_cache
|
| 655 |
+
cache_dir = Path(args.cache_dir) if cache_enabled else None
|
| 656 |
+
cache_opts = {
|
| 657 |
+
"device": args.device,
|
| 658 |
+
"pop_size": args.pop_size,
|
| 659 |
+
"cpu_program": bool(args.cpu_program),
|
| 660 |
+
}
|
| 661 |
+
cache_hits = 0
|
| 662 |
results = []
|
| 663 |
fail_count = 0
|
| 664 |
for f in files:
|
| 665 |
print(f"=== {f.name}")
|
| 666 |
+
cached = None
|
| 667 |
+
key = None
|
| 668 |
+
if cache_enabled:
|
| 669 |
+
try:
|
| 670 |
+
key = _cache_key(f, cache_opts)
|
| 671 |
+
cached = _load_cache(cache_dir, key)
|
| 672 |
+
except OSError:
|
| 673 |
+
cached = None
|
| 674 |
+
if cached is not None:
|
| 675 |
+
r = cached
|
| 676 |
+
cache_hits += 1
|
| 677 |
+
print(f" (cache hit)")
|
| 678 |
+
else:
|
| 679 |
+
r = evaluate_one(f, device=args.device, pop_size=args.pop_size,
|
| 680 |
+
debug=args.debug, run_cpu_program=args.cpu_program)
|
| 681 |
+
if cache_enabled and key is not None:
|
| 682 |
+
_save_cache(cache_dir, key, r)
|
| 683 |
results.append(r)
|
| 684 |
print_row(r, show_cpu=args.cpu_program)
|
| 685 |
if r.get("status") != "PASS":
|
|
|
|
| 706 |
print(f"ALL {len(files)} variants PASS")
|
| 707 |
else:
|
| 708 |
print(f"{fail_count}/{len(files)} variants FAIL")
|
| 709 |
+
if cache_enabled:
|
| 710 |
+
print(f"(cache: {cache_hits}/{len(files)} hits, dir={cache_dir})")
|
| 711 |
return fail_count
|
| 712 |
|
| 713 |
|