eval_all: hash-keyed result cache (--cache-dir, --no-cache); README: bit-ordering scope rules; docs/ISA.md: opcode reference and end-to-end tutorial; docs/float-pipeline.md: composition gap notes

Files changed (5) hide show

.gitignore +1 -0
README.md +19 -1
docs/ISA.md +135 -0
docs/float-pipeline.md +39 -0
eval_all.py +77 -3

.gitignore CHANGED Viewed

@@ -1,3 +1,4 @@
 __pycache__/
 *.pyc
 .pt file

 __pycache__/
 *.pyc
 .pt file
+.eval_cache/

README.md CHANGED Viewed

@@ -150,7 +150,7 @@ A self-contained machine. State goes in, state comes out:
 ### State tensor layout
-All multi-bit fields are MSB-first (index 0 is the most-significant bit).
 ```
 [ PC[N] | IR[16] | R0[8] R1[8] R2[8] R3[8] | FLAGS[4] | SP[N] | CTRL[4] | MEM[2^N][8] ]
@@ -158,6 +158,24 @@ All multi-bit fields are MSB-first (index 0 is the most-significant bit).
 `N` is the address width (configurable, 0–16). Flags are ordered `Z, N, C, V`. Control bits are ordered `HALT, MEM_WE, MEM_RE, RESERVED`.
 ### Instruction encoding (16-bit, MSB-first)
 ```

 ### State tensor layout
+The **state tensor** uses MSB-first bit ordering: index 0 of each multi-bit field is the most-significant bit. So `R0[0]` is bit 7 of the architectural register, `R0[7]` is bit 0.
 ```
 [ PC[N] | IR[16] | R0[8] R1[8] R2[8] R3[8] | FLAGS[4] | SP[N] | CTRL[4] | MEM[2^N][8] ]
 `N` is the address width (configurable, 0–16). Flags are ordered `Z, N, C, V`. Control bits are ordered `HALT, MEM_WE, MEM_RE, RESERVED`.
+#### Bit ordering, one rule per scope
+The state tensor's MSB-first convention does **not** propagate to subcircuit ports. Each subcircuit names its operand bits in its own scope:
+| Scope | Convention | Example |
+|---|---|---|
+| State tensor | MSB-first (index 0 = MSB) | `R0[0]` is bit 7 of register R0 |
+| Subcircuit external ports (`$a[i]`, `$b[i]`) | LSB-indexed (index 0 = LSB) | `$a[0]` is bit 0 of operand `a` |
+| Ripple-carry full adders (`fa0..fa7`) | LSB-first (fa0 = LSB) | `fa0` consumes `$a[0]` and `$b[0]` |
+| Instruction word | MSB-first (bit 15 = opcode high) | bit 15 is `opcode[3]` |
+Worked example for `arithmetic.ripplecarry8bit`:
+- Inputs: `$a[0]..$a[7]` and `$b[0]..$b[7]` where `$a[0]` is the LSB of `a`. To add `a = 0x05 = 0b00000101` and `b = 0x03`, drive `a[0]=1, a[1]=0, a[2]=1` (rest 0) and `b[0]=1, b[1]=1` (rest 0).
+- Outputs: `fa0.ha2.sum.layer2`..`fa7.ha2.sum.layer2` are sum bits 0..7 (LSB to MSB), and `fa7.carry_or` is the final carry-out. The 8-bit result is `{fa7..fa0}` reading high-to-low.
+This is also how `safetensors2verilog`'s threshold-logic frontend exposes the ports of any extracted subcircuit. See the project's testbench at `tests/threshold_alu/run.py` for a worked end-to-end example, or use `python -m safetensors2verilog ... --inspect` to print the port contract for any extracted circuit.
 ### Instruction encoding (16-bit, MSB-first)
 ```

docs/ISA.md ADDED Viewed

	@@ -0,0 +1,135 @@

+# ISA reference card — 8-bit threshold-logic CPU
+This is the architecture exposed by the safetensors files. Every instruction below is *implemented entirely as threshold neurons*; the same gate-level circuits run whether you simulate in Python (`eval.py` / `play.py` / `test_cpu.py`) or compile the CPU's threshold network through `safetensors2verilog` to FPGA-synthesizable Verilog.
+## Architectural state
+| Field | Width | Notes |
+|---|---|---|
+| PC | N bits | program counter; N = address width (0–16) |
+| IR | 16 bits | instruction register |
+| R0–R3 | 8 bits each | general-purpose registers |
+| FLAGS | 4 bits | Z, N, C, V |
+| SP | N bits | stack pointer (CALL/RET) |
+| CTRL | 4 bits | HALT, MEM_WE, MEM_RE, RESERVED |
+| MEM | 2^N × 8 bits | byte-addressable memory |
+State tensor layout (MSB-first within each multi-bit field):
+```
+[ PC[N] | IR[16] | R0[8] R1[8] R2[8] R3[8] | FLAGS[4] | SP[N] | CTRL[4] | MEM[2^N][8] ]
+```
+## Instruction encoding
+```
+15..12   11..10   9..8   7..0
+opcode   rd       rs     imm8
+```
+| Class | Use of fields |
+|---|---|
+| **R-type** | `rd = rd op rs` — `imm8` ignored |
+| **I-type** | `rd = op rd, imm8` — `rs` ignored |
+| **Address-extended** | next 16-bit word is the absolute address (big-endian); `imm8` reserved. Applies to `LOAD`, `STORE`, `JMP`, `Jcc`, `CALL`. |
+Address-extended instructions consume **4 bytes** (instruction word + address word). Untaken conditional jumps still skip the address word, so the PC always advances by 4.
+## Opcode table
+| Opcode | Mnemonic | Class | Operation |
+|---|---|---|---|
+| 0x0 | ADD     | R | R[rd] = R[rd] + R[rs] |
+| 0x1 | SUB     | R | R[rd] = R[rd] - R[rs] |
+| 0x2 | AND     | R | R[rd] = R[rd] & R[rs] |
+| 0x3 | OR      | R | R[rd] = R[rd] \| R[rs] |
+| 0x4 | XOR     | R | R[rd] = R[rd] ^ R[rs] |
+| 0x5 | SHL     | R | R[rd] = R[rd] << 1 |
+| 0x6 | SHR     | R | R[rd] = R[rd] >> 1 |
+| 0x7 | MUL     | R | R[rd] = R[rd] * R[rs]   (low 8 bits) |
+| 0x8 | DIV     | R | R[rd] = R[rd] / R[rs] |
+| 0x9 | CMP     | R | flags = R[rd] - R[rs]   (no writeback) |
+| 0xA | LOAD    | A | R[rd] = M[addr] |
+| 0xB | STORE   | A | M[addr] = R[rs] |
+| 0xC | JMP     | A | PC = addr |
+| 0xD | Jcc     | A | PC = addr if cond.  imm8[2:0] selects condition |
+| 0xE | CALL    | A | push PC; PC = addr |
+| 0xF | HALT    | – | stop execution |
+### Conditional-jump conditions (encoded in imm8[2:0] of the Jcc opcode)
+| imm8[2:0] | Mnemonic | Fires when |
+|---|---|---|
+| 0 | JZ | Z flag set (last result was zero) |
+| 1 | JNZ | Z flag clear |
+| 2 | JC | carry-out set (last add overflowed unsigned) |
+| 3 | JNC | carry-out clear |
+| 4 | JN | result was negative (sign bit set) |
+| 5 | JP | result was positive (sign bit clear) |
+| 6 | JV | signed-overflow flag set |
+| 7 | JNV | signed-overflow flag clear |
+## Worked example: write your own program
+The Python assembler in `cpu_programs.py` exposes one-method-per-mnemonic helpers on a tiny `Asm` class. Here's "store the value 7 to address 0x10, then halt":
+```python
+from cpu_programs import Asm
+a = Asm(size=64)        # 64 bytes of memory
+a.org(0)
+# Set R0 to 7. There is no LDI; use XOR R0,R0 to zero it then ADD an
+# immediate from memory.
+a.label("seven")
+a.org(32); a.db(7)        # memory byte at addr 32 holds the constant 7
+a.org(0)
+a.xor_(0, 0)              # R0 = 0
+a.load(0, "seven")        # R0 = M[seven] = 7
+a.store(0, "dest")        # M[dest] = R0
+a.halt()
+a.label("dest"); a.db(0)  # destination cell
+bytes_ = a.assemble()
+```
+Then drop the assembled bytes into the CPU's initial memory and let the threshold-network forward pass run.
+## Using the CPU as a threshold-network forward pass
+The CPU is a single tensor program. State in, state out. The driver:
+1. Builds an initial state tensor with the program loaded at `MEM[0..]`.
+2. Calls the safetensors-derived threshold network, which internally loops one fetch–decode–execute cycle and re-feeds the state.
+3. After ≤ N cycles (or earlier if the HALT control bit fires), reads the final memory contents.
+Concretely, this is what `test_cpu.py` and `play.py` already do; both serve as runnable tutorials. The minimal driver loop is:
+```python
+from build import ThresholdComputer
+from safetensors.torch import load_file
+tensors = load_file("variants/neural_computer8_small.safetensors")
+cpu = ThresholdComputer(tensors, data_bits=8)
+state = cpu.initial_state(memory=bytes_)
+state = cpu.run(state, max_cycles=200)
+result = cpu.read_memory(state, addr=0x10)
+print(result)   # 7
+```
+## Common pitfalls
+- **No load-immediate.** `LOAD` reads from memory; there is no LDI / MOV-imm instruction. To put a constant in a register, place it in memory and `LOAD` it.
+- **Address-extended instructions are 4 bytes wide.** Branch targets must point at the start of an instruction word, not into the middle of one.
+- **`MUL` keeps only the low 8 bits.** Detect overflow via `CMP` against expected truncation.
+- **`CMP` writes only flags**, never the destination register. Always followed by a `Jcc`.
+- **`SHL` and `SHR` shift by 1.** No variable-amount shifter; chain them or compose with bit operations.
+## Threshold-network artefacts you'll want next
+- `python eval_all.py variants/<file>.safetensors` — gate-level fitness suite (5,900–7,800 tests per variant covering Boolean, arithmetic, ALU, control, modular, error-detection, threshold, and IEEE 754 float circuits).
+- `python eval_all.py --cpu-program variants/<file>.safetensors` — assembled program through the threshold-gated CPU.
+- `python -m safetensors2verilog <file>.safetensors --frontend threshold_logic --circuit arithmetic.ripplecarry8bit -o rc8.v` — extract one circuit, dependency-closed, into synthesizable Verilog.
+- `python -m safetensors2verilog ... --inspect` — print the port contract for any extracted circuit (which pins exist, what widths).
+- `python -m safetensors2verilog ... --equiv-check` — automatically build a Python-vs-iverilog cross-check testbench for the extracted circuit.

docs/float-pipeline.md ADDED Viewed

	@@ -0,0 +1,39 @@

+# IEEE-754 add: composing the stages
+The threshold network ships separate `float16.unpack`, `float16.add`, `float16.normalize`, and `float16.pack` (and the `float32.*` siblings). Each is an independent threshold-logic subcircuit with its own external ports.
+## What you actually get from each stage
+| Stage | Inputs | Outputs |
+|---|---|---|
+| `float16.unpack` | 1 generic input bit | 16 `bit0..bit15` outputs (sign / exp / mantissa fields) |
+| `float16.add` | 5 `exp_a` + 5 `exp_b` + 5 generic input bits = 15 inputs | 499 outputs covering align, exp_diff, mant_add, mant_sub, mant_select, sign_xor stages |
+| `float16.normalize` | 2 generic input bits | 356 outputs covering exp_adj and mantissa-shift stages |
+| `float16.pack` | 1 generic input bit | 16 `bit0..bit15` outputs (assembled IEEE-754 word) |
+Use `python -m safetensors2verilog ... --circuit float16.<stage> --inspect` to print the live contract for any stage from the variant you're using.
+## What's missing for a single composed block
+None of the four stages exposes ports named in a way that says "I produce the operand the next stage consumes." `float16.unpack` outputs `float16.unpack.bit0..15`; `float16.add` consumes `$float16_exp_a[0..4]` and an opaque `$input[0..4]`. The wire-up between `unpack` and `add` is not encoded in the safetensors metadata. It exists only in the original construction code at `build.py` and in the gate-fitness harness at `eval.py`.
+Two consequences:
+1. **Subcircuit extraction works**, but each stage compiles to a standalone module whose bit-bag inputs and outputs need a hand-written wrapper to chain.
+2. **A single `float16.add_full` Verilog block** that runs the complete IEEE-754 pipeline for a 16-bit operand pair is not derivable from the published safetensors files alone.
+## The intended path forward
+The right fix is in `build.py`: when generating each float stage, register the inter-stage signal IDs so that, e.g., `float16.unpack.bit15` (sign of `a`) and `float16.add.sign_a` are the same signal in the global registry, with metadata that exposes the per-stage contract by *role* (`sign_a`, `exp_a`, `mant_a`) rather than by ad-hoc port name. The schema-versioned metadata from `safetensors2verilog`'s frontend (see `core.SIGNAL_REGISTRY_SCHEMA_VERSION_LATEST`) is the place to land this.
+Once that is in place, `safetensors2verilog --circuit float16.add_full` would emit a single composed top-level module by walking the now-explicit cross-stage wiring. The dependency-closure extractor already does the heavy lifting; it currently produces correct output for any circuit whose internal wiring is already explicit (e.g. `arithmetic.ripplecarry8bit`).
+## Workaround today
+For end-to-end float16-add evaluation, run the existing gate-level fitness suite which exercises the composed pipeline through Python eval:
+```bash
+python eval_all.py variants/neural_alu8.safetensors --debug 2>&1 | grep -A 20 "FLOAT16 ADD"
+```
+For Verilog-side experimentation, pick one stage at a time and feed it driven testbench inputs that match the role each port plays in the original pipeline (read off `build.py`'s `infer_*` routines for that stage).

eval_all.py CHANGED Viewed

@@ -30,7 +30,7 @@ import os
 import sys
 import time
 from pathlib import Path
-from typing import Dict, List, Optional, Tuple
 import torch
 from safetensors import safe_open
@@ -443,6 +443,50 @@ def builtin_program(addr_bits: int) -> Tuple[List[int], int]:
 # Eval driver
 # ---------------------------------------------------------------------------
 def list_safetensors(path: Path) -> List[Path]:
     if path.is_file():
         return [path]
@@ -594,6 +638,11 @@ def main() -> int:
                         help="Also run a small assembled program through the threshold CPU "
                              "(only applies to 8-bit variants with >= 512 B memory)")
     parser.add_argument("--json", action="store_true", help="Emit JSON results to stdout instead of a table")
     args = parser.parse_args()
     files = list_safetensors(Path(args.path))
@@ -602,12 +651,35 @@ def main() -> int:
         return 2
     print(f"Evaluating {len(files)} file(s) on {args.device}\n")
     results = []
     fail_count = 0
     for f in files:
         print(f"=== {f.name}")
-        r = evaluate_one(f, device=args.device, pop_size=args.pop_size,
-                         debug=args.debug, run_cpu_program=args.cpu_program)
         results.append(r)
         print_row(r, show_cpu=args.cpu_program)
         if r.get("status") != "PASS":
@@ -634,6 +706,8 @@ def main() -> int:
         print(f"ALL {len(files)} variants PASS")
     else:
         print(f"{fail_count}/{len(files)} variants FAIL")
     return fail_count

 import sys
 import time
 from pathlib import Path
+from typing import Any, Dict, List, Optional, Tuple
 import torch
 from safetensors import safe_open
 # Eval driver
 # ---------------------------------------------------------------------------
+def _file_fingerprint(path: Path) -> str:
+    """Stable cache key for a safetensors file: sha256 of its content.
+    Hashes are content-addressed so renaming a file doesn't blow the cache,
+    but mtime-only would re-key on every clone of the repo. The sha256 of a
+    30 MB safetensors finishes in tens of milliseconds — small compared to
+    a 5,900-test fitness run.
+    """
+    import hashlib
+    h = hashlib.sha256()
+    with open(path, "rb") as f:
+        for chunk in iter(lambda: f.read(1 << 20), b""):
+            h.update(chunk)
+    return h.hexdigest()
+def _cache_key(path: Path, opts: Dict[str, Any]) -> str:
+    """Cache key combining file content with the relevant evaluation options."""
+    fp = _file_fingerprint(path)
+    opt_str = json.dumps(opts, sort_keys=True)
+    import hashlib
+    suffix = hashlib.sha256(opt_str.encode("utf-8")).hexdigest()[:8]
+    return f"{fp}_{suffix}"
+def _load_cache(cache_dir: Path, key: str) -> Dict[str, Any] | None:
+    p = cache_dir / f"{key}.json"
+    if not p.exists():
+        return None
+    try:
+        return json.loads(p.read_text(encoding="utf-8"))
+    except (json.JSONDecodeError, OSError):
+        return None
+def _save_cache(cache_dir: Path, key: str, payload: Dict[str, Any]) -> None:
+    cache_dir.mkdir(parents=True, exist_ok=True)
+    p = cache_dir / f"{key}.json"
+    try:
+        p.write_text(json.dumps(payload, indent=2, default=str), encoding="utf-8")
+    except OSError:
+        pass
 def list_safetensors(path: Path) -> List[Path]:
     if path.is_file():
         return [path]
                         help="Also run a small assembled program through the threshold CPU "
                              "(only applies to 8-bit variants with >= 512 B memory)")
     parser.add_argument("--json", action="store_true", help="Emit JSON results to stdout instead of a table")
+    parser.add_argument("--cache-dir", default=".eval_cache",
+                        help="Directory for hash-keyed result cache "
+                             "(default: ./.eval_cache). Set to '' to disable.")
+    parser.add_argument("--no-cache", action="store_true",
+                        help="Disable the result cache for this run.")
     args = parser.parse_args()
     files = list_safetensors(Path(args.path))
         return 2
     print(f"Evaluating {len(files)} file(s) on {args.device}\n")
+    cache_enabled = bool(args.cache_dir) and not args.no_cache
+    cache_dir = Path(args.cache_dir) if cache_enabled else None
+    cache_opts = {
+        "device": args.device,
+        "pop_size": args.pop_size,
+        "cpu_program": bool(args.cpu_program),
+    }
+    cache_hits = 0
     results = []
     fail_count = 0
     for f in files:
         print(f"=== {f.name}")
+        cached = None
+        key = None
+        if cache_enabled:
+            try:
+                key = _cache_key(f, cache_opts)
+                cached = _load_cache(cache_dir, key)
+            except OSError:
+                cached = None
+        if cached is not None:
+            r = cached
+            cache_hits += 1
+            print(f"   (cache hit)")
+        else:
+            r = evaluate_one(f, device=args.device, pop_size=args.pop_size,
+                             debug=args.debug, run_cpu_program=args.cpu_program)
+            if cache_enabled and key is not None:
+                _save_cache(cache_dir, key, r)
         results.append(r)
         print_row(r, show_cpu=args.cpu_program)
         if r.get("status") != "PASS":
         print(f"ALL {len(files)} variants PASS")
     else:
         print(f"{fail_count}/{len(files)} variants FAIL")
+    if cache_enabled:
+        print(f"(cache: {cache_hits}/{len(files)} hits, dir={cache_dir})")
     return fail_count