CharlesCNorton commited on
Commit
597e7c2
·
1 Parent(s): 6241818

eval_all: hash-keyed result cache (--cache-dir, --no-cache); README: bit-ordering scope rules; docs/ISA.md: opcode reference and end-to-end tutorial; docs/float-pipeline.md: composition gap notes

Browse files
Files changed (5) hide show
  1. .gitignore +1 -0
  2. README.md +19 -1
  3. docs/ISA.md +135 -0
  4. docs/float-pipeline.md +39 -0
  5. eval_all.py +77 -3
.gitignore CHANGED
@@ -1,3 +1,4 @@
1
  __pycache__/
2
  *.pyc
3
  .pt file
 
 
1
  __pycache__/
2
  *.pyc
3
  .pt file
4
+ .eval_cache/
README.md CHANGED
@@ -150,7 +150,7 @@ A self-contained machine. State goes in, state comes out:
150
 
151
  ### State tensor layout
152
 
153
- All multi-bit fields are MSB-first (index 0 is the most-significant bit).
154
 
155
  ```
156
  [ PC[N] | IR[16] | R0[8] R1[8] R2[8] R3[8] | FLAGS[4] | SP[N] | CTRL[4] | MEM[2^N][8] ]
@@ -158,6 +158,24 @@ All multi-bit fields are MSB-first (index 0 is the most-significant bit).
158
 
159
  `N` is the address width (configurable, 0–16). Flags are ordered `Z, N, C, V`. Control bits are ordered `HALT, MEM_WE, MEM_RE, RESERVED`.
160
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
161
  ### Instruction encoding (16-bit, MSB-first)
162
 
163
  ```
 
150
 
151
  ### State tensor layout
152
 
153
+ The **state tensor** uses MSB-first bit ordering: index 0 of each multi-bit field is the most-significant bit. So `R0[0]` is bit 7 of the architectural register, `R0[7]` is bit 0.
154
 
155
  ```
156
  [ PC[N] | IR[16] | R0[8] R1[8] R2[8] R3[8] | FLAGS[4] | SP[N] | CTRL[4] | MEM[2^N][8] ]
 
158
 
159
  `N` is the address width (configurable, 0–16). Flags are ordered `Z, N, C, V`. Control bits are ordered `HALT, MEM_WE, MEM_RE, RESERVED`.
160
 
161
+ #### Bit ordering, one rule per scope
162
+
163
+ The state tensor's MSB-first convention does **not** propagate to subcircuit ports. Each subcircuit names its operand bits in its own scope:
164
+
165
+ | Scope | Convention | Example |
166
+ |---|---|---|
167
+ | State tensor | MSB-first (index 0 = MSB) | `R0[0]` is bit 7 of register R0 |
168
+ | Subcircuit external ports (`$a[i]`, `$b[i]`) | LSB-indexed (index 0 = LSB) | `$a[0]` is bit 0 of operand `a` |
169
+ | Ripple-carry full adders (`fa0..fa7`) | LSB-first (fa0 = LSB) | `fa0` consumes `$a[0]` and `$b[0]` |
170
+ | Instruction word | MSB-first (bit 15 = opcode high) | bit 15 is `opcode[3]` |
171
+
172
+ Worked example for `arithmetic.ripplecarry8bit`:
173
+
174
+ - Inputs: `$a[0]..$a[7]` and `$b[0]..$b[7]` where `$a[0]` is the LSB of `a`. To add `a = 0x05 = 0b00000101` and `b = 0x03`, drive `a[0]=1, a[1]=0, a[2]=1` (rest 0) and `b[0]=1, b[1]=1` (rest 0).
175
+ - Outputs: `fa0.ha2.sum.layer2`..`fa7.ha2.sum.layer2` are sum bits 0..7 (LSB to MSB), and `fa7.carry_or` is the final carry-out. The 8-bit result is `{fa7..fa0}` reading high-to-low.
176
+
177
+ This is also how `safetensors2verilog`'s threshold-logic frontend exposes the ports of any extracted subcircuit. See the project's testbench at `tests/threshold_alu/run.py` for a worked end-to-end example, or use `python -m safetensors2verilog ... --inspect` to print the port contract for any extracted circuit.
178
+
179
  ### Instruction encoding (16-bit, MSB-first)
180
 
181
  ```
docs/ISA.md ADDED
@@ -0,0 +1,135 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ISA reference card — 8-bit threshold-logic CPU
2
+
3
+ This is the architecture exposed by the safetensors files. Every instruction below is *implemented entirely as threshold neurons*; the same gate-level circuits run whether you simulate in Python (`eval.py` / `play.py` / `test_cpu.py`) or compile the CPU's threshold network through `safetensors2verilog` to FPGA-synthesizable Verilog.
4
+
5
+ ## Architectural state
6
+
7
+ | Field | Width | Notes |
8
+ |---|---|---|
9
+ | PC | N bits | program counter; N = address width (0–16) |
10
+ | IR | 16 bits | instruction register |
11
+ | R0–R3 | 8 bits each | general-purpose registers |
12
+ | FLAGS | 4 bits | Z, N, C, V |
13
+ | SP | N bits | stack pointer (CALL/RET) |
14
+ | CTRL | 4 bits | HALT, MEM_WE, MEM_RE, RESERVED |
15
+ | MEM | 2^N × 8 bits | byte-addressable memory |
16
+
17
+ State tensor layout (MSB-first within each multi-bit field):
18
+
19
+ ```
20
+ [ PC[N] | IR[16] | R0[8] R1[8] R2[8] R3[8] | FLAGS[4] | SP[N] | CTRL[4] | MEM[2^N][8] ]
21
+ ```
22
+
23
+ ## Instruction encoding
24
+
25
+ ```
26
+ 15..12 11..10 9..8 7..0
27
+ opcode rd rs imm8
28
+ ```
29
+
30
+ | Class | Use of fields |
31
+ |---|---|
32
+ | **R-type** | `rd = rd op rs` — `imm8` ignored |
33
+ | **I-type** | `rd = op rd, imm8` — `rs` ignored |
34
+ | **Address-extended** | next 16-bit word is the absolute address (big-endian); `imm8` reserved. Applies to `LOAD`, `STORE`, `JMP`, `Jcc`, `CALL`. |
35
+
36
+ Address-extended instructions consume **4 bytes** (instruction word + address word). Untaken conditional jumps still skip the address word, so the PC always advances by 4.
37
+
38
+ ## Opcode table
39
+
40
+ | Opcode | Mnemonic | Class | Operation |
41
+ |---|---|---|---|
42
+ | 0x0 | ADD | R | R[rd] = R[rd] + R[rs] |
43
+ | 0x1 | SUB | R | R[rd] = R[rd] - R[rs] |
44
+ | 0x2 | AND | R | R[rd] = R[rd] & R[rs] |
45
+ | 0x3 | OR | R | R[rd] = R[rd] \| R[rs] |
46
+ | 0x4 | XOR | R | R[rd] = R[rd] ^ R[rs] |
47
+ | 0x5 | SHL | R | R[rd] = R[rd] << 1 |
48
+ | 0x6 | SHR | R | R[rd] = R[rd] >> 1 |
49
+ | 0x7 | MUL | R | R[rd] = R[rd] * R[rs] (low 8 bits) |
50
+ | 0x8 | DIV | R | R[rd] = R[rd] / R[rs] |
51
+ | 0x9 | CMP | R | flags = R[rd] - R[rs] (no writeback) |
52
+ | 0xA | LOAD | A | R[rd] = M[addr] |
53
+ | 0xB | STORE | A | M[addr] = R[rs] |
54
+ | 0xC | JMP | A | PC = addr |
55
+ | 0xD | Jcc | A | PC = addr if cond. imm8[2:0] selects condition |
56
+ | 0xE | CALL | A | push PC; PC = addr |
57
+ | 0xF | HALT | – | stop execution |
58
+
59
+ ### Conditional-jump conditions (encoded in imm8[2:0] of the Jcc opcode)
60
+
61
+ | imm8[2:0] | Mnemonic | Fires when |
62
+ |---|---|---|
63
+ | 0 | JZ | Z flag set (last result was zero) |
64
+ | 1 | JNZ | Z flag clear |
65
+ | 2 | JC | carry-out set (last add overflowed unsigned) |
66
+ | 3 | JNC | carry-out clear |
67
+ | 4 | JN | result was negative (sign bit set) |
68
+ | 5 | JP | result was positive (sign bit clear) |
69
+ | 6 | JV | signed-overflow flag set |
70
+ | 7 | JNV | signed-overflow flag clear |
71
+
72
+ ## Worked example: write your own program
73
+
74
+ The Python assembler in `cpu_programs.py` exposes one-method-per-mnemonic helpers on a tiny `Asm` class. Here's "store the value 7 to address 0x10, then halt":
75
+
76
+ ```python
77
+ from cpu_programs import Asm
78
+
79
+ a = Asm(size=64) # 64 bytes of memory
80
+ a.org(0)
81
+ # Set R0 to 7. There is no LDI; use XOR R0,R0 to zero it then ADD an
82
+ # immediate from memory.
83
+ a.label("seven")
84
+ a.org(32); a.db(7) # memory byte at addr 32 holds the constant 7
85
+
86
+ a.org(0)
87
+ a.xor_(0, 0) # R0 = 0
88
+ a.load(0, "seven") # R0 = M[seven] = 7
89
+ a.store(0, "dest") # M[dest] = R0
90
+ a.halt()
91
+
92
+ a.label("dest"); a.db(0) # destination cell
93
+
94
+ bytes_ = a.assemble()
95
+ ```
96
+
97
+ Then drop the assembled bytes into the CPU's initial memory and let the threshold-network forward pass run.
98
+
99
+ ## Using the CPU as a threshold-network forward pass
100
+
101
+ The CPU is a single tensor program. State in, state out. The driver:
102
+
103
+ 1. Builds an initial state tensor with the program loaded at `MEM[0..]`.
104
+ 2. Calls the safetensors-derived threshold network, which internally loops one fetch–decode–execute cycle and re-feeds the state.
105
+ 3. After ≤ N cycles (or earlier if the HALT control bit fires), reads the final memory contents.
106
+
107
+ Concretely, this is what `test_cpu.py` and `play.py` already do; both serve as runnable tutorials. The minimal driver loop is:
108
+
109
+ ```python
110
+ from build import ThresholdComputer
111
+ from safetensors.torch import load_file
112
+
113
+ tensors = load_file("variants/neural_computer8_small.safetensors")
114
+ cpu = ThresholdComputer(tensors, data_bits=8)
115
+ state = cpu.initial_state(memory=bytes_)
116
+ state = cpu.run(state, max_cycles=200)
117
+ result = cpu.read_memory(state, addr=0x10)
118
+ print(result) # 7
119
+ ```
120
+
121
+ ## Common pitfalls
122
+
123
+ - **No load-immediate.** `LOAD` reads from memory; there is no LDI / MOV-imm instruction. To put a constant in a register, place it in memory and `LOAD` it.
124
+ - **Address-extended instructions are 4 bytes wide.** Branch targets must point at the start of an instruction word, not into the middle of one.
125
+ - **`MUL` keeps only the low 8 bits.** Detect overflow via `CMP` against expected truncation.
126
+ - **`CMP` writes only flags**, never the destination register. Always followed by a `Jcc`.
127
+ - **`SHL` and `SHR` shift by 1.** No variable-amount shifter; chain them or compose with bit operations.
128
+
129
+ ## Threshold-network artefacts you'll want next
130
+
131
+ - `python eval_all.py variants/<file>.safetensors` — gate-level fitness suite (5,900–7,800 tests per variant covering Boolean, arithmetic, ALU, control, modular, error-detection, threshold, and IEEE 754 float circuits).
132
+ - `python eval_all.py --cpu-program variants/<file>.safetensors` — assembled program through the threshold-gated CPU.
133
+ - `python -m safetensors2verilog <file>.safetensors --frontend threshold_logic --circuit arithmetic.ripplecarry8bit -o rc8.v` — extract one circuit, dependency-closed, into synthesizable Verilog.
134
+ - `python -m safetensors2verilog ... --inspect` — print the port contract for any extracted circuit (which pins exist, what widths).
135
+ - `python -m safetensors2verilog ... --equiv-check` — automatically build a Python-vs-iverilog cross-check testbench for the extracted circuit.
docs/float-pipeline.md ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # IEEE-754 add: composing the stages
2
+
3
+ The threshold network ships separate `float16.unpack`, `float16.add`, `float16.normalize`, and `float16.pack` (and the `float32.*` siblings). Each is an independent threshold-logic subcircuit with its own external ports.
4
+
5
+ ## What you actually get from each stage
6
+
7
+ | Stage | Inputs | Outputs |
8
+ |---|---|---|
9
+ | `float16.unpack` | 1 generic input bit | 16 `bit0..bit15` outputs (sign / exp / mantissa fields) |
10
+ | `float16.add` | 5 `exp_a` + 5 `exp_b` + 5 generic input bits = 15 inputs | 499 outputs covering align, exp_diff, mant_add, mant_sub, mant_select, sign_xor stages |
11
+ | `float16.normalize` | 2 generic input bits | 356 outputs covering exp_adj and mantissa-shift stages |
12
+ | `float16.pack` | 1 generic input bit | 16 `bit0..bit15` outputs (assembled IEEE-754 word) |
13
+
14
+ Use `python -m safetensors2verilog ... --circuit float16.<stage> --inspect` to print the live contract for any stage from the variant you're using.
15
+
16
+ ## What's missing for a single composed block
17
+
18
+ None of the four stages exposes ports named in a way that says "I produce the operand the next stage consumes." `float16.unpack` outputs `float16.unpack.bit0..15`; `float16.add` consumes `$float16_exp_a[0..4]` and an opaque `$input[0..4]`. The wire-up between `unpack` and `add` is not encoded in the safetensors metadata. It exists only in the original construction code at `build.py` and in the gate-fitness harness at `eval.py`.
19
+
20
+ Two consequences:
21
+
22
+ 1. **Subcircuit extraction works**, but each stage compiles to a standalone module whose bit-bag inputs and outputs need a hand-written wrapper to chain.
23
+ 2. **A single `float16.add_full` Verilog block** that runs the complete IEEE-754 pipeline for a 16-bit operand pair is not derivable from the published safetensors files alone.
24
+
25
+ ## The intended path forward
26
+
27
+ The right fix is in `build.py`: when generating each float stage, register the inter-stage signal IDs so that, e.g., `float16.unpack.bit15` (sign of `a`) and `float16.add.sign_a` are the same signal in the global registry, with metadata that exposes the per-stage contract by *role* (`sign_a`, `exp_a`, `mant_a`) rather than by ad-hoc port name. The schema-versioned metadata from `safetensors2verilog`'s frontend (see `core.SIGNAL_REGISTRY_SCHEMA_VERSION_LATEST`) is the place to land this.
28
+
29
+ Once that is in place, `safetensors2verilog --circuit float16.add_full` would emit a single composed top-level module by walking the now-explicit cross-stage wiring. The dependency-closure extractor already does the heavy lifting; it currently produces correct output for any circuit whose internal wiring is already explicit (e.g. `arithmetic.ripplecarry8bit`).
30
+
31
+ ## Workaround today
32
+
33
+ For end-to-end float16-add evaluation, run the existing gate-level fitness suite which exercises the composed pipeline through Python eval:
34
+
35
+ ```bash
36
+ python eval_all.py variants/neural_alu8.safetensors --debug 2>&1 | grep -A 20 "FLOAT16 ADD"
37
+ ```
38
+
39
+ For Verilog-side experimentation, pick one stage at a time and feed it driven testbench inputs that match the role each port plays in the original pipeline (read off `build.py`'s `infer_*` routines for that stage).
eval_all.py CHANGED
@@ -30,7 +30,7 @@ import os
30
  import sys
31
  import time
32
  from pathlib import Path
33
- from typing import Dict, List, Optional, Tuple
34
 
35
  import torch
36
  from safetensors import safe_open
@@ -443,6 +443,50 @@ def builtin_program(addr_bits: int) -> Tuple[List[int], int]:
443
  # Eval driver
444
  # ---------------------------------------------------------------------------
445
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
446
  def list_safetensors(path: Path) -> List[Path]:
447
  if path.is_file():
448
  return [path]
@@ -594,6 +638,11 @@ def main() -> int:
594
  help="Also run a small assembled program through the threshold CPU "
595
  "(only applies to 8-bit variants with >= 512 B memory)")
596
  parser.add_argument("--json", action="store_true", help="Emit JSON results to stdout instead of a table")
 
 
 
 
 
597
  args = parser.parse_args()
598
 
599
  files = list_safetensors(Path(args.path))
@@ -602,12 +651,35 @@ def main() -> int:
602
  return 2
603
 
604
  print(f"Evaluating {len(files)} file(s) on {args.device}\n")
 
 
 
 
 
 
 
 
605
  results = []
606
  fail_count = 0
607
  for f in files:
608
  print(f"=== {f.name}")
609
- r = evaluate_one(f, device=args.device, pop_size=args.pop_size,
610
- debug=args.debug, run_cpu_program=args.cpu_program)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
611
  results.append(r)
612
  print_row(r, show_cpu=args.cpu_program)
613
  if r.get("status") != "PASS":
@@ -634,6 +706,8 @@ def main() -> int:
634
  print(f"ALL {len(files)} variants PASS")
635
  else:
636
  print(f"{fail_count}/{len(files)} variants FAIL")
 
 
637
  return fail_count
638
 
639
 
 
30
  import sys
31
  import time
32
  from pathlib import Path
33
+ from typing import Any, Dict, List, Optional, Tuple
34
 
35
  import torch
36
  from safetensors import safe_open
 
443
  # Eval driver
444
  # ---------------------------------------------------------------------------
445
 
446
+ def _file_fingerprint(path: Path) -> str:
447
+ """Stable cache key for a safetensors file: sha256 of its content.
448
+
449
+ Hashes are content-addressed so renaming a file doesn't blow the cache,
450
+ but mtime-only would re-key on every clone of the repo. The sha256 of a
451
+ 30 MB safetensors finishes in tens of milliseconds — small compared to
452
+ a 5,900-test fitness run.
453
+ """
454
+ import hashlib
455
+ h = hashlib.sha256()
456
+ with open(path, "rb") as f:
457
+ for chunk in iter(lambda: f.read(1 << 20), b""):
458
+ h.update(chunk)
459
+ return h.hexdigest()
460
+
461
+
462
+ def _cache_key(path: Path, opts: Dict[str, Any]) -> str:
463
+ """Cache key combining file content with the relevant evaluation options."""
464
+ fp = _file_fingerprint(path)
465
+ opt_str = json.dumps(opts, sort_keys=True)
466
+ import hashlib
467
+ suffix = hashlib.sha256(opt_str.encode("utf-8")).hexdigest()[:8]
468
+ return f"{fp}_{suffix}"
469
+
470
+
471
+ def _load_cache(cache_dir: Path, key: str) -> Dict[str, Any] | None:
472
+ p = cache_dir / f"{key}.json"
473
+ if not p.exists():
474
+ return None
475
+ try:
476
+ return json.loads(p.read_text(encoding="utf-8"))
477
+ except (json.JSONDecodeError, OSError):
478
+ return None
479
+
480
+
481
+ def _save_cache(cache_dir: Path, key: str, payload: Dict[str, Any]) -> None:
482
+ cache_dir.mkdir(parents=True, exist_ok=True)
483
+ p = cache_dir / f"{key}.json"
484
+ try:
485
+ p.write_text(json.dumps(payload, indent=2, default=str), encoding="utf-8")
486
+ except OSError:
487
+ pass
488
+
489
+
490
  def list_safetensors(path: Path) -> List[Path]:
491
  if path.is_file():
492
  return [path]
 
638
  help="Also run a small assembled program through the threshold CPU "
639
  "(only applies to 8-bit variants with >= 512 B memory)")
640
  parser.add_argument("--json", action="store_true", help="Emit JSON results to stdout instead of a table")
641
+ parser.add_argument("--cache-dir", default=".eval_cache",
642
+ help="Directory for hash-keyed result cache "
643
+ "(default: ./.eval_cache). Set to '' to disable.")
644
+ parser.add_argument("--no-cache", action="store_true",
645
+ help="Disable the result cache for this run.")
646
  args = parser.parse_args()
647
 
648
  files = list_safetensors(Path(args.path))
 
651
  return 2
652
 
653
  print(f"Evaluating {len(files)} file(s) on {args.device}\n")
654
+ cache_enabled = bool(args.cache_dir) and not args.no_cache
655
+ cache_dir = Path(args.cache_dir) if cache_enabled else None
656
+ cache_opts = {
657
+ "device": args.device,
658
+ "pop_size": args.pop_size,
659
+ "cpu_program": bool(args.cpu_program),
660
+ }
661
+ cache_hits = 0
662
  results = []
663
  fail_count = 0
664
  for f in files:
665
  print(f"=== {f.name}")
666
+ cached = None
667
+ key = None
668
+ if cache_enabled:
669
+ try:
670
+ key = _cache_key(f, cache_opts)
671
+ cached = _load_cache(cache_dir, key)
672
+ except OSError:
673
+ cached = None
674
+ if cached is not None:
675
+ r = cached
676
+ cache_hits += 1
677
+ print(f" (cache hit)")
678
+ else:
679
+ r = evaluate_one(f, device=args.device, pop_size=args.pop_size,
680
+ debug=args.debug, run_cpu_program=args.cpu_program)
681
+ if cache_enabled and key is not None:
682
+ _save_cache(cache_dir, key, r)
683
  results.append(r)
684
  print_row(r, show_cpu=args.cpu_program)
685
  if r.get("status") != "PASS":
 
706
  print(f"ALL {len(files)} variants PASS")
707
  else:
708
  print(f"{fail_count}/{len(files)} variants FAIL")
709
+ if cache_enabled:
710
+ print(f"(cache: {cache_hits}/{len(files)} hits, dir={cache_dir})")
711
  return fail_count
712
 
713