lyng/notes/pi_spigot_benchmark_baseline_2026-04-03.md

173 lines
11 KiB
Markdown

# Pi Spigot Benchmark Baseline
Date: 2026-04-03
Command:
`./gradlew :lynglib:jvmTest -Pbenchmarks=true --tests 'PiSpigotBenchmarkTest' --rerun-tasks`
Results for `n=200`:
- legacy-real-division: 1108 ms (3 iters, avg 369.33 ms)
- optimized-int-division-rval-off: 756 ms (3 iters, avg 252.00 ms)
- optimized-int-division-rval-on: 674 ms (3 iters, avg 224.67 ms)
Derived speedups:
- intDivSpeedup: 1.47x
- rvalSpeedup: 1.12x
- total: 1.64x
Notes:
- Bytecode still shows generic range iteration (`MAKE_RANGE`, `CALL_MEMBER_SLOT`, `ITER_PUSH`) for loop constructs in the legacy benchmark case.
- This baseline is captured before enabling counted-loop lowering for dynamic inline int ranges.
Optimization #1 follow-up:
- Attempt: broaden compiler loop lowering for dynamic int ranges and validate with `PiSpigotBenchmarkTest` bytecode dumps.
- Final result: success after switching loop-bound coercion to a runtime-checked int path for stable slots with missing metadata.
- Latest measured run after the working compiler change:
- legacy-real-division: 783 ms (3 iters, avg 261.00 ms)
- optimized-int-division-rval-off: 729 ms (3 iters, avg 243.00 ms)
- optimized-int-division-rval-on: 593 ms (3 iters, avg 197.67 ms)
- Hot-op counts for optimized bytecode now show the generic range iterator path is gone from the main loops:
- `makeRange=0`
- `callMemberSlot=2`
- `iterPush=0`
- `getIndex=4`
- `setIndex=4`
- The remaining member calls are non-loop overhead; the main improvement came from lowering `for` ranges to counted int loops.
Optimization #2 follow-up:
- Attempt: coerce stable integer operands into `INT` arithmetic during binary-op lowering so hot expressions stop falling back to `OBJ` math.
- Latest measured run after the arithmetic change:
- legacy-real-division: 593 ms (3 iters, avg 197.67 ms)
- optimized-int-division-rval-off: 542 ms (3 iters, avg 180.67 ms)
- optimized-int-division-rval-on: 516 ms (3 iters, avg 172.00 ms)
- Compiled-code impact in the optimized case:
- `boxes = n * 10 / 3` is now `UNBOX_INT_OBJ` + `MUL_INT` + `DIV_INT`
- `j = boxes - k` is now `SUB_INT`
- `denom = j * 2 + 1` is now `MUL_INT` + `ADD_INT`
- `carriedOver = quotient * j` is now `MUL_INT`
- Remaining hot object arithmetic is centered on list-backed reminder values and derived sums:
- `reminders[j] * 10`
- `reminders[j] + carriedOver`
- `sum / denom`, `sum % denom`, `sum / 10`
- Conclusion: loop lowering is fixed; the next likely win is preserving `List<Int>` element typing for `reminders` so indexed loads stay in int space.
Optimization #3 follow-up:
- Attempt: teach numeric-kind inference that `IndexRef` can be `INT`/`REAL` when the receiver list has a known element class.
- Compiler change:
- `inferNumericKind()` now handles `IndexRef` and resolves the receiver slot or receiver-declared list element class before choosing `INT`/`REAL`.
- Latest measured run after the indexed-load inference change:
- legacy-real-division: 656 ms (3 iters, avg 218.67 ms)
- optimized-int-division-rval-off: 509 ms (3 iters, avg 169.67 ms)
- optimized-int-division-rval-on: 403 ms (3 iters, avg 134.33 ms)
- Derived speedups vs legacy in this run:
- intDivSpeedup: 1.29x
- rvalSpeedup: 1.26x
- total: 1.63x
- Compiled-code impact in the optimized case:
- `carriedOver = quotient * j` stays in `INT` space (`ASSERT_IS` + `UNBOX_INT_OBJ` + `MUL_INT`) instead of plain object multiply.
- Counted int loops remain intact (`MAKE_RANGE=0`, `ITER_PUSH=0`).
- Remaining bottlenecks in the optimized bytecode:
- `GET_INDEX reminders[j]` still feeds `MUL_OBJ` / `ADD_OBJ`
- `sum / denom`, `sum % denom`, and `sum / 10` still compile to object arithmetic
- `suffix += pi[i]` remains `ADD_OBJ`, which is expected because it is string/object concatenation
- Conclusion:
- The new inference produced a real VM-speed gain, especially with `RVAL_FASTPATH` enabled.
- The next compiler win is stronger propagation from `List<Int>` indexed loads into the produced temporary slot so `sum` can stay typed as `INT` across the inner loop.
Optimization #4 follow-up:
- Attempt: preserve boxed-argument metadata through `compileCallArgs()` so `list.add(x)` retains `ObjInt` / `ObjReal` element typing.
- Compiler/runtime fixes:
- `compileCallArgs()` now routes arguments through `ensureObjSlot()` + `emitMove()` instead of raw `BOX_OBJ`, preserving `slotObjClass` and `stableObjSlots`.
- `CmdSetIndex` now reads `valueSlot` via `slotToObj()` so `SET_INDEX` can safely accept primitive slots.
- Fast local unbox ops (`CmdUnboxIntObjLocal`, `CmdUnboxRealObjLocal`) now handle already-primitive source slots directly instead of assuming a raw object payload.
- Plain assignment now coerces object-int RHS back into `INT` when the destination slot is currently compiled as `INT`, keeping loop-carried locals type-consistent.
- Latest measured run after the propagation + VM fixes:
- legacy-real-division: 438 ms (3 iters, avg 146.00 ms)
- optimized-int-division-rval-off: 238 ms (3 iters, avg 79.33 ms)
- optimized-int-division-rval-on: 201 ms (3 iters, avg 67.00 ms)
- Derived speedups vs legacy in this run:
- intDivSpeedup: 1.84x
- rvalSpeedup: 1.18x
- total: 2.18x
- Compiled-code impact in the optimized case:
- `sum = reminders[j] + carriedOver` is now `GET_INDEX` + `UNBOX_INT_OBJ` + `ADD_INT`
- `reminders[j] = sum % denom` is now `MOD_INT` + `SET_INDEX`
- `q = sum / 10` is now `DIV_INT`
- `carriedOver = quotient * j` is now `MUL_INT`
- Remaining hot object arithmetic in the optimized case:
- `reminders[j] *= 10` still compiles as `GET_INDEX` + `MUL_OBJ` + `SET_INDEX`
- `suffix += pi[i]` remains `ADD_OBJ`, which is expected string/object concatenation
- Conclusion:
- The main remaining arithmetic bottleneck is the compound index assignment path for `reminders[j] *= 10`.
- The next direct win is to specialize `AssignOpRef` on typed list elements so indexed compound assignment can lower to `UNBOX_INT_OBJ` + `MUL_INT` + boxed `SET_INDEX`.
Optimization #5 follow-up:
- Attempt: specialize typed `IndexRef` compound assignment so `List<Int>` element updates avoid object arithmetic.
- Compiler change:
- `compileAssignOp()` now detects non-optional typed `List<Int>` index targets and lowers arithmetic assign-ops through `UNBOX_INT_OBJ` + `*_INT` + `SET_INDEX`.
- Latest measured run after the indexed compound-assignment change:
- legacy-real-division: 394 ms (3 iters, avg 131.33 ms)
- optimized-int-division-rval-off: 216 ms (3 iters, avg 72.00 ms)
- optimized-int-division-rval-on: 184 ms (3 iters, avg 61.33 ms)
- Derived speedups vs legacy in this run:
- intDivSpeedup: 1.82x
- rvalSpeedup: 1.17x
- total: 2.14x
- Compiled-code impact in the optimized case:
- `reminders[j] *= 10` is now:
- `GET_INDEX`
- `UNBOX_INT_OBJ`
- `MUL_INT`
- `SET_INDEX`
- The optimized inner loop no longer contains object arithmetic for the `reminders` state update path.
- Remaining hot object work in the optimized case:
- `suffix += pi[i]` remains `ADD_OBJ` and is expected string/object concatenation
- The legacy benchmark case still carries real/object work because it intentionally keeps the original `floor(sum / (denom * 1.0))` path
- Conclusion:
- The inner arithmetic hot loop is now effectively int-lowered end-to-end in the optimized benchmark path.
- Further wins will likely require reducing list access overhead itself (`GET_INDEX` / `SET_INDEX`) or changing the source algorithm/data layout, not more basic arithmetic lowering.
Optimization #6 follow-up:
- Attempt: move the direct `ObjList` index fast path out from behind `RVAL_FASTPATH` so the common plain-list case is fast by default.
- Runtime change:
- `CmdGetIndex` and `CmdSetIndex` now always use direct `target.list[index]` / `target.list[index] = value` for exact `ObjList` receivers with `ObjInt` indices.
- Subclasses such as `ObjObservableList` still use their overridden `getAt` / `putAt` logic, so semantics stay intact.
- Latest measured run after the default plain-list path:
- legacy-real-division: 397 ms (3 iters, avg 132.33 ms)
- optimized-int-division-rval-off: 138 ms (3 iters, avg 46.00 ms)
- optimized-int-division-rval-on: 164 ms (3 iters, avg 54.67 ms)
- Derived speedups vs legacy in this run:
- intDivSpeedup: 2.88x
- rvalSpeedup: 0.84x
- total: 2.42x
- Interpretation:
- The stable fast baseline is now the `rval-off` case, because the direct plain-`ObjList` path no longer depends on `RVAL_FASTPATH`.
- `RVAL_FASTPATH` no longer improves this benchmark and only reflects remaining unrelated runtime variance.
- Conclusion:
- For `piSpigot`, the main VM list-access bottleneck is addressed in the default runtime path.
- Further work on this benchmark should target algorithm/data-layout changes or string-result construction, not the old `RVAL_FASTPATH` gate.
Remaining optimization candidates:
- `suffix += pi[i]` still compiles as repeated `ADD_OBJ` string/object concatenation.
- Best next option: build the suffix through a dedicated buffer/list-join path instead of per-iteration concatenation.
- The benchmark still performs many `GET_INDEX` / `SET_INDEX` operations even after the direct plain-`ObjList` fast path.
- Best next option: reduce indexed access count at the source level or introduce a more specialized typed-list storage layout if this benchmark matters enough.
- The legacy benchmark variant intentionally keeps the real-number `floor(sum / (denom * 1.0))` path.
- No release optimization needed there; it remains only as a regression/control case.
- `RVAL_FASTPATH` is no longer a useful tuning knob for this workload after the plain-list VM fast path.
- Best next option: profile other workloads before changing or removing it globally.
Release stabilization note:
- The broad assignment-side `INT` coercion and subclass-bypassing list fast path were rolled back/narrowed to restore correctness across numeric-mix, decimal, list, observable-list, and wasm tests.
- Full release gates now pass:
- `./gradlew test`
- `./gradlew :lynglib:wasmJsNodeTest`
- Current release-safe benchmark on the stabilized tree:
- legacy-real-division: 732 ms (3 iters, avg 244.00 ms)
- optimized-int-division-rval-off: 545 ms (3 iters, avg 181.67 ms)
- optimized-int-division-rval-on: 697 ms (3 iters, avg 232.33 ms)
- Interpretation:
- The release baseline is now `optimized-int-division-rval-off` at 545 ms for the current correct/stable tree.
- The removed coercion had been masking a real compiler typing gap; reintroducing it broadly is not release-safe.
- Highest-value remaining compiler optimization after release:
- Recover typed int lowering for `j = boxes - k`, `denom = j * 2 + 1`, `sum = reminders[j] + carriedOver`, and `carriedOver = quotient * j` using a narrower proof than the removed generic arithmetic coercion.