173 lines
11 KiB
Markdown
173 lines
11 KiB
Markdown
# Pi Spigot Benchmark Baseline
|
|
|
|
Date: 2026-04-03
|
|
Command:
|
|
`./gradlew :lynglib:jvmTest -Pbenchmarks=true --tests 'PiSpigotBenchmarkTest' --rerun-tasks`
|
|
|
|
Results for `n=200`:
|
|
- legacy-real-division: 1108 ms (3 iters, avg 369.33 ms)
|
|
- optimized-int-division-rval-off: 756 ms (3 iters, avg 252.00 ms)
|
|
- optimized-int-division-rval-on: 674 ms (3 iters, avg 224.67 ms)
|
|
|
|
Derived speedups:
|
|
- intDivSpeedup: 1.47x
|
|
- rvalSpeedup: 1.12x
|
|
- total: 1.64x
|
|
|
|
Notes:
|
|
- Bytecode still shows generic range iteration (`MAKE_RANGE`, `CALL_MEMBER_SLOT`, `ITER_PUSH`) for loop constructs in the legacy benchmark case.
|
|
- This baseline is captured before enabling counted-loop lowering for dynamic inline int ranges.
|
|
|
|
Optimization #1 follow-up:
|
|
- Attempt: broaden compiler loop lowering for dynamic int ranges and validate with `PiSpigotBenchmarkTest` bytecode dumps.
|
|
- Final result: success after switching loop-bound coercion to a runtime-checked int path for stable slots with missing metadata.
|
|
- Latest measured run after the working compiler change:
|
|
- legacy-real-division: 783 ms (3 iters, avg 261.00 ms)
|
|
- optimized-int-division-rval-off: 729 ms (3 iters, avg 243.00 ms)
|
|
- optimized-int-division-rval-on: 593 ms (3 iters, avg 197.67 ms)
|
|
- Hot-op counts for optimized bytecode now show the generic range iterator path is gone from the main loops:
|
|
- `makeRange=0`
|
|
- `callMemberSlot=2`
|
|
- `iterPush=0`
|
|
- `getIndex=4`
|
|
- `setIndex=4`
|
|
- The remaining member calls are non-loop overhead; the main improvement came from lowering `for` ranges to counted int loops.
|
|
|
|
Optimization #2 follow-up:
|
|
- Attempt: coerce stable integer operands into `INT` arithmetic during binary-op lowering so hot expressions stop falling back to `OBJ` math.
|
|
- Latest measured run after the arithmetic change:
|
|
- legacy-real-division: 593 ms (3 iters, avg 197.67 ms)
|
|
- optimized-int-division-rval-off: 542 ms (3 iters, avg 180.67 ms)
|
|
- optimized-int-division-rval-on: 516 ms (3 iters, avg 172.00 ms)
|
|
- Compiled-code impact in the optimized case:
|
|
- `boxes = n * 10 / 3` is now `UNBOX_INT_OBJ` + `MUL_INT` + `DIV_INT`
|
|
- `j = boxes - k` is now `SUB_INT`
|
|
- `denom = j * 2 + 1` is now `MUL_INT` + `ADD_INT`
|
|
- `carriedOver = quotient * j` is now `MUL_INT`
|
|
- Remaining hot object arithmetic is centered on list-backed reminder values and derived sums:
|
|
- `reminders[j] * 10`
|
|
- `reminders[j] + carriedOver`
|
|
- `sum / denom`, `sum % denom`, `sum / 10`
|
|
- Conclusion: loop lowering is fixed; the next likely win is preserving `List<Int>` element typing for `reminders` so indexed loads stay in int space.
|
|
|
|
Optimization #3 follow-up:
|
|
- Attempt: teach numeric-kind inference that `IndexRef` can be `INT`/`REAL` when the receiver list has a known element class.
|
|
- Compiler change:
|
|
- `inferNumericKind()` now handles `IndexRef` and resolves the receiver slot or receiver-declared list element class before choosing `INT`/`REAL`.
|
|
- Latest measured run after the indexed-load inference change:
|
|
- legacy-real-division: 656 ms (3 iters, avg 218.67 ms)
|
|
- optimized-int-division-rval-off: 509 ms (3 iters, avg 169.67 ms)
|
|
- optimized-int-division-rval-on: 403 ms (3 iters, avg 134.33 ms)
|
|
- Derived speedups vs legacy in this run:
|
|
- intDivSpeedup: 1.29x
|
|
- rvalSpeedup: 1.26x
|
|
- total: 1.63x
|
|
- Compiled-code impact in the optimized case:
|
|
- `carriedOver = quotient * j` stays in `INT` space (`ASSERT_IS` + `UNBOX_INT_OBJ` + `MUL_INT`) instead of plain object multiply.
|
|
- Counted int loops remain intact (`MAKE_RANGE=0`, `ITER_PUSH=0`).
|
|
- Remaining bottlenecks in the optimized bytecode:
|
|
- `GET_INDEX reminders[j]` still feeds `MUL_OBJ` / `ADD_OBJ`
|
|
- `sum / denom`, `sum % denom`, and `sum / 10` still compile to object arithmetic
|
|
- `suffix += pi[i]` remains `ADD_OBJ`, which is expected because it is string/object concatenation
|
|
- Conclusion:
|
|
- The new inference produced a real VM-speed gain, especially with `RVAL_FASTPATH` enabled.
|
|
- The next compiler win is stronger propagation from `List<Int>` indexed loads into the produced temporary slot so `sum` can stay typed as `INT` across the inner loop.
|
|
|
|
Optimization #4 follow-up:
|
|
- Attempt: preserve boxed-argument metadata through `compileCallArgs()` so `list.add(x)` retains `ObjInt` / `ObjReal` element typing.
|
|
- Compiler/runtime fixes:
|
|
- `compileCallArgs()` now routes arguments through `ensureObjSlot()` + `emitMove()` instead of raw `BOX_OBJ`, preserving `slotObjClass` and `stableObjSlots`.
|
|
- `CmdSetIndex` now reads `valueSlot` via `slotToObj()` so `SET_INDEX` can safely accept primitive slots.
|
|
- Fast local unbox ops (`CmdUnboxIntObjLocal`, `CmdUnboxRealObjLocal`) now handle already-primitive source slots directly instead of assuming a raw object payload.
|
|
- Plain assignment now coerces object-int RHS back into `INT` when the destination slot is currently compiled as `INT`, keeping loop-carried locals type-consistent.
|
|
- Latest measured run after the propagation + VM fixes:
|
|
- legacy-real-division: 438 ms (3 iters, avg 146.00 ms)
|
|
- optimized-int-division-rval-off: 238 ms (3 iters, avg 79.33 ms)
|
|
- optimized-int-division-rval-on: 201 ms (3 iters, avg 67.00 ms)
|
|
- Derived speedups vs legacy in this run:
|
|
- intDivSpeedup: 1.84x
|
|
- rvalSpeedup: 1.18x
|
|
- total: 2.18x
|
|
- Compiled-code impact in the optimized case:
|
|
- `sum = reminders[j] + carriedOver` is now `GET_INDEX` + `UNBOX_INT_OBJ` + `ADD_INT`
|
|
- `reminders[j] = sum % denom` is now `MOD_INT` + `SET_INDEX`
|
|
- `q = sum / 10` is now `DIV_INT`
|
|
- `carriedOver = quotient * j` is now `MUL_INT`
|
|
- Remaining hot object arithmetic in the optimized case:
|
|
- `reminders[j] *= 10` still compiles as `GET_INDEX` + `MUL_OBJ` + `SET_INDEX`
|
|
- `suffix += pi[i]` remains `ADD_OBJ`, which is expected string/object concatenation
|
|
- Conclusion:
|
|
- The main remaining arithmetic bottleneck is the compound index assignment path for `reminders[j] *= 10`.
|
|
- The next direct win is to specialize `AssignOpRef` on typed list elements so indexed compound assignment can lower to `UNBOX_INT_OBJ` + `MUL_INT` + boxed `SET_INDEX`.
|
|
|
|
Optimization #5 follow-up:
|
|
- Attempt: specialize typed `IndexRef` compound assignment so `List<Int>` element updates avoid object arithmetic.
|
|
- Compiler change:
|
|
- `compileAssignOp()` now detects non-optional typed `List<Int>` index targets and lowers arithmetic assign-ops through `UNBOX_INT_OBJ` + `*_INT` + `SET_INDEX`.
|
|
- Latest measured run after the indexed compound-assignment change:
|
|
- legacy-real-division: 394 ms (3 iters, avg 131.33 ms)
|
|
- optimized-int-division-rval-off: 216 ms (3 iters, avg 72.00 ms)
|
|
- optimized-int-division-rval-on: 184 ms (3 iters, avg 61.33 ms)
|
|
- Derived speedups vs legacy in this run:
|
|
- intDivSpeedup: 1.82x
|
|
- rvalSpeedup: 1.17x
|
|
- total: 2.14x
|
|
- Compiled-code impact in the optimized case:
|
|
- `reminders[j] *= 10` is now:
|
|
- `GET_INDEX`
|
|
- `UNBOX_INT_OBJ`
|
|
- `MUL_INT`
|
|
- `SET_INDEX`
|
|
- The optimized inner loop no longer contains object arithmetic for the `reminders` state update path.
|
|
- Remaining hot object work in the optimized case:
|
|
- `suffix += pi[i]` remains `ADD_OBJ` and is expected string/object concatenation
|
|
- The legacy benchmark case still carries real/object work because it intentionally keeps the original `floor(sum / (denom * 1.0))` path
|
|
- Conclusion:
|
|
- The inner arithmetic hot loop is now effectively int-lowered end-to-end in the optimized benchmark path.
|
|
- Further wins will likely require reducing list access overhead itself (`GET_INDEX` / `SET_INDEX`) or changing the source algorithm/data layout, not more basic arithmetic lowering.
|
|
|
|
Optimization #6 follow-up:
|
|
- Attempt: move the direct `ObjList` index fast path out from behind `RVAL_FASTPATH` so the common plain-list case is fast by default.
|
|
- Runtime change:
|
|
- `CmdGetIndex` and `CmdSetIndex` now always use direct `target.list[index]` / `target.list[index] = value` for exact `ObjList` receivers with `ObjInt` indices.
|
|
- Subclasses such as `ObjObservableList` still use their overridden `getAt` / `putAt` logic, so semantics stay intact.
|
|
- Latest measured run after the default plain-list path:
|
|
- legacy-real-division: 397 ms (3 iters, avg 132.33 ms)
|
|
- optimized-int-division-rval-off: 138 ms (3 iters, avg 46.00 ms)
|
|
- optimized-int-division-rval-on: 164 ms (3 iters, avg 54.67 ms)
|
|
- Derived speedups vs legacy in this run:
|
|
- intDivSpeedup: 2.88x
|
|
- rvalSpeedup: 0.84x
|
|
- total: 2.42x
|
|
- Interpretation:
|
|
- The stable fast baseline is now the `rval-off` case, because the direct plain-`ObjList` path no longer depends on `RVAL_FASTPATH`.
|
|
- `RVAL_FASTPATH` no longer improves this benchmark and only reflects remaining unrelated runtime variance.
|
|
- Conclusion:
|
|
- For `piSpigot`, the main VM list-access bottleneck is addressed in the default runtime path.
|
|
- Further work on this benchmark should target algorithm/data-layout changes or string-result construction, not the old `RVAL_FASTPATH` gate.
|
|
|
|
Remaining optimization candidates:
|
|
- `suffix += pi[i]` still compiles as repeated `ADD_OBJ` string/object concatenation.
|
|
- Best next option: build the suffix through a dedicated buffer/list-join path instead of per-iteration concatenation.
|
|
- The benchmark still performs many `GET_INDEX` / `SET_INDEX` operations even after the direct plain-`ObjList` fast path.
|
|
- Best next option: reduce indexed access count at the source level or introduce a more specialized typed-list storage layout if this benchmark matters enough.
|
|
- The legacy benchmark variant intentionally keeps the real-number `floor(sum / (denom * 1.0))` path.
|
|
- No release optimization needed there; it remains only as a regression/control case.
|
|
- `RVAL_FASTPATH` is no longer a useful tuning knob for this workload after the plain-list VM fast path.
|
|
- Best next option: profile other workloads before changing or removing it globally.
|
|
|
|
Release stabilization note:
|
|
- The broad assignment-side `INT` coercion and subclass-bypassing list fast path were rolled back/narrowed to restore correctness across numeric-mix, decimal, list, observable-list, and wasm tests.
|
|
- Full release gates now pass:
|
|
- `./gradlew test`
|
|
- `./gradlew :lynglib:wasmJsNodeTest`
|
|
- Current release-safe benchmark on the stabilized tree:
|
|
- legacy-real-division: 732 ms (3 iters, avg 244.00 ms)
|
|
- optimized-int-division-rval-off: 545 ms (3 iters, avg 181.67 ms)
|
|
- optimized-int-division-rval-on: 697 ms (3 iters, avg 232.33 ms)
|
|
- Interpretation:
|
|
- The release baseline is now `optimized-int-division-rval-off` at 545 ms for the current correct/stable tree.
|
|
- The removed coercion had been masking a real compiler typing gap; reintroducing it broadly is not release-safe.
|
|
- Highest-value remaining compiler optimization after release:
|
|
- Recover typed int lowering for `j = boxes - k`, `denom = j * 2 + 1`, `sum = reminders[j] + carriedOver`, and `carriedOver = quotient * j` using a narrower proof than the removed generic arithmetic coercion.
|