Pi Spigot Benchmark Baseline

Date: 2026-04-03 Command: ./gradlew :lynglib:jvmTest -Pbenchmarks=true --tests 'PiSpigotBenchmarkTest' --rerun-tasks

Results for n=200:

legacy-real-division: 1108 ms (3 iters, avg 369.33 ms)
optimized-int-division-rval-off: 756 ms (3 iters, avg 252.00 ms)
optimized-int-division-rval-on: 674 ms (3 iters, avg 224.67 ms)

Derived speedups:

intDivSpeedup: 1.47x
rvalSpeedup: 1.12x
total: 1.64x

Notes:

Bytecode still shows generic range iteration (MAKE_RANGE, CALL_MEMBER_SLOT, ITER_PUSH) for loop constructs in the legacy benchmark case.
This baseline is captured before enabling counted-loop lowering for dynamic inline int ranges.

Optimization #1 follow-up:

Attempt: broaden compiler loop lowering for dynamic int ranges and validate with PiSpigotBenchmarkTest bytecode dumps.
Final result: success after switching loop-bound coercion to a runtime-checked int path for stable slots with missing metadata.
Latest measured run after the working compiler change:
- legacy-real-division: 783 ms (3 iters, avg 261.00 ms)
- optimized-int-division-rval-off: 729 ms (3 iters, avg 243.00 ms)
- optimized-int-division-rval-on: 593 ms (3 iters, avg 197.67 ms)
Hot-op counts for optimized bytecode now show the generic range iterator path is gone from the main loops:
- makeRange=0
- callMemberSlot=2
- iterPush=0
- getIndex=4
- setIndex=4
The remaining member calls are non-loop overhead; the main improvement came from lowering for ranges to counted int loops.

Optimization #2 follow-up:

Attempt: coerce stable integer operands into INT arithmetic during binary-op lowering so hot expressions stop falling back to OBJ math.
Latest measured run after the arithmetic change:
- legacy-real-division: 593 ms (3 iters, avg 197.67 ms)
- optimized-int-division-rval-off: 542 ms (3 iters, avg 180.67 ms)
- optimized-int-division-rval-on: 516 ms (3 iters, avg 172.00 ms)
Compiled-code impact in the optimized case:
- boxes = n * 10 / 3 is now UNBOX_INT_OBJ + MUL_INT + DIV_INT
- j = boxes - k is now SUB_INT
- denom = j * 2 + 1 is now MUL_INT + ADD_INT
- carriedOver = quotient * j is now MUL_INT
Remaining hot object arithmetic is centered on list-backed reminder values and derived sums:
- reminders[j] * 10
- reminders[j] + carriedOver
- sum / denom, sum % denom, sum / 10
Conclusion: loop lowering is fixed; the next likely win is preserving List<Int> element typing for reminders so indexed loads stay in int space.

Optimization #3 follow-up:

Attempt: teach numeric-kind inference that IndexRef can be INT/REAL when the receiver list has a known element class.
Compiler change:
- inferNumericKind() now handles IndexRef and resolves the receiver slot or receiver-declared list element class before choosing INT/REAL.
Latest measured run after the indexed-load inference change:
- legacy-real-division: 656 ms (3 iters, avg 218.67 ms)
- optimized-int-division-rval-off: 509 ms (3 iters, avg 169.67 ms)
- optimized-int-division-rval-on: 403 ms (3 iters, avg 134.33 ms)
Derived speedups vs legacy in this run:
- intDivSpeedup: 1.29x
- rvalSpeedup: 1.26x
- total: 1.63x
Compiled-code impact in the optimized case:
- carriedOver = quotient * j stays in INT space (ASSERT_IS + UNBOX_INT_OBJ + MUL_INT) instead of plain object multiply.
- Counted int loops remain intact (MAKE_RANGE=0, ITER_PUSH=0).
Remaining bottlenecks in the optimized bytecode:
- GET_INDEX reminders[j] still feeds MUL_OBJ / ADD_OBJ
- sum / denom, sum % denom, and sum / 10 still compile to object arithmetic
- suffix += pi[i] remains ADD_OBJ, which is expected because it is string/object concatenation
Conclusion:
- The new inference produced a real VM-speed gain, especially with RVAL_FASTPATH enabled.
- The next compiler win is stronger propagation from List<Int> indexed loads into the produced temporary slot so sum can stay typed as INT across the inner loop.

Optimization #4 follow-up:

Attempt: preserve boxed-argument metadata through compileCallArgs() so list.add(x) retains ObjInt / ObjReal element typing.
Compiler/runtime fixes:
- compileCallArgs() now routes arguments through ensureObjSlot() + emitMove() instead of raw BOX_OBJ, preserving slotObjClass and stableObjSlots.
- CmdSetIndex now reads valueSlot via slotToObj() so SET_INDEX can safely accept primitive slots.
- Fast local unbox ops (CmdUnboxIntObjLocal, CmdUnboxRealObjLocal) now handle already-primitive source slots directly instead of assuming a raw object payload.
- Plain assignment now coerces object-int RHS back into INT when the destination slot is currently compiled as INT, keeping loop-carried locals type-consistent.
Latest measured run after the propagation + VM fixes:
- legacy-real-division: 438 ms (3 iters, avg 146.00 ms)
- optimized-int-division-rval-off: 238 ms (3 iters, avg 79.33 ms)
- optimized-int-division-rval-on: 201 ms (3 iters, avg 67.00 ms)
Derived speedups vs legacy in this run:
- intDivSpeedup: 1.84x
- rvalSpeedup: 1.18x
- total: 2.18x
Compiled-code impact in the optimized case:
- sum = reminders[j] + carriedOver is now GET_INDEX + UNBOX_INT_OBJ + ADD_INT
- reminders[j] = sum % denom is now MOD_INT + SET_INDEX
- q = sum / 10 is now DIV_INT
- carriedOver = quotient * j is now MUL_INT
Remaining hot object arithmetic in the optimized case:
- reminders[j] *= 10 still compiles as GET_INDEX + MUL_OBJ + SET_INDEX
- suffix += pi[i] remains ADD_OBJ, which is expected string/object concatenation
Conclusion:
- The main remaining arithmetic bottleneck is the compound index assignment path for reminders[j] *= 10.
- The next direct win is to specialize AssignOpRef on typed list elements so indexed compound assignment can lower to UNBOX_INT_OBJ + MUL_INT + boxed SET_INDEX.

Optimization #5 follow-up:

Attempt: specialize typed IndexRef compound assignment so List<Int> element updates avoid object arithmetic.
Compiler change:
- compileAssignOp() now detects non-optional typed List<Int> index targets and lowers arithmetic assign-ops through UNBOX_INT_OBJ + *_INT + SET_INDEX.
Latest measured run after the indexed compound-assignment change:
- legacy-real-division: 394 ms (3 iters, avg 131.33 ms)
- optimized-int-division-rval-off: 216 ms (3 iters, avg 72.00 ms)
- optimized-int-division-rval-on: 184 ms (3 iters, avg 61.33 ms)
Derived speedups vs legacy in this run:
- intDivSpeedup: 1.82x
- rvalSpeedup: 1.17x
- total: 2.14x
Compiled-code impact in the optimized case:
- reminders[j] *= 10 is now:
  - GET_INDEX
  - UNBOX_INT_OBJ
  - MUL_INT
  - SET_INDEX
- The optimized inner loop no longer contains object arithmetic for the reminders state update path.
Remaining hot object work in the optimized case:
- suffix += pi[i] remains ADD_OBJ and is expected string/object concatenation
- The legacy benchmark case still carries real/object work because it intentionally keeps the original floor(sum / (denom * 1.0)) path
Conclusion:
- The inner arithmetic hot loop is now effectively int-lowered end-to-end in the optimized benchmark path.
- Further wins will likely require reducing list access overhead itself (GET_INDEX / SET_INDEX) or changing the source algorithm/data layout, not more basic arithmetic lowering.

Optimization #6 follow-up:

Attempt: move the direct ObjList index fast path out from behind RVAL_FASTPATH so the common plain-list case is fast by default.
Runtime change:
- CmdGetIndex and CmdSetIndex now always use direct target.list[index] / target.list[index] = value for exact ObjList receivers with ObjInt indices.
- Subclasses such as ObjObservableList still use their overridden getAt / putAt logic, so semantics stay intact.
Latest measured run after the default plain-list path:
- legacy-real-division: 397 ms (3 iters, avg 132.33 ms)
- optimized-int-division-rval-off: 138 ms (3 iters, avg 46.00 ms)
- optimized-int-division-rval-on: 164 ms (3 iters, avg 54.67 ms)
Derived speedups vs legacy in this run:
- intDivSpeedup: 2.88x
- rvalSpeedup: 0.84x
- total: 2.42x
Interpretation:
- The stable fast baseline is now the rval-off case, because the direct plain-ObjList path no longer depends on RVAL_FASTPATH.
- RVAL_FASTPATH no longer improves this benchmark and only reflects remaining unrelated runtime variance.
Conclusion:
- For piSpigot, the main VM list-access bottleneck is addressed in the default runtime path.
- Further work on this benchmark should target algorithm/data-layout changes or string-result construction, not the old RVAL_FASTPATH gate.

Remaining optimization candidates:

suffix += pi[i] still compiles as repeated ADD_OBJ string/object concatenation.
- Best next option: build the suffix through a dedicated buffer/list-join path instead of per-iteration concatenation.
The benchmark still performs many GET_INDEX / SET_INDEX operations even after the direct plain-ObjList fast path.
- Best next option: reduce indexed access count at the source level or introduce a more specialized typed-list storage layout if this benchmark matters enough.
The legacy benchmark variant intentionally keeps the real-number floor(sum / (denom * 1.0)) path.
- No release optimization needed there; it remains only as a regression/control case.
RVAL_FASTPATH is no longer a useful tuning knob for this workload after the plain-list VM fast path.
- Best next option: profile other workloads before changing or removing it globally.

Release stabilization note:

The broad assignment-side INT coercion and subclass-bypassing list fast path were rolled back/narrowed to restore correctness across numeric-mix, decimal, list, observable-list, and wasm tests.
Full release gates now pass:
- ./gradlew test
- ./gradlew :lynglib:wasmJsNodeTest
Current release-safe benchmark on the stabilized tree:
- legacy-real-division: 732 ms (3 iters, avg 244.00 ms)
- optimized-int-division-rval-off: 545 ms (3 iters, avg 181.67 ms)
- optimized-int-division-rval-on: 697 ms (3 iters, avg 232.33 ms)
Interpretation:
- The release baseline is now optimized-int-division-rval-off at 545 ms for the current correct/stable tree.
- The removed coercion had been masking a real compiler typing gap; reintroducing it broadly is not release-safe.
Highest-value remaining compiler optimization after release:
- Recover typed int lowering for j = boxes - k, denom = j * 2 + 1, sum = reminders[j] + carriedOver, and carriedOver = quotient * j using a narrower proof than the removed generic arithmetic coercion.

11 KiB Raw Blame History

Pi Spigot Benchmark Baseline

11 KiB

Raw Blame History