11 KiB
11 KiB
Pi Spigot Benchmark Baseline
Date: 2026-04-03
Command:
./gradlew :lynglib:jvmTest -Pbenchmarks=true --tests 'PiSpigotBenchmarkTest' --rerun-tasks
Results for n=200:
- legacy-real-division: 1108 ms (3 iters, avg 369.33 ms)
- optimized-int-division-rval-off: 756 ms (3 iters, avg 252.00 ms)
- optimized-int-division-rval-on: 674 ms (3 iters, avg 224.67 ms)
Derived speedups:
- intDivSpeedup: 1.47x
- rvalSpeedup: 1.12x
- total: 1.64x
Notes:
- Bytecode still shows generic range iteration (
MAKE_RANGE,CALL_MEMBER_SLOT,ITER_PUSH) for loop constructs in the legacy benchmark case. - This baseline is captured before enabling counted-loop lowering for dynamic inline int ranges.
Optimization #1 follow-up:
- Attempt: broaden compiler loop lowering for dynamic int ranges and validate with
PiSpigotBenchmarkTestbytecode dumps. - Final result: success after switching loop-bound coercion to a runtime-checked int path for stable slots with missing metadata.
- Latest measured run after the working compiler change:
- legacy-real-division: 783 ms (3 iters, avg 261.00 ms)
- optimized-int-division-rval-off: 729 ms (3 iters, avg 243.00 ms)
- optimized-int-division-rval-on: 593 ms (3 iters, avg 197.67 ms)
- Hot-op counts for optimized bytecode now show the generic range iterator path is gone from the main loops:
makeRange=0callMemberSlot=2iterPush=0getIndex=4setIndex=4
- The remaining member calls are non-loop overhead; the main improvement came from lowering
forranges to counted int loops.
Optimization #2 follow-up:
- Attempt: coerce stable integer operands into
INTarithmetic during binary-op lowering so hot expressions stop falling back toOBJmath. - Latest measured run after the arithmetic change:
- legacy-real-division: 593 ms (3 iters, avg 197.67 ms)
- optimized-int-division-rval-off: 542 ms (3 iters, avg 180.67 ms)
- optimized-int-division-rval-on: 516 ms (3 iters, avg 172.00 ms)
- Compiled-code impact in the optimized case:
boxes = n * 10 / 3is nowUNBOX_INT_OBJ+MUL_INT+DIV_INTj = boxes - kis nowSUB_INTdenom = j * 2 + 1is nowMUL_INT+ADD_INTcarriedOver = quotient * jis nowMUL_INT
- Remaining hot object arithmetic is centered on list-backed reminder values and derived sums:
reminders[j] * 10reminders[j] + carriedOversum / denom,sum % denom,sum / 10
- Conclusion: loop lowering is fixed; the next likely win is preserving
List<Int>element typing forremindersso indexed loads stay in int space.
Optimization #3 follow-up:
- Attempt: teach numeric-kind inference that
IndexRefcan beINT/REALwhen the receiver list has a known element class. - Compiler change:
inferNumericKind()now handlesIndexRefand resolves the receiver slot or receiver-declared list element class before choosingINT/REAL.
- Latest measured run after the indexed-load inference change:
- legacy-real-division: 656 ms (3 iters, avg 218.67 ms)
- optimized-int-division-rval-off: 509 ms (3 iters, avg 169.67 ms)
- optimized-int-division-rval-on: 403 ms (3 iters, avg 134.33 ms)
- Derived speedups vs legacy in this run:
- intDivSpeedup: 1.29x
- rvalSpeedup: 1.26x
- total: 1.63x
- Compiled-code impact in the optimized case:
carriedOver = quotient * jstays inINTspace (ASSERT_IS+UNBOX_INT_OBJ+MUL_INT) instead of plain object multiply.- Counted int loops remain intact (
MAKE_RANGE=0,ITER_PUSH=0).
- Remaining bottlenecks in the optimized bytecode:
GET_INDEX reminders[j]still feedsMUL_OBJ/ADD_OBJsum / denom,sum % denom, andsum / 10still compile to object arithmeticsuffix += pi[i]remainsADD_OBJ, which is expected because it is string/object concatenation
- Conclusion:
- The new inference produced a real VM-speed gain, especially with
RVAL_FASTPATHenabled. - The next compiler win is stronger propagation from
List<Int>indexed loads into the produced temporary slot sosumcan stay typed asINTacross the inner loop.
- The new inference produced a real VM-speed gain, especially with
Optimization #4 follow-up:
- Attempt: preserve boxed-argument metadata through
compileCallArgs()solist.add(x)retainsObjInt/ObjRealelement typing. - Compiler/runtime fixes:
compileCallArgs()now routes arguments throughensureObjSlot()+emitMove()instead of rawBOX_OBJ, preservingslotObjClassandstableObjSlots.CmdSetIndexnow readsvalueSlotviaslotToObj()soSET_INDEXcan safely accept primitive slots.- Fast local unbox ops (
CmdUnboxIntObjLocal,CmdUnboxRealObjLocal) now handle already-primitive source slots directly instead of assuming a raw object payload. - Plain assignment now coerces object-int RHS back into
INTwhen the destination slot is currently compiled asINT, keeping loop-carried locals type-consistent.
- Latest measured run after the propagation + VM fixes:
- legacy-real-division: 438 ms (3 iters, avg 146.00 ms)
- optimized-int-division-rval-off: 238 ms (3 iters, avg 79.33 ms)
- optimized-int-division-rval-on: 201 ms (3 iters, avg 67.00 ms)
- Derived speedups vs legacy in this run:
- intDivSpeedup: 1.84x
- rvalSpeedup: 1.18x
- total: 2.18x
- Compiled-code impact in the optimized case:
sum = reminders[j] + carriedOveris nowGET_INDEX+UNBOX_INT_OBJ+ADD_INTreminders[j] = sum % denomis nowMOD_INT+SET_INDEXq = sum / 10is nowDIV_INTcarriedOver = quotient * jis nowMUL_INT
- Remaining hot object arithmetic in the optimized case:
reminders[j] *= 10still compiles asGET_INDEX+MUL_OBJ+SET_INDEXsuffix += pi[i]remainsADD_OBJ, which is expected string/object concatenation
- Conclusion:
- The main remaining arithmetic bottleneck is the compound index assignment path for
reminders[j] *= 10. - The next direct win is to specialize
AssignOpRefon typed list elements so indexed compound assignment can lower toUNBOX_INT_OBJ+MUL_INT+ boxedSET_INDEX.
- The main remaining arithmetic bottleneck is the compound index assignment path for
Optimization #5 follow-up:
- Attempt: specialize typed
IndexRefcompound assignment soList<Int>element updates avoid object arithmetic. - Compiler change:
compileAssignOp()now detects non-optional typedList<Int>index targets and lowers arithmetic assign-ops throughUNBOX_INT_OBJ+*_INT+SET_INDEX.
- Latest measured run after the indexed compound-assignment change:
- legacy-real-division: 394 ms (3 iters, avg 131.33 ms)
- optimized-int-division-rval-off: 216 ms (3 iters, avg 72.00 ms)
- optimized-int-division-rval-on: 184 ms (3 iters, avg 61.33 ms)
- Derived speedups vs legacy in this run:
- intDivSpeedup: 1.82x
- rvalSpeedup: 1.17x
- total: 2.14x
- Compiled-code impact in the optimized case:
reminders[j] *= 10is now:GET_INDEXUNBOX_INT_OBJMUL_INTSET_INDEX
- The optimized inner loop no longer contains object arithmetic for the
remindersstate update path.
- Remaining hot object work in the optimized case:
suffix += pi[i]remainsADD_OBJand is expected string/object concatenation- The legacy benchmark case still carries real/object work because it intentionally keeps the original
floor(sum / (denom * 1.0))path
- Conclusion:
- The inner arithmetic hot loop is now effectively int-lowered end-to-end in the optimized benchmark path.
- Further wins will likely require reducing list access overhead itself (
GET_INDEX/SET_INDEX) or changing the source algorithm/data layout, not more basic arithmetic lowering.
Optimization #6 follow-up:
- Attempt: move the direct
ObjListindex fast path out from behindRVAL_FASTPATHso the common plain-list case is fast by default. - Runtime change:
CmdGetIndexandCmdSetIndexnow always use directtarget.list[index]/target.list[index] = valuefor exactObjListreceivers withObjIntindices.- Subclasses such as
ObjObservableListstill use their overriddengetAt/putAtlogic, so semantics stay intact.
- Latest measured run after the default plain-list path:
- legacy-real-division: 397 ms (3 iters, avg 132.33 ms)
- optimized-int-division-rval-off: 138 ms (3 iters, avg 46.00 ms)
- optimized-int-division-rval-on: 164 ms (3 iters, avg 54.67 ms)
- Derived speedups vs legacy in this run:
- intDivSpeedup: 2.88x
- rvalSpeedup: 0.84x
- total: 2.42x
- Interpretation:
- The stable fast baseline is now the
rval-offcase, because the direct plain-ObjListpath no longer depends onRVAL_FASTPATH. RVAL_FASTPATHno longer improves this benchmark and only reflects remaining unrelated runtime variance.
- The stable fast baseline is now the
- Conclusion:
- For
piSpigot, the main VM list-access bottleneck is addressed in the default runtime path. - Further work on this benchmark should target algorithm/data-layout changes or string-result construction, not the old
RVAL_FASTPATHgate.
- For
Remaining optimization candidates:
suffix += pi[i]still compiles as repeatedADD_OBJstring/object concatenation.- Best next option: build the suffix through a dedicated buffer/list-join path instead of per-iteration concatenation.
- The benchmark still performs many
GET_INDEX/SET_INDEXoperations even after the direct plain-ObjListfast path.- Best next option: reduce indexed access count at the source level or introduce a more specialized typed-list storage layout if this benchmark matters enough.
- The legacy benchmark variant intentionally keeps the real-number
floor(sum / (denom * 1.0))path.- No release optimization needed there; it remains only as a regression/control case.
RVAL_FASTPATHis no longer a useful tuning knob for this workload after the plain-list VM fast path.- Best next option: profile other workloads before changing or removing it globally.
Release stabilization note:
- The broad assignment-side
INTcoercion and subclass-bypassing list fast path were rolled back/narrowed to restore correctness across numeric-mix, decimal, list, observable-list, and wasm tests. - Full release gates now pass:
./gradlew test./gradlew :lynglib:wasmJsNodeTest
- Current release-safe benchmark on the stabilized tree:
- legacy-real-division: 732 ms (3 iters, avg 244.00 ms)
- optimized-int-division-rval-off: 545 ms (3 iters, avg 181.67 ms)
- optimized-int-division-rval-on: 697 ms (3 iters, avg 232.33 ms)
- Interpretation:
- The release baseline is now
optimized-int-division-rval-offat 545 ms for the current correct/stable tree. - The removed coercion had been masking a real compiler typing gap; reintroducing it broadly is not release-safe.
- The release baseline is now
- Highest-value remaining compiler optimization after release:
- Recover typed int lowering for
j = boxes - k,denom = j * 2 + 1,sum = reminders[j] + carriedOver, andcarriedOver = quotient * jusing a narrower proof than the removed generic arithmetic coercion.
- Recover typed int lowering for