623 lines
36 KiB
Markdown
623 lines
36 KiB
Markdown
|
|
This document explains how to enable and measure the performance optimizations added to the Lyng interpreter. The focus is JVM‑first with safe, flag‑guarded rollouts and quick A/B testing. Other targets (JS/Wasm/Native) keep conservative defaults until validated.
|
|
|
|
[//]: # (excludeFromIndex)
|
|
|
|
## Overview
|
|
|
|
Optimizations are controlled by runtime‑mutable flags in `net.sergeych.lyng.PerfFlags`, initialized from platform‑specific static defaults `net.sergeych.lyng.PerfDefaults` (KMP `expect/actual`).
|
|
|
|
- JVM/Android defaults are aggressive (e.g. `RVAL_FASTPATH=true`).
|
|
- Non‑JVM defaults are conservative (e.g. `RVAL_FASTPATH=false`).
|
|
|
|
All flags are `var` and can be flipped at runtime (e.g., from tests or host apps) for A/B comparisons.
|
|
|
|
### Workload presets (JVM‑first)
|
|
|
|
To simplify switching between recommended flag sets for different workloads, use `net.sergeych.lyng.PerfProfiles`:
|
|
|
|
```
|
|
val snap = PerfProfiles.apply(PerfProfiles.Preset.BENCH) // or BASELINE / BOOKS
|
|
// ... run workload ...
|
|
PerfProfiles.restore(snap) // restore previous flags
|
|
```
|
|
|
|
- BASELINE: restores platform defaults from `PerfDefaults` (good rollback point).
|
|
- BENCH: expression‑heavy micro‑bench focus (aggressive R‑value and PIC optimizations on JVM).
|
|
- BOOKS: documentation workloads (prefers simpler paths; disables some PIC/arg builder features shown neutral/negative for this load in A/B).
|
|
|
|
## Key flags
|
|
|
|
- `LOCAL_SLOT_PIC` — Runtime cache in `LocalVarRef` to avoid repeated name→slot lookups per frame (ON JVM default).
|
|
- `EMIT_FAST_LOCAL_REFS` — Compiler emits `FastLocalVarRef` for identifiers known to be locals/params (ON JVM default).
|
|
- `ARG_BUILDER` — Efficient argument building: small‑arity no‑alloc and pooled builder on JVM (ON JVM default).
|
|
- `ARG_SMALL_ARITY_12` — Extends small‑arity no‑alloc call paths from 0–8 to 0–12 arguments (JVM‑first exploration; OFF by default). Use for codebases with many 9–12 arg calls; A/B before enabling.
|
|
- `SKIP_ARGS_ON_NULL_RECEIVER` — Early return on optional‑null receivers before building args (semantics‑compatible). A/B only.
|
|
- `SCOPE_POOL` — Scope frame pooling for calls (JVM, per‑thread ThreadLocal pool). ON by default on JVM; togglable at runtime.
|
|
- `FIELD_PIC` — 2‑entry polymorphic inline cache for field reads/writes keyed by `(classId, layoutVersion)` (ON JVM default).
|
|
- `METHOD_PIC` — 2‑entry PIC for instance method calls keyed by `(classId, layoutVersion)` (ON JVM default).
|
|
- `FIELD_PIC_SIZE_4` — Increases Field PIC size from 2 to 4 entries (JVM-first tuning; OFF by default). Use for sites with >2 receiver shapes.
|
|
- `METHOD_PIC_SIZE_4` — Increases Method PIC size from 2 to 4 entries (JVM-first tuning; OFF by default).
|
|
- `PIC_ADAPTIVE_2_TO_4` — Adaptive growth of Field/Method PICs from 2→4 entries per-site when miss rate >20% over ≥256 accesses (JVM-first; OFF by default).
|
|
- `INDEX_PIC` — Enables polymorphic inline cache for indexing (e.g., `a[i]`) and related fast paths. Defaults to follow `FIELD_PIC` on init; can be toggled independently.
|
|
- `INDEX_PIC_SIZE_4` — Increases Index PIC size from 2 to 4 entries (JVM-first tuning). Default: ON for JVM; OFF elsewhere by default.
|
|
- `PIC_DEBUG_COUNTERS` — Enable lightweight hit/miss counters via `PerfStats` (OFF by default).
|
|
- `PRIMITIVE_FASTOPS` — Fast paths for `(ObjInt, ObjInt)` arithmetic/comparisons and `(ObjBool, ObjBool)` logic (ON JVM default).
|
|
- `RVAL_FASTPATH` — Bypass `ObjRecord` in pure expression evaluation via `ObjRef.evalValue` (ON JVM default, OFF elsewhere).
|
|
|
|
See `src/commonMain/kotlin/net/sergeych/lyng/PerfFlags.kt` and `PerfDefaults.*.kt` for details and platform defaults.
|
|
|
|
## Where optimizations apply
|
|
|
|
- Locals: `FastLocalVarRef`, `LocalVarRef` per‑frame cache (PIC).
|
|
- Calls: small‑arity zero‑alloc paths (0–8 args; optionally 0–12 with `ARG_SMALL_ARITY_12`), pooled builder (JVM), and child frame pooling (optional).
|
|
- Properties/methods: Field/Method PICs with receiver shape `(classId, layoutVersion)` and handle‑aware caches; configurable 2→4 entries under flags.
|
|
- Expressions: R‑value fast paths in hot nodes (`UnaryOpRef`, `BinaryOpRef`, `ElvisRef`, logical ops, `RangeRef`, `IndexRef` read, `FieldRef` receiver eval, `ListLiteralRef` elements, `CallRef` callee, `MethodCallRef` receiver, assignment RHS).
|
|
- Primitives: Direct boolean/int ops where safe.
|
|
|
|
### Compiler constant folding (conservative)
|
|
- The compiler folds a safe subset of literal‑only expressions at compile time to reduce runtime work:
|
|
- Integer arithmetic: `+ - * / %` (division/modulo only when divisor ≠ 0).
|
|
- Bitwise integer ops: `& ^ | << >>`.
|
|
- Comparisons and equality for ints/strings/chars: `== != < <= > >=`.
|
|
- Boolean logic for literal booleans: `|| &&` and unary `!`.
|
|
- String concatenation of literal strings: `"a" + "b"`.
|
|
- Non‑literal operands or side‑effecting constructs are not folded.
|
|
- Semantics remain unchanged; tests verify parity.
|
|
|
|
## Running JVM micro‑benchmarks
|
|
|
|
Each benchmark prints timings with `[DEBUG_LOG]` and includes correctness assertions to prevent dead‑code elimination.
|
|
|
|
Run individual tests to avoid multiplatform matrices:
|
|
|
|
```
|
|
./gradlew :lynglib:jvmTest --tests LocalVarBenchmarkTest
|
|
./gradlew :lynglib:jvmTest --tests CallBenchmarkTest
|
|
./gradlew :lynglib:jvmTest --tests CallMixedArityBenchmarkTest
|
|
./gradlew :lynglib:jvmTest --tests CallSplatBenchmarkTest
|
|
./gradlew :lynglib:jvmTest --tests PicBenchmarkTest
|
|
./gradlew :lynglib:jvmTest --tests PicInvalidationJvmTest
|
|
./gradlew :lynglib:jvmTest --tests PicAdaptiveABTest
|
|
./gradlew :lynglib:jvmTest --tests ArithmeticBenchmarkTest
|
|
./gradlew :lynglib:jvmTest --tests ExpressionBenchmarkTest
|
|
./gradlew :lynglib:jvmTest --tests IndexPicABTest
|
|
./gradlew :lynglib:jvmTest --tests IndexWritePathABTest
|
|
./gradlew :lynglib:jvmTest --tests CallPoolingBenchmarkTest
|
|
./gradlew :lynglib:jvmTest --tests MethodPoolingBenchmarkTest
|
|
./gradlew :lynglib:jvmTest --tests MixedBenchmarkTest
|
|
./gradlew :lynglib:jvmTest --tests DeepPoolingStressJvmTest
|
|
```
|
|
|
|
Typical output (example):
|
|
|
|
```
|
|
[DEBUG_LOG] [BENCH] mixed-arity x200000 [ARG_BUILDER=ON]: 85.7 ms
|
|
```
|
|
|
|
Lower time is better. Run the same bench with a flag OFF vs ON to compare.
|
|
|
|
### Optional JFR allocation profiling (JVM)
|
|
|
|
When running end‑to‑end “book” workloads or heavier benches, you can enable JFR to capture allocation and GC details:
|
|
|
|
```
|
|
./gradlew :lynglib:jvmTest --tests BookAllocationProfileTest -Dlyng.jfr=true \
|
|
-Dlyng.profile.warmup=1 -Dlyng.profile.repeats=3 -Dlyng.profile.shuffle=true
|
|
```
|
|
|
|
- Dumps are saved to `lynglib/build/jfr_*.jfr` if the JVM supports Flight Recorder.
|
|
- The test also records GC counts/time and median time/heap deltas to `lynglib/build/book_alloc_profile.txt`.
|
|
|
|
## Toggling flags in tests
|
|
|
|
Flags are mutable at runtime, e.g.:
|
|
|
|
```kotlin
|
|
PerfFlags.ARG_BUILDER = false
|
|
val r1 = (Scope().eval(script) as ObjInt).value
|
|
PerfFlags.ARG_BUILDER = true
|
|
val r2 = (Scope().eval(script) as ObjInt).value
|
|
```
|
|
|
|
Reset flags at the end of a test to avoid impacting other tests.
|
|
|
|
## PIC diagnostics (optional)
|
|
|
|
Enable counters:
|
|
|
|
```kotlin
|
|
PerfFlags.PIC_DEBUG_COUNTERS = true
|
|
PerfStats.resetAll()
|
|
```
|
|
|
|
Available counters in `PerfStats`:
|
|
|
|
- Field PIC: `fieldPicHit`, `fieldPicMiss`, `fieldPicSetHit`, `fieldPicSetMiss`
|
|
- Method PIC: `methodPicHit`, `methodPicMiss`
|
|
- Index PIC: `indexPicHit`, `indexPicMiss`
|
|
- Locals: `localVarPicHit`, `localVarPicMiss`, `fastLocalHit`, `fastLocalMiss`
|
|
- Primitive ops: `primitiveFastOpsHit`
|
|
|
|
Print a summary at the end of a bench/test as needed. Remember to turn counters OFF after the test.
|
|
|
|
## A/B scenarios and guidance (JVM)
|
|
|
|
### Adaptive PIC (fields/methods)
|
|
- Flags: `FIELD_PIC=true`, `METHOD_PIC=true`, `FIELD_PIC_SIZE_4=false`, `METHOD_PIC_SIZE_4=false`, toggle `PIC_ADAPTIVE_2_TO_4` OFF vs ON.
|
|
- Benchmarks: `PicBenchmarkTest`, `MixedBenchmarkTest`, `PicAdaptiveABTest`.
|
|
- Expect wins at sites with >2 receiver shapes; counters should show fewer misses with adaptivity ON.
|
|
|
|
### Index PIC and size
|
|
- Flags: toggle `INDEX_PIC` OFF vs ON; then `INDEX_PIC_SIZE_4` OFF vs ON.
|
|
- Benchmarks: `ExpressionBenchmarkTest` (list indexing) and `IndexPicABTest` (string/map indexing).
|
|
- Expect wins when the same index shape recurs; counters should show higher `indexPicHit`.
|
|
|
|
### Index WRITE paths (Map and List)
|
|
- Flags: toggle `INDEX_PIC` OFF vs ON; then `INDEX_PIC_SIZE_4` OFF vs ON.
|
|
- Benchmark: `IndexWritePathABTest` (Map[String] put, List[Int] set) — writes results to `lynglib/build/index_write_ab_results.txt`.
|
|
- Direct fast‑paths are used on R‑value paths where safe and semantics‑preserving (e.g., optional‑chaining no‑ops on null receivers; bounds exceptions unchanged).
|
|
|
|
## Guidance per flag (JVM)
|
|
|
|
- Keep `RVAL_FASTPATH = true` unless debugging a suspected expression‑semantics issue.
|
|
- Use `SCOPE_POOL = true` only for benchmarks or once pooling passes the deep stress tests and broader validation; currently OFF by default.
|
|
- `FIELD_PIC` and `METHOD_PIC` should remain ON; they are validated with invalidation tests.
|
|
- Consider enabling `FIELD_PIC_SIZE_4`/`METHOD_PIC_SIZE_4` for sites with 3–4 receiver shapes; measure first.
|
|
- `PIC_ADAPTIVE_2_TO_4` is useful on polymorphic sites and may outperform fixed size 2 on mixed-shape workloads. Validate with `PicAdaptiveABTest`.
|
|
- `INDEX_PIC` is generally beneficial on JVM; leave ON when measuring index‑heavy workloads.
|
|
- `INDEX_PIC_SIZE_4` is ON by default on JVM as A/B showed consistent wins on String[Int] and Map[String] workloads. You can disable it by setting `PerfFlags.INDEX_PIC_SIZE_4 = false` if needed.
|
|
- `ARG_BUILDER` should remain ON; switch OFF only to get a baseline.
|
|
- `ARG_SMALL_ARITY_12` is experimental and OFF by default. Enable it only if your workload frequently calls functions with 9–12 arguments and A/B shows consistent wins.
|
|
|
|
### Workload‑specific recommendations (JVM)
|
|
|
|
- “Books”/documentation loads (BookTest): prefer simpler paths; in A/B these often benefit from the BOOKS preset (e.g., `ARG_BUILDER=false`, `SCOPE_POOL=false`, `INDEX_PIC=false`). Use `PerfProfiles.apply(PerfProfiles.Preset.BOOKS)` before the run and `restore(...)` after.
|
|
- Expression‑heavy benches: use the BENCH preset (PICs and R‑value fast‑paths enabled, `INDEX_PIC_SIZE_4=true`).
|
|
- Always verify with local A/B on your environment; rollback is a flag flip or applying BASELINE.
|
|
|
|
## Notes on correctness & safety
|
|
|
|
- Optional chaining semantics are preserved across fast paths.
|
|
- Visibility/mutability checks are enforced even on PIC fast‑paths.
|
|
- `frameId` is regenerated on each pooled frame borrow; stress tests verify no leakage under deep nesting/recursion.
|
|
|
|
## Cross‑platform
|
|
|
|
- Non‑JVM defaults keep `RVAL_FASTPATH=false` for now; other low‑risk flags may be ON.
|
|
- Once JVM path is fully validated and measured, add lightweight benches for JS/Wasm/Native and enable flags incrementally.
|
|
|
|
## Range fast iteration (experimental)
|
|
|
|
- Flag: `RANGE_FAST_ITER` (default OFF).
|
|
- When enabled and applicable, simple ascending integer ranges (`0..n`, `0..<n`) use a specialized non‑allocating iterator (`ObjFastIntRangeIterator`).
|
|
- Benchmark: `RangeIterationBenchmarkTest` records OFF/ON timings for inclusive, exclusive, reversed, negative, and empty ranges. Semantics are preserved; non‑int or complex ranges fall back to the generic iterator.
|
|
|
|
## Troubleshooting
|
|
|
|
- If a benchmark shows regressions, flip related flags OFF to isolate the source (e.g., `ARG_BUILDER`, `RVAL_FASTPATH`, `FIELD_PIC`, `METHOD_PIC`).
|
|
- Use `PIC_DEBUG_COUNTERS` to observe inline cache effectiveness.
|
|
- Ensure tests do not accidentally keep flags ON for subsequent tests; reset after each test.
|
|
|
|
|
|
## JVM micro-benchmark results (3× medians; OFF → ON)
|
|
|
|
Date: 2025-11-10 23:04 (local)
|
|
|
|
| Flag | Benchmark/Test | OFF median (ms) | ON median (ms) | Speedup | Notes |
|
|
|--------------------|----------------------------------------------|-----------------:|----------------:|:-------:|-------|
|
|
| ARG_BUILDER | CallMixedArityBenchmarkTest | 788.02 | 668.79 | 1.18× | Clear win on mixed arity |
|
|
| ARG_BUILDER | CallBenchmarkTest (simple calls) | 423.87 | 425.47 | 1.00× | Neutral on repeated simple calls |
|
|
| FIELD_PIC | PicBenchmarkTest::benchmarkFieldGetSetPic | 113.575 | 106.017 | 1.07× | Small but consistent win |
|
|
| METHOD_PIC | PicBenchmarkTest::benchmarkMethodPic | 251.068 | 149.439 | 1.68× | Large consistent win |
|
|
| RVAL_FASTPATH | ExpressionBenchmarkTest | 514.491 | 426.800 | 1.21× | Consistent win in expression chains |
|
|
| PRIMITIVE_FASTOPS | ArithmeticBenchmarkTest (int-sum) | 243.420 | 128.146 | 1.90× | Big win for integer addition |
|
|
| PRIMITIVE_FASTOPS | ArithmeticBenchmarkTest (int-cmp) | 210.385 | 168.534 | 1.25× | Moderate win for comparisons |
|
|
| SCOPE_POOL | CallPoolingBenchmarkTest | 505.778 | 366.737 | 1.38× | Single-threaded bench; per-thread ThreadLocal pool; default ON on JVM |
|
|
|
|
Notes:
|
|
- All results obtained from `[DEBUG_LOG] [BENCH]` outputs with three repeated Gradle test invocations per configuration; medians reported.
|
|
- JVM defaults (current): `ARG_BUILDER=true`, `PRIMITIVE_FASTOPS=true`, `RVAL_FASTPATH=true`, `FIELD_PIC=true`, `METHOD_PIC=true`, `SCOPE_POOL=true` (per‑thread ThreadLocal pool), `REGEX_CACHE=true`.
|
|
|
|
|
|
## Concurrency (multi‑core) pooling results (3× medians; OFF → ON)
|
|
|
|
Date: 2025-11-10 22:56 (local)
|
|
|
|
| Flag | Benchmark/Test | OFF median (ms) | ON median (ms) | Speedup | Notes |
|
|
|------------|--------------------------------------|-----------------:|----------------:|:-------:|-------|
|
|
| SCOPE_POOL | ConcurrencyCallBenchmarkTest (JVM) | 521.102 | 201.374 | 2.59× | Multithreaded workload on `Dispatchers.Default` with per‑thread ThreadLocal pool; workers=8, iters=15000/worker. |
|
|
|
|
Methodology:
|
|
- The test toggles `PerfFlags.SCOPE_POOL` within a single run and executes the same script across N worker coroutines scheduled on `Dispatchers.Default`.
|
|
- We executed the test three times via Gradle and computed medians from the printed `[DEBUG_LOG]` timings:
|
|
- OFF runs (ms): 532.442 | 521.102 | 474.386 → median 521.102
|
|
- ON runs (ms): 218.683 | 201.374 | 198.737 → median 201.374
|
|
- Speedup = OFF/ON.
|
|
|
|
Reproduce:
|
|
```
|
|
./gradlew :lynglib:jvmTest --tests "ConcurrencyCallBenchmarkTest" --rerun-tasks
|
|
```
|
|
|
|
|
|
## Next optimization steps (JVM)
|
|
|
|
Date: 2025-11-10 23:04 (local)
|
|
|
|
- PICs
|
|
- Widen METHOD_PIC to 3–4 entries with tiny LRU; keep invalidation on layout change; re-run `PicInvalidationJvmTest`.
|
|
- Micro fast-path for FIELD_PIC read-then-write pairs (`x = x + 1`) to reuse the resolved slot within one step.
|
|
- Locals and slots
|
|
- Pre-size `Scope` slot structures when compiler knows local/param counts; audit `EMIT_FAST_LOCAL_REFS` coverage.
|
|
- Re-run `LocalVarBenchmarkTest` to quantify gains.
|
|
- RVAL_FASTPATH coverage
|
|
- Cover primitive `ObjList` index reads, pure receivers in `FieldRef`, and assignment RHS where safe; add micro-benches to `ExpressionBenchmarkTest`.
|
|
- Collections and ranges
|
|
- Specialize `(Int..Int)` loops into tight counted loops (no intermediary objects).
|
|
- Add primitive-specialized `ObjList` ops (`map`, `filter`, `sum`, `contains`) under `PRIMITIVE_FASTOPS`.
|
|
- Regex and strings
|
|
- Cache compiled regex for string literals at compile time; add a tiny LRU for dynamic patterns behind `REGEX_CACHE`.
|
|
- Add `RegexBenchmarkTest` for repeated matches.
|
|
- JIT friendliness (Kotlin/JVM)
|
|
- Inline tiny helpers in hot paths, prefer arrays for internal buffers, finalize hot data structures where safe.
|
|
|
|
Validation matrix
|
|
- Always re-run: `CallBenchmarkTest`, `CallMixedArityBenchmarkTest`, `PicBenchmarkTest`, `ExpressionBenchmarkTest`, `ArithmeticBenchmarkTest`, `CallPoolingBenchmarkTest`, `DeepPoolingStressJvmTest`, `ConcurrencyCallBenchmarkTest` (3× medians when comparing).
|
|
- Keep full `:lynglib:jvmTest` green after each change.
|
|
|
|
|
|
|
|
## PIC update (4‑way METHOD_PIC) — JVM (3× medians; OFF → ON)
|
|
|
|
Date: 2025-11-11 00:16 (local)
|
|
|
|
| Flag | Benchmark/Test | OFF median (ms) | ON median (ms) | Speedup | Notes |
|
|
|-----------|-----------------------------------------------|-----------------:|----------------:|:-------:|-------|
|
|
| FIELD_PIC | PicBenchmarkTest::benchmarkFieldGetSetPic | 207.578 | 106.481 | 1.95× | Read→write loop; micro fast‑path groundwork present |
|
|
| METHOD_PIC| PicBenchmarkTest::benchmarkMethodPic | 273.478 | 182.226 | 1.50× | 4‑way PIC with move‑to‑front (was 2‑way before) |
|
|
|
|
Medians computed from three Gradle runs in this session; see `[DEBUG_LOG] [BENCH]` lines in test output.
|
|
|
|
|
|
## Locals/slots capacity (pre‑sizing hints) — JVM (3× medians; OFF → ON)
|
|
|
|
Date: 2025-11-11 13:19 (local)
|
|
|
|
| Optimization | Benchmark/Test | OFF config | ON config | OFF median (ms) | ON median (ms) | Speedup | Notes |
|
|
|-------------------------|-----------------------------|------------------------------------|------------------------------------|-----------------:|----------------:|:-------:|-------|
|
|
| Locals pre‑sizing + PIC | LocalVarBenchmarkTest | LOCAL_SLOT_PIC=OFF, FAST_LOCAL=OFF | LOCAL_SLOT_PIC=ON, FAST_LOCAL=ON | 472.129 | 370.871 | 1.27× | Compiler hint `params+4`; slot pre‑size; semantics unchanged |
|
|
|
|
Methodology:
|
|
- Each configuration executed three times via `:lynglib:jvmTest --tests "…" --rerun-tasks`; medians reported.
|
|
- Locals improvement stacks with per‑thread `SCOPE_POOL` and ARG fast paths.
|
|
|
|
|
|
|
|
|
|
## RVAL fast paths update — JVM (IndexRef and FieldRef) [3× medians; OFF → ON]
|
|
|
|
Date: 2025-11-11 13:19 (local)
|
|
|
|
New micro-benchmarks have been added to quantify the latest `RVAL_FASTPATH` extensions:
|
|
- Primitive `ObjList` index-read fast path in `IndexRef`.
|
|
- Conservative “pure receiver” evaluation in `FieldRef` (monomorphic, immutable receiver), preserving visibility/mutability checks and optional chaining semantics.
|
|
|
|
Benchmarks to run (each 3× OFF → ON):
|
|
- `ExpressionBenchmarkTest::benchmarkListIndexReads`
|
|
- `ExpressionBenchmarkTest::benchmarkFieldReadPureReceiver`
|
|
|
|
Reproduce (3× each; collect `[DEBUG_LOG] [BENCH]` lines and compute medians):
|
|
```
|
|
./gradlew :lynglib:jvmTest --tests "ExpressionBenchmarkTest.benchmarkListIndexReads" --rerun-tasks
|
|
./gradlew :lynglib:jvmTest --tests "ExpressionBenchmarkTest.benchmarkListIndexReads" --rerun-tasks
|
|
./gradlew :lynglib:jvmTest --tests "ExpressionBenchmarkTest.benchmarkListIndexReads" --rerun-tasks
|
|
|
|
./gradlew :lynglib:jvmTest --tests "ExpressionBenchmarkTest.benchmarkFieldReadPureReceiver" --rerun-tasks
|
|
./gradlew :lynglib:jvmTest --tests "ExpressionBenchmarkTest.benchmarkFieldReadPureReceiver" --rerun-tasks
|
|
./gradlew :lynglib:jvmTest --tests "ExpressionBenchmarkTest.benchmarkFieldReadPureReceiver" --rerun-tasks
|
|
```
|
|
|
|
Once collected, add medians and speedups to the table below:
|
|
|
|
| Flag | Benchmark/Test | OFF median (ms) | ON median (ms) | Speedup | Notes |
|
|
|---------------|---------------------------------------------------|-----------------:|----------------:|:-------:|-------|
|
|
| RVAL_FASTPATH | ExpressionBenchmarkTest::benchmarkListIndexReads | 305.243 | 230.942 | 1.32× | Fast path in `IndexRef` for `ObjList` + `ObjInt` index |
|
|
| RVAL_FASTPATH | ExpressionBenchmarkTest::benchmarkFieldReadPureReceiver | 266.222 | 190.720 | 1.40× | Pure-receiver evaluation in `FieldRef` (monomorphic, immutable) |
|
|
|
|
Notes:
|
|
- Both benches toggle `PerfFlags.RVAL_FASTPATH` within a single run to produce OFF and ON timings under identical conditions.
|
|
- Correctness assertions ensure the loops are not optimized away.
|
|
- All semantics (visibility/mutability checks, optional chaining) remain intact; fast paths only skip interim `ObjRecord` traffic when safe.
|
|
|
|
|
|
## ARG_BUILDER — splat fast‑path (3× medians; OFF → ON)
|
|
|
|
Date: 2025-11-11 13:12 (local)
|
|
|
|
Environment: Gradle 8.7; JVM (JDK as configured by toolchain); single‑threaded test execution; stdout enabled.
|
|
|
|
| Flag | Benchmark/Test | OFF median (ms) | ON median (ms) | Speedup | Notes |
|
|
|-------------|-----------------------------------|-----------------:|----------------:|:-------:|-------|
|
|
| ARG_BUILDER | CallSplatBenchmarkTest (splat) | 613.689 | 463.593 | 1.32× | Single‑splat fast‑path returns underlying list directly; avoids intermediate copies |
|
|
|
|
Inputs (3×):
|
|
- OFF runs (ms): 613.689 | 629.604 | 612.361 → median 613.689
|
|
- ON runs (ms): 453.752 | 463.593 | 468.844 → median 463.593
|
|
|
|
Reproduce (3×):
|
|
```
|
|
./gradlew :lynglib:jvmTest --tests "CallSplatBenchmarkTest" --rerun-tasks
|
|
```
|
|
|
|
|
|
|
|
## Phase A consolidation (JVM) — 3× medians updated
|
|
|
|
Date: 2025-11-11 13:48 (local)
|
|
Environment:
|
|
- JDK: OpenJDK 20.0.2.1 (Amazon Corretto 20.0.2.1+10-FR)
|
|
- Gradle: 8.7
|
|
- OS/Arch: macOS 14.8.1 (aarch64)
|
|
|
|
### ARG_BUILDER
|
|
|
|
| Benchmark/Test | OFF median (ms) | ON median (ms) | Speedup | Notes |
|
|
|----------------------------------|-----------------:|----------------:|:-------:|-------|
|
|
| CallMixedArityBenchmarkTest | 866.681 | 717.439 | 1.21× | Small-arity 0–8 fast path + builder; correctness preserved |
|
|
| CallSplatBenchmarkTest (splat) | 600.880 | 459.706 | 1.31× | Single-splat fast path returns underlying list; avoids copies |
|
|
|
|
Inputs (3×):
|
|
- Mixed arity OFF: 874.088291 | 866.680959 | 858.577125 → median 866.680959
|
|
- Mixed arity ON: 731.308625 | 706.440125 | 717.438542 → median 717.438542
|
|
- Splat OFF: 600.268625 | 607.849416 | 600.879666 → median 600.879666
|
|
- Splat ON: 459.706375 | 449.950166 | 461.815167 → median 459.706375
|
|
|
|
### RVAL_FASTPATH (new coverage)
|
|
|
|
| Benchmark/Test | OFF median (ms) | ON median (ms) | Speedup | Notes |
|
|
|--------------------------------------------------|-----------------:|----------------:|:-------:|-------|
|
|
| ExpressionBenchmarkTest::benchmarkListIndexReads | 299.366 | 218.812 | 1.37× | IndexRef fast path for ObjList + ObjInt |
|
|
| ExpressionBenchmarkTest::benchmarkFieldReadPureReceiver | 268.315 | 186.032 | 1.44× | Pure-receiver evaluation in FieldRef (monomorphic, immutable) |
|
|
|
|
Inputs (3×):
|
|
- ListIndex OFF: 291.344 | 310.717167 | 299.365709 → median 299.365709
|
|
- ListIndex ON: 217.795375 | 221.504166 | 218.812042 → median 218.812042
|
|
- FieldRead OFF: 267.2775 | 274.355208 | 268.315125 → median 268.315125
|
|
- FieldRead ON: 189.599333 | 186.031791 | 182.069167 → median 186.031791
|
|
|
|
### Locals/slots capacity (precise hints)
|
|
|
|
| Benchmark/Test | OFF config | ON config | OFF median (ms) | ON median (ms) | Speedup | Notes |
|
|
|---------------------------|------------------------------------|------------------------------------|-----------------:|----------------:|:-------:|-------|
|
|
| LocalVarBenchmarkTest | LOCAL_SLOT_PIC=OFF, FAST_LOCAL=OFF | LOCAL_SLOT_PIC=ON, FAST_LOCAL=ON | 446.018 | 347.964 | 1.28× | Precise capacity hints + fast-locals coverage |
|
|
|
|
Inputs (3×):
|
|
- Locals OFF: 470.575041 | 441.89625 | 446.017833 → median 446.017833
|
|
- Locals ON: 370.664208 | 345.615541 | 347.964291 → median 347.964291
|
|
|
|
Methodology:
|
|
- Each test executed three times via Gradle with stdout enabled; medians computed from `[DEBUG_LOG] [BENCH]` lines.
|
|
- Full JVM tests and stress benches remain green in this cycle.
|
|
|
|
|
|
|
|
## Phase B — List ops specialization (PRIMITIVE_FASTOPS) — 3× medians (OFF → ON)
|
|
|
|
Date: 2025-11-11 13:48 (local)
|
|
Environment:
|
|
- JDK: OpenJDK 20.0.2.1 (Amazon Corretto 20.0.2.1+10-FR)
|
|
- Gradle: 8.7
|
|
- OS/Arch: macOS 14.8.1 (aarch64)
|
|
|
|
| Optimization | Benchmark/Test | OFF median (ms) | ON median (ms) | Speedup | Notes |
|
|
|---------------------|------------------------------------------|-----------------:|----------------:|:-------:|-------|
|
|
| PRIMITIVE_FASTOPS | ListOpsBenchmarkTest::benchmarkSumInts | 324.805 | 144.908 | 2.24× | ObjList.sum fast path for int lists; generic fallback preserved |
|
|
| PRIMITIVE_FASTOPS | ListOpsBenchmarkTest::benchmarkContainsInts | 440.414 | 415.476 | 1.06× | ObjList.contains fast path when searching ObjInt in int list |
|
|
|
|
Inputs (3×):
|
|
- list-sum OFF: 332.863417 | 323.491625 | 324.804083 → median 324.804083
|
|
- list-sum ON: 144.907833 | 148.870792 | 126.418542 → median 144.907833
|
|
- list-contains OFF: 440.413709 | 440.368333 | 441.4365 → median 440.413709
|
|
- list-contains ON: 416.465292 | 412.283291 | 415.475833 → median 415.475833
|
|
|
|
Methodology:
|
|
- Each test executed three times via Gradle; medians computed from `[DEBUG_LOG] [BENCH]` lines.
|
|
- Changes are fully guarded by `PerfFlags.PRIMITIVE_FASTOPS`; semantics preserved (null on empty sum; generic fallback on mixed types).
|
|
|
|
|
|
|
|
### Phase B — Ranges for-in lowering (PRIMITIVE_FASTOPS) — 3× medians (OFF → ON)
|
|
|
|
Date: 2025-11-11 13:48 (local)
|
|
Environment:
|
|
- JDK: OpenJDK 20.0.2.1 (Amazon Corretto 20.0.2.1+10-FR)
|
|
- Gradle: 8.7
|
|
- OS/Arch: macOS 14.8.1 (aarch64)
|
|
|
|
| Optimization | Benchmark/Test | OFF median (ms) | ON median (ms) | Speedup | Notes |
|
|
|---------------------|------------------------------------------|-----------------:|----------------:|:-------:|-------|
|
|
| PRIMITIVE_FASTOPS | RangeBenchmarkTest::benchmarkIntRangeForIn | 1705.299 | 788.974 | 2.16× | Tight counted loop for (Int..Int) for-in; preserves semantics |
|
|
|
|
Inputs (3×):
|
|
- range-for-in OFF: 1705.298958 | 1684.357708 | 1735.880917 → median 1705.298958
|
|
- range-for-in ON: 794.178458 | 778.741834 | 788.973625 → median 788.973625
|
|
|
|
Methodology:
|
|
- Each configuration executed three times via Gradle; medians computed from `[DEBUG_LOG] [BENCH]` lines.
|
|
- Lowering is guarded by `PerfFlags.PRIMITIVE_FASTOPS` and applies only when the source is an `ObjRange` with int bounds; otherwise falls back to generic iteration.
|
|
|
|
|
|
|
|
## Phase B — Regex caching (REGEX_CACHE) — 3× medians (OFF → ON)
|
|
|
|
Date: 2025-11-11 13:48 (local)
|
|
Environment:
|
|
- JDK: OpenJDK 20.0.2.1 (Amazon Corretto 20.0.2.1+10-FR)
|
|
- Gradle: 8.7
|
|
- OS/Arch: macOS 14.8.1 (aarch64)
|
|
|
|
| Flag | Benchmark/Test | OFF median (ms) | ON median (ms) | Speedup | Notes |
|
|
|--------------|---------------------------------------------------|-----------------:|----------------:|:-------:|-------|
|
|
| REGEX_CACHE | RegexBenchmarkTest::benchmarkLiteralPatternMatches | 378.246 | 275.890 | 1.37× | Caches compiled regex for identical literal pattern per iteration |
|
|
| REGEX_CACHE | RegexBenchmarkTest::benchmarkDynamicPatternMatches | 514.944 | 229.006 | 2.25× | Two dynamic patterns alternate; cache size sufficient to retain both |
|
|
|
|
Inputs (1× here; can extend to 3× on request):
|
|
- regex-literal OFF: 378.245916; ON: 275.889541
|
|
- regex-dynamic OFF: 514.944167; ON: 229.005834
|
|
|
|
Methodology:
|
|
- Each benchmark toggles `PerfFlags.REGEX_CACHE` inside a single test and prints `[DEBUG_LOG]` timings for OFF and ON runs under identical conditions. We recorded one set of OFF/ON timings here; we can extend to 3× medians if required for publication.
|
|
- The cache is a tiny size-bounded map (64 entries) activated only when `PerfFlags.REGEX_CACHE` is true. Defaults remain OFF.
|
|
|
|
|
|
|
|
|
|
## JIT tweaks (Round 1) — quick gains snapshot (locals, ranges, list ops)
|
|
|
|
Date: 2025-11-11 21:05 (local)
|
|
|
|
Scope: fast confirmation of overall gain using current configuration; focused on locals, ranges, and list ops. Each test prints OFF → ON timings in a single run. We executed the benches via Gradle with stdout enabled and single test fork.
|
|
|
|
Environment:
|
|
- Gradle: 8.7 (stdout enabled, maxParallelForks=1)
|
|
- JVM: as configured by toolchain for this project
|
|
- OS/Arch: per developer machine (unchanged from prior sections)
|
|
|
|
Reproduce:
|
|
```
|
|
./gradlew :lynglib:jvmTest --tests LocalVarBenchmarkTest --rerun-tasks
|
|
./gradlew :lynglib:jvmTest --tests RangeBenchmarkTest --rerun-tasks
|
|
./gradlew :lynglib:jvmTest --tests ListOpsBenchmarkTest --rerun-tasks
|
|
```
|
|
|
|
Results (representative runs; OFF → ON):
|
|
- Local variables — LOCAL_SLOT_PIC + EMIT_FAST_LOCAL_REFS
|
|
- Run 1: 468.407 ms → 367.277 ms (≈ 1.28×)
|
|
- Run 2: 447.031 ms → 346.126 ms (≈ 1.29×)
|
|
- Ranges for‑in — PRIMITIVE_FASTOPS (tight counted loop for (Int..Int))
|
|
- 1731.780 ms → 799.023 ms (≈ 2.17×)
|
|
- List ops — PRIMITIVE_FASTOPS
|
|
- sum(int list): 318.943 ms → 148.571 ms (≈ 2.15×)
|
|
- contains(int in int list): 440.013 ms → 412.450 ms (≈ 1.07×)
|
|
|
|
Summary: All three areas improved with optimizations ON; no regressions observed in these runs. For publication‑grade stability, run each test 3× and report medians (see sections below for methodology and previous median tables).
|
|
|
|
|
|
## Additional tweaks — verification snapshot (Index write fast‑path, List literal pre‑size, Regex LRU)
|
|
|
|
Date: 2025-11-11 21:31 (local)
|
|
|
|
Scope: Implemented three semantics‑neutral optimizations and verified they are green across targeted and broader JVM benches.
|
|
|
|
What changed (guarded by flags where applicable):
|
|
- RVAL_FASTPATH: Index write fast‑path
|
|
- `IndexRef.setAt`: direct path for `ObjList` + `ObjInt` (`list[i] = value`) mirrors the read fast‑path. Optional chaining semantics preserved; bounds exceptions propagate unchanged.
|
|
- RVAL_FASTPATH: List literal pre‑sizing
|
|
- `ListLiteralRef.get`: pre‑counts element entries and uses `ArrayList` with capacity hint; for spreads of `ObjList`, uses `ensureCapacity` before bulk add. Evaluation order unchanged.
|
|
- REGEX_CACHE: LRU‑like behavior
|
|
- `RegexCache`: emulates access‑order LRU within a tiny bounded map (`MAX=64`) by moving accessed entries to the tail; improves alternating‑pattern scenarios. Only active when `PerfFlags.REGEX_CACHE` is true.
|
|
|
|
Reproduce quick verification (1× runs):
|
|
```
|
|
./gradlew :lynglib:jvmTest --tests ExpressionBenchmarkTest --rerun-tasks
|
|
./gradlew :lynglib:jvmTest --tests ListOpsBenchmarkTest --rerun-tasks
|
|
./gradlew :lynglib:jvmTest --tests RegexBenchmarkTest --rerun-tasks
|
|
./gradlew :lynglib:jvmTest --tests PicBenchmarkTest --rerun-tasks
|
|
./gradlew :lynglib:jvmTest --tests PicInvalidationJvmTest --rerun-tasks
|
|
./gradlew :lynglib:jvmTest --tests LocalVarBenchmarkTest --rerun-tasks
|
|
./gradlew :lynglib:jvmTest --tests ConcurrencyCallBenchmarkTest --rerun-tasks
|
|
./gradlew :lynglib:jvmTest --tests DeepPoolingStressJvmTest --rerun-tasks
|
|
./gradlew :lynglib:jvmTest --tests MultiThreadPoolingStressJvmTest --rerun-tasks
|
|
```
|
|
|
|
Observation: All listed tests green in this cycle; no behavioral regressions observed. For the new paths (index write, list literal), performance was neutral‑to‑positive in smoke runs; Regex benches remained positive or neutral with the LRU behavior. For publication‑grade medians, extend to 3× per test as in earlier sections.
|
|
|
|
|
|
## Sanity matrix (JVM) — quick OFF→ON runs
|
|
|
|
Date: 2025-11-11 21:59 (local)
|
|
|
|
Scope: Final Round 1 sanity sweep across JVM micro‑benches and stress tests to confirm that optimizations ON do not regress performance vs OFF in representative scenarios. Each benchmark prints `[DEBUG_LOG] [BENCH]` timings for OFF → ON within a single run. This section records a quick pass confirmation (not 3× medians) and reproduction commands.
|
|
|
|
Environment:
|
|
- Gradle: 8.7 (stdout enabled, maxParallelForks=1)
|
|
- JVM: as configured by the project toolchain
|
|
- OS/Arch: macOS 14.x (aarch64)
|
|
|
|
Benches covered (all green; no regressions observed in these runs):
|
|
- Calls/Args: `CallBenchmarkTest`, `CallMixedArityBenchmarkTest` (ARG_BUILDER)
|
|
- PICs: `PicBenchmarkTest` (field/method); `PicInvalidationJvmTest` correctness reconfirmed
|
|
- Expressions/Arithmetic: `ExpressionBenchmarkTest`, `ArithmeticBenchmarkTest` (RVAL_FASTPATH, PRIMITIVE_FASTOPS)
|
|
- Ranges: `RangeBenchmarkTest` (PRIMITIVE_FASTOPS counted loop)
|
|
- List ops: `ListOpsBenchmarkTest` (PRIMITIVE_FASTOPS specializations)
|
|
- Regex: `RegexBenchmarkTest` (REGEX_CACHE with LRU behavior)
|
|
- Locals: `LocalVarBenchmarkTest` (LOCAL_SLOT_PIC + FAST_LOCAL)
|
|
- Concurrency/Pooling: `ConcurrencyCallBenchmarkTest`, `DeepPoolingStressJvmTest`, `MultiThreadPoolingStressJvmTest` (SCOPE_POOL per‑thread)
|
|
|
|
Reproduce (examples):
|
|
```
|
|
./gradlew :lynglib:jvmTest --tests CallBenchmarkTest --rerun-tasks
|
|
./gradlew :lynglib:jvmTest --tests CallMixedArityBenchmarkTest --rerun-tasks
|
|
./gradlew :lynglib:jvmTest --tests PicBenchmarkTest --rerun-tasks
|
|
./gradlew :lynglib:jvmTest --tests PicInvalidationJvmTest --rerun-tasks
|
|
./gradlew :lynglib:jvmTest --tests ExpressionBenchmarkTest --rerun-tasks
|
|
./gradlew :lynglib:jvmTest --tests ArithmeticBenchmarkTest --rerun-tasks
|
|
./gradlew :lynglib:jvmTest --tests RangeBenchmarkTest --rerun-tasks
|
|
./gradlew :lynglib:jvmTest --tests ListOpsBenchmarkTest --rerun-tasks
|
|
./gradlew :lynglib:jvmTest --tests RegexBenchmarkTest --rerun-tasks
|
|
./gradlew :lynglib:jvmTest --tests LocalVarBenchmarkTest --rerun-tasks
|
|
./gradlew :lynglib:jvmTest --tests ConcurrencyCallBenchmarkTest --rerun-tasks
|
|
./gradlew :lynglib:jvmTest --tests DeepPoolingStressJvmTest --rerun-tasks
|
|
./gradlew :lynglib:jvmTest --tests MultiThreadPoolingStressJvmTest --rerun-tasks
|
|
```
|
|
|
|
Summary:
|
|
- All listed tests passed in this sanity sweep.
|
|
- For each benchmark’s OFF → ON printouts examined during this pass, ON was equal or faster than OFF; no ON<OFF regressions were observed.
|
|
- For publication‑grade numbers, use the 3× medians methodology outlined earlier in this document. The existing median tables in previous sections remain representative, and the additional tweaks (Index write, List literal pre‑size, Regex LRU, Field PIC 4‑way + read→write reuse, mixed Int/Real fast‑ops) remained neutral‑to‑positive.
|
|
|
|
|
|
## Quick snapshot — IndexRef PIC + negative miss cache (JVM) — 3× medians (OFF → ON)
|
|
|
|
Date: 2025-11-11 22:32 (local)
|
|
|
|
Scope
|
|
- Confirm that the latest changes — IndexRef read/write PIC (stacked on RVAL_FASTPATH) and safe catch‑and‑cache negative entries for Field/Method PICs — do not regress performance. We collected 3× medians for the two expression sub‑benches that are most sensitive to RVAL paths and cross‑checked PICs and ranges.
|
|
|
|
Environment
|
|
- Gradle: 8.7 (stdout enabled, maxParallelForks=1)
|
|
- JVM: project toolchain default
|
|
- OS/Arch: macOS 14.x (aarch64)
|
|
|
|
Results (3× medians)
|
|
|
|
| Area | Benchmark/Test | OFF median (ms) | ON median (ms) | Speedup | Notes |
|
|
|------|-----------------|-----------------:|----------------:|:-------:|-------|
|
|
| RVAL_FASTPATH | ExpressionBenchmarkTest::benchmarkListIndexReads | 304.282 | 229.168 | 1.33× | IndexRef direct fast‑path for ObjList+ObjInt; 4‑way Index PIC handles polymorphic cases |
|
|
| RVAL_FASTPATH | ExpressionBenchmarkTest::benchmarkFieldReadPureReceiver | 275.122 | 194.876 | 1.41× | Monomorphic, immutable receiver path; preserves visibility/optional semantics |
|
|
|
|
Cross‑checks (from the same session, 1× quick)
|
|
- PicBenchmarkTest::benchmarkFieldGetSetPic — OFF 203.701 ms → ON 117.129 ms (≈1.74×)
|
|
- PicBenchmarkTest::benchmarkMethodPic — OFF 280.806 ms → ON 202.613 ms (≈1.39×)
|
|
- RangeBenchmarkTest::benchmarkIntRangeForIn — OFF 1762.425 ms → ON 806.898 ms (≈2.18×)
|
|
|
|
Reproduce
|
|
```
|
|
./gradlew :lynglib:jvmTest --tests "ExpressionBenchmarkTest.benchmarkListIndexReads" --rerun-tasks
|
|
./gradlew :lynglib:jvmTest --tests "ExpressionBenchmarkTest.benchmarkListIndexReads" --rerun-tasks
|
|
./gradlew :lynglib:jvmTest --tests "ExpressionBenchmarkTest.benchmarkListIndexReads" --rerun-tasks
|
|
|
|
./gradlew :lynglib:jvmTest --tests "ExpressionBenchmarkTest.benchmarkFieldReadPureReceiver" --rerun-tasks
|
|
./gradlew :lynglib:jvmTest --tests "ExpressionBenchmarkTest.benchmarkFieldReadPureReceiver" --rerun-tasks
|
|
./gradlew :lynglib:jvmTest --tests "ExpressionBenchmarkTest.benchmarkFieldReadPureReceiver" --rerun-tasks
|
|
|
|
./gradlew :lynglib:jvmTest --tests PicBenchmarkTest --rerun-tasks
|
|
./gradlew :lynglib:jvmTest --tests RangeBenchmarkTest --rerun-tasks
|
|
```
|
|
|
|
Notes
|
|
- Negative caches are installed only after a real miss throws (cache‑after‑miss), preserving error semantics and invalidation on `layoutVersion` changes.
|
|
- IndexRef PIC augments the existing direct path and uses move‑to‑front promotion; it is keyed on `(classId, layoutVersion)` like other PICs.
|
|
|