lyng/docs/perf_guide.md


This document explains how to enable and measure the performance optimizations added to the Lyng interpreter. The focus is JVM‑first with safe, flag‑guarded rollouts and quick A/B testing. Other targets (JS/Wasm/Native) keep conservative defaults until validated.

[//]: # (excludeFromIndex)

## Overview

Optimizations are controlled by runtime‑mutable flags in `net.sergeych.lyng.PerfFlags`, initialized from platform‑specific static defaults `net.sergeych.lyng.PerfDefaults` (KMP `expect/actual`).

- JVM/Android defaults are aggressive (e.g. `RVAL_FASTPATH=true`).
- Non‑JVM defaults are conservative (e.g. `RVAL_FASTPATH=false`).

All flags are `var` and can be flipped at runtime (e.g., from tests or host apps) for A/B comparisons.

### Workload presets (JVM‑first)

To simplify switching between recommended flag sets for different workloads, use `net.sergeych.lyng.PerfProfiles`:

```
val snap = PerfProfiles.apply(PerfProfiles.Preset.BENCH)  // or BASELINE / BOOKS
// ... run workload ...
PerfProfiles.restore(snap)  // restore previous flags
```

- BASELINE: restores platform defaults from `PerfDefaults` (good rollback point).
- BENCH: expression‑heavy micro‑bench focus (aggressive R‑value and PIC optimizations on JVM).
- BOOKS: documentation workloads (prefers simpler paths; disables some PIC/arg builder features shown neutral/negative for this load in A/B).

## Key flags

- `LOCAL_SLOT_PIC` — Runtime cache in `LocalVarRef` to avoid repeated name→slot lookups per frame (ON JVM default).
- `EMIT_FAST_LOCAL_REFS` — Compiler emits `FastLocalVarRef` for identifiers known to be locals/params (ON JVM default).
- `ARG_BUILDER` — Efficient argument building: small‑arity no‑alloc and pooled builder on JVM (ON JVM default).
- `ARG_SMALL_ARITY_12` — Extends small‑arity no‑alloc call paths from 0–8 to 0–12 arguments (JVM‑first exploration; OFF by default). Use for codebases with many 9–12 arg calls; A/B before enabling.
- `SKIP_ARGS_ON_NULL_RECEIVER` — Early return on optional‑null receivers before building args (semantics‑compatible). A/B only.
- `SCOPE_POOL` — Scope frame pooling for calls (JVM, per‑thread ThreadLocal pool). ON by default on JVM; togglable at runtime.
- `FIELD_PIC` — 2‑entry polymorphic inline cache for field reads/writes keyed by `(classId, layoutVersion)` (ON JVM default).
- `METHOD_PIC` — 2‑entry PIC for instance method calls keyed by `(classId, layoutVersion)` (ON JVM default).
- `FIELD_PIC_SIZE_4` — Increases Field PIC size from 2 to 4 entries (JVM-first tuning; OFF by default). Use for sites with >2 receiver shapes.
- `METHOD_PIC_SIZE_4` — Increases Method PIC size from 2 to 4 entries (JVM-first tuning; OFF by default).
- `PIC_ADAPTIVE_2_TO_4` — Adaptive growth of Field/Method PICs from 2→4 entries per-site when miss rate >20% over ≥256 accesses (JVM-first; OFF by default).
- `INDEX_PIC` — Enables polymorphic inline cache for indexing (e.g., `a[i]`) and related fast paths. Defaults to follow `FIELD_PIC` on init; can be toggled independently.
- `INDEX_PIC_SIZE_4` — Increases Index PIC size from 2 to 4 entries (JVM-first tuning). Default: ON for JVM; OFF elsewhere by default.
- `PIC_DEBUG_COUNTERS` — Enable lightweight hit/miss counters via `PerfStats` (OFF by default).
- `PRIMITIVE_FASTOPS` — Fast paths for `(ObjInt, ObjInt)` arithmetic/comparisons and `(ObjBool, ObjBool)` logic (ON JVM default).
- `RVAL_FASTPATH` — Bypass `ObjRecord` in pure expression evaluation via `ObjRef.evalValue` (ON JVM default, OFF elsewhere).

See `src/commonMain/kotlin/net/sergeych/lyng/PerfFlags.kt` and `PerfDefaults.*.kt` for details and platform defaults.

## Where optimizations apply

- Locals: `FastLocalVarRef`, `LocalVarRef` per‑frame cache (PIC).
- Calls: small‑arity zero‑alloc paths (0–8 args; optionally 0–12 with `ARG_SMALL_ARITY_12`), pooled builder (JVM), and child frame pooling (optional).
- Properties/methods: Field/Method PICs with receiver shape `(classId, layoutVersion)` and handle‑aware caches; configurable 2→4 entries under flags.
- Expressions: R‑value fast paths in hot nodes (`UnaryOpRef`, `BinaryOpRef`, `ElvisRef`, logical ops, `RangeRef`, `IndexRef` read, `FieldRef` receiver eval, `ListLiteralRef` elements, `CallRef` callee, `MethodCallRef` receiver, assignment RHS).
- Primitives: Direct boolean/int ops where safe.

### Compiler constant folding (conservative)
- The compiler folds a safe subset of literal‑only expressions at compile time to reduce runtime work:
  - Integer arithmetic: `+ - * / %` (division/modulo only when divisor ≠ 0).
  - Bitwise integer ops: `& ^ | << >>`.
  - Comparisons and equality for ints/strings/chars: `== != < <= > >=`.
  - Boolean logic for literal booleans: `|| &&` and unary `!`.
  - String concatenation of literal strings: `"a" + "b"`.
- Non‑literal operands or side‑effecting constructs are not folded.
- Semantics remain unchanged; tests verify parity.

## Running JVM micro‑benchmarks

Each benchmark prints timings with `[DEBUG_LOG]` and includes correctness assertions to prevent dead‑code elimination.

Run individual tests to avoid multiplatform matrices:

```
./gradlew :lynglib:jvmTest --tests LocalVarBenchmarkTest
./gradlew :lynglib:jvmTest --tests CallBenchmarkTest
./gradlew :lynglib:jvmTest --tests CallMixedArityBenchmarkTest
./gradlew :lynglib:jvmTest --tests CallSplatBenchmarkTest
./gradlew :lynglib:jvmTest --tests PicBenchmarkTest
./gradlew :lynglib:jvmTest --tests PicInvalidationJvmTest
./gradlew :lynglib:jvmTest --tests PicAdaptiveABTest
./gradlew :lynglib:jvmTest --tests ArithmeticBenchmarkTest
./gradlew :lynglib:jvmTest --tests ExpressionBenchmarkTest
./gradlew :lynglib:jvmTest --tests IndexPicABTest
./gradlew :lynglib:jvmTest --tests IndexWritePathABTest
./gradlew :lynglib:jvmTest --tests CallPoolingBenchmarkTest
./gradlew :lynglib:jvmTest --tests MethodPoolingBenchmarkTest
./gradlew :lynglib:jvmTest --tests MixedBenchmarkTest
./gradlew :lynglib:jvmTest --tests DeepPoolingStressJvmTest
```

Typical output (example):

```
[DEBUG_LOG] [BENCH] mixed-arity x200000 [ARG_BUILDER=ON]: 85.7 ms
```

Lower time is better. Run the same bench with a flag OFF vs ON to compare.

### Optional JFR allocation profiling (JVM)

When running end‑to‑end “book” workloads or heavier benches, you can enable JFR to capture allocation and GC details:

```
./gradlew :lynglib:jvmTest --tests BookAllocationProfileTest -Dlyng.jfr=true \
  -Dlyng.profile.warmup=1 -Dlyng.profile.repeats=3 -Dlyng.profile.shuffle=true
```

- Dumps are saved to `lynglib/build/jfr_*.jfr` if the JVM supports Flight Recorder.
- The test also records GC counts/time and median time/heap deltas to `lynglib/build/book_alloc_profile.txt`.

## Toggling flags in tests

Flags are mutable at runtime, e.g.:

```kotlin
PerfFlags.ARG_BUILDER = false
val r1 = (Scope().eval(script) as ObjInt).value
PerfFlags.ARG_BUILDER = true
val r2 = (Scope().eval(script) as ObjInt).value
```

Reset flags at the end of a test to avoid impacting other tests.

## PIC diagnostics (optional)

Enable counters:

```kotlin
PerfFlags.PIC_DEBUG_COUNTERS = true
PerfStats.resetAll()
```

Available counters in `PerfStats`:

- Field PIC: `fieldPicHit`, `fieldPicMiss`, `fieldPicSetHit`, `fieldPicSetMiss`
- Method PIC: `methodPicHit`, `methodPicMiss`
- Index PIC: `indexPicHit`, `indexPicMiss`
- Locals: `localVarPicHit`, `localVarPicMiss`, `fastLocalHit`, `fastLocalMiss`
- Primitive ops: `primitiveFastOpsHit`

Print a summary at the end of a bench/test as needed. Remember to turn counters OFF after the test.

## A/B scenarios and guidance (JVM)

### Adaptive PIC (fields/methods)
- Flags: `FIELD_PIC=true`, `METHOD_PIC=true`, `FIELD_PIC_SIZE_4=false`, `METHOD_PIC_SIZE_4=false`, toggle `PIC_ADAPTIVE_2_TO_4` OFF vs ON.
- Benchmarks: `PicBenchmarkTest`, `MixedBenchmarkTest`, `PicAdaptiveABTest`.
- Expect wins at sites with >2 receiver shapes; counters should show fewer misses with adaptivity ON.

### Index PIC and size
- Flags: toggle `INDEX_PIC` OFF vs ON; then `INDEX_PIC_SIZE_4` OFF vs ON.
- Benchmarks: `ExpressionBenchmarkTest` (list indexing) and `IndexPicABTest` (string/map indexing).
- Expect wins when the same index shape recurs; counters should show higher `indexPicHit`.

### Index WRITE paths (Map and List)
- Flags: toggle `INDEX_PIC` OFF vs ON; then `INDEX_PIC_SIZE_4` OFF vs ON.
- Benchmark: `IndexWritePathABTest` (Map[String] put, List[Int] set) — writes results to `lynglib/build/index_write_ab_results.txt`.
- Direct fast‑paths are used on R‑value paths where safe and semantics‑preserving (e.g., optional‑chaining no‑ops on null receivers; bounds exceptions unchanged).

## Guidance per flag (JVM)

- Keep `RVAL_FASTPATH = true` unless debugging a suspected expression‑semantics issue.
- Use `SCOPE_POOL = true` only for benchmarks or once pooling passes the deep stress tests and broader validation; currently OFF by default.
- `FIELD_PIC` and `METHOD_PIC` should remain ON; they are validated with invalidation tests.
- Consider enabling `FIELD_PIC_SIZE_4`/`METHOD_PIC_SIZE_4` for sites with 3–4 receiver shapes; measure first.
- `PIC_ADAPTIVE_2_TO_4` is useful on polymorphic sites and may outperform fixed size 2 on mixed-shape workloads. Validate with `PicAdaptiveABTest`.
- `INDEX_PIC` is generally beneficial on JVM; leave ON when measuring index‑heavy workloads.
- `INDEX_PIC_SIZE_4` is ON by default on JVM as A/B showed consistent wins on String[Int] and Map[String] workloads. You can disable it by setting `PerfFlags.INDEX_PIC_SIZE_4 = false` if needed.
- `ARG_BUILDER` should remain ON; switch OFF only to get a baseline.
- `ARG_SMALL_ARITY_12` is experimental and OFF by default. Enable it only if your workload frequently calls functions with 9–12 arguments and A/B shows consistent wins.

### Workload‑specific recommendations (JVM)

- “Books”/documentation loads (BookTest): prefer simpler paths; in A/B these often benefit from the BOOKS preset (e.g., `ARG_BUILDER=false`, `SCOPE_POOL=false`, `INDEX_PIC=false`). Use `PerfProfiles.apply(PerfProfiles.Preset.BOOKS)` before the run and `restore(...)` after.
- Expression‑heavy benches: use the BENCH preset (PICs and R‑value fast‑paths enabled, `INDEX_PIC_SIZE_4=true`).
- Always verify with local A/B on your environment; rollback is a flag flip or applying BASELINE.

## Notes on correctness & safety

- Optional chaining semantics are preserved across fast paths.
- Visibility/mutability checks are enforced even on PIC fast‑paths.
- `frameId` is regenerated on each pooled frame borrow; stress tests verify no leakage under deep nesting/recursion.

## Cross‑platform

- Non‑JVM defaults keep `RVAL_FASTPATH=false` for now; other low‑risk flags may be ON.
- Once JVM path is fully validated and measured, add lightweight benches for JS/Wasm/Native and enable flags incrementally.

## Range fast iteration (experimental)

- Flag: `RANGE_FAST_ITER` (default OFF).
- When enabled and applicable, simple ascending integer ranges (`0..n`, `0..<n`) use a specialized non‑allocating iterator (`ObjFastIntRangeIterator`).
- Benchmark: `RangeIterationBenchmarkTest` records OFF/ON timings for inclusive, exclusive, reversed, negative, and empty ranges. Semantics are preserved; non‑int or complex ranges fall back to the generic iterator.

## Troubleshooting

- If a benchmark shows regressions, flip related flags OFF to isolate the source (e.g., `ARG_BUILDER`, `RVAL_FASTPATH`, `FIELD_PIC`, `METHOD_PIC`).
- Use `PIC_DEBUG_COUNTERS` to observe inline cache effectiveness.
- Ensure tests do not accidentally keep flags ON for subsequent tests; reset after each test.


## JVM micro-benchmark results (3× medians; OFF → ON)

Date: 2025-11-10 23:04 (local)

| Flag               | Benchmark/Test                              | OFF median (ms) | ON median (ms) | Speedup | Notes |
|--------------------|----------------------------------------------|-----------------:|----------------:|:-------:|-------|
| ARG_BUILDER        | CallMixedArityBenchmarkTest                   |           788.02 |          668.79 |  1.18×  | Clear win on mixed arity |
| ARG_BUILDER        | CallBenchmarkTest (simple calls)              |           423.87 |          425.47 |  1.00×  | Neutral on repeated simple calls |
| FIELD_PIC          | PicBenchmarkTest::benchmarkFieldGetSetPic     |           113.575 |          106.017 |  1.07×  | Small but consistent win |
| METHOD_PIC         | PicBenchmarkTest::benchmarkMethodPic          |           251.068 |          149.439 |  1.68×  | Large consistent win |
| RVAL_FASTPATH      | ExpressionBenchmarkTest                       |           514.491 |          426.800 |  1.21×  | Consistent win in expression chains |
| PRIMITIVE_FASTOPS  | ArithmeticBenchmarkTest (int-sum)             |           243.420 |          128.146 |  1.90×  | Big win for integer addition |
| PRIMITIVE_FASTOPS  | ArithmeticBenchmarkTest (int-cmp)             |           210.385 |          168.534 |  1.25×  | Moderate win for comparisons |
| SCOPE_POOL         | CallPoolingBenchmarkTest                      |           505.778 |          366.737 |  1.38×  | Single-threaded bench; per-thread ThreadLocal pool; default ON on JVM |

Notes:
- All results obtained from `[DEBUG_LOG] [BENCH]` outputs with three repeated Gradle test invocations per configuration; medians reported.
- JVM defaults (current): `ARG_BUILDER=true`, `PRIMITIVE_FASTOPS=true`, `RVAL_FASTPATH=true`, `FIELD_PIC=true`, `METHOD_PIC=true`, `SCOPE_POOL=true` (per‑thread ThreadLocal pool), `REGEX_CACHE=true`.


## Concurrency (multi‑core) pooling results (3× medians; OFF → ON)

Date: 2025-11-10 22:56 (local)

| Flag       | Benchmark/Test                      | OFF median (ms) | ON median (ms) | Speedup | Notes |
|------------|--------------------------------------|-----------------:|----------------:|:-------:|-------|
| SCOPE_POOL | ConcurrencyCallBenchmarkTest (JVM)   |           521.102 |          201.374 |  2.59×  | Multithreaded workload on `Dispatchers.Default` with per‑thread ThreadLocal pool; workers=8, iters=15000/worker. |

Methodology:
- The test toggles `PerfFlags.SCOPE_POOL` within a single run and executes the same script across N worker coroutines scheduled on `Dispatchers.Default`.
- We executed the test three times via Gradle and computed medians from the printed `[DEBUG_LOG]` timings:
  - OFF runs (ms): 532.442 | 521.102 | 474.386 → median 521.102
  - ON runs (ms):  218.683 | 201.374 | 198.737 → median 201.374
- Speedup = OFF/ON.

Reproduce:
```
./gradlew :lynglib:jvmTest --tests "ConcurrencyCallBenchmarkTest" --rerun-tasks
```


## Next optimization steps (JVM)

Date: 2025-11-10 23:04 (local)

- PICs
  - Widen METHOD_PIC to 3–4 entries with tiny LRU; keep invalidation on layout change; re-run `PicInvalidationJvmTest`.
  - Micro fast-path for FIELD_PIC read-then-write pairs (`x = x + 1`) to reuse the resolved slot within one step.
- Locals and slots
  - Pre-size `Scope` slot structures when compiler knows local/param counts; audit `EMIT_FAST_LOCAL_REFS` coverage.
  - Re-run `LocalVarBenchmarkTest` to quantify gains.
- RVAL_FASTPATH coverage
  - Cover primitive `ObjList` index reads, pure receivers in `FieldRef`, and assignment RHS where safe; add micro-benches to `ExpressionBenchmarkTest`.
- Collections and ranges
  - Specialize `(Int..Int)` loops into tight counted loops (no intermediary objects).
  - Add primitive-specialized `ObjList` ops (`map`, `filter`, `sum`, `contains`) under `PRIMITIVE_FASTOPS`.
- Regex and strings
  - Cache compiled regex for string literals at compile time; add a tiny LRU for dynamic patterns behind `REGEX_CACHE`.
  - Add `RegexBenchmarkTest` for repeated matches.
- JIT friendliness (Kotlin/JVM)
  - Inline tiny helpers in hot paths, prefer arrays for internal buffers, finalize hot data structures where safe.

Validation matrix
- Always re-run: `CallBenchmarkTest`, `CallMixedArityBenchmarkTest`, `PicBenchmarkTest`, `ExpressionBenchmarkTest`, `ArithmeticBenchmarkTest`, `CallPoolingBenchmarkTest`, `DeepPoolingStressJvmTest`, `ConcurrencyCallBenchmarkTest` (3× medians when comparing).
- Keep full `:lynglib:jvmTest` green after each change.


## PIC update (4‑way METHOD_PIC) — JVM (3× medians; OFF → ON)

Date: 2025-11-11 00:16 (local)

| Flag      | Benchmark/Test                               | OFF median (ms) | ON median (ms) | Speedup | Notes |
|-----------|-----------------------------------------------|-----------------:|----------------:|:-------:|-------|
| FIELD_PIC | PicBenchmarkTest::benchmarkFieldGetSetPic      |          207.578 |         106.481 |  1.95×  | Read→write loop; micro fast‑path groundwork present |
| METHOD_PIC| PicBenchmarkTest::benchmarkMethodPic           |          273.478 |         182.226 |  1.50×  | 4‑way PIC with move‑to‑front (was 2‑way before) |

Medians computed from three Gradle runs in this session; see `[DEBUG_LOG] [BENCH]` lines in test output.


## Locals/slots capacity (pre‑sizing hints) — JVM (3× medians; OFF → ON)

Date: 2025-11-11 13:19 (local)

| Optimization            | Benchmark/Test              | OFF config                         | ON config                          | OFF median (ms) | ON median (ms) | Speedup | Notes |
|-------------------------|-----------------------------|------------------------------------|------------------------------------|-----------------:|----------------:|:-------:|-------|
| Locals pre‑sizing + PIC | LocalVarBenchmarkTest       | LOCAL_SLOT_PIC=OFF, FAST_LOCAL=OFF | LOCAL_SLOT_PIC=ON, FAST_LOCAL=ON   |          472.129 |         370.871 |  1.27×  | Compiler hint `params+4`; slot pre‑size; semantics unchanged |

Methodology:
- Each configuration executed three times via `:lynglib:jvmTest --tests "…" --rerun-tasks`; medians reported.
- Locals improvement stacks with per‑thread `SCOPE_POOL` and ARG fast paths.


## RVAL fast paths update — JVM (IndexRef and FieldRef) [3× medians; OFF → ON]

Date: 2025-11-11 13:19 (local)

New micro-benchmarks have been added to quantify the latest `RVAL_FASTPATH` extensions:
- Primitive `ObjList` index-read fast path in `IndexRef`.
- Conservative “pure receiver” evaluation in `FieldRef` (monomorphic, immutable receiver), preserving visibility/mutability checks and optional chaining semantics.

Benchmarks to run (each 3× OFF → ON):
- `ExpressionBenchmarkTest::benchmarkListIndexReads`
- `ExpressionBenchmarkTest::benchmarkFieldReadPureReceiver`

Reproduce (3× each; collect `[DEBUG_LOG] [BENCH]` lines and compute medians):
```
./gradlew :lynglib:jvmTest --tests "ExpressionBenchmarkTest.benchmarkListIndexReads" --rerun-tasks
./gradlew :lynglib:jvmTest --tests "ExpressionBenchmarkTest.benchmarkListIndexReads" --rerun-tasks
./gradlew :lynglib:jvmTest --tests "ExpressionBenchmarkTest.benchmarkListIndexReads" --rerun-tasks

./gradlew :lynglib:jvmTest --tests "ExpressionBenchmarkTest.benchmarkFieldReadPureReceiver" --rerun-tasks
./gradlew :lynglib:jvmTest --tests "ExpressionBenchmarkTest.benchmarkFieldReadPureReceiver" --rerun-tasks
./gradlew :lynglib:jvmTest --tests "ExpressionBenchmarkTest.benchmarkFieldReadPureReceiver" --rerun-tasks
```

Once collected, add medians and speedups to the table below:

| Flag          | Benchmark/Test                                  | OFF median (ms) | ON median (ms) | Speedup | Notes |
|---------------|---------------------------------------------------|-----------------:|----------------:|:-------:|-------|
| RVAL_FASTPATH | ExpressionBenchmarkTest::benchmarkListIndexReads  |           305.243 |          230.942 |  1.32×  | Fast path in `IndexRef` for `ObjList` + `ObjInt` index |
| RVAL_FASTPATH | ExpressionBenchmarkTest::benchmarkFieldReadPureReceiver |           266.222 |          190.720 |  1.40×  | Pure-receiver evaluation in `FieldRef` (monomorphic, immutable) |

Notes:
- Both benches toggle `PerfFlags.RVAL_FASTPATH` within a single run to produce OFF and ON timings under identical conditions.
- Correctness assertions ensure the loops are not optimized away.
- All semantics (visibility/mutability checks, optional chaining) remain intact; fast paths only skip interim `ObjRecord` traffic when safe.


## ARG_BUILDER — splat fast‑path (3× medians; OFF → ON)

Date: 2025-11-11 13:12 (local)

Environment: Gradle 8.7; JVM (JDK as configured by toolchain); single‑threaded test execution; stdout enabled.

| Flag        | Benchmark/Test                    | OFF median (ms) | ON median (ms) | Speedup | Notes |
|-------------|-----------------------------------|-----------------:|----------------:|:-------:|-------|
| ARG_BUILDER | CallSplatBenchmarkTest (splat)    |          613.689 |         463.593 |  1.32×  | Single‑splat fast‑path returns underlying list directly; avoids intermediate copies |

Inputs (3×):
- OFF runs (ms): 613.689 | 629.604 | 612.361 → median 613.689
- ON runs (ms):  453.752 | 463.593 | 468.844 → median 463.593

Reproduce (3×):
```
./gradlew :lynglib:jvmTest --tests "CallSplatBenchmarkTest" --rerun-tasks
```


## Phase A consolidation (JVM) — 3× medians updated

Date: 2025-11-11 13:48 (local)
Environment:
- JDK: OpenJDK 20.0.2.1 (Amazon Corretto 20.0.2.1+10-FR)
- Gradle: 8.7
- OS/Arch: macOS 14.8.1 (aarch64)

### ARG_BUILDER

| Benchmark/Test                   | OFF median (ms) | ON median (ms) | Speedup | Notes |
|----------------------------------|-----------------:|----------------:|:-------:|-------|
| CallMixedArityBenchmarkTest      |          866.681 |         717.439 |  1.21×  | Small-arity 0–8 fast path + builder; correctness preserved |
| CallSplatBenchmarkTest (splat)   |          600.880 |         459.706 |  1.31×  | Single-splat fast path returns underlying list; avoids copies |

Inputs (3×):
- Mixed arity OFF: 874.088291 | 866.680959 | 858.577125 → median 866.680959
- Mixed arity ON:  731.308625 | 706.440125 | 717.438542 → median 717.438542
- Splat OFF: 600.268625 | 607.849416 | 600.879666 → median 600.879666
- Splat ON:  459.706375 | 449.950166 | 461.815167 → median 459.706375

### RVAL_FASTPATH (new coverage)

| Benchmark/Test                                   | OFF median (ms) | ON median (ms) | Speedup | Notes |
|--------------------------------------------------|-----------------:|----------------:|:-------:|-------|
| ExpressionBenchmarkTest::benchmarkListIndexReads |          299.366 |         218.812 |  1.37×  | IndexRef fast path for ObjList + ObjInt |
| ExpressionBenchmarkTest::benchmarkFieldReadPureReceiver |          268.315 |         186.032 |  1.44×  | Pure-receiver evaluation in FieldRef (monomorphic, immutable) |

Inputs (3×):
- ListIndex OFF: 291.344 | 310.717167 | 299.365709 → median 299.365709
- ListIndex ON:  217.795375 | 221.504166 | 218.812042 → median 218.812042
- FieldRead OFF: 267.2775 | 274.355208 | 268.315125 → median 268.315125
- FieldRead ON:  189.599333 | 186.031791 | 182.069167 → median 186.031791

### Locals/slots capacity (precise hints)

| Benchmark/Test             | OFF config                         | ON config                          | OFF median (ms) | ON median (ms) | Speedup | Notes |
|---------------------------|------------------------------------|------------------------------------|-----------------:|----------------:|:-------:|-------|
| LocalVarBenchmarkTest     | LOCAL_SLOT_PIC=OFF, FAST_LOCAL=OFF | LOCAL_SLOT_PIC=ON, FAST_LOCAL=ON   |          446.018 |         347.964 |  1.28×  | Precise capacity hints + fast-locals coverage |

Inputs (3×):
- Locals OFF: 470.575041 | 441.89625 | 446.017833 → median 446.017833
- Locals ON:  370.664208 | 345.615541 | 347.964291 → median 347.964291

Methodology:
- Each test executed three times via Gradle with stdout enabled; medians computed from `[DEBUG_LOG] [BENCH]` lines.
- Full JVM tests and stress benches remain green in this cycle.


## Phase B — List ops specialization (PRIMITIVE_FASTOPS) — 3× medians (OFF → ON)

Date: 2025-11-11 13:48 (local)
Environment:
- JDK: OpenJDK 20.0.2.1 (Amazon Corretto 20.0.2.1+10-FR)
- Gradle: 8.7
- OS/Arch: macOS 14.8.1 (aarch64)

| Optimization        | Benchmark/Test                          | OFF median (ms) | ON median (ms) | Speedup | Notes |
|---------------------|------------------------------------------|-----------------:|----------------:|:-------:|-------|
| PRIMITIVE_FASTOPS   | ListOpsBenchmarkTest::benchmarkSumInts   |          324.805 |         144.908 |  2.24×  | ObjList.sum fast path for int lists; generic fallback preserved |
| PRIMITIVE_FASTOPS   | ListOpsBenchmarkTest::benchmarkContainsInts |          440.414 |         415.476 |  1.06×  | ObjList.contains fast path when searching ObjInt in int list |

Inputs (3×):
- list-sum OFF: 332.863417 | 323.491625 | 324.804083 → median 324.804083
- list-sum ON:  144.907833 | 148.870792 | 126.418542 → median 144.907833
- list-contains OFF: 440.413709 | 440.368333 | 441.4365 → median 440.413709
- list-contains ON:  416.465292 | 412.283291 | 415.475833 → median 415.475833

Methodology:
- Each test executed three times via Gradle; medians computed from `[DEBUG_LOG] [BENCH]` lines.
- Changes are fully guarded by `PerfFlags.PRIMITIVE_FASTOPS`; semantics preserved (null on empty sum; generic fallback on mixed types).


### Phase B — Ranges for-in lowering (PRIMITIVE_FASTOPS) — 3× medians (OFF → ON)

Date: 2025-11-11 13:48 (local)
Environment:
- JDK: OpenJDK 20.0.2.1 (Amazon Corretto 20.0.2.1+10-FR)
- Gradle: 8.7
- OS/Arch: macOS 14.8.1 (aarch64)

| Optimization        | Benchmark/Test                          | OFF median (ms) | ON median (ms) | Speedup | Notes |
|---------------------|------------------------------------------|-----------------:|----------------:|:-------:|-------|
| PRIMITIVE_FASTOPS   | RangeBenchmarkTest::benchmarkIntRangeForIn |         1705.299 |         788.974 |  2.16×  | Tight counted loop for (Int..Int) for-in; preserves semantics |

Inputs (3×):
- range-for-in OFF: 1705.298958 | 1684.357708 | 1735.880917 → median 1705.298958
- range-for-in ON:  794.178458 | 778.741834 | 788.973625 → median 788.973625

Methodology:
- Each configuration executed three times via Gradle; medians computed from `[DEBUG_LOG] [BENCH]` lines.
- Lowering is guarded by `PerfFlags.PRIMITIVE_FASTOPS` and applies only when the source is an `ObjRange` with int bounds; otherwise falls back to generic iteration.


## Phase B — Regex caching (REGEX_CACHE) — 3× medians (OFF → ON)

Date: 2025-11-11 13:48 (local)
Environment:
- JDK: OpenJDK 20.0.2.1 (Amazon Corretto 20.0.2.1+10-FR)
- Gradle: 8.7
- OS/Arch: macOS 14.8.1 (aarch64)

| Flag         | Benchmark/Test                                  | OFF median (ms) | ON median (ms) | Speedup | Notes |
|--------------|---------------------------------------------------|-----------------:|----------------:|:-------:|-------|
| REGEX_CACHE  | RegexBenchmarkTest::benchmarkLiteralPatternMatches |          378.246 |         275.890 |  1.37×  | Caches compiled regex for identical literal pattern per iteration |
| REGEX_CACHE  | RegexBenchmarkTest::benchmarkDynamicPatternMatches |          514.944 |         229.006 |  2.25×  | Two dynamic patterns alternate; cache size sufficient to retain both |

Inputs (1× here; can extend to 3× on request):
- regex-literal OFF: 378.245916; ON: 275.889541
- regex-dynamic OFF: 514.944167; ON: 229.005834

Methodology:
- Each benchmark toggles `PerfFlags.REGEX_CACHE` inside a single test and prints `[DEBUG_LOG]` timings for OFF and ON runs under identical conditions. We recorded one set of OFF/ON timings here; we can extend to 3× medians if required for publication.
- The cache is a tiny size-bounded map (64 entries) activated only when `PerfFlags.REGEX_CACHE` is true. Defaults remain OFF.


## JIT tweaks (Round 1) — quick gains snapshot (locals, ranges, list ops)

Date: 2025-11-11 21:05 (local)

Scope: fast confirmation of overall gain using current configuration; focused on locals, ranges, and list ops. Each test prints OFF → ON timings in a single run. We executed the benches via Gradle with stdout enabled and single test fork.

Environment:
- Gradle: 8.7 (stdout enabled, maxParallelForks=1)
- JVM: as configured by toolchain for this project
- OS/Arch: per developer machine (unchanged from prior sections)

Reproduce:
```
./gradlew :lynglib:jvmTest --tests LocalVarBenchmarkTest --rerun-tasks
./gradlew :lynglib:jvmTest --tests RangeBenchmarkTest --rerun-tasks
./gradlew :lynglib:jvmTest --tests ListOpsBenchmarkTest --rerun-tasks
```

Results (representative runs; OFF → ON):
- Local variables — LOCAL_SLOT_PIC + EMIT_FAST_LOCAL_REFS
  - Run 1: 468.407 ms → 367.277 ms (≈ 1.28×)
  - Run 2: 447.031 ms → 346.126 ms (≈ 1.29×)
- Ranges for‑in — PRIMITIVE_FASTOPS (tight counted loop for (Int..Int))
  - 1731.780 ms → 799.023 ms (≈ 2.17×)
- List ops — PRIMITIVE_FASTOPS
  - sum(int list): 318.943 ms → 148.571 ms (≈ 2.15×)
  - contains(int in int list): 440.013 ms → 412.450 ms (≈ 1.07×)

Summary: All three areas improved with optimizations ON; no regressions observed in these runs. For publication‑grade stability, run each test 3× and report medians (see sections below for methodology and previous median tables).


## Additional tweaks — verification snapshot (Index write fast‑path, List literal pre‑size, Regex LRU)

Date: 2025-11-11 21:31 (local)

Scope: Implemented three semantics‑neutral optimizations and verified they are green across targeted and broader JVM benches.

What changed (guarded by flags where applicable):
- RVAL_FASTPATH: Index write fast‑path
  - `IndexRef.setAt`: direct path for `ObjList` + `ObjInt` (`list[i] = value`) mirrors the read fast‑path. Optional chaining semantics preserved; bounds exceptions propagate unchanged.
- RVAL_FASTPATH: List literal pre‑sizing
  - `ListLiteralRef.get`: pre‑counts element entries and uses `ArrayList` with capacity hint; for spreads of `ObjList`, uses `ensureCapacity` before bulk add. Evaluation order unchanged.
- REGEX_CACHE: LRU‑like behavior
  - `RegexCache`: emulates access‑order LRU within a tiny bounded map (`MAX=64`) by moving accessed entries to the tail; improves alternating‑pattern scenarios. Only active when `PerfFlags.REGEX_CACHE` is true.

Reproduce quick verification (1× runs):
```
./gradlew :lynglib:jvmTest --tests ExpressionBenchmarkTest --rerun-tasks
./gradlew :lynglib:jvmTest --tests ListOpsBenchmarkTest --rerun-tasks
./gradlew :lynglib:jvmTest --tests RegexBenchmarkTest --rerun-tasks
./gradlew :lynglib:jvmTest --tests PicBenchmarkTest --rerun-tasks
./gradlew :lynglib:jvmTest --tests PicInvalidationJvmTest --rerun-tasks
./gradlew :lynglib:jvmTest --tests LocalVarBenchmarkTest --rerun-tasks
./gradlew :lynglib:jvmTest --tests ConcurrencyCallBenchmarkTest --rerun-tasks
./gradlew :lynglib:jvmTest --tests DeepPoolingStressJvmTest --rerun-tasks
./gradlew :lynglib:jvmTest --tests MultiThreadPoolingStressJvmTest --rerun-tasks
```

Observation: All listed tests green in this cycle; no behavioral regressions observed. For the new paths (index write, list literal), performance was neutral‑to‑positive in smoke runs; Regex benches remained positive or neutral with the LRU behavior. For publication‑grade medians, extend to 3× per test as in earlier sections.


## Sanity matrix (JVM) — quick OFF→ON runs

Date: 2025-11-11 21:59 (local)

Scope: Final Round 1 sanity sweep across JVM micro‑benches and stress tests to confirm that optimizations ON do not regress performance vs OFF in representative scenarios. Each benchmark prints `[DEBUG_LOG] [BENCH]` timings for OFF → ON within a single run. This section records a quick pass confirmation (not 3× medians) and reproduction commands.

Environment:
- Gradle: 8.7 (stdout enabled, maxParallelForks=1)
- JVM: as configured by the project toolchain
- OS/Arch: macOS 14.x (aarch64)

Benches covered (all green; no regressions observed in these runs):
- Calls/Args: `CallBenchmarkTest`, `CallMixedArityBenchmarkTest` (ARG_BUILDER)
- PICs: `PicBenchmarkTest` (field/method); `PicInvalidationJvmTest` correctness reconfirmed
- Expressions/Arithmetic: `ExpressionBenchmarkTest`, `ArithmeticBenchmarkTest` (RVAL_FASTPATH, PRIMITIVE_FASTOPS)
- Ranges: `RangeBenchmarkTest` (PRIMITIVE_FASTOPS counted loop)
- List ops: `ListOpsBenchmarkTest` (PRIMITIVE_FASTOPS specializations)
- Regex: `RegexBenchmarkTest` (REGEX_CACHE with LRU behavior)
- Locals: `LocalVarBenchmarkTest` (LOCAL_SLOT_PIC + FAST_LOCAL)
- Concurrency/Pooling: `ConcurrencyCallBenchmarkTest`, `DeepPoolingStressJvmTest`, `MultiThreadPoolingStressJvmTest` (SCOPE_POOL per‑thread)

Reproduce (examples):
```
./gradlew :lynglib:jvmTest --tests CallBenchmarkTest --rerun-tasks
./gradlew :lynglib:jvmTest --tests CallMixedArityBenchmarkTest --rerun-tasks
./gradlew :lynglib:jvmTest --tests PicBenchmarkTest --rerun-tasks
./gradlew :lynglib:jvmTest --tests PicInvalidationJvmTest --rerun-tasks
./gradlew :lynglib:jvmTest --tests ExpressionBenchmarkTest --rerun-tasks
./gradlew :lynglib:jvmTest --tests ArithmeticBenchmarkTest --rerun-tasks
./gradlew :lynglib:jvmTest --tests RangeBenchmarkTest --rerun-tasks
./gradlew :lynglib:jvmTest --tests ListOpsBenchmarkTest --rerun-tasks
./gradlew :lynglib:jvmTest --tests RegexBenchmarkTest --rerun-tasks
./gradlew :lynglib:jvmTest --tests LocalVarBenchmarkTest --rerun-tasks
./gradlew :lynglib:jvmTest --tests ConcurrencyCallBenchmarkTest --rerun-tasks
./gradlew :lynglib:jvmTest --tests DeepPoolingStressJvmTest --rerun-tasks
./gradlew :lynglib:jvmTest --tests MultiThreadPoolingStressJvmTest --rerun-tasks
```

  Summary:
  - All listed tests passed in this sanity sweep.
  - For each benchmark’s OFF → ON printouts examined during this pass, ON was equal or faster than OFF; no ON<OFF regressions were observed.
  - For publication‑grade numbers, use the 3× medians methodology outlined earlier in this document. The existing median tables in previous sections remain representative, and the additional tweaks (Index write, List literal pre‑size, Regex LRU, Field PIC 4‑way + read→write reuse, mixed Int/Real fast‑ops) remained neutral‑to‑positive.


## Quick snapshot — IndexRef PIC + negative miss cache (JVM) — 3× medians (OFF → ON)

Date: 2025-11-11 22:32 (local)

Scope
- Confirm that the latest changes — IndexRef read/write PIC (stacked on RVAL_FASTPATH) and safe catch‑and‑cache negative entries for Field/Method PICs — do not regress performance. We collected 3× medians for the two expression sub‑benches that are most sensitive to RVAL paths and cross‑checked PICs and ranges.

Environment
- Gradle: 8.7 (stdout enabled, maxParallelForks=1)
- JVM: project toolchain default
- OS/Arch: macOS 14.x (aarch64)

Results (3× medians)

| Area | Benchmark/Test | OFF median (ms) | ON median (ms) | Speedup | Notes |
|------|-----------------|-----------------:|----------------:|:-------:|-------|
| RVAL_FASTPATH | ExpressionBenchmarkTest::benchmarkListIndexReads | 304.282 | 229.168 | 1.33× | IndexRef direct fast‑path for ObjList+ObjInt; 4‑way Index PIC handles polymorphic cases |
| RVAL_FASTPATH | ExpressionBenchmarkTest::benchmarkFieldReadPureReceiver | 275.122 | 194.876 | 1.41× | Monomorphic, immutable receiver path; preserves visibility/optional semantics |

Cross‑checks (from the same session, 1× quick)
- PicBenchmarkTest::benchmarkFieldGetSetPic — OFF 203.701 ms → ON 117.129 ms (≈1.74×)
- PicBenchmarkTest::benchmarkMethodPic — OFF 280.806 ms → ON 202.613 ms (≈1.39×)
- RangeBenchmarkTest::benchmarkIntRangeForIn — OFF 1762.425 ms → ON 806.898 ms (≈2.18×)

Reproduce
```
./gradlew :lynglib:jvmTest --tests "ExpressionBenchmarkTest.benchmarkListIndexReads" --rerun-tasks
./gradlew :lynglib:jvmTest --tests "ExpressionBenchmarkTest.benchmarkListIndexReads" --rerun-tasks
./gradlew :lynglib:jvmTest --tests "ExpressionBenchmarkTest.benchmarkListIndexReads" --rerun-tasks

./gradlew :lynglib:jvmTest --tests "ExpressionBenchmarkTest.benchmarkFieldReadPureReceiver" --rerun-tasks
./gradlew :lynglib:jvmTest --tests "ExpressionBenchmarkTest.benchmarkFieldReadPureReceiver" --rerun-tasks
./gradlew :lynglib:jvmTest --tests "ExpressionBenchmarkTest.benchmarkFieldReadPureReceiver" --rerun-tasks

./gradlew :lynglib:jvmTest --tests PicBenchmarkTest --rerun-tasks
./gradlew :lynglib:jvmTest --tests RangeBenchmarkTest --rerun-tasks
```

Notes
- Negative caches are installed only after a real miss throws (cache‑after‑miss), preserving error semantics and invalidation on `layoutVersion` changes.
- IndexRef PIC augments the existing direct path and uses move‑to‑front promotion; it is keyed on `(classId, layoutVersion)` like other PICs.