# Lyng Performance Guide (JVM‑first) This document explains how to enable and measure the performance optimizations added to the Lyng interpreter. The focus is JVM‑first with safe, flag‑guarded rollouts and quick A/B testing. Other targets (JS/Wasm/Native) keep conservative defaults until validated. ## Overview Optimizations are controlled by runtime‑mutable flags in `net.sergeych.lyng.PerfFlags`, initialized from platform‑specific static defaults `net.sergeych.lyng.PerfDefaults` (KMP `expect/actual`). - JVM/Android defaults are aggressive (e.g. `RVAL_FASTPATH=true`). - Non‑JVM defaults are conservative (e.g. `RVAL_FASTPATH=false`). All flags are `var` and can be flipped at runtime (e.g., from tests or host apps) for A/B comparisons. ## Key flags - `LOCAL_SLOT_PIC` — Runtime cache in `LocalVarRef` to avoid repeated name→slot lookups per frame (ON JVM default). - `EMIT_FAST_LOCAL_REFS` — Compiler emits `FastLocalVarRef` for identifiers known to be locals/params (ON JVM default). - `ARG_BUILDER` — Efficient argument building: small‑arity no‑alloc and pooled builder on JVM (ON JVM default). - `SKIP_ARGS_ON_NULL_RECEIVER` — Early return on optional‑null receivers before building args (semantics‑compatible). A/B only. - `SCOPE_POOL` — Scope frame pooling for calls (JVM, per‑thread ThreadLocal pool). ON by default on JVM; togglable at runtime. - `FIELD_PIC` — 2‑entry polymorphic inline cache for field reads/writes keyed by `(classId, layoutVersion)` (ON JVM default). - `METHOD_PIC` — 2‑entry PIC for instance method calls keyed by `(classId, layoutVersion)` (ON JVM default). - `PIC_DEBUG_COUNTERS` — Enable lightweight hit/miss counters via `PerfStats` (OFF by default). - `PRIMITIVE_FASTOPS` — Fast paths for `(ObjInt, ObjInt)` arithmetic/comparisons and `(ObjBool, ObjBool)` logic (ON JVM default). - `RVAL_FASTPATH` — Bypass `ObjRecord` in pure expression evaluation via `ObjRef.evalValue` (ON JVM default, OFF elsewhere). See `src/commonMain/kotlin/net/sergeych/lyng/PerfFlags.kt` and `PerfDefaults.*.kt` for details and platform defaults. ## Where optimizations apply - Locals: `FastLocalVarRef`, `LocalVarRef` per‑frame cache (PIC). - Calls: small‑arity zero‑alloc paths (0–8 args), pooled builder (JVM), and child frame pooling (optional). - Properties/methods: Field/Method PICs with receiver shape `(classId, layoutVersion)` and handle‑aware caches. - Expressions: R‑value fast paths in hot nodes (`UnaryOpRef`, `BinaryOpRef`, `ElvisRef`, logical ops, `RangeRef`, `IndexRef` read, `FieldRef` receiver eval, `ListLiteralRef` elements, `CallRef` callee, `MethodCallRef` receiver, assignment RHS). - Primitives: Direct boolean/int ops where safe. ## Running JVM micro‑benchmarks Each benchmark prints timings with `[DEBUG_LOG]` and includes correctness assertions to prevent dead‑code elimination. Run individual tests to avoid multiplatform matrices: ``` ./gradlew :lynglib:jvmTest --tests LocalVarBenchmarkTest ./gradlew :lynglib:jvmTest --tests CallBenchmarkTest ./gradlew :lynglib:jvmTest --tests CallMixedArityBenchmarkTest ./gradlew :lynglib:jvmTest --tests CallSplatBenchmarkTest ./gradlew :lynglib:jvmTest --tests PicBenchmarkTest ./gradlew :lynglib:jvmTest --tests PicInvalidationJvmTest ./gradlew :lynglib:jvmTest --tests ArithmeticBenchmarkTest ./gradlew :lynglib:jvmTest --tests ExpressionBenchmarkTest ./gradlew :lynglib:jvmTest --tests CallPoolingBenchmarkTest ./gradlew :lynglib:jvmTest --tests MethodPoolingBenchmarkTest ./gradlew :lynglib:jvmTest --tests MixedBenchmarkTest ./gradlew :lynglib:jvmTest --tests DeepPoolingStressJvmTest ``` Typical output (example): ``` [DEBUG_LOG] [BENCH] mixed-arity x200000 [ARG_BUILDER=ON]: 85.7 ms ``` Lower time is better. Run the same bench with a flag OFF vs ON to compare. ## Toggling flags in tests Flags are mutable at runtime, e.g.: ```kotlin PerfFlags.ARG_BUILDER = false val r1 = (Scope().eval(script) as ObjInt).value PerfFlags.ARG_BUILDER = true val r2 = (Scope().eval(script) as ObjInt).value ``` Reset flags at the end of a test to avoid impacting other tests. ## PIC diagnostics (optional) Enable counters: ```kotlin PerfFlags.PIC_DEBUG_COUNTERS = true PerfStats.resetAll() ``` Available counters in `PerfStats`: - Field PIC: `fieldPicHit`, `fieldPicMiss`, `fieldPicSetHit`, `fieldPicSetMiss` - Method PIC: `methodPicHit`, `methodPicMiss` - Locals: `localVarPicHit`, `localVarPicMiss`, `fastLocalHit`, `fastLocalMiss` - Primitive ops: `primitiveFastOpsHit` Print a summary at the end of a bench/test as needed. Remember to turn counters OFF after the test. ## Guidance per flag (JVM) - Keep `RVAL_FASTPATH = true` unless debugging a suspected expression‑semantics issue. - Use `SCOPE_POOL = true` only for benchmarks or once pooling passes the deep stress tests and broader validation; currently OFF by default. - `FIELD_PIC` and `METHOD_PIC` should remain ON; they are validated with invalidation tests. - `ARG_BUILDER` should remain ON; switch OFF only to get a baseline. ## Notes on correctness & safety - Optional chaining semantics are preserved across fast paths. - Visibility/mutability checks are enforced even on PIC fast‑paths. - `frameId` is regenerated on each pooled frame borrow; stress tests verify no leakage under deep nesting/recursion. ## Cross‑platform - Non‑JVM defaults keep `RVAL_FASTPATH=false` for now; other low‑risk flags may be ON. - Once JVM path is fully validated and measured, add lightweight benches for JS/Wasm/Native and enable flags incrementally. ## Troubleshooting - If a benchmark shows regressions, flip related flags OFF to isolate the source (e.g., `ARG_BUILDER`, `RVAL_FASTPATH`, `FIELD_PIC`, `METHOD_PIC`). - Use `PIC_DEBUG_COUNTERS` to observe inline cache effectiveness. - Ensure tests do not accidentally keep flags ON for subsequent tests; reset after each test. ## JVM micro-benchmark results (3× medians; OFF → ON) Date: 2025-11-10 23:04 (local) | Flag | Benchmark/Test | OFF median (ms) | ON median (ms) | Speedup | Notes | |--------------------|----------------------------------------------|-----------------:|----------------:|:-------:|-------| | ARG_BUILDER | CallMixedArityBenchmarkTest | 788.02 | 668.79 | 1.18× | Clear win on mixed arity | | ARG_BUILDER | CallBenchmarkTest (simple calls) | 423.87 | 425.47 | 1.00× | Neutral on repeated simple calls | | FIELD_PIC | PicBenchmarkTest::benchmarkFieldGetSetPic | 113.575 | 106.017 | 1.07× | Small but consistent win | | METHOD_PIC | PicBenchmarkTest::benchmarkMethodPic | 251.068 | 149.439 | 1.68× | Large consistent win | | RVAL_FASTPATH | ExpressionBenchmarkTest | 514.491 | 426.800 | 1.21× | Consistent win in expression chains | | PRIMITIVE_FASTOPS | ArithmeticBenchmarkTest (int-sum) | 243.420 | 128.146 | 1.90× | Big win for integer addition | | PRIMITIVE_FASTOPS | ArithmeticBenchmarkTest (int-cmp) | 210.385 | 168.534 | 1.25× | Moderate win for comparisons | | SCOPE_POOL | CallPoolingBenchmarkTest | 505.778 | 366.737 | 1.38× | Single-threaded bench; per-thread ThreadLocal pool; default ON on JVM | Notes: - All results obtained from `[DEBUG_LOG] [BENCH]` outputs with three repeated Gradle test invocations per configuration; medians reported. - JVM defaults (current): `ARG_BUILDER=true`, `PRIMITIVE_FASTOPS=true`, `RVAL_FASTPATH=true`, `FIELD_PIC=true`, `METHOD_PIC=true`, `SCOPE_POOL=true` (per‑thread ThreadLocal pool). ## Concurrency (multi‑core) pooling results (3× medians; OFF → ON) Date: 2025-11-10 22:56 (local) | Flag | Benchmark/Test | OFF median (ms) | ON median (ms) | Speedup | Notes | |------------|--------------------------------------|-----------------:|----------------:|:-------:|-------| | SCOPE_POOL | ConcurrencyCallBenchmarkTest (JVM) | 521.102 | 201.374 | 2.59× | Multithreaded workload on `Dispatchers.Default` with per‑thread ThreadLocal pool; workers=8, iters=15000/worker. | Methodology: - The test toggles `PerfFlags.SCOPE_POOL` within a single run and executes the same script across N worker coroutines scheduled on `Dispatchers.Default`. - We executed the test three times via Gradle and computed medians from the printed `[DEBUG_LOG]` timings: - OFF runs (ms): 532.442 | 521.102 | 474.386 → median 521.102 - ON runs (ms): 218.683 | 201.374 | 198.737 → median 201.374 - Speedup = OFF/ON. Reproduce: ``` ./gradlew :lynglib:jvmTest --tests "ConcurrencyCallBenchmarkTest" --rerun-tasks ``` ## Next optimization steps (JVM) Date: 2025-11-10 23:04 (local) - PICs - Widen METHOD_PIC to 3–4 entries with tiny LRU; keep invalidation on layout change; re-run `PicInvalidationJvmTest`. - Micro fast-path for FIELD_PIC read-then-write pairs (`x = x + 1`) to reuse the resolved slot within one step. - Locals and slots - Pre-size `Scope` slot structures when compiler knows local/param counts; audit `EMIT_FAST_LOCAL_REFS` coverage. - Re-run `LocalVarBenchmarkTest` to quantify gains. - RVAL_FASTPATH coverage - Cover primitive `ObjList` index reads, pure receivers in `FieldRef`, and assignment RHS where safe; add micro-benches to `ExpressionBenchmarkTest`. - Collections and ranges - Specialize `(Int..Int)` loops into tight counted loops (no intermediary objects). - Add primitive-specialized `ObjList` ops (`map`, `filter`, `sum`, `contains`) under `PRIMITIVE_FASTOPS`. - Regex and strings - Cache compiled regex for string literals at compile time; add a tiny LRU for dynamic patterns behind `REGEX_CACHE`. - Add `RegexBenchmarkTest` for repeated matches. - JIT friendliness (Kotlin/JVM) - Inline tiny helpers in hot paths, prefer arrays for internal buffers, finalize hot data structures where safe. Validation matrix - Always re-run: `CallBenchmarkTest`, `CallMixedArityBenchmarkTest`, `PicBenchmarkTest`, `ExpressionBenchmarkTest`, `ArithmeticBenchmarkTest`, `CallPoolingBenchmarkTest`, `DeepPoolingStressJvmTest`, `ConcurrencyCallBenchmarkTest` (3× medians when comparing). - Keep full `:lynglib:jvmTest` green after each change.