lyng/perf_guide.md at 4e37d0be26f7a9c811b04ab00ba6a0548bb719b1

sergeych 1fadc42414 added DocsPage, improved navbar with dynamic height handling, MathJax integration, new TOC features, and extensive markdown processing

2025-11-19 21:52:58 +01:00

36 KiB

Raw Blame History

This document explains how to enable and measure the performance optimizations added to the Lyng interpreter. The focus is JVM‑first with safe, flag‑guarded rollouts and quick A/B testing. Other targets (JS/Wasm/Native) keep conservative defaults until validated.

Overview

Optimizations are controlled by runtime‑mutable flags in net.sergeych.lyng.PerfFlags, initialized from platform‑specific static defaults net.sergeych.lyng.PerfDefaults (KMP expect/actual).

JVM/Android defaults are aggressive (e.g. RVAL_FASTPATH=true).
Non‑JVM defaults are conservative (e.g. RVAL_FASTPATH=false).

All flags are var and can be flipped at runtime (e.g., from tests or host apps) for A/B comparisons.

Workload presets (JVM‑first)

To simplify switching between recommended flag sets for different workloads, use net.sergeych.lyng.PerfProfiles:

val snap = PerfProfiles.apply(PerfProfiles.Preset.BENCH)  // or BASELINE / BOOKS
// ... run workload ...
PerfProfiles.restore(snap)  // restore previous flags

BASELINE: restores platform defaults from PerfDefaults (good rollback point).
BENCH: expression‑heavy micro‑bench focus (aggressive R‑value and PIC optimizations on JVM).
BOOKS: documentation workloads (prefers simpler paths; disables some PIC/arg builder features shown neutral/negative for this load in A/B).

Key flags

LOCAL_SLOT_PIC — Runtime cache in LocalVarRef to avoid repeated name→slot lookups per frame (ON JVM default).
EMIT_FAST_LOCAL_REFS — Compiler emits FastLocalVarRef for identifiers known to be locals/params (ON JVM default).
ARG_BUILDER — Efficient argument building: small‑arity no‑alloc and pooled builder on JVM (ON JVM default).
ARG_SMALL_ARITY_12 — Extends small‑arity no‑alloc call paths from 0–8 to 0–12 arguments (JVM‑first exploration; OFF by default). Use for codebases with many 9–12 arg calls; A/B before enabling.
SKIP_ARGS_ON_NULL_RECEIVER — Early return on optional‑null receivers before building args (semantics‑compatible). A/B only.
SCOPE_POOL — Scope frame pooling for calls (JVM, per‑thread ThreadLocal pool). ON by default on JVM; togglable at runtime.
FIELD_PIC — 2‑entry polymorphic inline cache for field reads/writes keyed by (classId, layoutVersion) (ON JVM default).
METHOD_PIC — 2‑entry PIC for instance method calls keyed by (classId, layoutVersion) (ON JVM default).
FIELD_PIC_SIZE_4 — Increases Field PIC size from 2 to 4 entries (JVM-first tuning; OFF by default). Use for sites with >2 receiver shapes.
METHOD_PIC_SIZE_4 — Increases Method PIC size from 2 to 4 entries (JVM-first tuning; OFF by default).
PIC_ADAPTIVE_2_TO_4 — Adaptive growth of Field/Method PICs from 2→4 entries per-site when miss rate >20% over ≥256 accesses (JVM-first; OFF by default).
INDEX_PIC — Enables polymorphic inline cache for indexing (e.g., a[i]) and related fast paths. Defaults to follow FIELD_PIC on init; can be toggled independently.
INDEX_PIC_SIZE_4 — Increases Index PIC size from 2 to 4 entries (JVM-first tuning). Default: ON for JVM; OFF elsewhere by default.
PIC_DEBUG_COUNTERS — Enable lightweight hit/miss counters via PerfStats (OFF by default).
PRIMITIVE_FASTOPS — Fast paths for (ObjInt, ObjInt) arithmetic/comparisons and (ObjBool, ObjBool) logic (ON JVM default).
RVAL_FASTPATH — Bypass ObjRecord in pure expression evaluation via ObjRef.evalValue (ON JVM default, OFF elsewhere).

See src/commonMain/kotlin/net/sergeych/lyng/PerfFlags.kt and PerfDefaults.*.kt for details and platform defaults.

Where optimizations apply

Locals: FastLocalVarRef, LocalVarRef per‑frame cache (PIC).
Calls: small‑arity zero‑alloc paths (0–8 args; optionally 0–12 with ARG_SMALL_ARITY_12), pooled builder (JVM), and child frame pooling (optional).
Properties/methods: Field/Method PICs with receiver shape (classId, layoutVersion) and handle‑aware caches; configurable 2→4 entries under flags.
Expressions: R‑value fast paths in hot nodes (UnaryOpRef, BinaryOpRef, ElvisRef, logical ops, RangeRef, IndexRef read, FieldRef receiver eval, ListLiteralRef elements, CallRef callee, MethodCallRef receiver, assignment RHS).
Primitives: Direct boolean/int ops where safe.

Compiler constant folding (conservative)

The compiler folds a safe subset of literal‑only expressions at compile time to reduce runtime work:
- Integer arithmetic: + - * / % (division/modulo only when divisor ≠ 0).
- Bitwise integer ops: & ^ | << >>.
- Comparisons and equality for ints/strings/chars: == != < <= > >=.
- Boolean logic for literal booleans: || && and unary !.
- String concatenation of literal strings: "a" + "b".
Non‑literal operands or side‑effecting constructs are not folded.
Semantics remain unchanged; tests verify parity.

Running JVM micro‑benchmarks

Each benchmark prints timings with [DEBUG_LOG] and includes correctness assertions to prevent dead‑code elimination.

Run individual tests to avoid multiplatform matrices:

./gradlew :lynglib:jvmTest --tests LocalVarBenchmarkTest
./gradlew :lynglib:jvmTest --tests CallBenchmarkTest
./gradlew :lynglib:jvmTest --tests CallMixedArityBenchmarkTest
./gradlew :lynglib:jvmTest --tests CallSplatBenchmarkTest
./gradlew :lynglib:jvmTest --tests PicBenchmarkTest
./gradlew :lynglib:jvmTest --tests PicInvalidationJvmTest
./gradlew :lynglib:jvmTest --tests PicAdaptiveABTest
./gradlew :lynglib:jvmTest --tests ArithmeticBenchmarkTest
./gradlew :lynglib:jvmTest --tests ExpressionBenchmarkTest
./gradlew :lynglib:jvmTest --tests IndexPicABTest
./gradlew :lynglib:jvmTest --tests IndexWritePathABTest
./gradlew :lynglib:jvmTest --tests CallPoolingBenchmarkTest
./gradlew :lynglib:jvmTest --tests MethodPoolingBenchmarkTest
./gradlew :lynglib:jvmTest --tests MixedBenchmarkTest
./gradlew :lynglib:jvmTest --tests DeepPoolingStressJvmTest

Typical output (example):

[DEBUG_LOG] [BENCH] mixed-arity x200000 [ARG_BUILDER=ON]: 85.7 ms

Lower time is better. Run the same bench with a flag OFF vs ON to compare.

Optional JFR allocation profiling (JVM)

When running end‑to‑end “book” workloads or heavier benches, you can enable JFR to capture allocation and GC details:

./gradlew :lynglib:jvmTest --tests BookAllocationProfileTest -Dlyng.jfr=true \
  -Dlyng.profile.warmup=1 -Dlyng.profile.repeats=3 -Dlyng.profile.shuffle=true

Dumps are saved to lynglib/build/jfr_*.jfr if the JVM supports Flight Recorder.
The test also records GC counts/time and median time/heap deltas to lynglib/build/book_alloc_profile.txt.

Toggling flags in tests

Flags are mutable at runtime, e.g.:

PerfFlags.ARG_BUILDER = false
val r1 = (Scope().eval(script) as ObjInt).value
PerfFlags.ARG_BUILDER = true
val r2 = (Scope().eval(script) as ObjInt).value

Reset flags at the end of a test to avoid impacting other tests.

PIC diagnostics (optional)

Enable counters:

PerfFlags.PIC_DEBUG_COUNTERS = true
PerfStats.resetAll()

Available counters in PerfStats:

Field PIC: fieldPicHit, fieldPicMiss, fieldPicSetHit, fieldPicSetMiss
Method PIC: methodPicHit, methodPicMiss
Index PIC: indexPicHit, indexPicMiss
Locals: localVarPicHit, localVarPicMiss, fastLocalHit, fastLocalMiss
Primitive ops: primitiveFastOpsHit

Print a summary at the end of a bench/test as needed. Remember to turn counters OFF after the test.

A/B scenarios and guidance (JVM)

Adaptive PIC (fields/methods)

Flags: FIELD_PIC=true, METHOD_PIC=true, FIELD_PIC_SIZE_4=false, METHOD_PIC_SIZE_4=false, toggle PIC_ADAPTIVE_2_TO_4 OFF vs ON.
Benchmarks: PicBenchmarkTest, MixedBenchmarkTest, PicAdaptiveABTest.
Expect wins at sites with >2 receiver shapes; counters should show fewer misses with adaptivity ON.

Index PIC and size

Flags: toggle INDEX_PIC OFF vs ON; then INDEX_PIC_SIZE_4 OFF vs ON.
Benchmarks: ExpressionBenchmarkTest (list indexing) and IndexPicABTest (string/map indexing).
Expect wins when the same index shape recurs; counters should show higher indexPicHit.

Index WRITE paths (Map and List)

Flags: toggle INDEX_PIC OFF vs ON; then INDEX_PIC_SIZE_4 OFF vs ON.
Benchmark: IndexWritePathABTest (Map[String] put, List[Int] set) — writes results to lynglib/build/index_write_ab_results.txt.
Direct fast‑paths are used on R‑value paths where safe and semantics‑preserving (e.g., optional‑chaining no‑ops on null receivers; bounds exceptions unchanged).

Guidance per flag (JVM)

Keep RVAL_FASTPATH = true unless debugging a suspected expression‑semantics issue.
Use SCOPE_POOL = true only for benchmarks or once pooling passes the deep stress tests and broader validation; currently OFF by default.
FIELD_PIC and METHOD_PIC should remain ON; they are validated with invalidation tests.
Consider enabling FIELD_PIC_SIZE_4/METHOD_PIC_SIZE_4 for sites with 3–4 receiver shapes; measure first.
PIC_ADAPTIVE_2_TO_4 is useful on polymorphic sites and may outperform fixed size 2 on mixed-shape workloads. Validate with PicAdaptiveABTest.
INDEX_PIC is generally beneficial on JVM; leave ON when measuring index‑heavy workloads.
INDEX_PIC_SIZE_4 is ON by default on JVM as A/B showed consistent wins on String[Int] and Map[String] workloads. You can disable it by setting PerfFlags.INDEX_PIC_SIZE_4 = false if needed.
ARG_BUILDER should remain ON; switch OFF only to get a baseline.
ARG_SMALL_ARITY_12 is experimental and OFF by default. Enable it only if your workload frequently calls functions with 9–12 arguments and A/B shows consistent wins.

Workload‑specific recommendations (JVM)

“Books”/documentation loads (BookTest): prefer simpler paths; in A/B these often benefit from the BOOKS preset (e.g., ARG_BUILDER=false, SCOPE_POOL=false, INDEX_PIC=false). Use PerfProfiles.apply(PerfProfiles.Preset.BOOKS) before the run and restore(...) after.
Expression‑heavy benches: use the BENCH preset (PICs and R‑value fast‑paths enabled, INDEX_PIC_SIZE_4=true).
Always verify with local A/B on your environment; rollback is a flag flip or applying BASELINE.

Notes on correctness & safety

Optional chaining semantics are preserved across fast paths.
Visibility/mutability checks are enforced even on PIC fast‑paths.
frameId is regenerated on each pooled frame borrow; stress tests verify no leakage under deep nesting/recursion.

Cross‑platform

Non‑JVM defaults keep RVAL_FASTPATH=false for now; other low‑risk flags may be ON.
Once JVM path is fully validated and measured, add lightweight benches for JS/Wasm/Native and enable flags incrementally.

Range fast iteration (experimental)

Flag: RANGE_FAST_ITER (default OFF).
When enabled and applicable, simple ascending integer ranges (0..n, 0..<n) use a specialized non‑allocating iterator (ObjFastIntRangeIterator).
Benchmark: RangeIterationBenchmarkTest records OFF/ON timings for inclusive, exclusive, reversed, negative, and empty ranges. Semantics are preserved; non‑int or complex ranges fall back to the generic iterator.

Troubleshooting

If a benchmark shows regressions, flip related flags OFF to isolate the source (e.g., ARG_BUILDER, RVAL_FASTPATH, FIELD_PIC, METHOD_PIC).
Use PIC_DEBUG_COUNTERS to observe inline cache effectiveness.
Ensure tests do not accidentally keep flags ON for subsequent tests; reset after each test.

JVM micro-benchmark results (3× medians; OFF → ON)

Date: 2025-11-10 23:04 (local)

Flag	Benchmark/Test	OFF median (ms)	ON median (ms)	Speedup	Notes
ARG_BUILDER	CallMixedArityBenchmarkTest	788.02	668.79	1.18×	Clear win on mixed arity
ARG_BUILDER	CallBenchmarkTest (simple calls)	423.87	425.47	1.00×	Neutral on repeated simple calls
FIELD_PIC	PicBenchmarkTest::benchmarkFieldGetSetPic	113.575	106.017	1.07×	Small but consistent win
METHOD_PIC	PicBenchmarkTest::benchmarkMethodPic	251.068	149.439	1.68×	Large consistent win
RVAL_FASTPATH	ExpressionBenchmarkTest	514.491	426.800	1.21×	Consistent win in expression chains
PRIMITIVE_FASTOPS	ArithmeticBenchmarkTest (int-sum)	243.420	128.146	1.90×	Big win for integer addition
PRIMITIVE_FASTOPS	ArithmeticBenchmarkTest (int-cmp)	210.385	168.534	1.25×	Moderate win for comparisons
SCOPE_POOL	CallPoolingBenchmarkTest	505.778	366.737	1.38×	Single-threaded bench; per-thread ThreadLocal pool; default ON on JVM

Notes:

All results obtained from [DEBUG_LOG] [BENCH] outputs with three repeated Gradle test invocations per configuration; medians reported.
JVM defaults (current): ARG_BUILDER=true, PRIMITIVE_FASTOPS=true, RVAL_FASTPATH=true, FIELD_PIC=true, METHOD_PIC=true, SCOPE_POOL=true (per‑thread ThreadLocal pool), REGEX_CACHE=true.

Concurrency (multi‑core) pooling results (3× medians; OFF → ON)

Date: 2025-11-10 22:56 (local)

Flag	Benchmark/Test	OFF median (ms)	ON median (ms)	Speedup	Notes
SCOPE_POOL	ConcurrencyCallBenchmarkTest (JVM)	521.102	201.374	2.59×	Multithreaded workload on `Dispatchers.Default` with per‑thread ThreadLocal pool; workers=8, iters=15000/worker.

Methodology:

The test toggles PerfFlags.SCOPE_POOL within a single run and executes the same script across N worker coroutines scheduled on Dispatchers.Default.
We executed the test three times via Gradle and computed medians from the printed [DEBUG_LOG] timings:
- OFF runs (ms): 532.442 | 521.102 | 474.386 → median 521.102
- ON runs (ms): 218.683 | 201.374 | 198.737 → median 201.374
Speedup = OFF/ON.

Reproduce:

./gradlew :lynglib:jvmTest --tests "ConcurrencyCallBenchmarkTest" --rerun-tasks

Next optimization steps (JVM)

Date: 2025-11-10 23:04 (local)

PICs
- Widen METHOD_PIC to 3–4 entries with tiny LRU; keep invalidation on layout change; re-run PicInvalidationJvmTest.
- Micro fast-path for FIELD_PIC read-then-write pairs (x = x + 1) to reuse the resolved slot within one step.
Locals and slots
- Pre-size Scope slot structures when compiler knows local/param counts; audit EMIT_FAST_LOCAL_REFS coverage.
- Re-run LocalVarBenchmarkTest to quantify gains.
RVAL_FASTPATH coverage
- Cover primitive ObjList index reads, pure receivers in FieldRef, and assignment RHS where safe; add micro-benches to ExpressionBenchmarkTest.
Collections and ranges
- Specialize (Int..Int) loops into tight counted loops (no intermediary objects).
- Add primitive-specialized ObjList ops (map, filter, sum, contains) under PRIMITIVE_FASTOPS.
Regex and strings
- Cache compiled regex for string literals at compile time; add a tiny LRU for dynamic patterns behind REGEX_CACHE.
- Add RegexBenchmarkTest for repeated matches.
JIT friendliness (Kotlin/JVM)
- Inline tiny helpers in hot paths, prefer arrays for internal buffers, finalize hot data structures where safe.

Validation matrix

Always re-run: CallBenchmarkTest, CallMixedArityBenchmarkTest, PicBenchmarkTest, ExpressionBenchmarkTest, ArithmeticBenchmarkTest, CallPoolingBenchmarkTest, DeepPoolingStressJvmTest, ConcurrencyCallBenchmarkTest (3× medians when comparing).
Keep full :lynglib:jvmTest green after each change.

PIC update (4‑way METHOD_PIC) — JVM (3× medians; OFF → ON)

Date: 2025-11-11 00:16 (local)

Flag	Benchmark/Test	OFF median (ms)	ON median (ms)	Speedup	Notes
FIELD_PIC	PicBenchmarkTest::benchmarkFieldGetSetPic	207.578	106.481	1.95×	Read→write loop; micro fast‑path groundwork present
METHOD_PIC	PicBenchmarkTest::benchmarkMethodPic	273.478	182.226	1.50×	4‑way PIC with move‑to‑front (was 2‑way before)

Medians computed from three Gradle runs in this session; see [DEBUG_LOG] [BENCH] lines in test output.

Locals/slots capacity (pre‑sizing hints) — JVM (3× medians; OFF → ON)

Date: 2025-11-11 13:19 (local)

Optimization	Benchmark/Test	OFF config	ON config	OFF median (ms)	ON median (ms)	Speedup	Notes
Locals pre‑sizing + PIC	LocalVarBenchmarkTest	LOCAL_SLOT_PIC=OFF, FAST_LOCAL=OFF	LOCAL_SLOT_PIC=ON, FAST_LOCAL=ON	472.129	370.871	1.27×	Compiler hint `params+4`; slot pre‑size; semantics unchanged

Methodology:

Each configuration executed three times via :lynglib:jvmTest --tests "…" --rerun-tasks; medians reported.
Locals improvement stacks with per‑thread SCOPE_POOL and ARG fast paths.

RVAL fast paths update — JVM (IndexRef and FieldRef) [3× medians; OFF → ON]

Date: 2025-11-11 13:19 (local)

New micro-benchmarks have been added to quantify the latest RVAL_FASTPATH extensions:

Primitive ObjList index-read fast path in IndexRef.
Conservative “pure receiver” evaluation in FieldRef (monomorphic, immutable receiver), preserving visibility/mutability checks and optional chaining semantics.

Benchmarks to run (each 3× OFF → ON):

ExpressionBenchmarkTest::benchmarkListIndexReads
ExpressionBenchmarkTest::benchmarkFieldReadPureReceiver

Reproduce (3× each; collect [DEBUG_LOG] [BENCH] lines and compute medians):

./gradlew :lynglib:jvmTest --tests "ExpressionBenchmarkTest.benchmarkListIndexReads" --rerun-tasks
./gradlew :lynglib:jvmTest --tests "ExpressionBenchmarkTest.benchmarkListIndexReads" --rerun-tasks
./gradlew :lynglib:jvmTest --tests "ExpressionBenchmarkTest.benchmarkListIndexReads" --rerun-tasks

./gradlew :lynglib:jvmTest --tests "ExpressionBenchmarkTest.benchmarkFieldReadPureReceiver" --rerun-tasks
./gradlew :lynglib:jvmTest --tests "ExpressionBenchmarkTest.benchmarkFieldReadPureReceiver" --rerun-tasks
./gradlew :lynglib:jvmTest --tests "ExpressionBenchmarkTest.benchmarkFieldReadPureReceiver" --rerun-tasks

Once collected, add medians and speedups to the table below:

Flag	Benchmark/Test	OFF median (ms)	ON median (ms)	Speedup	Notes
RVAL_FASTPATH	ExpressionBenchmarkTest::benchmarkListIndexReads	305.243	230.942	1.32×	Fast path in `IndexRef` for `ObjList` + `ObjInt` index
RVAL_FASTPATH	ExpressionBenchmarkTest::benchmarkFieldReadPureReceiver	266.222	190.720	1.40×	Pure-receiver evaluation in `FieldRef` (monomorphic, immutable)

Notes:

Both benches toggle PerfFlags.RVAL_FASTPATH within a single run to produce OFF and ON timings under identical conditions.
Correctness assertions ensure the loops are not optimized away.
All semantics (visibility/mutability checks, optional chaining) remain intact; fast paths only skip interim ObjRecord traffic when safe.

ARG_BUILDER — splat fast‑path (3× medians; OFF → ON)

Date: 2025-11-11 13:12 (local)

Environment: Gradle 8.7; JVM (JDK as configured by toolchain); single‑threaded test execution; stdout enabled.

Flag	Benchmark/Test	OFF median (ms)	ON median (ms)	Speedup	Notes
ARG_BUILDER	CallSplatBenchmarkTest (splat)	613.689	463.593	1.32×	Single‑splat fast‑path returns underlying list directly; avoids intermediate copies

Inputs (3×):

OFF runs (ms): 613.689 | 629.604 | 612.361 → median 613.689
ON runs (ms): 453.752 | 463.593 | 468.844 → median 463.593

Reproduce (3×):

./gradlew :lynglib:jvmTest --tests "CallSplatBenchmarkTest" --rerun-tasks

Phase A consolidation (JVM) — 3× medians updated

Date: 2025-11-11 13:48 (local) Environment:

JDK: OpenJDK 20.0.2.1 (Amazon Corretto 20.0.2.1+10-FR)
Gradle: 8.7
OS/Arch: macOS 14.8.1 (aarch64)

ARG_BUILDER

Benchmark/Test	OFF median (ms)	ON median (ms)	Speedup	Notes
CallMixedArityBenchmarkTest	866.681	717.439	1.21×	Small-arity 0–8 fast path + builder; correctness preserved
CallSplatBenchmarkTest (splat)	600.880	459.706	1.31×	Single-splat fast path returns underlying list; avoids copies

Inputs (3×):

Mixed arity OFF: 874.088291 | 866.680959 | 858.577125 → median 866.680959
Mixed arity ON: 731.308625 | 706.440125 | 717.438542 → median 717.438542
Splat OFF: 600.268625 | 607.849416 | 600.879666 → median 600.879666
Splat ON: 459.706375 | 449.950166 | 461.815167 → median 459.706375

RVAL_FASTPATH (new coverage)

Benchmark/Test	OFF median (ms)	ON median (ms)	Speedup	Notes
ExpressionBenchmarkTest::benchmarkListIndexReads	299.366	218.812	1.37×	IndexRef fast path for ObjList + ObjInt
ExpressionBenchmarkTest::benchmarkFieldReadPureReceiver	268.315	186.032	1.44×	Pure-receiver evaluation in FieldRef (monomorphic, immutable)

Inputs (3×):

ListIndex OFF: 291.344 | 310.717167 | 299.365709 → median 299.365709
ListIndex ON: 217.795375 | 221.504166 | 218.812042 → median 218.812042
FieldRead OFF: 267.2775 | 274.355208 | 268.315125 → median 268.315125
FieldRead ON: 189.599333 | 186.031791 | 182.069167 → median 186.031791

Locals/slots capacity (precise hints)

Benchmark/Test	OFF config	ON config	OFF median (ms)	ON median (ms)	Speedup	Notes
LocalVarBenchmarkTest	LOCAL_SLOT_PIC=OFF, FAST_LOCAL=OFF	LOCAL_SLOT_PIC=ON, FAST_LOCAL=ON	446.018	347.964	1.28×	Precise capacity hints + fast-locals coverage

Inputs (3×):

Locals OFF: 470.575041 | 441.89625 | 446.017833 → median 446.017833
Locals ON: 370.664208 | 345.615541 | 347.964291 → median 347.964291

Methodology:

Each test executed three times via Gradle with stdout enabled; medians computed from [DEBUG_LOG] [BENCH] lines.
Full JVM tests and stress benches remain green in this cycle.

Phase B — List ops specialization (PRIMITIVE_FASTOPS) — 3× medians (OFF → ON)

Date: 2025-11-11 13:48 (local) Environment:

JDK: OpenJDK 20.0.2.1 (Amazon Corretto 20.0.2.1+10-FR)
Gradle: 8.7
OS/Arch: macOS 14.8.1 (aarch64)

Optimization	Benchmark/Test	OFF median (ms)	ON median (ms)	Speedup	Notes
PRIMITIVE_FASTOPS	ListOpsBenchmarkTest::benchmarkSumInts	324.805	144.908	2.24×	ObjList.sum fast path for int lists; generic fallback preserved
PRIMITIVE_FASTOPS	ListOpsBenchmarkTest::benchmarkContainsInts	440.414	415.476	1.06×	ObjList.contains fast path when searching ObjInt in int list

Inputs (3×):

list-sum OFF: 332.863417 | 323.491625 | 324.804083 → median 324.804083
list-sum ON: 144.907833 | 148.870792 | 126.418542 → median 144.907833
list-contains OFF: 440.413709 | 440.368333 | 441.4365 → median 440.413709
list-contains ON: 416.465292 | 412.283291 | 415.475833 → median 415.475833

Methodology:

Each test executed three times via Gradle; medians computed from [DEBUG_LOG] [BENCH] lines.
Changes are fully guarded by PerfFlags.PRIMITIVE_FASTOPS; semantics preserved (null on empty sum; generic fallback on mixed types).

Phase B — Ranges for-in lowering (PRIMITIVE_FASTOPS) — 3× medians (OFF → ON)

Date: 2025-11-11 13:48 (local) Environment:

JDK: OpenJDK 20.0.2.1 (Amazon Corretto 20.0.2.1+10-FR)
Gradle: 8.7
OS/Arch: macOS 14.8.1 (aarch64)

Optimization	Benchmark/Test	OFF median (ms)	ON median (ms)	Speedup	Notes
PRIMITIVE_FASTOPS	RangeBenchmarkTest::benchmarkIntRangeForIn	1705.299	788.974	2.16×	Tight counted loop for (Int..Int) for-in; preserves semantics

Inputs (3×):

range-for-in OFF: 1705.298958 | 1684.357708 | 1735.880917 → median 1705.298958
range-for-in ON: 794.178458 | 778.741834 | 788.973625 → median 788.973625

Methodology:

Each configuration executed three times via Gradle; medians computed from [DEBUG_LOG] [BENCH] lines.
Lowering is guarded by PerfFlags.PRIMITIVE_FASTOPS and applies only when the source is an ObjRange with int bounds; otherwise falls back to generic iteration.

Phase B — Regex caching (REGEX_CACHE) — 3× medians (OFF → ON)

Date: 2025-11-11 13:48 (local) Environment:

JDK: OpenJDK 20.0.2.1 (Amazon Corretto 20.0.2.1+10-FR)
Gradle: 8.7
OS/Arch: macOS 14.8.1 (aarch64)

Flag	Benchmark/Test	OFF median (ms)	ON median (ms)	Speedup	Notes
REGEX_CACHE	RegexBenchmarkTest::benchmarkLiteralPatternMatches	378.246	275.890	1.37×	Caches compiled regex for identical literal pattern per iteration
REGEX_CACHE	RegexBenchmarkTest::benchmarkDynamicPatternMatches	514.944	229.006	2.25×	Two dynamic patterns alternate; cache size sufficient to retain both

Inputs (1× here; can extend to 3× on request):

regex-literal OFF: 378.245916; ON: 275.889541
regex-dynamic OFF: 514.944167; ON: 229.005834

Methodology:

Each benchmark toggles PerfFlags.REGEX_CACHE inside a single test and prints [DEBUG_LOG] timings for OFF and ON runs under identical conditions. We recorded one set of OFF/ON timings here; we can extend to 3× medians if required for publication.
The cache is a tiny size-bounded map (64 entries) activated only when PerfFlags.REGEX_CACHE is true. Defaults remain OFF.

JIT tweaks (Round 1) — quick gains snapshot (locals, ranges, list ops)

Date: 2025-11-11 21:05 (local)

Scope: fast confirmation of overall gain using current configuration; focused on locals, ranges, and list ops. Each test prints OFF → ON timings in a single run. We executed the benches via Gradle with stdout enabled and single test fork.

Environment:

Gradle: 8.7 (stdout enabled, maxParallelForks=1)
JVM: as configured by toolchain for this project
OS/Arch: per developer machine (unchanged from prior sections)

Reproduce:

./gradlew :lynglib:jvmTest --tests LocalVarBenchmarkTest --rerun-tasks
./gradlew :lynglib:jvmTest --tests RangeBenchmarkTest --rerun-tasks
./gradlew :lynglib:jvmTest --tests ListOpsBenchmarkTest --rerun-tasks

Results (representative runs; OFF → ON):

Local variables — LOCAL_SLOT_PIC + EMIT_FAST_LOCAL_REFS
- Run 1: 468.407 ms → 367.277 ms (≈ 1.28×)
- Run 2: 447.031 ms → 346.126 ms (≈ 1.29×)
Ranges for‑in — PRIMITIVE_FASTOPS (tight counted loop for (Int..Int))
- 1731.780 ms → 799.023 ms (≈ 2.17×)
List ops — PRIMITIVE_FASTOPS
- sum(int list): 318.943 ms → 148.571 ms (≈ 2.15×)
- contains(int in int list): 440.013 ms → 412.450 ms (≈ 1.07×)

Summary: All three areas improved with optimizations ON; no regressions observed in these runs. For publication‑grade stability, run each test 3× and report medians (see sections below for methodology and previous median tables).

Additional tweaks — verification snapshot (Index write fast‑path, List literal pre‑size, Regex LRU)

Date: 2025-11-11 21:31 (local)

Scope: Implemented three semantics‑neutral optimizations and verified they are green across targeted and broader JVM benches.

What changed (guarded by flags where applicable):

RVAL_FASTPATH: Index write fast‑path
- IndexRef.setAt: direct path for ObjList + ObjInt (list[i] = value) mirrors the read fast‑path. Optional chaining semantics preserved; bounds exceptions propagate unchanged.
RVAL_FASTPATH: List literal pre‑sizing
- ListLiteralRef.get: pre‑counts element entries and uses ArrayList with capacity hint; for spreads of ObjList, uses ensureCapacity before bulk add. Evaluation order unchanged.
REGEX_CACHE: LRU‑like behavior
- RegexCache: emulates access‑order LRU within a tiny bounded map (MAX=64) by moving accessed entries to the tail; improves alternating‑pattern scenarios. Only active when PerfFlags.REGEX_CACHE is true.

Reproduce quick verification (1× runs):

./gradlew :lynglib:jvmTest --tests ExpressionBenchmarkTest --rerun-tasks
./gradlew :lynglib:jvmTest --tests ListOpsBenchmarkTest --rerun-tasks
./gradlew :lynglib:jvmTest --tests RegexBenchmarkTest --rerun-tasks
./gradlew :lynglib:jvmTest --tests PicBenchmarkTest --rerun-tasks
./gradlew :lynglib:jvmTest --tests PicInvalidationJvmTest --rerun-tasks
./gradlew :lynglib:jvmTest --tests LocalVarBenchmarkTest --rerun-tasks
./gradlew :lynglib:jvmTest --tests ConcurrencyCallBenchmarkTest --rerun-tasks
./gradlew :lynglib:jvmTest --tests DeepPoolingStressJvmTest --rerun-tasks
./gradlew :lynglib:jvmTest --tests MultiThreadPoolingStressJvmTest --rerun-tasks

Observation: All listed tests green in this cycle; no behavioral regressions observed. For the new paths (index write, list literal), performance was neutral‑to‑positive in smoke runs; Regex benches remained positive or neutral with the LRU behavior. For publication‑grade medians, extend to 3× per test as in earlier sections.

Sanity matrix (JVM) — quick OFF→ON runs

Date: 2025-11-11 21:59 (local)

Scope: Final Round 1 sanity sweep across JVM micro‑benches and stress tests to confirm that optimizations ON do not regress performance vs OFF in representative scenarios. Each benchmark prints [DEBUG_LOG] [BENCH] timings for OFF → ON within a single run. This section records a quick pass confirmation (not 3× medians) and reproduction commands.

Environment:

Gradle: 8.7 (stdout enabled, maxParallelForks=1)
JVM: as configured by the project toolchain
OS/Arch: macOS 14.x (aarch64)

Benches covered (all green; no regressions observed in these runs):

Calls/Args: CallBenchmarkTest, CallMixedArityBenchmarkTest (ARG_BUILDER)
PICs: PicBenchmarkTest (field/method); PicInvalidationJvmTest correctness reconfirmed
Expressions/Arithmetic: ExpressionBenchmarkTest, ArithmeticBenchmarkTest (RVAL_FASTPATH, PRIMITIVE_FASTOPS)
Ranges: RangeBenchmarkTest (PRIMITIVE_FASTOPS counted loop)
List ops: ListOpsBenchmarkTest (PRIMITIVE_FASTOPS specializations)
Regex: RegexBenchmarkTest (REGEX_CACHE with LRU behavior)
Locals: LocalVarBenchmarkTest (LOCAL_SLOT_PIC + FAST_LOCAL)
Concurrency/Pooling: ConcurrencyCallBenchmarkTest, DeepPoolingStressJvmTest, MultiThreadPoolingStressJvmTest (SCOPE_POOL per‑thread)

Reproduce (examples):

./gradlew :lynglib:jvmTest --tests CallBenchmarkTest --rerun-tasks
./gradlew :lynglib:jvmTest --tests CallMixedArityBenchmarkTest --rerun-tasks
./gradlew :lynglib:jvmTest --tests PicBenchmarkTest --rerun-tasks
./gradlew :lynglib:jvmTest --tests PicInvalidationJvmTest --rerun-tasks
./gradlew :lynglib:jvmTest --tests ExpressionBenchmarkTest --rerun-tasks
./gradlew :lynglib:jvmTest --tests ArithmeticBenchmarkTest --rerun-tasks
./gradlew :lynglib:jvmTest --tests RangeBenchmarkTest --rerun-tasks
./gradlew :lynglib:jvmTest --tests ListOpsBenchmarkTest --rerun-tasks
./gradlew :lynglib:jvmTest --tests RegexBenchmarkTest --rerun-tasks
./gradlew :lynglib:jvmTest --tests LocalVarBenchmarkTest --rerun-tasks
./gradlew :lynglib:jvmTest --tests ConcurrencyCallBenchmarkTest --rerun-tasks
./gradlew :lynglib:jvmTest --tests DeepPoolingStressJvmTest --rerun-tasks
./gradlew :lynglib:jvmTest --tests MultiThreadPoolingStressJvmTest --rerun-tasks

Summary:

All listed tests passed in this sanity sweep.
For each benchmark’s OFF → ON printouts examined during this pass, ON was equal or faster than OFF; no ON<OFF regressions were observed.
For publication‑grade numbers, use the 3× medians methodology outlined earlier in this document. The existing median tables in previous sections remain representative, and the additional tweaks (Index write, List literal pre‑size, Regex LRU, Field PIC 4‑way + read→write reuse, mixed Int/Real fast‑ops) remained neutral‑to‑positive.

Quick snapshot — IndexRef PIC + negative miss cache (JVM) — 3× medians (OFF → ON)

Date: 2025-11-11 22:32 (local)

Scope

Confirm that the latest changes — IndexRef read/write PIC (stacked on RVAL_FASTPATH) and safe catch‑and‑cache negative entries for Field/Method PICs — do not regress performance. We collected 3× medians for the two expression sub‑benches that are most sensitive to RVAL paths and cross‑checked PICs and ranges.

Environment

Gradle: 8.7 (stdout enabled, maxParallelForks=1)
JVM: project toolchain default
OS/Arch: macOS 14.x (aarch64)

Results (3× medians)

Area	Benchmark/Test	OFF median (ms)	ON median (ms)	Speedup	Notes
RVAL_FASTPATH	ExpressionBenchmarkTest::benchmarkListIndexReads	304.282	229.168	1.33×	IndexRef direct fast‑path for ObjList+ObjInt; 4‑way Index PIC handles polymorphic cases
RVAL_FASTPATH	ExpressionBenchmarkTest::benchmarkFieldReadPureReceiver	275.122	194.876	1.41×	Monomorphic, immutable receiver path; preserves visibility/optional semantics

Cross‑checks (from the same session, 1× quick)

PicBenchmarkTest::benchmarkFieldGetSetPic — OFF 203.701 ms → ON 117.129 ms (≈1.74×)
PicBenchmarkTest::benchmarkMethodPic — OFF 280.806 ms → ON 202.613 ms (≈1.39×)
RangeBenchmarkTest::benchmarkIntRangeForIn — OFF 1762.425 ms → ON 806.898 ms (≈2.18×)

Reproduce

./gradlew :lynglib:jvmTest --tests "ExpressionBenchmarkTest.benchmarkListIndexReads" --rerun-tasks
./gradlew :lynglib:jvmTest --tests "ExpressionBenchmarkTest.benchmarkListIndexReads" --rerun-tasks
./gradlew :lynglib:jvmTest --tests "ExpressionBenchmarkTest.benchmarkListIndexReads" --rerun-tasks

./gradlew :lynglib:jvmTest --tests "ExpressionBenchmarkTest.benchmarkFieldReadPureReceiver" --rerun-tasks
./gradlew :lynglib:jvmTest --tests "ExpressionBenchmarkTest.benchmarkFieldReadPureReceiver" --rerun-tasks
./gradlew :lynglib:jvmTest --tests "ExpressionBenchmarkTest.benchmarkFieldReadPureReceiver" --rerun-tasks

./gradlew :lynglib:jvmTest --tests PicBenchmarkTest --rerun-tasks
./gradlew :lynglib:jvmTest --tests RangeBenchmarkTest --rerun-tasks

Notes

Negative caches are installed only after a real miss throws (cache‑after‑miss), preserving error semantics and invalidation on layoutVersion changes.
IndexRef PIC augments the existing direct path and uses move‑to‑front promotion; it is keyed on (classId, layoutVersion) like other PICs.

36 KiB Raw Blame History

Overview

Workload presets (JVM‑first)

Key flags

Where optimizations apply

Compiler constant folding (conservative)

Running JVM micro‑benchmarks

Optional JFR allocation profiling (JVM)

Toggling flags in tests

PIC diagnostics (optional)

A/B scenarios and guidance (JVM)

Adaptive PIC (fields/methods)

Index PIC and size

Index WRITE paths (Map and List)

Guidance per flag (JVM)

Workload‑specific recommendations (JVM)

Notes on correctness & safety

Cross‑platform

Range fast iteration (experimental)

Troubleshooting

JVM micro-benchmark results (3× medians; OFF → ON)

Concurrency (multi‑core) pooling results (3× medians; OFF → ON)

Next optimization steps (JVM)

PIC update (4‑way METHOD_PIC) — JVM (3× medians; OFF → ON)

Locals/slots capacity (pre‑sizing hints) — JVM (3× medians; OFF → ON)

RVAL fast paths update — JVM (IndexRef and FieldRef) [3× medians; OFF → ON]

ARG_BUILDER — splat fast‑path (3× medians; OFF → ON)

Phase A consolidation (JVM) — 3× medians updated

ARG_BUILDER

RVAL_FASTPATH (new coverage)

Locals/slots capacity (precise hints)

Phase B — List ops specialization (PRIMITIVE_FASTOPS) — 3× medians (OFF → ON)

Phase B — Ranges for-in lowering (PRIMITIVE_FASTOPS) — 3× medians (OFF → ON)

Phase B — Regex caching (REGEX_CACHE) — 3× medians (OFF → ON)

JIT tweaks (Round 1) — quick gains snapshot (locals, ranges, list ops)

Additional tweaks — verification snapshot (Index write fast‑path, List literal pre‑size, Regex LRU)

Sanity matrix (JVM) — quick OFF→ON runs

Quick snapshot — IndexRef PIC + negative miss cache (JVM) — 3× medians (OFF → ON)

36 KiB

Raw Blame History