test262 Harness Contract

How GocciaScript runs the official TC39 test262 conformance suite, the contract the orchestrator guarantees, and the boundary between conformance failures, wrapper-infrastructure failures, and runner-level errors.

What test262 is, and why we run it

test262 is the TC39-maintained conformance suite for ECMAScript. ~50K tests cover every observable behavior in the language and built-ins, plus Intl, staging proposals, and the harness itself. We run it as an indicator metric and regression signal to track which spec corners GocciaScript implements, where the engine diverges, and how each PR moves those numbers. The generated reports are the source of truth for ECMAScript compatibility status.

For the architectural rationale behind the current LoaderBare-plus-stock-harness setup, see ADR 0042.

Executive summary

Test262 runs via GocciaScriptLoaderBare, never via GocciaTestRunner, so wrapper bodies execute inside a neutral engine without expect, describe, test, lifecycle hooks, mocks, or runTests.
Stock tc39/test262 harness files are loaded from the pinned test262 checkout's harness/ directory. The only bundled harness file is scripts/test262_harness/$262.js, the host-provided hook object.
The orchestrator passes GocciaScriptLoaderBare --test262-host so those host hooks are available only during conformance runs.
Test feature metadata may add explicit engine options when test262 splits proposal layers more narrowly than the base engine flag set.
The orchestrator drives via process exit code + stdout markers, the same convention test262-harness/eshost/test262.fyi use.
Wrapper-infrastructure failures are classified separately from conformance failures and gated to zero in CI.
CI uploads test262-results.json on every PR and main run. Main runs also publish the report to Vercel Blob when BLOB_READ_WRITE_TOKEN is configured, and the website compatibility dashboard reads those durable reports at request time with CDN caching.
Main CI also publishes full-corpus test262 profile reports for performance review. These profiles are retained separately from compatibility JSON and are reviewed from the aggregate first, with detailed profiles reserved for investigation.

Website dashboard

The public compatibility dashboard lives at /compatibility. It reads the latest available main-branch test262 report for each UTC day, renders pass-rate and runtime timelines, shows the top-level test262 category split, and ranks the five least-covered path groups from the latest report. The "JSON result" link on the dashboard points back to the exact report used for the latest view.

Dashboard data is built from Vercel Blob at request time and served with CDN caching. Main CI publishes future dashboard points directly to Blob. To seed historical data, run the one-off backfill command: it first copies any still-retained main-branch GitHub artifact reports to Blob, then reruns historical main commits for days whose artifacts have expired.

cd website
BLOB_READ_WRITE_TOKEN=<vercel-blob-token> \
GITHUB_ARTIFACT_TOKEN="$(gh auth token)" \
bun run backfill-test262

The backfill defaults to --since=2026-05-01 and today's UTC date; pass --since=YYYY-MM-DD or --until=YYYY-MM-DD only when intentionally narrowing the range. It stores immutable run reports under test262/runs/ and writes one daily pointer under test262/daily/YYYY-MM-DD.json for the latest published main run on each UTC day. Website builds do not download GitHub Actions artifact ZIPs, read Blob, or bake test262 data into the deployment output. The /compatibility page and /api/test262/* routes read the daily pointers and per-run reports from Blob on request, using short CDN/server caching because new results arrive only a few times per day.

Configure the Vercel project so both Preview and Production deployments have access to the Blob store at runtime. CI and one-off backfills still need the GitHub repository secret BLOB_READ_WRITE_TOKEN because they publish new reports from GitHub Actions rather than from inside the Vercel project.

Profile report contract

Main CI publishes a full-corpus bytecode profile for the existing test262 job without changing PR CI. The profile is a performance-review artifact, not the compatibility dashboard input and not a conformance gate.

The GitHub Actions artifact is named test262-profile. It contains:

test262-profile-aggregate.json: the review entry point, with run provenance, corpus settings, summary counts, runtime totals, path-group rollups, top opcode histograms, hot opcode pairs, scalar fast-path rates, function self-time/allocation summaries, and links or identifiers for the detailed profiles that explain each aggregate row.
test262-profile-aggregate.md: a human-readable review summary with the same provenance header and ranked tables.
test262-profile-details/: detailed profile JSON files used only after the aggregate points to a hotspot or regression worth investigating.

Main runs also publish the same profile data to Vercel Blob under the separate test262-profiles/ namespace. That namespace is intentionally distinct from the compatibility dashboard's test262/ namespace: test262/ stores conformance reports and daily pointers, while test262-profiles/ stores profile aggregate, Markdown summaries, compressed detail archives, and profile-specific daily pointers for retained main-run review. The default paths are test262-profiles/runs/<artifactId>/aggregate.json.gz, test262-profiles/runs/<artifactId>/summary.md, test262-profiles/runs/<artifactId>/details.tar.gz, and test262-profiles/daily/<YYYY-MM-DD>.json.

Reviewers should compare the latest aggregate with the previous weekly profile and nearby main-run profiles before opening detail files. Detailed profiles are for explaining a ranked finding, such as a newly hot opcode pair, unexpectedly low scalar fast-path hit rate, high allocation path group, or call-frame-heavy feature area. Findings should turn the observed corpus behavior into concrete compiler, bytecode, AST, parser, value-boxing, property-access, or call-frame recommendations.

Architecture

scripts/run_test262_suite.ts
  → discover tests under suite/test/{built-ins,harness,intl402,language,staging}
  → for each test (parallel pool, --jobs=N):
      → read frontmatter, classify by phase (parse / runtime / positive)
      → build source = (stock harness includes) + body, with a tiny
        marker-emitting wrapper for most negative-runtime tests
      → spawn ./build/GocciaScriptLoaderBare --test262-host with stdin = source
      → capture (exitCode, stdout, stderr)
      → classify into PASS / FAIL / WRAPPER_INFRA / TIMEOUT
  → aggregate per top-level category
  → emit JSON, console summary, GitHub Step Summary table

Wire protocol

Test kind	Pass signal	Fail signal
Sync positive	exit 0	exit non-zero (stderr is the diagnostic)
Async positive	stdout contains `Test262:AsyncTestComplete`	stdout contains `Test262:AsyncTestFailure:<name>: <msg>`, OR no marker before timeout, OR engine exits before $DONE
Negative runtime	stdout contains `Test262:NegativeTestError:<expected-type>`	`Test262:NegativeTestNoError`, OR `Test262:NegativeTestError:<other>`
Top-level negative runtime	exit non-zero and stderr starts with the expected error type	exit 0, OR stderr starts with another error type
Negative parse	exit non-zero (parse failed as expected)	exit 0 (parse succeeded)

Async markers are emitted by stock test262 doneprintHandle.js via $DONE; the runner only scans stdout for those strings. The negative-runtime markers (Test262:NegativeTestError:... / Test262:NegativeTestNoError) are the only Goccia-specific marker addition; see "Wrapper templates" below. Runtime-negative Script tests that depend on global declaration instantiation run as top-level source instead, because wrapping them in try { ... } would introduce a block scope and change the declaration semantics under test.

Wrapper templates

All four template kinds are produced by buildTestSource in scripts/run_test262_suite.ts. Bodies for positive sync, positive async, empty, and script-scope tests are all harness + body — identical shape, no special wrapping. Tests flagged onlyStrict receive a "use strict" directive prefix before the harness so the runtime uses strict directive semantics while the runner can still enable compatibility-gated parser support for Script tests.

Positive (sync, async, empty, script-scope)

{harness_source}
{body}

Negative runtime

{harness_source}
try {
{body}
  print("Test262:NegativeTestNoError");
} catch (__gocciaT262_e) {
  var __gocciaT262_n = "unknown";
  if (__gocciaT262_e && typeof __gocciaT262_e === "object") {
    if (__gocciaT262_e.constructor && __gocciaT262_e.constructor.name) {
      __gocciaT262_n = __gocciaT262_e.constructor.name;
    }
  }
  print("Test262:NegativeTestError:" + __gocciaT262_n);
}

Top-level global-code runtime negatives that expect SyntaxError use the positive template shape instead:

{harness_source}
{body}

The runner then checks the subprocess exit and stderr error type directly.

The error-class identification uses e.constructor.name, matching the spec-visible constructor/prototype path. If this cannot identify the thrown error class, the test is treated as an engine failure rather than papered over by the harness.

The bindings __gocciaT262_e (catch parameter) and __gocciaT262_n (local var inside catch) are the only Goccia-specific identifiers in the generated source. Both are catch-block-scoped and cannot collide with body-level vars.

Negative parse

{body}

Body alone. The parser runs, fails (or doesn't), and the orchestrator reads the exit code.

Failure classification

Source	Surface	Counts as
Body assertion fails (`Test262Error` thrown)	exit 1, stderr has formatted error	conformance fail
Body throws other Error	exit 1, stderr captures the error	conformance fail
Body throws non-Error (`undefined`, `null`, etc.)	exit 1, stderr carries the formatted value	conformance fail
Async body never calls `$DONE`	no `Test262:Async*` marker, exit 0	conformance fail
Engine killed by signal (SIGSEGV, OOM)	`signalCode != null` or exit > 1	wrapper infra
Pascal-side error (`EAccessViolation`, `ESocket`, …)	stderr starts with Pascal class name	wrapper infra
Engine cooperative timeout	`GocciaScriptLoaderBare --test262-host` exits 124 and emits `GocciaTest262:Timeout:<ms>`	timeout
Per-test wall-clock timeout	`setTimeout(() => ac.abort(), wallClockMs)` fired (signalCode SIGTERM/SIGKILL)	timeout
Negative-runtime catch path itself crashes	no marker emitted at all	wrapper infra

The classifier (classifyRunResult in run_test262_suite.ts) reads exit code, signal, stdout, and stderr to pick the bucket. The PASCAL_INFRA_RE regex deliberately excludes the bare Error: prefix because Goccia's bytecode mode uses it for legitimate JS errors — the prefix alone is not a wrapper-infra signal.

wrapper_infra_failures is gated to zero in CI. Any non-zero count fails the run because the conformance numbers are not trustworthy when the wrapper itself is broken.

Visibility invariants

Bodies see only the identifiers stock test262 expects:

Test262Error (from sta.js)
assert and its methods (from assert.js)
$DONE, $DONOTEVALUATE (from sta.js / doneprintHandle.js when included)
print (Goccia engine global; stock doneprintHandle.js uses it for async markers)
$262 helpers implemented by Goccia's bundled host object: detachArrayBuffer, evalScript, gc, global, createRealm, AbstractModuleSource, and IsHTMLDDA
$262.agent helpers used by Atomics tests: start, broadcast, receiveBroadcast, report, getReport, sleep, monotonicNow, and leaving
test262Host, a private marker on the Goccia namespace exposed only by GocciaScriptLoaderBare --test262-host so $262.js can reject accidental use outside the conformance host.
Anything declared in test-included harness files (e.g. compareArray, propertyHelper)

Bodies do NOT see:

expect, describe, test, it, beforeAll, beforeEach, afterEach, afterAll, onTestFinished, runTests, mock, spyOn — none of these exist on the Bare engine.
console, fetch, URL, performance — Bare doesn't register Goccia's runtime extension.
Optional $262 hooks outside the bundled implementation. Tests that depend on unavailable host hooks fail honestly.

Bundled harness adaptations

The orchestrator loads stock tc39/test262 harness files from the pinned checkout's harness/ directory. BUNDLED_INCLUDES contains only $262.js, because $262 is not a stock harness helper: it is the host-provided object test262 expects engines to supply.

Bundling rule: do not add compatibility copies of stock harness files. If a stock helper fails, fix the language/runtime behavior or classify the test as a genuine conformance failure. $262.js may grow only by adding test262 host hooks or harness-local $262 behavior.

$262.evalScript() and $262.createRealm() delegate to flagged test262 host hooks exposed by the bare loader. evalScript parses and executes its source as script code in the current realm. createRealm creates a fresh Goccia engine/realm, returns a host record with the child realm's globalThis, its own evalScript, and createRealm, and keeps the child engine alive for the duration of the test run so cross-realm intrinsics remain valid. The hooks are only exposed when GocciaScriptLoaderBare --test262-host is enabled.

eval is also host-gated. GocciaScriptLoaderBare --test262-host installs the official test262 host eval only for conformance runs; default Bare execution does not expose it. Bytecode direct calls to that host eval preserve the caller realm and caller lexical bindings, while shadowed or indirect eval calls use ordinary function-call semantics.

$262.agent is host-gated in the same way. Agent start runs the supplied source in a test262-host-enabled bare-loader thread with the bundled $262 object installed; broadcast/receiveBroadcast share the provided value with agent threads, and report/getReport provide the report queue used by Atomics wait/notify tests.

Strict mode

GocciaScript recognizes strict directive prologues at execution time for the compatibility behaviors that otherwise depend on Script non-strict mode. Independently, the engine's curated default semantics enforce most strict-mode behaviors statically:

Implicit globals throw ReferenceError (sloppy would create a global)
delete <identifier> and non-configurable property deletion throw by default

The orchestrator enables --compat-non-strict-mode per test, not globally: Script tests receive it, while module tests stay strict. onlyStrict Script tests also receive the flag, but the injected directive keeps with, non-strict assignment failures, legacy delete return values, and regular-function nullish this coercion on the strict path. Remaining noStrict tests rely on sloppy-only behaviors that GocciaScript still does not provide and fail naturally as ordinary conformance failures, not as wrapper-infra failures.

The runner passes syntax compatibility flags such as --compat-traditional-for-loop, --compat-for-in-loop, --compat-while-loops, --compat-label, and the semantic --compat-arguments-object flag unconditionally because test262 uses those forms across both harness helpers and test bodies. The test's source type and strictness still decide strict-mode semantics; --compat-arguments-object only enables the implicit arguments binding. Strictness and parameter-list shape decide whether that binding is unmapped or mapped.

Source-phase import feature flags

The runner reads the test262 features frontmatter when a feature maps to a Goccia engine option. source-phase-imports tests run with the base test262 flag set only. Tests that also declare source-phase-imports-module-source receive --experimental-js-module-source, which enables JavaScript ModuleSource objects for the separate ESM Phase Imports proposal. This is not an eligibility filter: every discovered test still runs, and feature metadata only changes the host options used for that test.

Path normalization

Test IDs are stored as POSIX-style relative paths under suite/test/. On Windows the filesystem returns backslashes; the orchestrator normalizes to forward slashes via normalizeTestId(id) at every site that uses an ID (glob-match, reporting, baseline lookup) so the same test produces the same ID on both platforms.

Test corpus

Default categories: built-ins, harness, intl402, language, staging (everything except annexB).

annexB (legacy/deprecated browser-only behavior) is excluded from the default run and is not a pre-1.0 release target. Some Annex B surfaces are implemented anyway (e.g. String.prototype.substr, __proto__, __defineGetter__) because they already support current compatibility work or layer cleanly as shims. Treat Annex B results as informational unless a future Web API/browser-compatibility profile deliberately re-opens the policy. See ADR 0085.
intl402 covers Intl APIs (ECMA-402); tests exercise Intl.getCanonicalLocales, constructors, and formatting operations.
staging is forward-looking proposals — engine-readiness signal.
harness verifies test262's own harness functions work under our engine.

There is no eligibility filter. Every discovered test runs. Tests that depend on missing features fail with a real diagnostic, not an invisible skip. Per-test subprocess + --timeout + --max-memory bound the blast radius of any individual hang or OOM.

Known engine crashes

A small KNOWN_ENGINE_CRASHES set in scripts/run_test262_suite.ts skips tests that are known to crash the engine at the native level (SIGSEGV / SIGBUS) — not catchable by the per-test timeout, not representative of conformance failures, and would otherwise inflate wrapper_infra_failures indefinitely. Each entry is paired with a GitHub issue tracking the underlying engine bug; remove the entry once the bug is fixed.

This list is the only allowed form of test-skipping in the harness. Do not rebuild a generic eligibility filter (the structural blast-radius control is per-test subprocess + --timeout + --max-memory, not pre-execution exclusion).

Currently empty — there are no native-level crashes being skipped.

Updating the contract

Changes to buildTestSource in scripts/run_test262_suite.ts are verified by the full conformance run itself (no separate regression suite). After any wrapper-template change:

Run locally:

./build.pas loaderbare
bun scripts/run_test262_suite.ts --suite-dir <checkout> \
  --output local-results.json

Confirm wrapper_infra_failures: 0 in the summary.
Diff local-results.json against the prior baseline; investigate any sign change in pass/fail counts before opening the PR.

Updating the SHA pin

The test262 SHA is pinned in scripts/test262-suite-sha.txt. Both .github/workflows/ci.yml and .github/workflows/pr.yml read that file before checking out tc39/test262, so the cached main baseline and a PR run measure the same upstream corpus without weekly bump PRs needing to modify workflow files. The weekly cron at .github/workflows/test262-bump.yml opens a PR every Monday with the latest tc39/test262 main SHA; the PR's standard CI run posts the per-category delta vs. the previous main baseline. Merge once the delta is acceptable.

Manual bump: bun scripts/test262-bump-pin.ts <40-hex-sha>.