test262 Harness Contract
How GocciaScript runs the official TC39 test262 conformance suite, the contract the orchestrator guarantees, and the boundary between conformance failures, wrapper-infrastructure failures, and runner-level errors.
What test262 is, and why we run it#
test262 is the TC39-maintained conformance suite for ECMAScript. ~50K tests cover every observable behavior in the language and built-ins, plus Intl, staging proposals, and the harness itself. We run it as an indicator metric and regression signal to track which spec corners GocciaScript implements, where the engine diverges, and how each PR moves those numbers. The generated reports are the source of truth for ECMAScript compatibility status.
For the architectural rationale behind the current LoaderBare-plus-stock-harness setup, see the Decision log entry dated 2026-05-04.
Executive summary#
- Test262 runs via
GocciaScriptLoaderBare, never viaGocciaTestRunner, so
wrapper bodies execute inside a neutral engine without expect, describe, test, lifecycle hooks, mocks, or runTests.
- Stock tc39/test262 harness files are loaded from the pinned test262
checkout's harness/ directory. The only bundled harness file is scripts/test262_harness/$262.js, the host-provided hook object.
- The orchestrator passes
GocciaScriptLoaderBare --test262-hostso those
host hooks are available only during conformance runs.
- Test feature metadata may add explicit engine options when test262 splits
proposal layers more narrowly than the base engine flag set.
- The orchestrator drives via process exit code + stdout markers, the
same convention test262-harness/eshost/test262.fyi use.
- Wrapper-infrastructure failures are classified separately from
conformance failures and gated to zero in CI.
- CI uploads
test262-results.jsonon every PR and main run. Main runs also
publish the report to Vercel Blob when BLOB_READ_WRITE_TOKEN is configured, and the website compatibility dashboard reads those durable reports at request time with CDN caching.
Website dashboard#
The public compatibility dashboard lives at /compatibility. It reads the latest available main-branch test262 report for each UTC day, renders pass-rate and runtime timelines, shows the top-level test262 category split, and ranks the five least-covered path groups from the latest report. The "JSON result" link on the dashboard points back to the exact report used for the latest view.
Dashboard data is built from Vercel Blob at request time and served with CDN caching. Main CI publishes future dashboard points directly to Blob. To seed historical data, run the one-off backfill command: it first copies any still-retained main-branch GitHub artifact reports to Blob, then reruns historical main commits for days whose artifacts have expired.
cd website
BLOB_READ_WRITE_TOKEN=<vercel-blob-token> \
GITHUB_ARTIFACT_TOKEN="$(gh auth token)" \
bun run backfill-test262The backfill defaults to --since=2026-05-01 and today's UTC date; pass --since=YYYY-MM-DD or --until=YYYY-MM-DD only when intentionally narrowing the range. It stores immutable run reports under test262/runs/ and writes one daily pointer under test262/daily/YYYY-MM-DD.json for the latest published main run on each UTC day. Website builds do not download GitHub Actions artifact ZIPs, read Blob, or bake test262 data into the deployment output. The /compatibility page and /api/test262/* routes read the daily pointers and per-run reports from Blob on request, using short CDN/server caching because new results arrive only a few times per day.
Configure the Vercel project so both Preview and Production deployments have access to the Blob store at runtime. CI and one-off backfills still need the GitHub repository secret BLOB_READ_WRITE_TOKEN because they publish new reports from GitHub Actions rather than from inside the Vercel project.
Architecture#
scripts/run_test262_suite.ts
→ discover tests under suite/test/{built-ins,harness,intl402,language,staging}
→ for each test (parallel pool, --jobs=N):
→ read frontmatter, classify by phase (parse / runtime / positive)
→ build source = (stock harness includes) + body, with a tiny
marker-emitting wrapper for negative-runtime
→ spawn ./build/GocciaScriptLoaderBare --test262-host with stdin = source
→ capture (exitCode, stdout, stderr)
→ classify into PASS / FAIL / WRAPPER_INFRA / TIMEOUT
→ aggregate per top-level category
→ emit JSON, console summary, GitHub Step Summary tableWire protocol#
| Test kind | Pass signal | Fail signal |
|---|---|---|
| Sync positive | exit 0 | exit non-zero (stderr is the diagnostic) |
| Async positive | stdout contains Test262:AsyncTestComplete | stdout contains Test262:AsyncTestFailure:<name>: <msg>, OR no marker before timeout, OR engine exits before $DONE |
| Negative runtime | stdout contains Test262:NegativeTestError:<expected-type> | Test262:NegativeTestNoError, OR Test262:NegativeTestError:<other> |
| Negative parse | exit non-zero (parse failed as expected) | exit 0 (parse succeeded) |
Async markers are emitted by stock test262 doneprintHandle.js via $DONE; the runner only scans stdout for those strings. The negative-runtime markers (Test262:NegativeTestError:... / Test262:NegativeTestNoError) are the only Goccia-specific marker addition; see "Wrapper templates" below.
Wrapper templates#
All four template kinds are produced by buildTestSource in scripts/run_test262_suite.ts. Bodies for positive sync, positive async, empty, and script-scope tests are all harness + body — identical shape, no special wrapping. Tests flagged onlyStrict receive a "use strict" directive prefix before the harness so the runtime uses strict directive semantics while the runner can still enable compatibility-gated parser support for Script tests.
Positive (sync, async, empty, script-scope)#
{harness_source}
{body}Negative runtime#
{harness_source}
try {
{body}
print("Test262:NegativeTestNoError");
} catch (__gocciaT262_e) {
var __gocciaT262_n = "unknown";
if (__gocciaT262_e && typeof __gocciaT262_e === "object") {
if (__gocciaT262_e.constructor && __gocciaT262_e.constructor.name) {
__gocciaT262_n = __gocciaT262_e.constructor.name;
}
}
print("Test262:NegativeTestError:" + __gocciaT262_n);
}The error-class identification uses e.constructor.name, matching the spec-visible constructor/prototype path. If this cannot identify the thrown error class, the test is treated as an engine failure rather than papered over by the harness.
The bindings __gocciaT262_e (catch parameter) and __gocciaT262_n (local var inside catch) are the only Goccia-specific identifiers in the generated source. Both are catch-block-scoped and cannot collide with body-level vars.
Negative parse#
{body}Body alone. The parser runs, fails (or doesn't), and the orchestrator reads the exit code.
Failure classification#
| Source | Surface | Counts as |
|---|---|---|
Body assertion fails (Test262Error thrown) | exit 1, stderr has formatted error | conformance fail |
| Body throws other Error | exit 1, stderr captures the error | conformance fail |
Body throws non-Error (undefined, null, etc.) | exit 1, stderr carries the formatted value | conformance fail |
Async body never calls $DONE | no Test262:Async* marker, exit 0 | conformance fail |
| Engine killed by signal (SIGSEGV, OOM) | signalCode != null or exit > 1 | wrapper infra |
Pascal-side error (EAccessViolation, ESocket, …) | stderr starts with Pascal class name | wrapper infra |
| Per-test wall-clock timeout | setTimeout(() => ac.abort(), wallClockMs) fired (signalCode SIGTERM/SIGKILL) | timeout |
| Negative-runtime catch path itself crashes | no marker emitted at all | wrapper infra |
The classifier (classifyRunResult in run_test262_suite.ts) reads exit code, signal, stdout, and stderr to pick the bucket. The PASCAL_INFRA_RE regex deliberately excludes the bare Error: prefix because Goccia's bytecode mode uses it for legitimate JS errors — the prefix alone is not a wrapper-infra signal.
wrapper_infra_failures is gated to zero in CI. Any non-zero count fails the run because the conformance numbers are not trustworthy when the wrapper itself is broken.
Visibility invariants#
Bodies see only the identifiers stock test262 expects:
Test262Error(fromsta.js)assertand its methods (fromassert.js)$DONE,$DONOTEVALUATE(fromsta.js/doneprintHandle.jswhen included)print(Goccia engine global; stockdoneprintHandle.jsuses it for async markers)$262helpers implemented by Goccia's bundled host object:
detachArrayBuffer, evalScript, gc, global, and createRealm
test262Host, a private marker on theGoccianamespace exposed only by
GocciaScriptLoaderBare --test262-host so $262.js can reject accidental use outside the conformance host.
- Anything declared in test-included harness files (e.g.
compareArray,
propertyHelper)
Bodies do NOT see:
expect,describe,test,it,beforeAll,beforeEach,
afterEach, afterAll, onTestFinished, runTests, mock, spyOn — none of these exist on the Bare engine.
console,fetch,URL,performance— Bare doesn't register
Goccia's runtime extension.
- Optional
$262hooks outside the bundled implementation. Tests that
depend on unavailable host hooks fail honestly.
Bundled harness adaptations#
The orchestrator loads stock tc39/test262 harness files from the pinned checkout's harness/ directory. BUNDLED_INCLUDES contains only $262.js, because $262 is not a stock harness helper: it is the host-provided object test262 expects engines to supply.
Bundling rule: do not add compatibility copies of stock harness files. If a stock helper fails, fix the language/runtime behavior or classify the test as a genuine conformance failure. $262.js may grow only by adding test262 host hooks or harness-local $262 behavior.
$262.evalScript() and $262.createRealm() delegate to flagged test262 host hooks exposed by the bare loader. evalScript parses and executes its source as script code in the current realm. createRealm creates a fresh Goccia engine/realm, returns a host record with the child realm's globalThis, its own evalScript, and createRealm, and keeps the child engine alive for the duration of the test run so cross-realm intrinsics remain valid. The hooks are only exposed when GocciaScriptLoaderBare --test262-host is enabled.
eval is also host-gated. GocciaScriptLoaderBare --test262-host installs the official test262 host eval only for conformance runs; default Bare execution does not expose it. Bytecode direct calls to that host eval preserve the caller realm and caller lexical bindings, while shadowed or indirect eval calls use ordinary function-call semantics.
Strict mode#
GocciaScript recognizes strict directive prologues at execution time for the compatibility behaviors that otherwise depend on Script non-strict mode. Independently, the engine's curated default semantics enforce most strict-mode behaviors statically:
- Implicit globals throw
ReferenceError(sloppy would create a global) delete <identifier>and non-configurable property deletion throw by default
The orchestrator enables --compat-non-strict-mode per test, not globally: Script tests receive it, while module tests stay strict. onlyStrict Script tests also receive the flag, but the injected directive keeps with, non-strict assignment failures, legacy delete return values, and regular-function nullish this coercion on the strict path. Remaining noStrict tests rely on sloppy-only behaviors that GocciaScript still does not provide and fail naturally as ordinary conformance failures, not as wrapper-infra failures.
The runner passes syntax compatibility flags such as --compat-traditional-for-loop, --compat-for-in-loop, --compat-while-loops, --compat-label, and the semantic --compat-arguments-object flag unconditionally because test262 uses those forms across both harness helpers and test bodies. The test's source type and strictness still decide strict-mode semantics; --compat-arguments-object only enables the implicit arguments binding. Strictness and parameter-list shape decide whether that binding is unmapped or mapped.
Source-phase import feature flags#
The runner reads the test262 features frontmatter when a feature maps to a Goccia engine option. source-phase-imports tests run with the base test262 flag set only. Tests that also declare source-phase-imports-module-source receive --experimental-js-module-source, which enables JavaScript ModuleSource objects for the separate ESM Phase Imports proposal. This is not an eligibility filter: every discovered test still runs, and feature metadata only changes the host options used for that test.
Path normalization#
Test IDs are stored as POSIX-style relative paths under suite/test/. On Windows the filesystem returns backslashes; the orchestrator normalizes to forward slashes via normalizeTestId(id) at every site that uses an ID (glob-match, reporting, baseline lookup) so the same test produces the same ID on both platforms.
Test corpus#
Default categories: built-ins, harness, intl402, language, staging (everything except annexB).
annexBis legacy/deprecated browser-only behavior we don't intend
to support.
intl402covers Intl APIs (ECMA-402); tests exercise
Intl.getCanonicalLocales, constructors, and formatting operations.
stagingis forward-looking proposals — engine-readiness signal.harnessverifies test262's own harness functions work under our
engine.
There is no eligibility filter. Every discovered test runs. Tests that depend on missing features fail with a real diagnostic, not an invisible skip. Per-test subprocess + --timeout + --max-memory bound the blast radius of any individual hang or OOM.
Known engine crashes#
A small KNOWN_ENGINE_CRASHES set in scripts/run_test262_suite.ts skips tests that are known to crash the engine at the native level (SIGSEGV / SIGBUS) — not catchable by the per-test timeout, not representative of conformance failures, and would otherwise inflate wrapper_infra_failures indefinitely. Each entry is paired with a GitHub issue tracking the underlying engine bug; remove the entry once the bug is fixed.
This list is the only allowed form of test-skipping in the harness. Do not rebuild a generic eligibility filter (the structural blast-radius control is per-test subprocess + --timeout + --max-memory, not pre-execution exclusion).
Current entries:
built-ins/Iterator/concat/throws-typeerror-when-generator-is-running-next.js
— #514
staging/sm/RegExp/test-trailing.js
— #515
Updating the contract#
Changes to buildTestSource in scripts/run_test262_suite.ts are verified by the full conformance run itself (no separate regression suite). After any wrapper-template change:
1. Run locally:
``bash ./build.pas loaderbare bun scripts/run_test262_suite.ts --suite-dir <checkout> \ --output local-results.json ``
2. Confirm wrapper_infra_failures: 0 in the summary. 3. Diff local-results.json against the prior baseline; investigate any sign change in pass/fail counts before opening the PR.
Updating the SHA pin#
The test262 SHA is pinned in scripts/test262-suite-sha.txt. Both .github/workflows/ci.yml and .github/workflows/pr.yml read that file before checking out tc39/test262, so the cached main baseline and a PR run measure the same upstream corpus without weekly bump PRs needing to modify workflow files. The weekly cron at .github/workflows/test262-bump.yml opens a PR every Monday with the latest tc39/test262 main SHA; the PR's standard CI run posts the per-category delta vs. the previous main baseline. Merge once the delta is acceptable.
Manual bump: bun scripts/test262-bump-pin.ts <40-hex-sha>.