Determinism and replay

Determinism is optional. Use it when you need stable traces for evals, regression tests, or debugging a specific episode. You can ignore it for casual experimentation.

Quick setup: seed-based determinism

The easiest way to get repeatable behavior is to fix a seed on your session:

import noesis as ns
from noesis.runtime.session import SessionBuilder

session = (
    SessionBuilder.from_env()
    .with_determinism(seed=42)
    .build()
)

ep = session.run("Summarize incident INC-1234")

With a deterministic model/tooling stack, runs with the same seed will produce the same episode trajectory and metrics. The seed is recorded in summary.json.

Stricter reproducibility (clock + RNG)

If you need fully stable timings/IDs (e.g., evals in CI), add a deterministic clock and RNG:

import noesis as ns
from noesis.runtime.session import SessionBuilder
from noesis.runtime.determinism import DeterministicClock, DeterministicRNG

clock = DeterministicClock.from_start("2024-01-15T10:30:00Z", tick_ms=10)
rng = DeterministicRNG(seed=42)

session = (
    SessionBuilder.from_env()
    .with_determinism(clock=clock, rng=rng, episode_timestamp_ms=1705314600000)
    .build()
)

ep = session.run("Draft release notes")

This locks timestamps, random numbers, and episode timestamps; the configuration is reflected in summary.json.

Replay and comparison

Compare two episodes to check for drift. Ignore timing fields to focus on behavior:

import noesis as ns


def compare_episodes(ep_a: str, ep_b: str) -> dict:
    events_a = list(ns.events.read(ep_a))
    events_b = list(ns.events.read(ep_b))

    diffs = {
        "event_count_match": len(events_a) == len(events_b),
        "phase_sequence_match": True,
        "payload_diffs": [],
    }

    for e1, e2 in zip(events_a, events_b):
        if e1["phase"] != e2["phase"]:
            diffs["phase_sequence_match"] = False
            break

        # Ignore timing differences; compare payloads to catch behavioral drift
        p1 = dict(e1.get("payload", {}))
        p2 = dict(e2.get("payload", {}))

        if p1 != p2:
            diffs["payload_diffs"].append(
                {"phase": e1["phase"], "diff": {"expected": p1, "actual": p2}}
            )

    return diffs

Golden tests (pytest)

Use deterministic sessions in tests:

import json
from pathlib import Path
import noesis as ns
from noesis.runtime.session import SessionBuilder


def test_golden_episode():
    session = (
        SessionBuilder.from_env()
        .with_determinism(seed=42)
        .build()
    )
    episode_id = session.run("Generate test data")

    summary = ns.summary.read(episode_id)
    events = list(ns.events.read(episode_id))

    golden = json.loads(Path("tests/golden/generate_test_data.json").read_text())

    assert summary["metrics"]["success"] == golden["metrics"]["success"]
    assert summary["metrics"]["act_count"] == golden["metrics"]["act_count"]
    assert [e["phase"] for e in events] == [e["phase"] for e in golden["events"]]

Episode ID format

Episode IDs are human-readable and sortable:

ep_<YYYYMMDD>_<HHMMSS>_<hash>_<entropy>_s<seed>

Prefix ep_
Date + time for sortability
Content hash + entropy for uniqueness
Seed suffix (s0 if unset) for reproducibility

Deterministic components (overview)

Component	Purpose	How it works
Deterministic clock	Consistent timestamps	Fixed tick intervals instead of wall clock
Deterministic RNG	Reproducible random values	Seeded random number generator
Deterministic IDs	Stable identifiers	UUIDv5 based on namespace + content
Canonical JSON	Byte-identical output	Sorted keys, consistent formatting

Direction/governance events use deterministic UUIDv5 IDs so the same inputs produce the same identifiers, making replay comparisons reliable.

Advanced: canonical JSON

Artifacts use canonical JSON so the same data yields the same bytes:

from noesis.runtime.determinism import canonical_dumps

payload = {"b": 2, "a": 1, "c": [3, 1, 2]}
assert canonical_dumps(payload) == canonical_dumps(payload)

This keeps manifest hashes and diffs stable.

When to use determinism

Use deterministic mode for:

Evals and golden tests
Debugging specific runs
Compliance scenarios that require reproducibility

Avoid deterministic mode for:

Production workloads that need real timestamps/IDs
Performance benchmarks
Security-sensitive operations where predictable IDs are a risk

CI validation

Add deterministic tests and replay checks to CI:

# .github/workflows/determinism.yml
name: Determinism Validation

on: [push, pull_request]

jobs:
  golden-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: astral-sh/setup-uv@v4

      - name: Run golden tests
        run: uv run pytest tests/golden/ -v

      - name: Check replay stability
        run: uv run python scripts/validate_replay.py

Get started

Tutorials

How-to guides

Explanation

Determinism and replay

Quick setup: seed-based determinism

Stricter reproducibility (clock + RNG)

Replay and comparison

Golden tests (pytest)

Episode ID format

Deterministic components (overview)

Advanced: canonical JSON

When to use determinism

CI validation

Get started

Tutorials

How-to guides

Explanation

​Quick setup: seed-based determinism

​Stricter reproducibility (clock + RNG)

​Replay and comparison

​Golden tests (pytest)

​Episode ID format

​Deterministic components (overview)

​Advanced: canonical JSON

​When to use determinism

​CI validation

Quick setup: seed-based determinism

Stricter reproducibility (clock + RNG)

Replay and comparison

Golden tests (pytest)

Episode ID format

Deterministic components (overview)

Advanced: canonical JSON

When to use determinism

CI validation