Skip to main content
Determinism is optional. Use it when you need stable traces for evals, regression tests, or debugging a specific episode. You can ignore it for casual experimentation.

Quick setup: seed-based determinism

The easiest way to get repeatable behavior is to fix a seed on your session:
import noesis as ns
from noesis.runtime.session import SessionBuilder

session = (
    SessionBuilder.from_env()
    .with_determinism(seed=42)
    .build()
)

ep = session.run("Summarize incident INC-1234")
With a deterministic model/tooling stack, runs with the same seed will produce the same episode trajectory and metrics. The seed is recorded in summary.json.

Stricter reproducibility (clock + RNG)

If you need fully stable timings/IDs (e.g., evals in CI), add a deterministic clock and RNG:
import noesis as ns
from noesis.runtime.session import SessionBuilder
from noesis.runtime.determinism import DeterministicClock, DeterministicRNG

clock = DeterministicClock.from_start("2024-01-15T10:30:00Z", tick_ms=10)
rng = DeterministicRNG(seed=42)

session = (
    SessionBuilder.from_env()
    .with_determinism(clock=clock, rng=rng, episode_timestamp_ms=1705314600000)
    .build()
)

ep = session.run("Draft release notes")
This locks timestamps, random numbers, and episode timestamps; the configuration is reflected in summary.json.

Replay and comparison

Compare two episodes to check for drift. Ignore timing fields to focus on behavior:
import noesis as ns


def compare_episodes(ep_a: str, ep_b: str) -> dict:
    events_a = list(ns.events.read(ep_a))
    events_b = list(ns.events.read(ep_b))

    diffs = {
        "event_count_match": len(events_a) == len(events_b),
        "phase_sequence_match": True,
        "payload_diffs": [],
    }

    for e1, e2 in zip(events_a, events_b):
        if e1["phase"] != e2["phase"]:
            diffs["phase_sequence_match"] = False
            break

        # Ignore timing differences; compare payloads to catch behavioral drift
        p1 = dict(e1.get("payload", {}))
        p2 = dict(e2.get("payload", {}))

        if p1 != p2:
            diffs["payload_diffs"].append(
                {"phase": e1["phase"], "diff": {"expected": p1, "actual": p2}}
            )

    return diffs

Golden tests (pytest)

Use deterministic sessions in tests:
import json
from pathlib import Path
import noesis as ns
from noesis.runtime.session import SessionBuilder


def test_golden_episode():
    session = (
        SessionBuilder.from_env()
        .with_determinism(seed=42)
        .build()
    )
    episode_id = session.run("Generate test data")

    summary = ns.summary.read(episode_id)
    events = list(ns.events.read(episode_id))

    golden = json.loads(Path("tests/golden/generate_test_data.json").read_text())

    assert summary["metrics"]["success"] == golden["metrics"]["success"]
    assert summary["metrics"]["act_count"] == golden["metrics"]["act_count"]
    assert [e["phase"] for e in events] == [e["phase"] for e in golden["events"]]

Episode ID format

Episode IDs are human-readable and sortable:
ep_<YYYYMMDD>_<HHMMSS>_<hash>_<entropy>_s<seed>
  • Prefix ep_
  • Date + time for sortability
  • Content hash + entropy for uniqueness
  • Seed suffix (s0 if unset) for reproducibility

Deterministic components (overview)

ComponentPurposeHow it works
Deterministic clockConsistent timestampsFixed tick intervals instead of wall clock
Deterministic RNGReproducible random valuesSeeded random number generator
Deterministic IDsStable identifiersUUIDv5 based on namespace + content
Canonical JSONByte-identical outputSorted keys, consistent formatting
Direction/governance events use deterministic UUIDv5 IDs so the same inputs produce the same identifiers, making replay comparisons reliable.

Advanced: canonical JSON

Artifacts use canonical JSON so the same data yields the same bytes:
from noesis.runtime.determinism import canonical_dumps

payload = {"b": 2, "a": 1, "c": [3, 1, 2]}
assert canonical_dumps(payload) == canonical_dumps(payload)
This keeps manifest hashes and diffs stable.

When to use determinism

Use deterministic mode for:
  • Evals and golden tests
  • Debugging specific runs
  • Compliance scenarios that require reproducibility
Avoid deterministic mode for:
  • Production workloads that need real timestamps/IDs
  • Performance benchmarks
  • Security-sensitive operations where predictable IDs are a risk

CI validation

Add deterministic tests and replay checks to CI:
# .github/workflows/determinism.yml
name: Determinism Validation

on: [push, pull_request]

jobs:
  golden-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: astral-sh/setup-uv@v4

      - name: Run golden tests
        run: uv run pytest tests/golden/ -v

      - name: Check replay stability
        run: uv run python scripts/validate_replay.py