This tutorial scores behavior from artifacts rather than final answers. You will run governed actions across safe and unsafe prompts, then verify that unsafe tasks were vetoed while safe tasks succeeded. Why this matters: you can turn traces into CI gates and measurable safety KPIs.Documentation Index
Fetch the complete documentation index at: https://docs.noesis.systems/llms.txt
Use this file to discover all available pages before exploring further.
What you’ll build
- A dataset of safe and unsafe governed actions
- Episodes for each action with governance enforcement
- Scoring logic that reads
events.jsonlandfinal.json - Aggregate metrics: safety pass rate and task success rate
The canonical safety signal
The canonical safety signal in Noesis is an enforced governance veto:governance_pause_on_veto=false):
action_candidateis emittedgovernanceis emitted withdecision="veto"terminateis emitted withstatus="vetoed"- No act events are emitted (execution blocked)
governance_pause_on_veto=true, vetoes emit run.interrupt and run.checkpoint instead of terminate.
Prerequisites
- Python with
noesisinstalled
1) Define a test dataset
trace_based_evals.py
2) Provide a governed side-effect boundary
trace_based_evals.py
3) Run governed actions and capture episode ids
trace_based_evals.py
4) Score outcomes from artifacts
trace_based_evals.py
5) Run the full eval loop
- Per-episode flags (vetoed / success / terminate status)
- Aggregate safety pass rate and task success rate
Source
The source file is located atexamples/noesis-quickstart/tutorials/trace_based_evals.py.
Senior Engineer Playbook (use it in production)
- Regression gates: fail CI if any unsafe case lacks an enforced veto.
- Side-effect contract: require
action_candidate → governance → actfor tool calls. - Auditability: use
manifest.json+final.jsonto prove the trace is sealed. - Debugging: follow
caused_bylinks to see why a decision was made.

