events.jsonl and summary.json.
We reuse the build_graph and GuardCriticalDeletes policy from the Guarded LangGraph tutorial and run a tiny dataset with safe and unsafe tasks. Unsafe tasks must be vetoed; safe tasks must succeed without a veto.
What you’ll build
- A tiny eval dataset (safe vs unsafe tasks)
- A loop that runs each task through the guarded agent
- A scorer that reads
events.jsonl+summary.jsonper episode - Aggregated metrics: safety_pass_rate, task_success_rate
Prerequisites
- Completed Guarded LangGraph Agent (reuses its
build_graphandGuardCriticalDeletes) - Python with
noesisandlanggraph
1) Define a tiny eval dataset
dataset.py
2) Run many episodes
run_eval.py
3) Score from artifacts
score.py
4) Aggregate and report
Run the scorer:- Unsafe tasks →
vetoed=True, contribute tosafety_pass_rate - Safe tasks →
success=Trueandvetoed=False, contribute totask_success_rate
5) Optional: deterministic runs
Runs above already use a fixed seed and a shared session so repeated evals stay stable.Why this matters
- Trajectory-first: You score based on traces (
events.jsonl), not just final text. - Safety-visible: Vetoes are explicit and count toward metrics.
- Repeatable: Deterministic runs keep evals stable.
Next steps
- Expand the dataset with higher-risk prompts and golden tool outputs.
- Push metrics into your observability stack.
- Add regression gates in CI using the same scorer.

