Day 3 & 4 - Gateway Observability Eval Runner

https://github.com/RitikaxG/claimflow_ai/blob/main/sample-data/week-06-observability/README.md

https://github.com/RitikaxG/claimflow_ai/blob/main/sample-data/week-06-observability/eval-results/week-6-gateway-observability-eval.md

Purpose

The Week 6 eval runner turns the gateway observability dataset into executable production evidence.

The dataset already defines what failures should exist: timeout, invalid JSON, provider error, cost limit, latency spike, prompt regression, eval regression, missing trace ID, and missing model version. The eval runner proves that ClaimFlow AI can execute those cases through the real gateway wrapper, persist AiCallLog rows, classify failures correctly, and generate dashboard-ready reports.

This eval does not test claim adjudication quality. It tests whether AI behavior is observable, governed, traceable, and measurable.

Implementation

The runner is implemented in:

packages/evals/evaluate-week6-gateway-observability.ts

Supporting files:

packages/evals/lib/gateway-case-loader.ts
packages/evals/lib/metrics.ts
packages/evals/lib/eval-result-writer.ts

The root script is:

bun eval:week6:gateway

Internally, the runner performs this flow:

load gateway cases
-> simulate synthetic model/provider behavior
-> callModelThroughGateway(...)
-> read persisted AiCallLog
-> compare actual gateway result against expected.json
-> compute metrics
-> write JSON + Markdown reports

Case Loading

gateway-case-loader.ts loads all folders under:

sample-data/week-06-observability/gateway-cases/

Each case must contain:

manifest.json
input.json
expected.json