Day 2
1. Add env flags.
2. Add apps/web/lib/testing/mock-extraction.ts.
3. Validate gold extraction JSON with ClaimExtractionSchema.
4. Add mock branch to extract route.
5. Test w2-001 happy mock review path.
6. Test w2-012 mock extraction failure.
7. Test USE_MOCK_EXTRACTION=false still calls Gemini.

Day 3
1. Add packages/evals/evaluate-week2-review-workflow.ts.
2. Add eval package/root scripts.
3. Add eval-safe cleanup for synthetic packet data.
4. Implement packet discovery + manifest parsing.
5. Implement upload/extract/validate API calls.
6. Assert validationJson and workflow.expected.json.
7. Add review action tests for w2-013, w2-014, w2-015.
8. Print console summary.
9. Write eval JSON + Markdown report.
10. Fix only real gaps found by the eval.
Synthetic messy packets → mock extraction → real validation → real review task → real human decision APIs → eval report

1. Prepare Week-2 Review Failure Dataset.

image.png

image.png

image.png

2. Run every Week 2 messy packet through the full existing API workflow and produce a pass/fail report.

  1. Week 2 eval runner created

packages/evals/evaluate-week2-review-workflow.ts exists and covers the full workflow: