Day 4 - Review Failure Dataset Preparation and Testing

Day 2
1. Add env flags.
2. Add apps/web/lib/testing/mock-extraction.ts.
3. Validate gold extraction JSON with ClaimExtractionSchema.
4. Add mock branch to extract route.
5. Test w2-001 happy mock review path.
6. Test w2-012 mock extraction failure.
7. Test USE_MOCK_EXTRACTION=false still calls Gemini.

Day 3
1. Add packages/evals/evaluate-week2-review-workflow.ts.
2. Add eval package/root scripts.
3. Add eval-safe cleanup for synthetic packet data.
4. Implement packet discovery + manifest parsing.
5. Implement upload/extract/validate API calls.
6. Assert validationJson and workflow.expected.json.
7. Add review action tests for w2-013, w2-014, w2-015.
8. Print console summary.
9. Write eval JSON + Markdown report.
10. Fix only real gaps found by the eval.

Synthetic messy packets → mock extraction → real validation → real review task → real human decision APIs → eval report

1. Prepare Week-2 Review Failure Dataset.

2. Run every Week 2 messy packet through the full existing API workflow and produce a pass/fail report.

Week 2 eval runner created

packages/evals/evaluate-week2-review-workflow.ts exists and covers the full workflow:

reads packet manifests
resets synthetic eval data
uploads packet documents through the real API
tests duplicate upload
calls real extract API
calls real validate API
compares validation output
checks review task creation
checks review events
executes review actions for edit/reject/request-more-info packets
writes JSON + Markdown eval reports.