Below is the updated failure dataset preparation plan, aligned to your actual 8-week ClaimFlow AI roadmap.
The main correction from the earlier plan is this:
Dataset/eval work is not a separate project.
It is a weekly eval lane attached to the core feature of that week.
Also keep the source strategy unchanged: mostly controlled synthetic packets, with public documents used mainly as anchors for policy wording, exclusions, field inspiration, and RAG sources — not as full real claim packets.
Every week should produce:
1. Product feature
2. Dataset for that feature
3. Gold expected behavior
4. Eval script
5. Markdown + JSON eval result
6. Docs updated with eval evidence
Dataset work should be small, targeted, and tied to the workflow being built that week.
Do not collect random PDFs.
Build controlled claim packets:
claim packet
→ known source documents
→ known extraction truth
→ known validation result
→ known workflow status
→ known review / RAG / memory / agent expectation
Your attached plan correctly says the dataset should become a workflow testbench, not a folder of random insurance PDFs.
Use:
80% synthetic claim packets
20% public anchor documents