Below is the updated failure dataset preparation plan, aligned to your actual 8-week ClaimFlow AI roadmap.

The main correction from the earlier plan is this:

Dataset/eval work is not a separate project.
It is a weekly eval lane attached to the core feature of that week.

Also keep the source strategy unchanged: mostly controlled synthetic packets, with public documents used mainly as anchors for policy wording, exclusions, field inspiration, and RAG sources — not as full real claim packets.


ClaimFlow AI — Updated Failure Dataset Preparation Plan

Core rule

Every week should produce:

1. Product feature
2. Dataset for that feature
3. Gold expected behavior
4. Eval script
5. Markdown + JSON eval result
6. Docs updated with eval evidence

Dataset work should be small, targeted, and tied to the workflow being built that week.


Global dataset philosophy

Do not collect random PDFs.

Build controlled claim packets:

claim packet
→ known source documents
→ known extraction truth
→ known validation result
→ known workflow status
→ known review / RAG / memory / agent expectation

Your attached plan correctly says the dataset should become a workflow testbench, not a folder of random insurance PDFs.

Source mix

Use:

80% synthetic claim packets
20% public anchor documents

Synthetic data for