Per-Tool Precision Table¶
Last evaluated: 2026-04-25 (snapshot of evals/gold.yaml against data/jpintel.db).
Source: 79-query gold suite at evals/gold.yaml. Every expected_ids list was generated by calling the live tool against data/jpintel.db. CI re-runs the suite on every PR via evals/run.py. Drift surfaces as a non-zero exit code.
Precision policy¶
precision@10= fraction of the top-10 returned rows that are inexpected_ids.recall@10= fraction ofexpected_idsthat appear in the top-10 (only meaningful whentotal_hits > 10; for queries wheretotal_hits <= 10we record N/A and rely on precision).min_precision_at_10is the gate, not the claim. Actual precision floats above the gate; the gate is set to the level a future regression must respect.- One-shot tools (
smb_starter_pack,deadline_calendar,subsidy_combo_finder,similar_cases,subsidy_roadmap_3yr,regulatory_prep_pack) anddd_profile_amare synthesized output tools — they don't have rankableexpected_ids. They are gated on response shape (non-error, profile echo, sections present) instead of precision/recall. They are listed below for completeness withprecision@10 = N/A. - Edge cases (
edge_001_unknown_prefecture,edge_002_empty_q) deliberately have emptyexpected_idsandmin_precision_at_10 = 0.0— they test theinput_warningsenvelope, not ranking.
Aggregate by tool¶
| Tool | Gold queries | min p@10 (worst-case gate) | min p@10 (median gate) | Notes |
|---|---|---|---|---|
search_programs |
33 | 0.0 (edge cases) | 0.7 | Core FTS5 trigram search. Precision gate ranges 0.0 (edge) → 0.8 (mfg_001 ものづくり). |
search_tax_rules |
4 | 0.1 | 0.35 | Small index (35 rows). Precision floor is shallow because total hits is often ≤ 5. |
search_loan_programs |
3 | 0.3 | 0.3 | 108 loan rows. 三軸分解 (担保 / 個人保証人 / 第三者保証人) filter combos. |
search_case_studies |
3 | 0.7 | 0.7 | mirasapo 採択事例 2,286 rows. Precision stable at 0.7 (industry × prefecture). |
search_enforcement_cases |
1 | 0.7 | 0.7 | jbaudit 令和5年度 1,185 rows. Precision is publication-date-stable. |
search_invoice_registrants |
1 | 0.1 | 0.1 | 国税庁 PDL v1.0 13,801 rows delta. 13桁 法人番号 → T+13桁 正規化を 1 件で確認。 |
prescreen_programs |
3 | 0.6 | 0.7 | Profile-based fit ranking. Tier S 都道府県 + 全国 fallback の 2 layer。 |
smb_starter_pack |
5 | N/A | shape gate | One-shot synthesizer (top_subsidies + top_loans + tax_hints + same_industry_enforcement_count)。precision @10 は意味なし、profile.prefecture echo + sections non-empty で pass。 |
deadline_calendar |
5 | N/A | shape gate | by_month bucket + urgent_next_7_days int + empty_months hint。 |
subsidy_combo_finder |
5 | N/A | shape gate | combos に 補助金 + 融資 + 税制 triple、conflict detection。 |
similar_cases |
5 | N/A | shape gate | Jaccard similarity + program_used boost、seed exclusion。 |
subsidy_roadmap_3yr |
5 | N/A | shape gate | JST 会計年度 quarter bucket、opens_at ASC、horizon 24-36 months。 |
regulatory_prep_pack |
5 | N/A | shape gate | laws + certifications + tax_rulesets + recent_enforcement の 4 sections。 |
dd_profile_am |
1 | N/A | shape gate | 単一企業 entity dossier。adoptions list non-empty + invoice + enforcement 確認。 |
Per-query detail (search_programs subset, top 33)¶
Detailed precision gates per query are visible in evals/gold.yaml with each row carrying its own min_precision_at_10 and a note: explaining the regression we're guarding against. Selected highlights:
| Query | Total hits | min p@10 | What it tests |
|---|---|---|---|
agri_001_nintei_shinki_shunou |
617 | 0.7 | FTS5 trigram stability for the agri-pivot core target_type. |
mfg_001_monozukuri |
9 | 0.8 | Most famous 中小企業補助金、tier ranking。 |
it_001_donyu_subsidy |
7 + tier=X excluded | 0.7 | Exclusion test — UNI-158e0bb965 (tier=X) must NEVER appear given tier=[S,A,B]. |
pref_004_okinawa |
12 | 0.8 | Top-10 are the full set, 沖縄一括交付金 + 県農業近代化資金。 |
pref_005_ishikawa |
6 | 0.6 | Thin county recall limit (national data skew documented 2026-04-15)。 |
tax_001_invoice |
10+ | 0.8 | 2026-09 (2割特例終了) cliff date inclusion check. |
loan_001_no_collateral |
3 | 0.3 | 三軸 (担保 + 第三者保証人 同時) フィルタ — total=3 が回収できるか。 |
enforcement_001_tokyo |
116 | 0.7 | 会計検査院令和5年度報告 publication_date DESC stability。 |
cross_001_invoice_lookup |
1 | 0.1 | 13桁 法人番号 → T+13桁 正規化 path。 |
edge_001_unknown_prefecture |
0 (warning) | 0.0 | input_warnings: [unknown_prefecture] envelope shape。 |
edge_002_empty_q |
3,688 | 0.0 | q='' で全件返却 + total > 0 のみチェック。 |
How to re-evaluate¶
.venv/bin/python evals/run.py
# Exit 0 = all 79 queries pass their `min_precision_at_10` gate
# Exit 1 = at least one regression — fail loud, see diff in stderr
To update gates after a real schema change (e.g. new tier policy, new tax_rulesets row), regenerate expected_ids against the current DB, manually inspect that the new top-10 makes sense, then commit gold.yaml + a CHANGELOG entry. Never raise a gate without inspecting the new ground truth.
Coverage gaps (open work)¶
- autonomath.db tools (16) are NOT yet in gold.yaml — V1 launch ships shape-only checks. Adding gold queries for
search_tax_incentives/search_certifications/intent_of/reason_answeris post-launch work (P5-P6). - Cross-dataset glue (
trace_program_to_law,find_cases_by_law,combined_compliance_check) — only one cross_* query exists. Expanding to 5+ is P5. - answer() aggregator — not in 55-tool count yet, gold suite excludes it.
Honesty constraints¶
- This table reflects the current snapshot. We commit gold.yaml regenerated runs, not optimistic numbers.
- Tier=X is excluded from every search path.
it_001_donyu_subsidyhas aforbidden_ids: [UNI-158e0bb965]to enforce this — silent inclusion of quarantined rows = trust kill. source_fetched_atsemantics: rendered as 「出典取得」 (when we last fetched), never as 「最終更新」 (which would imply we verified currency). Documented inCLAUDE.mdunder "Common gotchas".
Auto-generated discussion of the 79-query gold suite, distilled from evals/gold.yaml (last regenerated 2026-04-25). For per-query expected IDs and notes, read the YAML directly.