Per-Tool Precision Table¶

Last evaluated: 2026-04-25 (snapshot of evals/gold.yaml against data/jpintel.db).

Source: 79-query gold suite at evals/gold.yaml. Every expected_ids list was generated by calling the live tool against data/jpintel.db. CI re-runs the suite on every PR via evals/run.py. Drift surfaces as a non-zero exit code.

Precision policy¶

precision@10 = fraction of the top-10 returned rows that are in expected_ids.
recall@10 = fraction of expected_ids that appear in the top-10 (only meaningful when total_hits > 10; for queries where total_hits <= 10 we record N/A and rely on precision).
min_precision_at_10 is the gate, not the claim. Actual precision floats above the gate; the gate is set to the level a future regression must respect.
One-shot tools (smb_starter_pack, deadline_calendar, subsidy_combo_finder, similar_cases, subsidy_roadmap_3yr, regulatory_prep_pack) and dd_profile_am are synthesized output tools — they don't have rankable expected_ids. They are gated on response shape (non-error, profile echo, sections present) instead of precision/recall. They are listed below for completeness with precision@10 = N/A.
Edge cases (edge_001_unknown_prefecture, edge_002_empty_q) deliberately have empty expected_ids and min_precision_at_10 = 0.0 — they test the input_warnings envelope, not ranking.

Aggregate by tool¶

Tool	Gold queries	min p@10 (worst-case gate)	min p@10 (median gate)	Notes
`search_programs`	33	0.0 (edge cases)	0.7	Core FTS5 trigram search. Precision gate ranges 0.0 (edge) → 0.8 (mfg_001 ものづくり).
`search_tax_rules`	4	0.1	0.35	Small index (35 rows). Precision floor is shallow because total hits is often ≤ 5.
`search_loan_programs`	3	0.3	0.3	108 loan rows. 三軸分解 (担保 / 個人保証人 / 第三者保証人) filter combos.
`search_case_studies`	3	0.7	0.7	mirasapo 採択事例 2,286 rows. Precision stable at 0.7 (industry × prefecture).
`search_enforcement_cases`	1	0.7	0.7	jbaudit 令和5年度 1,185 rows. Precision is publication-date-stable.
`search_invoice_registrants`	1	0.1	0.1	国税庁 PDL v1.0 13,801 rows delta. 13桁法人番号 → T+13桁正規化を 1 件で確認。
`prescreen_programs`	3	0.6	0.7	Profile-based fit ranking. Tier S 都道府県 + 全国 fallback の 2 layer。
`smb_starter_pack`	5	N/A	shape gate	One-shot synthesizer (top_subsidies + top_loans + tax_hints + same_industry_enforcement_count)。precision @10 は意味なし、profile.prefecture echo + sections non-empty で pass。
`deadline_calendar`	5	N/A	shape gate	by_month bucket + urgent_next_7_days int + empty_months hint。
`subsidy_combo_finder`	5	N/A	shape gate	combos に補助金 + 融資 + 税制 triple、conflict detection。
`similar_cases`	5	N/A	shape gate	Jaccard similarity + program_used boost、seed exclusion。
`subsidy_roadmap_3yr`	5	N/A	shape gate	JST 会計年度 quarter bucket、opens_at ASC、horizon 24-36 months。
`regulatory_prep_pack`	5	N/A	shape gate	laws + certifications + tax_rulesets + recent_enforcement の 4 sections。
`dd_profile_am`	1	N/A	shape gate	単一企業 entity dossier。adoptions list non-empty + invoice + enforcement 確認。

Per-query detail (search_programs subset, top 33)¶

Detailed precision gates per query are visible in evals/gold.yaml with each row carrying its own min_precision_at_10 and a note: explaining the regression we're guarding against. Selected highlights:

Query	Total hits	min p@10	What it tests
`agri_001_nintei_shinki_shunou`	617	0.7	FTS5 trigram stability for the agri-pivot core target_type.
`mfg_001_monozukuri`	9	0.8	Most famous 中小企業補助金、tier ranking。
`it_001_donyu_subsidy`	7 + tier=X excluded	0.7	Exclusion test — `UNI-158e0bb965` (tier=X) must NEVER appear given `tier=[S,A,B]`.
`pref_004_okinawa`	12	0.8	Top-10 are the full set, 沖縄一括交付金 + 県農業近代化資金。
`pref_005_ishikawa`	6	0.6	Thin county recall limit (national data skew documented 2026-04-15)。
`tax_001_invoice`	10+	0.8	2026-09 (2割特例終了) cliff date inclusion check.
`loan_001_no_collateral`	3	0.3	三軸 (担保 + 第三者保証人同時) フィルタ — total=3 が回収できるか。
`enforcement_001_tokyo`	116	0.7	会計検査院令和5年度報告 publication_date DESC stability。
`cross_001_invoice_lookup`	1	0.1	13桁法人番号 → T+13桁正規化 path。
`edge_001_unknown_prefecture`	0 (warning)	0.0	`input_warnings: [unknown_prefecture]` envelope shape。
`edge_002_empty_q`	3,688	0.0	`q=''` で全件返却 + total > 0 のみチェック。

How to re-evaluate¶

.venv/bin/python evals/run.py
# Exit 0 = all 79 queries pass their `min_precision_at_10` gate
# Exit 1 = at least one regression — fail loud, see diff in stderr

To update gates after a real schema change (e.g. new tier policy, new tax_rulesets row), regenerate expected_ids against the current DB, manually inspect that the new top-10 makes sense, then commit gold.yaml + a CHANGELOG entry. Never raise a gate without inspecting the new ground truth.

Coverage gaps (open work)¶

autonomath.db tools (16) are NOT yet in gold.yaml — V1 launch ships shape-only checks. Adding gold queries for search_tax_incentives / search_certifications / intent_of / reason_answer is post-launch work (P5-P6).
Cross-dataset glue (trace_program_to_law, find_cases_by_law, combined_compliance_check) — only one cross_* query exists. Expanding to 5+ is P5.
answer() aggregator — not in 55-tool count yet, gold suite excludes it.

Honesty constraints¶

This table reflects the current snapshot. We commit gold.yaml regenerated runs, not optimistic numbers.
Tier=X is excluded from every search path. it_001_donyu_subsidy has a forbidden_ids: [UNI-158e0bb965] to enforce this — silent inclusion of quarantined rows = trust kill.
source_fetched_at semantics: rendered as 「出典取得」 (when we last fetched), never as 「最終更新」 (which would imply we verified currency). Documented in CLAUDE.md under "Common gotchas".

Auto-generated discussion of the 79-query gold suite, distilled from evals/gold.yaml (last regenerated 2026-04-25). For per-query expected IDs and notes, read the YAML directly.