コンテンツにスキップ

Per-Tool Precision Table

Last evaluated: 2026-04-25 (snapshot of evals/gold.yaml against data/jpintel.db).

Source: 79-query gold suite at evals/gold.yaml. Every expected_ids list was generated by calling the live tool against data/jpintel.db. CI re-runs the suite on every PR via evals/run.py. Drift surfaces as a non-zero exit code.

Precision policy

  • precision@10 = fraction of the top-10 returned rows that are in expected_ids.
  • recall@10 = fraction of expected_ids that appear in the top-10 (only meaningful when total_hits > 10; for queries where total_hits <= 10 we record N/A and rely on precision).
  • min_precision_at_10 is the gate, not the claim. Actual precision floats above the gate; the gate is set to the level a future regression must respect.
  • One-shot tools (smb_starter_pack, deadline_calendar, subsidy_combo_finder, similar_cases, subsidy_roadmap_3yr, regulatory_prep_pack) and dd_profile_am are synthesized output tools — they don't have rankable expected_ids. They are gated on response shape (non-error, profile echo, sections present) instead of precision/recall. They are listed below for completeness with precision@10 = N/A.
  • Edge cases (edge_001_unknown_prefecture, edge_002_empty_q) deliberately have empty expected_ids and min_precision_at_10 = 0.0 — they test the input_warnings envelope, not ranking.

Aggregate by tool

Tool Gold queries min p@10 (worst-case gate) min p@10 (median gate) Notes
search_programs 33 0.0 (edge cases) 0.7 Core FTS5 trigram search. Precision gate ranges 0.0 (edge) → 0.8 (mfg_001 ものづくり).
search_tax_rules 4 0.1 0.35 Small index (35 rows). Precision floor is shallow because total hits is often ≤ 5.
search_loan_programs 3 0.3 0.3 108 loan rows. 三軸分解 (担保 / 個人保証人 / 第三者保証人) filter combos.
search_case_studies 3 0.7 0.7 mirasapo 採択事例 2,286 rows. Precision stable at 0.7 (industry × prefecture).
search_enforcement_cases 1 0.7 0.7 jbaudit 令和5年度 1,185 rows. Precision is publication-date-stable.
search_invoice_registrants 1 0.1 0.1 国税庁 PDL v1.0 13,801 rows delta. 13桁 法人番号 → T+13桁 正規化を 1 件で確認。
prescreen_programs 3 0.6 0.7 Profile-based fit ranking. Tier S 都道府県 + 全国 fallback の 2 layer。
smb_starter_pack 5 N/A shape gate One-shot synthesizer (top_subsidies + top_loans + tax_hints + same_industry_enforcement_count)。precision @10 は意味なし、profile.prefecture echo + sections non-empty で pass。
deadline_calendar 5 N/A shape gate by_month bucket + urgent_next_7_days int + empty_months hint。
subsidy_combo_finder 5 N/A shape gate combos に 補助金 + 融資 + 税制 triple、conflict detection。
similar_cases 5 N/A shape gate Jaccard similarity + program_used boost、seed exclusion。
subsidy_roadmap_3yr 5 N/A shape gate JST 会計年度 quarter bucket、opens_at ASC、horizon 24-36 months。
regulatory_prep_pack 5 N/A shape gate laws + certifications + tax_rulesets + recent_enforcement の 4 sections。
dd_profile_am 1 N/A shape gate 単一企業 entity dossier。adoptions list non-empty + invoice + enforcement 確認。

Per-query detail (search_programs subset, top 33)

Detailed precision gates per query are visible in evals/gold.yaml with each row carrying its own min_precision_at_10 and a note: explaining the regression we're guarding against. Selected highlights:

Query Total hits min p@10 What it tests
agri_001_nintei_shinki_shunou 617 0.7 FTS5 trigram stability for the agri-pivot core target_type.
mfg_001_monozukuri 9 0.8 Most famous 中小企業補助金、tier ranking。
it_001_donyu_subsidy 7 + tier=X excluded 0.7 Exclusion testUNI-158e0bb965 (tier=X) must NEVER appear given tier=[S,A,B].
pref_004_okinawa 12 0.8 Top-10 are the full set, 沖縄一括交付金 + 県農業近代化資金。
pref_005_ishikawa 6 0.6 Thin county recall limit (national data skew documented 2026-04-15)。
tax_001_invoice 10+ 0.8 2026-09 (2割特例終了) cliff date inclusion check.
loan_001_no_collateral 3 0.3 三軸 (担保 + 第三者保証人 同時) フィルタ — total=3 が回収できるか。
enforcement_001_tokyo 116 0.7 会計検査院令和5年度報告 publication_date DESC stability。
cross_001_invoice_lookup 1 0.1 13桁 法人番号 → T+13桁 正規化 path。
edge_001_unknown_prefecture 0 (warning) 0.0 input_warnings: [unknown_prefecture] envelope shape。
edge_002_empty_q 3,688 0.0 q='' で全件返却 + total > 0 のみチェック。

How to re-evaluate

.venv/bin/python evals/run.py
# Exit 0 = all 79 queries pass their `min_precision_at_10` gate
# Exit 1 = at least one regression — fail loud, see diff in stderr

To update gates after a real schema change (e.g. new tier policy, new tax_rulesets row), regenerate expected_ids against the current DB, manually inspect that the new top-10 makes sense, then commit gold.yaml + a CHANGELOG entry. Never raise a gate without inspecting the new ground truth.

Coverage gaps (open work)

  • autonomath.db tools (16) are NOT yet in gold.yaml — V1 launch ships shape-only checks. Adding gold queries for search_tax_incentives / search_certifications / intent_of / reason_answer is post-launch work (P5-P6).
  • Cross-dataset glue (trace_program_to_law, find_cases_by_law, combined_compliance_check) — only one cross_* query exists. Expanding to 5+ is P5.
  • answer() aggregator — not in 55-tool count yet, gold suite excludes it.

Honesty constraints

  • This table reflects the current snapshot. We commit gold.yaml regenerated runs, not optimistic numbers.
  • Tier=X is excluded from every search path. it_001_donyu_subsidy has a forbidden_ids: [UNI-158e0bb965] to enforce this — silent inclusion of quarantined rows = trust kill.
  • source_fetched_at semantics: rendered as 「出典取得」 (when we last fetched), never as 「最終更新」 (which would imply we verified currency). Documented in CLAUDE.md under "Common gotchas".

Auto-generated discussion of the 79-query gold suite, distilled from evals/gold.yaml (last regenerated 2026-04-25). For per-query expected IDs and notes, read the YAML directly.