AutonoMath — 10 Self-Improvement Loops¶

Owner: 梅田茂利, [email protected] (Bookyou株式会社) Status: scaffolding (T+30d for real ML wiring); dry-run only at launch. Constraint: zero LLM. Local e5-small + DBSCAN + SQL only. See operator memory feedback_autonomath_no_api_use — calling Anthropic / OpenAI / Gemini from these loops is a regression that breaks ¥3/req economics.

Why ten loops, not one big rewrite¶

Each loop is a single-purpose, small-blast-radius feedback pipeline. All ten share the same shape:

read signal table -> cluster / score -> propose candidates -> operator approves -> production write

This makes the system safe-by-default (dry_run=True is the orchestrator default, and operator review is mandatory before any candidate is promoted). It also matches feedback_completion_gate_minimal — none of these loops are launch blockers; they ship dark and turn on individually as data accumulates.

Cadence overview¶

Loop	Name	Cadence	Trigger	Cost ceiling
A	hallucination_guard expansion	weekly	Mon 09:30 JST	5 CPU min
B	testimonial -> SEO/GEO	monthly	1st 10:00 JST	2 CPU min
C	personalized cache	weekly	Sun 03:00 JST	10 CPU min
D	forecast accuracy	monthly	15th 09:00 JST	3 CPU min
E	multi-language alias expansion	weekly	Wed 04:00 JST	5 CPU min
F	channel ROI	weekly	Fri 16:00 JST	1 CPU min
G	invariant expansion	monthly	5th 11:00 JST	3 CPU min
H	cache warming (Zipf)	daily	03:30 JST	10 CPU min
I	doc freshness re-fetch priority	weekly	Tue 02:00 JST	2 CPU min
J	gold.yaml expansion	monthly	20th 09:00 JST	5 CPU min

Total: ~6 hours of CPU per month — fits inside the Fly.io shared compute budget without any auto-scale.

Loops in detail¶

Loop A — hallucination_guard expansion¶

Inputs: customer_feedback, query_log (zero-result + low-conf), hallucination_guard (504 rows at launch).
Output: hallucination_guard_candidates rows with status='pending_review'. Operator promotes -> hallucination_guard (target 5,000+ within 6 months).
Method: e5-small embed feedback text, DBSCAN cluster, take cluster medoid as candidate pattern.

Loop B — customer success -> testimonial SEO/GEO¶

Inputs: customer_success_events (consent_flag=1), programs, prefectures.
Output: seo_testimonial_candidates rows + JSON-LD Review schema. Operator pastes into site/blog/case-*.md.
Method: Pure SQL aggregation + Jinja template rendering. No LLM rewrite.

Loop C — personalized cache¶

Inputs: query_log (paid keys, 30 days), api_keys.
Output: personalized_cache(customer_id, query_hash, payload_json, computed_at) upserts. TTL 7 days.
Method: Top-K=20 query histogram per customer with ≥50 prior requests. Pre-execute FTS5; cache.

Loop D — forecast accuracy¶

Inputs: am_application_round (1,256 rows), forecast_predictions.
Output: forecast_calibration(program_id, coef_json, computed_at). Consumed by subsidy_roadmap_3yr MCP tool.
Method: Closed-form Bayesian shrinkage estimator (numpy). No new predictions written; calibration only.

Loop E — multi-language alias expansion (V4+)¶

Inputs: query_log (14 days, all tiers), am_alias (335,605 rows).
Output: alias_candidates(entity_id, surface, lang, source, score, status). Operator promotes to am_alias monthly.
Method: Mine queries with conf<0.5 hitting a single program; e5-small cluster surface forms; lang detect via script heuristic.

Loop F — channel ROI¶

Inputs: subscribers.utm_source, query_log, billing_events.
Output: channel_roi(channel, signups_28d, paid_28d, revenue_28d_jpy, computed_at).
Method: Pure SQL window aggregations. Informational only — feedback_organic_only_no_ads means we never optimize a paid channel.

Loop G — invariant expansion¶

Inputs: error_log, customer_feedback (wrong_answer flag), existing tests/properties/.
Output: tests/properties/_candidates/<failure_mode>.py Hypothesis stubs. Operator polishes + moves to tests/properties/.
Method: Cluster errors by stack-trace top-5 frames; template a Hypothesis property per cluster.

Loop H — cache warming (Zipf)¶

Inputs: query_log (24h, all tiers).
Output: query_cache(query_hash, payload_json, ttl_until) for top-200 global queries.
Method: Daily Zipf head; pre-execute live FTS5; upsert with 24h TTL. Drops P95 from ~30 ms to ~2 ms on warm queries.

Loop I — doc freshness re-fetch priority¶

Inputs: programs.source_url / source_fetched_at, source_change_signals (HTTP 304 / Last-Modified from nightly liveness scan).
Output: source_refetch_queue(source_url, priority, reason, queued_at).
Method: Score = 0.6 * staleness_days + 0.3 * tier_weight + 0.1 * change_signal. Top 500 written. Honest semantics — does NOT touch source_fetched_at (per CLAUDE.md gotchas) until an actual fetch runs.

Loop J — gold.yaml expansion¶

Inputs: query_log (30 days), evals/gold.yaml, programs.
Output: evals/_candidates/gold_proposed_<YYYY-MM>.yaml. Operator hand-curates accept/reject.
Method: High-confidence (>=0.85) stable (≥5 sessions, same top-1) queries become candidates. Skip rows already in gold. Mandatory operator review — gold rows are regression tests forever.

Orchestrator¶

# Default: dry-run all 10 (safe)
.venv/bin/python scripts/self_improve_orchestrator.py

# Operator-approved real run
.venv/bin/python scripts/self_improve_orchestrator.py --execute

# Single loop
.venv/bin/python scripts/self_improve_orchestrator.py --only loop_h_cache_warming

Output goes to analysis_wave18/self_improve_runs/<YYYY-MM-DD>.json with this shape:

{
  "ts": "2026-04-25T12:34:56+09:00",
  "dry_run": true,
  "loops_total": 10,
  "loops_succeeded": 10,
  "loops_failed": 0,
  "totals": {"scanned": 0, "actions_proposed": 0, "actions_executed": 0},
  "results": [
    {"loop": "loop_a_hallucination_guard", "scanned": 0, "actions_proposed": 0, "actions_executed": 0},
    ...
  ]
}

Monitoring¶

Add a single line to the existing weekly digest (scripts/weekly_digest.py):

"Self-improvement: 10/10 loops succeeded last 7 days, X candidates proposed, Y promoted."

If loops_failed > 0 for two consecutive runs, the digest highlights it as a PC2 data-gap signal under the existing docs/improvement_loop.md priority framework.

No paging, no external alerting — solo + zero-touch ops.

Hard rules (do not break)¶

No LLM calls. Ever. Local e5-small + scikit-learn DBSCAN + SQL only. (feedback_autonomath_no_api_use)
dry_run=True is default. Production writes need --execute, which currently is a no-op anyway since loops are scaffolding.
Operator review for every promotion. Candidates go to *_candidates tables / _candidates/ directories — never directly into the production tables.
Honest semantics. Loop I never silently rewrites source_fetched_at (matches CLAUDE.md "What NOT to do").
Schedule is suggestive, not enforced. Pre-launch the orchestrator runs all 10 loops on each invocation. Per-loop cadence checks land T+30d alongside the real implementations.