mts1b-llm
LLM router + 35 personas + semantic cache + cost ledger + scorers + evals + governor.
Repo: github.com/MTS1B/mts1b-llm Layer: 4 Wave: 2 (months 4-7) Depends on: foundation, platform, httpx, anthropic, openai, google-generativeai Audience: any repo that wants AI assistance (research, riskengine CRO veto, githubbot, discordbot)
What it is
A provider-agnostic LLM service with:
- Router — pick model per task (capability, cost, latency, regional preference)
- 35 personas — domain experts (CRO, equities analyst, options strategist, crypto specialist, ...)
- Semantic cache — embed prompts; return cached completion if cosine similarity > threshold
- Cost ledger — every call logged with provider, model, tokens, USD cost
- Scorers — per-persona output quality metrics
- Evals — offline regression tests across all personas
- Governor — daily/weekly USD budget per persona; throttles when budget runs low
Why a separate service
Without this:
- Each repo calls Anthropic/OpenAI directly; key sprawl
- No cost visibility — burn $1000/day before noticing
- Same prompt sent 100x because there's no cache
- Persona prompts duplicated across repos
- No way to A/B compare model providers
With this:
- Single API endpoint; provider abstracted
- Daily cost breakdown by repo, persona, model
- 70%+ cache hit rate on common questions
- Personas are namespaced configs; one place to edit
Module layout
mts1b_llm/
├── router/
│ ├── policy.py # model selection per task class
│ └── fallback.py
├── personas/
│ ├── registry.py # 35 personas (yaml-defined)
│ ├── cro.py # custom code for personas needing tool use
│ └── prompts/ # one .md per persona
├── providers/
│ ├── anthropic.py
│ ├── openai.py
│ ├── google.py
│ ├── openrouter.py # for OSS models
│ └── local.py # vLLM / TGI local serving
├── cache/
│ ├─ ─ embedder.py
│ └── store.py # Redis-backed
├── ledger/
│ ├── tracker.py
│ └── reports.py
├── scorers/
│ └── ... # per-persona quality scoring
├── evals/
│ └── ... # nightly regression suite
└── governor/
├── budget.py
└── circuit_breaker.py
API
Direct invocation
from mts1b_llm import LLM
llm = LLM()
response = await llm.complete(
persona="equities_analyst",
prompt="Summarize today's NVDA earnings.",
context={"earnings_release": "...", "consensus": {...}},
max_tokens=500,
temperature=0.2,
)
# Response(text=..., model="claude-sonnet-4-5", cached=False, cost_usd=0.012, latency_ms=1843)
Persona-scoped
from mts1b_llm.personas import persona
cro = persona("CRO")
veto = await cro.veto_order(order=..., context=..., timeout=5.0)
# VetoDecision(veto=False, confidence=0.65, reasoning="...")
Structured output
from pydantic import BaseModel
class TradeIdea(BaseModel):
symbol: str
side: str
rationale: str
confidence: float
ideas = await llm.complete(
persona="quant_screener",
prompt="Suggest 3 long ideas from the Russell 1000 today.",
output_schema=TradeIdea,
n=3,
)
# [TradeIdea(symbol="...", side="long", rationale="...", confidence=0.7), ...]
Schema-validated output via pydantic. Auto-repair on malformed JSON (re-asks the model with the parse error).
Persona registry
Personas are YAML-defined:
name: CRO
description: Chief Risk Officer — vetoes orders that fail edge-case checks
default_model: claude-sonnet-4-5
fallback_models: [gpt-4-turbo, claude-opus-4-7]
temperature: 0.1
max_tokens: 800
system_prompt: |
You are the Chief Risk Officer of a multi-strategy quant firm.
...
tools:
- get_recent_drawdown
- get_open_positions
- get_news_sentiment
budget_usd_per_day: 5.0
scorers:
- veto_consistency_with_envelope
- latency_within_5s
35 personas at v1 launch. Add a new one by dropping a YAML file in personas/.
Router policy
def pick_model(persona: str, task_class: str, urgency: str) -> str:
if urgency == "real_time" and task_class == "veto":
return "claude-haiku-3-5" # fast + cheap
if task_class == "deep_research":
return "claude-opus-4-7" # high quality
if persona == "equities_analyst":
return "claude-sonnet-4-5"
# ...
return DEFAULT_MODEL
Override via mts1b.config per service.
Semantic cache
# Embed every prompt + persona + context
embedding = await embedder.embed(persona, prompt, context)
# Cosine-similarity lookup in Redis vector store
hit = await cache.lookup(embedding, threshold=0.95)
if hit:
return hit.response # 0ms vs 1-5s
# Miss: hit the provider, then cache
response = await provider.complete(...)
await cache.store(embedding, response, ttl=86400 * 7)
Typical cache hit rate: 60-80% for repetitive tasks (drift summaries, daily reports), 10-30% for exploratory (research questions).
Cost ledger
Every call writes to data/llm/ledger.duckdb:
SELECT
persona,
provider || ':' || model AS model,
COUNT(*) AS n_calls,
SUM(input_tokens) AS in_tokens,
SUM(output_tokens) AS out_tokens,
SUM(cost_usd) AS cost_usd,
AVG(latency_ms) AS avg_latency_ms,
SUM(CASE WHEN cached THEN 1 ELSE 0 END) * 1.0 / COUNT(*) AS cache_rate
FROM ledger
WHERE ts >= today() - 7
GROUP BY 1, 2
ORDER BY cost_usd DESC;
Daily cost summary published to Telegram via mts1b-platform/messaging.
Governor
Daily/weekly USD budgets per persona. When 80% spent, the governor downshifts to cheaper models:
budget: $5/day for CRO persona
spent: $3.95 (79%)
→ governor allows current model (claude-sonnet-4-5)
budget: $5/day for CRO persona
spent: $4.50 (90%)
→ governor downshifts to claude-haiku-3-5
budget: $5/day for CRO persona
spent: $5.00 (100%)
→ governor returns canned "budget exhausted" response, alerts ops
Daily reset at UTC midnight; weekly reset at Mon 00:00 UTC.
Evals
mts1b-llm evals run --persona CRO --suite veto-cases
# 50 fixture cases, expected veto = yes/no
# PASS: 47/50 (94%)
# Regressions: case-12 (was: veto=yes, now: veto=no)
Run nightly via Prefect; regression triggers Telegram alert.
Provider failover
If primary model fails (rate limit, 5xx), try fallback in order:
try:
return await anthropic.complete(...)
except (RateLimitError, APIError):
return await openai.complete(...)
except (RateLimitError, APIError):
return await google.complete(...)
Latency budget enforced: if all providers fail within budget, return canned fallback + alert.
Local model serving (optional)
For privacy-sensitive workflows, mts1b-llm can route to a self-hosted vLLM/TGI:
providers:
local:
base_url: http://gpu1.local:8000
models:
- llama-3-70b-instruct
- mixtral-8x22b
Build + test
pip install -e ".[dev]"
pytest -m unit
pytest -m live --provider=anthropic # requires API key
Roadmap
| Version | Items |
|---|---|
| 0.1 (Wave 2) | Router + 35 personas + cache + ledger + governor + 4 providers |
| 0.2 (Wave 2) | Tool use (function calling) for personas |
| 0.3 (Wave 3) | Vision personas (chart analysis) |
| 0.4 (Wave 3) | Multi-turn personas with persistent memory |
| 1.0 (LTS) | Stable persona spec |
See also
mts1b-githubbot,mts1b-discordbot— consumersmts1b-riskengine— CRO veto integrationmts1b-research— uses LLM for narrative summarization + tagging