Skip to main content

mts1b-cloudburst

Cloud-burst worker for GPU bursting: Vast.ai, Runpod, Thunder Compute, SSH runner, budget enforcer.

Repo: github.com/MTS1B/mts1b-cloudburst Layer: 4 Wave: 2 (months 4-7) Depends on: foundation, platform, httpx, paramiko Audience: mts1b-research (ladder sweeps), mts1b-GPUbacktester (heavy backtests)

What it is

A small worker that spins up rented GPU instances, runs a batch job, persists results, and tears down. Used to:

  • Run ladder sweeps across millions of param combos in hours instead of weeks
  • Backtest the full Russell-3000 daily 2000-2026 (uneconomical on a workstation)
  • Train factor-extraction models

Self-hosted by default; uses Vast.ai / Runpod / Thunder Compute as a spot-GPU spot market.

Supported providers

ProviderAPI authHourly cost (approx)Spot reliability
vast_aiAPI key$0.20-2.00 per RTX 4090medium
runpodAPI key$0.30-1.00 per RTX 4090high
thunderAPI key$0.40-1.50 per H100high (focus on premium GPUs)
sshSSH key(free, your own boxes)high

Module layout

mts1b_cloudburst/
├── providers/
│ ├── vast_ai.py
│ ├── runpod.py
│ ├── thunder.py
│ └── ssh.py
├── budget/
│ ├── enforcer.py # USD cap per job + per day
│ └── ledger.py # cost tracking
├── runner/
│ ├── job.py # JobSpec + lifecycle
│ ├── image.py # build/push container image
│ └── result_sync.py # rsync results to local datalake
└── cli/
└── burst.py

API

Submit a job

from mts1b_cloudburst import burst, JobSpec

job = await burst(
JobSpec(
name="ladder-sweep-momentum",
image="ghcr.io/mts1b/mts1b-gpubacktester:0.1.0",
gpus=1,
gpu_model="RTX_4090", # or "H100" or "ANY"
memory_gb=24,
max_duration_minutes=120,
max_cost_usd=10.0,
command=[
"mts1b-backtest", "batch",
"--config", "/data/sweep.yaml",
"--output", "/data/results/",
],
mount_local=[
("./configs", "/data/configs:ro"),
("./results", "/data/results:rw"),
],
provider="auto", # picks cheapest available
)
)

# Monitor
async for event in job.stream_events():
print(event)
# JobEvent(type="provisioned", at=..., cost_per_hour_usd=0.34)
# JobEvent(type="container_started", at=...)
# JobEvent(type="progress", at=..., percent=12.5)
# JobEvent(type="completed", at=..., total_cost_usd=0.68)

# Sync results
await job.sync_results()

Provider auto-selection

provider="auto" queries each provider's spot market, picks the cheapest instance matching the spec. Tie-breaker: provider reliability score (Runpod > Thunder > Vast.ai based on historical job-completion rate).

Budget enforcement

# Per-job
JobSpec(..., max_cost_usd=10.0) # hard kill at $10

# Per-day (across all jobs)
await burst.set_daily_budget(usd=50.0)
# Subsequent burst() calls fail with BudgetExceededError if would push past

mts1b-platform/messaging alerts when daily spend > 80%.

CLI

mts1b-burst submit \
--image ghcr.io/mts1b/mts1b-gpubacktester:0.1.0 \
--gpus 1 --gpu-model RTX_4090 \
--max-cost-usd 10 \
--command "mts1b-backtest batch --config /data/sweep.yaml"

mts1b-burst list # active jobs
mts1b-burst show <job_id> # status + logs
mts1b-burst kill <job_id> # tear down early

mts1b-burst budget show
# Today: $12.50 / $50.00 (25%)
# This week: $87.30 / $200.00 (44%)

mts1b-burst providers status
# vast.ai: available, lowest RTX_4090 = $0.22/hr (3 offers)
# runpod: available, lowest RTX_4090 = $0.34/hr (10 offers)
# thunder: available, lowest H100 = $1.40/hr (2 offers)

Result sync

Each job has a result mount. After completion, mts1b-cloudburst rsyncs to data/cloudburst/results/<job_id>/. From there, mts1b-research ingests into the lake.

For large results (>1 GB), use S3 / R2 instead of rsync:

JobSpec(
...,
result_sink="s3://my-bucket/cloudburst/{job_id}/",
)

Image building

Reference Dockerfile at mts1b-cloudburst/images/gpubacktester.Dockerfile. Build + push:

mts1b-burst image build --image gpubacktester --tag 0.1.0
mts1b-burst image push --image gpubacktester --tag 0.1.0
# Pushes to ghcr.io/mts1b/mts1b-gpubacktester:0.1.0

Workers automatically pull the right image at startup.

SSH provider (your own boxes)

For pre-existing GPU servers:

JobSpec(
...,
provider="ssh",
ssh_hosts=["gpu1.local", "gpu2.local"], # round-robin
)

Skips spot-market provisioning; just runs on your hardware. Same budget enforcement (tracks GPU-hours × your declared $/hr rate).

Build + test

pip install -e ".[dev]"
pytest -m unit # mock providers
pytest -m live --provider=vast # spins up actual instance (costs ~$0.20)

Roadmap

VersionItems
0.1 (Wave 2)Vast.ai + Runpod + Thunder + SSH, budget enforcer, result sync
0.2 (Wave 2)AWS spot, GCP preemptible, Azure spot
0.3 (Wave 3)Multi-job orchestration (DAGs)
1.0 (LTS)Stable JobSpec

See also