mts1b-cloudburst

Cloud-burst worker for GPU bursting: Vast.ai, Runpod, Thunder Compute, SSH runner, budget enforcer.

Repo: github.com/MTS1B/mts1b-cloudburst Layer: 4 Wave: 2 (months 4-7) Depends on: foundation, platform, httpx, paramiko Audience: mts1b-research (ladder sweeps), mts1b-GPUbacktester (heavy backtests)

What it is

A small worker that spins up rented GPU instances, runs a batch job, persists results, and tears down. Used to:

Run ladder sweeps across millions of param combos in hours instead of weeks
Backtest the full Russell-3000 daily 2000-2026 (uneconomical on a workstation)
Train factor-extraction models

Self-hosted by default; uses Vast.ai / Runpod / Thunder Compute as a spot-GPU spot market.

Supported providers

Provider	API auth	Hourly cost (approx)	Spot reliability
`vast_ai`	API key	$0.20-2.00 per RTX 4090	medium
`runpod`	API key	$0.30-1.00 per RTX 4090	high
`thunder`	API key	$0.40-1.50 per H100	high (focus on premium GPUs)
`ssh`	SSH key	(free, your own boxes)	high

Module layout

mts1b_cloudburst/
├── providers/
│   ├── vast_ai.py
│   ├── runpod.py
│   ├── thunder.py
│   └── ssh.py
├── budget/
│   ├── enforcer.py           # USD cap per job + per day
│   └── ledger.py              # cost tracking
├── runner/
│   ├── job.py                 # JobSpec + lifecycle
│   ├── image.py               # build/push container image
│   └── result_sync.py         # rsync results to local datalake
└── cli/
    └── burst.py

API

Submit a job

from mts1b_cloudburst import burst, JobSpec

job = await burst(
    JobSpec(
        name="ladder-sweep-momentum",
        image="ghcr.io/mts1b/mts1b-gpubacktester:0.1.0",
        gpus=1,
        gpu_model="RTX_4090",       # or "H100" or "ANY"
        memory_gb=24,
        max_duration_minutes=120,
        max_cost_usd=10.0,
        command=[
            "mts1b-backtest", "batch",
            "--config", "/data/sweep.yaml",
            "--output", "/data/results/",
        ],
        mount_local=[
            ("./configs", "/data/configs:ro"),
            ("./results", "/data/results:rw"),
        ],
        provider="auto",            # picks cheapest available
    )
)

# Monitor
async for event in job.stream_events():
    print(event)
# JobEvent(type="provisioned", at=..., cost_per_hour_usd=0.34)
# JobEvent(type="container_started", at=...)
# JobEvent(type="progress", at=..., percent=12.5)
# JobEvent(type="completed", at=..., total_cost_usd=0.68)

# Sync results
await job.sync_results()

Provider auto-selection

provider="auto" queries each provider's spot market, picks the cheapest instance matching the spec. Tie-breaker: provider reliability score (Runpod > Thunder > Vast.ai based on historical job-completion rate).

Budget enforcement

# Per-job
JobSpec(..., max_cost_usd=10.0)            # hard kill at $10

# Per-day (across all jobs)
await burst.set_daily_budget(usd=50.0)
# Subsequent burst() calls fail with BudgetExceededError if would push past

mts1b-platform/messaging alerts when daily spend > 80%.

CLI

mts1b-burst submit \
  --image ghcr.io/mts1b/mts1b-gpubacktester:0.1.0 \
  --gpus 1 --gpu-model RTX_4090 \
  --max-cost-usd 10 \
  --command "mts1b-backtest batch --config /data/sweep.yaml"

mts1b-burst list                         # active jobs
mts1b-burst show <job_id>                # status + logs
mts1b-burst kill <job_id>                # tear down early

mts1b-burst budget show
# Today:        $12.50 / $50.00  (25%)
# This week:    $87.30 / $200.00 (44%)

mts1b-burst providers status
# vast.ai:    available, lowest RTX_4090 = $0.22/hr (3 offers)
# runpod:     available, lowest RTX_4090 = $0.34/hr (10 offers)
# thunder:    available, lowest H100      = $1.40/hr (2 offers)

Result sync

Each job has a result mount. After completion, mts1b-cloudburst rsyncs to data/cloudburst/results/<job_id>/. From there, mts1b-research ingests into the lake.

For large results (>1 GB), use S3 / R2 instead of rsync:

JobSpec(
    ...,
    result_sink="s3://my-bucket/cloudburst/{job_id}/",
)

Image building

Reference Dockerfile at mts1b-cloudburst/images/gpubacktester.Dockerfile. Build + push:

mts1b-burst image build --image gpubacktester --tag 0.1.0
mts1b-burst image push --image gpubacktester --tag 0.1.0
# Pushes to ghcr.io/mts1b/mts1b-gpubacktester:0.1.0

Workers automatically pull the right image at startup.

SSH provider (your own boxes)

For pre-existing GPU servers:

JobSpec(
    ...,
    provider="ssh",
    ssh_hosts=["gpu1.local", "gpu2.local"],   # round-robin
)

Skips spot-market provisioning; just runs on your hardware. Same budget enforcement (tracks GPU-hours × your declared $/hr rate).

Build + test

pip install -e ".[dev]"
pytest -m unit                  # mock providers
pytest -m live --provider=vast  # spins up actual instance (costs ~$0.20)

Roadmap

Version	Items
0.1 (Wave 2)	Vast.ai + Runpod + Thunder + SSH, budget enforcer, result sync
0.2 (Wave 2)	AWS spot, GCP preemptible, Azure spot
0.3 (Wave 3)	Multi-job orchestration (DAGs)
1.0 (LTS)	Stable JobSpec

What it is​

Supported providers​

Module layout​

API​

Submit a job​

Provider auto-selection​

Budget enforcement​

CLI​

Result sync​

Image building​

SSH provider (your own boxes)​

Build + test​

Roadmap​

See also​