GPU compute for GPU-bound projects (261/262 class): viable free paths (Kaggle, Modal) — contingency, not scheduled

## Summary

Research into **free GPU compute** to unblock GPU-bound projects (the PROJ-261 / PROJ-262 class — 8-bit `bitsandbytes` inference of a ~350M model; small SchNet GNN training on a QM9 subset) that cannot run on our free CPU-only GitHub Actions runners.

**Bottom line:** genuinely-free, ToS-permitted, CI-automatable GPU paths **do exist** (Kaggle Notebooks; Modal's recurring free credit). But GPU compute is a **contingency, not a near-term need** — the primary strategic direction remains re-scoping generated science to be CPU-tractable (#27, already in the specifier/planner prompts). **This issue documents findings and a proposed design; it is NOT scheduled for implementation.** (All facts verified against live sources 2026-06-25; verification gaps flagged inline.)

## Why this is a contingency, not urgent

- The end-to-end goal candidate (PROJ-492, statistical p-value validity) is **pure CPU** SciPy stats — it needs no GPU.
- #27 made the design-stage prompts forbid GPU-mandating methods, so **future** projects are generated CPU-tractable; 261/262 predate that guard.
- PROJ-262's training code is a **fabricated stub** (hardcoded loss curve + "dummy checkpoint"), so GPU would not make it valid; PROJ-261 is a single narrow project mandating 8-bit `bitsandbytes`.
- A Kaggle/Modal CI lane is a real execution-layer + CI feature (auth, packaging, push/poll/retrieve, quota + error handling), i.e. a deliberate scope expansion.

## Findings — CI-automatable options (verified 2026-06-25)

| Option | Free? | Headless CI-invokable? | ToS on automation | Verdict |
|-|-|-|-|-|
| **Kaggle Notebooks** | **Yes** (~30 GPU-h/wk, T4/P100; 12h/session) | **Yes** — `kaggle kernels push/status/output` (first-party API; existing GH Actions do this) | Anti-scraping clause targets website spidering, **not** the official API | **VIABLE (best "truly free")** |
| **Modal** | **Recurring** $30/mo free credit (~37–50 T4-hr) | **Yes** — `Function.from_name(...).remote()`/`.spawn()`, "ideal for CI" | Bans only mining/DoS/file-hosting | **VIABLE (best CI ergonomics)** |
| Beam Cloud | Recurring $30/mo | Yes (SDK/CLI/REST) | none found | VIABLE-WITH-CAVEATS (backup) |
| Lightning AI | 15 credits/mo (~80 *interruptible* GPU-h) | Partial (Studio-oriented; preemptible) | none found | BACKUP (interruptible = bad for unattended) |
| GitHub-hosted GPU larger runners | **No** — $0.052/min, Team/Enterprise, excluded from free minutes | native | (paid) | NOT VIABLE (not free) |
| Google Colab free (self-hosted runner) | GPU exists | technically | **Prohibited** — FAQ bans "bypassing the notebook UI", "remote control/SSH", "distributed computing workers"; CAPTCHA-blocks headless; 12h + idle caps | NOT VIABLE (ToS-violating) |
| RunPod / Baseten / Cerebrium | **Trial credits only** (one-time) | yes | n/a | NOT VIABLE (not recurring) |
| Replicate / HF Inference Endpoints / HF Jobs | pay-per-use (HF Jobs = PRO+) | yes | (paid) | NOT VIABLE (not free) |
| Cloudflare Workers AI (10k neurons/day) / HF Spaces ZeroGPU (5 min/day, Gradio-only) | recurring-free | partial | — | NOT VIABLE — **fixed-catalog inference only**, can't run custom `bitsandbytes`/GNN code |

### Verification gaps to confirm before relying on a path
- **Kaggle AUP**: `kaggle.com/terms` and `kaggle.com/aup` are client-rendered (returned a JS shell). The "anti-scraping ≠ official-API" reading is from the ToS;DR catalog of Kaggle's terms, **not** a direct primary-source fetch. Manually confirm `kaggle.com/aup` permits unattended first-party-API kernel execution before building on it.
- **Modal**: the free "No-Fee Use Limitations" are referenced in the ToS but not enumerated on the fetched pages — confirm no hidden unattended/free-tier automation restriction.

## Scope-expanding paths (future only — do NOT implement)

**Dartmouth Discovery HPC (CI → SSH → Slurm).** Free `gpuq` tier (A100 MIG 40 GB — ample). **Blocked for unattended CI**: off-campus GitHub runners can only reach the login node via the **Dartmouth VPN, which is behind Duo 2FA** (an interactive approve cannot be scripted). Academic-HPC AUPs very likely **prohibit** unattended/automated submission (e.g. Utah CHPC bans auto-login scripts + passwordless SSH outright); would require a **self-hosted in-network runner** + **explicit written Dartmouth RC approval**. Effort M–L, real institutional/credential-blast-radius risk. *Unverified for Dartmouth specifically:* no Dartmouth-published AUP clause was found either way, and whether SSH login adds its own Duo prompt is unconfirmed.

**Cloud Kubernetes GPU (GKE/EKS/AKS).** Technically straightforward but **over-engineered** for two tiny intermittent jobs (cluster + NVIDIA device-plugin + autoscaler upkeep). Not free (violates Constitution IV / Free-First). If a cloud GPU is ever needed, **serverless per-second GPU** (HF Jobs — already in this repo's tooling; or RunPod ~$0.40/hr T4 / Vast.ai) is the right shape: cents-per-run, scale-to-zero, no standing infra. Mention only as the rejected heavyweight option.

## Proposed design (if/when greenlit) — Kaggle-first

1. New `gpu-offload` lane or an `execute_and_gate` branch: when a project's analysis is flagged GPU-required (detected by the existing `_COMPUTE_INFRA_RE` signals — `bitsandbytes`/`load_in_8bit`/`device_map=cuda`), instead of failing → package `code/` + the data subset into a Kaggle kernel.
2. `kaggle kernels push` (with `kernel-metadata.json` `enable_gpu:true`, `enable_internet:true`) → poll `kernels status` async (don't hold a runner) → `kernels output` to pull metrics/checkpoints back → commit via the normal path.
3. Secrets: `KAGGLE_USERNAME`/`KAGGLE_KEY` (single account — do NOT multi-account; both Kaggle and Colab ban quota evasion). Backup lane: Modal token.
4. Guardrails: respect the 30 GPU-h/week + 12h/session caps (checkpoint long jobs); keep CPU-first re-scoping as the default (this lane only for projects that genuinely cannot be CPU-tractable and clear scientific-soundness review).

## Recommendation

Keep **CPU re-scoping (#27) as the primary direction**. Treat GPU as a contingency: if a *high-value* project is ever genuinely un-CPU-able **and** passes scientific-soundness review, implement the **Kaggle-first** lane (Modal as backup) behind a small spec — after confirming the two ToS gaps above. Do **not** pursue Discovery (AUP/Duo wall) or managed k8s now.

---
*Findings from a parallel research sweep (3 agents, ~30 live-source fetches). Source URLs are listed in the linked research notes; key ones: GitHub Actions runner pricing; Google Colab FAQ; Kaggle `kaggle-api` kernels docs; Modal pricing/terms; Dartmouth RC Discovery docs; Utah CHPC security policy.*


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GPU compute for GPU-bound projects (261/262 class): viable free paths (Kaggle, Modal) — contingency, not scheduled #367

Summary

Why this is a contingency, not urgent

Findings — CI-automatable options (verified 2026-06-25)

Verification gaps to confirm before relying on a path

Scope-expanding paths (future only — do NOT implement)

Proposed design (if/when greenlit) — Kaggle-first

Recommendation

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Option	Free?	Headless CI-invokable?	ToS on automation	Verdict
Kaggle Notebooks	Yes (~30 GPU-h/wk, T4/P100; 12h/session)	Yes — `kaggle kernels push/status/output` (first-party API; existing GH Actions do this)	Anti-scraping clause targets website spidering, not the official API	VIABLE (best "truly free")
Modal	Recurring $30/mo free credit (~37–50 T4-hr)	Yes — `Function.from_name(...).remote()`/`.spawn()`, "ideal for CI"	Bans only mining/DoS/file-hosting	VIABLE (best CI ergonomics)
Beam Cloud	Recurring $30/mo	Yes (SDK/CLI/REST)	none found	VIABLE-WITH-CAVEATS (backup)
Lightning AI	15 credits/mo (~80 interruptible GPU-h)	Partial (Studio-oriented; preemptible)	none found	BACKUP (interruptible = bad for unattended)
GitHub-hosted GPU larger runners	No — $0.052/min, Team/Enterprise, excluded from free minutes	native	(paid)	NOT VIABLE (not free)
Google Colab free (self-hosted runner)	GPU exists	technically	Prohibited — FAQ bans "bypassing the notebook UI", "remote control/SSH", "distributed computing workers"; CAPTCHA-blocks headless; 12h + idle caps	NOT VIABLE (ToS-violating)
RunPod / Baseten / Cerebrium	Trial credits only (one-time)	yes	n/a	NOT VIABLE (not recurring)
Replicate / HF Inference Endpoints / HF Jobs	pay-per-use (HF Jobs = PRO+)	yes	(paid)	NOT VIABLE (not free)
Cloudflare Workers AI (10k neurons/day) / HF Spaces ZeroGPU (5 min/day, Gradio-only)	recurring-free	partial	—	NOT VIABLE — fixed-catalog inference only, can't run custom `bitsandbytes`/GNN code

Uh oh!

GPU compute for GPU-bound projects (261/262 class): viable free paths (Kaggle, Modal) — contingency, not scheduled #367

Description

Summary

Why this is a contingency, not urgent

Findings — CI-automatable options (verified 2026-06-25)

Verification gaps to confirm before relying on a path

Scope-expanding paths (future only — do NOT implement)

Proposed design (if/when greenlit) — Kaggle-first

Recommendation

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions