Skip to content

GPU compute for GPU-bound projects (261/262 class): viable free paths (Kaggle, Modal) — contingency, not scheduled #367

Description

@jeremymanning

Summary

Research into free GPU compute to unblock GPU-bound projects (the PROJ-261 / PROJ-262 class — 8-bit bitsandbytes inference of a ~350M model; small SchNet GNN training on a QM9 subset) that cannot run on our free CPU-only GitHub Actions runners.

Bottom line: genuinely-free, ToS-permitted, CI-automatable GPU paths do exist (Kaggle Notebooks; Modal's recurring free credit). But GPU compute is a contingency, not a near-term need — the primary strategic direction remains re-scoping generated science to be CPU-tractable (#27, already in the specifier/planner prompts). This issue documents findings and a proposed design; it is NOT scheduled for implementation. (All facts verified against live sources 2026-06-25; verification gaps flagged inline.)

Why this is a contingency, not urgent

  • The end-to-end goal candidate (PROJ-492, statistical p-value validity) is pure CPU SciPy stats — it needs no GPU.
  • The Use of Climate-Smart Agricultural Practices in Rural Areas to Improve... #27 made the design-stage prompts forbid GPU-mandating methods, so future projects are generated CPU-tractable; 261/262 predate that guard.
  • PROJ-262's training code is a fabricated stub (hardcoded loss curve + "dummy checkpoint"), so GPU would not make it valid; PROJ-261 is a single narrow project mandating 8-bit bitsandbytes.
  • A Kaggle/Modal CI lane is a real execution-layer + CI feature (auth, packaging, push/poll/retrieve, quota + error handling), i.e. a deliberate scope expansion.

Findings — CI-automatable options (verified 2026-06-25)

Option Free? Headless CI-invokable? ToS on automation Verdict
Kaggle Notebooks Yes (~30 GPU-h/wk, T4/P100; 12h/session) Yeskaggle kernels push/status/output (first-party API; existing GH Actions do this) Anti-scraping clause targets website spidering, not the official API VIABLE (best "truly free")
Modal Recurring $30/mo free credit (~37–50 T4-hr) YesFunction.from_name(...).remote()/.spawn(), "ideal for CI" Bans only mining/DoS/file-hosting VIABLE (best CI ergonomics)
Beam Cloud Recurring $30/mo Yes (SDK/CLI/REST) none found VIABLE-WITH-CAVEATS (backup)
Lightning AI 15 credits/mo (~80 interruptible GPU-h) Partial (Studio-oriented; preemptible) none found BACKUP (interruptible = bad for unattended)
GitHub-hosted GPU larger runners No — $0.052/min, Team/Enterprise, excluded from free minutes native (paid) NOT VIABLE (not free)
Google Colab free (self-hosted runner) GPU exists technically Prohibited — FAQ bans "bypassing the notebook UI", "remote control/SSH", "distributed computing workers"; CAPTCHA-blocks headless; 12h + idle caps NOT VIABLE (ToS-violating)
RunPod / Baseten / Cerebrium Trial credits only (one-time) yes n/a NOT VIABLE (not recurring)
Replicate / HF Inference Endpoints / HF Jobs pay-per-use (HF Jobs = PRO+) yes (paid) NOT VIABLE (not free)
Cloudflare Workers AI (10k neurons/day) / HF Spaces ZeroGPU (5 min/day, Gradio-only) recurring-free partial NOT VIABLE — fixed-catalog inference only, can't run custom bitsandbytes/GNN code

Verification gaps to confirm before relying on a path

  • Kaggle AUP: kaggle.com/terms and kaggle.com/aup are client-rendered (returned a JS shell). The "anti-scraping ≠ official-API" reading is from the ToS;DR catalog of Kaggle's terms, not a direct primary-source fetch. Manually confirm kaggle.com/aup permits unattended first-party-API kernel execution before building on it.
  • Modal: the free "No-Fee Use Limitations" are referenced in the ToS but not enumerated on the fetched pages — confirm no hidden unattended/free-tier automation restriction.

Scope-expanding paths (future only — do NOT implement)

Dartmouth Discovery HPC (CI → SSH → Slurm). Free gpuq tier (A100 MIG 40 GB — ample). Blocked for unattended CI: off-campus GitHub runners can only reach the login node via the Dartmouth VPN, which is behind Duo 2FA (an interactive approve cannot be scripted). Academic-HPC AUPs very likely prohibit unattended/automated submission (e.g. Utah CHPC bans auto-login scripts + passwordless SSH outright); would require a self-hosted in-network runner + explicit written Dartmouth RC approval. Effort M–L, real institutional/credential-blast-radius risk. Unverified for Dartmouth specifically: no Dartmouth-published AUP clause was found either way, and whether SSH login adds its own Duo prompt is unconfirmed.

Cloud Kubernetes GPU (GKE/EKS/AKS). Technically straightforward but over-engineered for two tiny intermittent jobs (cluster + NVIDIA device-plugin + autoscaler upkeep). Not free (violates Constitution IV / Free-First). If a cloud GPU is ever needed, serverless per-second GPU (HF Jobs — already in this repo's tooling; or RunPod ~$0.40/hr T4 / Vast.ai) is the right shape: cents-per-run, scale-to-zero, no standing infra. Mention only as the rejected heavyweight option.

Proposed design (if/when greenlit) — Kaggle-first

  1. New gpu-offload lane or an execute_and_gate branch: when a project's analysis is flagged GPU-required (detected by the existing _COMPUTE_INFRA_RE signals — bitsandbytes/load_in_8bit/device_map=cuda), instead of failing → package code/ + the data subset into a Kaggle kernel.
  2. kaggle kernels push (with kernel-metadata.json enable_gpu:true, enable_internet:true) → poll kernels status async (don't hold a runner) → kernels output to pull metrics/checkpoints back → commit via the normal path.
  3. Secrets: KAGGLE_USERNAME/KAGGLE_KEY (single account — do NOT multi-account; both Kaggle and Colab ban quota evasion). Backup lane: Modal token.
  4. Guardrails: respect the 30 GPU-h/week + 12h/session caps (checkpoint long jobs); keep CPU-first re-scoping as the default (this lane only for projects that genuinely cannot be CPU-tractable and clear scientific-soundness review).

Recommendation

Keep CPU re-scoping (#27) as the primary direction. Treat GPU as a contingency: if a high-value project is ever genuinely un-CPU-able and passes scientific-soundness review, implement the Kaggle-first lane (Modal as backup) behind a small spec — after confirming the two ToS gaps above. Do not pursue Discovery (AUP/Duo wall) or managed k8s now.


Findings from a parallel research sweep (3 agents, ~30 live-source fetches). Source URLs are listed in the linked research notes; key ones: GitHub Actions runner pricing; Google Colab FAQ; Kaggle kaggle-api kernels docs; Modal pricing/terms; Dartmouth RC Discovery docs; Utah CHPC security policy.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions