You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Research into free GPU compute to unblock GPU-bound projects (the PROJ-261 / PROJ-262 class — 8-bit bitsandbytes inference of a ~350M model; small SchNet GNN training on a QM9 subset) that cannot run on our free CPU-only GitHub Actions runners.
Bottom line: genuinely-free, ToS-permitted, CI-automatable GPU paths do exist (Kaggle Notebooks; Modal's recurring free credit). But GPU compute is a contingency, not a near-term need — the primary strategic direction remains re-scoping generated science to be CPU-tractable (#27, already in the specifier/planner prompts). This issue documents findings and a proposed design; it is NOT scheduled for implementation. (All facts verified against live sources 2026-06-25; verification gaps flagged inline.)
Why this is a contingency, not urgent
The end-to-end goal candidate (PROJ-492, statistical p-value validity) is pure CPU SciPy stats — it needs no GPU.
PROJ-262's training code is a fabricated stub (hardcoded loss curve + "dummy checkpoint"), so GPU would not make it valid; PROJ-261 is a single narrow project mandating 8-bit bitsandbytes.
A Kaggle/Modal CI lane is a real execution-layer + CI feature (auth, packaging, push/poll/retrieve, quota + error handling), i.e. a deliberate scope expansion.
NOT VIABLE — fixed-catalog inference only, can't run custom bitsandbytes/GNN code
Verification gaps to confirm before relying on a path
Kaggle AUP: kaggle.com/terms and kaggle.com/aup are client-rendered (returned a JS shell). The "anti-scraping ≠ official-API" reading is from the ToS;DR catalog of Kaggle's terms, not a direct primary-source fetch. Manually confirm kaggle.com/aup permits unattended first-party-API kernel execution before building on it.
Modal: the free "No-Fee Use Limitations" are referenced in the ToS but not enumerated on the fetched pages — confirm no hidden unattended/free-tier automation restriction.
Scope-expanding paths (future only — do NOT implement)
Dartmouth Discovery HPC (CI → SSH → Slurm). Free gpuq tier (A100 MIG 40 GB — ample). Blocked for unattended CI: off-campus GitHub runners can only reach the login node via the Dartmouth VPN, which is behind Duo 2FA (an interactive approve cannot be scripted). Academic-HPC AUPs very likely prohibit unattended/automated submission (e.g. Utah CHPC bans auto-login scripts + passwordless SSH outright); would require a self-hosted in-network runner + explicit written Dartmouth RC approval. Effort M–L, real institutional/credential-blast-radius risk. Unverified for Dartmouth specifically: no Dartmouth-published AUP clause was found either way, and whether SSH login adds its own Duo prompt is unconfirmed.
Cloud Kubernetes GPU (GKE/EKS/AKS). Technically straightforward but over-engineered for two tiny intermittent jobs (cluster + NVIDIA device-plugin + autoscaler upkeep). Not free (violates Constitution IV / Free-First). If a cloud GPU is ever needed, serverless per-second GPU (HF Jobs — already in this repo's tooling; or RunPod ~$0.40/hr T4 / Vast.ai) is the right shape: cents-per-run, scale-to-zero, no standing infra. Mention only as the rejected heavyweight option.
Proposed design (if/when greenlit) — Kaggle-first
New gpu-offload lane or an execute_and_gate branch: when a project's analysis is flagged GPU-required (detected by the existing _COMPUTE_INFRA_RE signals — bitsandbytes/load_in_8bit/device_map=cuda), instead of failing → package code/ + the data subset into a Kaggle kernel.
kaggle kernels push (with kernel-metadata.jsonenable_gpu:true, enable_internet:true) → poll kernels status async (don't hold a runner) → kernels output to pull metrics/checkpoints back → commit via the normal path.
Secrets: KAGGLE_USERNAME/KAGGLE_KEY (single account — do NOT multi-account; both Kaggle and Colab ban quota evasion). Backup lane: Modal token.
Guardrails: respect the 30 GPU-h/week + 12h/session caps (checkpoint long jobs); keep CPU-first re-scoping as the default (this lane only for projects that genuinely cannot be CPU-tractable and clear scientific-soundness review).
Recommendation
Keep CPU re-scoping (#27) as the primary direction. Treat GPU as a contingency: if a high-value project is ever genuinely un-CPU-able and passes scientific-soundness review, implement the Kaggle-first lane (Modal as backup) behind a small spec — after confirming the two ToS gaps above. Do not pursue Discovery (AUP/Duo wall) or managed k8s now.
Findings from a parallel research sweep (3 agents, ~30 live-source fetches). Source URLs are listed in the linked research notes; key ones: GitHub Actions runner pricing; Google Colab FAQ; Kaggle kaggle-api kernels docs; Modal pricing/terms; Dartmouth RC Discovery docs; Utah CHPC security policy.
Summary
Research into free GPU compute to unblock GPU-bound projects (the PROJ-261 / PROJ-262 class — 8-bit
bitsandbytesinference of a ~350M model; small SchNet GNN training on a QM9 subset) that cannot run on our free CPU-only GitHub Actions runners.Bottom line: genuinely-free, ToS-permitted, CI-automatable GPU paths do exist (Kaggle Notebooks; Modal's recurring free credit). But GPU compute is a contingency, not a near-term need — the primary strategic direction remains re-scoping generated science to be CPU-tractable (#27, already in the specifier/planner prompts). This issue documents findings and a proposed design; it is NOT scheduled for implementation. (All facts verified against live sources 2026-06-25; verification gaps flagged inline.)
Why this is a contingency, not urgent
bitsandbytes.Findings — CI-automatable options (verified 2026-06-25)
kaggle kernels push/status/output(first-party API; existing GH Actions do this)Function.from_name(...).remote()/.spawn(), "ideal for CI"bitsandbytes/GNN codeVerification gaps to confirm before relying on a path
kaggle.com/termsandkaggle.com/aupare client-rendered (returned a JS shell). The "anti-scraping ≠ official-API" reading is from the ToS;DR catalog of Kaggle's terms, not a direct primary-source fetch. Manually confirmkaggle.com/auppermits unattended first-party-API kernel execution before building on it.Scope-expanding paths (future only — do NOT implement)
Dartmouth Discovery HPC (CI → SSH → Slurm). Free
gpuqtier (A100 MIG 40 GB — ample). Blocked for unattended CI: off-campus GitHub runners can only reach the login node via the Dartmouth VPN, which is behind Duo 2FA (an interactive approve cannot be scripted). Academic-HPC AUPs very likely prohibit unattended/automated submission (e.g. Utah CHPC bans auto-login scripts + passwordless SSH outright); would require a self-hosted in-network runner + explicit written Dartmouth RC approval. Effort M–L, real institutional/credential-blast-radius risk. Unverified for Dartmouth specifically: no Dartmouth-published AUP clause was found either way, and whether SSH login adds its own Duo prompt is unconfirmed.Cloud Kubernetes GPU (GKE/EKS/AKS). Technically straightforward but over-engineered for two tiny intermittent jobs (cluster + NVIDIA device-plugin + autoscaler upkeep). Not free (violates Constitution IV / Free-First). If a cloud GPU is ever needed, serverless per-second GPU (HF Jobs — already in this repo's tooling; or RunPod ~$0.40/hr T4 / Vast.ai) is the right shape: cents-per-run, scale-to-zero, no standing infra. Mention only as the rejected heavyweight option.
Proposed design (if/when greenlit) — Kaggle-first
gpu-offloadlane or anexecute_and_gatebranch: when a project's analysis is flagged GPU-required (detected by the existing_COMPUTE_INFRA_REsignals —bitsandbytes/load_in_8bit/device_map=cuda), instead of failing → packagecode/+ the data subset into a Kaggle kernel.kaggle kernels push(withkernel-metadata.jsonenable_gpu:true,enable_internet:true) → pollkernels statusasync (don't hold a runner) →kernels outputto pull metrics/checkpoints back → commit via the normal path.KAGGLE_USERNAME/KAGGLE_KEY(single account — do NOT multi-account; both Kaggle and Colab ban quota evasion). Backup lane: Modal token.Recommendation
Keep CPU re-scoping (#27) as the primary direction. Treat GPU as a contingency: if a high-value project is ever genuinely un-CPU-able and passes scientific-soundness review, implement the Kaggle-first lane (Modal as backup) behind a small spec — after confirming the two ToS gaps above. Do not pursue Discovery (AUP/Duo wall) or managed k8s now.
Findings from a parallel research sweep (3 agents, ~30 live-source fetches). Source URLs are listed in the linked research notes; key ones: GitHub Actions runner pricing; Google Colab FAQ; Kaggle
kaggle-apikernels docs; Modal pricing/terms; Dartmouth RC Discovery docs; Utah CHPC security policy.