Skip to content

Fix scheduler stepping and label dtype handling in train.py and eval.py (#152)#221

Open
egekaya1 wants to merge 1 commit intoML4SCI:mainfrom
egekaya1:fix/scheduler-label-dtype-152
Open

Fix scheduler stepping and label dtype handling in train.py and eval.py (#152)#221
egekaya1 wants to merge 1 commit intoML4SCI:mainfrom
egekaya1:fix/scheduler-label-dtype-152

Conversation

@egekaya1
Copy link
Copy Markdown

  • scheduler.step(loss) -> scheduler.step(): CosineAnnealingWarmRestarts does not accept a metric argument; passing loss caused it to treat the loss value as an epoch override, breaking the cosine LR cycle entirely

  • labels.type(torch.LongTensor).to(device) -> labels.to(device, dtype=torch.long) in train.py: torch.LongTensor is CPU-specific; the old form allocated an intermediate CPU tensor before transferring to device every batch

  • batch_y.type(torch.LongTensor) -> batch_y.to(dtype=torch.long) in eval.py: same dtype issue, and the original had no .to(device) call at all — labels never moved to GPU, working only accidentally because logits were pulled back to CPU before metric computation

While reviewing eval.py, I noticed micro_auroc is initialised as an empty
list and never populated, so np.mean(micro_auroc) always returns NaN. This
is already tracked in #164 and is out of scope here, but flagging it for
whoever picks that up.

Fixes #152

- scheduler.step(loss) -> scheduler.step(): CosineAnnealingWarmRestarts
  does not accept a metric argument; passing loss caused it to treat the
  loss value as an epoch override, breaking the cosine LR cycle entirely

- labels.type(torch.LongTensor).to(device) -> labels.to(device, dtype=torch.long)
  in train.py: torch.LongTensor is CPU-specific; the old form allocated an
  intermediate CPU tensor before transferring to device every batch

- batch_y.type(torch.LongTensor) -> batch_y.to(dtype=torch.long) in eval.py:
  same dtype issue, and the original had no .to(device) call at all —
  labels never moved to GPU, working only accidentally because logits were
  pulled back to CPU before metric computation

eval.py was missed by all prior fix attempts (ML4SCI#156, ML4SCI#168, ML4SCI#173).

Fixes ML4SCI#152

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Incorrect scheduler stepping and label dtype handling in training loop

1 participant