Skip to content

fix(scout): erase SAS/SCSI drives by capability, not always crypto erase#2319

Closed
adnandnv wants to merge 2 commits into
NVIDIA:mainfrom
adnandnv:fix/scout-erase-by-capability
Closed

fix(scout): erase SAS/SCSI drives by capability, not always crypto erase#2319
adnandnv wants to merge 2 commits into
NVIDIA:mainfrom
adnandnv:fix/scout-erase-by-capability

Conversation

@adnandnv

@adnandnv adnandnv commented Jun 9, 2026

Copy link
Copy Markdown

Description

Scout's deprovision issued SCSI SANITIZE crypto erase unconditionally on SAS/SCSI
drives. Drives that don't support crypto erase reject it ("Illegal request,
Invalid opcode"), failing cleanup and leaving the machine stuck retrying. Scout
now probes the drive's reported capability (the SUPPORT field from REPORT
SUPPORTED OPERATION CODES, via sg_opcodes --hex) and picks the erase method.

Decision tree (SAS/SCSI), keyed on probed crypto-erase support:

  • Supported → crypto erase
  • NotSupported → block erase if supported, else ATA Secure Erase, else fail closed
  • Unknown → attempt crypto erase; on failure, block erase if supported, else
    ATA Secure Erase, else fail closed

"Unknown" = the drive didn't report capability; attempting crypto keeps
already-working drives working, and the fallback recovers drives that can't crypto
erase. "Fail closed" holds the machine out of the pool rather than returning it
unwiped. SATA drives use ATA Secure Erase; the NVMe path is unchanged.

Type of Change

  • Fix - Bug fixes

Related Issues (Optional)

#429

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

Unit tests cover the probe parsing, capability classification, and the decision
logic. The erase path can't run in CI (there's no SCSI hardware, and unit tests
don't execute the real erase/probe commands), so the following needs validation
on real hardware before rollout:

  • Affected drive: probe + fallback erases it (verify LBA 0 / sample sectors
    wiped) and the machine returns to the pool.
  • Known-good SAS SED: still erases via crypto erase.
  • Block-erase fallback: a drive supporting block but not crypto erases and is
    verified wiped.

adnandnv and others added 2 commits June 8, 2026 18:49
Scout's deprovision issued SCSI SANITIZE crypto erase unconditionally on
SAS/SCSI drives. Drives that don't support crypto erase reject it ("Illegal
request, Invalid opcode"), failing cleanup and leaving the machine stuck
retrying. Scout now probes the drive's reported capability (the SUPPORT field
from REPORT SUPPORTED OPERATION CODES, via `sg_opcodes --hex`) and picks the
erase method accordingly.

Decision tree (SAS/SCSI), keyed on probed crypto-erase support:
  Supported     -> crypto erase
  NotSupported  -> block erase if supported, else ATA Secure Erase, else fail
  Unknown       -> attempt crypto erase; on failure, block erase if supported,
                   else ATA Secure Erase, else fail

"Unknown" = the drive didn't report capability; attempting crypto keeps
already-working drives working, and the fallback recovers drives that can't
crypto erase. "Fail" holds the machine out of the pool rather than returning it
unwiped. SATA drives use ATA Secure Erase; the NVMe path is unchanged.

Manual testing required (cannot run in CI — there's no SCSI hardware, and unit
tests don't execute the real erase/probe commands):
  - Affected drive: probe + fallback erases it (verify LBA 0 / sample sectors
    wiped) and the machine returns to the pool.
  - Known-good SAS SED: still erases via crypto erase.
  - Block-erase fallback: a drive supporting block but not crypto erases and is
    verified wiped.

Related: NVIDIA#429

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: adnandnv <258082943+adnandnv@users.noreply.github.com>
…pability

Signed-off-by: adnandnv <258082943+adnandnv@users.noreply.github.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 9, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3595b807-56f9-4f2b-80fb-040138dab92e

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@williampnvidia

Copy link
Copy Markdown
Contributor

This PR does not fix the actual bug that was encountered. On those H100 systems, they report a Virtual SD0 USB-media drive. With this PR, it still enumerates every /dev/sd* from /sys/block and sends it into clean_this_block_device; there is no hidden/removable/USB skip. For the reported device (TRAN=usb, TYPE=disk, SIZE=1G), it would still enter the SAS/SCSI path, probe/attempt sanitize, then will fall through to ATA secure erase and fail closed. That still leaves cleanup failed.

This PR implements a broader policy change, not a targeted hotfix. It changes SAS/SCSI cleanup from “supported fleet SAS SEDs use crypto erase” to “probe capabilities, try block erase, maybe try ATA Secure Erase.” That might be valuable later for real SAS/SCSI drives that lack crypto erase, but it needs hardware validation because it can return non-crypto-erased drives to the pool via different erase methods.

The tests mostly cover parsing/classification helpers. They do not actually exercise command selection/fallback behavior, and they do not cover the USB/removable/hidden device case. The PR body says the erase path needs manual hardware validation, which I agree with.

#2317 PR is narrower and directly addresses the reported failure mode: BMC virtual USB media should not be treated as a cleanup target. PR #2319 is a reasonable separate follow-up for “real SAS/SCSI drive rejects crypto sanitize" that could actually happen but does not fix the actual bug it is targeting.

@adnandnv adnandnv closed this Jun 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants