fix(scout): erase SAS/SCSI drives by capability, not always crypto erase#2319
fix(scout): erase SAS/SCSI drives by capability, not always crypto erase#2319adnandnv wants to merge 2 commits into
Conversation
Scout's deprovision issued SCSI SANITIZE crypto erase unconditionally on
SAS/SCSI drives. Drives that don't support crypto erase reject it ("Illegal
request, Invalid opcode"), failing cleanup and leaving the machine stuck
retrying. Scout now probes the drive's reported capability (the SUPPORT field
from REPORT SUPPORTED OPERATION CODES, via `sg_opcodes --hex`) and picks the
erase method accordingly.
Decision tree (SAS/SCSI), keyed on probed crypto-erase support:
Supported -> crypto erase
NotSupported -> block erase if supported, else ATA Secure Erase, else fail
Unknown -> attempt crypto erase; on failure, block erase if supported,
else ATA Secure Erase, else fail
"Unknown" = the drive didn't report capability; attempting crypto keeps
already-working drives working, and the fallback recovers drives that can't
crypto erase. "Fail" holds the machine out of the pool rather than returning it
unwiped. SATA drives use ATA Secure Erase; the NVMe path is unchanged.
Manual testing required (cannot run in CI — there's no SCSI hardware, and unit
tests don't execute the real erase/probe commands):
- Affected drive: probe + fallback erases it (verify LBA 0 / sample sectors
wiped) and the machine returns to the pool.
- Known-good SAS SED: still erases via crypto erase.
- Block-erase fallback: a drive supporting block but not crypto erases and is
verified wiped.
Related: NVIDIA#429
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: adnandnv <258082943+adnandnv@users.noreply.github.com>
…pability Signed-off-by: adnandnv <258082943+adnandnv@users.noreply.github.com>
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
This PR does not fix the actual bug that was encountered. On those H100 systems, they report a Virtual SD0 USB-media drive. With this PR, it still enumerates every /dev/sd* from /sys/block and sends it into clean_this_block_device; there is no hidden/removable/USB skip. For the reported device (TRAN=usb, TYPE=disk, SIZE=1G), it would still enter the SAS/SCSI path, probe/attempt sanitize, then will fall through to ATA secure erase and fail closed. That still leaves cleanup failed. This PR implements a broader policy change, not a targeted hotfix. It changes SAS/SCSI cleanup from “supported fleet SAS SEDs use crypto erase” to “probe capabilities, try block erase, maybe try ATA Secure Erase.” That might be valuable later for real SAS/SCSI drives that lack crypto erase, but it needs hardware validation because it can return non-crypto-erased drives to the pool via different erase methods. The tests mostly cover parsing/classification helpers. They do not actually exercise command selection/fallback behavior, and they do not cover the USB/removable/hidden device case. The PR body says the erase path needs manual hardware validation, which I agree with. #2317 PR is narrower and directly addresses the reported failure mode: BMC virtual USB media should not be treated as a cleanup target. PR #2319 is a reasonable separate follow-up for “real SAS/SCSI drive rejects crypto sanitize" that could actually happen but does not fix the actual bug it is targeting. |
Description
Scout's deprovision issued SCSI SANITIZE crypto erase unconditionally on SAS/SCSI
drives. Drives that don't support crypto erase reject it ("Illegal request,
Invalid opcode"), failing cleanup and leaving the machine stuck retrying. Scout
now probes the drive's reported capability (the SUPPORT field from REPORT
SUPPORTED OPERATION CODES, via
sg_opcodes --hex) and picks the erase method.Decision tree (SAS/SCSI), keyed on probed crypto-erase support:
ATA Secure Erase, else fail closed
"Unknown" = the drive didn't report capability; attempting crypto keeps
already-working drives working, and the fallback recovers drives that can't crypto
erase. "Fail closed" holds the machine out of the pool rather than returning it
unwiped. SATA drives use ATA Secure Erase; the NVMe path is unchanged.
Type of Change
Related Issues (Optional)
#429
Breaking Changes
Testing
Additional Notes
Unit tests cover the probe parsing, capability classification, and the decision
logic. The erase path can't run in CI (there's no SCSI hardware, and unit tests
don't execute the real erase/probe commands), so the following needs validation
on real hardware before rollout:
wiped) and the machine returns to the pool.
verified wiped.