Skip to content

Silent SEGV in host proxy when ENABLE_GPU_IPC=false with MPI backend; no fail-fast when PCIe atomics unavailable #20

@SHREYASINGH29

Description

@SHREYASINGH29

Summary

On a 2× Intel Data Center GPU Max 1100 host (PCIe-only, no XeLink), ishmem's distributed PUT/GET path fails in two ways depending on ENABLE_GPU_IPC:

  1. ENABLE_GPU_IPC=true (default): GPU-side AtomicAccessViolation during cross-PE synchronization (expected on platforms without PCIe AtomicOps but ishmem's README doesn't warn about this).
  2. ENABLE_GPU_IPC=false: Host proxy thread silently SEGVs (SEGV_MAPERR) at a USM device address. No error message, just crash.

Collectives backed by MPI (reduce_*, barrier, sync) work in both configurations. Point-to-point device ops (put, get, alltoall, broadcast, collect, AMOs) fail.


Environment

  • Hardware: 2× Intel Data Center GPU Max 1100 (PVC, intel_gpu_pvc), PCIe-attached, single node
  • Kernel: Linux 5.15.0 (i915 driver; xe not available on this kernel)
  • Intel oneAPI: 2025.3 (DPC++/C++ Compiler 2025.3.2, MPI 2021.17 Build 20251215)
  • Intel SHMEM: 1.5.0 (built from source on the main branch; also tested with prebuilt intel-shmem-1.5.0.224_offline.sh same behavior)
  • NEO compute runtime: 25.18.33578.51
  • ishmem build: cmake .. -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DENABLE_MPI=ON -DMPI_DIR=/opt/intel/oneapi/mpi/2021.17

Test results (2 PEs, one on each GPU)

Test ENABLE_GPU_IPC=true ENABLE_GPU_IPC=false
1_helloworld ✅ pass ✅ pass
align, init_attr, timestamp ✅ pass ✅ pass
barrier, sync ✅ pass ✅ pass
reduce_sum/min/max/prod/and/or/xor ✅ pass ✅ pass
put, get, put_nbi, get_nbi, ibput ❌ AtomicAccessViolation ❌ Host SEGV
alltoall, broadcast, collect, fcollect ❌ AtomicAccessViolation ❌ Host SEGV
amo_* (all variants) ❌ AtomicAccessViolation ❌ Host SEGV

Same pattern observed in test/performance/*_bw benchmarks.


Bug 1: Silent SEGV with ENABLE_GPU_IPC=false

Reproduction

source /opt/intel/oneapi/setvars.sh
export EnableImplicitScaling=0 NEOReadDebugKeys=1
export ISHMEM_ENABLE_GPU_IPC=false
export ISHMEM_DEBUG=1
cd $BUILD/test/unit
mpirun -n 2 bash -c 'export ZE_AFFINITY_MASK=$MPI_LOCALRANKID; exec ./put'

Output

[0] Testing device with device memory
[1] Testing device with device memory
BAD TERMINATION ... RANK 1 ... KILLED BY SIGNAL: 11 (Segmentation fault)

strace shows the SEGV is in the host proxy thread (spawned in proxy_init), not the main thread, with si_code=SEGV_MAPERR, si_addr=0xff000000006096c0 an Intel GPU USM virtual address not mapped in the CPU process.

Possible root cause

With ENABLE_GPU_IPC=false:

  • ipc_init() is skipped (ishmem.cpp:326), so local_pes[remote_pe] = 0.
  • Device ishmem_put sees local_index == 0, falls through to ishmemi_proxy_blocking_request (rma_impl.h:39).
  • Host proxy thread dequeues the request and calls ishmemi_upcall_funcs[PUT][UINT8]ishmem_uint8_put on the host (proxy_func.cpp:17-22).
  • The host-side ishmem_internal_put calls ishmemi_ipc_put (rma_impl.h:40), which hits get_ipc_buffer(pe, dst) → returns nullptr because local_pes[pe] == 0 (runtime_ipc.h:28).
  • Returns non-zero, so the code falls through to ishmemi_runtime->proxy_funcs[PUT][UINT8] the MPI backend (rma_impl.h:41).
  • MPI backend calls MPI_Put(src, ..., dest_disp, ..., win) where win was created with ishmemi_heap_base (a GPU USM pointer) as the base (runtime_mpi.cpp:1384).
  • Intel MPI 2021.17 (without I_MPI_OFFLOAD=2) does not handle GPU USM as a window base, it tries to CPU-dereference the address in its RMA backend → SEGV at the USM address.

Partial workaround

Setting I_MPI_OFFLOAD=2 I_MPI_OFFLOAD_RMA=1 (Intel MPI 2021.17) changes the behavior:

  • No SEGV.
  • But data corruption: received bytes are stamped with 0x8080808080808080 / 0x8181818181818181 patterns instead of the actual payload. Probably a separate Intel MPI / libfabric bug with GPU RMA, but surfaces here because ishmem has no way to validate the backend's GPU RMA support.

Bug 2: No guidance when PCIe AtomicOps unavailable

Reproduction

Defaults (ENABLE_GPU_IPC=true), same hardware:

mpirun -n 2 bash -c 'export ZE_AFFINITY_MASK=$MPI_LOCALRANKID; exec ./alltoall-device'

Output

Segmentation fault from GPU at 0xff00000020210000, ctx_id: 1 (CCS)
  type: 2 (AtomicAccessViolation), level: 0 (PTE), access: 2 (Atomic),
  banned: 1, aborting.
Abort was called at 288 line in file: ./shared/source/os_interface/linux/drm_neo.cpp

Root cause

ishmemi_team_sync (collectives/sync_impl.h:56) does atomic_psync += 1L on the remote PE's psync counter, translating through ISHMEMI_FAST_ADJUST. This is a cross-device atomic fetch-add over PCIe. On PCIe-attached Max GPUs without PCIe AtomicOps (BIOS setting + modern xe driver), the kernel rejects the atomic at the PTE level.

Things that do NOT fix it (tested)

  • ISHMEM_ENABLE_GPU_IPC_PIDFD=false (forces socket-based IPC)
  • EnableConcurrentSharedCrossP2PDeviceAccess=1 (NEO debug key)
  • DisableScratchPages=0 EnableRecoverablePageFaults=1 (suppresses abort, but then UR_RESULT_ERROR_DEVICE_LOST)

Also affecting

Same-GPU placement (ZE_AFFINITY_MASK=0 on both PEs) hangs indefinitely on barrier/reduce_sum/put presumably the atomic-spin synchronization deadlocks when two contexts share a device. Probably won't be prioritized, but noting it.


Things that work fine

  • Single PE (mpirun -n 1): all tests pass
  • Collectives backed by MPI (reductions, barriers, sync): work in both ENABLE_GPU_IPC={true,false} modes
  • ishmem build, runtime, and local-device code paths are all healthy

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions