Silent SEGV in host proxy when `ENABLE_GPU_IPC=false` with MPI backend; no fail-fast when PCIe atomics unavailable


## Summary

On a 2× Intel Data Center GPU Max 1100 host (PCIe-only, no XeLink), ishmem's distributed PUT/GET path fails in two ways depending on `ENABLE_GPU_IPC`:

1. **`ENABLE_GPU_IPC=true` (default)**: GPU-side `AtomicAccessViolation` during cross-PE synchronization (expected on platforms without PCIe AtomicOps but ishmem's README doesn't warn about this).
2. **`ENABLE_GPU_IPC=false`**: Host proxy thread silently SEGVs (`SEGV_MAPERR`) at a USM device address. No error message, just crash.

Collectives backed by MPI (`reduce_*`, `barrier`, `sync`) work in both configurations. Point-to-point device ops (`put`, `get`, `alltoall`, `broadcast`, `collect`, AMOs) fail.

---

## Environment

- Hardware: 2× Intel Data Center GPU Max 1100 (PVC, `intel_gpu_pvc`), PCIe-attached, single node
- Kernel: Linux 5.15.0 (i915 driver; `xe` not available on this kernel)
- Intel oneAPI: 2025.3 (DPC++/C++ Compiler 2025.3.2, MPI 2021.17 Build 20251215)
- Intel SHMEM: 1.5.0 (built from source on the `main` branch; also tested with prebuilt `intel-shmem-1.5.0.224_offline.sh` same behavior)
- NEO compute runtime: 25.18.33578.51
- ishmem build: `cmake .. -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DENABLE_MPI=ON -DMPI_DIR=/opt/intel/oneapi/mpi/2021.17`

---

## Test results (2 PEs, one on each GPU)

| Test | `ENABLE_GPU_IPC=true` | `ENABLE_GPU_IPC=false` |
|---|---|---|
| `1_helloworld` | ✅ pass | ✅ pass |
| `align`, `init_attr`, `timestamp` | ✅ pass | ✅ pass |
| `barrier`, `sync` | ✅ pass | ✅ pass |
| `reduce_sum/min/max/prod/and/or/xor` | ✅ pass | ✅ pass |
| `put`, `get`, `put_nbi`, `get_nbi`, `ibput` | ❌ AtomicAccessViolation | ❌ Host SEGV |
| `alltoall`, `broadcast`, `collect`, `fcollect` | ❌ AtomicAccessViolation | ❌ Host SEGV |
| `amo_*` (all variants) | ❌ AtomicAccessViolation | ❌ Host SEGV |

Same pattern observed in `test/performance/*_bw` benchmarks.

---

## Bug 1: Silent SEGV with `ENABLE_GPU_IPC=false`

### Reproduction

```bash
source /opt/intel/oneapi/setvars.sh
export EnableImplicitScaling=0 NEOReadDebugKeys=1
export ISHMEM_ENABLE_GPU_IPC=false
export ISHMEM_DEBUG=1
cd $BUILD/test/unit
mpirun -n 2 bash -c 'export ZE_AFFINITY_MASK=$MPI_LOCALRANKID; exec ./put'
```

### Output

```
[0] Testing device with device memory
[1] Testing device with device memory
BAD TERMINATION ... RANK 1 ... KILLED BY SIGNAL: 11 (Segmentation fault)
```

`strace` shows the SEGV is in the host proxy thread (spawned in `proxy_init`), not the main thread, with `si_code=SEGV_MAPERR, si_addr=0xff000000006096c0` an Intel GPU USM virtual address not mapped in the CPU process.

### Possible root cause 

With `ENABLE_GPU_IPC=false`:

- `ipc_init()` is skipped (`ishmem.cpp:326`), so `local_pes[remote_pe] = 0`.
- Device `ishmem_put` sees `local_index == 0`, falls through to `ishmemi_proxy_blocking_request` (`rma_impl.h:39`).
- Host proxy thread dequeues the request and calls `ishmemi_upcall_funcs[PUT][UINT8]` → `ishmem_uint8_put` on the host (`proxy_func.cpp:17-22`).
- The host-side `ishmem_internal_put` calls `ishmemi_ipc_put` (`rma_impl.h:40`), which hits `get_ipc_buffer(pe, dst)` → returns `nullptr` because `local_pes[pe] == 0` (`runtime_ipc.h:28`).
- Returns non-zero, so the code falls through to `ishmemi_runtime->proxy_funcs[PUT][UINT8]` the MPI backend (`rma_impl.h:41`).
- MPI backend calls `MPI_Put(src, ..., dest_disp, ..., win)` where `win` was created with `ishmemi_heap_base` (a GPU USM pointer) as the base (`runtime_mpi.cpp:1384`).
- Intel MPI 2021.17 (without `I_MPI_OFFLOAD=2`) does not handle GPU USM as a window base, it tries to CPU-dereference the address in its RMA backend → SEGV at the USM address.

### Partial workaround

Setting `I_MPI_OFFLOAD=2 I_MPI_OFFLOAD_RMA=1` (Intel MPI 2021.17) changes the behavior:

- No SEGV.
- But data corruption: received bytes are stamped with `0x8080808080808080` / `0x8181818181818181` patterns instead of the actual payload. Probably a separate Intel MPI / libfabric bug with GPU RMA, but surfaces here because ishmem has no way to validate the backend's GPU RMA support.

---

## Bug 2: No guidance when PCIe AtomicOps unavailable

### Reproduction

Defaults (`ENABLE_GPU_IPC=true`), same hardware:

```bash
mpirun -n 2 bash -c 'export ZE_AFFINITY_MASK=$MPI_LOCALRANKID; exec ./alltoall-device'
```

### Output

```
Segmentation fault from GPU at 0xff00000020210000, ctx_id: 1 (CCS)
  type: 2 (AtomicAccessViolation), level: 0 (PTE), access: 2 (Atomic),
  banned: 1, aborting.
Abort was called at 288 line in file: ./shared/source/os_interface/linux/drm_neo.cpp
```

### Root cause

`ishmemi_team_sync` (`collectives/sync_impl.h:56`) does `atomic_psync += 1L` on the remote PE's psync counter, translating through `ISHMEMI_FAST_ADJUST`. This is a cross-device atomic fetch-add over PCIe. On PCIe-attached Max GPUs without PCIe AtomicOps (BIOS setting + modern `xe` driver), the kernel rejects the atomic at the PTE level.

### Things that do NOT fix it (tested)

- `ISHMEM_ENABLE_GPU_IPC_PIDFD=false` (forces socket-based IPC)
- `EnableConcurrentSharedCrossP2PDeviceAccess=1` (NEO debug key)
- `DisableScratchPages=0 EnableRecoverablePageFaults=1` (suppresses abort, but then `UR_RESULT_ERROR_DEVICE_LOST`)

---

## Also affecting

Same-GPU placement (`ZE_AFFINITY_MASK=0` on both PEs) hangs indefinitely on `barrier`/`reduce_sum`/`put`  presumably the atomic-spin synchronization deadlocks when two contexts share a device. Probably won't be prioritized, but noting it.

---

## Things that work fine

- Single PE (`mpirun -n 1`): all tests pass
- Collectives backed by MPI (reductions, barriers, sync): work in both `ENABLE_GPU_IPC={true,false}` modes
- ishmem build, runtime, and local-device code paths are all healthy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Silent SEGV in host proxy when `ENABLE_GPU_IPC=false` with MPI backend; no fail-fast when PCIe atomics unavailable #20

Summary

Environment

Test results (2 PEs, one on each GPU)

Bug 1: Silent SEGV with `ENABLE_GPU_IPC=false`

Reproduction

Output

Possible root cause

Partial workaround

Bug 2: No guidance when PCIe AtomicOps unavailable

Reproduction

Output

Root cause

Things that do NOT fix it (tested)

Also affecting

Things that work fine

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Test	`ENABLE_GPU_IPC=true`	`ENABLE_GPU_IPC=false`
`1_helloworld`	✅ pass	✅ pass
`align`, `init_attr`, `timestamp`	✅ pass	✅ pass
`barrier`, `sync`	✅ pass	✅ pass
`reduce_sum/min/max/prod/and/or/xor`	✅ pass	✅ pass
`put`, `get`, `put_nbi`, `get_nbi`, `ibput`	❌ AtomicAccessViolation	❌ Host SEGV
`alltoall`, `broadcast`, `collect`, `fcollect`	❌ AtomicAccessViolation	❌ Host SEGV
`amo_*` (all variants)	❌ AtomicAccessViolation	❌ Host SEGV

Silent SEGV in host proxy when ENABLE_GPU_IPC=false with MPI backend; no fail-fast when PCIe atomics unavailable #20

Description

Summary

Environment

Test results (2 PEs, one on each GPU)

Bug 1: Silent SEGV with ENABLE_GPU_IPC=false

Reproduction

Output

Possible root cause

Partial workaround

Bug 2: No guidance when PCIe AtomicOps unavailable

Reproduction

Output

Root cause

Things that do NOT fix it (tested)

Also affecting

Things that work fine

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Silent SEGV in host proxy when `ENABLE_GPU_IPC=false` with MPI backend; no fail-fast when PCIe atomics unavailable #20

Bug 1: Silent SEGV with `ENABLE_GPU_IPC=false`