Skip to content

ISHMEM on Aurora: Unit test wait_until_all-on_queue-2 hanging #15

@colleeneb

Description

@colleeneb

When running on Aurora with oneapi/public/2025.3.0 results in a test failure which hangs.

Build info:

   git clone --recurse-submodules https://github.com/Sandia-OpenSHMEM/SOS.git SOS
    cd SOS
    ./autogen.sh
    CC=icx CXX=icpx ./configure --prefix=$PWD/install_sos --with-ofi=/opt/cray/libfabric/1.22.0 --enable-pmi-simple --enable-ofi-mr=basic --disable-ofi-inject --enable-ofi-hmem --disable-bounce-buffers --enable-ofi-m\
anual-progress --enable-mr-endpoint --disable-nonfetch-amo --enable-manual-progress 2>&1 | tee SOS_config.log
    make -j 2>&1 | tee SOS_build.log
    make install
    cd ../

    export LD_LIBRARY_PATH=$PWD/SOS/install_sos/lib/:$LD_LIBRARY_PATH
    export LIBRARY_PATH=$PWD/SOS/install_sos/lib/:$LIBRARY_PATH
    export PATH=$PWD/SOS/install_sos/bin/:$PATH
    export CPATH=$PWD/SOS/install_sos/include:$CPATH

    git clone https://github.com/oneapi-src/ishmem.git
    cd ishmem
    mkdir -p build_sos
    cd build_sos
    CC=icx CXX=icpx cmake .. -DENABLE_OPENSHMEM=ON -DSHMEM_DIR=$PWD/../../SOS/install_sos -DCMAKE_INSTALL_PREFIX=$PWD/../../SOS/install_sos_ishmem -DBUILD_UNIT_TESTS=ON -DBUILD_PERF_TESTS=ON -DBUILD_APPS=ON -DCTEST_L\
AUNCHER=mpi 2>&1 | tee ISHMEM_config_sos.log
    make -j 2>&1 | tee ISHMEM_build.log
   ctest --test-dir ./test/unit --verbose --timeout 300 --no-tests=error |& tee -a cmake_tests.log

Using these envs:

    export FI_CXI_OPTIMIZED_MRS=0
    export ISHMEM_RUNTIME=OPENSHMEM
    export SHMEM_OFI_PROVIDER="cxi"
    export EnableImplicitScaling=0
    export export NEOReadDebugKeys=1

Error from ctest:

The following tests FAILED:
        161 - wait_until_all-on_queue-2 (Timeout)
Errors while running CTest

From a backtrace it looks like it hangs here:

Thread 1.1 (Thread 0x148098dc4f80 (LWP 85264) "wait_until_all"):
#0  0x0000148096903cae in ?? () from /usr/lib64/libze_intel_gpu.so.1
#1  0x0000148096936609 in ?? () from /usr/lib64/libze_intel_gpu.so.1
#2  0x0000148096934af0 in ?? () from /usr/lib64/libze_intel_gpu.so.1
#3  0x00001480969440e2 in ?? () from /usr/lib64/libze_intel_gpu.so.1
#4  0x0000148096945d27 in ?? () from /usr/lib64/libze_intel_gpu.so.1
#5  0x000014809655e4de in ?? () from /usr/lib64/libze_intel_gpu.so.1
#6  0x000000000048197e in ishmemi_usm_free(void*) ()
#7  0x000000000048c991 in ishmemi_proxy_fini() ()
#8  0x0000000000439b56 in ishmem_finalize() ()
#9  0x0000000000423a6d in ishmem_tester::~ishmem_tester() ()
#10 0x000000000042366d in main ()

Note that with 2025.2 SDK it does not hang but with 2025.3 SDK it does. I see the behavior with 1146.31 and 1146.12, although it's not consistent -- maybe 50% of runs.

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions