FANDOM



MPI Jobs HangingEdit

Description: Normal MPI jobs confirmed to hang on both BR and HB. The right number of processes start, but some of them go into Sleep state.

Diagnosis (10/21/13): The MIC installation adds an extra IB interface - the non-MIC nodes have only mlx4_0 while the MIC nodes also have scif0. MPI was defaulting to the latter.

Solution: Resolution to MPI hangs has been identified. Now we need to update the MPI installations to force MPI to use mlx4_0 by default.

TestingEdit

Original Issue:

A basic hello world cannot run using a two node job. It runs successfully with two processes within a node.  There seems to be an issue with the fabric that prevents MPI_Init from executing correctly. Here's mvapich2:

 [jkrometi@br002 mpi_norm]$ cat hf
 br002
 br002
 [jkrometi@br002 mpi_norm]$ mpirun -np 2 -hostfile ./hf ./mpihw
 Hello from task 0 on br002!
 MASTER: Number of MPI tasks is: 2
 Hello from task 1 on br002!
 [mpiexec@br002] HYDT_bscd_pbs_wait_for_completion (./tools/bootstrap/external/pbs_wait.c:68): tm_poll(obit_event) failed with TM error 17002
 [mpiexec@br002] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
 [mpiexec@br002] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:216): launcher returned error waiting for completion
 [mpiexec@br002] main (./ui/mpich/mpiexec.c:325): process manager error waiting for completion
 [jkrometi@br002 mpi_norm]$ vi hf
 [jkrometi@br002 mpi_norm]$ cat hf
 br002
 br003
 [jkrometi@br002 mpi_norm]$ mpirun -np 2 -hostfile ./hf ./mpihw
 Max MV2_DEFAULT_MAX_SG_LIST is 0, set to 1
 Max MV2_SRQ_SIZE is 0, set to 4096
 Max MV2_DEFAULT_MAX_SG_LIST is 0, set to 1
 Max MV2_SRQ_SIZE is 0, set to 4096
 [cli_0]: aborting job:
 Fatal error in MPI_Init:
 Other MPI error, error stack:
 MPIR_Init_thread(436)....:
 MPID_Init(371)...........: channel initialization failed
 MPIDI_CH3_Init(292)......:
 MPIDI_CH3I_RDMA_init(368):
 rdma_iba_hca_init(871)...: Attributes failed sanity check
 [cli_1]: aborting job:
 Fatal error in MPI_Init:
 Other MPI error, error stack:
 MPIR_Init_thread(436)....:
 MPID_Init(371)...........: channel initialization failed
 MPIDI_CH3_Init(292)......:
 MPIDI_CH3I_RDMA_init(368):
 rdma_iba_hca_init(871)...: Attributes failed sanity check

And OpenMPI error messages were quite lengthy, but look something like this:

 [hb001:65181] Signal: Segmentation fault (11)
 [hb001:65181] Signal code: Address not mapped (1)
 [hb001:65181] Failing at address: 0x58

Full error messages can be found in MM_OpenMPI.o40 (for Intel) and MM_OpenMPI.o42 (for GCC) in /home/jkrometi/mic/mpi_norm/

Update (10/21/13):

Adding flags or environment variables to force MPI to use mlx4_0 fixes the errors. For mvapich2 set the MV2_IBA_HCA environment variable (or use the -env MV2_IBA_HCA=mlx4_0 flag):

 [jkrometi@br007 mpi_norm]$ export MV2_IBA_HCA=mlx4_0
 [jkrometi@br007 mpi_norm]$ mpiexec -np 2 -ppn 1 ./mpihw
 Hello from task 0 on br007!
 MASTER: Number of MPI tasks is: 2
 Hello from task 1 on br008!
 [mpiexec@br007] HYDT_bscd_pbs_wait_for_completion (./tools/bootstrap/external/pbs_wait.c:68): tm_poll(obit_event) failed with TM error 17002
 [mpiexec@br007] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
 [mpiexec@br007] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:216): launcher returned error waiting for completion
 [mpiexec@br007] main (./ui/mpich/mpiexec.c:325): process manager error waiting for completion

For OpenMPI set the OMPI_MCA_btl_openib_if_include environment variable (or use the --mca btl_openib_if_include mlx4_0:1 flag):

 [jkrometi@br007 mpi_norm]$ module swap mvapich2 openmpi
 [jkrometi@br007 mpi_norm]$ mpiexec -np 2 -hostfile hf --mca btl_openib_if_include mlx4_0:1 ./mpihw.ompi
 Hello from task 0 on br007!
 MASTER: Number of MPI tasks is: 2
 Hello from task 1 on br008!
 [jkrometi@br007 mpi_norm]$ export OMPI_MCA_btl_openib_if_include=mlx4_0:1
 [jkrometi@br007 mpi_norm]$ mpiexec -np 2 -hostfile hf ./mpihw.ompi
 Hello from task 0 on br007!
 MASTER: Number of MPI tasks is: 2
 Hello from task 1 on br008!

Both the mvapich2 and openmpi solutions have been tested on both the MIC and non-MIC nodes and via qsub. Now we need to update the installations to make these environment variables are set automatically.

Updates (10/24/13-11/3/2013): Tested the following apps with success on both MIC and non-MIC-enabled nodes using the flags above:

  • Gromacs benchmark (/home/arcadm/gromacs/gromacs_mictest.sh) - seemed like a good test of an app built against our MPI stack.
  • Abaqus multinode (modified version of the Ithaca example here)
  • I don't have permission to run Ansys, the other key application with its own MPI implementation, so Gene Cliff tested an Ansys Fluent run and it ran cleanly on the MIC nodes. So we should be okay there, too.
  • Amit ran some tests using OpenMPI and those ran fine as well.

The changes to the MPI modules are probably ready to roll out to the cluster as a whole.

Update (11/4/2013): Justin updates the mvapich2 and openmpi spec files to set the environment variables identified above and submits a Jira ticket (UAS-664) to get the appropriate changes made to the modulefiles.

MIC LibrariesEdit

Description: Manual copying of libraries to the MICs on BR is required to get basic jobs to run. For example, libiomp5.so needs to be copied to MIC and added to the LD_LIBRARY_PATH for OpenMP jobs to work natively. See Bharath's scripts for more.

Status: Resolved.

UpdatesEdit

10/13/13: Original Issue: helloflops2 is a simple OpenMP program from the Jeffers/Reinders book on programming for the MIC:

 [jkrometi@br001 jeffers]$ scp helloflops2 mic0:
 Warning: Permanently added 'mic0,10.11.110.1' (RSA) to the list of known hosts.
 helloflops2                                                                                                      100%   13KB  12.9KB/s   00:00
 [jkrometi@br001 jeffers]$ ssh mic0
 Warning: Permanently added 'mic0,10.11.110.1' (RSA) to the list of known hosts.
 [jkrometi@br001-mic0 jkrometi]$ ./helloflops2
 ./helloflops2: error while loading shared libraries: libiomp5.so: cannot open shared object file: No such file or directory 

10/15/13: Justin checked on Stampede and when you SSH into a MIC, /opt/apps is surfaced and the only special environment variable is

 LD_LIBRARY_PATH=/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/mic/

10/16/13: Chris and Justin decide to mimick what is seen on Stampede by surfacing /opt/apps and setting:

 LD_LIBRARY_PATH=/opt/apps/intel/13.1/compiler/lib/mic:/opt/apps/intel13_1/mkl/11/lib/intel64

10/21/13: Chris and Brandon work to implement the solution above. This seems to have fixed the issue:

 [jkrometi@br007 omp]$ ssh mic1
 Warning: Permanently added 'mic1,10.11.210.7' (RSA) to the list of known hosts.
 ~ $ cd mic/jeffers/
 ~/mic/jeffers $ ./helloflops2
 Initializing
 Starting Compute
 Gflops =     25.600, Secs =      1.531, GFlops per sec =     16.718


MIC Module UnloadingEdit

Description: BlueRidge has a mic module that sets a number of environment variables. However, when the module is unloaded, the environment variables are not unset correctly.

Status: Resolved 10/16/13 by Justin and Chris.

HB MIC ModuleEdit

Description: BlueRidge has a mic module that sets a number of environment variables. Once the mkl and mic modules are loaded, MKL offloading seems to work. However, HoneyBadger does not have a mic module.

Solution: Create a mic module to set the same environment variables as the BlueRidge module does. (Note: The BR module was changed on 10/16/13 by Justin and Chris to fix an error with the unloading of the module.)

TestingEdit

Bharath's cblas_dgemm example yields:

  • >400 Gflops on BR MIC-enabled nodes once the mic module has been loaded.
  • >400 GFlops on HB once the above environment variables have been set.
  • No more than 270 GFlops on BR MIC-enabled nodes when MKL_MIC_ENABLE is turned off.
  • No more than 270 GFlops on HB when the above environment variables are not set.
  • No more than 270 GFlops on normal (non-MIC) BR nodes.

Offload PerformanceEdit

Offload performance appears to be about half what it should be and half what is attained via native mode. Here's performance of a simple example from Jeffers' and Reinders' book on MIC programming on BlueRidge:

 [jkrometi@br001 jeffers]$ MIC_KMP_AFFINITY=scatter ./helloflops3offload
 Initializing
 Starting Compute on 236 threads
 Using 236 threads...
 Gflops =   6041.600, Secs =      7.139, GFlops per sec =    846.243

Here's performance of the same code on Stampede:

 c557-804$ MIC_KMP_AFFINITY=scatter ./helloflops3offload
 Initializing
 Starting Compute on 240 threads
 Using 240 threads...
 Gflops =   6144.000, Secs =      3.360, GFlops per sec =   1828.753

An almost identical code achieved 1786.6 Gflops when running in Native mode on BlueRidge.

Note: Justin tried increasing the number of iterations to check whether this is overhead associated with offload jobs, such as copying arrays to the MIC. The performance did not improve, so this is likely not the case. The computations are simply being performance half as fast for some reason.


Dropped PacketsEdit

Description: This network issue is local to the MICs themselves. For some reason, there are tons of dropped packets on the MIC:

 mic0      Link encap:Ethernet  HWaddr CA:51:7D:50:93:08  
           inet addr:10.11.110.1  Bcast:0.0.0.0  Mask:255.255.0.0
           inet6 addr: fe80::c089:5bff:fef6:b637/64 Scope:Link
           UP BROADCAST RUNNING  MTU:1500  Metric:1
           RX packets:212474 errors:0 dropped:167685 overruns:0 frame:0
           TX packets:864 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:1000 
           RX bytes:34786729 (33.1 MiB)  TX bytes:133014 (129.8 KiB)

This is a bad enough problem that mounting /home would not work. We came up with a workaround to mount home as a TCP mount, rather than the default UDP mount. TCP has built-in tools that deal with lost packets.

Even though we have a working /home inside the MICs now, it's still bad because due to retransmitting so many packets it is WAY slow. We are hoping that we can upgrade the mic environment to the latest available and this bug will be gone.

Brandon has already downloaded the latest RPMs and put them onto the cluster. We just need some time to reprovision the nodes.

Ad blocker interference detected!


Wikia is a free-to-use site that makes money from advertising. We have a modified experience for viewers using ad blockers

Wikia is not accessible if you’ve made further modifications. Remove the custom ad blocker rule(s) and the page will load as expected.