Serpent2 on Kraken Cray XT5

Parallelization with OpenMP and MPI, scalability, reproducibility, errors, problems suggestions
Post Reply
User avatar
ondrejch
Posts: 24
Joined: Sat Oct 12, 2013 3:56 am
Security question 1: No
Security question 2: 92
Contact:

Serpent2 on Kraken Cray XT5

Post by ondrejch » Tue Oct 15, 2013 10:03 pm

Hi everyone,

in this thread I will post information which may be useful for people who want to run Serpent2 on a similar setup as Kraken.

Compiling
Kraken comes with 3 C compilers: GCC, Intel, PGI. To switch them, use "module load PrgEnv-<gnu|intel|pgi>". All compilers are called via the same "cc" wrapper when the respective environment is loaded via modules. PGI compiler available at Kraken is 2012 release, which does not support C99's tgmath, therefore Serpent2 cannot be built using PGI. GCC compiled Serpent2 runs faster than Intel compiled Serpent2, so stick with GCC. The only modification to the GCC section of the Makefile is gcc->cc:

Code: Select all

###############################################################################
# first, module swap PrgEnv-pgi PrgEnv-gnu
# GNU Compiler:
CC       = cc   # gcc 
CFLAGS   = -Wall -ansi -ffast-math -O3
LDFLAGS  = -lm

# Parallel calculation using Open MP:
CFLAGS  += -DOPEN_MP
CFLAGS  += -fopenmp
LDFLAGS += -fopenmp

# Parallel calculation using MPI:
#CC      = mpicc      # the MPI wrapper is cc in Cray!
CFLAGS  += -DMPI
To build Serpent2 with plotting, one has to build his own GD library, install it into $MYINSTALL, and modify the Serpent2 Makefile to load all the symbols from other libraries. This has something to do with codes being all statically linked for CrayXT compute nodes.

Code: Select all

# GD graphics library:
LDFLAGS += -L${MYINSTALL}/lib
CFLAGS  += -I${MYINSTALL}/include

LDFLAGS += -lgd -ldl -lpng -lz 
Submitting jobs
Below is an example script to submit a job on 48 cores.

Code: Select all

#PBS -S /bin/bash
#PBS -N serpent2testGCC
#PBS -q small
#PBS -l size=48,walltime=01:00:00
#PBS -A UT-NTNL0202

set echo
cd /lustre/scratch/ondrejch/s2test/gcc/

# run on 4 nodes, that is  4x12 = 48 processes 
aprun -n 4 -d 12  /lustre/scratch/ondrejch/serpent/bin/sss2.gcc -omp 12 serpent2test.inp
Observations
* Intel compiled code takes about 11% longer than GCC compiled code.

* Serpent2 executed using the job script above ran 12 minutes on 48x 2.6 GHz Opteron cores, compared to 18 minutes on 24x 3.4 GHz Xeon cores, which makes sense.

* The challenge we faced is memory limitations. Kraken has 16GB of RAM per every 12-core node. This quickly becomes an issue in depletion calculations. I the following energy grid reconstruction parameters, but this is still not enough and the code gets killed by OOM.

Code: Select all

set egrid 5E-4 5e-9 10
Using the double indexing method with seems to work fine however:

Code: Select all

set dix 1
Ondrej Chvala, UT Knoxville

User avatar
Jaakko Leppänen
Site Admin
Posts: 2388
Joined: Thu Mar 18, 2010 10:43 pm
Security question 2: 0
Location: Espoo, Finland
Contact:

Re: Serpent2 on Kraken Cray XT5

Post by Jaakko Leppänen » Wed Oct 16, 2013 11:23 am

Serpent has an appetite for computer memory, which is a problem especially for burnup calculations. Grid thinning is a solution that was implemented in Serpent 1, because there was no other way around the problem. In Serpent 2, I suggest you use one of the lower optimization modes instead:

http://ttuki.vtt.fi/serpent/viewtopic.php?f=24&t=1648

In practice, mode 4 (the default and equivalent with the methods used in Serpent 1) is intended for assembly burnup calculations involving < 100 depletion zones, and mode 1 for full core problems with thousands of burnable materials. Mode 2 can be used to replace mode 1 if there's enough memory. Mode 3 is the leftover combination of options that probably has very little practical use. The most significant impact in performance is seen in group constant generation, where mode 4 really makes a difference. Group constant generation is swithced off by default in modes 1 and 2.

Another thing to consider is that MPI parallelization should be used only to distribute the calculation between different nodes. If a single node is running multiple MPI tasks, the memory consumption is correspondingly multiplied. To divide the calculation between CPU's or cores inside a single node, use OpenMP parallelization instead.

With so many different applications and optimization modes, parallel scalability becomes an extremely complicated issue. I've tried to collect some of the unanswered questions under an unofficial parallel scalability benchmark:

http://ttuki.vtt.fi/serpent/viewforum.php?f=27
- Jaakko

User avatar
ondrejch
Posts: 24
Joined: Sat Oct 12, 2013 3:56 am
Security question 1: No
Security question 2: 92
Contact:

Re: Serpent2 on Kraken Cray XT5

Post by ondrejch » Wed Oct 16, 2013 11:43 am

Thank you for the reply Jaakko. Concerning the optimizations, is there a relation between "set opti N" and "set dix 1"? Is there a rule as to when to use one over another? Can they be combined?

Concerning MPI, I should have probably explained the aprun syntax a bit. Total number of MPI processes is set by "-n", cores per MPI process by "-d". Kraken has 12-core nodes, so:

Code: Select all

aprun -n 4 -d 12  /lustre/scratch/ondrejch/serpent/bin/sss2.gcc -omp 12 serpent2test.inp
will run on 4 nodes, one MPI task per node, 12 OMP threads, thus 48 cores total.
Ondrej Chvala, UT Knoxville

User avatar
Jaakko Leppänen
Site Admin
Posts: 2388
Joined: Thu Mar 18, 2010 10:43 pm
Security question 2: 0
Location: Espoo, Finland
Contact:

Re: Serpent2 on Kraken Cray XT5

Post by Jaakko Leppänen » Wed Oct 16, 2013 1:32 pm

Double-indexing ("set dix 1") is available only in Serpent 1. The idea is that the microscopic nuclide-wise cross sections are stored using their original ACE grids, and an indexing table is created for fast access between the unionized and the nuclide-wise grids. Optimization mode 3 in Serpent 2 is similar, but not exactly the same. Microscopic cross sections are not reconstructed, but the code pre-calculates macroscopic cross sections using the unionized grid. Memory consumption per nuclide is reduced compared to modes 2 and 4, but consumption per material is similar to mode 4.
- Jaakko

Post Reply