Page 1 of 1

Serpent 2 scalability benchmark

Posted: Fri Nov 23, 2012 8:22 pm
by Jaakko Leppänen
After hearing about some very promising results regarding the parallel scalability of Serpent 2 from UC Berkeley, I thought it might be a good idea to post some ideas about different things that should be tested in parallel mode. Several users have access to computing resources far beyond what we have at VTT, so extending the scalability tests to hundreds or even thousands of CPU's would really help us get an idea on how the code is really performing, and pointing out the problems that need to be fixed.

You can think of this proposal as an unofficial parallel scalability benchmark for Serpent 2, and the idea is to define a set of test problems that any user can run in their computers, and produce results that can be compared to others. If we could get a few sets of test runs in "supercomputers" and several more in smaller clusters, we should have enough material to publish the results as a joint paper.

Let me know if you are interested in contributing to the tests, or have any suggestions on how the tests should be run. I have listed some of my own ideas below.

General remarks

Serpent is a reactor physics code, so the test cases should reflect the typical applications:

1) Group constant generation for deterministic reactor simulator codes, involving 2D/3D assembly-level transport and burnup calculations.
2) Full-core calculations for test reactors

To push the boundaries, I suggest we also look at:

3) Full-core calculations for power reactors
4) Multi-physics applications (after the methodology is ready for that)

The selection of the test case should also depend on the available computing resources -- there is no point in running a 2D assembly-level calculation with 1000 CPU's, or a full-core simulation in a desktop workstation.

Parallelization in Serpent 2 is based on a hybrid OpenMP / MPI approach, and probably more than anything else the scalability depends the fraction of CPU time spent inside OpenMP parallelized loops. I have listed some of the factors below.

History length

OpenMP parallelization divides the particle histories over different threads at the beginning of each criticality source cycle. Everything that is done between the cycles is done in serial, which affects the scalability. I believe this is the reason why fast reactor and HTGR calculations scale up better than LWR calculations -- the longer the neutron histories, the larger the fraction of CPU time spent in the parallel loop.

OpenMP reproducibility

In order to reproduce the same random number sequences, the banked fission neutrons must be sorted before they are re-indexed for the next criticality cycle. This takes some CPU time, and the procedure is done outside the parallel loop. The reproducibility mode is on by default, and it can be switched off by "set repro 0". I changed the sorting algorithm in 2.1.9 from bubble- to insertion-sort, but switching the reproducibility option off should still improve scalability to some extent.

Burnup calculation mode

The processing and burnup routines that are run between transport simulations in burnup mode are parallelized by distributing burnable materials to different OpenMP threads. The scalability is not as good as for the transport routine, and if the number of threads is larger than the number of materials, some of the available computing power is left unused. The overall scalability of a burnup calculation therefore depends on the number of burnable materials. The method used for solving the depletion equations ("set bumode") should not have a major effect, but it may be worth testing in very large burnup calculation problems.

Optimization modes

Serpent 2 has different optimization modes, intended for small and large burnup calculation problems. These modes change the way the cross sections are handled, which effects memory usage and performance. Since the lower optimization modes do not use pre-calculated macroscopic cross sections, some of the processing routines are not run in burnup mode. Different optimization modes should therefore result in different scalability. Mode 4 is used by default. Mode 3 is somewhat uninteresting, but modes 2 and 1 should be tested.

Number of tallies

Increasing the number of detectors increases the time needed for processing tally statistics, which is done outside the parallel loop. In addition to user-defined detectors, Serpent calculates group constants, power distributions, mesh plots, etc., which have a similar impact in scalability. There are different ways to improve the performance, such as increasing the batching interval (results are collected after N cycles instead of after every cycle).

Score buffering

Serpent stores scores in a temporary buffer, which in OpenMP parallel mode can be either private (each thread writes in its own buffer) or shared (all threads write in same buffer). Using private buffers increases memory usage to some extent, but it should improve scalability since no barriers need to be set to protect memory. Default is private, shared mode can be set using "set shbuf 1" (available from version 2.1.10 on).

Division between MPI and OpenMP parallelization

Since OpenMP and MPI are two completely different techniques, the running time for a calculation divided in <M> MPI tasks and <N> OpenMP threads is generally different from a run divided in <N> tasks and <M> threads. In my opinion the best (most practical and "realistic") way to divide the calculation is to use MPI only for division between computational units (nodes), and OpenMP inside them. It may not always be the most optimal division, but it is the most efficient in terms of memory usage.

Test cases

To ensure that the results produced by all participants are comparable, I suggest that we define a set of representative test problems. Feel free to perform additional tests on your own cases, but if we plan to publish the results, the calculations should be more or less standardized and run by several participants.

Below are some general outlines for six test cases that I've come up with so far. Additional suggestions are welcome.

BWR assembly burnup calculation
  • 2D model of a BWR assembly with burnable absorber
  • Represents a typical group constant generation run with burnup calculation
  • Division into different number of depletion zones
  • Optimization mode must be reduced from 4 to 2 or 1 with the increasing number of burnable materials to save memory, group constant generation is performed only in mode 4
Hoogenboom-Martin PWR performance benchmark
  • Large PWR core with power distribution tallied in over 6 million positions
  • Things to test include batching, reproducibility, score buffering, number of detectors, etc.
Full-core pebble-bed reactor calculation
  • Serpent has an explicit geometry type for modeling fuel particle and pebble distributions in HTGR reactors
  • Geometry model is a full-scale pebble bed reactor core with more than 100,000 pebbles in an unstructured configuration
  • Calculation of pebble-wise power distribution
  • Points out the differences between LWR and HTGR calculations
VVER-440 full-core burnup calculation
  • Simplified but realistic geometry for testing full-core burnup calculations
  • Over 130,000 depletion zones, must be run in the lowest optimization mode
  • Demonstration that Serpent can handle very large burnup calculation problems
Full-core SFR calculation
  • Need a suitable test case (does anyone have a Serpent model for the ESFR core?)
  • We could run a Hoogenboom-Martin type power distribution calculation, a full-core burnup calculation, or both
Dynamic RIA simulation with temperature feedback
  • This test case is related to the multi-physics capability, which is still under development
  • Simulation of a reactivity-initiated accident with fuel temperature feedback
  • Test case for the scalability of an external source simulation and the internal temperature feedback module
I will start working on the input files and put the first test cases at the website soon. The RIA case requires some development in the calculation methods, so it will take some time before we can start running the test calculations. I will also write some Matlab / Octave scripts that can be used to extract the relevant information from the output files.


Since we are running calculations in different computing environments, the main parameter of interest is not the total running time, but the speed-up factor (running time for a single-CPU calculation divided by the running time for the parallel calculation). Because the single-CPU running time is always needed as a reference, we probably cannot run the Hoogenboom-Martin benchmark all the way to the 1% statistical accuracy. For the same reason, we probably need to limit the number of burnup steps in the full-core burnup calculations.

Here are some general instructions on how the calculations should be performed:
  • Use one of the latest versions of Serpent 2 (preferably the latest available, but at least version 2.1.10, which I will distribute soon).
  • Run the calculations without debug and profiler modes (-DDEBUG and -pg flags commented out in Makefile). Use the same compilation for single-CPU and parallel runs.
  • Use the variables printed in the _res.m output in the comparisons. Running time used for calculating the speed-up factors is given by RUNNING_TIME - INIT_TIME. There are also additional variables: PROCESS_TIME, TRANSPORT_CYCLE_TIME, and BURNUP_CYCLE_TIME, that can be used to calculate the fraction of running time used for processing, transport simulation and burnup calculation. Serpent calculates also other parameters, but those should not be used in the comparison. In particular, the ratio of TOT_CPU_TIME and RUNNING_TIME does NOT result in the speed-up factor, because both depend on the parallelization.
  • Pay attention to any changes that you make in the input files. Changing the optimization mode, grid reconstruction tolerance, group constant generation, etc. also changes the running times. Make sure that the reference single-CPU calculation is always run using the same input as the parallel calculation.
  • We are not really interested in actual results of the simulation (K-eff, etc.), but you should check that you get the same results for single-CPU and parallel calculations (to within statistics). If you want to test the reproducibility, you need to define the random number seed manually ("set seed <N>"). MPI parallelization is not reproducible by default, and switching the option on deteriorates the performance so much that it is practical only for debugging purposes. Reproducibility in OpenMP mode is on by default, but the random number sequence is sometimes lost because of roundoff errors in burnup calculation.
  • Remember that the code is still very much a beta-version, so report any crashes, errors and unexpected results asap. Also note that the parallel performance may improve with later updates, so if your access to the computer resources is limited to a certain number of CPU hours, keep in mind that you may want to repeat your calculations later.
and finally:
  • We are not trying to break records, but to get an idea on how the code performs in different types of applications and computing environments. The most interesting results are often those that reveal something that needs improvement.
I would like to keep all communcation here, so that new ideas, questions and answers, results, troubleshooting, etc. can directly benefit all participants. So rather than sending me e-mail, share your thoughs here. The discussion area is password-protected, so there is a certain level of privacy.

Re: Serpent 2 scalability benchmark

Posted: Sun Jan 06, 2013 2:43 pm
by Ville Rintala
Notice couple things when calculating these benchmarks.

We have 16 node cluster with two E5-2660 Xeons (2.2 GHz) per calculation node (8+8 cores) here in Lappeenranta. These Xeons (like about every recent processor) have turbo boost which overclocks some cores if only partial load is applied. I reserved one calculation node to test OpenMP performance of Serpent 2.1.11 and tried to run single core calculation to obtain base performance. If I would have done this with just single process running the clock rate would have been 3 GHz. This is definitely calculated with very different speed compared to later cases where more cores would be used. Parallel performance is underestimated as single run performs much better than it actually should. My solution was to run 16 single core runs at same time and 8 with -omp 2 and 4 with -omp 4 and so on. This requires lots of memory of course.

Depending on system settings another issue could happen with speedstep which saves energy by underclocking processor. Depending on power saving settings of operation system processor might be running on lower clock rate if running in partial load. I noticed this kind of odd behaviour in our other cluster with older processor which doesn't have turbo boost. So if only one process was running the clock rate was less than nominal. Again this can be avoided by making sure that there is full load during the calculation.

Hyper threading (HT) can also cause great impact to results as operating system scheduler first puts one thread per (physical) core and if there are more threads to run then it adds the second per core. For example on quad core with HT it is definitely different case to run single process than five or more. Again all problems are avoided if there is full load during each calculation.

Re: Serpent 2 scalability benchmark

Posted: Fri Jul 12, 2013 4:36 pm
by siefman
If this project is still ongoing, I could help with doing these benchmark runs on two clusters that I have access to, at the University of Florida and at PSI. I can't get 1,000 CPUs, but 120 would be reasonable. Also in reference to an ESFR model, Levon K. Ghasabyan recently performed Serpent analysis on it at PSI. See this link

Maybe with permission his model would be acceptable.

Re: Serpent 2 scalability benchmark

Posted: Mon Aug 20, 2018 5:57 pm
by s.pfeiffer

Has there been any publications or final results from this benchmark testing?

Re: Serpent 2 scalability benchmark

Posted: Thu Aug 30, 2018 10:32 am
by Jaakko Leppänen
Nothing official. If there are any publications, please let us know so they can be included in Serpent Wiki.

Re: Serpent 2 scalability benchmark

Posted: Fri Nov 27, 2020 2:13 pm
by mathieu.hursin
in the framework of a M&C2021 paper, I m comparing the performance of MCNP and SERPENT for single statepoint (no depletion) 2D burnt assembly calculations (large number of isotopes but low number of detectors). I m using Serpent v2.1.29.
I have access to a cluster allowing for up to ~352 cpus per job w/o hyperthreading and the double amount with HT enabled.
I used the recommendation of the manual for parallel run (tasks bound to processor). The neutron population considered is below:
set pop 1000000 50 25 (the number of active cycles is low to limit wall-time). Wall-time is used as metric.
I used 2 mpi tasks and an increasing number of OMP threads. W/ HT, the number of cpus is divided by two to reflect the fact that if I don't use HT, the dual core will still book both cpus, keeping one idle.

The results below:
HT.png (11.8 KiB) Viewed 1192 times
HT is always useful but its efficiency decreases with the increasing number of cpus.

Did other users observed the same performance when using HT? or am I doing something wrong?



Re: Serpent 2 scalability benchmark

Posted: Mon Nov 30, 2020 11:30 am
by Jaakko Leppänen
Are you comparing the running time to a case with 1 OpenMP thread per 2 MPI tasks? If so, I think there may be some overhead from data transfer between tasks. Try comparison to a single CPU run instead.

I think the superior performance for hyper-threading results from more efficient use of CPU cache or some similar factor.

Re: Serpent 2 scalability benchmark

Posted: Thu Dec 03, 2020 6:55 pm
by mathieu.hursin

our cluster has two processors per node for a total of 44 cores per node, 88 cores when enabling hyperthreading. Hence the initial use of two mpi tasks per node, binding the memory to each mpi task as suggested in the wiki.

But I did what you recommended: use a single MPI task, but increasing the number of OMP threads from 1 to 44; then did the same while switching on HT.
The gain for a low number of threads is clear. The runtime is 15h for 1 thread and 11h for 1 thread wth HT enabled; with 8 threads, it is 2h and 1h30 with HT. When using the full potential of the node (44 threads, 88 with HT), the benefit is almost negligible, 34min vs 32 min...
any thought of what might be the problem here?



Re: Serpent 2 scalability benchmark

Posted: Fri Dec 04, 2020 2:26 pm
by Ana Jambrina
I might be mistaken here, but I think the cause might be that the code/sequence is reaching the theoretical maximum (floating-point) performance of the arithmetic units of the CPU: HT does not help with the maximum throughput of the execution units.

(HT tries to get more utilization out of the math execution units, without actually doubling very much memory. If one thread has enough instruction-level parallelism and doesn't miss cache much, then it'll mostly fill up the core's resources and HT’s effect would be negligible - if data fits into cache. For compute-bound highly-parallelized code/sequence with data that lies out of cache, HT still probably won't help much, since both threads (physical and virtual) are using the same execution units (SSE math), needing more than the full cache - it might be the case of likely loosing performance since they'll be competing for cache; it would depend on the operations and data access patterns).