You can think of this proposal as an unofficial parallel scalability benchmark for Serpent 2, and the idea is to define a set of test problems that any user can run in their computers, and produce results that can be compared to others. If we could get a few sets of test runs in "supercomputers" and several more in smaller clusters, we should have enough material to publish the results as a joint paper.
Let me know if you are interested in contributing to the tests, or have any suggestions on how the tests should be run. I have listed some of my own ideas below.
Serpent is a reactor physics code, so the test cases should reflect the typical applications:
1) Group constant generation for deterministic reactor simulator codes, involving 2D/3D assembly-level transport and burnup calculations.
2) Full-core calculations for test reactors
To push the boundaries, I suggest we also look at:
3) Full-core calculations for power reactors
4) Multi-physics applications (after the methodology is ready for that)
The selection of the test case should also depend on the available computing resources -- there is no point in running a 2D assembly-level calculation with 1000 CPU's, or a full-core simulation in a desktop workstation.
Parallelization in Serpent 2 is based on a hybrid OpenMP / MPI approach, and probably more than anything else the scalability depends the fraction of CPU time spent inside OpenMP parallelized loops. I have listed some of the factors below.
OpenMP parallelization divides the particle histories over different threads at the beginning of each criticality source cycle. Everything that is done between the cycles is done in serial, which affects the scalability. I believe this is the reason why fast reactor and HTGR calculations scale up better than LWR calculations -- the longer the neutron histories, the larger the fraction of CPU time spent in the parallel loop.
In order to reproduce the same random number sequences, the banked fission neutrons must be sorted before they are re-indexed for the next criticality cycle. This takes some CPU time, and the procedure is done outside the parallel loop. The reproducibility mode is on by default, and it can be switched off by "set repro 0". I changed the sorting algorithm in 2.1.9 from bubble- to insertion-sort, but switching the reproducibility option off should still improve scalability to some extent.
Burnup calculation mode
The processing and burnup routines that are run between transport simulations in burnup mode are parallelized by distributing burnable materials to different OpenMP threads. The scalability is not as good as for the transport routine, and if the number of threads is larger than the number of materials, some of the available computing power is left unused. The overall scalability of a burnup calculation therefore depends on the number of burnable materials. The method used for solving the depletion equations ("set bumode") should not have a major effect, but it may be worth testing in very large burnup calculation problems.
Serpent 2 has different optimization modes, intended for small and large burnup calculation problems. These modes change the way the cross sections are handled, which effects memory usage and performance. Since the lower optimization modes do not use pre-calculated macroscopic cross sections, some of the processing routines are not run in burnup mode. Different optimization modes should therefore result in different scalability. Mode 4 is used by default. Mode 3 is somewhat uninteresting, but modes 2 and 1 should be tested.
Number of tallies
Increasing the number of detectors increases the time needed for processing tally statistics, which is done outside the parallel loop. In addition to user-defined detectors, Serpent calculates group constants, power distributions, mesh plots, etc., which have a similar impact in scalability. There are different ways to improve the performance, such as increasing the batching interval (results are collected after N cycles instead of after every cycle).
Serpent stores scores in a temporary buffer, which in OpenMP parallel mode can be either private (each thread writes in its own buffer) or shared (all threads write in same buffer). Using private buffers increases memory usage to some extent, but it should improve scalability since no barriers need to be set to protect memory. Default is private, shared mode can be set using "set shbuf 1" (available from version 2.1.10 on).
Division between MPI and OpenMP parallelization
Since OpenMP and MPI are two completely different techniques, the running time for a calculation divided in <M> MPI tasks and <N> OpenMP threads is generally different from a run divided in <N> tasks and <M> threads. In my opinion the best (most practical and "realistic") way to divide the calculation is to use MPI only for division between computational units (nodes), and OpenMP inside them. It may not always be the most optimal division, but it is the most efficient in terms of memory usage.
To ensure that the results produced by all participants are comparable, I suggest that we define a set of representative test problems. Feel free to perform additional tests on your own cases, but if we plan to publish the results, the calculations should be more or less standardized and run by several participants.
Below are some general outlines for six test cases that I've come up with so far. Additional suggestions are welcome.
BWR assembly burnup calculation
- 2D model of a BWR assembly with burnable absorber
- Represents a typical group constant generation run with burnup calculation
- Division into different number of depletion zones
- Optimization mode must be reduced from 4 to 2 or 1 with the increasing number of burnable materials to save memory, group constant generation is performed only in mode 4
- Large PWR core with power distribution tallied in over 6 million positions
- Things to test include batching, reproducibility, score buffering, number of detectors, etc.
- Serpent has an explicit geometry type for modeling fuel particle and pebble distributions in HTGR reactors
- Geometry model is a full-scale pebble bed reactor core with more than 100,000 pebbles in an unstructured configuration
- Calculation of pebble-wise power distribution
- Points out the differences between LWR and HTGR calculations
- Simplified but realistic geometry for testing full-core burnup calculations
- Over 130,000 depletion zones, must be run in the lowest optimization mode
- Demonstration that Serpent can handle very large burnup calculation problems
- Need a suitable test case (does anyone have a Serpent model for the ESFR core?)
- We could run a Hoogenboom-Martin type power distribution calculation, a full-core burnup calculation, or both
- This test case is related to the multi-physics capability, which is still under development
- Simulation of a reactivity-initiated accident with fuel temperature feedback
- Test case for the scalability of an external source simulation and the internal temperature feedback module
Since we are running calculations in different computing environments, the main parameter of interest is not the total running time, but the speed-up factor (running time for a single-CPU calculation divided by the running time for the parallel calculation). Because the single-CPU running time is always needed as a reference, we probably cannot run the Hoogenboom-Martin benchmark all the way to the 1% statistical accuracy. For the same reason, we probably need to limit the number of burnup steps in the full-core burnup calculations.
Here are some general instructions on how the calculations should be performed:
- Use one of the latest versions of Serpent 2 (preferably the latest available, but at least version 2.1.10, which I will distribute soon).
- Run the calculations without debug and profiler modes (-DDEBUG and -pg flags commented out in Makefile). Use the same compilation for single-CPU and parallel runs.
- Use the variables printed in the _res.m output in the comparisons. Running time used for calculating the speed-up factors is given by RUNNING_TIME - INIT_TIME. There are also additional variables: PROCESS_TIME, TRANSPORT_CYCLE_TIME, and BURNUP_CYCLE_TIME, that can be used to calculate the fraction of running time used for processing, transport simulation and burnup calculation. Serpent calculates also other parameters, but those should not be used in the comparison. In particular, the ratio of TOT_CPU_TIME and RUNNING_TIME does NOT result in the speed-up factor, because both depend on the parallelization.
- Pay attention to any changes that you make in the input files. Changing the optimization mode, grid reconstruction tolerance, group constant generation, etc. also changes the running times. Make sure that the reference single-CPU calculation is always run using the same input as the parallel calculation.
- We are not really interested in actual results of the simulation (K-eff, etc.), but you should check that you get the same results for single-CPU and parallel calculations (to within statistics). If you want to test the reproducibility, you need to define the random number seed manually ("set seed <N>"). MPI parallelization is not reproducible by default, and switching the option on deteriorates the performance so much that it is practical only for debugging purposes. Reproducibility in OpenMP mode is on by default, but the random number sequence is sometimes lost because of roundoff errors in burnup calculation.
- Remember that the code is still very much a beta-version, so report any crashes, errors and unexpected results asap. Also note that the parallel performance may improve with later updates, so if your access to the computer resources is limited to a certain number of CPU hours, keep in mind that you may want to repeat your calculations later.
- We are not trying to break records, but to get an idea on how the code performs in different types of applications and computing environments. The most interesting results are often those that reveal something that needs improvement.