Page 3 of 3

Re: Threading with OpenMP

Posted: Sat Nov 12, 2011 5:43 pm
by Jaakko Leppänen
I've been able to improve the perfomance with OpenMP parallelization, mainly by restructuring arrays that are accessed during the transport cycle. There is still some work to be done, but so far I've achieved a factor of 8-10 decrease in calculation time with 12 CPU cores for the parallelized part in a BWR assembly calculation (transport only). Before the changes the factor was about 2-3. In burnup mode the scalability is not as good for processing and burnup routines, although I've only run a few tests in optimization mode 4.

I have a few more ideas I want to try out, but at some point the simplicity and maintainability of the source code becomes more important that squeezing out the las few percent of perfomance.

Re: Threading with OpenMP

Posted: Tue Nov 15, 2011 3:40 am
by nelsonag
Thats good news. Looking forward to working with it.

Re: Threading with OpenMP

Posted: Tue Nov 15, 2011 12:43 pm
by Jaakko Leppänen
After running some tests with different systems, I believe I have now identified the main bottlenecks...

In criticality source simulation mode, the code stores neutron data structures in two buffers, one from where the neutron source is read and one where new source points are written for the next cycle. Accessing these buffers must be protected by OpenMP critical pragmas, to ensure that two threads are not reading or writing the same data. These operations limit the scalability, because each thread has to wait for the previous thread to complete the operation. The overhead basically depends on how much CPU time is spent for simulating a single neutron history. In my LWR test calculations, in which the neutron histories are relatively short, I get a speed-up factor of about 10 when using 12 CPU cores. In an HTGR full-core calculation, the same factor is almost 12, because the neutron histories are much longer, and the probability of having two threads accessing the buffers simultaneously is much lower.

The second problem arises from reproducibility. The random number generator is initialized at the beginning of each history, based on the seed number and the history index. When the threads write new source points in the buffer for the next cycle, the order in which the points are stored depends on CPU load, and is different every time the code is run. To maintain reproducibility, the source buffer is sorted after each cycle based on the history indexes, which ensures that the same index is assigned to the same history every time the code is run. If the neutron population is large, the sorting operation starts to take up a significant fraction of the overall CPU time, and since the sorting is done outside the parallel loop, it starts to affect scalability.

I'll post some example results later...

Even the speed-up factor of 10 with 12 cores sounds pretty good, but the problem is that I only have 12 cores at my disposal, and I have no way of knowing whether the factor is still 10 with 20 or 100 cores. The second problem is easier to overcome, simply by introducing an option to skip the sorting altogether. Reproducibility is lost, but no statistical laws are violated, so the results should be OK.

Re: Threading with OpenMP

Posted: Thu Nov 24, 2011 3:12 pm
by Jaakko Leppänen
Here are some example results...

The first two figures below show the scalability in transport calculation. The test cases are selected from the standard inputs that I use to validate Serpent against MCNP in infinite lattice calculations (See related topic). Figure 1 shows different LWR cases and figure 2 other reactor types. There was some occasional CPU load in the calculation node while running the VVER-440 and SFR cases, which explains the slight notch at 11 threads. As I mentioned in my previous post, the best scalability is attained in the HTGR calculation cases, which I believe results from the longer neutron histories.

Image
Figure 1. Speed-up factor as function of number of OpenMP threads in LWR cases. Transport calculation only.

Image
Figure 2. Speed-up factor as function of number of OpenMP threads in CANDU, HTGR and SFR cases. Transport calculation only.

The next two figures below show the scalability in burnup calculation. The situation becomes much more complicated when additional CPU time is spent for burnup and processing routines after each step. The first case is a PWR assembly burnup calculation with 65 depletion zones, and the second case is a PBMR fuel pebble with a single burnable material region. In the PWR case the speed-up factor for the transport routine is similar to what is seen in Figure 1. The scalability is not as good for burnup and processing routines, which leaves the overall speed-up factor to just below 7 with 12 CPU cores. The main bottleneck is a processing routine that calculates material-wise total cross sections before running the transport simulation. Most of the CPU time in the PBMR case is spent in the transport cycle, and the good scalability is reflected in the overall speed-up factor.

Image
Figure 3. Speed-up factor as function of number of OpenMP threads in a PWR assembly burnup calculation with 65 depletion zones.

Image
Figure 4. Speed-up factor as function of number of OpenMP threads in a PBMR burnup calculation with a single burnable material region.

All calculations presented here were run in optimization mode 4, which corresponds to the methods used in Serpent 1. Lower optimization modes desinged to save memory probably behave differently. I'll look into that next...

Re: Threading with OpenMP

Posted: Fri Dec 02, 2011 3:15 pm
by Jaakko Leppänen
As expected, the scalability is better with lower optimization modes, in which the transport routine takes up larger fraction of the overall running time. The figure below shows the speed-up factors for optimization mode 1 in the PWR burnup case. By optimization modes I am referring to the options introduced in my presentation at the Dresden meeting.
Image