Page 1 of 1

Test case 2: PWR full-core calculation

Posted: Thu Dec 13, 2012 12:59 pm
by Jaakko Leppänen
The input file for the second case is found at:

http://virtual.vtt.fi/virtual/montecarl ... /PWR_CORE/

The case is taken from the Hoogenboom-Martin Monte Carlo performance benchmark, described at:

http://www.oecd-nea.org/dbprog/MonteCar ... chmark.htm

The challenge here is the large number of tallies -- six mesh plots, full-core power distribution with over 6 million zones ("set cpd ...") and three detectors. What I expect to see here is a good scalability if all tally calculation is switched off, but a saturation in perfomance after 5 or 6 CPU's when all tallies are calculated.

Things that will most likely affect the scalability include:

1) Batching interval (the 5th parameter in "set pop") -- increasing the number means that the code runs more criticality cycles without processing statistics between them.
2) Score buffering -- private buffer means that each OpenMP thread writes tally scores in it's own memory space. With this option, there is no need to set barriers when the data is written, which should improve scalability. On the other hand, collecting the data from the private buffers when the statistics are processed (done after each criticality cycle or batch) requires extra CPU time.
3) OpenMP reproducibility -- Population size is relatively large in this case, and if the reproducibility option is on, the CPU time required for sorting the banked fission source before each cycle is run may affect scalability.

Optimization mode most likely does not affect the scalability because no burnup calculation is run. The uniform fission source method, invoked by "set ufs ..." changes the total number of neutron histories run, so changing or removing this parameter changes your entire calculation. I don't expect this parameter to have a major impact in scalability.

Re: Test case 2: PWR full-core calculation

Posted: Sun Jan 06, 2013 2:50 pm
by Ville Rintala
Some results from Lappeenranta University of Tech:

See this post: Re: Serpent 2 scalability benchmark

Calculation nodes: 2 x E5-2660 Xeon (8+8 cores, Hyper-threading off), 128 GiB

No changes in input file. I did following calculations (multiple serial jobs couldn't be started in one SLURM batch file with MPI support compiled in so different executable was used in first four cases):

One calculation node with only OpenMP support:
1. -omp 1 16 cases running
2. -omp 2 8 cases running
3. -omp 4 4 cases running
4. -omp 8 2 cases running
With MPI support:
5. -mpi 1 -omp 16
6. -mpi 2 -omp 8
7. -mpi 4 -omp 4
8. -mpi 8 -omp 2
Two node calculations:
9. -mpi 4 -omp 8
10. -mpi 8 -omp 4

Results:

Code: Select all

----------------------------------------------------
MPI OMP PARA :  Total        :  Transport          :
----------------------------------------------------
  1   1    1 :   714.8   1.0 :   714.7   1.0  1.00 : 
  1   2    2 :   375.5   1.9 :   375.5   1.9  1.00 : 
  1   4    4 :   200.6   3.6 :   200.6   3.6  1.00 : 
  1   8    8 :   114.9   6.2 :   114.9   6.2  1.00 : 
  1  16   16 :    78.3   9.1 :    78.3   9.1  1.00 : 
  2   8   16 :    68.9  10.4 :    68.9  10.4  1.00 : 
  4   4   16 :    64.4  11.1 :    64.4  11.1  1.00 : 
  8   2   16 :    64.1  11.1 :    64.1  11.1  1.00 : 
  4   8   32 :    47.2  15.1 :    47.2  15.1  1.00 : 
  8   4   32 :    40.7  17.6 :    40.7  17.6  1.00 : 
----------------------------------------------------
Image

Re: Test case 2: PWR full-core calculation

Posted: Mon Jan 14, 2013 12:25 pm
by hartanto
This is the result from KAIST for the test case 2. No change was made in the input file.

The CPU type is Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz.

Code: Select all

---------------------------------------------------------
NODE MPI OMP PARA :  Total        :  Transport          :
---------------------------------------------------------
  1    1   1    1 :   646.3   1.0 :   646.3   1.0  1.00 : 
  1    1   2    2 :   336.5   1.9 :   336.5   1.9  1.00 : 
  1    2   1    2 :   325.4   2.0 :   325.4   2.0  1.00 : 
  1    1   4    4 :   185.6   3.5 :   185.6   3.5  1.00 : 
  1    2   2    4 :   181.2   3.6 :   181.2   3.6  1.00 : 
  1    4   1    4 :   177.3   3.6 :   177.3   3.6  1.00 : 
  1    1   8    8 :   110.4   5.9 :   110.4   5.9  1.00 : 
  1    2   4    8 :   105.3   6.1 :   105.3   6.1  1.00 : 
  1    4   2    8 :   104.8   6.2 :   104.8   6.2  1.00 : 
  1    8   1    8 :   101.4   6.4 :   101.4   6.4  1.00 : 
  1    1  16   16 :    88.4   7.3 :    88.4   7.3  1.00 : 
  1    2   8   16 :    69.7   9.3 :    69.7   9.3  1.00 : 
  1    4   4   16 :    65.8   9.8 :    65.8   9.8  1.00 : 
  1    8   2   16 :    64.1  10.1 :    64.1  10.1  1.00 : 
  1   16   1   16 :    65.7   9.8 :    65.7   9.8  1.00 : 
  2    2   8   16 :    72.6   8.9 :    72.6   8.9  1.00 : 
  4    4   4   16 :    61.5  10.5 :    61.5  10.5  1.00 : 
  2    2  16   32 :    71.3   9.1 :    71.3   9.1  1.00 : 
  2   32   1   32 :    38.9  16.6 :    38.9  16.6  1.00 : 
  4    4   8   32 :    55.5  11.7 :    55.5  11.7  1.00 : 
  8    8   4   32 :    41.3  15.6 :    41.3  15.6  1.00 : 
---------------------------------------------------------
plot.jpg
PWR
plot.jpg (40.56 KiB) Viewed 3298 times

Re: Test case 2: PWR full-core calculation

Posted: Mon Jan 14, 2013 2:48 pm
by Jaakko Leppänen
This is pretty much what I expected -- the large number of tallies has a negative impact on scalability. As I mention in the introduction of this benchmark, this results from the excessive CPU time spent for processing the results between criticality cycles. I will try to see if some of that processing could be done in parallel. Another way to improve the scalability is to increase the batching interval, which means that the results are collected over multiple cycles before processing them. I ran a quick test with 12 CPU's with OpenMP parallelization using batching interval of 1:

Code: Select all

set pop 500000 1000 200 1.0 1
and 10:

Code: Select all

set pop 500000 1000 200 1.0 10
The running times that I got were 86 and 59 minutes, respectively, so increasing the batching interval will probably improve the scalability as well.

While doing these tests I noticed a bug in the calculation routines, which results in some biases and over-estimated statistical errors for some output variables. This will be fixed in the next update. The problem shouldn't affect the running times, so the results of these scalability tests should still be valid.

Re: Test case 2: PWR full-core calculation

Posted: Tue Nov 06, 2018 6:31 pm
by gavin.ridley.utk
Hi,

Is the input for this still available? I'd like to test it on our cluster.

Thanks,

Re: Test case 2: PWR full-core calculation

Posted: Wed Nov 07, 2018 1:29 pm
by Jaakko Leppänen
The server settings no longer allow directory listing, but the input file can be accessed at:

http://virtual.vtt.fi/virtual/montecarl ... _CORE/core