Page 1 of 3

Threading with OpenMP

Posted: Tue Jan 18, 2011 1:07 am
by Jaakko Leppänen
I've made some test calculations with OpenMP to figure out the best way to handle parallelization with shared memory (threading). The results are not exactly what I was hoping for, and this is a topic where I would welcome any suggestions and advice from more experienced OpenMP programmers. My expertise so far is limited to what is written in Wikipedia.

What I'm trying to do is to parallelize a simple loop in a test routine that picks energy points from the grid and checks that the sum of partial cross sections matches the total for each material, and that the material cross sections are below the majorant used for delta-tracking. This is essentially a test for some fundamental subroutines that are called each time the neutron path length and the reaction at the collision point are sampled. The calculations involve grid search and linear interpolation of cross sections between tabulated points. The operations are not computationally expensive, but they involve a lot of memory access.

The problem is that I get very different results in terms of speed-up in different systems:

machine: Intel(R) Core(TM)2 CPU U7600 @ 1.20GHz (dual-core)
compiler: gcc 4.2.4

Code: Select all

Threads: 1 CPU time: 3.10 Wall-clock time: 3.11 (1.00 x)
Threads: 2 CPU time: 3.69 Wall-clock time: 1.87 (1.66 x)
machine: Intel(R) Xeon(R) CPU X5450 @ 3.00GHz (2 quad-core processors)
compiler: gcc 4.1.2

Code: Select all

Threads: 1 CPU time: 1.05 Wall-clock time: 1.05 (1.00 x)
Threads: 2 CPU time: 3.56 Wall-clock time: 1.79 (0.59 x)
Threads: 3 CPU time: 3.85 Wall-clock time: 1.28 (0.82 x)
Threads: 4 CPU time: 4.69 Wall-clock time: 1.18 (0.89 x)
Threads: 5 CPU time: 4.62 Wall-clock time: 0.92 (1.14 x)
Threads: 6 CPU time: 5.00 Wall-clock time: 0.83 (1.26 x)
Threads: 7 CPU time: 4.85 Wall-clock time: 0.69 (1.51 x)
Threads: 8 CPU time: 5.39 Wall-clock time: 0.67 (1.56 x)
machine: AMD Opteron(tm) Processor 252 (two 2.6 GHz CPU's)
compiler: gcc 4.4.3

Code: Select all

Threads: 1 CPU time: 1.51 Wall-clock time: 1.52 (1.00 x)
Threads: 2 CPU time: 2.97 Wall-clock time: 1.50 (1.01 x)
machine: 2.53 GHz Intel Core 2 Duo (MAC)
compiler: gcc 4.2.1

Code: Select all

Threads: 1 CPU time: 1.29 Wall-clock time: 1.29 (1.00 x)
Threads: 2 CPU time: 1.35 Wall-clock time: 0.69 (1.87 x)
So for two systems I get a good, close to a factor of two incease in speed (at least for two processors), and for two systems poor performance with hardly no speed-up at all. Especially the Intel Xeon double quad-core case bothers me, because the calculation runs significantly slower with two threads compared to one, and the speed-up factor for 8 threads is only 1.56. There is also a clear difference in the CPU time required by the calculation. The two "good" cases take about the same CPU time, regardless of the number of threads, but the two "bad" cases require increasingly more CPU time as the number of threads is increased.

I suspect that the problem is related to the way the routines access data in memory, but the question is, am I doing something wrong here, or is it just that memory-intensive calculation like this doesn't parallelize that well in some systems? And if so, is there anything that can be done about the problem?

Any ideas?

Re: Threading with OpenMP

Posted: Tue Jan 18, 2011 1:38 pm
by Michal Kvasnicka
First: the example of simple OpenMP code could be better starting point than your description of problem

Second: Wikipedia OpenMP page is not very good tutorial. Take a look on: https://computing.llnl.gov/tutorials/openMP/,for example. The best source of relevant information is the following book (Using OpenMP - Portable Shared Memory Parallel Programming, http://mitpress.mit.edu/catalog/item/de ... &tid=11387) .

Michal

Re: Threading with OpenMP

Posted: Tue Jan 18, 2011 3:28 pm
by Michal Kvasnicka
I would like to recommend you the following "OpenMP Microbenchmark" to evaluate you PCs.

http://www2.epcc.ed.ac.uk/computing/res ... index.html

The benchmark is available for Fortran and C. Let us know about results which you get.

Michal

Re: Threading with OpenMP

Posted: Tue Jan 18, 2011 3:53 pm
by Michal Kvasnicka
And finally, "Hybrid OpenMP MPI Benchmark": http://homb.sourceforge.net/

These benchmarks help you to decide what HW and or OS platform is suitable for your multi-core parallel programming, I hope.

The completely different task is, of course, how to write effective parallelized OpenMP vs MPI codes, this is still more art-work than routine programming, but ... things change!!!

Michal

Re: Threading with OpenMP

Posted: Tue Jan 18, 2011 7:11 pm
by Jaakko Leppänen
Michal,

Thank you for the links, I'll look into those.

My current plan for combined MPI + OpenMP parallelization is to have two levels:

1) The calculation is divided into a number of MPI tasks at the beginning of the transport simulation and the results are combined at the end (this is the way things are done at the moment in Serpent 1)
2) Each criticality cycle is threaded using the "omp for" pragma, so that each neutron history forms a separate thread

Something similar is done for other parts of the calculation, such as data processing and the solution of depletion equations.

Even though Monte Carlo calculation is ideal for parallelization, the unfortunate thing is that neutron transport differs from most problems discussed in textbooks and tutorials. The calculation doesn't involve CPU time-intensive mathematical routines, but frequent access to reaction data stored in memory. The calculation can involve several megabytes or even gigabytes of data. For parallelization with MPI this means rather slow data transfer, but since the tasks communicate only at the beginning and the end of the transport cycle, the speed-up is almost linear. The drawback is that memory demand is multiplied by the number of tasks.

For OpenMP and shared memory things are a bit different. I suspect that the size of the data block and the way the data is organized in the memory forms bottlenecks when different threads are trying to access it. For some reason this behavior seems to be system-dependent (see the running times in my first post), which suggests that there might be something that can be done about it.

To reduce the complexity of the problem to a minimum, I wrote another simple test routine that allocates a large block of memory, fills it with random numbers, and performs simple operations that are timed.

omptester.c:

Code: Select all

#include "header.h"
#include "locations.h"

#define FUNCTION_NAME "OmpTester:"

void OmpTester()
{  
  long max_threads, ptr, n, nt, N, test;
  double t0, sum, sum0, *res;

  /* Reset variables */

  sum = 0.0;
  sum0 = 0.0;
  t0 = 0.0;

  /* Initialize data structure */

  InitData();

  /* Set block size */

  N = 100000000;

  /* Allocate memory */

  ptr = ReallocMem(DATA_ARRAY, N);

  /* Fill data block with random numbers */

  for (n = 0; n < N; n++)
    DATA[ptr + n] = RandF();
  
  /* Get maximum number of threads */

  max_threads = omp_get_max_threads();

  /* Allocate memory for results array */

  res = (double *)calloc(max_threads, sizeof(double));

  /* Loop over threads */

  for (nt = 0; nt < max_threads; nt++)
    {
      /* reset results array */

      for (n = 0; n < max_threads; n++)
        res[n] = 0.0;

      /* Set number of threads */

      omp_set_num_threads(nt + 1);

      /* Reset and start timer */

      ResetTimer(0);
      StartTimer(0);

      /* Set test mode */

      test = 4;

      /* Parallelized loop */

#pragma omp parallel
      {

#pragma omp for

	for (n = 0; n < N; n++)
	  {
	    /* Operations */

	    if (test == 1)
	      res[omp_get_thread_num()] += erf(DATA[ptr + n]);
	    else if (test == 2)
	      res[omp_get_thread_num()] += DATA[ptr + n];
	    else if (test == 3)
	      res[omp_get_thread_num()] += erf(n);
	    else if (test == 4)
	      erf(n);
	  }
      }

      /* Stop timer */

      StopTimer(0);

      /* Calculate sum */

      sum = 0.0;
      for (n = 0; n < max_threads; n++)
        sum = sum + res[n];

      /* Save single-thread time and sum */

      if (nt == 0)
	{
	  t0 = TimerVal(0);
	  sum0 = sum;
	}

      /* Print */

      printf("Threads: %ld check: %8.1E time: %1.2f factor: %1.2f\n", nt + 1,
	     sum/sum0 - 1.0, TimerVal(0), t0/TimerVal(0));
    }

  /* Free memory */
  
  FreeMem();
  free(res);

  /* Terminate calculation */

  exit(-1);
}
The routine was written for the new version, but it should compile with Serpent 1 source code as well, because the function names are the same. The compiler (gcc) requires an additional -fopenmp flag to handle the pragmas. To run the test, a call to the subroutine must be placed in main.c before anything else.

I tried the routine with four operations, set by the "test" variable:

1) Calculate erf() of a value stored in the data block and add to sum
2) Add value stored in data block to sum
3) Calculate erf() of loop index and add to sum
4) Calculate erf() of loop index

Options 1 and 2 involve access to the large data block, option 3 only to the small results array. The error function is there just to add some number crunching. The sum is calculated to ensure that the result is the same, regardless of the number of tasks. The value in the output after "check:" should be close to zero (or NAN with option 4).

Here's the output for some test runs (3 GHz Intel Xeon, two quad-core processors):

Code: Select all

test = 1
Threads: 1 check:  0.0E+00 time: 2.07 speed-up factor: 1.00
Threads: 2 check:  5.7E-14 time: 2.32 speed-up factor: 0.89
Threads: 3 check:  6.5E-14 time: 2.11 speed-up factor: 0.98
Threads: 4 check:  5.3E-14 time: 1.96 speed-up factor: 1.06
Threads: 5 check: -9.7E-14 time: 1.99 speed-up factor: 1.04
Threads: 6 check:  3.2E-14 time: 2.34 speed-up factor: 0.88
Threads: 7 check: -2.8E-14 time: 1.88 speed-up factor: 1.10
Threads: 8 check: -1.1E-15 time: 1.75 speed-up factor: 1.18

Code: Select all

test = 2
Threads: 1 check:  0.0E+00 time: 0.52 speed-up factor: 1.00
Threads: 2 check:  7.2E-14 time: 0.64 speed-up factor: 0.81
Threads: 3 check:  4.9E-14 time: 0.70 speed-up factor: 0.74
Threads: 4 check:  5.2E-14 time: 0.74 speed-up factor: 0.70
Threads: 5 check:  5.0E-14 time: 0.82 speed-up factor: 0.64
Threads: 6 check:  5.6E-14 time: 0.94 speed-up factor: 0.55
Threads: 7 check:  5.5E-14 time: 0.80 speed-up factor: 0.65
Threads: 8 check:  5.5E-14 time: 0.69 speed-up factor: 0.75

Code: Select all

test = 3
Threads: 1 check:  0.0E+00 time: 1.01 speed-up factor: 1.00
Threads: 2 check:  0.0E+00 time: 0.66 speed-up factor: 1.53
Threads: 3 check:  0.0E+00 time: 1.10 speed-up factor: 0.91
Threads: 4 check:  0.0E+00 time: 0.84 speed-up factor: 1.20
Threads: 5 check:  0.0E+00 time: 1.08 speed-up factor: 0.93
Threads: 6 check:  0.0E+00 time: 1.56 speed-up factor: 0.64
Threads: 7 check:  0.0E+00 time: 1.33 speed-up factor: 0.75
Threads: 8 check:  0.0E+00 time: 1.22 speed-up factor: 0.83

Code: Select all

test = 4
Threads: 1 check:      NAN time: 0.70 speed-up factor: 1.00
Threads: 2 check:      NAN time: 0.35 speed-up factor: 2.00
Threads: 3 check:      NAN time: 0.24 speed-up factor: 2.96
Threads: 4 check:      NAN time: 0.18 speed-up factor: 4.01
Threads: 5 check:      NAN time: 0.14 speed-up factor: 5.00
Threads: 6 check:      NAN time: 0.12 speed-up factor: 5.78
Threads: 7 check:      NAN time: 0.10 speed-up factor: 6.84
Threads: 8 check:      NAN time: 0.09 speed-up factor: 7.58
So again I get very poor performance whenever data is read or written. What amazes me is that even mode 3 doesn't show any speed-up, even though the value of erf(n) is stored in an array of only 8 values. The increase in speed is linear only if the value is not stored anywhere. This doesn't really make any sense, so I assume I am doing something wrong...

Re: Threading with OpenMP

Posted: Tue Jan 18, 2011 9:16 pm
by Michal Kvasnicka
I am missing

#include <omp.h>

in your test code...

Michal

Re: Threading with OpenMP

Posted: Tue Jan 18, 2011 10:21 pm
by Michal Kvasnicka
your parallelized loop should look like this:

Code: Select all

/* Parallelized loop */
#pragma omp parallel reduction(+:res)
       for (n = 0; n < N; n++)
         {
           /* Operations */

           if (test == 1)
             res[omp_get_thread_num()] += erf(DATA[ptr + n]);
           else if (test == 2)
             res[omp_get_thread_num()] += DATA[ptr + n];
           else if (test == 3)
             res[omp_get_thread_num()] += erf(n);
           else if (test == 4)
             erf(n);
         }

Re: Threading with OpenMP

Posted: Tue Jan 18, 2011 11:12 pm
by Jaakko Leppänen
If I try to use res in the declaration, I get a compiler error:

Code: Select all

omptester.c: In function 'OmpTester':
omptester.c:86: error: 'res' has invalid type for 'reduction'
make: *** [omptester.o] Error 1
But it works if I use the sum variable directly. I also get much better performance for modes 1 and 3. For mode 2 the speed-up factor is about 2.6 with 8 threads, so the behavior is similar to what I see in my other test routines. The actual transport simulation involves some computation, so the performance may increase from that.

The problem is that the use of private variables is not very practical for storing data for tallies. The number of values is very large, which makes things complicated and increases memory demand.

In any case I probably need to have a working tracking routine before I can say anything about the actual performance. I'll continue working with other things and return to this issue later.

Thanks for you help!

Re: Threading with OpenMP

Posted: Tue Jan 18, 2011 11:29 pm
by Jaakko Leppänen
Although, there is still the question why the speed-up factor was significant for two systems and poor for two other, even though the test routine was the same?

Re: Threading with OpenMP

Posted: Tue Jan 18, 2011 11:51 pm
by Michal Kvasnicka
OK ...well, there is a lot of possible explanations.
- CPU caching and RAM memory access scenarios are different for different types of CPUs
- memory-CPUcore and inter-core + CPU cache (L1, L2, ...) communication
- gcc compiler produce still a bit strange binaries, which has not same performance on different CPUs (for serious parallel programming I recommend to use INTEL or PGI C/C++ compilers)
- etc ...

Please, try to perform OpenMP Microbenchmark and post here the final results for your systems.

Michal