Page 2 of 3

Re: Threading with OpenMP

Posted: Wed Jan 19, 2011 12:13 am
by Jaakko Leppänen
Michal Kvasnicka wrote: Please, try to perform OpenMP Microbenchmark and post here the final results for your systems.
OK, I'll do that tomorrow. I did some test runs today and it seems that there is a lot of output. Is there anything particular I should look at?

Re: Threading with OpenMP

Posted: Wed Jan 19, 2011 11:13 am
by Michal Kvasnicka
These OpenMP Microbenchmarks (three separate tests) are intended to measure the overheads of synchronisation (first), loop scheduling (second) and array operations (third) in the OpenMP runtime library. So, each benchmark is able to measure overheads of OpenMP communication processes, but each test from different point of view.

Regarding your needs I propose to start with first test.

These OpenMP benchmarks are programmed very well, so if you again obtain significant differences of results between different HW (CPU) platforms, there must be definitely some real reason. Please check out, if any "advanced" CPU BIOS options are disabled (intelligent CPU power-frequency management, etc.) on your PCs.

On the other hand, if the results will be more or less identical for each HW platform the reason of your (above mentioned) problems should be in your omptester code or somewhere else on your site... :)


P.S. But, if I understand well, your intention is to combine OpenMP and MPI approache to get "optimal" parallel performance. So, as the next step I propose to run HOMB ( benchmark, too. This benchmark was created exactly for evaluation of this specific tasks.

Re: Threading with OpenMP

Posted: Wed Jan 19, 2011 6:13 pm
by Jaakko Leppänen
I ran the tests in two machines. The results are:

array test with 1.2GHz Intel Core 2 CPU
schedule test with 1.2GHz Intel Core 2 CPU
sync test with 1.2GHz Intel Core 2 CPU

array test with 3.0GHz Intel Xeon 2 x quad-core
schedule test with 3.0GHz Intel Xeon 2 x quad-core
sync test with 3.0GHz Intel Xeon 2 x quad-core

The first set is in the laptop where I got good scalability in my own tests and the second set is in the cluster with poor performance.

So what exactly am I looking at here?

PS. The question of optimization between OpenMP and MPI really comes down to memory issues. The current parallelization with MPI used in Serpent 1 is very simple and efficient. Shared memory and OpenMP is needed only to get the memory consumption to an acceptable level. I cannot imagine OpenMP actually outperforming MPI in this type of simulation.

Re: Threading with OpenMP

Posted: Tue Sep 20, 2011 11:31 am
by Jukka Mettälä
Hello !

I tried your test routine with my Intel Q6600 workstation, and the results were something like this:

test = 1
Threads: 1 check: 0.0E+00 time: 3.12 factor: 1.00
Threads: 2 check: 4.8E-14 time: 2.37 factor: 1.32
Threads: 3 check: 6.0E-14 time: 2.14 factor: 1.46
Threads: 4 check: 4.4E-14 time: 2.24 factor: 1.40

test = 2
Threads: 1 check: 0.0E+00 time: 0.39 factor: 1.00
Threads: 2 check: 5.1E-14 time: 0.33 factor: 1.19
Threads: 3 check: -5.7E-15 time: 0.23 factor: 1.68
Threads: 4 check: 1.2E-14 time: 0.22 factor: 1.75

test = 3
Threads: 1 check: 0.0E+00 time: 1.13 factor: 1.00
Threads: 2 check: 0.0E+00 time: 1.11 factor: 1.02
Threads: 3 check: 0.0E+00 time: 1.33 factor: 0.85
Threads: 4 check: 0.0E+00 time: 1.21 factor: 0.93

test = 4
Threads: 1 check: 0.0E+00 time: 1.00 factor: 1.00
Threads: 2 check: -NAN time: 0.51 factor: 1.96
Threads: 3 check: -NAN time: 0.34 factor: 2.95
Threads: 4 check: -NAN time: 0.27 factor: 3.65

So it seems that the performance problem is not only in your machine..


Re: Threading with OpenMP

Posted: Tue Sep 20, 2011 4:54 pm
by Jaakko Leppänen

What kind of system and compiler did you use in your calculations?

Re: Threading with OpenMP

Posted: Wed Sep 21, 2011 8:51 am
by Jukka Mettälä
Here's some info of the machine

Intel Core2 quad Q6600 @ 2.4GHz (Dell Optiplex 755)
gcc version 4.5.2 (Ubuntu/Linaro 4.5.2-8ubuntu4)

And also I did some googling and found from some OpenMP pages (the introduction part) something that might mean there could be some performance issues in openmp.

Re: Threading with OpenMP

Posted: Wed Nov 02, 2011 4:00 pm
by nelsonag
I'm sure you are well past the stage where this post matters, but I just stumbled upon it now and thought I would throw my two cents in.

It looks to me like what is happening is 'false sharing'. This is a common thing in shared memory problems and can happen in a parallelized loop as simple as this:

#pragma omp parallel for shared(a) schedule(static,1)
for (int i=0; i<n; i++)

So when thread 0 updates a[0], it then has to share its chunk of cache (which includes some number of bytes of the array a, with all those who have it to make sure it all matches. Well, it will not match and so this sharing and updating of cache has to happen by all the interested threads.

This link explains some workarounds ( ... 06s07.html ), but unfortunately its not so easy since it is (as you encountered) very processor dependent since caching is a hardware-level thing and up to the chip designer.


Re: Threading with OpenMP

Posted: Thu Nov 03, 2011 12:40 am
by Jaakko Leppänen
Thanks for the link. Reading from and writing to global data that is accessed from multiple threads is something that cannot really be avoided, because cross sections need to be shared and the data is constantly accessed.

The third tip: "Don't store temporary, thread specific data in an array indexed by the thread id or rank", is something that is constantly violated, so that might be worth looking into.

The second and fourth tip I don't really understand. Is it even possible to tell the code which part of memory to read into the cache?

Re: Threading with OpenMP

Posted: Thu Nov 03, 2011 4:32 am
by nelsonag
Agreed on your first 2 points. As for the 2nd and 4th tips, the best I can come up with (and I am a nuclear engineer as well and not a computer scientist) is to waste memory and instead of storing variable a (from the example above) in a 1-D array, make it 2-D, and have 2nd dimension be garbage with a size that matches the cache line size of the processor at hand.

I did a quick search to see how to determine the cache line size - and its not easy. Seems like it can be done, but its not easy. At least on linux machines there is a file in the /sys/devices/system/cpu/cpu0/cache/ directory which seems to have the necessary info. (this is my reference: ... -line-size). I frankly think it can be a major waste of space to do that to every shared variable.

I just had another idea, and I also put this in to the 'could be very wrong since I am not a computer scientist' category, but do you think it possible to do a 'const_cast' on the cross-section data (or whatever other problem parameters are constant after initialization) which would then allow the compiler to say 'OK, checking all the values in each threads cache is no longer necessary since I can assume they dont change.' This should allow you to avoid false sharing in certain cases.


Re: Threading with OpenMP

Posted: Thu Nov 10, 2011 1:32 am
by Jaakko Leppänen
After reading a few chapters from the ThreadSpotter documentation, I'm getting pretty convinced that the scalability issues in Serpent are caused by false sharing and the overhead from the cache operations. I have a few ideas on how to solve the problems, and I'll get back to the topic when I start working with the OpenMP mode again.

One thing that still puzzles me, though, is the hardware dependence. What is so different in the two machines that give me good scalability, compared to the other two, in which the scalability is really poor?