MPI vs OMP

Parallelization with OpenMP and MPI, scalability, reproducibility, errors, problems suggestions
Post Reply
orca.blu
Posts: 59
Joined: Wed Apr 20, 2011 1:39 pm

MPI vs OMP

Post by orca.blu » Thu Mar 01, 2012 9:31 pm

Some running-time results attached.
I am trying quite long burnup calculations with serpent 2.
With only 2 burn zones opt 3 seems the best choice.
I have 32 core and 60 Gb of memory (i. e., no ram limitation).
I tried different combination of OMP and MPI, looking for the best solution.
I thought that [32 MPI_TASKS * 1 OMP_THREADS] was the best solution by far.
This is not the case. Maybe the reason is that with many MPI_TASKS a lot of time is wasted
waiting for all the task to finish after each transport cycles???
It seems that in may case the best option is to have a similar number of MPI_TASKS and OMP_THREADS
(e.g., 8 MPI_TASKS * 4 OMP_THREADS).
Up to 8 OMP_THREADS transport_cpu_usage is close to 1.
RUNNING_TIME.png
RUNNING_TIME
RUNNING_TIME.png (39.75 KiB) Viewed 3428 times
CPU_USAGE.png
CPU_USAGE
CPU_USAGE.png (38.54 KiB) Viewed 3428 times
MEMSIZE.png
MEMSIZE
MEMSIZE.png (32.93 KiB) Viewed 3428 times
Details:
45 burning step
45x2 (predictor corrector) transport cycle
5.5e6 total neutrons per cycle
dots: 10000 x (500 + 50 skip)
solid lines: 25000 x (200 + 20 skip)
MPI_REPRODUCIBILITY 0
OMP_REPRODUCIBILITY 1
OPTIMIZATION_MODE 3
"LELI" [10 10]
xscalc 2
no ures
case: molten salt fast reactor

Note:
The [32 MPI_TASKS * 1 OMP_THREADS] calculations where performed after recompiling the code,
without OMP parallel options.
Manuele Aufiero
LPSC/IN2P3/CNRS Grenoble

User avatar
Jaakko Leppänen
Site Admin
Posts: 2356
Joined: Thu Mar 18, 2010 10:43 pm
Security question 2: 0
Location: Espoo, Finland
Contact:

Re: MPI vs OMP

Post by Jaakko Leppänen » Fri Mar 02, 2012 1:02 am

Thank you for the results!

Like you said, with only 2 burnable material regions optimization mode 3 is probably the best choice. Since the parallelization of burnup and processing routines is (mostly) handled by dividing the materials between MPI tasks and OpenMP threads, the speed-up factor for those routines cannot be much higher than 2. What kind of values did you get for processing time (PROCESS_TIME)?

A lot of data is passed between the MPI tasks, especially during the processing and burnup routines, and the overhead should be proportional to the number of tasks. In this case it doesn't seem to have a major impact. Since MPI reproducibility is switched off, the transport routine doesn't involve any communication between the tasks until the end.

The moderate performance of MPI parallelization might result from the fact that some tasks are slower than others, and the calculation has to wait for the slowest to complete. Another factor is that since the population size is divided by the number of tasks, some output routines become relatively more time-consuming. This is especially the case for the mesh plotter. Do you have any large mesh plots in the calculation?

Did you run the calculations with or without the debug option, and have you compared the running times to a single-CPU calculation?
- Jaakko

orca.blu
Posts: 59
Joined: Wed Apr 20, 2011 1:39 pm

Re: MPI vs OMP

Post by orca.blu » Fri Mar 02, 2012 2:51 am

I can't access my data now.
As far as I remember, I always got PROCESS_TIME in the order of few minutes (X.XXXE+00).
(Not sure)

Compilation was made with no DEBUG or GD options.

In the average cpu_utilization graph (attached), in the calculation 32 mpi * 1 omp,
it seems like if all the cpu are continuosly working at 100%...
so... what are they doing? :)
cpu.jpg
cpu utilization
cpu.jpg (31.36 KiB) Viewed 3420 times
Anyway, I think that serpent has very good parallelization capabilities,
at least for my needs (I know that 2 burning materials are not representative)
I'll try opt 4 with as much MPI_TASKS as I can next week.
Manuele Aufiero
LPSC/IN2P3/CNRS Grenoble

User avatar
Jaakko Leppänen
Site Admin
Posts: 2356
Joined: Thu Mar 18, 2010 10:43 pm
Security question 2: 0
Location: Espoo, Finland
Contact:

Re: MPI vs OMP

Post by Jaakko Leppänen » Fri Mar 02, 2012 11:04 am

The CPU utilization could be explained by the fact that all CPU's are constantly working in MPI mode, except when they are waiting for others. In OpenMP parallelization only parts of the calculation are parallelized, and the rest is handled by a single CPU.
- Jaakko

orca.blu
Posts: 59
Joined: Wed Apr 20, 2011 1:39 pm

Re: MPI vs OMP

Post by orca.blu » Fri Mar 02, 2012 2:25 pm

ok thank you.

PROCESS_TIME(last idx,1) is always less than 3 min for any case but
[32 MPI * 1 OMP with executable compiled with OMP parallelization ON]
(not presented in the previous post).

In that case, total PROCESS_TIME increases very much (from 2.5 min to 36 min).
Manuele Aufiero
LPSC/IN2P3/CNRS Grenoble

User avatar
Jaakko Leppänen
Site Admin
Posts: 2356
Joined: Thu Mar 18, 2010 10:43 pm
Security question 2: 0
Location: Espoo, Finland
Contact:

Re: MPI vs OMP

Post by Jaakko Leppänen » Fri Mar 02, 2012 2:48 pm

As a last minute addition to update 2.1.3 I added a timer for MPI routines (MPI_OVERHEAD_TIME in the _res.m output). I'm not 100% sure this measures the true overhead from MPI parallelization, but it should give some esimate on the time spent for communication between the tasks and waiting for the slowest task to complete. To be precise, the waiting time is actually calculated relative to task 0, so if the other tasks finish earlier, the waiting time is also zero.
- Jaakko

orca.blu
Posts: 59
Joined: Wed Apr 20, 2011 1:39 pm

Re: MPI vs OMP

Post by orca.blu » Fri Mar 02, 2012 8:35 pm

I tried a shorter burnup case (5x2 steps) with the latest sss2 version.

MPI_OVERHEAD_TIME seems always quite low.
TIMES.png
times.png
TIMES.png (55.44 KiB) Viewed 3410 times

Post Reply