Often when running ParFlow on a cluster it is common to wonder about parallel speedup. One of the really nice things about ParFlow is how easy it is to run in parallel. First of all, ParFlow can be run in parallel simply by changing the process topology keys,
P
,
Q
and
R
. The computational domain is split up according to those values,
nx/
P
,
ny/
Q
and
nz/
R
for a total of
P
*
Q
*
R
processors.
There are a number of reasons to run in parallel, faster execution times or to run a problem that is too large to fit in the memory of a single machine. It is also important to know what an optimal number of processors is for your particular problem. This can be very helpful in setting up a run and there are a few metrics that can help determine this.
A common approach (e.g.
Kollet and Maxwell, 2006;
Jones and Woodard, 2001) is to use scaled efficiency,
E, is defined as
E(n,p) = T(n, 1)/T(pn,p), where
T is the run time (i.e. “wall clock” time) as a function of the problem size,
n, and the number of processors is
p. The important factor here is that the problem grows linearly with the number of processors, so the number of compute cells for each processor always stays the same. If everything is perfect, parallel code and parallel hardware,
E(n,p) = 1: doubling the problem size and the number of processors will result in the same run time.
The world is not perfect, of course, and many factors influence scaling efficiency. The biggest factors are things that don’t speed up in parallel (e.g. problem set up and inter-processor communication) compared to things that do speed up in parallel. The biggest factor is usually the amount of time the code and architecture spend communicating between processors versus the amount of time a processor spends solving a portion of the problem. This boils down to how big a problem you are solving and how much of that problem each processor has. In other words, if a problem contains a large number of cells and each processor has a large number of cells, scaling will be much more efficient than if a processor has a fewer number of cells because each cpu will spend more time computing and less time communicating. This is illustrated by the graph below:
In this graph scaled parallel efficiency plotted as a function of processor number for small and large problems sizes (from
Kollet and Maxwell, 2006, used w/ permission). We see that the larger problem reaches a stable scaled efficiency after just a few processors and maintains this efficiency out to 100 processors. While the small problem size shows much poorer efficiency and does not reach a stable value. This is due to the ratio of work the system does communicating compared to the work the system does solving the problem.
It is very easy to run a scalability study with ParFlow. Below I’ll modify the
harvey_flow.tcl
script and use tcl variables to automatically increase the problem size and the number of processors simultaneously.
At the beginning the of the script, we can add a scaling variable using tcl, I’m just scaling the number of cells, the number of processors and the domain size in
y but this could be done in more than one dimension easily. We do this by adding the tcl variable
sy
and setting the processor topology using it:
set sy 1
# Process Topology
pfset Process.Topology.P 1
pfset Process.Topology.Q $sy
pfset Process.Topology.R 1
Then we set the computational grid as a multiplier of
sy
:
pfset ComputationalGrid.NX 50
pfset ComputationalGrid.NY [ expr 100*$sy ]
pfset ComputationalGrid.NZ 100
Then the domain, upper and lower aquifer geometries:
pfset Geom.domain.Upper.Y [ expr 34.2*$sy ]
...
pfset Geom.upper_aquifer.Upper.Y [ expr 34.2*$sy ]
...
pfset Geom.lower_aquifer.Upper.Y [ expr 34.2*$sy ]
...
We now have a problem that increases the problem size and the number of processors by changing a single variable. Lastly, we can delete the looping/Monte Carlo portion of the original script and replace it with a run command that uses
sy
so that our output, namely the log file which contains the run time information, will contain the processor/problem size information:
pfrun pc_scale.$sy
pfundist pc_scale.$sy
I ran this same script on my laptop (a dual-core cpu) and on my linux cluster to look at scaling efficiency. The run times are contained in the log file, named
pc_scale.out.log
if you use my
pfrun
command syntax above. The results are below:
We can see that for this problem the scaling on the cluster is pretty good out to 8p and that it is quite a bit better than using one or two cores of my laptop. The scaled efficiency comparable to, but not as good as the plot I showed above from Kollet and Maxwell however, meaning the communication overhead might be still large and we might benefit from running an even larger problem (500,000 cells/cpu is not that many, really). This curve does tell us that we can expect to see reasonable decreases in simulation time out to eight processors of the cluster.
We can run another test that is useful, particularly if we have a problem of a given size in mind and want to determine how many CPU’s to run it on. I used the same script, modified as above, but I increase the number of processors but keep the problem size constant at 500,000. You can do this by keeping
sy
at 1 and changing the
Q processor topology manually, for example. When I do this on my laptop or cluster, I plot scaled execution time (time to run on
n cpus / time to run on 1 cpu) as a function of number of processors. This is plotted below:
In this plot we clearly see the point where splitting the problem up on additional processors yields no decrease in execution time. In fact, moving from 4 to 8p slightly
increases execution time. We can also see differences in the decrease in execution time from using 1-2 cores on the laptop as opposed to more processors on the cluster. These results depend
entirely on problem size (recall, that 500,000 cells is a pretty small problem to begin with), computer architecture and load balancing. Running a larger problem would change these numbers (as demonstrated by the scaled efficiency graph above) running on different machines changes these numbers (e.g. laptop v. cluster) and how the problem is split up (in
x,
y or
z) will also change these numbers. How the problem is split up, or allocated to different processors (i.e. load balancing) needs to account for various aspects of the problem being solved. This depends quite a lot on the physics of the problem. For the example I’ve shown here we are solving fully-saturated flow using the linear solver and there are not big differences in problem physics over the domain. This changes if we solve using the
Richards’ equation nonlinear solver, where parts of the problem are linear (below the water table) and parts of the problem are non-linear (above the water table). This point is made very clearly in
Jones and Woodward (2001), where scaling efficiencies splitting a problem in
x and
y are shown to be much better than those from splitting the problem in
z. It is easy to imagine situations using the overland flow boundary condition that would be even more sensitive to this, for example if the problem is split in
x and
y but with a river on some processors (yielding extra per-processor work) but not others. The best approach is to run your problem or something very much like it for a few trial runs to determine an optimal number of processors and topology. I hope this provides some background information to get things started.