**Leslie3D**

Leslie3D is a “Large Eddy Simulation” code that performs finite differences on a 3-D grid. Leslie3D uses 3-D decomposition so the mesh is divided among the processors as small cubes.

Following is a scaling chart for running the application up to 2048 cores.

We see that the scaling is departing from linear at around 512 processors. Looking at the scaling of individual routines reveals a couple of potential scaling bottlenecks. This is an example of “strong scaling”; that is, the problem size stays the same and as the number of processors increase, the work per processor decreases.

In fact, all of the computational routines are obtaining superlinear speedup. They attain a factor of as high as 51 going from 64 to 2048 cores. This is what is termed “super linear scaling”, as the processer count increases, the increase in performance is larger. In this case one of the routines increased by a factor of 51 when the processor count increase by 32. The reason for this phenonemon is that as the working set on the processor becomes smaller and smaller, it fits into cache and array references that require actual memory transfers are reduced significantly.

So the computation is scaling extremely well and the load balance of the application is good, and it never really changes as we scale up. One large inefficency occurs in gridmap at the higher processor counts. Fortunately, this routine is only called during initialization and if we run the computation to larger timesteps, the time will not grow. Unfortunately, the grid generation for a run at 4096 cores took almost an hour to complete. There are cases were initialization is exponential in time versus processor count. For very large problem sizes, we may have to worry about the time to initialize the grid. In this particular case, the grid partitioning is not parallelized. There are parallel grid partitioning algorithm; however, their value is very problem dependent.

If we ignore the initialization, the other issue impacting the scaling is the MPI routines. While the time they take is not increasing significantly with core count, the time they take is also not decreasing and so the overall percentage of total time increases. The MPI calls start to dominate at larger processor counts. At 64 processors, MPI takes 8.7% of the total time, at 2048 , they take 26% of the time. The actual time taken by the MPI routines reduces from 50 seconds on 64 processors down to 22 seconds on 2048 processors. Since the computation scales so much better than the computation, it quickly becomes the bottleneck.

Given these statistics, we can conclude that the problem size, which is fixed in this case, is becoming too small. At 2048 processors, each processor has a cube of 15 on a side. This is definitely a case where we need a larger problem.

So the computation is scaling extremely well and the load balance of the application is good, and it never really changes as we scale up. One large inefficency occurs in gridmap at the higher processor counts. Fortunately, this routine is only called during initialization and if we run the computation to larger timesteps, the time will not grow. Unfortunately, the grid generation for a run at 4096 cores took almost an hour to complete. There are cases were initialization is exponential in time versus processor count. For very large problem sizes, we may have to worry about the time to initialize the grid. In this particular case, the grid partitioning is not parallelized. There are parallel grid partitioning algorithm; however, their value is very problem dependent.

If we ignore the initialization, the other issue impacting the scaling is the MPI routines. While the time they take is not increasing significantly with core count, the time they take is also not decreasing and so the overall percentage of total time increases. The MPI calls start to dominate at larger processor counts. At 64 processors, MPI takes 8.7% of the total time, at 2048 , they take 26% of the time. The actual time taken by the MPI routines reduces from 50 seconds on 64 processors down to 22 seconds on 2048 processors. Since the computation scales so much better than the computation, it quickly becomes the bottleneck.

Given these statistics, we can conclude that the problem size, which is fixed in this case, is becoming too small. At 2048 processors, each processor has a cube of 15 on a side. This is definitely a case where we need a larger problem.