Programming for Hybrid Multi Many-core MPP Systems
  • Home
  • Table of Contents
    • Chapter 1 / Introduction
  • Previous Publications
    • High Performance Computing
    • A Guidebook
  • About the Authors
    • John Levesque
    • Aaron Vose
Picture
Excerpts
Intro. | Ch. 1 | Ch. 2 | Ch. 3 | Ch. 4 | Ch. 5 | Ch. 6 | Ch. 7 | Ch. 8 | Ch. 9

Chapter 5 Excerpts

5.1 Gathering Statistics for a large parallel program The larger and more complex the application the more difficult it is to gather statistics. There are numerous statistics gathering approaches. First is the traditional profiling approach. Here either a pre-processor or a compiler instruments the binary to gather summary information on the execution of each and every subroutine and function that is called during the execution of the program. The typical default for profiling tools is to instrument everything. The problem with this approach is that the instrumentation can introduce overhead and skew results for very small subroutines and functions. Additionally an application with a very large number of small routine calls may run significantly longer when instrumented.

A second approach is to sample the executing program every so often to see where it is spending its time. This sampling approach does not introduce as much overhead and it also has the benefit that it can gather information on finer boundaries than routine boundaries. With sampling it is possible to get line level execution profiling without additional overhead. The disadvantage of this approach is you do not have the detailed information available to the traditional profiling discussed above.

Developers at Cray Inc came up with an interesting combination of the two approaches which gives you the best from both approaches (Ref 10). Unfortunately you do have to run your application twice, first using the sampling approach and then using a filtered profiling approach. The first run of the application uses sampling to identify the most important routines in the applications. Additionally with the sampling, the small routines which use a very miniscule amount of time each time they are called can be identified and avoided. This first approach gives a nice overall summary of the execution of the application. Consider the following sampling profile obtained from running a combustion application on 6000 processors. Notice the time is broken into three sections. First MPI time, second User time and third ETC. The MPI is then broken into the MPI routines that take up a majority of the time, in this case MPI_WAIT And MPI_WAITALL. Next, the User time, which accounts for 28.9% of the overall runtime, is broken down into the individual routines that comprise the time. Finally, ETC give library routines and the time they utilize. The default is that no element is displayed that utilizes less than 1% of the time. The first column in the table gives the percentage of the samples, the second the number of samples, the third gives the imbalance of the sampling across all 6000 processors and finally the fourth column gives the imbalance percentage. The table shows one of the weaknesses of sampling. There is no way of knowing which routine calls some of the important library calls.

Table 1:  Profile by Function

Samp % |  Samp |Imb. Samp |   Imb.     |Experiment=1
             |           |                 | Samp % |Group
             |           |                 |              | Function
             |           |                 |              |  PE='HIDE'

 100.0% | 89319 |              -- |          -- |Total
|------------------------------------------------

|  46.8% | 41827 |             -- |           -- |MPI

||-----------------------------------------------

||  43.8% | 39138 |   4328.78 |    10.0% |mpi_wait_

||   1.0%  |   864  |  3287.25  |    79.2% |mpi_waitall_

||===============================================

|  35.5% | 31720 |             -- |     --        |USER

||-----------------------------------------------

||   5.1% |  4560 |      380.49 |   7.7%    |mcavis_new_looptool_

||   4.1% |  3632 |    1000.14 |    21.6% |rhsf_

||   2.8% |  2499 |      203.94 |      7.5% |diffflux_proc_looptool_

||   2.3% |  2070 |      162.47 |      7.3% |rdot_

||   1.9% |  1721 |      173.56 |      9.2% |ratx_

||   1.8% |  1617 |      392.03 |    19.5% |integrate_erk_jstage_lt_

||   1.7% |  1534 |      161.81 |      9.5% |rdsmh_

||   1.5% |  1333 |      325.02 |    19.6% |computeheatflux_looptool_

||   1.4% |  1286 |      122.76  |     8.7% |ratt_

||   1.4% |  1237 |      158.32  |    11.4% |chemkin_m_reaction_rate_bounds_

||   1.0% |   883 |       105.70  |    10.7% |qssa_

||   1.0% |   870 |       313.07  |    26.5% |thermchem_m_calc_inv_avg_mol_wt_

||===============================================

|  17.7% | 15772 |               -- |          -- |ETC

||-----------------------------------------------

||  11.6% | 10347 |       416.23 |   3.9% |__fmth_i_dexp_gh

||   1.5%  |  1360 |        142.68 |   9.5% |__fvdexp_gh

||   1.4%  |  1217 |        277.71 |  18.6% |__c_mcopy8

||   1.0%  |   858  |        121.15 |  12.4% |__fmth_i_dlog10_gh

|================================================

Several important facts can be drawn from the table. At this processor count, this application is slightly computationally dominant. MPI is utilizing less than 50% of the time. When we delve into optimizing this application, we want to first look at how to improve the computation. There is also significant load imbalance in some of the computational routines. A second bit of information we can obtain from sampling is the actual line number in the source code that is responsible for the most amount of time. The following table gives us some details for this run.

|     28.9% |    76550  |        -- |       --    |USER

||-------------------------------------------------

||      4.2% |   11238 |         -- |       --    |mcavis_new_looptool_

3|              |             |            |              | source/f77_files/johnmc/mcavis_new_lt_gen.f

||||-----------------------------------------------

4|||   4.0% |   10616 | 1010.30 | 8.7%    |line.103

||||===============================================

||   3.4%    |   8983   |          -- |         -- |rhsf_

3|              |             |              |            | source/f90_files/solve/rhsf.f90

||   2.0%    |     5185 |   259.74 |   4.8% |rdot_

3|              |             |              |            | f77_files/C7H16_52species/std/getrates.f

||||-----------------------------------------------

4|||           |              |              |            |line.2628

||||===============================================

||   2.0% |        5183 |           -- |        -- |diffflux_proc_looptool_

3|           |                |              |            | source/f77_files/johnmc/diffflux_gen_uj.f

||||-----------------------------------------------

4||| 1.9% |   5161 |521.01 | 9.2%|line.199

||||===============================================

Notice under the subroutine mcavis_new_looptool that uses 4.2% of the compute time, line number 102 uses 4 % of the compute time. This is showing that the Fortran statement that starts at line number 102 uses all but .2% of the entire routine. Undoubtedly line number 102 points to the start of a DO loop. In additional to this useful information we can now use our profiling approach where we only instrument the top level routines as well as all of the MPI library. From a second run of the application with the profiling we will be able to extract more pertinent information.

Table 1:  Profile by Function Group and Function

 Time % |           Time |  Imb. Time |   Imb.    |      Calls       |Experiment=1

             |                    |                   | Time % |                   |Group

             |                    |                   |              |                   | Function

             |                    |                   |              |                   |  PE='HIDE'
 
 100.0% | 1530.892958 |               -- |           -- | 27414118.0 |Total

|---------------------------------------------------------------------

|  52.0% |  796.046937 |                -- |          -- | 22403802.0 |USER

||--------------------------------------------------------------------

||  22.3% |  341.176468 |   3.482338 |   1.0% | 19200000.0 |getrates_

||  17.4% |  266.542501 |  35.451437 |  11.7% |     1200.0 |rhsf_

||   5.1% |   78.772615 |   0.532703 |   0.7% |  3200000.0 |mcavis_new_looptool_

||   2.6% |   40.477488 |   2.889609 |   6.7% |     1200.0 |diffflux_proc_looptool_

||   2.1% |   31.666938 |   6.785575 |  17.6% |      200.0 |integrate_erk_jstage_lt_

||   1.4% |   21.318895 |   5.042270 |  19.1% |     1200.0 |computeheatflux_looptool_

||   1.1% |   16.091956 |   6.863891 |  29.9% |        1.0 |main

||====================================================================

|  47.4% |  725.049709 |         -- |     -- |  5006632.0 |MPI

||--------------------------------------------------------------------

||  43.8% |  670.742304 |  83.143600 |  11.0% |  2389440.0 |mpi_wait_

||   1.9% |   28.821882 | 281.694997 |  90.7% |  1284320.0 |mpi_isend_

|=====================================================================

In this table, we only get the routines we have profiled and the main program. All routines that are called from within these listed routines which are not profiled will be added to the time in the profiled routine. For example, getrates was one of the major routines that called dexp, so dexp’s time is included into the time for getrates. Now the numbers in the second and third columns of the  table are in seconds. Once again the fourth column contains the load imbalance percentage.

A second table of information from this run gives us information on where the MPI routines are called from the program.

Table 4:  MPI Message Stats by Caller

 MPI Msg Bytes |   MPI Msg |  MsgSz | 16B<= |   256B<= |     4KB<= | 64KB<= |Experiment=1

               |     Count |   <16B | MsgSz |    MsgSz |     MsgSz |  MsgSz |Function

               |           |  Count | <256B |     <4KB |     <64KB |   <1MB | Caller

               |           |        | Count |    Count |     Count |  Count |  PE[mmm]
 
 25634067336.0 | 1287990.0 | 3596.0 |  65.0 |  89606.0 | 1194722.0 |    1.0 |Total

|----------------------------------------------------------------------------------------

| 25633587200.0 | 1284320.0 |     -- |    -- |  89600.0 | 1194720.0 |     -- |mpi_isend_

||---------------------------------------------------------------------------------------

||  6635520000.0 |  259200.0 |     -- |    -- |       -- |  259200.0 |     -- |derivative_z_send_

3|               |           |        |       |          |           |        | rhsf_

4|               |           |        |       |          |           |        | integrate_erk_

5|               |           |        |       |          |           |        |   integrate_

6|               |           |        |       |          |           |        |    solve_driver_

7|               |           |        |       |          |           |        |     MAIN_

||||||||---------------------------------------------------------------------------------

8|||||||  6635520000.0 |  259200.0 |     -- |    -- |       -- |  259200.0 |     -- |pe.1801

8|||||||  6635520000.0 |  259200.0 |     -- |    -- |       -- |  259200.0 |     -- |pe.4663

8|||||||  6635520000.0 |  259200.0 |     -- |    -- |       -- |  259200.0 |     -- |pe.3666

||||||||=================================================================================

The order of the MPI routines is determined by the calls that result in the most communication. Notice that MPI_ISEND is the routine responsible for sending most of the data. This explains somewhat why all the time is spent in MPI_WAIT. This is a non-blocking MPI operation, in other words, the message is sent and the program continues computing while the message is transferred. To find out when the message is completed, the program must call MPI_WAIT for a particular message or MPI_WAIT_ALL for a group of messages. Additionally we see that 6635520000 out of 25633587200 messages are sent from derivative_z_send. The table continues to include the other routines that call MPI_ISEND. If we look into derivative_z_send, we will probably be able to find the MPI_WAIT calls.

In additional to the MPI information we also have hardware counter information from this execution. The following table gives hardware counter data for getrates. Notice that there are four PAPI counters that are used. Unfortunately, on most systems there are a limited number of hardware counters that can be used at any given time. The four are L1 Data Cache Misses, TLB Data Misses, L1 Data Cache Accesses and Floating Point Operations. From this four raw metrics the software computes some derived metrics for us. Computational Intensity, MFLOP rate, TLB and L1 references per miss. Additionally we see that this routine achieves 9.7% of peak. These statistics are summed over all 6000 processors and the averages are being displayed. From these numbers we see that the TLB utilization is excellent and L1 cache reuse is very good. The metric that might be a little low is the computational intensity. This indicates that there are a approximately an even number of memory access and floating point operations. One would like for this to be somewhat higher; however, it is not always possible to increase the computational intensity. From this information we have an excellent place to start strategizing an approach to improve the performance of this application.

========================================================================
USER / getrates_
------------------------------------------------------------------------

  Time%                                                       22.3%

  Time                                                    341.176468 secs

  Imb.Time                                                 3.482338 secs

  Imb.Time%                                                  1.0%

  Calls                                  0.055M/sec       19200000.0 calls

  PAPI_L1_DCM                     6.291M/sec       2191246517 misses

  PAPI_TLB_DM                     0.024M/sec            8250904 misses

  PAPI_L1_DCA                  1044.783M/sec  363928615376 refs

  PAPI_FP_OPS                  1008.976M/sec   351456000000 ops

  User time (approx)            348.329 secs    905656290778 cycles    100.0%Time

  Average Time per Call                               0.000018 sec

  CrayPat Overhead : Time          3.8%

  HW FP Ops / User time        1008.976M/sec    351456000000 ops     9.7%peak(DP)

  HW FP Ops / WCT              1008.976M/sec

  Computational intensity          0.39 ops/cycle     0.97 ops/ref

  MFLOPS (aggregate)            6053856.92M/sec

  TLB utilization              44107.73 refs/miss       86.148 avg uses

  D1 cache hit,miss ratios        99.4% hits          0.6% misses

  D1 cache utilization (misses)  166.08 refs/miss   20.760 avg hits

The previous profile data was from the S3D run in figure 5.1 at 6000 processors. The following profile is from the S3D run at 48,000 processors.

Table 1:  Profile by Function Group and Function

 Time % |            Time  |  Imb. Time |   Imb.   |      Calls       |Experiment=1

              |                     |                  | Time % |                    |Group

              |                     |                  |              |                    | Function

              |                     |                  |              |                    |  PE='HIDE'

 100.0%  | 1730.555208 |              --  |     --       | 16090113.8  |Total

|---------------------------------------------------------------------

|  76.9% | 1330.111350 |                -- |           -- |  4882627.8  |MPI

||--------------------------------------------------------------------

||  72.1% | 1247.436960 |  54.277263  |   4.2%    |  2389440.0  |mpi_wait_

||   1.3%  |   22.712017  | 101.212360 |  81.7%    |  1234718.3 |mpi_isend_

||   1.0%  |   17.623757  |   4.642004  |  20.9%     |        1.0 |mpi_comm_dup_

||   1.0%  |   16.849281  |  71.805979 |  81.0%     |  1234718.3 |mpi_irecv_

||   1.0%  |   16.835691  | 192.820387 |  92.0%    |    19999.2 |mpi_waitall_

||====================================================================

|  22.2%   |  384.978417 |               -- |            -- | 11203802.0 |USER

||--------------------------------------------------------------------

||   9.9% |  171.440025 |   1.929439 |          1.1% |  9600000.0 |getrates_

||   7.7% |  133.599580 | 19.572807 |         12.8% |     1200.0 |rhsf_

||   2.3% |    39.465572 |   0.600168 |          1.5% |  1600000.0 |mcavis_new_looptool_

|=====================================================================

We see a completely different story with this data. Clearly communication is now taking 76.9% of the time. We will investigate how to address the optimization of S3D at 6000 cores in the Chapter of node optimization and the 48,000 core problem in the Chapter on communication. If we want to run our production at 6000 cores, computation is the bottleneck, if on the other hand we want to run our production at 48,000 cores we better do something about the communication.

In this case we are assuming that the MPP system is not at fault in causing the bottleneck. In some cases it could be that the design of the parallel system is not meant to handle processor runs greater than 5,000-10,000.

In these profiles we did not see any I/O show up; however, that is not always the case. In fact in most cases of applications that are being moved to 100,000s of processors, the I/O becomes the critical issue.

Now the following chapters are organized to address each of the potential bottlenecks.

a)      Computation or User time is using a majority of the time (Chapter 6)
b)      Communication is using a majority of the time (Chapter 7)
c)      I/O is using a majority of the time (Chapter 7)

So where does OpenMP come in. In Chapter 7 we will see that often times the solution to improving an application that doesn’t scale well is to introduce OpenMP. Chapter 8 will discuss the issues of using OpenMP efficiently.

Proudly powered by Weebly
  • Home
  • Table of Contents
    • Chapter 1 / Introduction
  • Previous Publications
    • High Performance Computing
    • A Guidebook
  • About the Authors
    • John Levesque
    • Aaron Vose