Programming for Hybrid Multi Many-core MPP Systems
  • Home
  • Table of Contents
    • Chapter 1 / Introduction
  • Previous Publications
    • High Performance Computing
    • A Guidebook
  • About the Authors
    • John Levesque
    • Aaron Vose

Chapter 3


Link to the solution:
​
3.1   What does “TLB” stand for, and what does a TLB do?​
​3.2   In the array statement A(IA(:)) = B(IA(:)) + C(:), describe what (if any) cache or TLB issues one would encounter. Consider if IA(I) = I and then consider different domains of IA(:). What about when IA(:) accesses 1 KB, 1 MB, and 1 GB?
3.3   What is the associativity of KNL’s MCDRAM when run in cache mode, and why is this so important from a performance perspective?
3.4   What does the numactl --preferred=1 command do? What does numactl --membind=1 do?​
3.5   What compiler directive can be used to tell the compiler to place an array in high-bandwidth memory?​
3.6   Which KNL clustering mode is frequently found the best and is thus a reasonable default?​
3.7   What is the bandwidth of NVLink? How does this compare to PCIe?​
3.8   Construct a case study: Revisit the application and target system selected in the exercise section at the end of Chapter 2 – this time with respect to data motion.
3.9   In Table 3.6.2, why doesn’t the placement of important arrays into MC- DRAM result in better performance?​
​3.10   Consider Table 3.6.2. Can you describe a case where the placement of the important arrays might give a better performance gain?​
​3.11   Describe the different NUMA regions for KNL for each clustering mode. Start with the first region being the tile containing two processors, two level one caches, and one level two cache.​
​
3.12   Give application characteristics where Quadrant-Cache would not be the best configuration on a KNL node. How much of an improvement over Quadrant-Cache would be expected for each clustering/MCDRAM configuration?​

​​​​​​

 
3.1   What does “TLB” stand for, and what does a TLB do?​
 

3.2   In the array statement A(IA(:)) = B(IA(:)) + C(:), describe what (if any) cache or TLB issues one would encounter. Consider if IA(I) = I and then consider different domains of IA(:). What about when IA(:) accesses 1 KB, 1 MB, and 1 GB?​
 

3.3   What is the associativity of KNL’s MCDRAM when run in cache mode, and why is this so important from a performance perspective?​
 

3.4   What does the numactl --preferred=1 command do? What does numactl --membind=1 do?​
 

3.5   What compiler directive can be used to tell the compiler to place an array in high-bandwidth memory?​
 

3.6   Which KNL clustering mode is frequently found the best and is thus a reasonable default?​
 

3.7   What is the bandwidth of NVLink? How does this compare to PCIe?​
 

3.8   Construct a case study: Revisit the application and target system selected in the exercise section at the end of Chapter 2 – this time with respect to data motion.
a. Characterize the target system in terms of:
      i. Number of NUMA nodes per system node.
     ii. Size, bandwidth, and latency of memory per NUMA node.
    iii. Size, bandwidth, and latency of each cache level (including per- cycle load/store bandwidth between the level-1 cache and the core).
    iv. Draw a tree/graph of the memory hierarchy/topology of the target system’s node architecture including each core, cache, NUMA node, and memory, as well as their logical connections.
b. For the selected application:
      i. Identify three levels of parallelism in the application (e.g., grid
decomposition, nested loops, independent tasks).
     ii. Map the three levels of parallelism identified in the application to the three levels available on the target system: MPI, threads, and vectors.
c. For each level of parallelism identified in the application and mapped to the target system, compare the working set of the application at that level to the amount of memory and bandwidth available on the target system:
      i. Compare the working set size of the chosen vectorizable loops to the size and bandwidth of each cache level.
     ii. Compare the working set size of each thread to the total amount of (private) cache available to a single core.
    iii. Compare the working set size of an MPI rank to the size and bandwidth of memory on a system node. If HBM is available, can it hold the working sets of all the ranks on the node?
Do these comparisons show a good fit between the parallelism identified in the application and the target system? If not, try to identify alternative parallelization strategies.
 

3.9   In Table 3.6.2, why doesn’t the placement of important arrays into MC- DRAM result in better performance?​
 

3.10   Consider Table 3.6.2. Can you describe a case where the placement of the important arrays might give a better performance gain?​
 

3.11   Describe the different NUMA regions for KNL for each clustering mode. Start with the first region being the tile containing two processors, two level one caches, and one level two cache.​
 

3.12   Give application characteristics where Quadrant-Cache would not be the best configuration on a KNL node. How much of an improvement over Quadrant-Cache would be expected for each clustering/MCDRAM configuration?
Proudly powered by Weebly
  • Home
  • Table of Contents
    • Chapter 1 / Introduction
  • Previous Publications
    • High Performance Computing
    • A Guidebook
  • About the Authors
    • John Levesque
    • Aaron Vose