Programming for Hybrid Multi Many-core MPP Systems
  • Home
  • Table of Contents
    • Chapter 1 / Introduction
  • Previous Publications
    • High Performance Computing
    • A Guidebook
  • About the Authors
    • John Levesque
    • Aaron Vose
Chapter Introduction, 1
1.1 INTRODUCTION
1.2 CHAPTER OVERVIEWS 

Chapter 2 Determining an Exaflop Strategy, 7 
2.1  FOREWORD BY JOHN LEVESQUE
2.2  INTRODUCTION
2.3  LOOKING AT THE APPLICATION 
2.4  DEGREE OF HYBRIDIZATION REQUIRED 
2.5  DECOMPOSITION AND I/O 
2.6  PARALLEL AND VECTOR LENGTHS 
2.7  PRODUCTIVITY AND PERFORMANCE PORTABILITY 
2.8  CONCLUSION 
2.9  EXERCISES 

Chapter 3 Target Hybrid Multi/Manycore System, 21
3.1  FOREWORD BY JOHN LEVESQUE 
3.2  UNDERSTANDING THE ARCHITECTURE 
3.3  CACHE ARCHITECTURES  
        3.3.1  Xeon Cache
​       
3.3.2  NVIDA GPU Cache
3.4  MEMORY HIERARCHY
        3.4.1 Knight’s Landing Cache
​3.5  KNL CLUSTERING MODES
​3.6  KNL MCDRAM MODES
​3.7  IMPORTANCE OF VECTORIZATION  
3.8 ALIGNMENT FOR VECTORIZATION 
3.9 EXERCISES 

Chapter 4 How Compilers Optimize Programs, 43
4.1  FOREWORD BY JOHN LEVESQUE 
4.2  INTRODUCTION 
4.3  MEMORY ALLOCATION 
4.4  MEMORY ALIGNMENT 
4.5  COMMENT-LINE DIRECTIVE 
4.6  INTERPROCEDURAL ANALYSIS 
4.7  COMPILER SWITCHES 
4.8  FORTRAN 2003 AND INEFFICIENCIES 
        4.8.1  Array Syntax 
        4.8.2  Use Optimized Libraries 
        4.8.3  Passing Array Sections 
        4.8.4  Using Modules for Local Variables 
        4.8.5  Derived Types 
4.9  C/C++ AND INEFFICIENCIES 
4.10 COMPILER SCALAR OPTIMIZATIONS
​        4.10.1 Strength Reduction
​        4.10.2 Avoiding Floating Point Exponents
        4.10.3 Common Subexpression Elimination

4.11 EXERCISES 

Chapter 5 Gathering Runtime Statistics for Optimizing, 67 
5.1  FOREWORD BY JOHN LEVESQUE 
5.2  INTRODUCTION 
5.3  WHAT’S IMPORTANT TO PROFILE 
        5.3.1  Profiling NAS BT 
        5.3.2  Profiling VH1 
5.4  CONCLUSION 
5.5  EXERCISES 

Chapter 6 Utilization of Available Memory Bandwidth, 79
6.1  FOREWORD BY JOHN LEVESQUE 
6.2  INTRODUCTION 
6.3  IMPORTANCE OF CACHE OPTIMIZATION 
6.4  VARIABLE ANALYSIS IN MULTIPLE LOOPS 
6.5  OPTIMIZING FOR THE CACHE HIERARCHY 
6.6  COMBINING MULTIPLE LOOPS 
6.7  CONCLUSION 
6.8  EXERCISES 

Chapter 7 Vectorization, 97
​
7.1  FOREWORD BY JOHN LEVESQUE 
7.2  INTRODUCTION 
7.3  VECTORIZATION INHIBITORS 
7.4  VECTORIZATION REJECTION FROM INEFFICIENCIES 
        7.4.1 Access Modes and Computational Intensity 
​        7.4.2 Conditionals 
7.5  STRIDING VERSUS CONTIGUOUS ACCESSING 
7.6  WRAP-AROUND SCALAR 
7.7  LOOPS SAVING MAXIMA AND MINIMA 
7.8  MULTINESTED LOOP STRUCTURES 
7.9  THERE’S MATMUL AND THEN THERE’S MATMUL 
7.10  DECISION PROCESSES IN LOOPS 
         7.10.1 Loop-Independent Conditionals 
         7.10.2 Conditionals Directly Testing Indicies
         7.10.3 Loop-Dependent Conditionals 
​         7.10.4 Conditionals Causing Early Loop Exit 
7.11  HANDLING FUNCTION CALLS WITHIN LOOPS 
7.12  RANK EXPANSION 
7.13  OUTER LOOP VECTORIZATION 
7.14  EXERCISES 

Chapter 8 Hybridization of an Application, 147
8.1 FOREWORD BY JOHN LEVESQUE
8.2  INTRODUCTION
​
8.3  THE NODE’S NUMA ARCHITECTURE
8.4  FIRST TOUCH IN THE HIMENO BENCHMARK
8.5  IDENTIFYING WHICH LOOPS TO THREAD
8.6  SPMD OPENMP
8.7  EXERCISES

Chapter 9 Porting Entire Applications, 169
9.1  FOREWORD BY JOHN LEVESQUE
9.2  INTRODUCTION
9.3  SPEC OPENMP BENCHMARKS
        9.3.1 WUPWISE 
        9.3.2 MGRID 
        9.3.3 GALGEL 
​        9.3.4 APSI 
        9.3.5 FMA3D
        9.3.6 AMMP
        9.3.7 SWIM
        9.3.8 APPLU
        9.3.9 EQUAKE
​        9.3.10 ART 
9.4  NASA PARALLEL BENCHMARK (NPB) BT 
9.5  REFACTORING VH-1 
9.6  REFACTORING LESLIE3D 
9.7  REFACTORING S3D – 2016 PRODUCTION VERSION 
9.8  PERFORMANCE PORTABLE – S3D ON TITAN 
9.9  EXERCISES 

Chapter 10 Future Hardware Advancements, 243
10.1  INTRODUCTION 
10.2  FUTURE X86 CPUS 
​         
10.2.1 Intel Skylake 
         10.2.2 Intel Knight’s Hill 
​         10.2.3 AMD Zen 
10.3  FUTURE ARM CPUS 
         10.3.1 Scalable Vector Extension
​         10.3.2 Broadcom Vulcan
         10.3.3 Cavium Thunder X
         10.3.4 Fujitsu Post-K
         10.3.5 Qualcomm Centriq 
10.4  FUTURE MEMORY TECHNOLOGIES
         10.4.1 Die-Stacking Technologies 
​         10.4.2 Compute Near Data 
10.5  FUTURE HARDWARE CONCLUSIONS 
         10.5.1 Increased Thread Counts
         10.5.2 Wider Vectors 
         10.5.3 Increasingly Complex Memory Hierarchies

Appendix A:  Supercomputer Cache Architectures
A.1 ASSOCIATIVITY

Appendix B:  The Translation Look-Aside Buffer
B.1 INTRODUCTION TO THE TLB

Appendix C:  Command Line Options / Compiler Directives
C.1 COMMAND LINE OPTIONS AND COMPILER DIRECTIVES

Appendix D:  Previously Used Optimizations
D.1  LOOP REORDERING
D.2  INDEX REORDERING
D.3  LOOP UNROLLING
D.4  LOOP FISSION
D.5  SCALAR PROMOTION
D.6  REMOVAL OF LOOP-INDEPENDENT IFS
D.7  USE OF INTRINSICS TO REMOVE IFS
D.8  STRIP MINING
D.9  SUBROUTINE INLINING
D.10  PULLING LOOPS INTO SUBROUTINES
D.11 CACHE BLOCKING
D.12 LOOP FUSION
D.13 OUTER LOOP VECTORIZATION

Appendix E:  I/O Optimization
​
E.1  INTRODUCTION
E.2  I/O STRATEGIES
       E.2.1 Spokesperson
       E.2.2  Multiple Writers – Multiple Files 
       E.2.3  Collective I/O to Single or Multiple Files 
E.3  LUSTRE MECHANICS 

Appendix F:  Terminology 
F.1 SELECTED DEFINITIONS

Appendix G:  12-Step Process 
G.1 INTRODUCTION
G.2 PROCESS
​
Proudly powered by Weebly
  • Home
  • Table of Contents
    • Chapter 1 / Introduction
  • Previous Publications
    • High Performance Computing
    • A Guidebook
  • About the Authors
    • John Levesque
    • Aaron Vose