Chapter Introduction, 1
1.1 INTRODUCTION
1.2 CHAPTER OVERVIEWS
Chapter 2 Determining an Exaflop Strategy, 7
2.1 FOREWORD BY JOHN LEVESQUE
2.2 INTRODUCTION
2.3 LOOKING AT THE APPLICATION
2.4 DEGREE OF HYBRIDIZATION REQUIRED
2.5 DECOMPOSITION AND I/O
2.6 PARALLEL AND VECTOR LENGTHS
2.7 PRODUCTIVITY AND PERFORMANCE PORTABILITY
2.8 CONCLUSION
2.9 EXERCISES
Chapter 3 Target Hybrid Multi/Manycore System, 21
3.1 FOREWORD BY JOHN LEVESQUE
3.2 UNDERSTANDING THE ARCHITECTURE
3.3 CACHE ARCHITECTURES
3.3.1 Xeon Cache
3.3.2 NVIDA GPU Cache
3.4 MEMORY HIERARCHY
3.4.1 Knight’s Landing Cache
3.5 KNL CLUSTERING MODES
3.6 KNL MCDRAM MODES
3.7 IMPORTANCE OF VECTORIZATION
3.8 ALIGNMENT FOR VECTORIZATION
3.9 EXERCISES
Chapter 4 How Compilers Optimize Programs, 43
4.1 FOREWORD BY JOHN LEVESQUE
4.2 INTRODUCTION
4.3 MEMORY ALLOCATION
4.4 MEMORY ALIGNMENT
4.5 COMMENT-LINE DIRECTIVE
4.6 INTERPROCEDURAL ANALYSIS
4.7 COMPILER SWITCHES
4.8 FORTRAN 2003 AND INEFFICIENCIES
4.8.1 Array Syntax
4.8.2 Use Optimized Libraries
4.8.3 Passing Array Sections
4.8.4 Using Modules for Local Variables
4.8.5 Derived Types
4.9 C/C++ AND INEFFICIENCIES
4.10 COMPILER SCALAR OPTIMIZATIONS
4.10.1 Strength Reduction
4.10.2 Avoiding Floating Point Exponents
4.10.3 Common Subexpression Elimination
4.11 EXERCISES
Chapter 5 Gathering Runtime Statistics for Optimizing, 67
5.1 FOREWORD BY JOHN LEVESQUE
5.2 INTRODUCTION
5.3 WHAT’S IMPORTANT TO PROFILE
5.3.1 Profiling NAS BT
5.3.2 Profiling VH1
5.4 CONCLUSION
5.5 EXERCISES
Chapter 6 Utilization of Available Memory Bandwidth, 79
6.1 FOREWORD BY JOHN LEVESQUE
6.2 INTRODUCTION
6.3 IMPORTANCE OF CACHE OPTIMIZATION
6.4 VARIABLE ANALYSIS IN MULTIPLE LOOPS
6.5 OPTIMIZING FOR THE CACHE HIERARCHY
6.6 COMBINING MULTIPLE LOOPS
6.7 CONCLUSION
6.8 EXERCISES
Chapter 7 Vectorization, 97
7.1 FOREWORD BY JOHN LEVESQUE
7.2 INTRODUCTION
7.3 VECTORIZATION INHIBITORS
7.4 VECTORIZATION REJECTION FROM INEFFICIENCIES
7.4.1 Access Modes and Computational Intensity
7.4.2 Conditionals
7.5 STRIDING VERSUS CONTIGUOUS ACCESSING
7.6 WRAP-AROUND SCALAR
7.7 LOOPS SAVING MAXIMA AND MINIMA
7.8 MULTINESTED LOOP STRUCTURES
7.9 THERE’S MATMUL AND THEN THERE’S MATMUL
7.10 DECISION PROCESSES IN LOOPS
7.10.1 Loop-Independent Conditionals
7.10.2 Conditionals Directly Testing Indicies
7.10.3 Loop-Dependent Conditionals
7.10.4 Conditionals Causing Early Loop Exit
7.11 HANDLING FUNCTION CALLS WITHIN LOOPS
7.12 RANK EXPANSION
7.13 OUTER LOOP VECTORIZATION
7.14 EXERCISES
Chapter 8 Hybridization of an Application, 147
8.1 FOREWORD BY JOHN LEVESQUE
8.2 INTRODUCTION
8.3 THE NODE’S NUMA ARCHITECTURE
8.4 FIRST TOUCH IN THE HIMENO BENCHMARK
8.5 IDENTIFYING WHICH LOOPS TO THREAD
8.6 SPMD OPENMP
8.7 EXERCISES
Chapter 9 Porting Entire Applications, 169
9.1 FOREWORD BY JOHN LEVESQUE
9.2 INTRODUCTION
9.3 SPEC OPENMP BENCHMARKS
9.3.1 WUPWISE
9.3.2 MGRID
9.3.3 GALGEL
9.3.4 APSI
9.3.5 FMA3D
9.3.6 AMMP
9.3.7 SWIM
9.3.8 APPLU
9.3.9 EQUAKE
9.3.10 ART
9.4 NASA PARALLEL BENCHMARK (NPB) BT
9.5 REFACTORING VH-1
9.6 REFACTORING LESLIE3D
9.7 REFACTORING S3D – 2016 PRODUCTION VERSION
9.8 PERFORMANCE PORTABLE – S3D ON TITAN
9.9 EXERCISES
Chapter 10 Future Hardware Advancements, 243
10.1 INTRODUCTION
10.2 FUTURE X86 CPUS
10.2.1 Intel Skylake
10.2.2 Intel Knight’s Hill
10.2.3 AMD Zen
10.3 FUTURE ARM CPUS
10.3.1 Scalable Vector Extension
10.3.2 Broadcom Vulcan
10.3.3 Cavium Thunder X
10.3.4 Fujitsu Post-K
10.3.5 Qualcomm Centriq
10.4 FUTURE MEMORY TECHNOLOGIES
10.4.1 Die-Stacking Technologies
10.4.2 Compute Near Data
10.5 FUTURE HARDWARE CONCLUSIONS
10.5.1 Increased Thread Counts
10.5.2 Wider Vectors
10.5.3 Increasingly Complex Memory Hierarchies
Appendix A: Supercomputer Cache Architectures
A.1 ASSOCIATIVITY
Appendix B: The Translation Look-Aside Buffer
B.1 INTRODUCTION TO THE TLB
Appendix C: Command Line Options / Compiler Directives
C.1 COMMAND LINE OPTIONS AND COMPILER DIRECTIVES
Appendix D: Previously Used Optimizations
D.1 LOOP REORDERING
D.2 INDEX REORDERING
D.3 LOOP UNROLLING
D.4 LOOP FISSION
D.5 SCALAR PROMOTION
D.6 REMOVAL OF LOOP-INDEPENDENT IFS
D.7 USE OF INTRINSICS TO REMOVE IFS
D.8 STRIP MINING
D.9 SUBROUTINE INLINING
D.10 PULLING LOOPS INTO SUBROUTINES
D.11 CACHE BLOCKING
D.12 LOOP FUSION
D.13 OUTER LOOP VECTORIZATION
Appendix E: I/O Optimization
E.1 INTRODUCTION
E.2 I/O STRATEGIES
E.2.1 Spokesperson
E.2.2 Multiple Writers – Multiple Files
E.2.3 Collective I/O to Single or Multiple Files
E.3 LUSTRE MECHANICS
Appendix F: Terminology
F.1 SELECTED DEFINITIONS
Appendix G: 12-Step Process
G.1 INTRODUCTION
G.2 PROCESS
1.1 INTRODUCTION
1.2 CHAPTER OVERVIEWS
Chapter 2 Determining an Exaflop Strategy, 7
2.1 FOREWORD BY JOHN LEVESQUE
2.2 INTRODUCTION
2.3 LOOKING AT THE APPLICATION
2.4 DEGREE OF HYBRIDIZATION REQUIRED
2.5 DECOMPOSITION AND I/O
2.6 PARALLEL AND VECTOR LENGTHS
2.7 PRODUCTIVITY AND PERFORMANCE PORTABILITY
2.8 CONCLUSION
2.9 EXERCISES
Chapter 3 Target Hybrid Multi/Manycore System, 21
3.1 FOREWORD BY JOHN LEVESQUE
3.2 UNDERSTANDING THE ARCHITECTURE
3.3 CACHE ARCHITECTURES
3.3.1 Xeon Cache
3.3.2 NVIDA GPU Cache
3.4 MEMORY HIERARCHY
3.4.1 Knight’s Landing Cache
3.5 KNL CLUSTERING MODES
3.6 KNL MCDRAM MODES
3.7 IMPORTANCE OF VECTORIZATION
3.8 ALIGNMENT FOR VECTORIZATION
3.9 EXERCISES
Chapter 4 How Compilers Optimize Programs, 43
4.1 FOREWORD BY JOHN LEVESQUE
4.2 INTRODUCTION
4.3 MEMORY ALLOCATION
4.4 MEMORY ALIGNMENT
4.5 COMMENT-LINE DIRECTIVE
4.6 INTERPROCEDURAL ANALYSIS
4.7 COMPILER SWITCHES
4.8 FORTRAN 2003 AND INEFFICIENCIES
4.8.1 Array Syntax
4.8.2 Use Optimized Libraries
4.8.3 Passing Array Sections
4.8.4 Using Modules for Local Variables
4.8.5 Derived Types
4.9 C/C++ AND INEFFICIENCIES
4.10 COMPILER SCALAR OPTIMIZATIONS
4.10.1 Strength Reduction
4.10.2 Avoiding Floating Point Exponents
4.10.3 Common Subexpression Elimination
4.11 EXERCISES
Chapter 5 Gathering Runtime Statistics for Optimizing, 67
5.1 FOREWORD BY JOHN LEVESQUE
5.2 INTRODUCTION
5.3 WHAT’S IMPORTANT TO PROFILE
5.3.1 Profiling NAS BT
5.3.2 Profiling VH1
5.4 CONCLUSION
5.5 EXERCISES
Chapter 6 Utilization of Available Memory Bandwidth, 79
6.1 FOREWORD BY JOHN LEVESQUE
6.2 INTRODUCTION
6.3 IMPORTANCE OF CACHE OPTIMIZATION
6.4 VARIABLE ANALYSIS IN MULTIPLE LOOPS
6.5 OPTIMIZING FOR THE CACHE HIERARCHY
6.6 COMBINING MULTIPLE LOOPS
6.7 CONCLUSION
6.8 EXERCISES
Chapter 7 Vectorization, 97
7.1 FOREWORD BY JOHN LEVESQUE
7.2 INTRODUCTION
7.3 VECTORIZATION INHIBITORS
7.4 VECTORIZATION REJECTION FROM INEFFICIENCIES
7.4.1 Access Modes and Computational Intensity
7.4.2 Conditionals
7.5 STRIDING VERSUS CONTIGUOUS ACCESSING
7.6 WRAP-AROUND SCALAR
7.7 LOOPS SAVING MAXIMA AND MINIMA
7.8 MULTINESTED LOOP STRUCTURES
7.9 THERE’S MATMUL AND THEN THERE’S MATMUL
7.10 DECISION PROCESSES IN LOOPS
7.10.1 Loop-Independent Conditionals
7.10.2 Conditionals Directly Testing Indicies
7.10.3 Loop-Dependent Conditionals
7.10.4 Conditionals Causing Early Loop Exit
7.11 HANDLING FUNCTION CALLS WITHIN LOOPS
7.12 RANK EXPANSION
7.13 OUTER LOOP VECTORIZATION
7.14 EXERCISES
Chapter 8 Hybridization of an Application, 147
8.1 FOREWORD BY JOHN LEVESQUE
8.2 INTRODUCTION
8.3 THE NODE’S NUMA ARCHITECTURE
8.4 FIRST TOUCH IN THE HIMENO BENCHMARK
8.5 IDENTIFYING WHICH LOOPS TO THREAD
8.6 SPMD OPENMP
8.7 EXERCISES
Chapter 9 Porting Entire Applications, 169
9.1 FOREWORD BY JOHN LEVESQUE
9.2 INTRODUCTION
9.3 SPEC OPENMP BENCHMARKS
9.3.1 WUPWISE
9.3.2 MGRID
9.3.3 GALGEL
9.3.4 APSI
9.3.5 FMA3D
9.3.6 AMMP
9.3.7 SWIM
9.3.8 APPLU
9.3.9 EQUAKE
9.3.10 ART
9.4 NASA PARALLEL BENCHMARK (NPB) BT
9.5 REFACTORING VH-1
9.6 REFACTORING LESLIE3D
9.7 REFACTORING S3D – 2016 PRODUCTION VERSION
9.8 PERFORMANCE PORTABLE – S3D ON TITAN
9.9 EXERCISES
Chapter 10 Future Hardware Advancements, 243
10.1 INTRODUCTION
10.2 FUTURE X86 CPUS
10.2.1 Intel Skylake
10.2.2 Intel Knight’s Hill
10.2.3 AMD Zen
10.3 FUTURE ARM CPUS
10.3.1 Scalable Vector Extension
10.3.2 Broadcom Vulcan
10.3.3 Cavium Thunder X
10.3.4 Fujitsu Post-K
10.3.5 Qualcomm Centriq
10.4 FUTURE MEMORY TECHNOLOGIES
10.4.1 Die-Stacking Technologies
10.4.2 Compute Near Data
10.5 FUTURE HARDWARE CONCLUSIONS
10.5.1 Increased Thread Counts
10.5.2 Wider Vectors
10.5.3 Increasingly Complex Memory Hierarchies
Appendix A: Supercomputer Cache Architectures
A.1 ASSOCIATIVITY
Appendix B: The Translation Look-Aside Buffer
B.1 INTRODUCTION TO THE TLB
Appendix C: Command Line Options / Compiler Directives
C.1 COMMAND LINE OPTIONS AND COMPILER DIRECTIVES
Appendix D: Previously Used Optimizations
D.1 LOOP REORDERING
D.2 INDEX REORDERING
D.3 LOOP UNROLLING
D.4 LOOP FISSION
D.5 SCALAR PROMOTION
D.6 REMOVAL OF LOOP-INDEPENDENT IFS
D.7 USE OF INTRINSICS TO REMOVE IFS
D.8 STRIP MINING
D.9 SUBROUTINE INLINING
D.10 PULLING LOOPS INTO SUBROUTINES
D.11 CACHE BLOCKING
D.12 LOOP FUSION
D.13 OUTER LOOP VECTORIZATION
Appendix E: I/O Optimization
E.1 INTRODUCTION
E.2 I/O STRATEGIES
E.2.1 Spokesperson
E.2.2 Multiple Writers – Multiple Files
E.2.3 Collective I/O to Single or Multiple Files
E.3 LUSTRE MECHANICS
Appendix F: Terminology
F.1 SELECTED DEFINITIONS
Appendix G: 12-Step Process
G.1 INTRODUCTION
G.2 PROCESS