Programming for Hybrid Multi Many-core MPP Systems
  • Home
  • Table of Contents
    • Chapter 1 / Introduction
  • Previous Publications
    • High Performance Computing
    • A Guidebook
  • About the Authors
    • John Levesque
    • Aaron Vose
Picture
Excerpts
Intro. | Ch. 1 | Ch. 2 | Ch. 3 | Ch. 4 | Ch. 5 | Ch. 6 | Ch. 7 | Ch. 8 | Ch. 9

Chapter 4 Excerpts

4.3.4 False Sharing in OpenMP 
Whenever the innermost DO loop of an array is chosen as the OpenMP parallel dimension, there is the potential of severely degrading performance if two threads share a cache line. Consider the following DO loop.

!$OMP DOPARALLEL PRIVATE(I,II,IBD,ICD)
               DO I = 2,NX-1
                  II  =  I + IADD
                  IBD = II – IBDD
                  ICD = II + IBDD
                  DUDX(I) =
     >               DXI * ABD * ((U(IBD,J,K) - U(ICD,J,K)) +
     >                    8.0D0 * (U( II,J,K) - U(IBD,J,K))) * R6I
                  DVDX(I) =

     >               DXI * ABD * ((V(IBD,J,K) - V(ICD,J,K)) +
     >                    8.0D0 * (V( II,J,K) - V(IBD,J,K))) * R6I
                  DWDX(I) =
     >               DXI * ABD * ((W(IBD,J,K) - W(ICD,J,K)) +
     >                    8.0D0 * (W( II,J,K) - W(IBD,J,K))) * R6I
                  DTDX(I) =
     >               DXI * ABD * ((T(IBD,J,K) - T(ICD,J,K)) +
     >                    8.0D0 * (T( II,J,K) - T(IBD,J,K))) * R6I
               END DO

Say we have four threads working on this OMP loop and NX is 200. This means that the first thread will have (200-2+1)/4 = 50 iterations and the other three threads will have 49. When thread 0 is storing into DUDX(51), while thread 1 is storing into DUDX(52) there will be a problem if both of these elements of the DUDX array are contained on the same cache line. At these intersections between the iterations between two threads, we could have multiple threads fighting over ownership of the cache line that contains the elements of the arrays that each thread wants to store into.

Evidently, one should never use OpenMP on the inner-most index of an array and/or if they do they should take care to distribute iterations to each of the threads according to cache lines rather than the range of the DO loop. This only concerns when storing into an array and in this case, it may be very difficult to divide the iterations so all four of the stores are avoiding overlapping requests from the same cache line. False sharing also occurs when several threads are writing into the same cache line of a shared array. Another application from the SPEC_OMP benchmark suite has this OpenMP parallel loop:

C
      DO 25 I=1,NUMTHREADS
         WWIND1(I)=0.0
         WSQ1(I)=0.0
 25   CONTINUE


!$OMP PARALLEL
!$OMP+PRIVATE(I,K,DV,TOPOW,HELPA1,HELP1,AN1,BN1,CN1,MY_CPU_ID)
       MY_CPU_ID = OMP_GET_THREAD_NUM() + 1
!$OMP DO
      DO 30 J=1,NY
         DO 40 I=1,NX
            HELP1(1)=0.0D0
            HELP1(NZ)=0.0D0
            DO 10 K=2,NZTOP
               IF(NY.EQ.1) THEN
                  DV=0.0D0
                           ELSE
                  DV=DVDY(I,J,K)
               ENDIF
               HELP1(K)=FILZ(K)*(DUDX(I,J,K)+DV)
 10         CONTINUE
C
C      SOLVE IMPLICITLY FOR THE W FOR EACH VERTICAL LAYER
C
            CALL DWDZ(NZ,ZET,HVAR,HELP1,HELPA1,AN1,BN1,CN1,ITY)
            DO 20 K=2,NZTOP
               TOPOW=UX(I,J,K)*EX(I,J)+VY(I,J,K)*EY(I,J)
               WZ(I,J,K)=HELP1(K)+TOPOW
               WWIND1(MY_CPU_ID)=WWIND1(MY_CPU_ID)+WZ(I,J,K)
               WSQ1(MY_CPU_ID)=WSQ1(MY_CPU_ID)+WZ(I,J,K)**2
 20         CONTINUE
 40      CONTINUE
 30   CONTINUE
!$OMP END DO
!$OMP END PARALLEL


      DO 35 I=1,NUMTHREADS
         WWIND=WWIND+WWIND1(I)
         WSQ=WSQ+WSQ1(I)
35    CONTINUE

Notice that the WWIND1 and WSQ1 arrays are written into by all the threads. While they are writing into different elements within the array, they are all writing into the same cache line. When a thread (a core) updates an array it must have the cache line that contains the array in its Level 1 cache. If a thread (a core) has the cache line and another thread needs the cache line, the thread that has the line must flush it out to memory, so the other thread can fetch it up into its Level 1 cache. This OpenMP loop runs very poorly. An improvement will be mentioned in Chapter 9.

Proudly powered by Weebly
  • Home
  • Table of Contents
    • Chapter 1 / Introduction
  • Previous Publications
    • High Performance Computing
    • A Guidebook
  • About the Authors
    • John Levesque
    • Aaron Vose