4.3.4 False Sharing in OpenMP
Whenever the innermost DO loop of an array is chosen as the OpenMP parallel dimension, there is the potential of severely degrading performance if two threads share a cache line. Consider the following DO loop.
!$OMP DOPARALLEL PRIVATE(I,II,IBD,ICD)
DO I = 2,NX-1
II = I + IADD
IBD = II – IBDD
ICD = II + IBDD
DUDX(I) =
> DXI * ABD * ((U(IBD,J,K) - U(ICD,J,K)) +
> 8.0D0 * (U( II,J,K) - U(IBD,J,K))) * R6I
DVDX(I) =
> DXI * ABD * ((V(IBD,J,K) - V(ICD,J,K)) +
> 8.0D0 * (V( II,J,K) - V(IBD,J,K))) * R6I
DWDX(I) =
> DXI * ABD * ((W(IBD,J,K) - W(ICD,J,K)) +
> 8.0D0 * (W( II,J,K) - W(IBD,J,K))) * R6I
DTDX(I) =
> DXI * ABD * ((T(IBD,J,K) - T(ICD,J,K)) +
> 8.0D0 * (T( II,J,K) - T(IBD,J,K))) * R6I
END DO
Say we have four threads working on this OMP loop and NX is 200. This means that the first thread will have (200-2+1)/4 = 50 iterations and the other three threads will have 49. When thread 0 is storing into DUDX(51), while thread 1 is storing into DUDX(52) there will be a problem if both of these elements of the DUDX array are contained on the same cache line. At these intersections between the iterations between two threads, we could have multiple threads fighting over ownership of the cache line that contains the elements of the arrays that each thread wants to store into.
Evidently, one should never use OpenMP on the inner-most index of an array and/or if they do they should take care to distribute iterations to each of the threads according to cache lines rather than the range of the DO loop. This only concerns when storing into an array and in this case, it may be very difficult to divide the iterations so all four of the stores are avoiding overlapping requests from the same cache line. False sharing also occurs when several threads are writing into the same cache line of a shared array. Another application from the SPEC_OMP benchmark suite has this OpenMP parallel loop:
C
DO 25 I=1,NUMTHREADS
WWIND1(I)=0.0
WSQ1(I)=0.0
25 CONTINUE
!$OMP PARALLEL
!$OMP+PRIVATE(I,K,DV,TOPOW,HELPA1,HELP1,AN1,BN1,CN1,MY_CPU_ID)
MY_CPU_ID = OMP_GET_THREAD_NUM() + 1
!$OMP DO
DO 30 J=1,NY
DO 40 I=1,NX
HELP1(1)=0.0D0
HELP1(NZ)=0.0D0
DO 10 K=2,NZTOP
IF(NY.EQ.1) THEN
DV=0.0D0
ELSE
DV=DVDY(I,J,K)
ENDIF
HELP1(K)=FILZ(K)*(DUDX(I,J,K)+DV)
10 CONTINUE
C
C SOLVE IMPLICITLY FOR THE W FOR EACH VERTICAL LAYER
C
CALL DWDZ(NZ,ZET,HVAR,HELP1,HELPA1,AN1,BN1,CN1,ITY)
DO 20 K=2,NZTOP
TOPOW=UX(I,J,K)*EX(I,J)+VY(I,J,K)*EY(I,J)
WZ(I,J,K)=HELP1(K)+TOPOW
WWIND1(MY_CPU_ID)=WWIND1(MY_CPU_ID)+WZ(I,J,K)
WSQ1(MY_CPU_ID)=WSQ1(MY_CPU_ID)+WZ(I,J,K)**2
20 CONTINUE
40 CONTINUE
30 CONTINUE
!$OMP END DO
!$OMP END PARALLEL
DO 35 I=1,NUMTHREADS
WWIND=WWIND+WWIND1(I)
WSQ=WSQ+WSQ1(I)
35 CONTINUE
Notice that the WWIND1 and WSQ1 arrays are written into by all the threads. While they are writing into different elements within the array, they are all writing into the same cache line. When a thread (a core) updates an array it must have the cache line that contains the array in its Level 1 cache. If a thread (a core) has the cache line and another thread needs the cache line, the thread that has the line must flush it out to memory, so the other thread can fetch it up into its Level 1 cache. This OpenMP loop runs very poorly. An improvement will be mentioned in Chapter 9.
Whenever the innermost DO loop of an array is chosen as the OpenMP parallel dimension, there is the potential of severely degrading performance if two threads share a cache line. Consider the following DO loop.
!$OMP DOPARALLEL PRIVATE(I,II,IBD,ICD)
DO I = 2,NX-1
II = I + IADD
IBD = II – IBDD
ICD = II + IBDD
DUDX(I) =
> DXI * ABD * ((U(IBD,J,K) - U(ICD,J,K)) +
> 8.0D0 * (U( II,J,K) - U(IBD,J,K))) * R6I
DVDX(I) =
> DXI * ABD * ((V(IBD,J,K) - V(ICD,J,K)) +
> 8.0D0 * (V( II,J,K) - V(IBD,J,K))) * R6I
DWDX(I) =
> DXI * ABD * ((W(IBD,J,K) - W(ICD,J,K)) +
> 8.0D0 * (W( II,J,K) - W(IBD,J,K))) * R6I
DTDX(I) =
> DXI * ABD * ((T(IBD,J,K) - T(ICD,J,K)) +
> 8.0D0 * (T( II,J,K) - T(IBD,J,K))) * R6I
END DO
Say we have four threads working on this OMP loop and NX is 200. This means that the first thread will have (200-2+1)/4 = 50 iterations and the other three threads will have 49. When thread 0 is storing into DUDX(51), while thread 1 is storing into DUDX(52) there will be a problem if both of these elements of the DUDX array are contained on the same cache line. At these intersections between the iterations between two threads, we could have multiple threads fighting over ownership of the cache line that contains the elements of the arrays that each thread wants to store into.
Evidently, one should never use OpenMP on the inner-most index of an array and/or if they do they should take care to distribute iterations to each of the threads according to cache lines rather than the range of the DO loop. This only concerns when storing into an array and in this case, it may be very difficult to divide the iterations so all four of the stores are avoiding overlapping requests from the same cache line. False sharing also occurs when several threads are writing into the same cache line of a shared array. Another application from the SPEC_OMP benchmark suite has this OpenMP parallel loop:
C
DO 25 I=1,NUMTHREADS
WWIND1(I)=0.0
WSQ1(I)=0.0
25 CONTINUE
!$OMP PARALLEL
!$OMP+PRIVATE(I,K,DV,TOPOW,HELPA1,HELP1,AN1,BN1,CN1,MY_CPU_ID)
MY_CPU_ID = OMP_GET_THREAD_NUM() + 1
!$OMP DO
DO 30 J=1,NY
DO 40 I=1,NX
HELP1(1)=0.0D0
HELP1(NZ)=0.0D0
DO 10 K=2,NZTOP
IF(NY.EQ.1) THEN
DV=0.0D0
ELSE
DV=DVDY(I,J,K)
ENDIF
HELP1(K)=FILZ(K)*(DUDX(I,J,K)+DV)
10 CONTINUE
C
C SOLVE IMPLICITLY FOR THE W FOR EACH VERTICAL LAYER
C
CALL DWDZ(NZ,ZET,HVAR,HELP1,HELPA1,AN1,BN1,CN1,ITY)
DO 20 K=2,NZTOP
TOPOW=UX(I,J,K)*EX(I,J)+VY(I,J,K)*EY(I,J)
WZ(I,J,K)=HELP1(K)+TOPOW
WWIND1(MY_CPU_ID)=WWIND1(MY_CPU_ID)+WZ(I,J,K)
WSQ1(MY_CPU_ID)=WSQ1(MY_CPU_ID)+WZ(I,J,K)**2
20 CONTINUE
40 CONTINUE
30 CONTINUE
!$OMP END DO
!$OMP END PARALLEL
DO 35 I=1,NUMTHREADS
WWIND=WWIND+WWIND1(I)
WSQ=WSQ+WSQ1(I)
35 CONTINUE
Notice that the WWIND1 and WSQ1 arrays are written into by all the threads. While they are writing into different elements within the array, they are all writing into the same cache line. When a thread (a core) updates an array it must have the cache line that contains the array in its Level 1 cache. If a thread (a core) has the cache line and another thread needs the cache line, the thread that has the line must flush it out to memory, so the other thread can fetch it up into its Level 1 cache. This OpenMP loop runs very poorly. An improvement will be mentioned in Chapter 9.