diff options
Diffstat (limited to 'Documentation/memory-barriers.txt')
-rw-r--r-- | Documentation/memory-barriers.txt | 139 |
1 files changed, 100 insertions, 39 deletions
diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt index 102dc19c4119..556f951f8626 100644 --- a/Documentation/memory-barriers.txt +++ b/Documentation/memory-barriers.txt @@ -608,26 +608,30 @@ as follows: b = p; /* BUG: Compiler can reorder!!! */ do_something(); -The solution is again ACCESS_ONCE(), which preserves the ordering between -the load from variable 'a' and the store to variable 'b': +The solution is again ACCESS_ONCE() and barrier(), which preserves the +ordering between the load from variable 'a' and the store to variable 'b': q = ACCESS_ONCE(a); if (q) { + barrier(); ACCESS_ONCE(b) = p; do_something(); } else { + barrier(); ACCESS_ONCE(b) = p; do_something_else(); } -You could also use barrier() to prevent the compiler from moving -the stores to variable 'b', but barrier() would not prevent the -compiler from proving to itself that a==1 always, so ACCESS_ONCE() -is also needed. +The initial ACCESS_ONCE() is required to prevent the compiler from +proving the value of 'a', and the pair of barrier() invocations are +required to prevent the compiler from pulling the two identical stores +to 'b' out from the legs of the "if" statement. It is important to note that control dependencies absolutely require a a conditional. For example, the following "optimized" version of -the above example breaks ordering: +the above example breaks ordering, which is why the barrier() invocations +are absolutely required if you have identical stores in both legs of +the "if" statement: q = ACCESS_ONCE(a); ACCESS_ONCE(b) = p; /* BUG: No ordering vs. load from a!!! */ @@ -643,9 +647,11 @@ It is of course legal for the prior load to be part of the conditional, for example, as follows: if (ACCESS_ONCE(a) > 0) { + barrier(); ACCESS_ONCE(b) = q / 2; do_something(); } else { + barrier(); ACCESS_ONCE(b) = q / 3; do_something_else(); } @@ -659,9 +665,11 @@ the needed conditional. For example: q = ACCESS_ONCE(a); if (q % MAX) { + barrier(); ACCESS_ONCE(b) = p; do_something(); } else { + barrier(); ACCESS_ONCE(b) = p; do_something_else(); } @@ -723,8 +731,13 @@ In summary: use smb_rmb(), smp_wmb(), or, in the case of prior stores and later loads, smp_mb(). + (*) If both legs of the "if" statement begin with identical stores + to the same variable, a barrier() statement is required at the + beginning of each leg of the "if" statement. + (*) Control dependencies require at least one run-time conditional - between the prior load and the subsequent store. If the compiler + between the prior load and the subsequent store, and this + conditional must involve the prior load. If the compiler is able to optimize the conditional away, it will have also optimized away the ordering. Careful use of ACCESS_ONCE() can help to preserve the needed conditional. @@ -1249,6 +1262,23 @@ The ACCESS_ONCE() function can prevent any number of optimizations that, while perfectly safe in single-threaded code, can be fatal in concurrent code. Here are some examples of these sorts of optimizations: + (*) The compiler is within its rights to reorder loads and stores + to the same variable, and in some cases, the CPU is within its + rights to reorder loads to the same variable. This means that + the following code: + + a[0] = x; + a[1] = x; + + Might result in an older value of x stored in a[1] than in a[0]. + Prevent both the compiler and the CPU from doing this as follows: + + a[0] = ACCESS_ONCE(x); + a[1] = ACCESS_ONCE(x); + + In short, ACCESS_ONCE() provides cache coherence for accesses from + multiple CPUs to a single variable. + (*) The compiler is within its rights to merge successive loads from the same variable. Such merging can cause the compiler to "optimize" the following code: @@ -1371,7 +1401,7 @@ code. Here are some examples of these sorts of optimizations: process_message(msg); } - There is nothing to prevent the the compiler from transforming + There is nothing to prevent the compiler from transforming process_level() to the following, in fact, this might well be a win for single-threaded code: @@ -1644,12 +1674,12 @@ for each construct. These operations all imply certain barriers: Memory operations issued after the ACQUIRE will be completed after the ACQUIRE operation has completed. - Memory operations issued before the ACQUIRE may be completed after the - ACQUIRE operation has completed. An smp_mb__before_spinlock(), combined - with a following ACQUIRE, orders prior loads against subsequent stores and - stores and prior stores against subsequent stores. Note that this is - weaker than smp_mb()! The smp_mb__before_spinlock() primitive is free on - many architectures. + Memory operations issued before the ACQUIRE may be completed after + the ACQUIRE operation has completed. An smp_mb__before_spinlock(), + combined with a following ACQUIRE, orders prior loads against + subsequent loads and stores and also orders prior stores against + subsequent stores. Note that this is weaker than smp_mb()! The + smp_mb__before_spinlock() primitive is free on many architectures. (2) RELEASE operation implication: @@ -1694,24 +1724,21 @@ may occur as: ACQUIRE M, STORE *B, STORE *A, RELEASE M -This same reordering can of course occur if the lock's ACQUIRE and RELEASE are -to the same lock variable, but only from the perspective of another CPU not -holding that lock. - -In short, a RELEASE followed by an ACQUIRE may -not- be assumed to be a full -memory barrier because it is possible for a preceding RELEASE to pass a -later ACQUIRE from the viewpoint of the CPU, but not from the viewpoint -of the compiler. Note that deadlocks cannot be introduced by this -interchange because if such a deadlock threatened, the RELEASE would -simply complete. - -If it is necessary for a RELEASE-ACQUIRE pair to produce a full barrier, the -ACQUIRE can be followed by an smp_mb__after_unlock_lock() invocation. This -will produce a full barrier if either (a) the RELEASE and the ACQUIRE are -executed by the same CPU or task, or (b) the RELEASE and ACQUIRE act on the -same variable. The smp_mb__after_unlock_lock() primitive is free on many -architectures. Without smp_mb__after_unlock_lock(), the critical sections -corresponding to the RELEASE and the ACQUIRE can cross: +When the ACQUIRE and RELEASE are a lock acquisition and release, +respectively, this same reordering can occur if the lock's ACQUIRE and +RELEASE are to the same lock variable, but only from the perspective of +another CPU not holding that lock. In short, a ACQUIRE followed by an +RELEASE may -not- be assumed to be a full memory barrier. + +Similarly, the reverse case of a RELEASE followed by an ACQUIRE does not +imply a full memory barrier. If it is necessary for a RELEASE-ACQUIRE +pair to produce a full barrier, the ACQUIRE can be followed by an +smp_mb__after_unlock_lock() invocation. This will produce a full barrier +if either (a) the RELEASE and the ACQUIRE are executed by the same +CPU or task, or (b) the RELEASE and ACQUIRE act on the same variable. +The smp_mb__after_unlock_lock() primitive is free on many architectures. +Without smp_mb__after_unlock_lock(), the CPU's execution of the critical +sections corresponding to the RELEASE and the ACQUIRE can cross, so that: *A = a; RELEASE M @@ -1722,7 +1749,36 @@ could occur as: ACQUIRE N, STORE *B, STORE *A, RELEASE M -With smp_mb__after_unlock_lock(), they cannot, so that: +It might appear that this reordering could introduce a deadlock. +However, this cannot happen because if such a deadlock threatened, +the RELEASE would simply complete, thereby avoiding the deadlock. + + Why does this work? + + One key point is that we are only talking about the CPU doing + the reordering, not the compiler. If the compiler (or, for + that matter, the developer) switched the operations, deadlock + -could- occur. + + But suppose the CPU reordered the operations. In this case, + the unlock precedes the lock in the assembly code. The CPU + simply elected to try executing the later lock operation first. + If there is a deadlock, this lock operation will simply spin (or + try to sleep, but more on that later). The CPU will eventually + execute the unlock operation (which preceded the lock operation + in the assembly code), which will unravel the potential deadlock, + allowing the lock operation to succeed. + + But what if the lock is a sleeplock? In that case, the code will + try to enter the scheduler, where it will eventually encounter + a memory barrier, which will force the earlier unlock operation + to complete, again unraveling the deadlock. There might be + a sleep-unlock race, but the locking primitive needs to resolve + such races properly in any case. + +With smp_mb__after_unlock_lock(), the two critical sections cannot overlap. +For example, with the following code, the store to *A will always be +seen by other CPUs before the store to *B: *A = a; RELEASE M @@ -1730,13 +1786,18 @@ With smp_mb__after_unlock_lock(), they cannot, so that: smp_mb__after_unlock_lock(); *B = b; -will always occur as either of the following: +The operations will always occur in one of the following orders: - STORE *A, RELEASE, ACQUIRE, STORE *B - STORE *A, ACQUIRE, RELEASE, STORE *B + STORE *A, RELEASE, ACQUIRE, smp_mb__after_unlock_lock(), STORE *B + STORE *A, ACQUIRE, RELEASE, smp_mb__after_unlock_lock(), STORE *B + ACQUIRE, STORE *A, RELEASE, smp_mb__after_unlock_lock(), STORE *B If the RELEASE and ACQUIRE were instead both operating on the same lock -variable, only the first of these two alternatives can occur. +variable, only the first of these alternatives can occur. In addition, +the more strongly ordered systems may rule out some of the above orders. +But in any case, as noted earlier, the smp_mb__after_unlock_lock() +ensures that the store to *A will always be seen as happening before +the store to *B. Locks and semaphores may not provide any guarantee of ordering on UP compiled systems, and so cannot be counted on in such a situation to actually achieve @@ -2757,7 +2818,7 @@ in that order, but, without intervention, the sequence may have almost any combination of elements combined or discarded, provided the program's view of the world remains consistent. Note that ACCESS_ONCE() is -not- optional in the above example, as there are architectures where a given CPU might -interchange successive loads to the same location. On such architectures, +reorder successive loads to the same location. On such architectures, ACCESS_ONCE() does whatever is necessary to prevent this, for example, on Itanium the volatile casts used by ACCESS_ONCE() cause GCC to emit the special ld.acq and st.rel instructions that prevent such reordering. |