From 2ecf810121c7eae34473b8fa108112036bc61127 Mon Sep 17 00:00:00 2001 From: "Paul E. McKenney" Date: Wed, 11 Dec 2013 13:59:04 -0800 Subject: [PATCH] Documentation/memory-barriers.txt: Add needed ACCESS_ONCE() calls to memory-barriers.txt The Documentation/memory-barriers.txt file was written before the need for ACCESS_ONCE() was fully appreciated. It therefore contains no ACCESS_ONCE() calls, which can be a problem when people lift examples from it. This commit therefore adds ACCESS_ONCE() calls. Signed-off-by: Paul E. McKenney Reviewed-by: Josh Triplett Reviewed-by: Peter Zijlstra Cc: Cc: Linus Torvalds Cc: Andrew Morton Link: http://lkml.kernel.org/r/1386799151-2219-1-git-send-email-paulmck@linux.vnet.ibm.com Signed-off-by: Ingo Molnar --- Documentation/memory-barriers.txt | 206 ++++++++++++++++++------------ 1 file changed, 126 insertions(+), 80 deletions(-) diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt index 020cccdbdd0c..1d067235b0bc 100644 --- a/Documentation/memory-barriers.txt +++ b/Documentation/memory-barriers.txt @@ -194,18 +194,22 @@ There are some minimal guarantees that may be expected of a CPU: (*) On any given CPU, dependent memory accesses will be issued in order, with respect to itself. This means that for: - Q = P; D = *Q; + ACCESS_ONCE(Q) = P; smp_read_barrier_depends(); D = ACCESS_ONCE(*Q); the CPU will issue the following memory operations: Q = LOAD P, D = LOAD *Q - and always in that order. + and always in that order. On most systems, smp_read_barrier_depends() + does nothing, but it is required for DEC Alpha. The ACCESS_ONCE() + is required to prevent compiler mischief. Please note that you + should normally use something like rcu_dereference() instead of + open-coding smp_read_barrier_depends(). (*) Overlapping loads and stores within a particular CPU will appear to be ordered within that CPU. This means that for: - a = *X; *X = b; + a = ACCESS_ONCE(*X); ACCESS_ONCE(*X) = b; the CPU will only issue the following sequence of memory operations: @@ -213,7 +217,7 @@ There are some minimal guarantees that may be expected of a CPU: And for: - *X = c; d = *X; + ACCESS_ONCE(*X) = c; d = ACCESS_ONCE(*X); the CPU will only issue: @@ -224,6 +228,41 @@ There are some minimal guarantees that may be expected of a CPU: And there are a number of things that _must_ or _must_not_ be assumed: + (*) It _must_not_ be assumed that the compiler will do what you want with + memory references that are not protected by ACCESS_ONCE(). Without + ACCESS_ONCE(), the compiler is within its rights to do all sorts + of "creative" transformations: + + (-) Repeat the load, possibly getting a different value on the second + and subsequent loads. This is especially prone to happen when + register pressure is high. + + (-) Merge adjacent loads and stores to the same location. The most + familiar example is the transformation from: + + while (a) + do_something(); + + to something like: + + if (a) + for (;;) + do_something(); + + Using ACCESS_ONCE() as follows prevents this sort of optimization: + + while (ACCESS_ONCE(a)) + do_something(); + + (-) "Store tearing", where a single store in the source code is split + into smaller stores in the object code. Note that gcc really + will do this on some architectures when storing certain constants. + It can be cheaper to do a series of immediate stores than to + form the constant in a register and then to store that register. + + (-) "Load tearing", which splits loads in a manner analogous to + store tearing. + (*) It _must_not_ be assumed that independent loads and stores will be issued in the order given. This means that for: @@ -450,14 +489,14 @@ The usage requirements of data dependency barriers are a little subtle, and it's not always obvious that they're needed. To illustrate, consider the following sequence of events: - CPU 1 CPU 2 - =============== =============== + CPU 1 CPU 2 + =============== =============== { A == 1, B == 2, C = 3, P == &A, Q == &C } B = 4; - P = &B - Q = P; - D = *Q; + ACCESS_ONCE(P) = &B + Q = ACCESS_ONCE(P); + D = *Q; There's a clear data dependency here, and it would seem that by the end of the sequence, Q must be either &A or &B, and that: @@ -477,15 +516,15 @@ Alpha). To deal with this, a data dependency barrier or better must be inserted between the address load and the data load: - CPU 1 CPU 2 - =============== =============== + CPU 1 CPU 2 + =============== =============== { A == 1, B == 2, C = 3, P == &A, Q == &C } B = 4; - P = &B - Q = P; - - D = *Q; + ACCESS_ONCE(P) = &B + Q = ACCESS_ONCE(P); + + D = *Q; This enforces the occurrence of one of the two implications, and prevents the third possibility from arising. @@ -504,21 +543,22 @@ Another example of where data dependency barriers might be required is where a number is read from memory and then used to calculate the index for an array access: - CPU 1 CPU 2 - =============== =============== + CPU 1 CPU 2 + =============== =============== { M[0] == 1, M[1] == 2, M[3] = 3, P == 0, Q == 3 } M[1] = 4; - P = 1 - Q = P; - - D = M[Q]; + ACCESS_ONCE(P) = 1 + Q = ACCESS_ONCE(P); + + D = M[Q]; -The data dependency barrier is very important to the RCU system, for example. -See rcu_dereference() in include/linux/rcupdate.h. This permits the current -target of an RCU'd pointer to be replaced with a new modified target, without -the replacement target appearing to be incompletely initialised. +The data dependency barrier is very important to the RCU system, +for example. See rcu_assign_pointer() and rcu_dereference() in +include/linux/rcupdate.h. This permits the current target of an RCU'd +pointer to be replaced with a new modified target, without the replacement +target appearing to be incompletely initialised. See also the subsection on "Cache Coherency" for a more thorough example. @@ -530,22 +570,23 @@ A control dependency requires a full read memory barrier, not simply a data dependency barrier to make it work correctly. Consider the following bit of code: - q = &a; + q = ACCESS_ONCE(a); if (p) { - q = &b; + q = ACCESS_ONCE(b); } x = *q; This will not have the desired effect because there is no actual data -dependency, but rather a control dependency that the CPU may short-circuit by -attempting to predict the outcome in advance. In such a case what's actually -required is: +dependency, but rather a control dependency that the CPU may short-circuit +by attempting to predict the outcome in advance, so that other CPUs see +the load from b as having happened before the load from a. In such a +case what's actually required is: - q = &a; + q = ACCESS_ONCE(a); if (p) { - q = &b; + q = ACCESS_ONCE(b); } x = *q; @@ -561,23 +602,23 @@ barrier, though a general barrier would also be viable. Similarly a read barrier or a data dependency barrier should always be paired with at least an write barrier, though, again, a general barrier is viable: - CPU 1 CPU 2 - =============== =============== - a = 1; + CPU 1 CPU 2 + =============== =============== + ACCESS_ONCE(a) = 1; - b = 2; x = b; - - y = a; + ACCESS_ONCE(b) = 2; x = ACCESS_ONCE(b); + + y = ACCESS_ONCE(a); Or: - CPU 1 CPU 2 - =============== =============================== + CPU 1 CPU 2 + =============== =============================== a = 1; - b = &a; x = b; - - y = *x; + ACCESS_ONCE(b) = &a; x = ACCESS_ONCE(b); + + y = *x; Basically, the read barrier always has to be there, even though it can be of the "weaker" type. @@ -586,13 +627,13 @@ the "weaker" type. match the loads after the read barrier or the data dependency barrier, and vice versa: - CPU 1 CPU 2 - =============== =============== - a = 1; }---- --->{ v = c - b = 2; } \ / { w = d - \ - c = 3; } / \ { x = a; - d = 4; }---- --->{ y = b; + CPU 1 CPU 2 + =================== =================== + ACCESS_ONCE(a) = 1; }---- --->{ v = ACCESS_ONCE(c); + ACCESS_ONCE(b) = 2; } \ / { w = ACCESS_ONCE(d); + \ + ACCESS_ONCE(c) = 3; } / \ { x = ACCESS_ONCE(a); + ACCESS_ONCE(d) = 4; }---- --->{ y = ACCESS_ONCE(b); EXAMPLES OF MEMORY BARRIER SEQUENCES @@ -1435,12 +1476,12 @@ three CPUs; then should the following sequence of events occur: CPU 1 CPU 2 =============================== =============================== - *A = a; *E = e; + ACCESS_ONCE(*A) = a; ACCESS_ONCE(*E) = e; LOCK M LOCK Q - *B = b; *F = f; - *C = c; *G = g; + ACCESS_ONCE(*B) = b; ACCESS_ONCE(*F) = f; + ACCESS_ONCE(*C) = c; ACCESS_ONCE(*G) = g; UNLOCK M UNLOCK Q - *D = d; *H = h; + ACCESS_ONCE(*D) = d; ACCESS_ONCE(*H) = h; Then there is no guarantee as to what order CPU 3 will see the accesses to *A through *H occur in, other than the constraints imposed by the separate locks @@ -1460,17 +1501,17 @@ However, if the following occurs: CPU 1 CPU 2 =============================== =============================== - *A = a; - LOCK M [1] - *B = b; - *C = c; - UNLOCK M [1] - *D = d; *E = e; - LOCK M [2] - *F = f; - *G = g; - UNLOCK M [2] - *H = h; + ACCESS_ONCE(*A) = a; + LOCK M [1] + ACCESS_ONCE(*B) = b; + ACCESS_ONCE(*C) = c; + UNLOCK M [1] + ACCESS_ONCE(*D) = d; ACCESS_ONCE(*E) = e; + LOCK M [2] + ACCESS_ONCE(*F) = f; + ACCESS_ONCE(*G) = g; + UNLOCK M [2] + ACCESS_ONCE(*H) = h; CPU 3 might see: @@ -2177,11 +2218,11 @@ A programmer might take it for granted that the CPU will perform memory operations in exactly the order specified, so that if the CPU is, for example, given the following piece of code to execute: - a = *A; - *B = b; - c = *C; - d = *D; - *E = e; + a = ACCESS_ONCE(*A); + ACCESS_ONCE(*B) = b; + c = ACCESS_ONCE(*C); + d = ACCESS_ONCE(*D); + ACCESS_ONCE(*E) = e; they would then expect that the CPU will complete the memory operation for each instruction before moving on to the next one, leading to a definite sequence of @@ -2228,12 +2269,12 @@ However, it is guaranteed that a CPU will be self-consistent: it will see its _own_ accesses appear to be correctly ordered, without the need for a memory barrier. For instance with the following code: - U = *A; - *A = V; - *A = W; - X = *A; - *A = Y; - Z = *A; + U = ACCESS_ONCE(*A); + ACCESS_ONCE(*A) = V; + ACCESS_ONCE(*A) = W; + X = ACCESS_ONCE(*A); + ACCESS_ONCE(*A) = Y; + Z = ACCESS_ONCE(*A); and assuming no intervention by an external influence, it can be assumed that the final result will appear to be: @@ -2250,7 +2291,12 @@ accesses: in that order, but, without intervention, the sequence may have almost any combination of elements combined or discarded, provided the program's view of -the world remains consistent. +the world remains consistent. Note that ACCESS_ONCE() is -not- optional +in the above example, as there are architectures where a given CPU might +interchange successive loads to the same location. On such architectures, +ACCESS_ONCE() does whatever is necessary to prevent this, for example, on +Itanium the volatile casts used by ACCESS_ONCE() cause GCC to emit the +special ld.acq and st.rel instructions that prevent such reordering. The compiler may also combine, discard or defer elements of the sequence before the CPU even sees them. @@ -2264,13 +2310,13 @@ may be reduced to: *A = W; -since, without a write barrier, it can be assumed that the effect of the -storage of V to *A is lost. Similarly: +since, without either a write barrier or an ACCESS_ONCE(), it can be +assumed that the effect of the storage of V to *A is lost. Similarly: *A = Y; Z = *A; -may, without a memory barrier, be reduced to: +may, without a memory barrier or an ACCESS_ONCE(), be reduced to: *A = Y; Z = Y; -- 2.34.1