Add a detailed note to explain the potential side effects of
plain-accessing the gp pointer using a plain load, without using the
rcu_dereference() macros; which might trip neighboring code that does
use rcu_dereference().
I haven't verified this with a compiler, but this is what I gather from
the below link using Will's experience with READ_ONCE().
Link: https://lore.kernel.org/all/20230728124412.GA21303@willie-the-truck/
Cc: Will Deacon <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
.../RCU/Design/Requirements/Requirements.rst | 32 +++++++++++++++++++
1 file changed, 32 insertions(+)
diff --git a/Documentation/RCU/Design/Requirements/Requirements.rst b/Documentation/RCU/Design/Requirements/Requirements.rst
index f3b605285a87..e0b896d3fb9b 100644
--- a/Documentation/RCU/Design/Requirements/Requirements.rst
+++ b/Documentation/RCU/Design/Requirements/Requirements.rst
@@ -376,6 +376,38 @@ mechanism, most commonly locking or reference counting
.. |high-quality implementation of C11 memory_order_consume [PDF]| replace:: high-quality implementation of C11 ``memory_order_consume`` [PDF]
.. _high-quality implementation of C11 memory_order_consume [PDF]: http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf
+Note that, there can be strange side effects (due to compiler optimizations) if
+``gp`` is ever accessed using a plain load (i.e. without ``READ_ONCE()`` or
+``rcu_dereference()``) potentially hurting any succeeding
+``rcu_dereference()``. For example, consider the code:
+
+ ::
+
+ 1 bool do_something_gp(void)
+ 2 {
+ 3 void *tmp;
+ 4 rcu_read_lock();
+ 5 tmp = gp; // Plain-load of GP.
+ 6 printk("Point gp = %p\n", tmp);
+ 7
+ 8 p = rcu_dereference(gp);
+ 9 if (p) {
+ 10 do_something(p->a, p->b);
+ 11 rcu_read_unlock();
+ 12 return true;
+ 13 }
+ 14 rcu_read_unlock();
+ 15 return false;
+ 16 }
+
+The behavior of plain accesses involved in a data race is non-deterministic in
+the face of compiler optimizations. Since accesses to the ``gp`` pointer is
+by-design a data race, the compiler could trip this code by caching the value
+of ``gp`` into a register in line 5, and then using the value of the register
+to satisfy the load in line 10. Thus it is important to never mix
+plain accesses of a memory location with rcu_dereference() of the same memory
+location, in code involved in a data race.
+
In short, updaters use rcu_assign_pointer() and readers use
rcu_dereference(), and these two RCU API elements work together to
ensure that readers have a consistent view of newly added data elements.
--
2.41.0.585.gd2178a4bd4-goog
The compiler has the ability to cause misordering by destroying
address-dependency barriers if comparison operations are used. Add a
note about this to memory-barriers.txt and point to rcu-dereference.rst
for more information.
Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
Documentation/memory-barriers.txt | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 06e14efd8662..acc8ec5ce563 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -435,6 +435,11 @@ Memory barriers come in four basic varieties:
variables such as READ_ONCE() and rcu_dereference() provide implicit
address-dependency barriers.
+ [!] Note that address dependency barriers can be destroyed by comparison
+ of a pointer obtained by a marked accessor such as READ_ONCE() or
+ rcu_dereference() with some value. For an example of this, see
+ rcu_dereference.rst (part where the comparison of pointers is discussed).
+
(3) Read (or load) memory barriers.
A read barrier is an address-dependency barrier plus a guarantee that all
--
2.41.0.585.gd2178a4bd4-goog
> 2023年8月3日 11:24,Joel Fernandes (Google) <[email protected]> 写道:
>
> Add a detailed note to explain the potential side effects of
> plain-accessing the gp pointer using a plain load, without using the
> rcu_dereference() macros; which might trip neighboring code that does
> use rcu_dereference().
>
> I haven't verified this with a compiler, but this is what I gather from
> the below link using Will's experience with READ_ONCE().
>
> Link: https://lore.kernel.org/all/20230728124412.GA21303@willie-the-truck/
> Cc: Will Deacon <[email protected]>
> Signed-off-by: Joel Fernandes (Google) <[email protected]>
> ---
> .../RCU/Design/Requirements/Requirements.rst | 32 +++++++++++++++++++
> 1 file changed, 32 insertions(+)
>
> diff --git a/Documentation/RCU/Design/Requirements/Requirements.rst b/Documentation/RCU/Design/Requirements/Requirements.rst
> index f3b605285a87..e0b896d3fb9b 100644
> --- a/Documentation/RCU/Design/Requirements/Requirements.rst
> +++ b/Documentation/RCU/Design/Requirements/Requirements.rst
> @@ -376,6 +376,38 @@ mechanism, most commonly locking or reference counting
> .. |high-quality implementation of C11 memory_order_consume [PDF]| replace:: high-quality implementation of C11 ``memory_order_consume`` [PDF]
> .. _high-quality implementation of C11 memory_order_consume [PDF]: http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf
>
> +Note that, there can be strange side effects (due to compiler optimizations) if
> +``gp`` is ever accessed using a plain load (i.e. without ``READ_ONCE()`` or
> +``rcu_dereference()``) potentially hurting any succeeding
> +``rcu_dereference()``. For example, consider the code:
> +
> + ::
> +
> + 1 bool do_something_gp(void)
> + 2 {
> + 3 void *tmp;
> + 4 rcu_read_lock();
> + 5 tmp = gp; // Plain-load of GP.
> + 6 printk("Point gp = %p\n", tmp);
> + 7
> + 8 p = rcu_dereference(gp);
> + 9 if (p) {
> + 10 do_something(p->a, p->b);
> + 11 rcu_read_unlock();
> + 12 return true;
> + 13 }
> + 14 rcu_read_unlock();
> + 15 return false;
> + 16 }
> +
> +The behavior of plain accesses involved in a data race is non-deterministic in
> +the face of compiler optimizations. Since accesses to the ``gp`` pointer is
> +by-design a data race, the compiler could trip this code by caching the value
> +of ``gp`` into a register in line 5, and then using the value of the register
> +to satisfy the load in line 10. Thus it is important to never mix
Will’s example is:
// Assume *ptr is initially 0 and somebody else writes it to 1
// concurrently
foo = *ptr;
bar = READ_ONCE(*ptr);
baz = *ptr;
Then the compiler is within its right to reorder it to:
foo = *ptr;
baz = *ptr;
bar = READ_ONCE(*ptr);
So, the result foo == baz == 0 but bar == 1 is perfectly legal.
But the example here is different, the compiler can not use the value loaded from line 5
unless the compiler can deduce that the tmp is equals to p in which case the address dependency
doesn’t exist anymore.
What am I missing here?
> +plain accesses of a memory location with rcu_dereference() of the same memory
> +location, in code involved in a data race.
> +
> In short, updaters use rcu_assign_pointer() and readers use
> rcu_dereference(), and these two RCU API elements work together to
> ensure that readers have a consistent view of newly added data elements.
> --
> 2.41.0.585.gd2178a4bd4-goog
>
> On Aug 3, 2023, at 8:09 AM, Alan Huang <[email protected]> wrote:
>
>
>> 2023年8月3日 11:24,Joel Fernandes (Google) <[email protected]> 写道:
>>
>> Add a detailed note to explain the potential side effects of
>> plain-accessing the gp pointer using a plain load, without using the
>> rcu_dereference() macros; which might trip neighboring code that does
>> use rcu_dereference().
>>
>> I haven't verified this with a compiler, but this is what I gather from
>> the below link using Will's experience with READ_ONCE().
>>
>> Link: https://lore.kernel.org/all/20230728124412.GA21303@willie-the-truck/
>> Cc: Will Deacon <[email protected]>
>> Signed-off-by: Joel Fernandes (Google) <[email protected]>
>> ---
>> .../RCU/Design/Requirements/Requirements.rst | 32 +++++++++++++++++++
>> 1 file changed, 32 insertions(+)
>>
>> diff --git a/Documentation/RCU/Design/Requirements/Requirements.rst b/Documentation/RCU/Design/Requirements/Requirements.rst
>> index f3b605285a87..e0b896d3fb9b 100644
>> --- a/Documentation/RCU/Design/Requirements/Requirements.rst
>> +++ b/Documentation/RCU/Design/Requirements/Requirements.rst
>> @@ -376,6 +376,38 @@ mechanism, most commonly locking or reference counting
>> .. |high-quality implementation of C11 memory_order_consume [PDF]| replace:: high-quality implementation of C11 ``memory_order_consume`` [PDF]
>> .. _high-quality implementation of C11 memory_order_consume [PDF]: http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf
>>
>> +Note that, there can be strange side effects (due to compiler optimizations) if
>> +``gp`` is ever accessed using a plain load (i.e. without ``READ_ONCE()`` or
>> +``rcu_dereference()``) potentially hurting any succeeding
>> +``rcu_dereference()``. For example, consider the code:
>> +
>> + ::
>> +
>> + 1 bool do_something_gp(void)
>> + 2 {
>> + 3 void *tmp;
>> + 4 rcu_read_lock();
>> + 5 tmp = gp; // Plain-load of GP.
>> + 6 printk("Point gp = %p\n", tmp);
>> + 7
>> + 8 p = rcu_dereference(gp);
>> + 9 if (p) {
>> + 10 do_something(p->a, p->b);
>> + 11 rcu_read_unlock();
>> + 12 return true;
>> + 13 }
>> + 14 rcu_read_unlock();
>> + 15 return false;
>> + 16 }
>> +
>> +The behavior of plain accesses involved in a data race is non-deterministic in
>> +the face of compiler optimizations. Since accesses to the ``gp`` pointer is
>> +by-design a data race, the compiler could trip this code by caching the value
>> +of ``gp`` into a register in line 5, and then using the value of the register
>> +to satisfy the load in line 10. Thus it is important to never mix
>
> Will’s example is:
>
> // Assume *ptr is initially 0 and somebody else writes it to 1
> // concurrently
>
> foo = *ptr;
> bar = READ_ONCE(*ptr);
> baz = *ptr;
>
> Then the compiler is within its right to reorder it to:
>
> foo = *ptr;
> baz = *ptr;
> bar = READ_ONCE(*ptr);
>
> So, the result foo == baz == 0 but bar == 1 is perfectly legal.
Yes, a bad outcome is perfectly legal amidst data race. Who said it is not legal?
>
> But the example here is different,
That is intentional. Wills discussion partially triggered this. Though I am wondering
if we should document that as well.
> the compiler can not use the value loaded from line 5
> unless the compiler can deduce that the tmp is equals to p in which case the address dependency
> doesn’t exist anymore.
>
> What am I missing here?
Maybe you are trying to rationalize too much that the sequence mentioned cannot result
in a counter intuitive outcome like I did?
The point AFAIU is not just about line 10 but that the compiler can replace any of the
lines after the plain access with the cached value.
Thanks.
>
>> +plain accesses of a memory location with rcu_dereference() of the same memory
>> +location, in code involved in a data race.
>> +
>> In short, updaters use rcu_assign_pointer() and readers use
>> rcu_dereference(), and these two RCU API elements work together to
>> ensure that readers have a consistent view of newly added data elements.
>> --
>> 2.41.0.585.gd2178a4bd4-goog
>>
>
> 2023年8月3日 下午8:35,Joel Fernandes <[email protected]> 写道:
>
>
>
>> On Aug 3, 2023, at 8:09 AM, Alan Huang <[email protected]> wrote:
>>
>>
>>> 2023年8月3日 11:24,Joel Fernandes (Google) <[email protected]> 写道:
>>>
>>> Add a detailed note to explain the potential side effects of
>>> plain-accessing the gp pointer using a plain load, without using the
>>> rcu_dereference() macros; which might trip neighboring code that does
>>> use rcu_dereference().
>>>
>>> I haven't verified this with a compiler, but this is what I gather from
>>> the below link using Will's experience with READ_ONCE().
>>>
>>> Link: https://lore.kernel.org/all/20230728124412.GA21303@willie-the-truck/
>>> Cc: Will Deacon <[email protected]>
>>> Signed-off-by: Joel Fernandes (Google) <[email protected]>
>>> ---
>>> .../RCU/Design/Requirements/Requirements.rst | 32 +++++++++++++++++++
>>> 1 file changed, 32 insertions(+)
>>>
>>> diff --git a/Documentation/RCU/Design/Requirements/Requirements.rst b/Documentation/RCU/Design/Requirements/Requirements.rst
>>> index f3b605285a87..e0b896d3fb9b 100644
>>> --- a/Documentation/RCU/Design/Requirements/Requirements.rst
>>> +++ b/Documentation/RCU/Design/Requirements/Requirements.rst
>>> @@ -376,6 +376,38 @@ mechanism, most commonly locking or reference counting
>>> .. |high-quality implementation of C11 memory_order_consume [PDF]| replace:: high-quality implementation of C11 ``memory_order_consume`` [PDF]
>>> .. _high-quality implementation of C11 memory_order_consume [PDF]: http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf
>>>
>>> +Note that, there can be strange side effects (due to compiler optimizations) if
>>> +``gp`` is ever accessed using a plain load (i.e. without ``READ_ONCE()`` or
>>> +``rcu_dereference()``) potentially hurting any succeeding
>>> +``rcu_dereference()``. For example, consider the code:
>>> +
>>> + ::
>>> +
>>> + 1 bool do_something_gp(void)
>>> + 2 {
>>> + 3 void *tmp;
>>> + 4 rcu_read_lock();
>>> + 5 tmp = gp; // Plain-load of GP.
>>> + 6 printk("Point gp = %p\n", tmp);
>>> + 7
>>> + 8 p = rcu_dereference(gp);
>>> + 9 if (p) {
>>> + 10 do_something(p->a, p->b);
>>> + 11 rcu_read_unlock();
>>> + 12 return true;
>>> + 13 }
>>> + 14 rcu_read_unlock();
>>> + 15 return false;
>>> + 16 }
>>> +
>>> +The behavior of plain accesses involved in a data race is non-deterministic in
>>> +the face of compiler optimizations. Since accesses to the ``gp`` pointer is
>>> +by-design a data race, the compiler could trip this code by caching the value
>>> +of ``gp`` into a register in line 5, and then using the value of the register
>>> +to satisfy the load in line 10. Thus it is important to never mix
>>
>> Will’s example is:
>>
>> // Assume *ptr is initially 0 and somebody else writes it to 1
>> // concurrently
>>
>> foo = *ptr;
>> bar = READ_ONCE(*ptr);
>> baz = *ptr;
>>
>> Then the compiler is within its right to reorder it to:
>>
>> foo = *ptr;
>> baz = *ptr;
>> bar = READ_ONCE(*ptr);
>>
>> So, the result foo == baz == 0 but bar == 1 is perfectly legal.
>
> Yes, a bad outcome is perfectly legal amidst data race. Who said it is not legal?
My understanding is that it is legal even without data race, and the compiler only keeps the order of volatile access.
>
>>
>> But the example here is different,
>
> That is intentional. Wills discussion partially triggered this. Though I am wondering
> if we should document that as well.
>
>> the compiler can not use the value loaded from line 5
>> unless the compiler can deduce that the tmp is equals to p in which case the address dependency
>> doesn’t exist anymore.
>>
>> What am I missing here?
>
> Maybe you are trying to rationalize too much that the sequence mentioned cannot result
> in a counter intuitive outcome like I did?
>
> The point AFAIU is not just about line 10 but that the compiler can replace any of the
> lines after the plain access with the cached value.
Well, IIUC, according to the C standard, the compiler can do anything if there is a data race (undefined behavior).
However, what if a write is not protected with WRITE_ONCE and the read is marked with READ_ONCE?
That’s also a data race, right? But the kernel considers it is Okay if the write is machine word aligned.
>
> Thanks.
>
>
>
>>
>>> +plain accesses of a memory location with rcu_dereference() of the same memory
>>> +location, in code involved in a data race.
>>> +
>>> In short, updaters use rcu_assign_pointer() and readers use
>>> rcu_dereference(), and these two RCU API elements work together to
>>> ensure that readers have a consistent view of newly added data elements.
>>> --
>>> 2.41.0.585.gd2178a4bd4-goog
On Thu, Aug 3, 2023 at 9:36 AM Alan Huang <[email protected]> wrote:
>
>
> > 2023年8月3日 下午8:35,Joel Fernandes <[email protected]> 写道:
> >
> >
> >
> >> On Aug 3, 2023, at 8:09 AM, Alan Huang <[email protected]> wrote:
> >>
> >>
> >>> 2023年8月3日 11:24,Joel Fernandes (Google) <[email protected]> 写道:
> >>>
> >>> Add a detailed note to explain the potential side effects of
> >>> plain-accessing the gp pointer using a plain load, without using the
> >>> rcu_dereference() macros; which might trip neighboring code that does
> >>> use rcu_dereference().
> >>>
> >>> I haven't verified this with a compiler, but this is what I gather from
> >>> the below link using Will's experience with READ_ONCE().
> >>>
> >>> Link: https://lore.kernel.org/all/20230728124412.GA21303@willie-the-truck/
> >>> Cc: Will Deacon <[email protected]>
> >>> Signed-off-by: Joel Fernandes (Google) <[email protected]>
> >>> ---
> >>> .../RCU/Design/Requirements/Requirements.rst | 32 +++++++++++++++++++
> >>> 1 file changed, 32 insertions(+)
> >>>
> >>> diff --git a/Documentation/RCU/Design/Requirements/Requirements.rst b/Documentation/RCU/Design/Requirements/Requirements.rst
> >>> index f3b605285a87..e0b896d3fb9b 100644
> >>> --- a/Documentation/RCU/Design/Requirements/Requirements.rst
> >>> +++ b/Documentation/RCU/Design/Requirements/Requirements.rst
> >>> @@ -376,6 +376,38 @@ mechanism, most commonly locking or reference counting
> >>> .. |high-quality implementation of C11 memory_order_consume [PDF]| replace:: high-quality implementation of C11 ``memory_order_consume`` [PDF]
> >>> .. _high-quality implementation of C11 memory_order_consume [PDF]: http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf
> >>>
> >>> +Note that, there can be strange side effects (due to compiler optimizations) if
> >>> +``gp`` is ever accessed using a plain load (i.e. without ``READ_ONCE()`` or
> >>> +``rcu_dereference()``) potentially hurting any succeeding
> >>> +``rcu_dereference()``. For example, consider the code:
> >>> +
> >>> + ::
> >>> +
> >>> + 1 bool do_something_gp(void)
> >>> + 2 {
> >>> + 3 void *tmp;
> >>> + 4 rcu_read_lock();
> >>> + 5 tmp = gp; // Plain-load of GP.
> >>> + 6 printk("Point gp = %p\n", tmp);
> >>> + 7
> >>> + 8 p = rcu_dereference(gp);
> >>> + 9 if (p) {
> >>> + 10 do_something(p->a, p->b);
> >>> + 11 rcu_read_unlock();
> >>> + 12 return true;
> >>> + 13 }
> >>> + 14 rcu_read_unlock();
> >>> + 15 return false;
> >>> + 16 }
> >>> +
> >>> +The behavior of plain accesses involved in a data race is non-deterministic in
> >>> +the face of compiler optimizations. Since accesses to the ``gp`` pointer is
> >>> +by-design a data race, the compiler could trip this code by caching the value
> >>> +of ``gp`` into a register in line 5, and then using the value of the register
> >>> +to satisfy the load in line 10. Thus it is important to never mix
> >>
> >> Will’s example is:
> >>
> >> // Assume *ptr is initially 0 and somebody else writes it to 1
> >> // concurrently
> >>
> >> foo = *ptr;
> >> bar = READ_ONCE(*ptr);
> >> baz = *ptr;
> >>
> >> Then the compiler is within its right to reorder it to:
> >>
> >> foo = *ptr;
> >> baz = *ptr;
> >> bar = READ_ONCE(*ptr);
> >>
> >> So, the result foo == baz == 0 but bar == 1 is perfectly legal.
> >
> > Yes, a bad outcome is perfectly legal amidst data race. Who said it is not legal?
>
> My understanding is that it is legal even without data race, and the compiler only keeps the order of volatile access.
Yes, but I can bet on it the author of the code would not have
intended such an outcome, if they did then Will wouldn't have been
debugging it ;-). That's why I called it a bad outcome. The goal of
this patch is to document such a possible unintentional outcome.
> >> But the example here is different,
> >
> > That is intentional. Wills discussion partially triggered this. Though I am wondering
> > if we should document that as well.
> >
> >> the compiler can not use the value loaded from line 5
> >> unless the compiler can deduce that the tmp is equals to p in which case the address dependency
> >> doesn’t exist anymore.
> >>
> >> What am I missing here?
> >
> > Maybe you are trying to rationalize too much that the sequence mentioned cannot result
> > in a counter intuitive outcome like I did?
> >
> > The point AFAIU is not just about line 10 but that the compiler can replace any of the
> > lines after the plain access with the cached value.
>
> Well, IIUC, according to the C standard, the compiler can do anything if there is a data race (undefined behavior).
>
> However, what if a write is not protected with WRITE_ONCE and the read is marked with READ_ONCE?
> That’s also a data race, right? But the kernel considers it is Okay if the write is machine word aligned.
Yes, but there is a compiler between the HLL code and what the
processor sees which can tear the write. How can not using
WRITE_ONCE() prevent store-tearing? See [1]. My understanding is that
it is OK only if the reader did a NULL check. In that case the torn
result will not change the semantics of the program. But otherwise,
that's bad.
[1] https://lwn.net/Articles/793253/#Store%20Tearing
thanks,
- Joel
>
> >
> > Thanks.
> >
> >
> >
> >>
> >>> +plain accesses of a memory location with rcu_dereference() of the same memory
> >>> +location, in code involved in a data race.
> >>> +
> >>> In short, updaters use rcu_assign_pointer() and readers use
> >>> rcu_dereference(), and these two RCU API elements work together to
> >>> ensure that readers have a consistent view of newly added data elements.
> >>> --
> >>> 2.41.0.585.gd2178a4bd4-goog
>
> 2023年8月4日 00:01,Joel Fernandes <[email protected]> 写道:
>
> On Thu, Aug 3, 2023 at 9:36 AM Alan Huang <[email protected]> wrote:
>>
>>
>>> 2023年8月3日 下午8:35,Joel Fernandes <[email protected]> 写道:
>>>
>>>
>>>
>>>> On Aug 3, 2023, at 8:09 AM, Alan Huang <[email protected]> wrote:
>>>>
>>>>
>>>>> 2023年8月3日 11:24,Joel Fernandes (Google) <[email protected]> 写道:
>>>>>
>>>>> Add a detailed note to explain the potential side effects of
>>>>> plain-accessing the gp pointer using a plain load, without using the
>>>>> rcu_dereference() macros; which might trip neighboring code that does
>>>>> use rcu_dereference().
>>>>>
>>>>> I haven't verified this with a compiler, but this is what I gather from
>>>>> the below link using Will's experience with READ_ONCE().
>>>>>
>>>>> Link: https://lore.kernel.org/all/20230728124412.GA21303@willie-the-truck/
>>>>> Cc: Will Deacon <[email protected]>
>>>>> Signed-off-by: Joel Fernandes (Google) <[email protected]>
>>>>> ---
>>>>> .../RCU/Design/Requirements/Requirements.rst | 32 +++++++++++++++++++
>>>>> 1 file changed, 32 insertions(+)
>>>>>
>>>>> diff --git a/Documentation/RCU/Design/Requirements/Requirements.rst b/Documentation/RCU/Design/Requirements/Requirements.rst
>>>>> index f3b605285a87..e0b896d3fb9b 100644
>>>>> --- a/Documentation/RCU/Design/Requirements/Requirements.rst
>>>>> +++ b/Documentation/RCU/Design/Requirements/Requirements.rst
>>>>> @@ -376,6 +376,38 @@ mechanism, most commonly locking or reference counting
>>>>> .. |high-quality implementation of C11 memory_order_consume [PDF]| replace:: high-quality implementation of C11 ``memory_order_consume`` [PDF]
>>>>> .. _high-quality implementation of C11 memory_order_consume [PDF]: http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf
>>>>>
>>>>> +Note that, there can be strange side effects (due to compiler optimizations) if
>>>>> +``gp`` is ever accessed using a plain load (i.e. without ``READ_ONCE()`` or
>>>>> +``rcu_dereference()``) potentially hurting any succeeding
>>>>> +``rcu_dereference()``. For example, consider the code:
>>>>> +
>>>>> + ::
>>>>> +
>>>>> + 1 bool do_something_gp(void)
>>>>> + 2 {
>>>>> + 3 void *tmp;
>>>>> + 4 rcu_read_lock();
>>>>> + 5 tmp = gp; // Plain-load of GP.
>>>>> + 6 printk("Point gp = %p\n", tmp);
>>>>> + 7
>>>>> + 8 p = rcu_dereference(gp);
>>>>> + 9 if (p) {
>>>>> + 10 do_something(p->a, p->b);
>>>>> + 11 rcu_read_unlock();
>>>>> + 12 return true;
>>>>> + 13 }
>>>>> + 14 rcu_read_unlock();
>>>>> + 15 return false;
>>>>> + 16 }
>>>>> +
>>>>> +The behavior of plain accesses involved in a data race is non-deterministic in
>>>>> +the face of compiler optimizations. Since accesses to the ``gp`` pointer is
>>>>> +by-design a data race, the compiler could trip this code by caching the value
>>>>> +of ``gp`` into a register in line 5, and then using the value of the register
>>>>> +to satisfy the load in line 10. Thus it is important to never mix
>>>>
>>>> Will’s example is:
>>>>
>>>> // Assume *ptr is initially 0 and somebody else writes it to 1
>>>> // concurrently
>>>>
>>>> foo = *ptr;
>>>> bar = READ_ONCE(*ptr);
>>>> baz = *ptr;
>>>>
>>>> Then the compiler is within its right to reorder it to:
>>>>
>>>> foo = *ptr;
>>>> baz = *ptr;
>>>> bar = READ_ONCE(*ptr);
>>>>
>>>> So, the result foo == baz == 0 but bar == 1 is perfectly legal.
>>>
>>> Yes, a bad outcome is perfectly legal amidst data race. Who said it is not legal?
>>
>> My understanding is that it is legal even without data race, and the compiler only keeps the order of volatile access.
>
> Yes, but I can bet on it the author of the code would not have
> intended such an outcome, if they did then Will wouldn't have been
> debugging it ;-). That's why I called it a bad outcome. The goal of
> this patch is to document such a possible unintentional outcome.
>
>>>> But the example here is different,
>>>
>>> That is intentional. Wills discussion partially triggered this. Though I am wondering
>>> if we should document that as well.
>>>
>>>> the compiler can not use the value loaded from line 5
>>>> unless the compiler can deduce that the tmp is equals to p in which case the address dependency
>>>> doesn’t exist anymore.
>>>>
>>>> What am I missing here?
>>>
>>> Maybe you are trying to rationalize too much that the sequence mentioned cannot result
>>> in a counter intuitive outcome like I did?
>>>
>>> The point AFAIU is not just about line 10 but that the compiler can replace any of the
>>> lines after the plain access with the cached value.
>>
>> Well, IIUC, according to the C standard, the compiler can do anything if there is a data race (undefined behavior).
>>
>> However, what if a write is not protected with WRITE_ONCE and the read is marked with READ_ONCE?
>> That’s also a data race, right? But the kernel considers it is Okay if the write is machine word aligned.
>
> Yes, but there is a compiler between the HLL code and what the
> processor sees which can tear the write. How can not using
> WRITE_ONCE() prevent store-tearing? See [1]. My understanding is that
> it is OK only if the reader did a NULL check. In that case the torn
Yes, a write-write data race where the value is the same is also fine.
But they are still data race, if the compiler is within its right to do anything it likes (due to data race),
we still need WRITE_ONCE() in these cases, though it’s semantically safe.
IIUC, even with _ONCE(), the compiler is within its right do anything according to the standard (at least before the upcoming C23), because the standard doesn’t consider a volatile access to be atomic.
However, the kernel consider the volatile access to be atomic, right?
BTW, line 5 in the example is likely to be optimized away. And yes, the compiler can cache the value loaded from line 5 from the perspective of undefined behavior, even if I believe it would be a compiler bug from the perspective of kernel.
> result will not change the semantics of the program. But otherwise,
> that's bad.
>
> [1] https://lwn.net/Articles/793253/#Store%20Tearing
>
> thanks,
>
> - Joel
>
>
>>
>>>
>>> Thanks.
>>>
>>>
>>>
>>>>
>>>>> +plain accesses of a memory location with rcu_dereference() of the same memory
>>>>> +location, in code involved in a data race.
>>>>> +
>>>>> In short, updaters use rcu_assign_pointer() and readers use
>>>>> rcu_dereference(), and these two RCU API elements work together to
>>>>> ensure that readers have a consistent view of newly added data elements.
>>>>> --
>>>>> 2.41.0.585.gd2178a4bd4-goog
On Thu, Aug 03, 2023 at 03:24:07AM +0000, Joel Fernandes (Google) wrote:
> The compiler has the ability to cause misordering by destroying
> address-dependency barriers if comparison operations are used. Add a
> note about this to memory-barriers.txt and point to rcu-dereference.rst
> for more information.
>
> Signed-off-by: Joel Fernandes (Google) <[email protected]>
> ---
> Documentation/memory-barriers.txt | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> index 06e14efd8662..acc8ec5ce563 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -435,6 +435,11 @@ Memory barriers come in four basic varieties:
> variables such as READ_ONCE() and rcu_dereference() provide implicit
> address-dependency barriers.
>
> + [!] Note that address dependency barriers can be destroyed by comparison
> + of a pointer obtained by a marked accessor such as READ_ONCE() or
> + rcu_dereference() with some value. For an example of this, see
> + rcu_dereference.rst (part where the comparison of pointers is discussed).
Hmmm...
Given that this is in a section marked "historical" (for the old
smp_read_barrier_depends() API), why not instead add a pointer to
Documentation/RCU/rcu_dereference.rst to the beginning of the section,
noted as the updated material?
Thanx, Paul
> +
> (3) Read (or load) memory barriers.
>
> A read barrier is an address-dependency barrier plus a guarantee that all
> --
> 2.41.0.585.gd2178a4bd4-goog
>
> On Aug 3, 2023, at 3:26 PM, Alan Huang <[email protected]> wrote:
>
>
>> 2023年8月4日 00:01,Joel Fernandes <[email protected]> 写道:
>>
>>> On Thu, Aug 3, 2023 at 9:36 AM Alan Huang <[email protected]> wrote:
>>>
>>>
>>>> 2023年8月3日 下午8:35,Joel Fernandes <[email protected]> 写道:
>>>>
>>>>
>>>>
>>>>> On Aug 3, 2023, at 8:09 AM, Alan Huang <[email protected]> wrote:
>>>>>
>>>>>
>>>>>> 2023年8月3日 11:24,Joel Fernandes (Google) <[email protected]> 写道:
>>>>>>
>>>>>> Add a detailed note to explain the potential side effects of
>>>>>> plain-accessing the gp pointer using a plain load, without using the
>>>>>> rcu_dereference() macros; which might trip neighboring code that does
>>>>>> use rcu_dereference().
>>>>>>
>>>>>> I haven't verified this with a compiler, but this is what I gather from
>>>>>> the below link using Will's experience with READ_ONCE().
>>>>>>
>>>>>> Link: https://lore.kernel.org/all/20230728124412.GA21303@willie-the-truck/
>>>>>> Cc: Will Deacon <[email protected]>
>>>>>> Signed-off-by: Joel Fernandes (Google) <[email protected]>
>>>>>> ---
>>>>>> .../RCU/Design/Requirements/Requirements.rst | 32 +++++++++++++++++++
>>>>>> 1 file changed, 32 insertions(+)
>>>>>>
>>>>>> diff --git a/Documentation/RCU/Design/Requirements/Requirements.rst b/Documentation/RCU/Design/Requirements/Requirements.rst
>>>>>> index f3b605285a87..e0b896d3fb9b 100644
>>>>>> --- a/Documentation/RCU/Design/Requirements/Requirements.rst
>>>>>> +++ b/Documentation/RCU/Design/Requirements/Requirements.rst
>>>>>> @@ -376,6 +376,38 @@ mechanism, most commonly locking or reference counting
>>>>>> .. |high-quality implementation of C11 memory_order_consume [PDF]| replace:: high-quality implementation of C11 ``memory_order_consume`` [PDF]
>>>>>> .. _high-quality implementation of C11 memory_order_consume [PDF]: http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf
>>>>>>
>>>>>> +Note that, there can be strange side effects (due to compiler optimizations) if
>>>>>> +``gp`` is ever accessed using a plain load (i.e. without ``READ_ONCE()`` or
>>>>>> +``rcu_dereference()``) potentially hurting any succeeding
>>>>>> +``rcu_dereference()``. For example, consider the code:
>>>>>> +
>>>>>> + ::
>>>>>> +
>>>>>> + 1 bool do_something_gp(void)
>>>>>> + 2 {
>>>>>> + 3 void *tmp;
>>>>>> + 4 rcu_read_lock();
>>>>>> + 5 tmp = gp; // Plain-load of GP.
>>>>>> + 6 printk("Point gp = %p\n", tmp);
>>>>>> + 7
>>>>>> + 8 p = rcu_dereference(gp);
>>>>>> + 9 if (p) {
>>>>>> + 10 do_something(p->a, p->b);
>>>>>> + 11 rcu_read_unlock();
>>>>>> + 12 return true;
>>>>>> + 13 }
>>>>>> + 14 rcu_read_unlock();
>>>>>> + 15 return false;
>>>>>> + 16 }
>>>>>> +
>>>>>> +The behavior of plain accesses involved in a data race is non-deterministic in
>>>>>> +the face of compiler optimizations. Since accesses to the ``gp`` pointer is
>>>>>> +by-design a data race, the compiler could trip this code by caching the value
>>>>>> +of ``gp`` into a register in line 5, and then using the value of the register
>>>>>> +to satisfy the load in line 10. Thus it is important to never mix
>>>>>
>>>>> Will’s example is:
>>>>>
>>>>> // Assume *ptr is initially 0 and somebody else writes it to 1
>>>>> // concurrently
>>>>>
>>>>> foo = *ptr;
>>>>> bar = READ_ONCE(*ptr);
>>>>> baz = *ptr;
>>>>>
>>>>> Then the compiler is within its right to reorder it to:
>>>>>
>>>>> foo = *ptr;
>>>>> baz = *ptr;
>>>>> bar = READ_ONCE(*ptr);
>>>>>
>>>>> So, the result foo == baz == 0 but bar == 1 is perfectly legal.
>>>>
>>>> Yes, a bad outcome is perfectly legal amidst data race. Who said it is not legal?
>>>
>>> My understanding is that it is legal even without data race, and the compiler only keeps the order of volatile access.
>>
>> Yes, but I can bet on it the author of the code would not have
>> intended such an outcome, if they did then Will wouldn't have been
>> debugging it ;-). That's why I called it a bad outcome. The goal of
>> this patch is to document such a possible unintentional outcome.
Please trim replies if possible.
>>
>>>>> But the example here is different,
>>>>
>>>> That is intentional. Wills discussion partially triggered this. Though I am wondering
>>>> if we should document that as well.
>>>>
>>>>> the compiler can not use the value loaded from line 5
>>>>> unless the compiler can deduce that the tmp is equals to p in which case the address dependency
>>>>> doesn’t exist anymore.
>>>>>
>>>>> What am I missing here?
>>>>
>>>> Maybe you are trying to rationalize too much that the sequence mentioned cannot result
>>>> in a counter intuitive outcome like I did?
>>>>
>>>> The point AFAIU is not just about line 10 but that the compiler can replace any of the
>>>> lines after the plain access with the cached value.
>>>
>>> Well, IIUC, according to the C standard, the compiler can do anything if there is a data race (undefined behavior).
>>>
>>> However, what if a write is not protected with WRITE_ONCE and the read is marked with READ_ONCE?
>>> That’s also a data race, right? But the kernel considers it is Okay if the write is machine word aligned.
>>
>> Yes, but there is a compiler between the HLL code and what the
>> processor sees which can tear the write. How can not using
>> WRITE_ONCE() prevent store-tearing? See [1]. My understanding is that
>> it is OK only if the reader did a NULL check. In that case the torn
>
> Yes, a write-write data race where the value is the same is also fine.
>
> But they are still data race, if the compiler is within its right to do anything it likes (due to data race),
> we still need WRITE_ONCE() in these cases, though it’s semantically safe.
>
> IIUC, even with _ONCE(), the compiler is within its right do anything according to the standard (at least before the upcoming C23), because the standard doesn’t consider a volatile access to be atomic.
>
> However, the kernel consider the volatile access to be atomic, right?
>
> BTW, line 5 in the example is likely to be optimized away. And yes, the compiler can cache the value loaded from line 5 from the perspective of undefined behavior, even if I believe it would be a compiler bug from the perspective of kernel.
I am actually a bit lost with what you are trying to say. Are you saying that mixing
plain accesses with marked accesses is an acceptable practice?
I would like others to weight in as well since I am not seeing what Alan is suggesting.
AFAICS, in the absence of barrier(), any optimization caused by plain access
makes it a bad practice to mix it.
Thanks,
- Joel
>
>> result will not change the semantics of the program. But otherwise,
>> that's bad.
>>
>> [1] https://lwn.net/Articles/793253/#Store%20Tearing
>>
>> thanks,
>>
>> - Joel
>>
>>
>>>
>>>>
>>>> Thanks.
>>>>
>>>>
>>>>
>>>>>
>>>>>> +plain accesses of a memory location with rcu_dereference() of the same memory
>>>>>> +location, in code involved in a data race.
>>>>>> +
>>>>>> In short, updaters use rcu_assign_pointer() and readers use
>>>>>> rcu_dereference(), and these two RCU API elements work together to
>>>>>> ensure that readers have a consistent view of newly added data elements.
>>>>>> --
>>>>>> 2.41.0.585.gd2178a4bd4-goog
>
>
On Fri, Aug 04, 2023 at 03:25:57AM +0800, Alan Huang wrote:
> > 2023年8月4日 00:01,Joel Fernandes <[email protected]> 写道:
> > On Thu, Aug 3, 2023 at 9:36 AM Alan Huang <[email protected]> wrote:
> >>> 2023年8月3日 下午8:35,Joel Fernandes <[email protected]> 写道:
> >>>> On Aug 3, 2023, at 8:09 AM, Alan Huang <[email protected]> wrote:
> >>>>> 2023年8月3日 11:24,Joel Fernandes (Google) <[email protected]> 写道:
> >>>>> Add a detailed note to explain the potential side effects of
> >>>>> plain-accessing the gp pointer using a plain load, without using the
> >>>>> rcu_dereference() macros; which might trip neighboring code that does
> >>>>> use rcu_dereference().
> >>>>>
> >>>>> I haven't verified this with a compiler, but this is what I gather from
> >>>>> the below link using Will's experience with READ_ONCE().
> >>>>>
> >>>>> Link: https://lore.kernel.org/all/20230728124412.GA21303@willie-the-truck/
> >>>>> Cc: Will Deacon <[email protected]>
> >>>>> Signed-off-by: Joel Fernandes (Google) <[email protected]>
> >>>>> ---
> >>>>> .../RCU/Design/Requirements/Requirements.rst | 32 +++++++++++++++++++
> >>>>> 1 file changed, 32 insertions(+)
> >>>>>
> >>>>> diff --git a/Documentation/RCU/Design/Requirements/Requirements.rst b/Documentation/RCU/Design/Requirements/Requirements.rst
> >>>>> index f3b605285a87..e0b896d3fb9b 100644
> >>>>> --- a/Documentation/RCU/Design/Requirements/Requirements.rst
> >>>>> +++ b/Documentation/RCU/Design/Requirements/Requirements.rst
> >>>>> @@ -376,6 +376,38 @@ mechanism, most commonly locking or reference counting
> >>>>> .. |high-quality implementation of C11 memory_order_consume [PDF]| replace:: high-quality implementation of C11 ``memory_order_consume`` [PDF]
> >>>>> .. _high-quality implementation of C11 memory_order_consume [PDF]: http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf
> >>>>>
> >>>>> +Note that, there can be strange side effects (due to compiler optimizations) if
> >>>>> +``gp`` is ever accessed using a plain load (i.e. without ``READ_ONCE()`` or
> >>>>> +``rcu_dereference()``) potentially hurting any succeeding
> >>>>> +``rcu_dereference()``. For example, consider the code:
> >>>>> +
> >>>>> + ::
> >>>>> +
> >>>>> + 1 bool do_something_gp(void)
> >>>>> + 2 {
> >>>>> + 3 void *tmp;
> >>>>> + 4 rcu_read_lock();
> >>>>> + 5 tmp = gp; // Plain-load of GP.
> >>>>> + 6 printk("Point gp = %p\n", tmp);
> >>>>> + 7
> >>>>> + 8 p = rcu_dereference(gp);
> >>>>> + 9 if (p) {
> >>>>> + 10 do_something(p->a, p->b);
> >>>>> + 11 rcu_read_unlock();
> >>>>> + 12 return true;
> >>>>> + 13 }
> >>>>> + 14 rcu_read_unlock();
> >>>>> + 15 return false;
> >>>>> + 16 }
> >>>>> +
> >>>>> +The behavior of plain accesses involved in a data race is non-deterministic in
> >>>>> +the face of compiler optimizations. Since accesses to the ``gp`` pointer is
> >>>>> +by-design a data race, the compiler could trip this code by caching the value
> >>>>> +of ``gp`` into a register in line 5, and then using the value of the register
> >>>>> +to satisfy the load in line 10. Thus it is important to never mix
> >>>>
> >>>> Will’s example is:
> >>>>
> >>>> // Assume *ptr is initially 0 and somebody else writes it to 1
> >>>> // concurrently
> >>>>
> >>>> foo = *ptr;
> >>>> bar = READ_ONCE(*ptr);
> >>>> baz = *ptr;
> >>>>
> >>>> Then the compiler is within its right to reorder it to:
> >>>>
> >>>> foo = *ptr;
> >>>> baz = *ptr;
> >>>> bar = READ_ONCE(*ptr);
> >>>>
> >>>> So, the result foo == baz == 0 but bar == 1 is perfectly legal.
> >>>
> >>> Yes, a bad outcome is perfectly legal amidst data race. Who said it is not legal?
> >>
> >> My understanding is that it is legal even without data race, and the compiler only keeps the order of volatile access.
> >
> > Yes, but I can bet on it the author of the code would not have
> > intended such an outcome, if they did then Will wouldn't have been
> > debugging it ;-). That's why I called it a bad outcome. The goal of
> > this patch is to document such a possible unintentional outcome.
> >
> >>>> But the example here is different,
> >>>
> >>> That is intentional. Wills discussion partially triggered this. Though I am wondering
> >>> if we should document that as well.
> >>>
> >>>> the compiler can not use the value loaded from line 5
> >>>> unless the compiler can deduce that the tmp is equals to p in which case the address dependency
> >>>> doesn’t exist anymore.
> >>>>
> >>>> What am I missing here?
> >>>
> >>> Maybe you are trying to rationalize too much that the sequence mentioned cannot result
> >>> in a counter intuitive outcome like I did?
> >>>
> >>> The point AFAIU is not just about line 10 but that the compiler can replace any of the
> >>> lines after the plain access with the cached value.
> >>
> >> Well, IIUC, according to the C standard, the compiler can do anything if there is a data race (undefined behavior).
> >>
> >> However, what if a write is not protected with WRITE_ONCE and the read is marked with READ_ONCE?
> >> That’s also a data race, right? But the kernel considers it is Okay if the write is machine word aligned.
> >
> > Yes, but there is a compiler between the HLL code and what the
> > processor sees which can tear the write. How can not using
> > WRITE_ONCE() prevent store-tearing? See [1]. My understanding is that
> > it is OK only if the reader did a NULL check. In that case the torn
>
> Yes, a write-write data race where the value is the same is also fine.
>
> But they are still data race, if the compiler is within its right to do anything it likes (due to data race),
> we still need WRITE_ONCE() in these cases, though it’s semantically safe.
>
> IIUC, even with _ONCE(), the compiler is within its right do anything according to the standard (at least before the upcoming C23), because the standard doesn’t consider a volatile access to be atomic.
Volatile accesses are not specified very well in the standard. However,
as a practical matter, compilers that wish to be able to device drivers
(whether in kernels or userspace applications) must compile those volatile
accesses in such a way to allow reliable device drivers to be written.
> However, the kernel consider the volatile access to be atomic, right?
The compiler must therefore act as if a volatile access to an aligned
machine-word size location is atomic. To see this, consider accesses
to memory that is shared by a device driver and that device's firmware,
both of which are written in either C or C++.
Does that help?
Thanx, Paul
> BTW, line 5 in the example is likely to be optimized away. And yes, the compiler can cache the value loaded from line 5 from the perspective of undefined behavior, even if I believe it would be a compiler bug from the perspective of kernel.
>
> > result will not change the semantics of the program. But otherwise,
> > that's bad.
> >
> > [1] https://lwn.net/Articles/793253/#Store%20Tearing
> >
> > thanks,
> >
> > - Joel
> >
> >
> >>
> >>>
> >>> Thanks.
> >>>
> >>>
> >>>
> >>>>
> >>>>> +plain accesses of a memory location with rcu_dereference() of the same memory
> >>>>> +location, in code involved in a data race.
> >>>>> +
> >>>>> In short, updaters use rcu_assign_pointer() and readers use
> >>>>> rcu_dereference(), and these two RCU API elements work together to
> >>>>> ensure that readers have a consistent view of newly added data elements.
> >>>>> --
> >>>>> 2.41.0.585.gd2178a4bd4-goog
>
>
On Thu, Aug 03, 2023 at 11:52:06AM -0700, Paul E. McKenney wrote:
> On Thu, Aug 03, 2023 at 03:24:07AM +0000, Joel Fernandes (Google) wrote:
> > The compiler has the ability to cause misordering by destroying
> > address-dependency barriers if comparison operations are used. Add a
> > note about this to memory-barriers.txt and point to rcu-dereference.rst
> > for more information.
> >
> > Signed-off-by: Joel Fernandes (Google) <[email protected]>
> > ---
> > Documentation/memory-barriers.txt | 5 +++++
> > 1 file changed, 5 insertions(+)
> >
> > diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> > index 06e14efd8662..acc8ec5ce563 100644
> > --- a/Documentation/memory-barriers.txt
> > +++ b/Documentation/memory-barriers.txt
> > @@ -435,6 +435,11 @@ Memory barriers come in four basic varieties:
> > variables such as READ_ONCE() and rcu_dereference() provide implicit
> > address-dependency barriers.
> >
> > + [!] Note that address dependency barriers can be destroyed by comparison
> > + of a pointer obtained by a marked accessor such as READ_ONCE() or
> > + rcu_dereference() with some value. For an example of this, see
> > + rcu_dereference.rst (part where the comparison of pointers is discussed).
>
> Hmmm...
>
> Given that this is in a section marked "historical" (for the old
> smp_read_barrier_depends() API), why not instead add a pointer to
> Documentation/RCU/rcu_dereference.rst to the beginning of the section,
> noted as the updated material?
Sounds good. There's also another section in the same file on Address
dependency barriers (also marked historical). So something like the
following?
---8<-----------------------
diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index acc8ec5ce563..ba50220716ca 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -396,6 +396,10 @@ Memory barriers come in four basic varieties:
(2) Address-dependency barriers (historical).
+ [!] This section is marked as HISTORICAL: For more up-to-date
+ information, including how compiler transformations related to pointer
+ comparisons can sometimes cause problems, see
+ Documentation/RCU/rcu_dereference.rst.
An address-dependency barrier is a weaker form of read barrier. In the
case where two loads are performed such that the second depends on the
@@ -561,6 +565,9 @@ There are certain things that the Linux kernel memory barriers do not guarantee:
ADDRESS-DEPENDENCY BARRIERS (HISTORICAL)
----------------------------------------
+[!] This section is marked as HISTORICAL: For more up-to-date information,
+including how compiler transformations related to pointer comparisons can
+sometimes cause problems, see Documentation/RCU/rcu_dereference.rst.
As of v4.15 of the Linux kernel, an smp_mb() was added to READ_ONCE() for
DEC Alpha, which means that about the only people who need to pay attention
> On Aug 3, 2023, at 8:01 PM, Paul E. McKenney <[email protected]> wrote:
>
> On Fri, Aug 04, 2023 at 03:25:57AM +0800, Alan Huang wrote:
>>> 2023年8月4日 00:01,Joel Fernandes <[email protected]> 写道:
>>> On Thu, Aug 3, 2023 at 9:36 AM Alan Huang <[email protected]> wrote:
>>>>> 2023年8月3日 下午8:35,Joel Fernandes <[email protected]> 写道:
>>>>>>> On Aug 3, 2023, at 8:09 AM, Alan Huang <[email protected]> wrote:
>>>>>>>> 2023年8月3日 11:24,Joel Fernandes (Google) <[email protected]> 写道:
>>>>>>>> Add a detailed note to explain the potential side effects of
>>>>>>>> plain-accessing the gp pointer using a plain load, without using the
>>>>>>>> rcu_dereference() macros; which might trip neighboring code that does
>>>>>>>> use rcu_dereference().
>>>>>>>>
>>>>>>>> I haven't verified this with a compiler, but this is what I gather from
>>>>>>>> the below link using Will's experience with READ_ONCE().
>>>>>>>>
>>>>>>>> Link: https://lore.kernel.org/all/20230728124412.GA21303@willie-the-truck/
>>>>>>>> Cc: Will Deacon <[email protected]>
>>>>>>>> Signed-off-by: Joel Fernandes (Google) <[email protected]>
>>>>>>>> ---
>>>>>>>> .../RCU/Design/Requirements/Requirements.rst | 32 +++++++++++++++++++
>>>>>>>> 1 file changed, 32 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/Documentation/RCU/Design/Requirements/Requirements.rst b/Documentation/RCU/Design/Requirements/Requirements.rst
>>>>>>>> index f3b605285a87..e0b896d3fb9b 100644
>>>>>>>> --- a/Documentation/RCU/Design/Requirements/Requirements.rst
>>>>>>>> +++ b/Documentation/RCU/Design/Requirements/Requirements.rst
>>>>>>>> @@ -376,6 +376,38 @@ mechanism, most commonly locking or reference counting
>>>>>>>> .. |high-quality implementation of C11 memory_order_consume [PDF]| replace:: high-quality implementation of C11 ``memory_order_consume`` [PDF]
>>>>>>>> .. _high-quality implementation of C11 memory_order_consume [PDF]: http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf
>>>>>>>>
>>>>>>>> +Note that, there can be strange side effects (due to compiler optimizations) if
>>>>>>>> +``gp`` is ever accessed using a plain load (i.e. without ``READ_ONCE()`` or
>>>>>>>> +``rcu_dereference()``) potentially hurting any succeeding
>>>>>>>> +``rcu_dereference()``. For example, consider the code:
>>>>>>>> +
>>>>>>>> + ::
>>>>>>>> +
>>>>>>>> + 1 bool do_something_gp(void)
>>>>>>>> + 2 {
>>>>>>>> + 3 void *tmp;
>>>>>>>> + 4 rcu_read_lock();
>>>>>>>> + 5 tmp = gp; // Plain-load of GP.
>>>>>>>> + 6 printk("Point gp = %p\n", tmp);
>>>>>>>> + 7
>>>>>>>> + 8 p = rcu_dereference(gp);
>>>>>>>> + 9 if (p) {
>>>>>>>> + 10 do_something(p->a, p->b);
>>>>>>>> + 11 rcu_read_unlock();
>>>>>>>> + 12 return true;
>>>>>>>> + 13 }
>>>>>>>> + 14 rcu_read_unlock();
>>>>>>>> + 15 return false;
>>>>>>>> + 16 }
>>>>>>>> +
>>>>>>>> +The behavior of plain accesses involved in a data race is non-deterministic in
>>>>>>>> +the face of compiler optimizations. Since accesses to the ``gp`` pointer is
>>>>>>>> +by-design a data race, the compiler could trip this code by caching the value
>>>>>>>> +of ``gp`` into a register in line 5, and then using the value of the register
>>>>>>>> +to satisfy the load in line 10. Thus it is important to never mix
>>>>>>>
>>>>>>> Will’s example is:
>>>>>>>
>>>>>>> // Assume *ptr is initially 0 and somebody else writes it to 1
>>>>>>> // concurrently
>>>>>>>
>>>>>>> foo = *ptr;
>>>>>>> bar = READ_ONCE(*ptr);
>>>>>>> baz = *ptr;
>>>>>>>
>>>>>>> Then the compiler is within its right to reorder it to:
>>>>>>>
>>>>>>> foo = *ptr;
>>>>>>> baz = *ptr;
>>>>>>> bar = READ_ONCE(*ptr);
>>>>>>>
>>>>>>> So, the result foo == baz == 0 but bar == 1 is perfectly legal.
>>>>>>
>>>>>> Yes, a bad outcome is perfectly legal amidst data race. Who said it is not legal?
>>>>>
>>>>> My understanding is that it is legal even without data race, and the compiler only keeps the order of volatile access.
>>>
>>> Yes, but I can bet on it the author of the code would not have
>>> intended such an outcome, if they did then Will wouldn't have been
>>> debugging it ;-). That's why I called it a bad outcome. The goal of
>>> this patch is to document such a possible unintentional outcome.
>>>
>>>>>> But the example here is different,
>>>>>
>>>>> That is intentional. Wills discussion partially triggered this. Though I am wondering
>>>>> if we should document that as well.
>>>>>
>>>>>> the compiler can not use the value loaded from line 5
>>>>>> unless the compiler can deduce that the tmp is equals to p in which case the address dependency
>>>>>> doesn’t exist anymore.
>>>>>>
>>>>>> What am I missing here?
>>>>>
>>>>> Maybe you are trying to rationalize too much that the sequence mentioned cannot result
>>>>> in a counter intuitive outcome like I did?
>>>>>
>>>>> The point AFAIU is not just about line 10 but that the compiler can replace any of the
>>>>> lines after the plain access with the cached value.
>>>>
>>>> Well, IIUC, according to the C standard, the compiler can do anything if there is a data race (undefined behavior).
>>>>
>>>> However, what if a write is not protected with WRITE_ONCE and the read is marked with READ_ONCE?
>>>> That’s also a data race, right? But the kernel considers it is Okay if the write is machine word aligned.
>>>
>>> Yes, but there is a compiler between the HLL code and what the
>>> processor sees which can tear the write. How can not using
>>> WRITE_ONCE() prevent store-tearing? See [1]. My understanding is that
>>> it is OK only if the reader did a NULL check. In that case the torn
>>
>> Yes, a write-write data race where the value is the same is also fine.
>>
>> But they are still data race, if the compiler is within its right to do anything it likes (due to data race),
>> we still need WRITE_ONCE() in these cases, though it’s semantically safe.
>>
>> IIUC, even with _ONCE(), the compiler is within its right do anything according to the standard (at least before the upcoming C23), because the standard doesn’t consider a volatile access to be atomic.
>
> Volatile accesses are not specified very well in the standard. However,
> as a practical matter, compilers that wish to be able to device drivers
> (whether in kernels or userspace applications) must compile those volatile
> accesses in such a way to allow reliable device drivers to be written.
Agreed.
>
>> However, the kernel consider the volatile access to be atomic, right?
>
> The compiler must therefore act as if a volatile access to an aligned
> machine-word size location is atomic. To see this, consider accesses
> to memory that is shared by a device driver and that device's firmware,
> both of which are written in either C or C++.
Btw it appears TSAN complaints bitterly on even volatile 4 byte data races.
Hence we have to explicitly use atomic API for data race accesses in Chrome.
Thanks,
Joel
> Does that help?
>
> Thanx, Paul
>
>> BTW, line 5 in the example is likely to be optimized away. And yes, the compiler can cache the value loaded from line 5 from the perspective of undefined behavior, even if I believe it would be a compiler bug from the perspective of kernel.
>>
>>> result will not change the semantics of the program. But otherwise,
>>> that's bad.
>>>
>>> [1] https://lwn.net/Articles/793253/#Store%20Tearing
>>>
>>> thanks,
>>>
>>> - Joel
>>>
>>>
>>>>
>>>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>> +plain accesses of a memory location with rcu_dereference() of the same memory
>>>>>>> +location, in code involved in a data race.
>>>>>>> +
>>>>>>> In short, updaters use rcu_assign_pointer() and readers use
>>>>>>> rcu_dereference(), and these two RCU API elements work together to
>>>>>>> ensure that readers have a consistent view of newly added data elements.
>>>>>>> --
>>>>>>> 2.41.0.585.gd2178a4bd4-goog
>>
>>
On Fri, Aug 04, 2023 at 05:11:27AM +0000, Joel Fernandes wrote:
> On Thu, Aug 03, 2023 at 11:52:06AM -0700, Paul E. McKenney wrote:
> > On Thu, Aug 03, 2023 at 03:24:07AM +0000, Joel Fernandes (Google) wrote:
> > > The compiler has the ability to cause misordering by destroying
> > > address-dependency barriers if comparison operations are used. Add a
> > > note about this to memory-barriers.txt and point to rcu-dereference.rst
> > > for more information.
> > >
> > > Signed-off-by: Joel Fernandes (Google) <[email protected]>
> > > ---
> > > Documentation/memory-barriers.txt | 5 +++++
> > > 1 file changed, 5 insertions(+)
> > >
> > > diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> > > index 06e14efd8662..acc8ec5ce563 100644
> > > --- a/Documentation/memory-barriers.txt
> > > +++ b/Documentation/memory-barriers.txt
> > > @@ -435,6 +435,11 @@ Memory barriers come in four basic varieties:
> > > variables such as READ_ONCE() and rcu_dereference() provide implicit
> > > address-dependency barriers.
> > >
> > > + [!] Note that address dependency barriers can be destroyed by comparison
> > > + of a pointer obtained by a marked accessor such as READ_ONCE() or
> > > + rcu_dereference() with some value. For an example of this, see
> > > + rcu_dereference.rst (part where the comparison of pointers is discussed).
> >
> > Hmmm...
> >
> > Given that this is in a section marked "historical" (for the old
> > smp_read_barrier_depends() API), why not instead add a pointer to
> > Documentation/RCU/rcu_dereference.rst to the beginning of the section,
> > noted as the updated material?
>
> Sounds good. There's also another section in the same file on Address
> dependency barriers (also marked historical). So something like the
> following?
Given a Signed-off-by and so forth, I would be happy to take this one.
Thanx, Paul
> ---8<-----------------------
>
> diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> index acc8ec5ce563..ba50220716ca 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -396,6 +396,10 @@ Memory barriers come in four basic varieties:
>
>
> (2) Address-dependency barriers (historical).
> + [!] This section is marked as HISTORICAL: For more up-to-date
> + information, including how compiler transformations related to pointer
> + comparisons can sometimes cause problems, see
> + Documentation/RCU/rcu_dereference.rst.
>
> An address-dependency barrier is a weaker form of read barrier. In the
> case where two loads are performed such that the second depends on the
> @@ -561,6 +565,9 @@ There are certain things that the Linux kernel memory barriers do not guarantee:
>
> ADDRESS-DEPENDENCY BARRIERS (HISTORICAL)
> ----------------------------------------
> +[!] This section is marked as HISTORICAL: For more up-to-date information,
> +including how compiler transformations related to pointer comparisons can
> +sometimes cause problems, see Documentation/RCU/rcu_dereference.rst.
>
> As of v4.15 of the Linux kernel, an smp_mb() was added to READ_ONCE() for
> DEC Alpha, which means that about the only people who need to pay attention
On Fri, Aug 04, 2023 at 08:33:55AM -0400, Joel Fernandes wrote:
> > On Aug 3, 2023, at 8:01 PM, Paul E. McKenney <[email protected]> wrote:
> > On Fri, Aug 04, 2023 at 03:25:57AM +0800, Alan Huang wrote:
> >>> 2023年8月4日 00:01,Joel Fernandes <[email protected]> 写道:
> >>> On Thu, Aug 3, 2023 at 9:36 AM Alan Huang <[email protected]> wrote:
> >>>>> 2023年8月3日 下午8:35,Joel Fernandes <[email protected]> 写道:
> >>>>>>> On Aug 3, 2023, at 8:09 AM, Alan Huang <[email protected]> wrote:
> >>>>>>>> 2023年8月3日 11:24,Joel Fernandes (Google) <[email protected]> 写道:
> >>>>>>>> Add a detailed note to explain the potential side effects of
> >>>>>>>> plain-accessing the gp pointer using a plain load, without using the
> >>>>>>>> rcu_dereference() macros; which might trip neighboring code that does
> >>>>>>>> use rcu_dereference().
> >>>>>>>>
> >>>>>>>> I haven't verified this with a compiler, but this is what I gather from
> >>>>>>>> the below link using Will's experience with READ_ONCE().
> >>>>>>>>
> >>>>>>>> Link: https://lore.kernel.org/all/20230728124412.GA21303@willie-the-truck/
> >>>>>>>> Cc: Will Deacon <[email protected]>
> >>>>>>>> Signed-off-by: Joel Fernandes (Google) <[email protected]>
> >>>>>>>> ---
> >>>>>>>> .../RCU/Design/Requirements/Requirements.rst | 32 +++++++++++++++++++
> >>>>>>>> 1 file changed, 32 insertions(+)
> >>>>>>>>
> >>>>>>>> diff --git a/Documentation/RCU/Design/Requirements/Requirements.rst b/Documentation/RCU/Design/Requirements/Requirements.rst
> >>>>>>>> index f3b605285a87..e0b896d3fb9b 100644
> >>>>>>>> --- a/Documentation/RCU/Design/Requirements/Requirements.rst
> >>>>>>>> +++ b/Documentation/RCU/Design/Requirements/Requirements.rst
> >>>>>>>> @@ -376,6 +376,38 @@ mechanism, most commonly locking or reference counting
> >>>>>>>> .. |high-quality implementation of C11 memory_order_consume [PDF]| replace:: high-quality implementation of C11 ``memory_order_consume`` [PDF]
> >>>>>>>> .. _high-quality implementation of C11 memory_order_consume [PDF]: http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf
> >>>>>>>>
> >>>>>>>> +Note that, there can be strange side effects (due to compiler optimizations) if
> >>>>>>>> +``gp`` is ever accessed using a plain load (i.e. without ``READ_ONCE()`` or
> >>>>>>>> +``rcu_dereference()``) potentially hurting any succeeding
> >>>>>>>> +``rcu_dereference()``. For example, consider the code:
> >>>>>>>> +
> >>>>>>>> + ::
> >>>>>>>> +
> >>>>>>>> + 1 bool do_something_gp(void)
> >>>>>>>> + 2 {
> >>>>>>>> + 3 void *tmp;
> >>>>>>>> + 4 rcu_read_lock();
> >>>>>>>> + 5 tmp = gp; // Plain-load of GP.
> >>>>>>>> + 6 printk("Point gp = %p\n", tmp);
> >>>>>>>> + 7
> >>>>>>>> + 8 p = rcu_dereference(gp);
> >>>>>>>> + 9 if (p) {
> >>>>>>>> + 10 do_something(p->a, p->b);
> >>>>>>>> + 11 rcu_read_unlock();
> >>>>>>>> + 12 return true;
> >>>>>>>> + 13 }
> >>>>>>>> + 14 rcu_read_unlock();
> >>>>>>>> + 15 return false;
> >>>>>>>> + 16 }
> >>>>>>>> +
> >>>>>>>> +The behavior of plain accesses involved in a data race is non-deterministic in
> >>>>>>>> +the face of compiler optimizations. Since accesses to the ``gp`` pointer is
> >>>>>>>> +by-design a data race, the compiler could trip this code by caching the value
> >>>>>>>> +of ``gp`` into a register in line 5, and then using the value of the register
> >>>>>>>> +to satisfy the load in line 10. Thus it is important to never mix
> >>>>>>>
> >>>>>>> Will’s example is:
> >>>>>>>
> >>>>>>> // Assume *ptr is initially 0 and somebody else writes it to 1
> >>>>>>> // concurrently
> >>>>>>>
> >>>>>>> foo = *ptr;
> >>>>>>> bar = READ_ONCE(*ptr);
> >>>>>>> baz = *ptr;
> >>>>>>>
> >>>>>>> Then the compiler is within its right to reorder it to:
> >>>>>>>
> >>>>>>> foo = *ptr;
> >>>>>>> baz = *ptr;
> >>>>>>> bar = READ_ONCE(*ptr);
> >>>>>>>
> >>>>>>> So, the result foo == baz == 0 but bar == 1 is perfectly legal.
> >>>>>>
> >>>>>> Yes, a bad outcome is perfectly legal amidst data race. Who said it is not legal?
> >>>>>
> >>>>> My understanding is that it is legal even without data race, and the compiler only keeps the order of volatile access.
> >>>
> >>> Yes, but I can bet on it the author of the code would not have
> >>> intended such an outcome, if they did then Will wouldn't have been
> >>> debugging it ;-). That's why I called it a bad outcome. The goal of
> >>> this patch is to document such a possible unintentional outcome.
> >>>
> >>>>>> But the example here is different,
> >>>>>
> >>>>> That is intentional. Wills discussion partially triggered this. Though I am wondering
> >>>>> if we should document that as well.
> >>>>>
> >>>>>> the compiler can not use the value loaded from line 5
> >>>>>> unless the compiler can deduce that the tmp is equals to p in which case the address dependency
> >>>>>> doesn’t exist anymore.
> >>>>>>
> >>>>>> What am I missing here?
> >>>>>
> >>>>> Maybe you are trying to rationalize too much that the sequence mentioned cannot result
> >>>>> in a counter intuitive outcome like I did?
> >>>>>
> >>>>> The point AFAIU is not just about line 10 but that the compiler can replace any of the
> >>>>> lines after the plain access with the cached value.
> >>>>
> >>>> Well, IIUC, according to the C standard, the compiler can do anything if there is a data race (undefined behavior).
> >>>>
> >>>> However, what if a write is not protected with WRITE_ONCE and the read is marked with READ_ONCE?
> >>>> That’s also a data race, right? But the kernel considers it is Okay if the write is machine word aligned.
> >>>
> >>> Yes, but there is a compiler between the HLL code and what the
> >>> processor sees which can tear the write. How can not using
> >>> WRITE_ONCE() prevent store-tearing? See [1]. My understanding is that
> >>> it is OK only if the reader did a NULL check. In that case the torn
> >>
> >> Yes, a write-write data race where the value is the same is also fine.
> >>
> >> But they are still data race, if the compiler is within its right to do anything it likes (due to data race),
> >> we still need WRITE_ONCE() in these cases, though it’s semantically safe.
> >>
> >> IIUC, even with _ONCE(), the compiler is within its right do anything according to the standard (at least before the upcoming C23), because the standard doesn’t consider a volatile access to be atomic.
> >
> > Volatile accesses are not specified very well in the standard. However,
> > as a practical matter, compilers that wish to be able to device drivers
> > (whether in kernels or userspace applications) must compile those volatile
> > accesses in such a way to allow reliable device drivers to be written.
>
> Agreed.
>
> >
> >> However, the kernel consider the volatile access to be atomic, right?
> >
> > The compiler must therefore act as if a volatile access to an aligned
> > machine-word size location is atomic. To see this, consider accesses
> > to memory that is shared by a device driver and that device's firmware,
> > both of which are written in either C or C++.
>
> Btw it appears TSAN complaints bitterly on even volatile 4 byte data races.
> Hence we have to explicitly use atomic API for data race accesses in Chrome.
That might have been a conscious and deliberate choice on the part of
the TSAN guys. Volatile does not imply any ordering in the standard
(other than the compiler avoiding emitting volatile operations out of
order), but some compilers did emit memory-barrier instructions for
volatile accesses. Which resulted in a lot of problems when such code
found compilers that did not cause the CPU to order volatile operations.
So a lot of people decided to thrown the volatile baby out with the
unordered bathwather. ;-)
Thanx, Paul
> Thanks,
> Joel
>
>
>
> > Does that help?
> >
> > Thanx, Paul
> >
> >> BTW, line 5 in the example is likely to be optimized away. And yes, the compiler can cache the value loaded from line 5 from the perspective of undefined behavior, even if I believe it would be a compiler bug from the perspective of kernel.
> >>
> >>> result will not change the semantics of the program. But otherwise,
> >>> that's bad.
> >>>
> >>> [1] https://lwn.net/Articles/793253/#Store%20Tearing
> >>>
> >>> thanks,
> >>>
> >>> - Joel
> >>>
> >>>
> >>>>
> >>>>>
> >>>>> Thanks.
> >>>>>
> >>>>>
> >>>>>
> >>>>>>
> >>>>>>> +plain accesses of a memory location with rcu_dereference() of the same memory
> >>>>>>> +location, in code involved in a data race.
> >>>>>>> +
> >>>>>>> In short, updaters use rcu_assign_pointer() and readers use
> >>>>>>> rcu_dereference(), and these two RCU API elements work together to
> >>>>>>> ensure that readers have a consistent view of newly added data elements.
> >>>>>>> --
> >>>>>>> 2.41.0.585.gd2178a4bd4-goog
> >>
> >>
On Fri, Aug 04, 2023 at 06:52:32AM -0700, Paul E. McKenney wrote:
> On Fri, Aug 04, 2023 at 05:11:27AM +0000, Joel Fernandes wrote:
> > On Thu, Aug 03, 2023 at 11:52:06AM -0700, Paul E. McKenney wrote:
> > > On Thu, Aug 03, 2023 at 03:24:07AM +0000, Joel Fernandes (Google) wrote:
> > > > The compiler has the ability to cause misordering by destroying
> > > > address-dependency barriers if comparison operations are used. Add a
> > > > note about this to memory-barriers.txt and point to rcu-dereference.rst
> > > > for more information.
> > > >
> > > > Signed-off-by: Joel Fernandes (Google) <[email protected]>
> > > > ---
> > > > Documentation/memory-barriers.txt | 5 +++++
> > > > 1 file changed, 5 insertions(+)
> > > >
> > > > diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> > > > index 06e14efd8662..acc8ec5ce563 100644
> > > > --- a/Documentation/memory-barriers.txt
> > > > +++ b/Documentation/memory-barriers.txt
> > > > @@ -435,6 +435,11 @@ Memory barriers come in four basic varieties:
> > > > variables such as READ_ONCE() and rcu_dereference() provide implicit
> > > > address-dependency barriers.
> > > >
> > > > + [!] Note that address dependency barriers can be destroyed by comparison
> > > > + of a pointer obtained by a marked accessor such as READ_ONCE() or
> > > > + rcu_dereference() with some value. For an example of this, see
> > > > + rcu_dereference.rst (part where the comparison of pointers is discussed).
> > >
> > > Hmmm...
> > >
> > > Given that this is in a section marked "historical" (for the old
> > > smp_read_barrier_depends() API), why not instead add a pointer to
> > > Documentation/RCU/rcu_dereference.rst to the beginning of the section,
> > > noted as the updated material?
> >
> > Sounds good. There's also another section in the same file on Address
> > dependency barriers (also marked historical). So something like the
> > following?
>
> Given a Signed-off-by and so forth, I would be happy to take this one.
Thank you for helping me improve the docs, here it goes:
---8<-----------------------
From: "Joel Fernandes (Google)" <[email protected]>
Subject: [PATCH] docs: memory-barriers: Add note on compiler transformation
and address deps
The compiler has the ability to cause misordering by destroying
address-dependency barriers if comparison operations are used. Add a
note about this to memory-barriers.txt in the beginning of both the
historical address-dependency sections and point to rcu-dereference.rst
for more information.
Signed-off-by: Joel Fernandes (Google) <[email protected]>
---
Documentation/memory-barriers.txt | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index acc8ec5ce563..ba50220716ca 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -396,6 +396,10 @@ Memory barriers come in four basic varieties:
(2) Address-dependency barriers (historical).
+ [!] This section is marked as HISTORICAL: For more up-to-date
+ information, including how compiler transformations related to pointer
+ comparisons can sometimes cause problems, see
+ Documentation/RCU/rcu_dereference.rst.
An address-dependency barrier is a weaker form of read barrier. In the
case where two loads are performed such that the second depends on the
@@ -561,6 +565,9 @@ There are certain things that the Linux kernel memory barriers do not guarantee:
ADDRESS-DEPENDENCY BARRIERS (HISTORICAL)
----------------------------------------
+[!] This section is marked as HISTORICAL: For more up-to-date information,
+including how compiler transformations related to pointer comparisons can
+sometimes cause problems, see Documentation/RCU/rcu_dereference.rst.
As of v4.15 of the Linux kernel, an smp_mb() was added to READ_ONCE() for
DEC Alpha, which means that about the only people who need to pay attention
--
2.41.0.585.gd2178a4bd4-goog
[...]
> > >> However, the kernel consider the volatile access to be atomic, right?
> > >
> > > The compiler must therefore act as if a volatile access to an aligned
> > > machine-word size location is atomic. To see this, consider accesses
> > > to memory that is shared by a device driver and that device's firmware,
> > > both of which are written in either C or C++.
> >
> > Btw it appears TSAN complaints bitterly on even volatile 4 byte data races.
> > Hence we have to explicitly use atomic API for data race accesses in Chrome.
>
> That might have been a conscious and deliberate choice on the part of
> the TSAN guys. Volatile does not imply any ordering in the standard
> (other than the compiler avoiding emitting volatile operations out of
> order), but some compilers did emit memory-barrier instructions for
> volatile accesses. Which resulted in a lot of problems when such code
> found compilers that did not cause the CPU to order volatile operations.
>
> So a lot of people decided to thrown the volatile baby out with the
> unordered bathwather. ;-)
Thanks for the input, I think TSAN was indeed worried about
memory-ordering even if relaxed ordering was intended. I think there
is a way to tell TSAN to shut-up in such situations but in my last
Chrome sprint, I just used the atomic API with relaxed ordering and
called it a day. :-)
- Joel
>
>>>
>>>>>> But the example here is different,
>>>>>
>>>>> That is intentional. Wills discussion partially triggered this. Though I am wondering
>>>>> if we should document that as well.
>>>>>
>>>>>> the compiler can not use the value loaded from line 5
>>>>>> unless the compiler can deduce that the tmp is equals to p in which case the address dependency
>>>>>> doesn’t exist anymore.
>>>>>>
>>>>>> What am I missing here?
>>>>>
>>>>> Maybe you are trying to rationalize too much that the sequence mentioned cannot result
>>>>> in a counter intuitive outcome like I did?
>>>>>
>>>>> The point AFAIU is not just about line 10 but that the compiler can replace any of the
>>>>> lines after the plain access with the cached value.
>>>>
>>>> Well, IIUC, according to the C standard, the compiler can do anything if there is a data race (undefined behavior).
>>>>
>>>> However, what if a write is not protected with WRITE_ONCE and the read is marked with READ_ONCE?
>>>> That’s also a data race, right? But the kernel considers it is Okay if the write is machine word aligned.
>>>
>>> Yes, but there is a compiler between the HLL code and what the
>>> processor sees which can tear the write. How can not using
>>> WRITE_ONCE() prevent store-tearing? See [1]. My understanding is that
>>> it is OK only if the reader did a NULL check. In that case the torn
>>
>> Yes, a write-write data race where the value is the same is also fine.
>>
>> But they are still data race, if the compiler is within its right to do anything it likes (due to data race),
>> we still need WRITE_ONCE() in these cases, though it’s semantically safe.
>>
>> IIUC, even with _ONCE(), the compiler is within its right do anything according to the standard (at least before the upcoming C23), because the standard doesn’t consider a volatile access to be atomic.
>>
>> However, the kernel consider the volatile access to be atomic, right?
>>
>> BTW, line 5 in the example is likely to be optimized away. And yes, the compiler can cache the value loaded from line 5 from the perspective of undefined behavior, even if I believe it would be a compiler bug from the perspective of kernel.
>
> I am actually a bit lost with what you are trying to say. Are you saying that mixing
> plain accesses with marked accesses is an acceptable practice?
I’m trying to say that sometimes data race is fine, that’s why we have the data_race().
Even if the standard says data race results in UB.
And IMHO, the possible data race at line 5 in this example is also fine, unless the compiler
deduces that the value of gp is always the same.
>
> I would like others to weight in as well since I am not seeing what Alan is suggesting.
> AFAICS, in the absence of barrier(), any optimization caused by plain access
> makes it a bad practice to mix it.
>
> Thanks,
>
> - Joel
>
>
>
>>
>>> result will not change the semantics of the program. But otherwise,
>>> that's bad.
>>>
>>> [1] https://lwn.net/Articles/793253/#Store%20Tearing
>>>
>>> thanks,
>>>
>>> - Joel
>>>
>>>
>>>>
>>>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>> +plain accesses of a memory location with rcu_dereference() of the same memory
>>>>>>> +location, in code involved in a data race.
>>>>>>> +
>>>>>>> In short, updaters use rcu_assign_pointer() and readers use
>>>>>>> rcu_dereference(), and these two RCU API elements work together to
>>>>>>> ensure that readers have a consistent view of newly added data elements.
>>>>>>> --
>>>>>>> 2.41.0.585.gd2178a4bd4-goog
>> Yes, a write-write data race where the value is the same is also fine.
>>
>> But they are still data race, if the compiler is within its right to do anything it likes (due to data race),
>> we still need WRITE_ONCE() in these cases, though it’s semantically safe.
>>
>> IIUC, even with _ONCE(), the compiler is within its right do anything according to the standard (at least before the upcoming C23), because the standard doesn’t consider a volatile access to be atomic.
>
> Volatile accesses are not specified very well in the standard. However,
> as a practical matter, compilers that wish to be able to device drivers
> (whether in kernels or userspace applications) must compile those volatile
> accesses in such a way to allow reliable device drivers to be written.
>
>> However, the kernel consider the volatile access to be atomic, right?
>
> The compiler must therefore act as if a volatile access to an aligned
> machine-word size location is atomic. To see this, consider accesses
> to memory that is shared by a device driver and that device's firmware,
> both of which are written in either C or C++.
I learned these things a few months ago. But still thank you!
The real problem is that there may be a data race at line 5, so Joel is correct that the compiler
can cache the value loaded from line 5 according to the standard given that the standard says that
a data race result in undefined behavior, so the compiler might be allowed to do anything. But from the
perspective of the kernel, line 5 is likely a diagnostic read, so it’s fine without READ_ONCE() and the
compiler is not allowed to cache the value.
This situation is like the volatile access.
Am I missing something?
>
> Does that help?
>
> Thanx, Paul
>
>> BTW, line 5 in the example is likely to be optimized away. And yes, the compiler can cache the value loaded from line 5 from the perspective of undefined behavior, even if I believe it would be a compiler bug from the perspective of kernel.
>>
>>> result will not change the semantics of the program. But otherwise,
>>> that's bad.
>>>
>>> [1] https://lwn.net/Articles/793253/#Store%20Tearing
>>>
>>> thanks,
>>>
>>> - Joel
>>>
>>>
>>>>
>>>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>> +plain accesses of a memory location with rcu_dereference() of the same memory
>>>>>>> +location, in code involved in a data race.
>>>>>>> +
>>>>>>> In short, updaters use rcu_assign_pointer() and readers use
>>>>>>> rcu_dereference(), and these two RCU API elements work together to
>>>>>>> ensure that readers have a consistent view of newly added data elements.
>>>>>>> --
>>>>>>> 2.41.0.585.gd2178a4bd4-goog
On Fri, Aug 04, 2023 at 12:17:58PM -0400, Joel Fernandes wrote:
> [...]
> > > >> However, the kernel consider the volatile access to be atomic, right?
> > > >
> > > > The compiler must therefore act as if a volatile access to an aligned
> > > > machine-word size location is atomic. To see this, consider accesses
> > > > to memory that is shared by a device driver and that device's firmware,
> > > > both of which are written in either C or C++.
> > >
> > > Btw it appears TSAN complaints bitterly on even volatile 4 byte data races.
> > > Hence we have to explicitly use atomic API for data race accesses in Chrome.
> >
> > That might have been a conscious and deliberate choice on the part of
> > the TSAN guys. Volatile does not imply any ordering in the standard
> > (other than the compiler avoiding emitting volatile operations out of
> > order), but some compilers did emit memory-barrier instructions for
> > volatile accesses. Which resulted in a lot of problems when such code
> > found compilers that did not cause the CPU to order volatile operations.
> >
> > So a lot of people decided to thrown the volatile baby out with the
> > unordered bathwather. ;-)
>
> Thanks for the input, I think TSAN was indeed worried about
> memory-ordering even if relaxed ordering was intended. I think there
> is a way to tell TSAN to shut-up in such situations but in my last
> Chrome sprint, I just used the atomic API with relaxed ordering and
> called it a day. :-)
Fair enough!
Note that the Linux kernel's version of TSAN, which is KCSAN, does
interpret volatile accesses more or less as if they were relaxed atomics.
So TSAN could change, but I don't have a dog in that fight. ;-)
Thanx, Paul
On Fri, Aug 4, 2023 at 11:47 AM Alan Huang <[email protected]> wrote:
>
> >
> >>>
> >>>>>> But the example here is different,
> >>>>>
> >>>>> That is intentional. Wills discussion partially triggered this. Though I am wondering
> >>>>> if we should document that as well.
> >>>>>
> >>>>>> the compiler can not use the value loaded from line 5
> >>>>>> unless the compiler can deduce that the tmp is equals to p in which case the address dependency
> >>>>>> doesn’t exist anymore.
> >>>>>>
> >>>>>> What am I missing here?
> >>>>>
> >>>>> Maybe you are trying to rationalize too much that the sequence mentioned cannot result
> >>>>> in a counter intuitive outcome like I did?
> >>>>>
> >>>>> The point AFAIU is not just about line 10 but that the compiler can replace any of the
> >>>>> lines after the plain access with the cached value.
> >>>>
> >>>> Well, IIUC, according to the C standard, the compiler can do anything if there is a data race (undefined behavior).
> >>>>
> >>>> However, what if a write is not protected with WRITE_ONCE and the read is marked with READ_ONCE?
> >>>> That’s also a data race, right? But the kernel considers it is Okay if the write is machine word aligned.
> >>>
> >>> Yes, but there is a compiler between the HLL code and what the
> >>> processor sees which can tear the write. How can not using
> >>> WRITE_ONCE() prevent store-tearing? See [1]. My understanding is that
> >>> it is OK only if the reader did a NULL check. In that case the torn
> >>
> >> Yes, a write-write data race where the value is the same is also fine.
> >>
> >> But they are still data race, if the compiler is within its right to do anything it likes (due to data race),
> >> we still need WRITE_ONCE() in these cases, though it’s semantically safe.
> >>
> >> IIUC, even with _ONCE(), the compiler is within its right do anything according to the standard (at least before the upcoming C23), because the standard doesn’t consider a volatile access to be atomic.
> >>
> >> However, the kernel consider the volatile access to be atomic, right?
> >>
> >> BTW, line 5 in the example is likely to be optimized away. And yes, the compiler can cache the value loaded from line 5 from the perspective of undefined behavior, even if I believe it would be a compiler bug from the perspective of kernel.
> >
> > I am actually a bit lost with what you are trying to say. Are you saying that mixing
> > plain accesses with marked accesses is an acceptable practice?
>
>
> I’m trying to say that sometimes data race is fine, that’s why we have the data_race().
>
> Even if the standard says data race results in UB.
>
> And IMHO, the possible data race at line 5 in this example is also fine, unless the compiler
> deduces that the value of gp is always the same.
IMHO, no one is saying it is not "fine". As in, such behavior is
neither a compiler nor strictly a kernel bug. More a wtf that the
programmer should know off (does not hurt to know).
I will rest my case with AlanH pending any input from people who know
more than me. If there is a better way to represent such matters in
the docs, I am happy to make changes to this patch.
Cheers,
- Joel
On Sat, Aug 05, 2023 at 12:33:03AM +0800, Alan Huang wrote:
>
> >> Yes, a write-write data race where the value is the same is also fine.
> >>
> >> But they are still data race, if the compiler is within its right to do anything it likes (due to data race),
> >> we still need WRITE_ONCE() in these cases, though it’s semantically safe.
> >>
> >> IIUC, even with _ONCE(), the compiler is within its right do anything according to the standard (at least before the upcoming C23), because the standard doesn’t consider a volatile access to be atomic.
> >
> > Volatile accesses are not specified very well in the standard. However,
> > as a practical matter, compilers that wish to be able to device drivers
> > (whether in kernels or userspace applications) must compile those volatile
> > accesses in such a way to allow reliable device drivers to be written.
> >
> >> However, the kernel consider the volatile access to be atomic, right?
> >
> > The compiler must therefore act as if a volatile access to an aligned
> > machine-word size location is atomic. To see this, consider accesses
> > to memory that is shared by a device driver and that device's firmware,
> > both of which are written in either C or C++.
>
> I learned these things a few months ago. But still thank you!
>
> The real problem is that there may be a data race at line 5, so Joel is correct that the compiler
> can cache the value loaded from line 5 according to the standard given that the standard says that
> a data race result in undefined behavior, so the compiler might be allowed to do anything. But from the
> perspective of the kernel, line 5 is likely a diagnostic read, so it’s fine without READ_ONCE() and the
> compiler is not allowed to cache the value.
>
> This situation is like the volatile access.
>
> Am I missing something?
I think you have it right.
The point is that we are sometimes more concerned about focusing KCSAN
diagnostics on the core concurrent algorithm, and are willing to take
the very low risk of messed-up diagnostic output in order to get simpler
and better KCSAN diagnostics on the main algorithm.
So in that case, we use data_race() on the diagnostics and other markings
in the main algorithm.
For example, suppose that we had a core algorithm that relied on strict
locking. In that case, we want to use unmarked plain C-language accesses
in the core algorithm, which will allow KCSAN to flag and accesses that
are not protected by the lock. But it might be bad for the diagnostic
code to acquire that lock, as this would suppress diagnostics in the case
where the lock was held for too long a time period. Using data_race()
in the diagnostic code addresses this situation.
Thanx, Paul
> > Does that help?
> >
> > Thanx, Paul
> >
> >> BTW, line 5 in the example is likely to be optimized away. And yes, the compiler can cache the value loaded from line 5 from the perspective of undefined behavior, even if I believe it would be a compiler bug from the perspective of kernel.
> >>
> >>> result will not change the semantics of the program. But otherwise,
> >>> that's bad.
> >>>
> >>> [1] https://lwn.net/Articles/793253/#Store%20Tearing
> >>>
> >>> thanks,
> >>>
> >>> - Joel
> >>>
> >>>
> >>>>
> >>>>>
> >>>>> Thanks.
> >>>>>
> >>>>>
> >>>>>
> >>>>>>
> >>>>>>> +plain accesses of a memory location with rcu_dereference() of the same memory
> >>>>>>> +location, in code involved in a data race.
> >>>>>>> +
> >>>>>>> In short, updaters use rcu_assign_pointer() and readers use
> >>>>>>> rcu_dereference(), and these two RCU API elements work together to
> >>>>>>> ensure that readers have a consistent view of newly added data elements.
> >>>>>>> --
> >>>>>>> 2.41.0.585.gd2178a4bd4-goog
>
>
On Fri, Aug 04, 2023 at 04:27:45PM +0000, Joel Fernandes wrote:
> On Fri, Aug 04, 2023 at 06:52:32AM -0700, Paul E. McKenney wrote:
> > On Fri, Aug 04, 2023 at 05:11:27AM +0000, Joel Fernandes wrote:
> > > On Thu, Aug 03, 2023 at 11:52:06AM -0700, Paul E. McKenney wrote:
> > > > On Thu, Aug 03, 2023 at 03:24:07AM +0000, Joel Fernandes (Google) wrote:
> > > > > The compiler has the ability to cause misordering by destroying
> > > > > address-dependency barriers if comparison operations are used. Add a
> > > > > note about this to memory-barriers.txt and point to rcu-dereference.rst
> > > > > for more information.
> > > > >
> > > > > Signed-off-by: Joel Fernandes (Google) <[email protected]>
> > > > > ---
> > > > > Documentation/memory-barriers.txt | 5 +++++
> > > > > 1 file changed, 5 insertions(+)
> > > > >
> > > > > diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> > > > > index 06e14efd8662..acc8ec5ce563 100644
> > > > > --- a/Documentation/memory-barriers.txt
> > > > > +++ b/Documentation/memory-barriers.txt
> > > > > @@ -435,6 +435,11 @@ Memory barriers come in four basic varieties:
> > > > > variables such as READ_ONCE() and rcu_dereference() provide implicit
> > > > > address-dependency barriers.
> > > > >
> > > > > + [!] Note that address dependency barriers can be destroyed by comparison
> > > > > + of a pointer obtained by a marked accessor such as READ_ONCE() or
> > > > > + rcu_dereference() with some value. For an example of this, see
> > > > > + rcu_dereference.rst (part where the comparison of pointers is discussed).
> > > >
> > > > Hmmm...
> > > >
> > > > Given that this is in a section marked "historical" (for the old
> > > > smp_read_barrier_depends() API), why not instead add a pointer to
> > > > Documentation/RCU/rcu_dereference.rst to the beginning of the section,
> > > > noted as the updated material?
> > >
> > > Sounds good. There's also another section in the same file on Address
> > > dependency barriers (also marked historical). So something like the
> > > following?
> >
> > Given a Signed-off-by and so forth, I would be happy to take this one.
>
> Thank you for helping me improve the docs, here it goes:
>
> ---8<-----------------------
>
> From: "Joel Fernandes (Google)" <[email protected]>
> Subject: [PATCH] docs: memory-barriers: Add note on compiler transformation
> and address deps
>
> The compiler has the ability to cause misordering by destroying
> address-dependency barriers if comparison operations are used. Add a
> note about this to memory-barriers.txt in the beginning of both the
> historical address-dependency sections and point to rcu-dereference.rst
> for more information.
>
> Signed-off-by: Joel Fernandes (Google) <[email protected]>
Queued and pushed, thank you!
Thanx, Paul
> ---
> Documentation/memory-barriers.txt | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> index acc8ec5ce563..ba50220716ca 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -396,6 +396,10 @@ Memory barriers come in four basic varieties:
>
>
> (2) Address-dependency barriers (historical).
> + [!] This section is marked as HISTORICAL: For more up-to-date
> + information, including how compiler transformations related to pointer
> + comparisons can sometimes cause problems, see
> + Documentation/RCU/rcu_dereference.rst.
>
> An address-dependency barrier is a weaker form of read barrier. In the
> case where two loads are performed such that the second depends on the
> @@ -561,6 +565,9 @@ There are certain things that the Linux kernel memory barriers do not guarantee:
>
> ADDRESS-DEPENDENCY BARRIERS (HISTORICAL)
> ----------------------------------------
> +[!] This section is marked as HISTORICAL: For more up-to-date information,
> +including how compiler transformations related to pointer comparisons can
> +sometimes cause problems, see Documentation/RCU/rcu_dereference.rst.
>
> As of v4.15 of the Linux kernel, an smp_mb() was added to READ_ONCE() for
> DEC Alpha, which means that about the only people who need to pay attention
> --
> 2.41.0.585.gd2178a4bd4-goog
>