Frederic,
In the perf ring buffer code we have this in perf_output_get_handle():
if (!local_dec_and_test(&rb->nest))
goto out;
/*
* Publish the known good head. Rely on the full barrier implied
* by atomic_dec_and_test() order the rb->head read and this
* write.
*/
rb->user_page->data_head = head;
The comment says atomic_dec_and_test() but the code is
local_dec_and_test().
On powerpc, local_dec_and_test() doesn't have a memory barrier but
atomic_dec_and_test() does. Is the comment wrong, or is
local_dec_and_test() suppose to imply a memory barrier too and we have
it wrongly implemented in powerpc?
My guess is that local_dec_and_test() is correct but we to add an
explicit memory barrier like below:
(Kudos to Victor Kaplansky for finding this)
Mikey
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index cd55144..95768c6 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -87,10 +87,10 @@ again:
goto out;
/*
- * Publish the known good head. Rely on the full barrier implied
- * by atomic_dec_and_test() order the rb->head read and this
- * write.
+ * Publish the known good head. We need a memory barrier to order the
+ * order the rb->head read and this write.
*/
+ smp_mb ();
rb->user_page->data_head = head;
/*
See below.
Michael Neuling <[email protected]> wrote on 10/23/2013 02:54:54 AM:
>
> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
> index cd55144..95768c6 100644
> --- a/kernel/events/ring_buffer.c
> +++ b/kernel/events/ring_buffer.c
> @@ -87,10 +87,10 @@ again:
> goto out;
>
> /*
> - * Publish the known good head. Rely on the full barrier implied
> - * by atomic_dec_and_test() order the rb->head read and this
> - * write.
> + * Publish the known good head. We need a memory barrier to order the
> + * order the rb->head read and this write.
> */
> + smp_mb ();
> rb->user_page->data_head = head;
>
> /*
1. As far as I understand, smp_mb() is superfluous in this case, smp_wmb()
should be enough.
(same for the space between the name of function and open
parenthesis :-) )
2. Again, as far as I understand from ./Documentation/atomic_ops.txt, it is
mistake in architecture independent
code to rely on memory barriers in atomic operations, all the more so in
"local" operations.
3. The solution above is sub-optimal on architectures where memory barrier
is part of "local", since we are going to execute
two consecutive barriers. So, maybe, it would be better to use
smp_mb__after_atomic_dec().
4. I'm not sure, but I think there is another, unrelated potential problem
in function perf_output_put_handle()
- the write to "data_head" -
kernel/events/ring_buffer.c:
77 /*
78 * Publish the known good head. Rely on the full barrier
implied
79 * by atomic_dec_and_test() order the rb->head read and this
80 * write.
81 */
82 rb->user_page->data_head = head;
As data_head is 64-bit wide, the update should be done by an atomic64_set
().
Regards,
-- Victor
On Wed, Oct 23, 2013 at 10:54:54AM +1100, Michael Neuling wrote:
> Frederic,
>
> In the perf ring buffer code we have this in perf_output_get_handle():
>
> if (!local_dec_and_test(&rb->nest))
> goto out;
>
> /*
> * Publish the known good head. Rely on the full barrier implied
> * by atomic_dec_and_test() order the rb->head read and this
> * write.
> */
> rb->user_page->data_head = head;
>
> The comment says atomic_dec_and_test() but the code is
> local_dec_and_test().
>
> On powerpc, local_dec_and_test() doesn't have a memory barrier but
> atomic_dec_and_test() does. Is the comment wrong, or is
> local_dec_and_test() suppose to imply a memory barrier too and we have
> it wrongly implemented in powerpc?
>
> My guess is that local_dec_and_test() is correct but we to add an
> explicit memory barrier like below:
>
> (Kudos to Victor Kaplansky for finding this)
>
> Mikey
>
> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
> index cd55144..95768c6 100644
> --- a/kernel/events/ring_buffer.c
> +++ b/kernel/events/ring_buffer.c
> @@ -87,10 +87,10 @@ again:
> goto out;
>
> /*
> - * Publish the known good head. Rely on the full barrier implied
> - * by atomic_dec_and_test() order the rb->head read and this
> - * write.
> + * Publish the known good head. We need a memory barrier to order the
> + * order the rb->head read and this write.
> */
> + smp_mb ();
> rb->user_page->data_head = head;
>
> /*
I'm adding Peter in Cc since he wrote that code.
I agree that local_dec_and_test() doesn't need to imply an smp barrier.
All it has to provide as a guarantee is the atomicity against local concurrent
operations (interrupts, preemption, ...).
Now I'm a bit confused about this barrier.
I think we want this ordering:
Kernel User
READ rb->user_page->data_tail READ rb->user_page->data_head
smp_mb() smp_mb()
WRITE rb data READ rb data
smp_mb() smp_mb()
rb->user_page->data_head WRITE rb->user_page->data_tail
So yeah we want a berrier between the data published and the user data_head.
But this ordering concerns wider layout than just rb->head and rb->user_page->data_head
And BTW I can see an smp_rmb() after we read rb->user_page->data_tail. This is probably the
first kernel barrier in my above example. (not sure if rmb() alone is enough though).
2013/10/23 Frederic Weisbecker <[email protected]>:
> On Wed, Oct 23, 2013 at 10:54:54AM +1100, Michael Neuling wrote:
>> Frederic,
>>
>> In the perf ring buffer code we have this in perf_output_get_handle():
>>
>> if (!local_dec_and_test(&rb->nest))
>> goto out;
>>
>> /*
>> * Publish the known good head. Rely on the full barrier implied
>> * by atomic_dec_and_test() order the rb->head read and this
>> * write.
>> */
>> rb->user_page->data_head = head;
>>
>> The comment says atomic_dec_and_test() but the code is
>> local_dec_and_test().
>>
>> On powerpc, local_dec_and_test() doesn't have a memory barrier but
>> atomic_dec_and_test() does. Is the comment wrong, or is
>> local_dec_and_test() suppose to imply a memory barrier too and we have
>> it wrongly implemented in powerpc?
>>
>> My guess is that local_dec_and_test() is correct but we to add an
>> explicit memory barrier like below:
>>
>> (Kudos to Victor Kaplansky for finding this)
>>
>> Mikey
>>
>> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
>> index cd55144..95768c6 100644
>> --- a/kernel/events/ring_buffer.c
>> +++ b/kernel/events/ring_buffer.c
>> @@ -87,10 +87,10 @@ again:
>> goto out;
>>
>> /*
>> - * Publish the known good head. Rely on the full barrier implied
>> - * by atomic_dec_and_test() order the rb->head read and this
>> - * write.
>> + * Publish the known good head. We need a memory barrier to order the
>> + * order the rb->head read and this write.
>> */
>> + smp_mb ();
>> rb->user_page->data_head = head;
>>
>> /*
>
>
> I'm adding Peter in Cc since he wrote that code.
> I agree that local_dec_and_test() doesn't need to imply an smp barrier.
> All it has to provide as a guarantee is the atomicity against local concurrent
> operations (interrupts, preemption, ...).
>
> Now I'm a bit confused about this barrier.
>
> I think we want this ordering:
>
> Kernel User
>
> READ rb->user_page->data_tail READ rb->user_page->data_head
> smp_mb() smp_mb()
> WRITE rb data READ rb data
> smp_mb() smp_mb()
> rb->user_page->data_head WRITE rb->user_page->data_tail
^^ I meant a write above for data_head.
On Wed, Oct 23, 2013 at 03:19:51PM +0100, Frederic Weisbecker wrote:
> On Wed, Oct 23, 2013 at 10:54:54AM +1100, Michael Neuling wrote:
> > Frederic,
> >
> > The comment says atomic_dec_and_test() but the code is
> > local_dec_and_test().
> >
> > On powerpc, local_dec_and_test() doesn't have a memory barrier but
> > atomic_dec_and_test() does. Is the comment wrong, or is
> > local_dec_and_test() suppose to imply a memory barrier too and we have
> > it wrongly implemented in powerpc?
My bad; I converted from atomic to local without actually thinking it
seems. Your implementation of the local primitives is fine.
> > diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
> > index cd55144..95768c6 100644
> > --- a/kernel/events/ring_buffer.c
> > +++ b/kernel/events/ring_buffer.c
> > @@ -87,10 +87,10 @@ again:
> > goto out;
> >
> > /*
> > - * Publish the known good head. Rely on the full barrier implied
> > - * by atomic_dec_and_test() order the rb->head read and this
> > - * write.
> > + * Publish the known good head. We need a memory barrier to order the
> > + * order the rb->head read and this write.
> > */
> > + smp_mb ();
> > rb->user_page->data_head = head;
> >
> > /*
Right; so that would indeed be what the comment suggests it should be.
However I think the comment is now actively wrong too :-)
Since on the kernel side the buffer is strictly per-cpu, we don't need
memory barriers there.
> I think we want this ordering:
>
> Kernel User
>
> READ rb->user_page->data_tail READ rb->user_page->data_head
> smp_mb() smp_mb()
> WRITE rb data READ rb data
> smp_mb() smp_mb()
> rb->user_page->data_head WRITE rb->user_page->data_tail
>
I would argue for:
READ ->data_tail READ ->data_head
smp_rmb() (A) smp_rmb() (C)
WRITE $data READ $data
smp_wmb() (B) smp_mb() (D)
STORE ->data_head WRITE ->data_tail
Where A pairs with D, and B pairs with C.
I don't think A needs to be a full barrier because we won't in fact
write data until we see the store from userspace. So we simply don't
issue the data WRITE until we observe it.
OTOH, D needs to be a full barrier since it separates the data READ from
the tail WRITE.
For B a WMB is sufficient since it separates two WRITEs, and for C an
RMB is sufficient since it separates two READs.
---
kernel/events/ring_buffer.c | 29 ++++++++++++++++++++++++++---
1 file changed, 26 insertions(+), 3 deletions(-)
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index cd55144270b5..c91274ef4e23 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -87,10 +87,31 @@ static void perf_output_put_handle(struct perf_output_handle *handle)
goto out;
/*
- * Publish the known good head. Rely on the full barrier implied
- * by atomic_dec_and_test() order the rb->head read and this
- * write.
+ * Since the mmap() consumer (userspace) can run on a different CPU:
+ *
+ * kernel user
+ *
+ * READ ->data_tail READ ->data_head
+ * smp_rmb() (A) smp_rmb() (C)
+ * WRITE $data READ $data
+ * smp_wmb() (B) smp_mb() (D)
+ * STORE ->data_head WRITE ->data_tail
+ *
+ * Where A pairs with D, and B pairs with C.
+ *
+ * I don't think A needs to be a full barrier because we won't in fact
+ * write data until we see the store from userspace. So we simply don't
+ * issue the data WRITE until we observe it.
+ *
+ * OTOH, D needs to be a full barrier since it separates the data READ
+ * from the tail WRITE.
+ *
+ * For B a WMB is sufficient since it separates two WRITEs, and for C
+ * an RMB is sufficient since it separates two READs.
+ *
+ * See perf_output_begin().
*/
+ smp_wmb();
rb->user_page->data_head = head;
/*
@@ -154,6 +175,8 @@ int perf_output_begin(struct perf_output_handle *handle,
* Userspace could choose to issue a mb() before updating the
* tail pointer. So that all reads will be completed before the
* write is issued.
+ *
+ * See perf_output_put_handle().
*/
tail = ACCESS_ONCE(rb->user_page->data_tail);
smp_rmb();
> I would argue for:
>
> READ ->data_tail READ ->data_head
> smp_rmb() (A) smp_rmb() (C)
> WRITE $data READ $data
> smp_wmb() (B) smp_mb() (D)
> STORE ->data_head WRITE ->data_tail
>
> Where A pairs with D, and B pairs with C.
>
> I don't think A needs to be a full barrier because we won't in fact
> write data until we see the store from userspace. So we simply don't
> issue the data WRITE until we observe it.
>
> OTOH, D needs to be a full barrier since it separates the data READ from
> the tail WRITE.
>
> For B a WMB is sufficient since it separates two WRITEs, and for C an
> RMB is sufficient since it separates two READs.
FWIW the testing Victor did confirms WMB is good enough on powerpc.
Thanks,
Mikey
>
> ---
> kernel/events/ring_buffer.c | 29 ++++++++++++++++++++++++++---
> 1 file changed, 26 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
> index cd55144270b5..c91274ef4e23 100644
> --- a/kernel/events/ring_buffer.c
> +++ b/kernel/events/ring_buffer.c
> @@ -87,10 +87,31 @@ static void perf_output_put_handle(struct perf_output_handle *handle)
> goto out;
>
> /*
> - * Publish the known good head. Rely on the full barrier implied
> - * by atomic_dec_and_test() order the rb->head read and this
> - * write.
> + * Since the mmap() consumer (userspace) can run on a different CPU:
> + *
> + * kernel user
> + *
> + * READ ->data_tail READ ->data_head
> + * smp_rmb() (A) smp_rmb() (C)
> + * WRITE $data READ $data
> + * smp_wmb() (B) smp_mb() (D)
> + * STORE ->data_head WRITE ->data_tail
> + *
> + * Where A pairs with D, and B pairs with C.
> + *
> + * I don't think A needs to be a full barrier because we won't in fact
> + * write data until we see the store from userspace. So we simply don't
> + * issue the data WRITE until we observe it.
> + *
> + * OTOH, D needs to be a full barrier since it separates the data READ
> + * from the tail WRITE.
> + *
> + * For B a WMB is sufficient since it separates two WRITEs, and for C
> + * an RMB is sufficient since it separates two READs.
> + *
> + * See perf_output_begin().
> */
> + smp_wmb();
> rb->user_page->data_head = head;
>
> /*
> @@ -154,6 +175,8 @@ int perf_output_begin(struct perf_output_handle *handle,
> * Userspace could choose to issue a mb() before updating the
> * tail pointer. So that all reads will be completed before the
> * write is issued.
> + *
> + * See perf_output_put_handle().
> */
> tail = ACCESS_ONCE(rb->user_page->data_tail);
> smp_rmb();
>
Peter Zijlstra <[email protected]> wrote on 10/25/2013 07:37:49 PM:
> I would argue for:
>
> READ ->data_tail READ ->data_head
> smp_rmb() (A) smp_rmb() (C)
> WRITE $data READ $data
> smp_wmb() (B) smp_mb() (D)
> STORE ->data_head WRITE ->data_tail
>
> Where A pairs with D, and B pairs with C.
1. I agree. My only concern is that architectures which do use atomic
operations
with memory barriers, will issue two consecutive barriers now, which is
sub-optimal.
2. I think the comment in "include/linux/perf_event.h" describing
"data_head" and
"data_tail" for user space need an update as well. Current version -
/*
* Control data for the mmap() data buffer.
*
* User-space reading the @data_head value should issue an rmb(),
on
* SMP capable platforms, after reading this value -- see
* perf_event_wakeup().
*
* When the mapping is PROT_WRITE the @data_tail value should be
* written by userspace to reflect the last read data. In this case
* the kernel will not over-write unread data.
*/
__u64 data_head; /* head in the data section */
__u64 data_tail; /* user-space written tail */
- say nothing about the need of memory barrier before "data_tail" write.
-- Victor
On Sun, Oct 27, 2013 at 11:00:33AM +0200, Victor Kaplansky wrote:
> Peter Zijlstra <[email protected]> wrote on 10/25/2013 07:37:49 PM:
>
> > I would argue for:
> >
> > READ ->data_tail READ ->data_head
> > smp_rmb() (A) smp_rmb() (C)
> > WRITE $data READ $data
> > smp_wmb() (B) smp_mb() (D)
> > STORE ->data_head WRITE ->data_tail
> >
> > Where A pairs with D, and B pairs with C.
>
> 1. I agree. My only concern is that architectures which do use atomic
> operations
> with memory barriers, will issue two consecutive barriers now, which is
> sub-optimal.
Yeah, although that would be fairly easy to optimize by the CPUs itself;
not sure they actually do this though.
But we don't really have much choice aside of introducing things like:
smp_wmb__after_local_$op; and I'm fairly sure people won't like adding a
ton of conditional barriers like that either.
> 2. I think the comment in "include/linux/perf_event.h" describing
> "data_head" and
> "data_tail" for user space need an update as well. Current version -
Oh, indeed. Thanks; I'll update that too!
2013/10/25 Peter Zijlstra <[email protected]>:
> On Wed, Oct 23, 2013 at 03:19:51PM +0100, Frederic Weisbecker wrote:
> I would argue for:
>
> READ ->data_tail READ ->data_head
> smp_rmb() (A) smp_rmb() (C)
> WRITE $data READ $data
> smp_wmb() (B) smp_mb() (D)
> STORE ->data_head WRITE ->data_tail
>
> Where A pairs with D, and B pairs with C.
>
> I don't think A needs to be a full barrier because we won't in fact
> write data until we see the store from userspace. So we simply don't
> issue the data WRITE until we observe it.
>
> OTOH, D needs to be a full barrier since it separates the data READ from
> the tail WRITE.
>
> For B a WMB is sufficient since it separates two WRITEs, and for C an
> RMB is sufficient since it separates two READs.
Hmm, I need to defer on you for that, I'm not yet comfortable with
picking specific barrier flavours when both write and read are
involved in a same side :)
> From: Frederic Weisbecker <[email protected]>
>
> 2013/10/25 Peter Zijlstra <[email protected]>:
> > On Wed, Oct 23, 2013 at 03:19:51PM +0100, Frederic Weisbecker wrote:
> > I would argue for
> >
> > READ ->data_tail READ ->data_head
> > smp_rmb() (A) smp_rmb() (C)
> > WRITE $data READ $data
> > smp_wmb() (B) smp_mb() (D)
> > STORE ->data_head WRITE ->data_tail
> >
> > Where A pairs with D, and B pairs with C.
> >
> > I don't think A needs to be a full barrier because we won't in fact
> > write data until we see the store from userspace. So we simply don't
> > issue the data WRITE until we observe it.
> >
> > OTOH, D needs to be a full barrier since it separates the data READ
from
> > the tail WRITE.
> >
> > For B a WMB is sufficient since it separates two WRITEs, and for C an
> > RMB is sufficient since it separates two READs.
>
> Hmm, I need to defer on you for that, I'm not yet comfortable with
> picking specific barrier flavours when both write and read are
> involved in a same side :)
I think you have a point :) IMO, memory barrier (A) is superfluous.
At producer side we need to ensure that "WRITE $data" is not committed to
memory
before "READ ->data_tail" had seen a new value and if the old one indicated
that
there is no enough space for a new entry. All this is already guaranteed by
control flow dependancy on single CPU - writes will not be committed to the
memory
if read value of "data_tail" doesn't specify enough free space in the ring
buffer.
Likewise, on consumer side, we can make use of natural data dependency and
memory ordering guarantee for single CPU and try to replace "smp_mb" by
a more light-weight "smp_rmb":
READ ->data_tail READ ->data_head
// ... smp_rmb() (C)
WRITE $data READ $data
smp_wmb() (B) smp_rmb() (D)
READ $header_size
STORE ->data_head WRITE ->data_tail = $old_data_tail +
$header_size
We ensure that all $data is read before "data_tail" is written by doing
"READ $header_size" after
all other data is read and we rely on natural data dependancy between
"data_tail" write
and "header_size" read.
-- Victor
On Mon, Oct 28, 2013 at 02:38:29PM +0200, Victor Kaplansky wrote:
> > 2013/10/25 Peter Zijlstra <[email protected]>:
> > > On Wed, Oct 23, 2013 at 03:19:51PM +0100, Frederic Weisbecker wrote:
> > > I would argue for
> > >
> > > READ ->data_tail READ ->data_head
> > > smp_rmb() (A) smp_rmb() (C)
> > > WRITE $data READ $data
> > > smp_wmb() (B) smp_mb() (D)
> > > STORE ->data_head WRITE ->data_tail
> > >
> > > Where A pairs with D, and B pairs with C.
> > >
> > > I don't think A needs to be a full barrier because we won't in fact
> > > write data until we see the store from userspace. So we simply don't
> > > issue the data WRITE until we observe it.
> > >
> > > OTOH, D needs to be a full barrier since it separates the data READ from
> > > the tail WRITE.
> > >
> > > For B a WMB is sufficient since it separates two WRITEs, and for C an
> > > RMB is sufficient since it separates two READs.
<snip>
> I think you have a point :) IMO, memory barrier (A) is superfluous.
> At producer side we need to ensure that "WRITE $data" is not committed
> to memory before "READ ->data_tail" had seen a new value and if the
> old one indicated that there is no enough space for a new entry. All
> this is already guaranteed by control flow dependancy on single CPU -
> writes will not be committed to the memory if read value of
> "data_tail" doesn't specify enough free space in the ring buffer.
>
> Likewise, on consumer side, we can make use of natural data dependency and
> memory ordering guarantee for single CPU and try to replace "smp_mb" by
> a more light-weight "smp_rmb":
>
> READ ->data_tail READ ->data_head
> // ... smp_rmb() (C)
> WRITE $data READ $data
> smp_wmb() (B) smp_rmb() (D)
> READ $header_size
> STORE ->data_head WRITE ->data_tail = $old_data_tail +
> $header_size
>
> We ensure that all $data is read before "data_tail" is written by
> doing "READ $header_size" after all other data is read and we rely on
> natural data dependancy between "data_tail" write and "header_size"
> read.
I'm not entirely sure I get the $header_size trickery; need to think
more on that. But yes, I did consider the other one. However, I had
trouble having no pairing barrier for (D).
ISTR something like Alpha being able to miss the update (for a long
while) if you don't issue the RMB.
Lets add Paul and Oleg to the thread; this is getting far more 'fun'
that it should be ;-)
For completeness; below the patch as I had queued it.
---
Subject: perf: Fix perf ring buffer memory ordering
From: Peter Zijlstra <[email protected]>
Date: Mon Oct 28 13:55:29 CET 2013
The PPC64 people noticed a missing memory barrier and crufty old
comments in the perf ring buffer code. So update all the comments and
add the missing barrier.
When the architecture implements local_t using atomic_long_t there
will be double barriers issued; but short of introducing more
conditional barrier primitives this is the best we can do.
Cc: [email protected]
Cc: [email protected]
Cc: Mathieu Desnoyers <[email protected]>
Cc: [email protected]
Cc: Paul McKenney <[email protected]>
Cc: Michael Neuling <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Reported-by: Victor Kaplansky <[email protected]>
Tested-by: Victor Kaplansky <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
---
include/uapi/linux/perf_event.h | 12 +++++++-----
kernel/events/ring_buffer.c | 29 ++++++++++++++++++++++++++---
2 files changed, 33 insertions(+), 8 deletions(-)
Index: linux-2.6/include/uapi/linux/perf_event.h
===================================================================
--- linux-2.6.orig/include/uapi/linux/perf_event.h
+++ linux-2.6/include/uapi/linux/perf_event.h
@@ -479,13 +479,15 @@ struct perf_event_mmap_page {
/*
* Control data for the mmap() data buffer.
*
- * User-space reading the @data_head value should issue an rmb(), on
- * SMP capable platforms, after reading this value -- see
- * perf_event_wakeup().
+ * User-space reading the @data_head value should issue an smp_rmb(),
+ * after reading this value.
*
* When the mapping is PROT_WRITE the @data_tail value should be
- * written by userspace to reflect the last read data. In this case
- * the kernel will not over-write unread data.
+ * written by userspace to reflect the last read data, after issueing
+ * an smp_mb() to separate the data read from the ->data_tail store.
+ * In this case the kernel will not over-write unread data.
+ *
+ * See perf_output_put_handle() for the data ordering.
*/
__u64 data_head; /* head in the data section */
__u64 data_tail; /* user-space written tail */
Index: linux-2.6/kernel/events/ring_buffer.c
===================================================================
--- linux-2.6.orig/kernel/events/ring_buffer.c
+++ linux-2.6/kernel/events/ring_buffer.c
@@ -87,10 +87,31 @@ static void perf_output_put_handle(struc
goto out;
/*
- * Publish the known good head. Rely on the full barrier implied
- * by atomic_dec_and_test() order the rb->head read and this
- * write.
+ * Since the mmap() consumer (userspace) can run on a different CPU:
+ *
+ * kernel user
+ *
+ * READ ->data_tail READ ->data_head
+ * smp_rmb() (A) smp_rmb() (C)
+ * WRITE $data READ $data
+ * smp_wmb() (B) smp_mb() (D)
+ * STORE ->data_head WRITE ->data_tail
+ *
+ * Where A pairs with D, and B pairs with C.
+ *
+ * I don't think A needs to be a full barrier because we won't in fact
+ * write data until we see the store from userspace. So we simply don't
+ * issue the data WRITE until we observe it.
+ *
+ * OTOH, D needs to be a full barrier since it separates the data READ
+ * from the tail WRITE.
+ *
+ * For B a WMB is sufficient since it separates two WRITEs, and for C
+ * an RMB is sufficient since it separates two READs.
+ *
+ * See perf_output_begin().
*/
+ smp_wmb();
rb->user_page->data_head = head;
/*
@@ -154,6 +175,8 @@ int perf_output_begin(struct perf_output
* Userspace could choose to issue a mb() before updating the
* tail pointer. So that all reads will be completed before the
* write is issued.
+ *
+ * See perf_output_put_handle().
*/
tail = ACCESS_ONCE(rb->user_page->data_tail);
smp_rmb();
On Mon, Oct 28, 2013 at 02:26:34PM +0100, Peter Zijlstra wrote:
> On Mon, Oct 28, 2013 at 02:38:29PM +0200, Victor Kaplansky wrote:
> > > 2013/10/25 Peter Zijlstra <[email protected]>:
> > > > On Wed, Oct 23, 2013 at 03:19:51PM +0100, Frederic Weisbecker wrote:
> > > > I would argue for
> > > >
> > > > READ ->data_tail READ ->data_head
> > > > smp_rmb() (A) smp_rmb() (C)
> > > > WRITE $data READ $data
> > > > smp_wmb() (B) smp_mb() (D)
> > > > STORE ->data_head WRITE ->data_tail
> > > >
> > > > Where A pairs with D, and B pairs with C.
> > > >
> > > > I don't think A needs to be a full barrier because we won't in fact
> > > > write data until we see the store from userspace. So we simply don't
> > > > issue the data WRITE until we observe it.
> > > >
> > > > OTOH, D needs to be a full barrier since it separates the data READ from
> > > > the tail WRITE.
> > > >
> > > > For B a WMB is sufficient since it separates two WRITEs, and for C an
> > > > RMB is sufficient since it separates two READs.
>
> <snip>
>
> > I think you have a point :) IMO, memory barrier (A) is superfluous.
> > At producer side we need to ensure that "WRITE $data" is not committed
> > to memory before "READ ->data_tail" had seen a new value and if the
> > old one indicated that there is no enough space for a new entry. All
> > this is already guaranteed by control flow dependancy on single CPU -
> > writes will not be committed to the memory if read value of
> > "data_tail" doesn't specify enough free space in the ring buffer.
> >
> > Likewise, on consumer side, we can make use of natural data dependency and
> > memory ordering guarantee for single CPU and try to replace "smp_mb" by
> > a more light-weight "smp_rmb":
> >
> > READ ->data_tail READ ->data_head
> > // ... smp_rmb() (C)
> > WRITE $data READ $data
> > smp_wmb() (B) smp_rmb() (D)
> > READ $header_size
> > STORE ->data_head WRITE ->data_tail = $old_data_tail +
> > $header_size
> >
> > We ensure that all $data is read before "data_tail" is written by
> > doing "READ $header_size" after all other data is read and we rely on
> > natural data dependancy between "data_tail" write and "header_size"
> > read.
>
> I'm not entirely sure I get the $header_size trickery; need to think
> more on that. But yes, I did consider the other one. However, I had
> trouble having no pairing barrier for (D).
>
> ISTR something like Alpha being able to miss the update (for a long
> while) if you don't issue the RMB.
>
> Lets add Paul and Oleg to the thread; this is getting far more 'fun'
> that it should be ;-)
>
> For completeness; below the patch as I had queued it.
> ---
> Subject: perf: Fix perf ring buffer memory ordering
> From: Peter Zijlstra <[email protected]>
> Date: Mon Oct 28 13:55:29 CET 2013
>
> The PPC64 people noticed a missing memory barrier and crufty old
> comments in the perf ring buffer code. So update all the comments and
> add the missing barrier.
>
> When the architecture implements local_t using atomic_long_t there
> will be double barriers issued; but short of introducing more
> conditional barrier primitives this is the best we can do.
>
> Cc: [email protected]
> Cc: [email protected]
> Cc: Mathieu Desnoyers <[email protected]>
> Cc: [email protected]
> Cc: Paul McKenney <[email protected]>
> Cc: Michael Neuling <[email protected]>
> Cc: Frederic Weisbecker <[email protected]>
> Reported-by: Victor Kaplansky <[email protected]>
> Tested-by: Victor Kaplansky <[email protected]>
> Signed-off-by: Peter Zijlstra <[email protected]>
> Link: http://lkml.kernel.org/r/[email protected]
> ---
> include/uapi/linux/perf_event.h | 12 +++++++-----
> kernel/events/ring_buffer.c | 29 ++++++++++++++++++++++++++---
> 2 files changed, 33 insertions(+), 8 deletions(-)
>
> Index: linux-2.6/include/uapi/linux/perf_event.h
> ===================================================================
> --- linux-2.6.orig/include/uapi/linux/perf_event.h
> +++ linux-2.6/include/uapi/linux/perf_event.h
> @@ -479,13 +479,15 @@ struct perf_event_mmap_page {
> /*
> * Control data for the mmap() data buffer.
> *
> - * User-space reading the @data_head value should issue an rmb(), on
> - * SMP capable platforms, after reading this value -- see
> - * perf_event_wakeup().
> + * User-space reading the @data_head value should issue an smp_rmb(),
> + * after reading this value.
> *
> * When the mapping is PROT_WRITE the @data_tail value should be
> - * written by userspace to reflect the last read data. In this case
> - * the kernel will not over-write unread data.
> + * written by userspace to reflect the last read data, after issueing
> + * an smp_mb() to separate the data read from the ->data_tail store.
> + * In this case the kernel will not over-write unread data.
> + *
> + * See perf_output_put_handle() for the data ordering.
> */
> __u64 data_head; /* head in the data section */
> __u64 data_tail; /* user-space written tail */
> Index: linux-2.6/kernel/events/ring_buffer.c
> ===================================================================
> --- linux-2.6.orig/kernel/events/ring_buffer.c
> +++ linux-2.6/kernel/events/ring_buffer.c
> @@ -87,10 +87,31 @@ static void perf_output_put_handle(struc
> goto out;
>
> /*
> - * Publish the known good head. Rely on the full barrier implied
> - * by atomic_dec_and_test() order the rb->head read and this
> - * write.
> + * Since the mmap() consumer (userspace) can run on a different CPU:
> + *
> + * kernel user
> + *
> + * READ ->data_tail READ ->data_head
> + * smp_rmb() (A) smp_rmb() (C)
Given that both of the kernel's subsequent operations are stores/writes,
doesn't (A) need to be smp_mb()?
Thanx, Paul
> + * WRITE $data READ $data
> + * smp_wmb() (B) smp_mb() (D)
> + * STORE ->data_head WRITE ->data_tail
> + *
> + * Where A pairs with D, and B pairs with C.
> + *
> + * I don't think A needs to be a full barrier because we won't in fact
> + * write data until we see the store from userspace. So we simply don't
> + * issue the data WRITE until we observe it.
> + *
> + * OTOH, D needs to be a full barrier since it separates the data READ
> + * from the tail WRITE.
> + *
> + * For B a WMB is sufficient since it separates two WRITEs, and for C
> + * an RMB is sufficient since it separates two READs.
> + *
> + * See perf_output_begin().
> */
> + smp_wmb();
> rb->user_page->data_head = head;
>
> /*
> @@ -154,6 +175,8 @@ int perf_output_begin(struct perf_output
> * Userspace could choose to issue a mb() before updating the
> * tail pointer. So that all reads will be completed before the
> * write is issued.
> + *
> + * See perf_output_put_handle().
> */
> tail = ACCESS_ONCE(rb->user_page->data_tail);
> smp_rmb();
>
On 10/28, Peter Zijlstra wrote:
>
> Lets add Paul and Oleg to the thread; this is getting far more 'fun'
> that it should be ;-)
Heh. All I can say is that I would like to see the authoritative answer,
you know who can shed a light ;)
But to avoid the confusion, wmb() added by this patch looks "obviously
correct" to me.
> + * Since the mmap() consumer (userspace) can run on a different CPU:
> + *
> + * kernel user
> + *
> + * READ ->data_tail READ ->data_head
> + * smp_rmb() (A) smp_rmb() (C)
> + * WRITE $data READ $data
> + * smp_wmb() (B) smp_mb() (D)
> + * STORE ->data_head WRITE ->data_tail
> + *
> + * Where A pairs with D, and B pairs with C.
> + *
> + * I don't think A needs to be a full barrier because we won't in fact
> + * write data until we see the store from userspace.
this matches the intuition, but ...
> So we simply don't
> + * issue the data WRITE until we observe it.
why do we need any barrier (rmb) then? how it can help to serialize with
"WRITE $data" ?
(of course there could be other reasons for this rmb(), just I can't
really understand "A pairs with D").
And this reminds me about the memory barrier in kfifo.c which I was not
able to understand. Hmm, it has already gone away, and now I do not
understand kfifo.c at all ;) But I have found the commit, attached below.
Note that that commit added the full barrier into __kfifo_put(). And to
me it looks the same as "A" above. Following the logic above we could say
that we do not need a barrier (at least the full one), we won't in fact
write into the "unread" area until we see the store to ->out from
__kfifo_get() ?
In short. I am confused, I _feel_ that "A" has to be a full barrier, but
I can't prove this. And let me suggest the artificial/simplified example,
bool BUSY;
data_t DATA;
bool try_to_get(data_t *data)
{
if (!BUSY)
return false;
rmb();
*data = DATA;
mb();
BUSY = false;
return true;
}
bool try_to_put(data_t *data)
{
if (BUSY)
return false;
mb(); // XXXXXXXX: do we really need it? I think yes.
DATA = *data;
wmb();
BUSY = true;
return true;
}
Again, following the description above we could turn the mb() in _put()
into barrier(), but I do not think we can rely on the contorl dependency.
Oleg.
---
commit a45bce49545739a940f8bd4ca85c3b7435564893
Author: Paul E. McKenney <[email protected]>
Date: Fri Sep 29 02:00:11 2006 -0700
[PATCH] memory ordering in __kfifo primitives
Both __kfifo_put() and __kfifo_get() have header comments stating that if
there is but one concurrent reader and one concurrent writer, locking is not
necessary. This is almost the case, but a couple of memory barriers are
needed. Another option would be to change the header comments to remove the
bit about locking not being needed, and to change the those callers who
currently don't use locking to add the required locking. The attachment
analyzes this approach, but the patch below seems simpler.
Signed-off-by: Paul E. McKenney <[email protected]>
Cc: Stelian Pop <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
diff --git a/kernel/kfifo.c b/kernel/kfifo.c
index 64ab045..5d1d907 100644
--- a/kernel/kfifo.c
+++ b/kernel/kfifo.c
@@ -122,6 +122,13 @@ unsigned int __kfifo_put(struct kfifo *fifo,
len = min(len, fifo->size - fifo->in + fifo->out);
+ /*
+ * Ensure that we sample the fifo->out index -before- we
+ * start putting bytes into the kfifo.
+ */
+
+ smp_mb();
+
/* first put the data starting from fifo->in to buffer end */
l = min(len, fifo->size - (fifo->in & (fifo->size - 1)));
memcpy(fifo->buffer + (fifo->in & (fifo->size - 1)), buffer, l);
@@ -129,6 +136,13 @@ unsigned int __kfifo_put(struct kfifo *fifo,
/* then put the rest (if any) at the beginning of the buffer */
memcpy(fifo->buffer, buffer + l, len - l);
+ /*
+ * Ensure that we add the bytes to the kfifo -before-
+ * we update the fifo->in index.
+ */
+
+ smp_wmb();
+
fifo->in += len;
return len;
@@ -154,6 +168,13 @@ unsigned int __kfifo_get(struct kfifo *fifo,
len = min(len, fifo->in - fifo->out);
+ /*
+ * Ensure that we sample the fifo->in index -before- we
+ * start removing bytes from the kfifo.
+ */
+
+ smp_rmb();
+
/* first get the data from fifo->out until the end of the buffer */
l = min(len, fifo->size - (fifo->out & (fifo->size - 1)));
memcpy(buffer, fifo->buffer + (fifo->out & (fifo->size - 1)), l);
@@ -161,6 +182,13 @@ unsigned int __kfifo_get(struct kfifo *fifo,
/* then get the rest (if any) from the beginning of the buffer */
memcpy(buffer + l, fifo->buffer, len - l);
+ /*
+ * Ensure that we remove the bytes from the kfifo -before-
+ * we update the fifo->out index.
+ */
+
+ smp_mb();
+
fifo->out += len;
return len;
On 10/28, Paul E. McKenney wrote:
>
> On Mon, Oct 28, 2013 at 02:26:34PM +0100, Peter Zijlstra wrote:
> > --- linux-2.6.orig/kernel/events/ring_buffer.c
> > +++ linux-2.6/kernel/events/ring_buffer.c
> > @@ -87,10 +87,31 @@ static void perf_output_put_handle(struc
> > goto out;
> >
> > /*
> > - * Publish the known good head. Rely on the full barrier implied
> > - * by atomic_dec_and_test() order the rb->head read and this
> > - * write.
> > + * Since the mmap() consumer (userspace) can run on a different CPU:
> > + *
> > + * kernel user
> > + *
> > + * READ ->data_tail READ ->data_head
> > + * smp_rmb() (A) smp_rmb() (C)
>
> Given that both of the kernel's subsequent operations are stores/writes,
> doesn't (A) need to be smp_mb()?
Yes, this is my understanding^Wfeeling too, but I have to admit that
I can't really explain to myself why _exactly_ we need mb() here.
And let me copy-and-paste the artificial example from my previous
email,
bool BUSY;
data_t DATA;
bool try_to_get(data_t *data)
{
if (!BUSY)
return false;
rmb();
*data = DATA;
mb();
BUSY = false;
return true;
}
bool try_to_put(data_t *data)
{
if (BUSY)
return false;
mb(); // XXXXXXXX: do we really need it? I think yes.
DATA = *data;
wmb();
BUSY = true;
return true;
}
(just in case, the code above obviously assumes that _get or _put
can't race with itself, but they can race with each other).
Could you confirm that try_to_put() actually needs mb() between
LOAD(BUSY) and STORE(DATA) ?
I am sure it actually needs, but I will appreciate it if you can
explain why. IOW, how it is possible that without mb() try_to_put()
can overwrite DATA before try_to_get() completes its "*data = DATA"
in this particular case.
Perhaps this can happen if, say, reader and writer share a level of
cache or something like this...
Oleg.
Oleg Nesterov <[email protected]> wrote on 10/28/2013 10:17:35 PM:
> mb(); // XXXXXXXX: do we really need it? I think yes.
Oh, it is hard to argue with feelings. Also, it is easy to be on
conservative side and put the barrier here just in case.
But I still insist that the barrier is redundant in your example.
-- Victor
On Mon, Oct 28, 2013 at 10:58:58PM +0200, Victor Kaplansky wrote:
> Oleg Nesterov <[email protected]> wrote on 10/28/2013 10:17:35 PM:
>
> > mb(); // XXXXXXXX: do we really need it? I think yes.
>
> Oh, it is hard to argue with feelings. Also, it is easy to be on
> conservative side and put the barrier here just in case.
I'll make it a full mb for now and too am curious to see the end of this
discussion explaining things ;-)
On Tue, Oct 29, 2013 at 11:21:31AM +0100, Peter Zijlstra wrote:
> On Mon, Oct 28, 2013 at 10:58:58PM +0200, Victor Kaplansky wrote:
> > Oleg Nesterov <[email protected]> wrote on 10/28/2013 10:17:35 PM:
> >
> > > mb(); // XXXXXXXX: do we really need it? I think yes.
> >
> > Oh, it is hard to argue with feelings. Also, it is easy to be on
> > conservative side and put the barrier here just in case.
>
> I'll make it a full mb for now and too am curious to see the end of this
> discussion explaining things ;-)
That is, I've now got this queued:
---
Subject: perf: Fix perf ring buffer memory ordering
From: Peter Zijlstra <[email protected]>
Date: Mon Oct 28 13:55:29 CET 2013
The PPC64 people noticed a missing memory barrier and crufty old
comments in the perf ring buffer code. So update all the comments and
add the missing barrier.
When the architecture implements local_t using atomic_long_t there
will be double barriers issued; but short of introducing more
conditional barrier primitives this is the best we can do.
Cc: Mathieu Desnoyers <[email protected]>
Cc: [email protected]
Cc: Paul McKenney <[email protected]>
Cc: Michael Neuling <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: [email protected]
Cc: [email protected]
Reported-by: Victor Kaplansky <[email protected]>
Tested-by: Victor Kaplansky <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
---
include/uapi/linux/perf_event.h | 12 +++++++-----
kernel/events/ring_buffer.c | 31 +++++++++++++++++++++++++++----
2 files changed, 34 insertions(+), 9 deletions(-)
Index: linux-2.6/include/uapi/linux/perf_event.h
===================================================================
--- linux-2.6.orig/include/uapi/linux/perf_event.h
+++ linux-2.6/include/uapi/linux/perf_event.h
@@ -479,13 +479,15 @@ struct perf_event_mmap_page {
/*
* Control data for the mmap() data buffer.
*
- * User-space reading the @data_head value should issue an rmb(), on
- * SMP capable platforms, after reading this value -- see
- * perf_event_wakeup().
+ * User-space reading the @data_head value should issue an smp_rmb(),
+ * after reading this value.
*
* When the mapping is PROT_WRITE the @data_tail value should be
- * written by userspace to reflect the last read data. In this case
- * the kernel will not over-write unread data.
+ * written by userspace to reflect the last read data, after issueing
+ * an smp_mb() to separate the data read from the ->data_tail store.
+ * In this case the kernel will not over-write unread data.
+ *
+ * See perf_output_put_handle() for the data ordering.
*/
__u64 data_head; /* head in the data section */
__u64 data_tail; /* user-space written tail */
Index: linux-2.6/kernel/events/ring_buffer.c
===================================================================
--- linux-2.6.orig/kernel/events/ring_buffer.c
+++ linux-2.6/kernel/events/ring_buffer.c
@@ -87,10 +87,31 @@ static void perf_output_put_handle(struc
goto out;
/*
- * Publish the known good head. Rely on the full barrier implied
- * by atomic_dec_and_test() order the rb->head read and this
- * write.
+ * Since the mmap() consumer (userspace) can run on a different CPU:
+ *
+ * kernel user
+ *
+ * READ ->data_tail READ ->data_head
+ * smp_mb() (A) smp_rmb() (C)
+ * WRITE $data READ $data
+ * smp_wmb() (B) smp_mb() (D)
+ * STORE ->data_head WRITE ->data_tail
+ *
+ * Where A pairs with D, and B pairs with C.
+ *
+ * I don't think A needs to be a full barrier because we won't in fact
+ * write data until we see the store from userspace. So we simply don't
+ * issue the data WRITE until we observe it. Be conservative for now.
+ *
+ * OTOH, D needs to be a full barrier since it separates the data READ
+ * from the tail WRITE.
+ *
+ * For B a WMB is sufficient since it separates two WRITEs, and for C
+ * an RMB is sufficient since it separates two READs.
+ *
+ * See perf_output_begin().
*/
+ smp_wmb();
rb->user_page->data_head = head;
/*
@@ -154,9 +175,11 @@ int perf_output_begin(struct perf_output
* Userspace could choose to issue a mb() before updating the
* tail pointer. So that all reads will be completed before the
* write is issued.
+ *
+ * See perf_output_put_handle().
*/
tail = ACCESS_ONCE(rb->user_page->data_tail);
- smp_rmb();
+ smp_mb();
offset = head = local_read(&rb->head);
head += size;
if (unlikely(!perf_output_space(rb, tail, offset, head)))
On Tue, Oct 29, 2013 at 11:30:57AM +0100, Peter Zijlstra wrote:
> @@ -154,9 +175,11 @@ int perf_output_begin(struct perf_output
> * Userspace could choose to issue a mb() before updating the
> * tail pointer. So that all reads will be completed before the
> * write is issued.
> + *
> + * See perf_output_put_handle().
> */
> tail = ACCESS_ONCE(rb->user_page->data_tail);
> - smp_rmb();
> + smp_mb();
> offset = head = local_read(&rb->head);
> head += size;
> if (unlikely(!perf_output_space(rb, tail, offset, head)))
That said; it would be very nice to be able to remove this barrier. This
is in every event write path :/
Commit-ID: bf378d341e4873ed928dc3c636252e6895a21f50
Gitweb: http://git.kernel.org/tip/bf378d341e4873ed928dc3c636252e6895a21f50
Author: Peter Zijlstra <[email protected]>
AuthorDate: Mon, 28 Oct 2013 13:55:29 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 29 Oct 2013 12:01:19 +0100
perf: Fix perf ring buffer memory ordering
The PPC64 people noticed a missing memory barrier and crufty old
comments in the perf ring buffer code. So update all the comments and
add the missing barrier.
When the architecture implements local_t using atomic_long_t there
will be double barriers issued; but short of introducing more
conditional barrier primitives this is the best we can do.
Reported-by: Victor Kaplansky <[email protected]>
Tested-by: Victor Kaplansky <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Mathieu Desnoyers <[email protected]>
Cc: [email protected]
Cc: Paul McKenney <[email protected]>
Cc: Michael Neuling <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/uapi/linux/perf_event.h | 12 +++++++-----
kernel/events/ring_buffer.c | 31 +++++++++++++++++++++++++++----
2 files changed, 34 insertions(+), 9 deletions(-)
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 009a655..2fc1602 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -456,13 +456,15 @@ struct perf_event_mmap_page {
/*
* Control data for the mmap() data buffer.
*
- * User-space reading the @data_head value should issue an rmb(), on
- * SMP capable platforms, after reading this value -- see
- * perf_event_wakeup().
+ * User-space reading the @data_head value should issue an smp_rmb(),
+ * after reading this value.
*
* When the mapping is PROT_WRITE the @data_tail value should be
- * written by userspace to reflect the last read data. In this case
- * the kernel will not over-write unread data.
+ * written by userspace to reflect the last read data, after issueing
+ * an smp_mb() to separate the data read from the ->data_tail store.
+ * In this case the kernel will not over-write unread data.
+ *
+ * See perf_output_put_handle() for the data ordering.
*/
__u64 data_head; /* head in the data section */
__u64 data_tail; /* user-space written tail */
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index cd55144..9c2ddfb 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -87,10 +87,31 @@ again:
goto out;
/*
- * Publish the known good head. Rely on the full barrier implied
- * by atomic_dec_and_test() order the rb->head read and this
- * write.
+ * Since the mmap() consumer (userspace) can run on a different CPU:
+ *
+ * kernel user
+ *
+ * READ ->data_tail READ ->data_head
+ * smp_mb() (A) smp_rmb() (C)
+ * WRITE $data READ $data
+ * smp_wmb() (B) smp_mb() (D)
+ * STORE ->data_head WRITE ->data_tail
+ *
+ * Where A pairs with D, and B pairs with C.
+ *
+ * I don't think A needs to be a full barrier because we won't in fact
+ * write data until we see the store from userspace. So we simply don't
+ * issue the data WRITE until we observe it. Be conservative for now.
+ *
+ * OTOH, D needs to be a full barrier since it separates the data READ
+ * from the tail WRITE.
+ *
+ * For B a WMB is sufficient since it separates two WRITEs, and for C
+ * an RMB is sufficient since it separates two READs.
+ *
+ * See perf_output_begin().
*/
+ smp_wmb();
rb->user_page->data_head = head;
/*
@@ -154,9 +175,11 @@ int perf_output_begin(struct perf_output_handle *handle,
* Userspace could choose to issue a mb() before updating the
* tail pointer. So that all reads will be completed before the
* write is issued.
+ *
+ * See perf_output_put_handle().
*/
tail = ACCESS_ONCE(rb->user_page->data_tail);
- smp_rmb();
+ smp_mb();
offset = head = local_read(&rb->head);
head += size;
if (unlikely(!perf_output_space(rb, tail, offset, head)))
On Tue, 29 Oct 2013, Peter Zijlstra wrote:
> On Tue, Oct 29, 2013 at 11:21:31AM +0100, Peter Zijlstra wrote:
> --- linux-2.6.orig/include/uapi/linux/perf_event.h
> +++ linux-2.6/include/uapi/linux/perf_event.h
> @@ -479,13 +479,15 @@ struct perf_event_mmap_page {
> /*
> * Control data for the mmap() data buffer.
> *
> - * User-space reading the @data_head value should issue an rmb(), on
> - * SMP capable platforms, after reading this value -- see
> - * perf_event_wakeup().
> + * User-space reading the @data_head value should issue an smp_rmb(),
> + * after reading this value.
so where's the patch fixing perf to use the new recommendations?
Is this purely a performance thing or a correctness change?
A change like this a bit of a pain, especially as userspace doesn't really
have nice access to smb_mb() defines so a lot of cut-and-pasting will
ensue for everyone who's trying to parse the mmap buffer.
Vince
On 10/29, Peter Zijlstra wrote:
>
> On Tue, Oct 29, 2013 at 11:30:57AM +0100, Peter Zijlstra wrote:
> > @@ -154,9 +175,11 @@ int perf_output_begin(struct perf_output
> > * Userspace could choose to issue a mb() before updating the
> > * tail pointer. So that all reads will be completed before the
> > * write is issued.
> > + *
> > + * See perf_output_put_handle().
> > */
> > tail = ACCESS_ONCE(rb->user_page->data_tail);
> > - smp_rmb();
> > + smp_mb();
> > offset = head = local_read(&rb->head);
> > head += size;
> > if (unlikely(!perf_output_space(rb, tail, offset, head)))
>
> That said; it would be very nice to be able to remove this barrier. This
> is in every event write path :/
Yes.. And I'm afraid very much that I simply confused you. Perhaps Victor
is right and we do not need this mb(). So I am waiting for the end of
this story too.
And btw I do not understand why we need it (or smp_rmb) right after
ACCESS_ONCE(data_tail).
Oleg.
Peter Zijlstra <[email protected]> wrote:
> On Tue, Oct 29, 2013 at 11:21:31AM +0100, Peter Zijlstra wrote:
> > On Mon, Oct 28, 2013 at 10:58:58PM +0200, Victor Kaplansky wrote:
> > > Oleg Nesterov <[email protected]> wrote on 10/28/2013 10:17:35 PM:
> > >
> > > > mb(); // XXXXXXXX: do we really need it? I think yes.
> > >
> > > Oh, it is hard to argue with feelings. Also, it is easy to be on
> > > conservative side and put the barrier here just in case.
> >
> > I'll make it a full mb for now and too am curious to see the end of this
> > discussion explaining things ;-)
>
> That is, I've now got this queued:
Can we also CC [email protected]? This has been around for a while.
Mikey
>
> ---
> Subject: perf: Fix perf ring buffer memory ordering
> From: Peter Zijlstra <[email protected]>
> Date: Mon Oct 28 13:55:29 CET 2013
>
> The PPC64 people noticed a missing memory barrier and crufty old
> comments in the perf ring buffer code. So update all the comments and
> add the missing barrier.
>
> When the architecture implements local_t using atomic_long_t there
> will be double barriers issued; but short of introducing more
> conditional barrier primitives this is the best we can do.
>
> Cc: Mathieu Desnoyers <[email protected]>
> Cc: [email protected]
> Cc: Paul McKenney <[email protected]>
> Cc: Michael Neuling <[email protected]>
> Cc: Frederic Weisbecker <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Reported-by: Victor Kaplansky <[email protected]>
> Tested-by: Victor Kaplansky <[email protected]>
> Signed-off-by: Peter Zijlstra <[email protected]>
> Link: http://lkml.kernel.org/r/[email protected]
> ---
> include/uapi/linux/perf_event.h | 12 +++++++-----
> kernel/events/ring_buffer.c | 31 +++++++++++++++++++++++++++----
> 2 files changed, 34 insertions(+), 9 deletions(-)
>
> Index: linux-2.6/include/uapi/linux/perf_event.h
> ===================================================================
> --- linux-2.6.orig/include/uapi/linux/perf_event.h
> +++ linux-2.6/include/uapi/linux/perf_event.h
> @@ -479,13 +479,15 @@ struct perf_event_mmap_page {
> /*
> * Control data for the mmap() data buffer.
> *
> - * User-space reading the @data_head value should issue an rmb(), on
> - * SMP capable platforms, after reading this value -- see
> - * perf_event_wakeup().
> + * User-space reading the @data_head value should issue an smp_rmb(),
> + * after reading this value.
> *
> * When the mapping is PROT_WRITE the @data_tail value should be
> - * written by userspace to reflect the last read data. In this case
> - * the kernel will not over-write unread data.
> + * written by userspace to reflect the last read data, after issueing
> + * an smp_mb() to separate the data read from the ->data_tail store.
> + * In this case the kernel will not over-write unread data.
> + *
> + * See perf_output_put_handle() for the data ordering.
> */
> __u64 data_head; /* head in the data section */
> __u64 data_tail; /* user-space written tail */
> Index: linux-2.6/kernel/events/ring_buffer.c
> ===================================================================
> --- linux-2.6.orig/kernel/events/ring_buffer.c
> +++ linux-2.6/kernel/events/ring_buffer.c
> @@ -87,10 +87,31 @@ static void perf_output_put_handle(struc
> goto out;
>
> /*
> - * Publish the known good head. Rely on the full barrier implied
> - * by atomic_dec_and_test() order the rb->head read and this
> - * write.
> + * Since the mmap() consumer (userspace) can run on a different CPU:
> + *
> + * kernel user
> + *
> + * READ ->data_tail READ ->data_head
> + * smp_mb() (A) smp_rmb() (C)
> + * WRITE $data READ $data
> + * smp_wmb() (B) smp_mb() (D)
> + * STORE ->data_head WRITE ->data_tail
> + *
> + * Where A pairs with D, and B pairs with C.
> + *
> + * I don't think A needs to be a full barrier because we won't in fact
> + * write data until we see the store from userspace. So we simply don't
> + * issue the data WRITE until we observe it. Be conservative for now.
> + *
> + * OTOH, D needs to be a full barrier since it separates the data READ
> + * from the tail WRITE.
> + *
> + * For B a WMB is sufficient since it separates two WRITEs, and for C
> + * an RMB is sufficient since it separates two READs.
> + *
> + * See perf_output_begin().
> */
> + smp_wmb();
> rb->user_page->data_head = head;
>
> /*
> @@ -154,9 +175,11 @@ int perf_output_begin(struct perf_output
> * Userspace could choose to issue a mb() before updating the
> * tail pointer. So that all reads will be completed before the
> * write is issued.
> + *
> + * See perf_output_put_handle().
> */
> tail = ACCESS_ONCE(rb->user_page->data_tail);
> - smp_rmb();
> + smp_mb();
> offset = head = local_read(&rb->head);
> head += size;
> if (unlikely(!perf_output_space(rb, tail, offset, head)))
>
On Mon, Oct 28, 2013 at 10:58:58PM +0200, Victor Kaplansky wrote:
> Oleg Nesterov <[email protected]> wrote on 10/28/2013 10:17:35 PM:
>
> > mb(); // XXXXXXXX: do we really need it? I think yes.
>
> Oh, it is hard to argue with feelings. Also, it is easy to be on
> conservative side and put the barrier here just in case.
> But I still insist that the barrier is redundant in your example.
If you were to back up that insistence with a description of the orderings
you are relying on, why other orderings are not important, and how the
important orderings are enforced, I might be tempted to pay attention
to your opinion.
Thanx, Paul
On Tue, Oct 29, 2013 at 03:27:05PM -0400, Vince Weaver wrote:
> On Tue, 29 Oct 2013, Peter Zijlstra wrote:
>
> > On Tue, Oct 29, 2013 at 11:21:31AM +0100, Peter Zijlstra wrote:
> > --- linux-2.6.orig/include/uapi/linux/perf_event.h
> > +++ linux-2.6/include/uapi/linux/perf_event.h
> > @@ -479,13 +479,15 @@ struct perf_event_mmap_page {
> > /*
> > * Control data for the mmap() data buffer.
> > *
> > - * User-space reading the @data_head value should issue an rmb(), on
> > - * SMP capable platforms, after reading this value -- see
> > - * perf_event_wakeup().
> > + * User-space reading the @data_head value should issue an smp_rmb(),
> > + * after reading this value.
>
> so where's the patch fixing perf to use the new recommendations?
Fair enough, thanks for reminding me about that. See below.
> Is this purely a performance thing or a correctness change?
Correctness, although I suppose on most archs you'd be hard pushed to
notice.
> A change like this a bit of a pain, especially as userspace doesn't really
> have nice access to smb_mb() defines so a lot of cut-and-pasting will
> ensue for everyone who's trying to parse the mmap buffer.
Agreed; we should maybe push for a user visible asm/barrier.h or so.
---
Subject: perf, tool: Add required memory barriers
To match patch bf378d341e48 ("perf: Fix perf ring buffer memory
ordering") change userspace to also adhere to the ordering outlined.
Most barrier implementations were gleaned from
arch/*/include/asm/barrier.h and with the exception of metag I'm fairly
sure they're correct.
Cc: James Hogan <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
---
tools/perf/perf.h | 39 +++++++++++++++++++++++++++++++++++++--
tools/perf/util/evlist.h | 2 +-
2 files changed, 38 insertions(+), 3 deletions(-)
diff --git a/tools/perf/perf.h b/tools/perf/perf.h
index f61c230beec4..1b8a0a2a63d4 100644
--- a/tools/perf/perf.h
+++ b/tools/perf/perf.h
@@ -4,6 +4,8 @@
#include <asm/unistd.h>
#if defined(__i386__)
+#define mb() asm volatile("lock; addl $0,0(%%esp)" ::: "memory")
+#define wmb() asm volatile("lock; addl $0,0(%%esp)" ::: "memory")
#define rmb() asm volatile("lock; addl $0,0(%%esp)" ::: "memory")
#define cpu_relax() asm volatile("rep; nop" ::: "memory");
#define CPUINFO_PROC "model name"
@@ -13,6 +15,8 @@
#endif
#if defined(__x86_64__)
+#define mb() asm volatile("mfence" ::: "memory")
+#define wmb() asm volatile("sfence" ::: "memory")
#define rmb() asm volatile("lfence" ::: "memory")
#define cpu_relax() asm volatile("rep; nop" ::: "memory");
#define CPUINFO_PROC "model name"
@@ -23,20 +27,28 @@
#ifdef __powerpc__
#include "../../arch/powerpc/include/uapi/asm/unistd.h"
+#define mb() asm volatile ("sync" ::: "memory")
+#define wmb() asm volatile ("sync" ::: "memory")
#define rmb() asm volatile ("sync" ::: "memory")
#define cpu_relax() asm volatile ("" ::: "memory");
#define CPUINFO_PROC "cpu"
#endif
#ifdef __s390__
+#define mb() asm volatile("bcr 15,0" ::: "memory")
+#define wmb() asm volatile("bcr 15,0" ::: "memory")
#define rmb() asm volatile("bcr 15,0" ::: "memory")
#define cpu_relax() asm volatile("" ::: "memory");
#endif
#ifdef __sh__
#if defined(__SH4A__) || defined(__SH5__)
+# define mb() asm volatile("synco" ::: "memory")
+# define wmb() asm volatile("synco" ::: "memory")
# define rmb() asm volatile("synco" ::: "memory")
#else
+# define mb() asm volatile("" ::: "memory")
+# define wmb() asm volatile("" ::: "memory")
# define rmb() asm volatile("" ::: "memory")
#endif
#define cpu_relax() asm volatile("" ::: "memory")
@@ -44,24 +56,38 @@
#endif
#ifdef __hppa__
+#define mb() asm volatile("" ::: "memory")
+#define wmb() asm volatile("" ::: "memory")
#define rmb() asm volatile("" ::: "memory")
#define cpu_relax() asm volatile("" ::: "memory");
#define CPUINFO_PROC "cpu"
#endif
#ifdef __sparc__
+#ifdef __LP64__
+#define mb() asm volatile("ba,pt %%xcc, 1f\n" \
+ "membar #StoreLoad\n" \
+ "1:\n"":::"memory")
+#else
+#define mb() asm volatile("":::"memory")
+#endif
+#define wmb() asm volatile("":::"memory")
#define rmb() asm volatile("":::"memory")
#define cpu_relax() asm volatile("":::"memory")
#define CPUINFO_PROC "cpu"
#endif
#ifdef __alpha__
+#define mb() asm volatile("mb" ::: "memory")
+#define wmb() asm volatile("wmb" ::: "memory")
#define rmb() asm volatile("mb" ::: "memory")
#define cpu_relax() asm volatile("" ::: "memory")
#define CPUINFO_PROC "cpu model"
#endif
#ifdef __ia64__
+#define mb() asm volatile ("mf" ::: "memory")
+#define wmb() asm volatile ("mf" ::: "memory")
#define rmb() asm volatile ("mf" ::: "memory")
#define cpu_relax() asm volatile ("hint @pause" ::: "memory")
#define CPUINFO_PROC "model name"
@@ -72,35 +98,44 @@
* Use the __kuser_memory_barrier helper in the CPU helper page. See
* arch/arm/kernel/entry-armv.S in the kernel source for details.
*/
+#define mb() ((void(*)(void))0xffff0fa0)()
+#define wmb() ((void(*)(void))0xffff0fa0)()
#define rmb() ((void(*)(void))0xffff0fa0)()
#define cpu_relax() asm volatile("":::"memory")
#define CPUINFO_PROC "Processor"
#endif
#ifdef __aarch64__
-#define rmb() asm volatile("dmb ld" ::: "memory")
+#define mb() asm volatile("dmb ish" ::: "memory")
+#define wmb() asm volatile("dmb ishld" ::: "memory")
+#define rmb() asm volatile("dmb ishst" ::: "memory")
#define cpu_relax() asm volatile("yield" ::: "memory")
#endif
#ifdef __mips__
-#define rmb() asm volatile( \
+#define mb() asm volatile( \
".set mips2\n\t" \
"sync\n\t" \
".set mips0" \
: /* no output */ \
: /* no input */ \
: "memory")
+#define wmb() mb()
+#define rmb() mb()
#define cpu_relax() asm volatile("" ::: "memory")
#define CPUINFO_PROC "cpu model"
#endif
#ifdef __arc__
+#define mb() asm volatile("" ::: "memory")
+#define wmb() asm volatile("" ::: "memory")
#define rmb() asm volatile("" ::: "memory")
#define cpu_relax() rmb()
#define CPUINFO_PROC "Processor"
#endif
#ifdef __metag__
+/* XXX no clue */
#define rmb() asm volatile("" ::: "memory")
#define cpu_relax() asm volatile("" ::: "memory")
#define CPUINFO_PROC "CPU"
diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
index 6e8acc9abe38..8ab1b5ae4a0e 100644
--- a/tools/perf/util/evlist.h
+++ b/tools/perf/util/evlist.h
@@ -189,7 +189,7 @@ static inline void perf_mmap__write_tail(struct perf_mmap *md,
/*
* ensure all reads are done before we write the tail out.
*/
- /* mb(); */
+ mb();
pc->data_tail = tail;
}
On Wed, Oct 30, 2013 at 02:27:25AM -0700, Paul E. McKenney wrote:
> On Mon, Oct 28, 2013 at 10:58:58PM +0200, Victor Kaplansky wrote:
> > Oleg Nesterov <[email protected]> wrote on 10/28/2013 10:17:35 PM:
> >
> > > mb(); // XXXXXXXX: do we really need it? I think yes.
> >
> > Oh, it is hard to argue with feelings. Also, it is easy to be on
> > conservative side and put the barrier here just in case.
> > But I still insist that the barrier is redundant in your example.
>
> If you were to back up that insistence with a description of the orderings
> you are relying on, why other orderings are not important, and how the
> important orderings are enforced, I might be tempted to pay attention
> to your opinion.
OK, so let me try.. a slightly less convoluted version of the code in
kernel/events/ring_buffer.c coupled with a userspace consumer would look
something like the below.
One important detail is that the kbuf part and the kbuf_writer() are
strictly per cpu and we can thus rely on implicit ordering for those.
Only the userspace consumer can possibly run on another cpu, and thus we
need to ensure data consistency for those.
struct buffer {
u64 size;
u64 tail;
u64 head;
void *data;
};
struct buffer *kbuf, *ubuf;
/*
* Determine there's space in the buffer to store data at @offset to
* @head without overwriting data at @tail.
*/
bool space(u64 tail, u64 offset, u64 head)
{
offset = (offset - tail) % kbuf->size;
head = (head - tail) % kbuf->size;
return (s64)(head - offset) >= 0;
}
/*
* If there's space in the buffer; store the data @buf; otherwise
* discard it.
*/
void kbuf_write(int sz, void *buf)
{
u64 tail = ACCESS_ONCE(ubuf->tail); /* last location userspace read */
u64 offset = kbuf->head; /* we already know where we last wrote */
u64 head = offset + sz;
if (!space(tail, offset, head)) {
/* discard @buf */
return;
}
/*
* Ensure that if we see the userspace tail (ubuf->tail) such
* that there is space to write @buf without overwriting data
* userspace hasn't seen yet, we won't in fact store data before
* that read completes.
*/
smp_mb(); /* A, matches with D */
write(kbuf->data + offset, buf, sz);
kbuf->head = head % kbuf->size;
/*
* Ensure that we write all the @buf data before we update the
* userspace visible ubuf->head pointer.
*/
smp_wmb(); /* B, matches with C */
ubuf->head = kbuf->head;
}
/*
* Consume the buffer data and update the tail pointer to indicate to
* kernel space there's 'free' space.
*/
void ubuf_read(void)
{
u64 head, tail;
tail = ACCESS_ONCE(ubuf->tail);
head = ACCESS_ONCE(ubuf->head);
/*
* Ensure we read the buffer boundaries before the actual buffer
* data...
*/
smp_rmb(); /* C, matches with B */
while (tail != head) {
obj = ubuf->data + tail;
/* process obj */
tail += obj->size;
tail %= ubuf->size;
}
/*
* Ensure all data reads are complete before we issue the
* ubuf->tail update; once that update hits, kbuf_write() can
* observe and overwrite data.
*/
smp_mb(); /* D, matches with A */
ubuf->tail = tail;
}
Now the whole crux of the question is if we need barrier A at all, since
the STORES issued by the @buf writes are dependent on the ubuf->tail
read.
If the read shows no available space, we simply will not issue those
writes -- therefore we could argue we can avoid the memory barrier.
However, that leaves D unpaired and me confused. We must have D because
otherwise the CPU could reorder that write into the reads previous and
the kernel could start overwriting data we're still reading.. which
seems like a bad deal.
Also, I'm not entirely sure on C, that too seems like a dependency, we
simply cannot read the buffer @tail before we've read the tail itself,
now can we? Similarly we cannot compare tail to head without having the
head read completed.
Could we replace A and C with an smp_read_barrier_depends()?
Hi Peter,
On 30/10/13 10:42, Peter Zijlstra wrote:
> Subject: perf, tool: Add required memory barriers
>
> To match patch bf378d341e48 ("perf: Fix perf ring buffer memory
> ordering") change userspace to also adhere to the ordering outlined.
>
> Most barrier implementations were gleaned from
> arch/*/include/asm/barrier.h and with the exception of metag I'm fairly
> sure they're correct.
Yeh...
Short answer:
For Meta you're probably best off assuming
CONFIG_METAG_SMP_WRITE_REORDERING=n and just using compiler barriers.
Long answer:
The issue with write reordering between Meta's hardware threads beyond
the cache is only with a particular SoC, and SMP is not used in
production on it.
It is possible to make the LINSYSEVENT_WR_COMBINE_FLUSH register
writable to userspace (it's in a non-mappable region already) but even
then the write to that register needs odd placement to be effective
(before the shmem write rather than after - which isn't a place any
existing barriers are guaranteed to be placed). I'm fairly confident we
get away with it in the kernel, and userland normally just uses linked
load/store instructions for atomicity which works fine.
Cheers
James
On Wed, Oct 30, 2013 at 11:48:44AM +0000, James Hogan wrote:
> Hi Peter,
>
> On 30/10/13 10:42, Peter Zijlstra wrote:
> > Subject: perf, tool: Add required memory barriers
> >
> > To match patch bf378d341e48 ("perf: Fix perf ring buffer memory
> > ordering") change userspace to also adhere to the ordering outlined.
> >
> > Most barrier implementations were gleaned from
> > arch/*/include/asm/barrier.h and with the exception of metag I'm fairly
> > sure they're correct.
>
> Yeh...
>
> Short answer:
> For Meta you're probably best off assuming
> CONFIG_METAG_SMP_WRITE_REORDERING=n and just using compiler barriers.
Thanks, fixed it that way.
> Long answer:
> The issue with write reordering between Meta's hardware threads beyond
> the cache is only with a particular SoC, and SMP is not used in
> production on it.
> It is possible to make the LINSYSEVENT_WR_COMBINE_FLUSH register
> writable to userspace (it's in a non-mappable region already) but even
> then the write to that register needs odd placement to be effective
> (before the shmem write rather than after - which isn't a place any
> existing barriers are guaranteed to be placed). I'm fairly confident we
> get away with it in the kernel, and userland normally just uses linked
> load/store instructions for atomicity which works fine.
Urgh.. sounds like way 'fun' for you ;-)
"Paul E. McKenney" <[email protected]> wrote on 10/30/2013
11:27:25 AM:
> If you were to back up that insistence with a description of the
orderings
> you are relying on, why other orderings are not important, and how the
> important orderings are enforced, I might be tempted to pay attention
> to your opinion.
>
> Thanx, Paul
NP, though, I feel too embarrassed to explain things about memory barriers
when
one of the authors of Documentation/memory-barriers.txt is on cc: list ;-)
Disclaimer: it is anyway impossible to prove lack of *any* problem.
Having said that, lets look into an example in
Documentation/circular-buffers.txt:
> THE PRODUCER
> ------------
>
> The producer will look something like this:
>
> spin_lock(&producer_lock);
>
> unsigned long head = buffer->head;
> unsigned long tail = ACCESS_ONCE(buffer->tail);
>
> if (CIRC_SPACE(head, tail, buffer->size) >= 1) {
> /* insert one item into the buffer */
> struct item *item = buffer[head];
>
> produce_item(item);
>
> smp_wmb(); /* commit the item before incrementing the head
*/
>
> buffer->head = (head + 1) & (buffer->size - 1);
>
> /* wake_up() will make sure that the head is committed
before
> * waking anyone up */
> wake_up(consumer);
> }
>
> spin_unlock(&producer_lock);
We can see that authors of the document didn't put any memory barrier
after "buffer->tail" read and before "produce_item(item)" and I think they
have
a good reason.
Lets consider an imaginary smp_mb() right before "produce_item(item);".
Such a barrier will ensure that -
- the memory read on "buffer->tail" is completed
before store to memory pointed by "item" is committed.
However, the store to "buffer->tail" anyway cannot be completed before
conditional
branch implied by "if ()" is proven to execute body statement of the if().
And the
latter cannot be proven before read of "buffer->tail" is completed.
Lets see this other way. Lets imagine that somehow a store to the data
pointed by "item"
is completed before we read "buffer->tail". That would mean, that the store
was completed
speculatively. But speculative execution of conditional stores is
prohibited by C/C++ standard,
otherwise any conditional store at any random place of code could pollute
shared memory.
On the other hand, if compiler or processor can prove that condition in
above if() is going
to be true (or if speculative store writes the same value as it was before
write), the
speculative store *is* allowed. In this case we should not be bothered by
the fact that
memory pointed by "item" is written before a read from "buffer->tail" is
completed.
Regards,
-- Victor
Peter Zijlstra <[email protected]> wrote on 10/30/2013 01:25:26 PM:
> Also, I'm not entirely sure on C, that too seems like a dependency, we
> simply cannot read the buffer @tail before we've read the tail itself,
> now can we? Similarly we cannot compare tail to head without having the
> head read completed.
No, this one we cannot omit, because our problem on consumer side is not
with @tail, which is written exclusively by consumer, but with @head.
BTW, it is why you also don't need ACCESS_ONCE() around @tail, but only
around
@head read.
-- Victor
On Wed, Oct 30, 2013 at 04:52:05PM +0200, Victor Kaplansky wrote:
> Peter Zijlstra <[email protected]> wrote on 10/30/2013 01:25:26 PM:
>
> > Also, I'm not entirely sure on C, that too seems like a dependency, we
> > simply cannot read the buffer @tail before we've read the tail itself,
> > now can we? Similarly we cannot compare tail to head without having the
> > head read completed.
>
> No, this one we cannot omit, because our problem on consumer side is not
> with @tail, which is written exclusively by consumer, but with @head.
Ah indeed, my argument was flawed in that @head is the important part.
But we still do a comparison of @tail against @head before we do further
reads.
Although I suppose speculative reads are allowed -- they don't have the
destructive behaviour speculative writes have -- and thus we could in
fact get reorder issues.
But since it is still a dependent load in that we do that @tail vs @head
comparison before doing other loads, wouldn't a read_barrier_depends()
be sufficient? Or do we still need a complete rmb?
> BTW, it is why you also don't need ACCESS_ONCE() around @tail, but only
> around
> @head read.
Agreed, the ACCESS_ONCE() around tail is superfluous since we're the one
updating tail, so there's no problem with the value changing
unexpectedly.
On Wed, Oct 30, 2013 at 03:28:54PM +0200, Victor Kaplansky wrote:
> one of the authors of Documentation/memory-barriers.txt is on cc: list ;-)
>
> Disclaimer: it is anyway impossible to prove lack of *any* problem.
>
> Having said that, lets look into an example in
> Documentation/circular-buffers.txt:
>
> We can see that authors of the document didn't put any memory barrier
Note that both documents have the same author list ;-)
Anyway, I didn't know about the circular thing, I suppose I should use
CIRC_SPACE() thing :-)
Peter Zijlstra <[email protected]> wrote on 10/30/2013 05:39:31 PM:
> Although I suppose speculative reads are allowed -- they don't have the
> destructive behaviour speculative writes have -- and thus we could in
> fact get reorder issues.
I agree.
>
> But since it is still a dependent load in that we do that @tail vs @head
> comparison before doing other loads, wouldn't a read_barrier_depends()
> be sufficient? Or do we still need a complete rmb?
We need a complete rmb() here IMO. I think there is a fundamental
difference
between load and stores in this aspect. Load are allowed to be hoisted by
compiler or executed speculatively by HW. To prevent load "*(ubuf->data +
tail)"
to be hoisted beyond "ubuf->head" load you would need something like this:
void
ubuf_read(void)
{
u64 head, tail;
tail = ubuf->tail;
head = ACCESS_ONCE(ubuf->head);
/*
* Ensure we read the buffer boundaries before the actual buffer
* data...
*/
while (tail != head) {
smp_read_barrier_depends(); /* for Alpha */
obj = *(ubuf->data + head - 128);
/* process obj */
tail += obj->size;
tail %= ubuf->size;
}
/*
* Ensure all data reads are complete before we issue the
* ubuf->tail update; once that update hits, kbuf_write() can
* observe and overwrite data.
*/
smp_mb(); /* D, matches with A */
ubuf->tail = tail;
}
(note that "head" is part of address calculation of obj load now).
But, even in this demo example some "smp_read_barrier_depends()" before
"obj = *(ubuf->data + head - 100);" is required for architectures
like Alpha. Though, on more sane architectures "smp_read_barrier_depends()"
will be translated to nothing.
Regards,
-- Victor
On Wed, Oct 30, 2013 at 07:14:29PM +0200, Victor Kaplansky wrote:
> We need a complete rmb() here IMO. I think there is a fundamental
> difference between load and stores in this aspect. Load are allowed to
> be hoisted by compiler or executed speculatively by HW. To prevent
> load "*(ubuf->data + tail)" to be hoisted beyond "ubuf->head" load you
> would need something like this:
Indeed, we could compute and load ->data + tail the moment we've
completed the tail load but before we've completed the head load and
done the comparison.
So yes, full rmb() it is.
> void
> ubuf_read(void)
> {
> u64 head, tail;
>
> tail = ubuf->tail;
> head = ACCESS_ONCE(ubuf->head);
>
> /*
> * Ensure we read the buffer boundaries before the actual buffer
> * data...
> */
>
> while (tail != head) {
> smp_read_barrier_depends(); /* for Alpha */
> obj = *(ubuf->data + head - 128);
> /* process obj */
> tail += obj->size;
> tail %= ubuf->size;
> }
>
> /*
> * Ensure all data reads are complete before we issue the
> * ubuf->tail update; once that update hits, kbuf_write() can
> * observe and overwrite data.
> */
> smp_mb(); /* D, matches with A */
>
> ubuf->tail = tail;
> }
>
> (note that "head" is part of address calculation of obj load now).
Right, explicit dependent loads; I was hoping the conditional in between
might be enough, but as argued above it is not. The above cannot work in
our case though, we must use tail to find the obj since we have variable
size objects.
> But, even in this demo example some "smp_read_barrier_depends()" before
> "obj = *(ubuf->data + head - 100);" is required for architectures
> like Alpha. Though, on more sane architectures "smp_read_barrier_depends()"
> will be translated to nothing.
Sure.. I know all about that.
On Wed, Oct 30, 2013 at 04:51:16PM +0100, Peter Zijlstra wrote:
> On Wed, Oct 30, 2013 at 03:28:54PM +0200, Victor Kaplansky wrote:
> > one of the authors of Documentation/memory-barriers.txt is on cc: list ;-)
> >
> > Disclaimer: it is anyway impossible to prove lack of *any* problem.
> >
> > Having said that, lets look into an example in
> > Documentation/circular-buffers.txt:
>
> >
> > We can see that authors of the document didn't put any memory barrier
>
> Note that both documents have the same author list ;-)
>
> Anyway, I didn't know about the circular thing, I suppose I should use
> CIRC_SPACE() thing :-)
The below removes 80 bytes from ring_buffer.o of which 50 bytes are from
perf_output_begin(), it also removes 30 lines of code, so yay!
(x86_64 build)
And it appears to still work.. although I've not stressed the no-space
bits.
---
kernel/events/ring_buffer.c | 74 ++++++++++++++-------------------------------
1 file changed, 22 insertions(+), 52 deletions(-)
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 9c2ddfbf4525..e4a51fa10595 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -12,40 +12,10 @@
#include <linux/perf_event.h>
#include <linux/vmalloc.h>
#include <linux/slab.h>
+#include <linux/circ_buf.h>
#include "internal.h"
-static bool perf_output_space(struct ring_buffer *rb, unsigned long tail,
- unsigned long offset, unsigned long head)
-{
- unsigned long sz = perf_data_size(rb);
- unsigned long mask = sz - 1;
-
- /*
- * check if user-writable
- * overwrite : over-write its own tail
- * !overwrite: buffer possibly drops events.
- */
- if (rb->overwrite)
- return true;
-
- /*
- * verify that payload is not bigger than buffer
- * otherwise masking logic may fail to detect
- * the "not enough space" condition
- */
- if ((head - offset) > sz)
- return false;
-
- offset = (offset - tail) & mask;
- head = (head - tail) & mask;
-
- if ((int)(head - offset) < 0)
- return false;
-
- return true;
-}
-
static void perf_output_wakeup(struct perf_output_handle *handle)
{
atomic_set(&handle->rb->poll, POLL_IN);
@@ -115,8 +85,7 @@ static void perf_output_put_handle(struct perf_output_handle *handle)
rb->user_page->data_head = head;
/*
- * Now check if we missed an update, rely on the (compiler)
- * barrier in atomic_dec_and_test() to re-read rb->head.
+ * Now check if we missed an update.
*/
if (unlikely(head != local_read(&rb->head))) {
local_inc(&rb->nest);
@@ -135,7 +104,7 @@ int perf_output_begin(struct perf_output_handle *handle,
{
struct ring_buffer *rb;
unsigned long tail, offset, head;
- int have_lost;
+ int have_lost, page_shift;
struct perf_sample_data sample_data;
struct {
struct perf_event_header header;
@@ -161,7 +130,7 @@ int perf_output_begin(struct perf_output_handle *handle,
goto out;
have_lost = local_read(&rb->lost);
- if (have_lost) {
+ if (unlikely(have_lost)) {
lost_event.header.size = sizeof(lost_event);
perf_event_header__init_id(&lost_event.header, &sample_data,
event);
@@ -171,32 +140,33 @@ int perf_output_begin(struct perf_output_handle *handle,
perf_output_get_handle(handle);
do {
- /*
- * Userspace could choose to issue a mb() before updating the
- * tail pointer. So that all reads will be completed before the
- * write is issued.
- *
- * See perf_output_put_handle().
- */
tail = ACCESS_ONCE(rb->user_page->data_tail);
- smp_mb();
offset = head = local_read(&rb->head);
- head += size;
- if (unlikely(!perf_output_space(rb, tail, offset, head)))
+ if (!rb->overwrite &&
+ unlikely(CIRC_SPACE(head, tail, perf_data_size(rb)) < size))
goto fail;
+ head += size;
} while (local_cmpxchg(&rb->head, offset, head) != offset);
+ /*
+ * Userspace SHOULD issue an MB before writing the tail; see
+ * perf_output_put_handle().
+ */
+ smp_mb();
+
if (head - local_read(&rb->wakeup) > rb->watermark)
local_add(rb->watermark, &rb->wakeup);
- handle->page = offset >> (PAGE_SHIFT + page_order(rb));
- handle->page &= rb->nr_pages - 1;
- handle->size = offset & ((PAGE_SIZE << page_order(rb)) - 1);
- handle->addr = rb->data_pages[handle->page];
- handle->addr += handle->size;
- handle->size = (PAGE_SIZE << page_order(rb)) - handle->size;
+ page_shift = PAGE_SHIFT + page_order(rb);
+
+ handle->page = (offset >> page_shift) & (rb->nr_pages - 1);
+
+ offset &= page_shift - 1;
+
+ handle->addr = rb->data_pages[handle->page] + offset;
+ handle->size = (1 << page_shift) - offset;
- if (have_lost) {
+ if (unlikely(have_lost)) {
lost_event.header.type = PERF_RECORD_LOST;
lost_event.header.misc = 0;
lost_event.id = event->id;
On Wed, Oct 30, 2013 at 07:29:30PM +0100, Peter Zijlstra wrote:
> + page_shift = PAGE_SHIFT + page_order(rb);
> +
> + handle->page = (offset >> page_shift) & (rb->nr_pages - 1);
> +
> + offset &= page_shift - 1;
offset &= (1UL << page_shift) - 1;
Weird that it even appeared to work.. /me wonders if he even booted the
right kernel.
> +
> + handle->addr = rb->data_pages[handle->page] + offset;
> + handle->size = (1 << page_shift) - offset;
On Wed, Oct 30, 2013 at 04:51:16PM +0100, Peter Zijlstra wrote:
> On Wed, Oct 30, 2013 at 03:28:54PM +0200, Victor Kaplansky wrote:
> > one of the authors of Documentation/memory-barriers.txt is on cc: list ;-)
> >
> > Disclaimer: it is anyway impossible to prove lack of *any* problem.
> >
> > Having said that, lets look into an example in
> > Documentation/circular-buffers.txt:
>
> >
> > We can see that authors of the document didn't put any memory barrier
>
> Note that both documents have the same author list ;-)
>
> Anyway, I didn't know about the circular thing, I suppose I should use
> CIRC_SPACE() thing :-)
Interesting that we didn't seem to supply a definition... ;-)
Thanx, Paul
On Wed, Oct 30, 2013 at 03:28:54PM +0200, Victor Kaplansky wrote:
> "Paul E. McKenney" <[email protected]> wrote on 10/30/2013
> 11:27:25 AM:
>
> > If you were to back up that insistence with a description of the
> orderings
> > you are relying on, why other orderings are not important, and how the
> > important orderings are enforced, I might be tempted to pay attention
> > to your opinion.
> >
> > Thanx, Paul
>
> NP, though, I feel too embarrassed to explain things about memory barriers
> when
> one of the authors of Documentation/memory-barriers.txt is on cc: list ;-)
>
> Disclaimer: it is anyway impossible to prove lack of *any* problem.
If you want to play the "omit memory barriers" game, then proving a
negative is in fact the task before you.
> Having said that, lets look into an example in
> Documentation/circular-buffers.txt:
And the correctness of this code has been called into question. :-(
An embarrassingly long time ago -- I need to get this either proven
or fixed.
> > THE PRODUCER
> > ------------
> >
> > The producer will look something like this:
> >
> > spin_lock(&producer_lock);
> >
> > unsigned long head = buffer->head;
> > unsigned long tail = ACCESS_ONCE(buffer->tail);
> >
> > if (CIRC_SPACE(head, tail, buffer->size) >= 1) {
> > /* insert one item into the buffer */
> > struct item *item = buffer[head];
> >
> > produce_item(item);
> >
> > smp_wmb(); /* commit the item before incrementing the head
> */
> >
> > buffer->head = (head + 1) & (buffer->size - 1);
> >
> > /* wake_up() will make sure that the head is committed
> before
> > * waking anyone up */
> > wake_up(consumer);
> > }
> >
> > spin_unlock(&producer_lock);
>
> We can see that authors of the document didn't put any memory barrier
> after "buffer->tail" read and before "produce_item(item)" and I think they
> have
> a good reason.
>
> Lets consider an imaginary smp_mb() right before "produce_item(item);".
> Such a barrier will ensure that -
>
> - the memory read on "buffer->tail" is completed
> before store to memory pointed by "item" is committed.
>
> However, the store to "buffer->tail" anyway cannot be completed before
> conditional
> branch implied by "if ()" is proven to execute body statement of the if().
> And the
> latter cannot be proven before read of "buffer->tail" is completed.
>
> Lets see this other way. Lets imagine that somehow a store to the data
> pointed by "item"
> is completed before we read "buffer->tail". That would mean, that the store
> was completed
> speculatively. But speculative execution of conditional stores is
> prohibited by C/C++ standard,
> otherwise any conditional store at any random place of code could pollute
> shared memory.
Before C/C++11, the closest thing to such a prohibition is use of
volatile, for example, ACCESS_ONCE(). Even in C/C++11, you have to
use atomics to get anything resembing this prohibition.
If you just use normal variables, the compiler is within its rights
to transform something like the following:
if (a)
b = 1;
else
b = 42;
Into:
b = 42;
if (a)
b = 1;
Many other similar transformations are permitted. Some are used to all
vector instructions to be used -- the compiler can do a write with an
overly wide vector instruction, then clean up the clobbered variables
later, if it wishes. Again, if the variables are not marked volatile,
or, in C/C++11, atomic.
> On the other hand, if compiler or processor can prove that condition in
> above if() is going
> to be true (or if speculative store writes the same value as it was before
> write), the
> speculative store *is* allowed. In this case we should not be bothered by
> the fact that
> memory pointed by "item" is written before a read from "buffer->tail" is
> completed.
The compilers don't always know as much as they might about the underlying
hardware's memory model. Of course, if this code is architecture specific,
it can avoid DEC Alpha's fun and games, which could also violate your
assumptions in the above paragraph:
http://www.openvms.compaq.com/wizard/wiz_2637.html
Anyway, proving or fixing the code in Documentation/circular-buffers.txt
has been on my list for too long, so I will take a closer look at it.
Thanx, Paul
On Wed, Oct 30, 2013 at 09:32:58PM -0700, Paul E. McKenney wrote:
> Before C/C++11, the closest thing to such a prohibition is use of
> volatile, for example, ACCESS_ONCE(). Even in C/C++11, you have to
> use atomics to get anything resembing this prohibition.
>
> If you just use normal variables, the compiler is within its rights
> to transform something like the following:
>
> if (a)
> b = 1;
> else
> b = 42;
>
> Into:
>
> b = 42;
> if (a)
> b = 1;
>
> Many other similar transformations are permitted. Some are used to all
> vector instructions to be used -- the compiler can do a write with an
> overly wide vector instruction, then clean up the clobbered variables
> later, if it wishes. Again, if the variables are not marked volatile,
> or, in C/C++11, atomic.
While I've heard you tell this story before, my mind keeps boggling how
we've been able to use shared memory at all, all these years.
It seems to me stuff should have broken left, right and center if
compilers were really aggressive about this.
"Paul E. McKenney" <[email protected]> wrote on 10/31/2013
06:32:58 AM:
> If you want to play the "omit memory barriers" game, then proving a
> negative is in fact the task before you.
Generally it is not fair. Otherwise, anyone could put an smp_mb() at a
random place, and expect others to "prove" that it is not needed.
It is not fair also because it should be virtually impossible to prove lack
of any problem. OTH, if a problem exists, it should be easy for proponents
of a memory barrier to build a test case or design a scenario demonstrating
the problem.
Actually, advocates of the memory barrier in our case do have an argument -
- the rule of thumb saying that barriers should be paired. I consider this
rule only as a general recommendation to look into potentially risky
places.
And indeed, in our case if the store to circular wasn't conditional, it
would require a memory barrier to prevent the store to be performed before
the read of @tail. But in our case the store is conditional, so no memory
barrier is required.
> And the correctness of this code has been called into question. :-(
> An embarrassingly long time ago -- I need to get this either proven
> or fixed.
I agree.
> Before C/C++11, the closest thing to such a prohibition is use of
> volatile, for example, ACCESS_ONCE(). Even in C/C++11, you have to
> use atomics to get anything resembing this prohibition.
>
> If you just use normal variables, the compiler is within its rights
> to transform something like the following:
>
> if (a)
> b = 1;
> else
> b = 42;
>
> Into:
>
> b = 42;
> if (a)
> b = 1;
>
> Many other similar transformations are permitted. Some are used to all
> vector instructions to be used -- the compiler can do a write with an
> overly wide vector instruction, then clean up the clobbered variables
> later, if it wishes. Again, if the variables are not marked volatile,
> or, in C/C++11, atomic.
All this can justify only compiler barrier() which is almost free from
performance point of view, since current gcc anyway doesn't perform store
hoisting optimization in our case.
(And I'm not getting into philosophical discussion whether kernel code
should consider future possible bugs/features in gcc or C/C++11
standard).
> The compilers don't always know as much as they might about the
underlying
> hardware's memory model.
That's correct in general. But can you point out a problem that really
exists?
> Of course, if this code is architecture specific,
> it can avoid DEC Alpha's fun and games, which could also violate your
> assumptions in the above paragraph:
>
> http://www.openvms.compaq.com/wizard/wiz_2637.html
Are you talking about this paragraph from above link:
"For instance, your producer must issue a "memory barrier" instruction
after writing the data to shared memory and before inserting it on
the queue; likewise, your consumer must issue a memory barrier
instruction after removing an item from the queue and before reading
from its memory. Otherwise, you risk seeing stale data, since, while
the Alpha processor does provide coherent memory, it does not provide
implicit ordering of reads and writes. (That is, the write of the
producer's data might reach memory after the write of the queue, such
that the consumer might read the new item from the queue but get the
previous values from the item's memory."
If yes, I don't think it explains the need of memory barrier on Alpha
in our case (we all agree about the need of smp_wmb() right before @head
update by producer). If not, could you please point to specific paragraph?
>
> Anyway, proving or fixing the code in Documentation/circular-buffers.txt
> has been on my list for too long, so I will take a closer look at it.
Thanks!
I'm concerned more about performance overhead imposed by the full memory
barrier in kfifo circular buffers. Even if it is needed on Alpha (I don't
understand why) we could try to solve this with some memory barrier which
is effective only on architectures which really need it.
Regards,
-- Victor
> "For instance, your producer must issue a "memory barrier" instruction
> after writing the data to shared memory and before inserting it on
> the queue; likewise, your consumer must issue a memory barrier
> instruction after removing an item from the queue and before reading
> from its memory. Otherwise, you risk seeing stale data, since, while
> the Alpha processor does provide coherent memory, it does not provide
> implicit ordering of reads and writes. (That is, the write of the
> producer's data might reach memory after the write of the queue, such
> that the consumer might read the new item from the queue but get the
> previous values from the item's memory."
>
> If yes, I don't think it explains the need of memory barrier on Alpha
> in our case (we all agree about the need of smp_wmb() right before @head
> update by producer). If not, could you please point to specific paragraph?
My understanding is that the extra read barrier the alpha needs isn't to
control the order the cpu performs the memory cycles in, but rather to
wait while the cache system performs all outstanding operations.
So even though the wmb() in the writer ensures the writes are correctly
ordered, the reader can read the old value from the second location from
its local cache.
David
"David Laight" <[email protected]> wrote on 10/31/2013 02:28:56 PM:
> So even though the wmb() in the writer ensures the writes are correctly
> ordered, the reader can read the old value from the second location from
> its local cache.
In case of circular buffer, the only thing that producer reads is @tail,
and nothing wrong will happen if producer reads old value of @tail.
Moreover,
adherents of smp_mb() insert it *after* the read of @tail, so it cannot
prevent reading of old value anyway.
-- Victor
On Thu, Oct 31, 2013 at 10:04:57AM +0100, Peter Zijlstra wrote:
> On Wed, Oct 30, 2013 at 09:32:58PM -0700, Paul E. McKenney wrote:
> > Before C/C++11, the closest thing to such a prohibition is use of
> > volatile, for example, ACCESS_ONCE(). Even in C/C++11, you have to
> > use atomics to get anything resembing this prohibition.
> >
> > If you just use normal variables, the compiler is within its rights
> > to transform something like the following:
> >
> > if (a)
> > b = 1;
> > else
> > b = 42;
> >
> > Into:
> >
> > b = 42;
> > if (a)
> > b = 1;
> >
> > Many other similar transformations are permitted. Some are used to all
> > vector instructions to be used -- the compiler can do a write with an
> > overly wide vector instruction, then clean up the clobbered variables
> > later, if it wishes. Again, if the variables are not marked volatile,
> > or, in C/C++11, atomic.
>
> While I've heard you tell this story before, my mind keeps boggling how
> we've been able to use shared memory at all, all these years.
>
> It seems to me stuff should have broken left, right and center if
> compilers were really aggressive about this.
Sometimes having stupid compilers is a good thing. But they really are
getting more aggressive.
Thanx, Paul
On Thu, Oct 31, 2013 at 08:07:56AM -0700, Paul E. McKenney wrote:
> On Thu, Oct 31, 2013 at 10:04:57AM +0100, Peter Zijlstra wrote:
> > On Wed, Oct 30, 2013 at 09:32:58PM -0700, Paul E. McKenney wrote:
> > > Before C/C++11, the closest thing to such a prohibition is use of
> > > volatile, for example, ACCESS_ONCE(). Even in C/C++11, you have to
> > > use atomics to get anything resembing this prohibition.
> > >
> > > If you just use normal variables, the compiler is within its rights
> > > to transform something like the following:
> > >
> > > if (a)
> > > b = 1;
> > > else
> > > b = 42;
> > >
> > > Into:
> > >
> > > b = 42;
> > > if (a)
> > > b = 1;
> > >
> > > Many other similar transformations are permitted. Some are used to all
> > > vector instructions to be used -- the compiler can do a write with an
> > > overly wide vector instruction, then clean up the clobbered variables
> > > later, if it wishes. Again, if the variables are not marked volatile,
> > > or, in C/C++11, atomic.
> >
> > While I've heard you tell this story before, my mind keeps boggling how
> > we've been able to use shared memory at all, all these years.
> >
> > It seems to me stuff should have broken left, right and center if
> > compilers were really aggressive about this.
>
> Sometimes having stupid compilers is a good thing. But they really are
> getting more aggressive.
But surely we cannot go mark all data structures lodged in shared memory
as volatile, that's insane.
I'm sure you're quite worried about this as well. Suppose we have:
struct foo {
unsigned long value;
void *ptr;
unsigned long value1;
};
And our ptr member is RCU managed. Then while the assignment using:
rcu_assign_ptr() will use the volatile cast, what stops the compiler
from wrecking ptr while writing either of the value* members and
'fixing' her up after?
This is a completely untenable position.
How do the C/C++ people propose to deal with this?