In perf_output_put_handle(), an IRQ/NMI can happen in below location and
write records to the same ring buffer:
...
local_dec_and_test(&rb->nest)
... <-- an IRQ/NMI can happen here
rb->user_page->data_head = head;
...
In this case, a value A is written to data_head in the IRQ, then a value
B is written to data_head after the IRQ. And A > B. As a result,
data_head is temporarily decreased from A to B. And a reader may see
data_head < data_tail if it read the buffer frequently enough, which
creates unexpected behaviors.
This can be fixed by moving dec(&rb->nest) to after updating data_head,
which prevents the IRQ/NMI above from updating data_head.
Signed-off-by: Yabin Cui <[email protected]>
---
kernel/events/ring_buffer.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 674b35383491..0b9aefe13b04 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -54,8 +54,10 @@ static void perf_output_put_handle(struct perf_output_handle *handle)
* IRQ/NMI can happen here, which means we can miss a head update.
*/
- if (!local_dec_and_test(&rb->nest))
+ if (local_read(&rb->nest) > 1) {
+ local_dec(&rb->nest);
goto out;
+ }
/*
* Since the mmap() consumer (userspace) can run on a different CPU:
@@ -86,6 +88,13 @@ static void perf_output_put_handle(struct perf_output_handle *handle)
smp_wmb(); /* B, matches C */
rb->user_page->data_head = head;
+ /*
+ * Clear rb->nest after updating data_head. This prevents IRQ/NMI from
+ * updating data_head before us. If that happens, we will expose a
+ * temporarily decreased data_head.
+ */
+ local_set(&rb->nest, 0);
+
/*
* Now check if we missed an update -- rely on previous implied
* compiler barriers to force a re-read.
--
2.21.0.1020.gf2820cf01a-goog
Yabin Cui <[email protected]> writes:
> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
> index 674b35383491..0b9aefe13b04 100644
> --- a/kernel/events/ring_buffer.c
> +++ b/kernel/events/ring_buffer.c
> @@ -54,8 +54,10 @@ static void perf_output_put_handle(struct perf_output_handle *handle)
> * IRQ/NMI can happen here, which means we can miss a head update.
> */
>
> - if (!local_dec_and_test(&rb->nest))
> + if (local_read(&rb->nest) > 1) {
> + local_dec(&rb->nest);
What stops rb->nest changing between local_read() and local_dec()?
Regards,
--
Alex
On Wed, May 15, 2019 at 09:51:07AM +0300, Alexander Shishkin wrote:
> Yabin Cui <[email protected]> writes:
>
> > diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
> > index 674b35383491..0b9aefe13b04 100644
> > --- a/kernel/events/ring_buffer.c
> > +++ b/kernel/events/ring_buffer.c
> > @@ -54,8 +54,10 @@ static void perf_output_put_handle(struct perf_output_handle *handle)
> > * IRQ/NMI can happen here, which means we can miss a head update.
> > */
> >
> > - if (!local_dec_and_test(&rb->nest))
> > + if (local_read(&rb->nest) > 1) {
> > + local_dec(&rb->nest);
>
> What stops rb->nest changing between local_read() and local_dec()?
Nothing, however it must remain the same :-)
That is the cryptic way of saying that since these buffers are strictly
per-cpu, the only change can come from interrupts, and they must have a
net 0 change. Or rather, an equal amount of decrements to increments.
So if it changes, it must also change back to where it was.
Peter Zijlstra <[email protected]> writes:
> On Wed, May 15, 2019 at 09:51:07AM +0300, Alexander Shishkin wrote:
>> Yabin Cui <[email protected]> writes:
>>
>> > diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
>> > index 674b35383491..0b9aefe13b04 100644
>> > --- a/kernel/events/ring_buffer.c
>> > +++ b/kernel/events/ring_buffer.c
>> > @@ -54,8 +54,10 @@ static void perf_output_put_handle(struct perf_output_handle *handle)
>> > * IRQ/NMI can happen here, which means we can miss a head update.
>> > */
>> >
>> > - if (!local_dec_and_test(&rb->nest))
>> > + if (local_read(&rb->nest) > 1) {
>> > + local_dec(&rb->nest);
>>
>> What stops rb->nest changing between local_read() and local_dec()?
>
> Nothing, however it must remain the same :-)
>
> That is the cryptic way of saying that since these buffers are strictly
> per-cpu, the only change can come from interrupts, and they must have a
> net 0 change. Or rather, an equal amount of decrements to increments.
>
> So if it changes, it must also change back to where it was.
Ah that's true. So the whole ->nest thing can be done with
READ_ONCE()/WRITE_ONCE() instead?
Because the use of local_dec_and_test() creates an impression that we
rely on atomicity of it, which in actuality we don't.
Regards,
--
Alex
On Tue, May 14, 2019 at 05:30:59PM -0700, Yabin Cui wrote:
> In perf_output_put_handle(), an IRQ/NMI can happen in below location and
> write records to the same ring buffer:
> ...
> local_dec_and_test(&rb->nest)
> ... <-- an IRQ/NMI can happen here
> rb->user_page->data_head = head;
> ...
>
> In this case, a value A is written to data_head in the IRQ, then a value
> B is written to data_head after the IRQ. And A > B. As a result,
> data_head is temporarily decreased from A to B. And a reader may see
> data_head < data_tail if it read the buffer frequently enough, which
> creates unexpected behaviors.
Indeed, good catch! Have you observed this, or is this patch due to code
inspection?
> This can be fixed by moving dec(&rb->nest) to after updating data_head,
> which prevents the IRQ/NMI above from updating data_head.
>
> Signed-off-by: Yabin Cui <[email protected]>
> ---
> kernel/events/ring_buffer.c | 11 ++++++++++-
> 1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
> index 674b35383491..0b9aefe13b04 100644
> --- a/kernel/events/ring_buffer.c
> +++ b/kernel/events/ring_buffer.c
> @@ -54,8 +54,10 @@ static void perf_output_put_handle(struct perf_output_handle *handle)
> * IRQ/NMI can happen here, which means we can miss a head update.
> */
>
> - if (!local_dec_and_test(&rb->nest))
> + if (local_read(&rb->nest) > 1) {
> + local_dec(&rb->nest);
> goto out;
> + }
>
> /*
> * Since the mmap() consumer (userspace) can run on a different CPU:
> @@ -86,6 +88,13 @@ static void perf_output_put_handle(struct perf_output_handle *handle)
> smp_wmb(); /* B, matches C */
> rb->user_page->data_head = head;
At the very least this needs:
barrier();
> + /*
> + * Clear rb->nest after updating data_head. This prevents IRQ/NMI from
> + * updating data_head before us. If that happens, we will expose a
> + * temporarily decreased data_head.
> + */
> + local_set(&rb->nest, 0);
Since you rely on the 'decrement' (1->0) to happen after we've written
head.
And similarly, you need:
barrier();
Because we must re-check the head after we've 'decremented'.
> /*
> * Now check if we missed an update -- rely on previous implied
> * compiler barriers to force a re-read.
A more paranoid person might've written:
WARN_ON_ONCE(local_cmpxchg(&rb->nest, 1, 0) != 1);
which would've also implied both barrier()s.
On Wed, May 15, 2019 at 01:04:16PM +0300, Alexander Shishkin wrote:
> Peter Zijlstra <[email protected]> writes:
>
> > On Wed, May 15, 2019 at 09:51:07AM +0300, Alexander Shishkin wrote:
> >> Yabin Cui <[email protected]> writes:
> >>
> >> > diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
> >> > index 674b35383491..0b9aefe13b04 100644
> >> > --- a/kernel/events/ring_buffer.c
> >> > +++ b/kernel/events/ring_buffer.c
> >> > @@ -54,8 +54,10 @@ static void perf_output_put_handle(struct perf_output_handle *handle)
> >> > * IRQ/NMI can happen here, which means we can miss a head update.
> >> > */
> >> >
> >> > - if (!local_dec_and_test(&rb->nest))
> >> > + if (local_read(&rb->nest) > 1) {
> >> > + local_dec(&rb->nest);
> >>
> >> What stops rb->nest changing between local_read() and local_dec()?
> >
> > Nothing, however it must remain the same :-)
> >
> > That is the cryptic way of saying that since these buffers are strictly
> > per-cpu, the only change can come from interrupts, and they must have a
> > net 0 change. Or rather, an equal amount of decrements to increments.
> >
> > So if it changes, it must also change back to where it was.
>
> Ah that's true. So the whole ->nest thing can be done with
> READ_ONCE()/WRITE_ONCE() instead?
> Because the use of local_dec_and_test() creates an impression that we
> rely on atomicity of it, which in actuality we don't.
Yes, I think we can get away with that. And that might be a worth-while
optimization for !x86.
In perf_output_put_handle(), an IRQ/NMI can happen in below location and
write records to the same ring buffer:
...
local_dec_and_test(&rb->nest)
... <-- an IRQ/NMI can happen here
rb->user_page->data_head = head;
...
In this case, a value A is written to data_head in the IRQ, then a value
B is written to data_head after the IRQ. And A > B. As a result,
data_head is temporarily decreased from A to B. And a reader may see
data_head < data_tail if it read the buffer frequently enough, which
creates unexpected behaviors.
This can be fixed by moving dec(&rb->nest) to after updating data_head,
which prevents the IRQ/NMI above from updating data_head.
Signed-off-by: Yabin Cui <[email protected]>
---
v1 -> v2: change rb->nest from local_t to unsigned int, and add barriers.
---
kernel/events/internal.h | 2 +-
kernel/events/ring_buffer.c | 24 ++++++++++++++++++------
2 files changed, 19 insertions(+), 7 deletions(-)
diff --git a/kernel/events/internal.h b/kernel/events/internal.h
index 79c47076700a..0a8c003b9bcf 100644
--- a/kernel/events/internal.h
+++ b/kernel/events/internal.h
@@ -24,7 +24,7 @@ struct ring_buffer {
atomic_t poll; /* POLL_ for wakeups */
local_t head; /* write position */
- local_t nest; /* nested writers */
+ unsigned int nest; /* nested writers */
local_t events; /* event limit */
local_t wakeup; /* wakeup stamp */
local_t lost; /* nr records lost */
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 674b35383491..c677beb01fb1 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -38,7 +38,8 @@ static void perf_output_get_handle(struct perf_output_handle *handle)
struct ring_buffer *rb = handle->rb;
preempt_disable();
- local_inc(&rb->nest);
+ rb->nest++;
+ barrier();
handle->wakeup = local_read(&rb->wakeup);
}
@@ -54,8 +55,10 @@ static void perf_output_put_handle(struct perf_output_handle *handle)
* IRQ/NMI can happen here, which means we can miss a head update.
*/
- if (!local_dec_and_test(&rb->nest))
+ if (rb->nest > 1) {
+ rb->nest--;
goto out;
+ }
/*
* Since the mmap() consumer (userspace) can run on a different CPU:
@@ -84,14 +87,23 @@ static void perf_output_put_handle(struct perf_output_handle *handle)
* See perf_output_begin().
*/
smp_wmb(); /* B, matches C */
- rb->user_page->data_head = head;
+ WRITE_ONCE(rb->user_page->data_head, head);
+
+ /*
+ * Clear rb->nest after updating data_head. This prevents IRQ/NMI from
+ * updating data_head before us. If that happens, we will expose a
+ * temporarily decreased data_head.
+ */
+ WRITE_ONCE(rb->nest, 0);
/*
- * Now check if we missed an update -- rely on previous implied
- * compiler barriers to force a re-read.
+ * Now check if we missed an update -- use barrier() to force a
+ * re-read.
*/
+ barrier();
if (unlikely(head != local_read(&rb->head))) {
- local_inc(&rb->nest);
+ rb->nest++;
+ barrier();
goto again;
}
--
2.21.0.1020.gf2820cf01a-goog
>Indeed, good catch! Have you observed this, or is this patch due to code
inspection?
I observed this. I was using perf_event_open instead of linux tools perf
to monitor a process continuously forking new processes, generating a lot
of mmap records. And a sample record happens while generating a mmap record.
For missing barriers, I thought local_xxx provided implied barriers. But
I found atomic_ops.txt said not. Thanks for reminding! I will add them in
path v2.
I also think it's a good idea to use READ_ONCE/WRITE_ONCE for rb->nest. Will
use it in patch v2.
On Thu, May 16, 2019 at 11:40:10AM -0700, Yabin Cui wrote:
> In perf_output_put_handle(), an IRQ/NMI can happen in below location and
> write records to the same ring buffer:
> ...
> local_dec_and_test(&rb->nest)
> ... <-- an IRQ/NMI can happen here
> rb->user_page->data_head = head;
> ...
>
> In this case, a value A is written to data_head in the IRQ, then a value
> B is written to data_head after the IRQ. And A > B. As a result,
> data_head is temporarily decreased from A to B. And a reader may see
> data_head < data_tail if it read the buffer frequently enough, which
> creates unexpected behaviors.
>
> This can be fixed by moving dec(&rb->nest) to after updating data_head,
> which prevents the IRQ/NMI above from updating data_head.
>
> Signed-off-by: Yabin Cui <[email protected]>
> ---
>
> v1 -> v2: change rb->nest from local_t to unsigned int, and add barriers.
>
> ---
> kernel/events/internal.h | 2 +-
> kernel/events/ring_buffer.c | 24 ++++++++++++++++++------
> 2 files changed, 19 insertions(+), 7 deletions(-)
>
> diff --git a/kernel/events/internal.h b/kernel/events/internal.h
> index 79c47076700a..0a8c003b9bcf 100644
> --- a/kernel/events/internal.h
> +++ b/kernel/events/internal.h
> @@ -24,7 +24,7 @@ struct ring_buffer {
> atomic_t poll; /* POLL_ for wakeups */
>
> local_t head; /* write position */
> - local_t nest; /* nested writers */
> + unsigned int nest; /* nested writers */
> local_t events; /* event limit */
> local_t wakeup; /* wakeup stamp */
> local_t lost; /* nr records lost */
> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
> index 674b35383491..c677beb01fb1 100644
> --- a/kernel/events/ring_buffer.c
> +++ b/kernel/events/ring_buffer.c
> @@ -38,7 +38,8 @@ static void perf_output_get_handle(struct perf_output_handle *handle)
> struct ring_buffer *rb = handle->rb;
>
> preempt_disable();
> - local_inc(&rb->nest);
> + rb->nest++;
> + barrier();
Urgh; almost but not quite. You just lost the 'volatile' qualifier and
now the compiler can mess things up for you.
> handle->wakeup = local_read(&rb->wakeup);
> }
What I'm going to do is split this into two patches, one fixes the
problem and marked for backport, and one changing away from local_t.
> > - local_inc(&rb->nest);
> > + rb->nest++;
> > + barrier();
> Urgh; almost but not quite. You just lost the 'volatile' qualifier and
> now the compiler can mess things up for you.
I thought the barriers added could force the compiler to forget what it knows
about rb->nest, and do the write as been told to. I appreciate it if you can
tell me more details about it. Anyway, it's a good choice to be protective
and always use WRITE_ONCE/READ_ONCE for rb->nest.
> What I'm going to do is split this into two patches, one fixes the
> problem and marked for backport, and one changing away from local_t.
I read the split patches. They totally LGTM. Thanks for all your help
and rapid reply! I appreciate it :)