2009-03-28 19:51:46

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 2/9] perf_counter: fix update_userpage()

It just occured to me it is possible to have multiple contending
updates of the userpage (mmap information vs overflow vs counter).
This would break the seqlock logic.

It appear the arch code uses this from NMI context, so we cannot
possibly serialize its use, therefore separate the data_head update
from it and let it return to its original use.

The arch code needs to make sure there are no contending callers by
disabling the counter before using it -- powerpc appears to do this
nicely.

Signed-off-by: Peter Zijlstra <[email protected]>
---
include/linux/perf_counter.h | 35 +++++++++++++++++++++++++++++++++++
kernel/perf_counter.c | 38 +++++++++++++++++++++++---------------
2 files changed, 58 insertions(+), 15 deletions(-)

Index: linux-2.6/include/linux/perf_counter.h
===================================================================
--- linux-2.6.orig/include/linux/perf_counter.h
+++ linux-2.6/include/linux/perf_counter.h
@@ -160,10 +160,45 @@ struct perf_counter_hw_event {
struct perf_counter_mmap_page {
__u32 version; /* version number of this structure */
__u32 compat_version; /* lowest version this is compat with */
+
+ /*
+ * Bits needed to read the hw counters in user-space.
+ *
+ * The index and offset should be read atomically using the seqlock:
+ *
+ * __u32 seq, index;
+ * __s64 offset;
+ *
+ * again:
+ * rmb();
+ * seq = pc->lock;
+ *
+ * if (unlikely(seq & 1)) {
+ * cpu_relax();
+ * goto again;
+ * }
+ *
+ * index = pc->index;
+ * offset = pc->offset;
+ *
+ * rmb();
+ * if (pc->lock != seq)
+ * goto again;
+ *
+ * After this, index contains architecture specific counter index + 1,
+ * so that 0 means unavailable, offset contains the value to be added
+ * to the result of the raw timer read to obtain this counter's value.
+ */
__u32 lock; /* seqlock for synchronization */
__u32 index; /* hardware counter identifier */
__s64 offset; /* add to hardware counter value */

+ /*
+ * Control data for the mmap() data buffer.
+ *
+ * User-space reading this value should issue an rmb(), on SMP capable
+ * platforms, after reading this value. -- see perf_counter_wakeup().
+ */
__u32 data_head; /* head in the data section */
};

Index: linux-2.6/kernel/perf_counter.c
===================================================================
--- linux-2.6.orig/kernel/perf_counter.c
+++ linux-2.6/kernel/perf_counter.c
@@ -1316,10 +1316,22 @@ static long perf_ioctl(struct file *file
return err;
}

-static void __perf_counter_update_userpage(struct perf_counter *counter,
- struct perf_mmap_data *data)
+/*
+ * Callers need to ensure there can be no nesting of this function, otherwise
+ * the seqlock logic goes bad. We can not serialize this because the arch
+ * code calls this from NMI context.
+ */
+void perf_counter_update_userpage(struct perf_counter *counter)
{
- struct perf_counter_mmap_page *userpg = data->user_page;
+ struct perf_mmap_data *data;
+ struct perf_counter_mmap_page *userpg;
+
+ rcu_read_lock();
+ data = rcu_dereference(counter->data);
+ if (!data)
+ goto unlock;
+
+ userpg = data->user_page;

/*
* Disable preemption so as to not let the corresponding user-space
@@ -1333,20 +1345,10 @@ static void __perf_counter_update_userpa
if (counter->state == PERF_COUNTER_STATE_ACTIVE)
userpg->offset -= atomic64_read(&counter->hw.prev_count);

- userpg->data_head = atomic_read(&data->head);
smp_wmb();
++userpg->lock;
preempt_enable();
-}
-
-void perf_counter_update_userpage(struct perf_counter *counter)
-{
- struct perf_mmap_data *data;
-
- rcu_read_lock();
- data = rcu_dereference(counter->data);
- if (data)
- __perf_counter_update_userpage(counter, data);
+unlock:
rcu_read_unlock();
}

@@ -1547,7 +1549,13 @@ void perf_counter_wakeup(struct perf_cou
data = rcu_dereference(counter->data);
if (data) {
(void)atomic_xchg(&data->wakeup, POLL_IN);
- __perf_counter_update_userpage(counter, data);
+ /*
+ * Ensure all data writes are issued before updating the
+ * user-space data head information. The matching rmb()
+ * will be in userspace after reading this value.
+ */
+ smp_wmb();
+ data->user_page->data_head = atomic_read(&data->head);
}
rcu_read_unlock();


--


2009-03-29 00:25:42

by Paul Mackerras

[permalink] [raw]
Subject: Re: [PATCH 2/9] perf_counter: fix update_userpage()

Peter Zijlstra writes:

> It just occured to me it is possible to have multiple contending
> updates of the userpage (mmap information vs overflow vs counter).
> This would break the seqlock logic.
>
> It appear the arch code uses this from NMI context, so we cannot
> possibly serialize its use, therefore separate the data_head update
> from it and let it return to its original use.

That sounds reasonable, and thanks for putting in a big comment.

Acked-by: Paul Mackerras <[email protected]>

> --- linux-2.6.orig/include/linux/perf_counter.h
> +++ linux-2.6/include/linux/perf_counter.h
> @@ -160,10 +160,45 @@ struct perf_counter_hw_event {
> struct perf_counter_mmap_page {
> __u32 version; /* version number of this structure */
> __u32 compat_version; /* lowest version this is compat with */
> +
> + /*
> + * Bits needed to read the hw counters in user-space.
> + *
> + * The index and offset should be read atomically using the seqlock:
> + *
> + * __u32 seq, index;
> + * __s64 offset;
> + *
> + * again:
> + * rmb();
> + * seq = pc->lock;
> + *
> + * if (unlikely(seq & 1)) {
> + * cpu_relax();
> + * goto again;
> + * }
> + *
> + * index = pc->index;
> + * offset = pc->offset;
> + *
> + * rmb();
> + * if (pc->lock != seq)
> + * goto again;
> + *
> + * After this, index contains architecture specific counter index + 1,
> + * so that 0 means unavailable, offset contains the value to be added
> + * to the result of the raw timer read to obtain this counter's value.
> + */
> __u32 lock; /* seqlock for synchronization */
> __u32 index; /* hardware counter identifier */
> __s64 offset; /* add to hardware counter value */

I think we can simplify this (in a follow-on patch).

It has occurred to me that we don't need to do all this on the
userspace side, because we are necessarily reading and writing these
fields on the same CPU. If the reader and writer were on different
CPUs, that would make no sense since they would be accessing different
hardware counter registers.

That means that we don't need any CPU memory barriers on either side.
All the kernel needs to do is to increment `lock' when it updates
things, and the user side can be:

do {
seq = pc->lock;
index = pc->index;
offset = pc->offset;
barrier();
} while (pc->lock != seq);

and all that's needed is a compiler barrier to stop the compiler from
optimizing too much.

Paul.

2009-04-02 08:50:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/9] perf_counter: fix update_userpage()

On Sun, 2009-03-29 at 11:24 +1100, Paul Mackerras wrote:

> > --- linux-2.6.orig/include/linux/perf_counter.h
> > +++ linux-2.6/include/linux/perf_counter.h
> > @@ -160,10 +160,45 @@ struct perf_counter_hw_event {
> > struct perf_counter_mmap_page {
> > __u32 version; /* version number of this structure */
> > __u32 compat_version; /* lowest version this is compat with */
> > +
> > + /*
> > + * Bits needed to read the hw counters in user-space.
> > + *
> > + * The index and offset should be read atomically using the seqlock:
> > + *
> > + * __u32 seq, index;
> > + * __s64 offset;
> > + *
> > + * again:
> > + * rmb();
> > + * seq = pc->lock;
> > + *
> > + * if (unlikely(seq & 1)) {
> > + * cpu_relax();
> > + * goto again;
> > + * }
> > + *
> > + * index = pc->index;
> > + * offset = pc->offset;
> > + *
> > + * rmb();
> > + * if (pc->lock != seq)
> > + * goto again;
> > + *
> > + * After this, index contains architecture specific counter index + 1,
> > + * so that 0 means unavailable, offset contains the value to be added
> > + * to the result of the raw timer read to obtain this counter's value.
> > + */
> > __u32 lock; /* seqlock for synchronization */
> > __u32 index; /* hardware counter identifier */
> > __s64 offset; /* add to hardware counter value */
>
> I think we can simplify this (in a follow-on patch).
>
> It has occurred to me that we don't need to do all this on the
> userspace side, because we are necessarily reading and writing these
> fields on the same CPU. If the reader and writer were on different
> CPUs, that would make no sense since they would be accessing different
> hardware counter registers.
>
> That means that we don't need any CPU memory barriers on either side.
> All the kernel needs to do is to increment `lock' when it updates
> things, and the user side can be:
>
> do {
> seq = pc->lock;
> index = pc->index;
> offset = pc->offset;
> barrier();
> } while (pc->lock != seq);
>
> and all that's needed is a compiler barrier to stop the compiler from
> optimizing too much.

Can this work at all?

I mean, user-space could get preempted/rescheduled after we read the
mmap() data using that seqlock and before we actually did the read-pmc
bit.

In that case, the counter can have changed underneath us and we're
reading rubbish.

2009-04-02 09:00:41

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/9] perf_counter: fix update_userpage()

On Thu, 2009-04-02 at 10:50 +0200, Peter Zijlstra wrote:
> On Sun, 2009-03-29 at 11:24 +1100, Paul Mackerras wrote:
>
> > > --- linux-2.6.orig/include/linux/perf_counter.h
> > > +++ linux-2.6/include/linux/perf_counter.h
> > > @@ -160,10 +160,45 @@ struct perf_counter_hw_event {
> > > struct perf_counter_mmap_page {
> > > __u32 version; /* version number of this structure */
> > > __u32 compat_version; /* lowest version this is compat with */
> > > +
> > > + /*
> > > + * Bits needed to read the hw counters in user-space.
> > > + *
> > > + * The index and offset should be read atomically using the seqlock:
> > > + *
> > > + * __u32 seq, index;
> > > + * __s64 offset;
> > > + *
> > > + * again:
> > > + * rmb();
> > > + * seq = pc->lock;
> > > + *
> > > + * if (unlikely(seq & 1)) {
> > > + * cpu_relax();
> > > + * goto again;
> > > + * }
> > > + *
> > > + * index = pc->index;
> > > + * offset = pc->offset;
> > > + *
> > > + * rmb();
> > > + * if (pc->lock != seq)
> > > + * goto again;
> > > + *
> > > + * After this, index contains architecture specific counter index + 1,
> > > + * so that 0 means unavailable, offset contains the value to be added
> > > + * to the result of the raw timer read to obtain this counter's value.
> > > + */
> > > __u32 lock; /* seqlock for synchronization */
> > > __u32 index; /* hardware counter identifier */
> > > __s64 offset; /* add to hardware counter value */
> >
> > I think we can simplify this (in a follow-on patch).
> >
> > It has occurred to me that we don't need to do all this on the
> > userspace side, because we are necessarily reading and writing these
> > fields on the same CPU. If the reader and writer were on different
> > CPUs, that would make no sense since they would be accessing different
> > hardware counter registers.
> >
> > That means that we don't need any CPU memory barriers on either side.
> > All the kernel needs to do is to increment `lock' when it updates
> > things, and the user side can be:
> >
> > do {
> > seq = pc->lock;
> > index = pc->index;
> > offset = pc->offset;
> > barrier();
> > } while (pc->lock != seq);
> >
> > and all that's needed is a compiler barrier to stop the compiler from
> > optimizing too much.
>
> Can this work at all?
>
> I mean, user-space could get preempted/rescheduled after we read the
> mmap() data using that seqlock and before we actually did the read-pmc
> bit.
>
> In that case, the counter can have changed underneath us and we're
> reading rubbish.

The below might work:

u32 seq;
s64 count;

again:
seq = pc->lock;

if (unlikely(seq & 1)) {
cpu_relax();
goto again;
}

count = pmc_read(pc->index);
counter += pc->offset;

barrier();
if (pc->lock != seq)
goto again;



2009-04-02 09:22:12

by Paul Mackerras

[permalink] [raw]
Subject: Re: [PATCH 2/9] perf_counter: fix update_userpage()

Peter Zijlstra writes:

> The below might work:
>
> u32 seq;
> s64 count;
>
> again:
> seq = pc->lock;
>
> if (unlikely(seq & 1)) {

I don't believe we can ever see this condition, since pc->lock is
updated in the kernel either at interrupt level on the cpu this task
is running on, or in the kernel in the context of this task. So this
userspace code can never run in the middle of the kernel updating
things.

Paul.

2009-04-02 09:22:30

by Paul Mackerras

[permalink] [raw]
Subject: Re: [PATCH 2/9] perf_counter: fix update_userpage()

Peter Zijlstra writes:

> > That means that we don't need any CPU memory barriers on either side.
> > All the kernel needs to do is to increment `lock' when it updates
> > things, and the user side can be:
> >
> > do {
> > seq = pc->lock;
> > index = pc->index;
> > offset = pc->offset;
> > barrier();
> > } while (pc->lock != seq);
> >
> > and all that's needed is a compiler barrier to stop the compiler from
> > optimizing too much.
>
> Can this work at all?
>
> I mean, user-space could get preempted/rescheduled after we read the
> mmap() data using that seqlock and before we actually did the read-pmc
> bit.
>
> In that case, the counter can have changed underneath us and we're
> reading rubbish.

Good point. This should work, though:

do {
seq = pc->lock;
barrier();
value = read_pmc(pc->index) + pc->offset;
barrier();
} while (pc->lock != seq);
return value;

No?

Paul.

2009-04-02 09:28:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/9] perf_counter: fix update_userpage()

On Thu, 2009-04-02 at 20:21 +1100, Paul Mackerras wrote:
> Peter Zijlstra writes:
>
> > The below might work:
> >
> > u32 seq;
> > s64 count;
> >
> > again:
> > seq = pc->lock;
> >
> > if (unlikely(seq & 1)) {
>
> I don't believe we can ever see this condition, since pc->lock is
> updated in the kernel either at interrupt level on the cpu this task
> is running on, or in the kernel in the context of this task. So this
> userspace code can never run in the middle of the kernel updating
> things.

Colour me paranoid ;-)

2009-04-02 09:35:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/9] perf_counter: fix update_userpage()

On Thu, 2009-04-02 at 20:15 +1100, Paul Mackerras wrote:
> Peter Zijlstra writes:
>
> > > That means that we don't need any CPU memory barriers on either side.
> > > All the kernel needs to do is to increment `lock' when it updates
> > > things, and the user side can be:
> > >
> > > do {
> > > seq = pc->lock;
> > > index = pc->index;
> > > offset = pc->offset;
> > > barrier();
> > > } while (pc->lock != seq);
> > >
> > > and all that's needed is a compiler barrier to stop the compiler from
> > > optimizing too much.
> >
> > Can this work at all?
> >
> > I mean, user-space could get preempted/rescheduled after we read the
> > mmap() data using that seqlock and before we actually did the read-pmc
> > bit.
> >
> > In that case, the counter can have changed underneath us and we're
> > reading rubbish.
>
> Good point. This should work, though:
>
> do {
> seq = pc->lock;
> barrier();
> value = read_pmc(pc->index) + pc->offset;
> barrier();
> } while (pc->lock != seq);
> return value;

I don't think you need the first barrier(), all you need to avoid is it
reusing the first pc->lock read, so one should suffice.

Also, you need to handle the !pc->index case.

But yeah.

2009-04-02 09:59:29

by Paul Mackerras

[permalink] [raw]
Subject: Re: [PATCH 2/9] perf_counter: fix update_userpage()

Peter Zijlstra writes:

> > Good point. This should work, though:
> >
> > do {
> > seq = pc->lock;
> > barrier();
> > value = read_pmc(pc->index) + pc->offset;
> > barrier();
> > } while (pc->lock != seq);
> > return value;
>
> I don't think you need the first barrier(), all you need to avoid is it
> reusing the first pc->lock read, so one should suffice.

I need it to make sure that the compiler doesn't put the load of
pc->index or pc->offset before the first load of pc->lock. The second
barrier is needed to make sure the compiler puts the second load of
pc->lock after the loads of pc->index and pc->offset. So I think I do
need to barrier()s (but only compiler barriers, not cpu memory
barriers).

> Also, you need to handle the !pc->index case.

Hmmm, yeah. I claim that read_pmc(0) always returns 0. :)

Paul.

2009-04-02 10:35:45

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/9] perf_counter: fix update_userpage()

On Thu, 2009-04-02 at 20:58 +1100, Paul Mackerras wrote:
> Peter Zijlstra writes:
>
> > > Good point. This should work, though:
> > >
> > > do {
> > > seq = pc->lock;
> > > barrier();
> > > value = read_pmc(pc->index) + pc->offset;
> > > barrier();
> > > } while (pc->lock != seq);
> > > return value;
> >
> > I don't think you need the first barrier(), all you need to avoid is it
> > reusing the first pc->lock read, so one should suffice.
>
> I need it to make sure that the compiler doesn't put the load of
> pc->index or pc->offset before the first load of pc->lock. The second
> barrier is needed to make sure the compiler puts the second load of
> pc->lock after the loads of pc->index and pc->offset. So I think I do
> need to barrier()s (but only compiler barriers, not cpu memory
> barriers).

Ah, you're right indeed.

> > Also, you need to handle the !pc->index case.
>
> Hmmm, yeah. I claim that read_pmc(0) always returns 0. :)

Hehe :-)

Ok, updated that patch.