Hello
the changeset 2e939d1d perf: Limit event refresh to sampling event
changes the behavior of
ioctl( , PERF_EVENT_IOC_REFRESH, )
before the changeset, you could have a counter group where only one of the
subevents was sampling. PERF_EVENT_IOC_REFRESH would correctly enable
sampling for only that subevent.
With the changeset applied, this fails with EINVALID. Now you can only
sample in a group leader.
Was this an intended change? It breaks various of our PAPI regression
tests that worked fine before the change was committed.
Vince
[email protected]
On Mon, 2011-05-23 at 16:04 -0400, Vince Weaver wrote:
> Hello
>
> the changeset 2e939d1d perf: Limit event refresh to sampling event
>
> changes the behavior of
> ioctl( , PERF_EVENT_IOC_REFRESH, )
>
> before the changeset, you could have a counter group where only one of the
> subevents was sampling. PERF_EVENT_IOC_REFRESH would correctly enable
> sampling for only that subevent.
But how? it adds to the event_limit of the event you call the ioctl()
on, not on any of the group siblings.
> With the changeset applied, this fails with EINVALID. Now you can only
> sample in a group leader.
I'm not quite seeing how group-leaders are relevant here, you can only
call this ioctl() on sampling events, regardless of if they're they
group leader or a sibling.
> Was this an intended change? It breaks various of our PAPI regression
> tests that worked fine before the change was committed.
I'm not quite seeing what's wrong yet, the change seemed simple enough
in that calling that ioctl() on an event that wasn't capable of
generating samples doesn't make sense, since it will increase the event
limit for the event you call it on.
On Mon, 23 May 2011, Peter Zijlstra wrote:
> > before the changeset, you could have a counter group where only one of the
> > subevents was sampling. PERF_EVENT_IOC_REFRESH would correctly enable
> > sampling for only that subevent.
>
> But how? it adds to the event_limit of the event you call the ioctl()
> on, not on any of the group siblings.
the old behavior did.
> > With the changeset applied, this fails with EINVALID. Now you can only
> > sample in a group leader.
>
> I'm not quite seeing how group-leaders are relevant here, you can only
> call this ioctl() on sampling events, regardless of if they're they
> group leader or a sibling.
previous behavior was that if you called it on a group leader that wasn't
sampling, a sibling event that was sampling would be activated.
> > Was this an intended change? It breaks various of our PAPI regression
> > tests that worked fine before the change was committed.
>
> I'm not quite seeing what's wrong yet, the change seemed simple enough
> in that calling that ioctl() on an event that wasn't capable of
> generating samples doesn't make sense, since it will increase the event
> limit for the event you call it on.
attached is some code that will return a sampled count on older kernels
but gives EINVAL on current kernels.
The old behavior might have been unintentional, but PAPI unfortunately
depends on it (not code I originaly wrote, but code I have to maintain).
Vince
On Tue, 2011-05-24 at 02:20 -0400, Vince Weaver wrote:
> On Mon, 23 May 2011, Peter Zijlstra wrote:
>
> > > before the changeset, you could have a counter group where only one of the
> > > subevents was sampling. PERF_EVENT_IOC_REFRESH would correctly enable
> > > sampling for only that subevent.
> >
> > But how? it adds to the event_limit of the event you call the ioctl()
> > on, not on any of the group siblings.
>
> the old behavior did.
>
> > > With the changeset applied, this fails with EINVALID. Now you can only
> > > sample in a group leader.
> >
> > I'm not quite seeing how group-leaders are relevant here, you can only
> > call this ioctl() on sampling events, regardless of if they're they
> > group leader or a sibling.
>
> previous behavior was that if you called it on a group leader that wasn't
> sampling, a sibling event that was sampling would be activated.
>
> > > Was this an intended change? It breaks various of our PAPI regression
> > > tests that worked fine before the change was committed.
> >
> > I'm not quite seeing what's wrong yet, the change seemed simple enough
> > in that calling that ioctl() on an event that wasn't capable of
> > generating samples doesn't make sense, since it will increase the event
> > limit for the event you call it on.
>
> attached is some code that will return a sampled count on older kernels
> but gives EINVAL on current kernels.
>
> The old behavior might have been unintentional, but PAPI unfortunately
> depends on it (not code I originaly wrote, but code I have to maintain).
Right, so what the code does is create a group of which the leader is
disabled, which effectively has the whole group disabled. It then abuses
REFRESH,0 to enable the leader.
The code should use ENABLE (surprise, surprise!) to enable the leader
and get things going.
This is very clear abuse of the API and I'm not really inclined to fix
it, in fact, I'd be tempted to add an additional test to verify the
argument to REFRESH > 0.
Then again, I do appreciate you're having a problem there, how soon can
you push this trivial fix into all maintained PAPI branches? We could
revert this for one release and then properly shut this abuse down the
next release.
On Tue, 24 May 2011, Peter Zijlstra wrote:
> > The old behavior might have been unintentional, but PAPI unfortunately
> > depends on it (not code I originaly wrote, but code I have to maintain).
>
> Right, so what the code does is create a group of which the leader is
> disabled, which effectively has the whole group disabled. It then abuses
> REFRESH,0 to enable the leader.
yes. This is currently what PAPI does.
> The code should use ENABLE (surprise, surprise!) to enable the leader
> and get things going.
It was unclear from the documentation that enabling a group leader that
had siblings with overflow setup would start them triggering
without some sort of REFRESH first. It does seem to work though.
But then again, so did the original behavior.
> This is very clear abuse of the API and I'm not really inclined to fix
> it, in fact, I'd be tempted to add an additional test to verify the
> argument to REFRESH > 0.
Well, when there's no documentation then people write to the code.
As I said before, I didn't write this code. I just get to pick up the
pieces when it breaks.
As an aside, I also wasted 6 hours last night finding out that you don't
get signaled on overflow if you don't have a ring-buffer mmap()ed, even
if you never read from the buffer and you only are interested in counting
the number of overflows. I should stop complaining though or you'll
start telling me to fix the documentation. Which maybe I would do if I
didn't spend all my time writing code to reproduce problems in the perf
ABI for bisection purposes.
> Then again, I do appreciate you're having a problem there, how soon can
> you push this trivial fix into all maintained PAPI branches? We could
> revert this for one release and then properly shut this abuse down the
> next release.
well since you apparently aren't going to retroactively revert it for
2.6.37, 2.6.38, or 2.6.39 then it would be silly to do it just for 2.6.40.
I'll audit the PAPI code to see how widespread it is.
To warn you though, we still have people complain within hours when we
break support for 2.6.16 kernels, so it's not like the people who use
PAPI are the kind to run out and install a minor stable release. They're
going to upgrade their distro or move their binary to a newer machine and
suddenly start geting EINVAL returns on their previously working code and
then start noisily complaining.
Vince
On Tue, 2011-05-24 at 11:04 -0400, Vince Weaver wrote:
> 2.6.37, 2.6.38, or 2.6.39 then it would be silly to do it just for
> 2.6.40.
Oh, I assumed it was recent and .39/.40 would suffice.
On Tue, 2011-05-24 at 11:04 -0400, Vince Weaver wrote:
> As an aside, I also wasted 6 hours last night finding out that you don't
> get signaled on overflow if you don't have a ring-buffer mmap()ed, even
> if you never read from the buffer and you only are interested in counting
> the number of overflows.
That sounds like something we could fix, let me investigate.
> I should stop complaining though or you'll
> start telling me to fix the documentation. Which maybe I would do if I
> didn't spend all my time writing code to reproduce problems in the perf
> ABI for bisection purposes.
Both are appreciated.
On Tue, 2011-05-24 at 17:11 +0200, Peter Zijlstra wrote:
> On Tue, 2011-05-24 at 11:04 -0400, Vince Weaver wrote:
> > As an aside, I also wasted 6 hours last night finding out that you don't
> > get signaled on overflow if you don't have a ring-buffer mmap()ed, even
> > if you never read from the buffer and you only are interested in counting
> > the number of overflows.
>
> That sounds like something we could fix, let me investigate.
Does the below cure this?
---
kernel/events/core.c | 5 ++++-
1 files changed, 4 insertions(+), 1 deletions(-)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index c09767f..bd1ba5e 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5018,9 +5018,12 @@ static int __perf_event_overflow(struct perf_event *event, int nmi,
event->pending_kill = POLL_HUP;
if (nmi) {
event->pending_disable = 1;
+ event->pending_wakeup = 1;
irq_work_queue(&event->pending);
- } else
+ } else {
perf_event_disable(event);
+ perf_event_wakeup(event);
+ }
}
if (event->overflow_handler)
* Peter Zijlstra <[email protected]> wrote:
> On Tue, 2011-05-24 at 11:04 -0400, Vince Weaver wrote:
>
> > 2.6.37, 2.6.38, or 2.6.39 then it would be silly to do it just for
> > 2.6.40.
No, this commit was added in v2.6.38 so v2.6.37 should be fine.
> Oh, I assumed it was recent and .39/.40 would suffice.
Btw., how did it happen that the PAPI tests did not get run against upstream
over the course of about half a year, two full stable kernels released:
Date: Mon Mar 14 18:20:32 2011 -0700 Linux 2.6.38
Date: Wed May 18 21:06:34 2011 -0700 Linux 2.6.39
?
I'd suggest periodically running the PAPI tests on the perf development tree:
http://people.redhat.com/mingo/tip.git/README
doing that would have caught this problem 6 months ago.
The upstream policy is that regressions are generally recognized before the
next kernel gets released: i.e. in the stabilization period after -rc1, the
roughly two months until the final kernel gets released. That is the window
when we can still fix regressions relatively cheaply.
Yes, there are exceptions, but if a piece of user-space code did not get tested
with upstream over months and months then that moves into the 'fix it if we
can' category - not a regression per se.
So the upstream message is: we can only care about you if you care testing
upstream.
So if it's easy to fix we can certainly fix this bug and mark it for a -stable
backport, but this is not a regression that got reported to us in any timely
manner.
Btw., to get such assumptions tested more frequently than twice a year i'd
suggest moving these usecases into 'perf test' or so - that it gets run every
day:
$ perf test
1: vmlinux symtab matches kallsyms: FAILED!
2: detect open syscall event: Ok
3: detect open syscall event on all cpus: Ok
4: read samples using the mmap interface: Ok
Thanks,
Ingo
On Tue, 24 May 2011, Peter Zijlstra wrote:
> On Tue, 2011-05-24 at 11:04 -0400, Vince Weaver wrote:
> > 2.6.37, 2.6.38, or 2.6.39 then it would be silly to do it just for
> > 2.6.40.
>
> Oh, I assumed it was recent and .39/.40 would suffice.
It turns out the problem was only introduced in PAPI in November ( I had
assumed it dated back further). So maybe not as dire as it seemed at
first.
The fix to PAPI does seem to be trivial, though the whole set of
bad code was introduced to work around a different problem so I need
to verify the fix doesn't break other things.
Thanks,
Vince
[email protected]
On Tue, 24 May 2011, Ingo Molnar wrote:
>
> Btw., how did it happen that the PAPI tests did not get run against upstream
> over the course of about half a year, two full stable kernels released:
we run regresion tests nightly. There was a bug in our
"create two events and sample on the second"
test, where it was actualy sampling on the first counter by mistake.
When I fixed the test to do what it claimed to do it found the bug.
PAPI runs on at least 5 different operating systems, 3 different
Linux perf counter implementations, and on kernels dating back to 2.4.
Plus numerous architectures. While we try to test against recent
Linux perf_events, we just don't have the hardware to be comprehensive.
It doesn't help that our test machines are primarily used for GPGPU work
during the day, so we're limited to what kernels we can have installed
due to driver issues.
> suggest moving these usecases into 'perf test' or so - that it gets run every
> day:
you can feel free to install PAPI on your test machines and give it a run
daily too. It's open source
setenv CVSROOT :pserver:[email protected]:/cvs/homes/papi
cvs login
cvs co all
cd papi
./configure
make
./run_tests
There are often false negatives on the tests that can be a pain
to track down. Welcome to my world. We gladly accept patches.
Vince
On Tue, 24 May 2011, Peter Zijlstra wrote:
> On Tue, 2011-05-24 at 17:11 +0200, Peter Zijlstra wrote:
> > On Tue, 2011-05-24 at 11:04 -0400, Vince Weaver wrote:
> > > As an aside, I also wasted 6 hours last night finding out that you don't
> > > get signaled on overflow if you don't have a ring-buffer mmap()ed, even
> > > if you never read from the buffer and you only are interested in counting
> > > the number of overflows.
> >
> > That sounds like something we could fix, let me investigate.
>
> Does the below cure this?
>
well that was an interesting interaction between debian-unstable and
linus-git. Luckily a udev update seemed to help.
Anyway, your patch did not fix the problem. I still don't get overflow
signals if the fd doesn't have a ring-buffer mmap()ed to it.
For a test case you can take my previous test case and comment out the
mmap call. Since this test case is probably the only code in the world
trying to do this, maybe it's not that important.
I can test further patches, but my fastest build machine here at home
takes 3 hours to build a kernel so there will be some latency in the
response.
Vince
* Vince Weaver <[email protected]> wrote:
> > suggest moving these usecases into 'perf test' or so - that it
> > gets run every day:
>
> you can feel free to install PAPI on your test machines and give it
> a run daily too. It's open source
Well, i gave you a suggestion of how to prevent such regressions in
the future, should you be worried about such regressions.
The standing upstream policy is that while we try to do a good effort
and obviously fix everything we find ourselves or which gets reported
to us, we cannot test everything so if you want us to fix bugs you
need to pick one or more of these options:
- run our code
- or help us build better tests, either to 'perf test' (or to LTP)
- or get your users to test recent upstream kernels for you
- or ignore us and deal with regressions whenever you get hit by them
Your choice really.
Thanks,
Ingo
On Wed, 25 May 2011, Ingo Molnar wrote:
> * Vince Weaver <[email protected]> wrote:
>
> Well, i gave you a suggestion of how to prevent such regressions in
> the future, should you be worried about such regressions.
>
> - or help us build better tests, either to 'perf test' (or to LTP)
I do have a perf_event validation test suite that I maintain to help when
debugging PAPI issues.
http://web.eecs.utk.edu/~vweaver1/projects/perf-events/validation.html
it's only about 12 tests right now, but feel free to use them if you'd
like (they're GPL). I really don't have any interest in merging them into
the perf tree though.
Vince
On Tue, 24 May 2011, Peter Zijlstra wrote:
>
> This is very clear abuse of the API and I'm not really inclined to fix
> it, in fact, I'd be tempted to add an additional test to verify the
> argument to REFRESH > 0.
on that note (and while trying to document exactly what the ioctls do) it
seems that a PERF_EVENT_IOC_REFRESH with an argument of anything higher
than one does not work on kernels 2.6.36 and newer. The behavior acts
as if 1 was passed, even if you pass in, say, 3.
Is it worth bisecting this, or has this become the new official behavior
since no one noticed until now?
To reproduce you can grab my tests from here:
http://web.eecs.utk.edu/~vweaver1/projects/perf-events/validation.html
and run the
./validation/simple_overflow_leader
test
Vince
On Fri, 2011-05-27 at 23:38 -0400, Vince Weaver wrote:
> on that note (and while trying to document exactly what the ioctls do) it
> seems that a PERF_EVENT_IOC_REFRESH with an argument of anything higher
> than one does not work on kernels 2.6.36 and newer. The behavior acts
> as if 1 was passed, even if you pass in, say, 3.
Urgh, no that should definitely work. Thanks for the test-case, I'll
work on that (probably not until Monday though, but who knows).
On Sat, 28 May 2011, Peter Zijlstra wrote:
> On Fri, 2011-05-27 at 23:38 -0400, Vince Weaver wrote:
> > on that note (and while trying to document exactly what the ioctls do) it
> > seems that a PERF_EVENT_IOC_REFRESH with an argument of anything higher
> > than one does not work on kernels 2.6.36 and newer. The behavior acts
> > as if 1 was passed, even if you pass in, say, 3.
>
> Urgh, no that should definitely work. Thanks for the test-case, I'll
> work on that (probably not until Monday though, but who knows).
So wait, the two regressions I found in 2.6.37 are WONTFIX because they
are too old, even though they break existing userspace code?
And this older regression in 2.6.36 is going to be fixed, even though
perf, PAPI, and libpfm4 don't trigger the buggy functionality at all?
I think it's time to redefine the PERF_EVENT_IOC_REFRESH ioctl to just
refresh once (as that's what it actually does on 2.6.36 - 2.6.39) and
if we need to refresh multiple we should add a new
PERF_EVENT_IOC_REFRESH_COUNT ioctl.
I know I am being difficult, but the perf-event ABI is a mess to program
for in a backward compatible fashion.
Vince
On Sat, 28 May 2011, Peter Zijlstra wrote:
> On Fri, 2011-05-27 at 23:38 -0400, Vince Weaver wrote:
> > on that note (and while trying to document exactly what the ioctls do) it
> > seems that a PERF_EVENT_IOC_REFRESH with an argument of anything higher
> > than one does not work on kernels 2.6.36 and newer. The behavior acts
> > as if 1 was passed, even if you pass in, say, 3.
>
> Urgh, no that should definitely work. Thanks for the test-case, I'll
> work on that (probably not until Monday though, but who knows).
>
after a painfully long bisection, it turns out that this problem was in
theory introduced by the following commit:
d57e34fdd60be7ffd0b1d86bfa1a553df86b7172
perf: Simplify the ring-buffer logic: make perf_buffer_alloc() do everything needed
I'll see if I can come up with a patch, but it's a bit non-obvious why
this commit is affecting the REFRESH value at all.
Vince
[email protected]
On Sun, 29 May 2011, Vince Weaver wrote:
> On Sat, 28 May 2011, Peter Zijlstra wrote:
>
> > On Fri, 2011-05-27 at 23:38 -0400, Vince Weaver wrote:
> > > on that note (and while trying to document exactly what the ioctls do) it
> > > seems that a PERF_EVENT_IOC_REFRESH with an argument of anything higher
> > > than one does not work on kernels 2.6.36 and newer. The behavior acts
> > > as if 1 was passed, even if you pass in, say, 3.
> >
> > Urgh, no that should definitely work. Thanks for the test-case, I'll
> > work on that (probably not until Monday though, but who knows).
> >
>
> after a painfully long bisection, it turns out that this problem was in
> theory introduced by the following commit:
>
> d57e34fdd60be7ffd0b1d86bfa1a553df86b7172
>
> perf: Simplify the ring-buffer logic: make perf_buffer_alloc() do everything needed
>
> I'll see if I can come up with a patch, but it's a bit non-obvious why
> this commit is affecting the REFRESH value at all.
the problem was the mentioned commit tried to optimize the use of
watermark and wakeup_watermark without taking into account that
wakeup_watermark is a union with wakeup_events.
The patch below *should* fix it, but something unrelated has broken
overflow support between 2.6.39 and 3.0-rc1 which I haven't had time to
investigate. The overflow count is suddenly about 10x what it should be
though. So the below is semi-untested and I possibly need to do another
bisect. *sigh*
Vince
Signed-off-by: Vince Weaver <[email protected]>
diff --git a/kernel/events/core.c b/kernel/events/core.c
index d863b3c..e3ff283 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3403,12 +3403,14 @@ unlock:
static unsigned long perf_data_size(struct perf_buffer *buffer);
static void
-perf_buffer_init(struct perf_buffer *buffer, long watermark, int flags)
+perf_buffer_init(struct perf_buffer *buffer,
+ long watermark,
+ long wakeup_watermark, int flags)
{
long max_size = perf_data_size(buffer);
if (watermark)
- buffer->watermark = min(max_size, watermark);
+ buffer->watermark = min(max_size, wakeup_watermark);
if (!buffer->watermark)
buffer->watermark = max_size / 2;
@@ -3451,7 +3453,8 @@ static void *perf_mmap_alloc_page(int cpu)
}
static struct perf_buffer *
-perf_buffer_alloc(int nr_pages, long watermark, int cpu, int flags)
+perf_buffer_alloc(int nr_pages, long watermark,
+ long wakeup_watermark, int cpu, int flags)
{
struct perf_buffer *buffer;
unsigned long size;
@@ -3476,7 +3479,7 @@ perf_buffer_alloc(int nr_pages, long watermark, int cpu, int flags)
buffer->nr_pages = nr_pages;
- perf_buffer_init(buffer, watermark, flags);
+ perf_buffer_init(buffer, watermark, wakeup_watermark, flags);
return buffer;
@@ -3568,7 +3571,8 @@ static void perf_buffer_free(struct perf_buffer *buffer)
}
static struct perf_buffer *
-perf_buffer_alloc(int nr_pages, long watermark, int cpu, int flags)
+perf_buffer_alloc(int nr_pages, long watermark,
+ long wakeup_watermark, int cpu, int flags)
{
struct perf_buffer *buffer;
unsigned long size;
@@ -3592,7 +3596,7 @@ perf_buffer_alloc(int nr_pages, long watermark, int cpu, int flags)
buffer->page_order = ilog2(nr_pages);
buffer->nr_pages = 1;
- perf_buffer_init(buffer, watermark, flags);
+ perf_buffer_init(buffer, watermark, wakeup_watermark, flags);
return buffer;
@@ -3787,7 +3791,9 @@ static int perf_mmap(struct file *file, struct vm_area_struct *vma)
if (vma->vm_flags & VM_WRITE)
flags |= PERF_BUFFER_WRITABLE;
- buffer = perf_buffer_alloc(nr_pages, event->attr.wakeup_watermark,
+ buffer = perf_buffer_alloc(nr_pages,
+ event->attr.watermark,
+ event->attr.wakeup_watermark,
event->cpu, flags);
if (!buffer) {
ret = -ENOMEM;
On Mon, 2011-05-30 at 21:33 -0400, Vince Weaver wrote:
> On Sun, 29 May 2011, Vince Weaver wrote:
>
> > On Sat, 28 May 2011, Peter Zijlstra wrote:
> >
> > > On Fri, 2011-05-27 at 23:38 -0400, Vince Weaver wrote:
> > > > on that note (and while trying to document exactly what the ioctls do) it
> > > > seems that a PERF_EVENT_IOC_REFRESH with an argument of anything higher
> > > > than one does not work on kernels 2.6.36 and newer. The behavior acts
> > > > as if 1 was passed, even if you pass in, say, 3.
> > >
> > > Urgh, no that should definitely work. Thanks for the test-case, I'll
> > > work on that (probably not until Monday though, but who knows).
> > >
> >
> > after a painfully long bisection, it turns out that this problem was in
> > theory introduced by the following commit:
> >
> > d57e34fdd60be7ffd0b1d86bfa1a553df86b7172
> >
> > perf: Simplify the ring-buffer logic: make perf_buffer_alloc() do everything needed
> >
> > I'll see if I can come up with a patch, but it's a bit non-obvious why
> > this commit is affecting the REFRESH value at all.
>
> the problem was the mentioned commit tried to optimize the use of
> watermark and wakeup_watermark without taking into account that
> wakeup_watermark is a union with wakeup_events.
>
> The patch below *should* fix it,
Awesome thanks!
> but something unrelated has broken
> overflow support between 2.6.39 and 3.0-rc1 which I haven't had time to
> investigate. The overflow count is suddenly about 10x what it should be
> though. So the below is semi-untested and I possibly need to do another
> bisect. *sigh*
Yeah, I noticed, I was hunting that as well..
On Mon, 2011-05-30 at 21:33 -0400, Vince Weaver wrote:
> the problem was the mentioned commit tried to optimize the use of
> watermark and wakeup_watermark without taking into account that
> wakeup_watermark is a union with wakeup_events.
Note that wake_events isn't related to IOC_REFRESH, wake_events is how
much events to buffer in the mmap-buffer before issuing a wakeup.
IOC_REFRESH increments event_limit, which is how many events to run
before disabling yourself.
What I gather is that due to that SIGIO bug (fixed by f506b3dc0e), you
had to have both an mmap and a wakeup in order for that signal to
arrive.
On Tue, 31 May 2011, Peter Zijlstra wrote:
> On Mon, 2011-05-30 at 21:33 -0400, Vince Weaver wrote:
> > the problem was the mentioned commit tried to optimize the use of
> > watermark and wakeup_watermark without taking into account that
> > wakeup_watermark is a union with wakeup_events.
>
> Note that wake_events isn't related to IOC_REFRESH, wake_events is how
> much events to buffer in the mmap-buffer before issuing a wakeup.
>
> IOC_REFRESH increments event_limit, which is how many events to run
> before disabling yourself.
>
> What I gather is that due to that SIGIO bug (fixed by f506b3dc0e), you
> had to have both an mmap and a wakeup in order for that signal to
> arrive.
yes, but due to a bug in the mentioned changeset, the buffer watermark
value was being set to a low value even if *watermark* was 0. So if you
were using IOC_REFRESH to set the *wakeup_events* value, it was also
setting the *wakeup_watermark* value (it's a union) and the buffer setup
was then unconditionally setting the buffer watermark to the value of the
supposedly unrelated *wakeup_watermark*. Normally the wakeup watermark
would default to something like 2048, but if you were trying to set the
wakeup_events value to something like 3 then wakeup_watermark would be set
to that too, causing a lot more overflow events.
I verified all the above painfully using a lot of printks.
I agree this does seem to be a combination of bugs, as even with a
properlyu set value on affected kernels you'd get spurious watermark
overflow events if you weren't consuming the ring buffer.
In any case, I can provide a cleaner patch than the one before that isn't
as intrusive.
I'm also bisecting the other problem I mentioned, the one where overflows
are 10x too large on 3.0-rc1. I'm at work with a Nehalem machine so the
bisect should go faster than the bisect I had to do on an atom machine
this weekend.
A power outage over the weekend has taken part of the
network down here though so my e-mail access is a bit limited, so I
apologize if I've been missing comments sent to my other e-mail address.
Vince
On Tue, 2011-05-31 at 09:49 -0400, Vince Weaver wrote:
> On Tue, 31 May 2011, Peter Zijlstra wrote:
>
> > On Mon, 2011-05-30 at 21:33 -0400, Vince Weaver wrote:
> > > the problem was the mentioned commit tried to optimize the use of
> > > watermark and wakeup_watermark without taking into account that
> > > wakeup_watermark is a union with wakeup_events.
> >
> > Note that wake_events isn't related to IOC_REFRESH, wake_events is how
> > much events to buffer in the mmap-buffer before issuing a wakeup.
> >
> > IOC_REFRESH increments event_limit, which is how many events to run
> > before disabling yourself.
> >
> > What I gather is that due to that SIGIO bug (fixed by f506b3dc0e), you
> > had to have both an mmap and a wakeup in order for that signal to
> > arrive.
>
> yes, but due to a bug in the mentioned changeset, the buffer watermark
> value was being set to a low value even if *watermark* was 0. So if you
> were using IOC_REFRESH to set the *wakeup_events* value,
IOC_REFRESH sets event->event_limit, not wakeup_events.
> it was also
> setting the *wakeup_watermark* value (it's a union) and the buffer setup
> was then unconditionally setting the buffer watermark to the value of the
> supposedly unrelated *wakeup_watermark*. Normally the wakeup watermark
> would default to something like 2048, but if you were trying to set the
> wakeup_events value to something like 3 then wakeup_watermark would be set
> to that too, causing a lot more overflow events.
poll() wakeups, which were inadvertly linked to SIGIOs
> I verified all the above painfully using a lot of printks.
I prefer to use trace_printk() and /debug/tracing/, that doesn't slow
stuff down as much.
> I agree this does seem to be a combination of bugs, as even with a
> properlyu set value on affected kernels you'd get spurious watermark
> overflow events if you weren't consuming the ring buffer.
*nod*
> In any case, I can provide a cleaner patch than the one before that isn't
> as intrusive.
Appreciated.
> I'm also bisecting the other problem I mentioned, the one where overflows
> are 10x too large on 3.0-rc1. I'm at work with a Nehalem machine so the
> bisect should go faster than the bisect I had to do on an atom machine
> this weekend.
It wouldn't be the SIGIO fix would it?, with that every overflow
generates a SIGIO, not only the poll() wakeups. And ouch at bisecting
(or even building a kernel) on an Atom, those things are horridly slow.
> A power outage over the weekend has taken part of the
> network down here though so my e-mail access is a bit limited, so I
> apologize if I've been missing comments sent to my other e-mail address.
I'm afraid not, I've been mostly tied up with fixing some scheduler
regressions.
Also, it looks like I just broke stuff even worse in -tip, am bisecting
that now.
On Tue, 31 May 2011, Peter Zijlstra wrote:
>
> IOC_REFRESH sets event->event_limit, not wakeup_events.
ahh, yes.
So it's a userspace "bug".
The test code called a "IOC_REFRESH, 3" whenever it got signalled.
It didn't distinguish between whether it was a plain overflow or if it was
a ring-buffer overflow (can it distinguish?).
Thus the watermark bug was confusing the user-space code into refreshing
when it was not necessary.
> > I'm also bisecting the other problem I mentioned, the one where overflows
> > are 10x too large on 3.0-rc1. I'm at work with a Nehalem machine so the
> > bisect should go faster than the bisect I had to do on an atom machine
> > this weekend.
>
> It wouldn't be the SIGIO fix would it?, with that every overflow
> generates a SIGIO, not only the poll() wakeups. And ouch at bisecting
> (or even building a kernel) on an Atom, those things are horridly slow.
Oh, it could be the SIGIO fix. I hadn't realized that got merged already.
And yes, bisect on atom is painful, but my alternatives were a 1.7GHz P4
(with small disk, so would have been over NFS), a 600MHz G3 iBook, or an
armv6 machine. So atom it was.
Vince
[email protected]
* Vince Weaver <[email protected]> wrote:
> On Sat, 28 May 2011, Peter Zijlstra wrote:
>
> > On Fri, 2011-05-27 at 23:38 -0400, Vince Weaver wrote:
> > > on that note (and while trying to document exactly what the ioctls do) it
> > > seems that a PERF_EVENT_IOC_REFRESH with an argument of anything higher
> > > than one does not work on kernels 2.6.36 and newer. The behavior acts
> > > as if 1 was passed, even if you pass in, say, 3.
> >
> > Urgh, no that should definitely work. Thanks for the test-case, I'll
> > work on that (probably not until Monday though, but who knows).
>
> So wait, the two regressions I found in 2.6.37 are WONTFIX because
> they are too old, even though they break existing userspace code?
>
> And this older regression in 2.6.36 is going to be fixed, even
> though perf, PAPI, and libpfm4 don't trigger the buggy
> functionality at all?
Btw., these considerations are flexible and we can reconsider and
change the WONTFIX if there's a patch available and doesn't look
horrible to backport. We can also mark fixes that havent been marked
-stable originally as -stable later on, etc.
So please don't feel needlessly bitter about past decisions: when
there's some good technical solution to a problem (or we were plain
out wrong about a decision) we try hard not to stand in the way.
Thanks,
Ingo