2011-02-10 19:08:56

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: 2.6.37 kernel warning in perf_events code

Em Wed, Feb 09, 2011 at 06:50:09PM -0800, Arun Sharma escreveu:
> [ Not sure if this is known and fixed in newer kernels. My machine
> becomes unusable after this warning triggers ]
>
> perf record -g -p <pid> cs -o csw.data -- sleep 3

What kernel are you using? Peter, rings any bells?

- Arnaldo

> Feb 7 15:28:36 localhost kernel: WARNING: at kernel/perf_event.c:1302
> task_ctx_sched_out+0x45/0x67()
> ..
> Feb 7 15:28:36 localhost kernel: [<ffffffff8103c60d>]
> warn_slowpath_null+0x15/0x17
> Feb 7 15:28:36 localhost kernel: [<ffffffff8109e3b6>]
> task_ctx_sched_out+0x45/0x67
> Feb 7 15:28:36 localhost kernel: [<ffffffff8109e3b6>]
> task_ctx_sched_out+0x45/0x67
> Feb 7 15:28:36 localhost kernel: [<ffffffff810a207f>]
> perf_event_exit_task+0xca/0x196
> Feb 7 15:28:36 localhost kernel: [<ffffffff810a207f>]
> perf_event_exit_task+0xca/0x196
> Feb 7 15:28:36 localhost kernel: [<ffffffff8103fb3f>] do_exit+0x2c5/0x6e6
> Feb 7 15:28:36 localhost kernel: [<ffffffff8103fb3f>] do_exit+0x2c5/0x6e6
> Feb 7 15:28:36 localhost kernel: [<ffffffff81040021>] complete_and_exit+0x0/0x1e
> Feb 7 15:28:36 localhost kernel: [<ffffffff81040021>] complete_and_exit+0x0/0x1e
> Feb 7 15:28:36 localhost kernel: [<ffffffff8100293b>]
> system_call_fastpath+0x16/0x1b
> Feb 7 15:28:36 localhost kernel: [<ffffffff8100293b>]
> system_call_fastpath+0x16/0x1b
>
>
> 1301
> 1302 if (WARN_ON_ONCE(ctx != cpuctx->task_ctx))
> 1303 return;
>
> -Arun


2011-02-10 20:02:23

by Arun Sharma

[permalink] [raw]
Subject: Re: 2.6.37 kernel warning in perf_events code

On Thu, Feb 10, 2011 at 11:08 AM, Arnaldo Carvalho de Melo
<[email protected]> wrote:
> Em Wed, Feb 09, 2011 at 06:50:09PM -0800, Arun Sharma escreveu:
>> [ Not sure if this is known and fixed in newer kernels. My machine
>> becomes unusable after this warning triggers ]
>>
>> perf record -g -p <pid> cs -o csw.data -- sleep 3
>
> What kernel are you using? Peter, rings any bells?

Stock 2.6.37.

-Arun

2011-02-10 20:20:30

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: 2.6.37 kernel warning in perf_events code

Em Thu, Feb 10, 2011 at 12:02:20PM -0800, Arun Sharma escreveu:
> On Thu, Feb 10, 2011 at 11:08 AM, Arnaldo Carvalho de Melo
> <[email protected]> wrote:
> > Em Wed, Feb 09, 2011 at 06:50:09PM -0800, Arun Sharma escreveu:
> >> [ Not sure if this is known and fixed in newer kernels. My machine
> >> becomes unusable after this warning triggers ]
> >>
> >> perf record -g -p <pid> cs -o csw.data -- sleep 3

Arun, are you shure the above line is right? I guess it should read:

perf record -g -p <pid> -e cs -o csw.data -- sleep 3

To specify the context switches soft event, right?

> > What kernel are you using? Peter, rings any bells?
>
> Stock 2.6.37.

- Arnaldo

2011-02-10 20:46:10

by Arun Sharma

[permalink] [raw]
Subject: Re: 2.6.37 kernel warning in perf_events code

On Thu, Feb 10, 2011 at 12:20 PM, Arnaldo Carvalho de Melo
<[email protected]> wrote:
>
>> >> perf record -g -p <pid> cs -o csw.data -- sleep 3
>
> Arun, are you shure the above line is right? I guess it should read:
>
>
> perf record -g -p <pid> -e cs -o csw.data -- sleep 3
>
> To specify the context switches soft event, right?
>

You caught a cut and paste error. I'm pretty sure I had the -e in
there when the warning triggered. I tried this command a few times,
just to verify and here's what I found:

* Under low loads, everything works fine.
* Under a heavy work load - I'm not able to reproduce the warning, but
hitting very similar symptoms:

[ perf record: Captured and wrote 2.282 MB /tmp/junk.data (~99721 samples) ]
[ perf record: Captured and wrote 1.734 MB /tmp/junk.data (~75740 samples) ]
[ perf record: Captured and wrote 0.091 MB /tmp/junk.data (~3975
samples) ] <--- bad run

The bad run made my shell unresponsive and took around 30-40 seconds
to complete (whereas the good runs completed in less than 5 secs).
Could this be some kind of a feedback loop where what the measurement
machinery is perturbing what's being measured?

-Arun

2011-02-11 00:42:45

by Arnaldo Carvalho de Melo

[permalink] [raw]
Subject: Re: 2.6.37 kernel warning in perf_events code

Em Thu, Feb 10, 2011 at 12:46:07PM -0800, Arun Sharma escreveu:
> On Thu, Feb 10, 2011 at 12:20 PM, Arnaldo Carvalho de Melo wrote:
> >> >> perf record -g -p <pid> cs -o csw.data -- sleep 3

> > Arun, are you shure the above line is right? I guess it should read:

> > perf record -g -p <pid> -e cs -o csw.data -- sleep 3

> > To specify the context switches soft event, right?
>
> You caught a cut and paste error. I'm pretty sure I had the -e in
> there when the warning triggered. I tried this command a few times,
> just to verify and here's what I found:

> * Under low loads, everything works fine.
> * Under a heavy work load - I'm not able to reproduce the warning, but
> hitting very similar symptoms:

> [ perf record: Captured and wrote 2.282 MB /tmp/junk.data (~99721 samples) ]
> [ perf record: Captured and wrote 1.734 MB /tmp/junk.data (~75740 samples) ]
> [ perf record: Captured and wrote 0.091 MB /tmp/junk.data (~3975
> samples) ] <--- bad run

> The bad run made my shell unresponsive and took around 30-40 seconds
> to complete (whereas the good runs completed in less than 5 secs).
> Could this be some kind of a feedback loop where what the measurement
> machinery is perturbing what's being measured?

Is it possible for you to test this with 2.6.38-rc4? At least the user
level tools, just do:

[acme@felicio linux]$ make help | grep perf
perf-tar-src-pkg - Build perf-2.6.38-rc3.tar source tarball
perf-targz-src-pkg - Build perf-2.6.38-rc3.tar.gz source tarball
perf-tarbz2-src-pkg - Build perf-2.6.38-rc3.tar.bz2 source tarball
[acme@felicio linux]$

Pick one of these targets on the source tree for 2.6.38-rc4, move the
tarball to the machine where you need to run the older kernel (.37,
right?) and try building and running it there.

Either or just build the new tools and run it on the older kernel.

There were several changes to better inform about lost events due to
heavy load that may be at play in your case.

- Arnaldo

2011-02-11 00:57:42

by Arun Sharma

[permalink] [raw]
Subject: Re: 2.6.37 kernel warning in perf_events code

On Thu, Feb 10, 2011 at 4:42 PM, Arnaldo Carvalho de Melo
<[email protected]> wrote:
> Is it possible for you to test this with 2.6.38-rc4? At least the user
> level tools, just do:

I tried the 2.6.38rc3 perf binary on two machines:

kernel = 2.6.37: The machine hung solid on the first try.
kernel = 2.6.38rc3: I could run it 10 times in a loop and no bad runs.

Looks like this is already fixed in newer kernels.

-Arun

2011-02-11 13:13:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: 2.6.37 kernel warning in perf_events code

On Thu, 2011-02-10 at 16:57 -0800, Arun Sharma wrote:
> On Thu, Feb 10, 2011 at 4:42 PM, Arnaldo Carvalho de Melo
> <[email protected]> wrote:
> > Is it possible for you to test this with 2.6.38-rc4? At least the user
> > level tools, just do:
>
> I tried the 2.6.38rc3 perf binary on two machines:
>
> kernel = 2.6.37: The machine hung solid on the first try.
> kernel = 2.6.38rc3: I could run it 10 times in a loop and no bad runs.
>
> Looks like this is already fixed in newer kernels.

I couldn't actually reproduce on .37 with the given details, so while
its already fixed it would be nice to figure out what patch fixed it so
we can make sure it ends up in .37-stable.

Could you provide exact reproduction details or maybe bisect the thing?