2014-02-21 20:23:41

by Vince Weaver

[permalink] [raw]
Subject: perf_fuzzer causes reboot


So I'm not sure who exactly to report this to. Some perf people CC'd as
I trigger it while using the perf_fuzzer.

This is with 3.14-rc3 on a core2 machine, although I've had the reboots
happen throughout at least 3.14-rc*

I'm having a hard time coming up with a reproducible test case. Using the
random seed that caused the below will cause the perf_fuzzer to segfault
but not reboot.

The log isn't very helpful, it reboots so fast that the oops doesn't
finish printing and the serial log just moves to the bootloader...

[ 4466.804123] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
[ 4466.808014] IP: [<ffffffff81111783>] cache_reap+0x5e/0x1c5
[ 4466.808014] PGD 0
[ 4466.808014] Oops: 0000 [#1] GNU GRUB version 2.00-17


Vince


2014-02-21 22:11:50

by Vince Weaver

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot


cc'ing x32 people

On Fri, 21 Feb 2014, Vince Weaver wrote:

> So I'm not sure who exactly to report this to. Some perf people CC'd as
> I trigger it while using the perf_fuzzer.
>
> This is with 3.14-rc3 on a core2 machine, although I've had the reboots
> happen throughout at least 3.14-rc*
>
> I'm having a hard time coming up with a reproducible test case. Using the
> random seed that caused the below will cause the perf_fuzzer to segfault
> but not reboot.
>
> The log isn't very helpful, it reboots so fast that the oops doesn't
> finish printing and the serial log just moves to the bootloader...
>
> [ 4466.804123] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
> [ 4466.808014] IP: [<ffffffff81111783>] cache_reap+0x5e/0x1c5
> [ 4466.808014] PGD 0
> [ 4466.808014] Oops: 0000 [#1] GNU GRUB version 2.00-17

Maybe related, this is on an x32-compiled binary.

When trying to reproduce the perf_fuzzer myseriously segfaults on what
appears to be perfectly valid mmap'd perf ring-buffers.

(running under gdb)

Program received signal SIGSEGV, Segmentation fault.
0x0041efbb in __memset_sse2 ()

=> 0x0041efbb <+2203>: movdqa %xmm0,(%rdi)

rdi 0xf7f61000 4160098304

f7f61000-f7f72000 rw-s 00000000 00:08 4475 anon_inode:[perf_event]

So I'm not sure if somehow something is wrong with the page mapping, that
makes a valid write fail and sometimes (possibly due to address space
randomization) reboot the system?

Vince

2014-02-21 22:32:13

by Vince Weaver

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On Fri, 21 Feb 2014, Vince Weaver wrote:

> On Fri, 21 Feb 2014, Vince Weaver wrote:
>
> > So I'm not sure who exactly to report this to. Some perf people CC'd as
> > I trigger it while using the perf_fuzzer.
> >
> > This is with 3.14-rc3 on a core2 machine, although I've had the reboots
> > happen throughout at least 3.14-rc*
> >
> > I'm having a hard time coming up with a reproducible test case. Using the
> > random seed that caused the below will cause the perf_fuzzer to segfault
> > but not reboot.
> >
> > The log isn't very helpful, it reboots so fast that the oops doesn't
> > finish printing and the serial log just moves to the bootloader...
> >
> > [ 4466.804123] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050
> > [ 4466.808014] IP: [<ffffffff81111783>] cache_reap+0x5e/0x1c5
> > [ 4466.808014] PGD 0
> > [ 4466.808014] Oops: 0000 [#1] GNU GRUB version 2.00-17
>
> Maybe related, this is on an x32-compiled binary.
>
> When trying to reproduce the perf_fuzzer myseriously segfaults on what
> appears to be perfectly valid mmap'd perf ring-buffers.
>
> (running under gdb)
>
> Program received signal SIGSEGV, Segmentation fault.
> 0x0041efbb in __memset_sse2 ()
>
> => 0x0041efbb <+2203>: movdqa %xmm0,(%rdi)
>
> rdi 0xf7f61000 4160098304
>
> f7f61000-f7f72000 rw-s 00000000 00:08 4475 anon_inode:[perf_event]
>
> So I'm not sure if somehow something is wrong with the page mapping, that
> makes a valid write fail and sometimes (possibly due to address space
> randomization) reboot the system?

also strange, when I look at the core dumps it always shows the bad memory
address happening at the beginning of an mmap page as expected for the
ip listed, but the segfault listed in the kernel happens at some
completely unrelated address that isn't even page aligned and shouldn't be
possible based on the gdb/coredump results?

[ 1560.313863] perf_fuzzer[2826]: segfault at 503283ff ip 000000000041efbb
sp 00000000ffd367d8 error 6 in perf_fuzzer[400000+d1000]
[ 1704.673245] perf_fuzzer[2835]: segfault at 503283ff ip 000000000041efbb
sp 00000000ff972be8 error 6 in perf_fuzzer[400000+d1000]
[ 2978.101276] perf_fuzzer[2841]: segfault at 503283ff ip 000000000041efbb
sp 00000000ff92ba68 error 6 in perf_fuzzer[400000+d1000]
[ 4907.185366] perf_fuzzer[2868]: segfault at 503283ff ip 000000000041efbb
sp 00000000ffadcd28 error 6 in perf_fuzzer[400000+d1000]
[ 9570.793746] perf_fuzzer[6183]: segfault at 4d0bf28e ip 000000000041efbb
sp 00000000ff83f688 error 6 in perf_fuzzer[400000+d1000]
[ 9743.888431] perf_fuzzer[6187]: segfault at 91734d5 ip 000000000041efbb
sp 00000000ffb4e288 error 6 in perf_fuzzer[400000+d1000]

Vince

2014-02-22 04:48:42

by Vince Weaver

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot


So I changed the perf_fuzzer so when it randomly stomps all over the
perf_event_mmap_page, it uses a constant value of 0xdeadbeef rather
than a random value.

The result is below. The segfaults make a bit more sense now, it
almost looks like what is happening is we are corrupting an address
value somehow (head? tail?) and the kernel then uses the corrupt address
and writes to memory outside of the mmap ring buffer.

I still haven't figured out how to trigger this exactly, but you can
see when over-written with 0xdeadbeef the memory address written to is
consistently some small multiple of 0x120.

I imagine it would be a bad thing if it turned out to be possible to
select what memory address got written to. Although since I've
only reproduced this on x32 maybe it won't be possible to over-write
the kernel; but I have seen this bug cause a reboot when the
wrong thing got over-written.

[28002.850192] perf_fuzzer[7083]: segfault at 2be0 ip 000000000041efab sp 00000000ff826748 error 6 in perf_fuzzer[400000+d1000]
[28639.769869] perf_fuzzer[7100]: segfault at 1320 ip 000000000041efab sp 00000000ffa65038 error 6 in perf_fuzzer[400000+d1000]
[29396.986242] perf_fuzzer[7120]: segfault at 10e0 ip 000000000041efab sp 00000000ffd48e68 error 6 in perf_fuzzer[400000+d1000]
[29738.892931] perf_fuzzer[7128]: segfault at 18c0 ip 000000000041efab sp 00000000ffcdcd88 error 6 in perf_fuzzer[400000+d1000]
[29815.550210] perf_fuzzer[7132]: segfault at 120 ip 000000000041efab sp 00000000ffe673b8 error 6 in perf_fuzzer[400000+d1000]
[30173.455348] perf_fuzzer[7141]: segfault at 120 ip 000000000041efab sp 00000000ffda1948 error 6 in perf_fuzzer[400000+d1000]
[30570.625642] perf_fuzzer[7156]: segfault at 1680 ip 000000000041efab sp 00000000ffaad028 error 6 in perf_fuzzer[400000+d1000]
[31047.887784] perf_fuzzer[7169]: segfault at 60c0 ip 000000000041efab sp 00000000ffaa86e8 error 6 in perf_fuzzer[400000+d1000]
[31300.168714] perf_fuzzer[7175]: segfault at 3a80 ip 000000000041efab sp 00000000ffd83228 error 6 in perf_fuzzer[400000+d1000]
[31984.727278] perf_fuzzer[7193]: segfault at 7e0 ip 000000000041efab sp 00000000ff9db1f8 error 6 in perf_fuzzer[400000+d1000]

Vince

2014-02-22 05:05:02

by H. Peter Anvin

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

Those are segfaults in user space, though?

On February 21, 2014 8:50:38 PM PST, Vince Weaver <[email protected]> wrote:
>
>So I changed the perf_fuzzer so when it randomly stomps all over the
>perf_event_mmap_page, it uses a constant value of 0xdeadbeef rather
>than a random value.
>
>The result is below. The segfaults make a bit more sense now, it
>almost looks like what is happening is we are corrupting an address
>value somehow (head? tail?) and the kernel then uses the corrupt
>address
>and writes to memory outside of the mmap ring buffer.
>
>I still haven't figured out how to trigger this exactly, but you can
>see when over-written with 0xdeadbeef the memory address written to is
>consistently some small multiple of 0x120.
>
>I imagine it would be a bad thing if it turned out to be possible to
>select what memory address got written to. Although since I've
>only reproduced this on x32 maybe it won't be possible to over-write
>the kernel; but I have seen this bug cause a reboot when the
>wrong thing got over-written.
>
>[28002.850192] perf_fuzzer[7083]: segfault at 2be0 ip 000000000041efab
>sp 00000000ff826748 error 6 in perf_fuzzer[400000+d1000]
>[28639.769869] perf_fuzzer[7100]: segfault at 1320 ip 000000000041efab
>sp 00000000ffa65038 error 6 in perf_fuzzer[400000+d1000]
>[29396.986242] perf_fuzzer[7120]: segfault at 10e0 ip 000000000041efab
>sp 00000000ffd48e68 error 6 in perf_fuzzer[400000+d1000]
>[29738.892931] perf_fuzzer[7128]: segfault at 18c0 ip 000000000041efab
>sp 00000000ffcdcd88 error 6 in perf_fuzzer[400000+d1000]
>[29815.550210] perf_fuzzer[7132]: segfault at 120 ip 000000000041efab
>sp 00000000ffe673b8 error 6 in perf_fuzzer[400000+d1000]
>[30173.455348] perf_fuzzer[7141]: segfault at 120 ip 000000000041efab
>sp 00000000ffda1948 error 6 in perf_fuzzer[400000+d1000]
>[30570.625642] perf_fuzzer[7156]: segfault at 1680 ip 000000000041efab
>sp 00000000ffaad028 error 6 in perf_fuzzer[400000+d1000]
>[31047.887784] perf_fuzzer[7169]: segfault at 60c0 ip 000000000041efab
>sp 00000000ffaa86e8 error 6 in perf_fuzzer[400000+d1000]
>[31300.168714] perf_fuzzer[7175]: segfault at 3a80 ip 000000000041efab
>sp 00000000ffd83228 error 6 in perf_fuzzer[400000+d1000]
>[31984.727278] perf_fuzzer[7193]: segfault at 7e0 ip 000000000041efab
>sp 00000000ff9db1f8 error 6 in perf_fuzzer[400000+d1000]
>
>Vince

--
Sent from my mobile phone. Please pardon brevity and lack of formatting.

2014-02-22 06:27:17

by H. Peter Anvin

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On 02/21/2014 08:50 PM, Vince Weaver wrote:
>
> So I changed the perf_fuzzer so when it randomly stomps all over the
> perf_event_mmap_page, it uses a constant value of 0xdeadbeef rather
> than a random value.
>
> The result is below. The segfaults make a bit more sense now, it
> almost looks like what is happening is we are corrupting an address
> value somehow (head? tail?) and the kernel then uses the corrupt address
> and writes to memory outside of the mmap ring buffer.
>

That seems unlikely:

handle->page = (offset >> page_shift) & (rb->nr_pages - 1);
offset &= (1UL << page_shift) - 1;

The masking to the number of pages should make that not possible, even
if a completely bogus value is written.

> I still haven't figured out how to trigger this exactly, but you can
> see when over-written with 0xdeadbeef the memory address written to is
> consistently some small multiple of 0x120.
>
> I imagine it would be a bad thing if it turned out to be possible to
> select what memory address got written to. Although since I've
> only reproduced this on x32 maybe it won't be possible to over-write
> the kernel; but I have seen this bug cause a reboot when the
> wrong thing got over-written.
>
> [28002.850192] perf_fuzzer[7083]: segfault at 2be0 ip 000000000041efab sp 00000000ff826748 error 6 in perf_fuzzer[400000+d1000]
> [28639.769869] perf_fuzzer[7100]: segfault at 1320 ip 000000000041efab sp 00000000ffa65038 error 6 in perf_fuzzer[400000+d1000]
> [29396.986242] perf_fuzzer[7120]: segfault at 10e0 ip 000000000041efab sp 00000000ffd48e68 error 6 in perf_fuzzer[400000+d1000]
> [29738.892931] perf_fuzzer[7128]: segfault at 18c0 ip 000000000041efab sp 00000000ffcdcd88 error 6 in perf_fuzzer[400000+d1000]
> [29815.550210] perf_fuzzer[7132]: segfault at 120 ip 000000000041efab sp 00000000ffe673b8 error 6 in perf_fuzzer[400000+d1000]
> [30173.455348] perf_fuzzer[7141]: segfault at 120 ip 000000000041efab sp 00000000ffda1948 error 6 in perf_fuzzer[400000+d1000]
> [30570.625642] perf_fuzzer[7156]: segfault at 1680 ip 000000000041efab sp 00000000ffaad028 error 6 in perf_fuzzer[400000+d1000]
> [31047.887784] perf_fuzzer[7169]: segfault at 60c0 ip 000000000041efab sp 00000000ffaa86e8 error 6 in perf_fuzzer[400000+d1000]
> [31300.168714] perf_fuzzer[7175]: segfault at 3a80 ip 000000000041efab sp 00000000ffd83228 error 6 in perf_fuzzer[400000+d1000]
> [31984.727278] perf_fuzzer[7193]: segfault at 7e0 ip 000000000041efab sp 00000000ff9db1f8 error 6 in perf_fuzzer[400000+d1000]

Error 6 reflects a write in userspace to a not-present page.

Since your previous trace indicates that the value of the register in
question is a different one, I'm guessing that what we have here is PEBS
getting activated. 0x120 is 2*0x90, and 0x90 is the size of a 64-bit
PEBS record.

-hpa

2014-02-23 05:16:20

by Vince Weaver

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On Fri, 21 Feb 2014, H. Peter Anvin wrote:

> Error 6 reflects a write in userspace to a not-present page.
>
> Since your previous trace indicates that the value of the register in question
> is a different one, I'm guessing that what we have here is PEBS getting
> activated. 0x120 is 2*0x90, and 0x90 is the size of a 64-bit PEBS record.

I'm having problems generating a replayable syscall trace that exhibits
the problem.

It turns out that the segfault address listed (the multiple of 0x120)
happens to be the value in the RBP register at the time of the segfault.

That's odd, as the instruction is
movdqa %xmm0,(%rdi)
and rdi is the valid mmap address of the perf ring buffer
rdi 0xf7768000 4151738368

so I'm not sure why RBP is involved at all.

In all of the cases I've investigated the precise_ip value has been set
for the problem event... but none of the events have been hardware events
(software and breakpoint so far). So probably not PEBS related?

Vince

2014-02-23 05:25:09

by H. Peter Anvin

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

What is the instructions around it, by any chance?

On February 22, 2014 9:18:17 PM PST, Vince Weaver <[email protected]> wrote:
>On Fri, 21 Feb 2014, H. Peter Anvin wrote:
>
>> Error 6 reflects a write in userspace to a not-present page.
>>
>> Since your previous trace indicates that the value of the register in
>question
>> is a different one, I'm guessing that what we have here is PEBS
>getting
>> activated. 0x120 is 2*0x90, and 0x90 is the size of a 64-bit PEBS
>record.
>
>I'm having problems generating a replayable syscall trace that exhibits
>
>the problem.
>
>It turns out that the segfault address listed (the multiple of 0x120)
>happens to be the value in the RBP register at the time of the
>segfault.
>
>That's odd, as the instruction is
> movdqa %xmm0,(%rdi)
>and rdi is the valid mmap address of the perf ring buffer
> rdi 0xf7768000 4151738368
>
>so I'm not sure why RBP is involved at all.
>
>In all of the cases I've investigated the precise_ip value has been set
>
>for the problem event... but none of the events have been hardware
>events
>(software and breakpoint so far). So probably not PEBS related?
>
>Vince

--
Sent from my mobile phone. Please pardon brevity and lack of formatting.

2014-02-23 06:08:48

by H. Peter Anvin

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

I'd be interested in how rbp gets set, too. It might just be a coincidence and the value in rbp has some other meaning here.

On February 22, 2014 9:18:17 PM PST, Vince Weaver <[email protected]> wrote:
>On Fri, 21 Feb 2014, H. Peter Anvin wrote:
>
>> Error 6 reflects a write in userspace to a not-present page.
>>
>> Since your previous trace indicates that the value of the register in
>question
>> is a different one, I'm guessing that what we have here is PEBS
>getting
>> activated. 0x120 is 2*0x90, and 0x90 is the size of a 64-bit PEBS
>record.
>
>I'm having problems generating a replayable syscall trace that exhibits
>
>the problem.
>
>It turns out that the segfault address listed (the multiple of 0x120)
>happens to be the value in the RBP register at the time of the
>segfault.
>
>That's odd, as the instruction is
> movdqa %xmm0,(%rdi)
>and rdi is the valid mmap address of the perf ring buffer
> rdi 0xf7768000 4151738368
>
>so I'm not sure why RBP is involved at all.
>
>In all of the cases I've investigated the precise_ip value has been set
>
>for the problem event... but none of the events have been hardware
>events
>(software and breakpoint so far). So probably not PEBS related?
>
>Vince

--
Sent from my mobile phone. Please pardon brevity and lack of formatting.

2014-02-23 14:03:02

by Vince Weaver

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On Sat, 22 Feb 2014, H. Peter Anvin wrote:

> I'd be interested in how rbp gets set, too. It might just be a
> coincidence and the value in rbp has some other meaning here.

The code in question does this:

i=find_random_active_event();
if (i<0) return;
if ((event_data[i].mmap)) {
value=0xdeadbeef;
memset(event_data[i].mmap,value,getpagesize());


[New LWP 10526]
Core was generated by `./perf_fuzzer -t OCIRMQWPpAi -r 1392938876'.
Program terminated with signal 11, Segmentation fault.
#0 0x0041efab in __memset_sse2 ()
(gdb) bt
#0 0x0041efab in __memset_sse2 ()
#1 0x004017ec in trash_random_mmap () at perf_fuzzer.c:808
#2 main (argc=<optimized out>, argv=<optimized out>) at perf_fuzzer.c:1604


So rbp is set by the imul below, it is the offset into the
event_data[i] array where the elements have size of 0x120

0x004017bd <+3085>: callq 0x402ee0 <find_random_active_event>
0x004017c2 <+3090>: test %eax,%eax
0x004017c4 <+3092>: js 0x4011e8 <main+1592>
0x004017ca <+3098>: imul $0x120,%eax,%ebp
0x004017d0 <+3104>: mov 0x756b2c(%ebp),%eax

0x004017d7 <+3111>: test %eax,%eax
0x004017d9 <+3113>: je 0x40183b <main+3211>

0x004017db <+3115>: mov 0xc(%esp),%edx
0x004017e0 <+3120>: mov %eax,%edi
0x004017e2 <+3122>: mov $0xdeadbeef,%esi
0x004017e7 <+3127>: callq 0x400260
0x004017ec <+3132>: testb $0x20,0x353e76(%rip) # 0x755669 <logging+$

400260: ff 25 ce 0e 2d 00 jmpq *0x2d0ece(%rip) # 6d1134 $

0x6d1134: 0x0041e710

Dump of assembler code for function __memset_sse2:

0x0041e710 <+0>: cmp $0x1,%rdx
0x0041e714 <+4>: mov %rdi,%rax
0x0041e717 <+7>: jne 0x41e71d <__memset_sse2+13>
0x0041e719 <+9>: mov %sil,(%rdi)

and as far as I can tell nothing touches rbp again until the segfault.
Nothing in _memset_sse2 does as far as I can tell.

Vince

2014-02-24 03:00:17

by Vince Weaver

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On Sun, 23 Feb 2014, Vince Weaver wrote:
>
> and as far as I can tell nothing touches rbp again until the segfault.
> Nothing in _memset_sse2 does as far as I can tell.

I only know enough about ftrace to be dangerous, but here is what I think
is the trace of the problem:

perf_fuzzer-11492 [000] 197077.488363: function: intel_get_event_constraints
perf_fuzzer-11492 [000] 197077.488363: function: intel_pebs_constraints
perf_fuzzer-11492 [000] 197077.488365: function: intel_put_event_constraints
perf_fuzzer-11492 [000] 197077.488365: function: intel_pmu_enable_all
perf_fuzzer-11492 [000] 197077.488366: function: intel_pmu_pebs_enable_all
perf_fuzzer-11492 [000] 197077.488367: function: intel_pmu_lbr_enable_all
perf_fuzzer-11492 [000] 197077.488366: function: intel_pmu_pebs_enable_all
perf_fuzzer-11492 [000] 197077.488367: function: intel_pmu_lbr_enable_all
perf_fuzzer-11492 [000] 197077.488368: function: mutex_unlock
perf_fuzzer-11492 [000] 197077.488368: function: mutex_lock
perf_fuzzer-11492 [000] 197077.488369: function: _cond_resched
perf_fuzzer-11492 [000] 197077.488370: function: _raw_spin_lock_irq
perf_fuzzer-11492 [000] 197077.488370: function: mutex_unlock
perf_fuzzer-11492 [000] 197077.488371: function: mutex_lock
perf_fuzzer-11492 [000] 197077.488371: function: _cond_resched
perf_fuzzer-11492 [000] 197077.488372: function: _raw_spin_lock_irq
perf_fuzzer-11492 [000] 197077.488373: function: mutex_unlock
perf_fuzzer-11492 [000] 197077.488373: function: mutex_lock
perf_fuzzer-11492 [000] 197077.488374: function: _cond_resched
perf_fuzzer-11492 [000] 197077.488374: function: _raw_spin_lock_irq
perf_fuzzer-11492 [000] 197077.488375: function: smp_call_function_single
perf_fuzzer-11492 [000] 197077.488376: function: _raw_spin_lock
perf_fuzzer-11492 [000] 197077.488377: function: mutex_unlock
perf_fuzzer-11492 [000] 197077.488378: function: mutex_lock
perf_fuzzer-11492 [000] 197077.488378: function: _cond_resched
perf_fuzzer-11492 [000] 197077.488379: function: _raw_spin_lock_irq
perf_fuzzer-11492 [000] 197077.488380: function: smp_call_function_single
perf_fuzzer-11492 [000] 197077.488380: function: _raw_spin_lock
perf_fuzzer-11492 [000] 197077.488381: function: mutex_unlock
perf_fuzzer-11492 [000] 197077.488382: function: mutex_unlock
perf_fuzzer-11492 [000] 197077.488383: function: syscall_trace_leave
perf_fuzzer-11492 [000] 197077.488383: sys_exit: NR 1073741981 = 0
perf_fuzzer-11492 [000] 197077.488387: function: do_device_not_available
perf_fuzzer-11492 [000] 197077.488387: function: math_state_restore
perf_fuzzer-11492 [000] 197077.488390: function: trace_do_page_fault
perf_fuzzer-11492 [000] 197077.488391: page_fault_user: address=__per_cpu_end ip=__per_cpu_end error_code=0x6
perf_fuzzer-11492 [000] 197077.488395: function: perf_callchain
perf_fuzzer-11492 [000] 197077.488396: function: copy_from_user_nmi
perf_fuzzer-11492 [000] 197077.488397: function: trace_do_page_fault
perf_fuzzer-11492 [000] 197077.488398: page_fault_kernel: address=irq_stack_union ip=copy_user_generic_string error_code=0x0
perf_fuzzer-11492 [000] 197077.488399: function: __do_page_fault
perf_fuzzer-11492 [000] 197077.488400: function: bad_area_nosemaphore
perf_fuzzer-11492 [000] 197077.488401: function: __bad_area_nosemaphore
perf_fuzzer-11492 [000] 197077.488401: function: no_context
perf_fuzzer-11492 [000] 197077.488402: function: fixup_exception
perf_fuzzer-11492 [000] 197077.488403: function: search_exception_tables
perf_fuzzer-11492 [000] 197077.488403: function: search_extable
perf_fuzzer-11492 [000] 197077.488405: function: copy_user_handle_tail
perf_fuzzer-11492 [000] 197077.488406: function: trace_do_page_fault
perf_fuzzer-11492 [000] 197077.488406: page_fault_kernel: address=irq_stack_union ip=copy_user_handle_tail error_code=0x0
perf_fuzzer-11492 [000] 197077.488407: function: __do_page_fault
perf_fuzzer-11492 [000] 197077.488408: function: bad_area_nosemaphore
perf_fuzzer-11492 [000] 197077.488409: function: __bad_area_nosemaphore
perf_fuzzer-11492 [000] 197077.488409: function: no_context
perf_fuzzer-11492 [000] 197077.488410: function: fixup_exception
perf_fuzzer-11492 [000] 197077.488410: function: search_exception_tables
perf_fuzzer-11492 [000] 197077.488411: function: search_extable
perf_fuzzer-11492 [000] 197077.488413: function: perf_output_begin
perf_fuzzer-11492 [000] 197077.488414: function: perf_output_copy
perf_fuzzer-11492 [000] 197077.488415: function: perf_output_copy
perf_fuzzer-11492 [000] 197077.488415: function: perf_output_copy
perf_fuzzer-11492 [000] 197077.488416: function: perf_output_copy
perf_fuzzer-11492 [000] 197077.488418: function: perf_output_copy
perf_fuzzer-11492 [000] 197077.488419: function: perf_output_copy
perf_fuzzer-11492 [000] 197077.488419: function: perf_output_end
perf_fuzzer-11492 [000] 197077.488420: function: perf_output_put_handle
perf_fuzzer-11492 [000] 197077.488421: function: __do_page_fault
perf_fuzzer-11492 [000] 197077.488422: function: down_read_trylock
perf_fuzzer-11492 [000] 197077.488423: function: _cond_resched
perf_fuzzer-11492 [000] 197077.488423: function: find_vma
perf_fuzzer-11492 [000] 197077.488424: function: bad_area
perf_fuzzer-11492 [000] 197077.488425: function: up_read
perf_fuzzer-11492 [000] 197077.488426: function: __bad_area_nosemaphore
perf_fuzzer-11492 [000] 197077.488426: function: is_prefetch
perf_fuzzer-11492 [000] 197077.488427: function: convert_ip_to_linear
perf_fuzzer-11492 [000] 197077.488428: function: unhandled_signal
perf_fuzzer-11492 [000] 197077.488429: function: __printk_ratelimit
perf_fuzzer-11492 [000] 197077.488430: function: _raw_spin_trylock
perf_fuzzer-11492 [000] 197077.488430: function: _raw_spin_unlock_irqrestore
perf_fuzzer-11492 [000] 197077.488431: function: printk
perf_fuzzer-11492 [000] 197077.488432: function: vprintk_emit
perf_fuzzer-11492 [000] 197077.488434: function: _raw_spin_lock
perf_fuzzer-11492 [000] 197077.488443: function: cont_add
perf_fuzzer-11492 [000] 197077.488444: function: console_trylock
perf_fuzzer-11492 [000] 197077.488445: function: down_trylock
perf_fuzzer-11492 [000] 197077.488445: function: _raw_spin_lock_irqsave
perf_fuzzer-11492 [000] 197077.488446: function: _raw_spin_unlock_irqrestore
perf_fuzzer-11492 [000] 197077.488447: function: console_unlock
perf_fuzzer-11492 [000] 197077.488447: function: _raw_spin_lock_irqsave
perf_fuzzer-11492 [000] 197077.488449: function: print_time
perf_fuzzer-11492 [000] 197077.488452: function: T.950
perf_fuzzer-11492 [000] 197077.488453: console: [197179.420735] perf_fuzzer[11492]: segfault at 22e0 ip 000000000041efab sp 00000000ffda0938 error 6

2014-02-24 05:23:23

by H. Peter Anvin

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On 02/23/2014 07:02 PM, Vince Weaver wrote:
> On Sun, 23 Feb 2014, Vince Weaver wrote:
>>
>> and as far as I can tell nothing touches rbp again until the segfault.
>> Nothing in _memset_sse2 does as far as I can tell.
>
> I only know enough about ftrace to be dangerous, but here is what I think
> is the trace of the problem:
>
> perf_fuzzer-11492 [000] 197077.488420: function: perf_output_put_handle
> perf_fuzzer-11492 [000] 197077.488421: function: __do_page_fault

So we do a write to the buffer rather immediately before this happens,
and in particular that will update the head:

rb->user_page->data_head = head;

However, that doesn't explain what is going on and in particular the
write to whatever address was in %rbp. The rest pretty much seems to be
the page fault logic.

Incidentally, I doubt that this is x32-related in any way; there seems
to be absolutely no difference between x86-64 perf and x32 perf; more
likely it just makes the error more reproducible because the address
space is so much smaller.

-hpa

2014-02-24 15:33:37

by Vince Weaver

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On Sun, 23 Feb 2014, H. Peter Anvin wrote:

> So we do a write to the buffer rather immediately before this happens,
> and in particular that will update the head:
>
> rb->user_page->data_head = head;
>
> However, that doesn't explain what is going on and in particular the
> write to whatever address was in %rbp. The rest pretty much seems to be
> the page fault logic.

It turns out you don't even have to over-write rb->user_page->data_head.
Just touching the mmap page with a write of a single byte (it doesn't
matter where) is enough to trigger the bug.

This is a pain to track down, it would be easier if I could get a
replayable syscall trace, but even though the segfault is very
reproducible with my fuzzer, it's very sensitive to extra syscalls in the
trace path and the fuzzer logger/replayer path has a different number of
write syscalls and won't trigger the problem.

> Incidentally, I doubt that this is x32-related in any way; there seems
> to be absolutely no difference between x86-64 perf and x32 perf; more
> likely it just makes the error more reproducible because the address
> space is so much smaller.

quite possibly. I only began chasing the problem because when compiled
for x32 this bug apparently will reboot the machine now and then (not just
segfault the program). I never saw that failure mode with x86_64, but
again maybe it's just easier to hit with the reduced address space as you
say.

Vince

2014-02-24 16:32:32

by Vince Weaver

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On Mon, 24 Feb 2014, Vince Weaver wrote:

> Just touching the mmap page with a write of a single byte (it doesn't
> matter where) is enough to trigger the bug.

OK, investigating this more.

perf_fuzzer-2971 [000] 154.944114: page_fault_user: address=0xf7729000 ip=0x41efab error_code=0x6
perf_fuzzer-2971 [000] 154.944118: function: ip=0xffffffff810d40e7 parent_ip=0xffffffff810d0840
perf_fuzzer-2971 [000] 154.944119: function: ip=0xffffffff812a91a5 parent_ip=0xffffffff81013ff5
perf_fuzzer-2971 [000] 154.944120: function: ip=0xffffffff8153837c parent_ip=0xffffffff81535432
perf_fuzzer-2971 [000] 154.944121: page_fault_kernel: address=0x22e0 ip=0xffffffff812a7d5c error_code=0x0

It looks like there are two page faults. The first is caused by the user
code accessing the mmap'd page. It looks sort of normal and what you'd
expect if the perf_event mmap ring buffer is being accessed for the first
time.

What follows is a kernel page fault, and this is the one where for
whatever reason CR2 has obtained the value of the userspace RBP register.

Vince

2014-02-24 16:47:59

by H. Peter Anvin

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

Ok, so the obvious question is what is at that kernel address?

On February 24, 2014 8:34:30 AM PST, Vince Weaver <[email protected]> wrote:
>On Mon, 24 Feb 2014, Vince Weaver wrote:
>
>> Just touching the mmap page with a write of a single byte (it doesn't
>
>> matter where) is enough to trigger the bug.
>
>OK, investigating this more.
>
>perf_fuzzer-2971 [000] 154.944114: page_fault_user:
>address=0xf7729000 ip=0x41efab error_code=0x6
>perf_fuzzer-2971 [000] 154.944118: function:
>ip=0xffffffff810d40e7 parent_ip=0xffffffff810d0840
>perf_fuzzer-2971 [000] 154.944119: function:
>ip=0xffffffff812a91a5 parent_ip=0xffffffff81013ff5
>perf_fuzzer-2971 [000] 154.944120: function:
>ip=0xffffffff8153837c parent_ip=0xffffffff81535432
>perf_fuzzer-2971 [000] 154.944121: page_fault_kernel:
>address=0x22e0 ip=0xffffffff812a7d5c error_code=0x0
>
>It looks like there are two page faults. The first is caused by the
>user
>code accessing the mmap'd page. It looks sort of normal and what you'd
>expect if the perf_event mmap ring buffer is being accessed for the
>first
>time.
>
>What follows is a kernel page fault, and this is the one where for
>whatever reason CR2 has obtained the value of the userspace RBP
>register.
>
>Vince

--
Sent from my mobile phone. Please pardon brevity and lack of formatting.

2014-02-24 17:08:51

by Vince Weaver

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On Mon, 24 Feb 2014, H. Peter Anvin wrote:

> On February 24, 2014 8:34:30 AM PST, Vince Weaver <[email protected]> wrote:
> >On Mon, 24 Feb 2014, Vince Weaver wrote:
> >
> >> Just touching the mmap page with a write of a single byte (it doesn't
> >
> >> matter where) is enough to trigger the bug.
> >
> >OK, investigating this more.
> >
> >perf_fuzzer-2971 [000] 154.944114: page_fault_user:
> >address=0xf7729000 ip=0x41efab error_code=0x6
> >perf_fuzzer-2971 [000] 154.944118: function:
> >ip=0xffffffff810d40e7 parent_ip=0xffffffff810d0840
> >perf_fuzzer-2971 [000] 154.944119: function:
> >ip=0xffffffff812a91a5 parent_ip=0xffffffff81013ff5
> >perf_fuzzer-2971 [000] 154.944120: function:
> >ip=0xffffffff8153837c parent_ip=0xffffffff81535432
> >perf_fuzzer-2971 [000] 154.944121: page_fault_kernel:
> >address=0x22e0 ip=0xffffffff812a7d5c error_code=0x0

> Ok, so the obvious question is what is at that kernel address?
>

It's in copy_user_generic_string()
rep movsq %ds:(%rsi),%es:(%rdi)

And looking at the ftrace:
perf_fuzzer-2979 [000] 161.475920: page_fault_user: address=__per_cpu_end ip=__per_cpu_end error_code=0x6
perf_fuzzer-2979 [000] 161.475922: function: perf_callchain
perf_fuzzer-2979 [000] 161.475922: function: copy_from_user_nmi
perf_fuzzer-2979 [000] 161.475923: function: trace_do_page_fault
perf_fuzzer-2979 [000] 161.475924: page_fault_kernel: address=irq_stack_union ip=copy_user_generic_string error_code=0x0

What is likely happening is the user page fault is triggering
code to do a "perf_callchain" dump, which is calling copy_from_user_nmi()
which calls copy_user_generic_string() which is somehow getting the user
RBP in the RDI register somehow?

Vince

2014-02-24 17:25:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On Mon, Feb 24, 2014 at 12:10:44PM -0500, Vince Weaver wrote:
> On Mon, 24 Feb 2014, H. Peter Anvin wrote:
>
> > On February 24, 2014 8:34:30 AM PST, Vince Weaver <[email protected]> wrote:
> > >On Mon, 24 Feb 2014, Vince Weaver wrote:
> > >
> > >> Just touching the mmap page with a write of a single byte (it doesn't
> > >
> > >> matter where) is enough to trigger the bug.
> > >
> > >OK, investigating this more.
> > >
> > >perf_fuzzer-2971 [000] 154.944114: page_fault_user:
> > >address=0xf7729000 ip=0x41efab error_code=0x6
> > >perf_fuzzer-2971 [000] 154.944118: function:
> > >ip=0xffffffff810d40e7 parent_ip=0xffffffff810d0840
> > >perf_fuzzer-2971 [000] 154.944119: function:
> > >ip=0xffffffff812a91a5 parent_ip=0xffffffff81013ff5
> > >perf_fuzzer-2971 [000] 154.944120: function:
> > >ip=0xffffffff8153837c parent_ip=0xffffffff81535432
> > >perf_fuzzer-2971 [000] 154.944121: page_fault_kernel:
> > >address=0x22e0 ip=0xffffffff812a7d5c error_code=0x0
>
> > Ok, so the obvious question is what is at that kernel address?
> >
>
> It's in copy_user_generic_string()
> rep movsq %ds:(%rsi),%es:(%rdi)
>
> And looking at the ftrace:
> perf_fuzzer-2979 [000] 161.475920: page_fault_user: address=__per_cpu_end ip=__per_cpu_end error_code=0x6
> perf_fuzzer-2979 [000] 161.475922: function: perf_callchain
> perf_fuzzer-2979 [000] 161.475922: function: copy_from_user_nmi
> perf_fuzzer-2979 [000] 161.475923: function: trace_do_page_fault
> perf_fuzzer-2979 [000] 161.475924: page_fault_kernel: address=irq_stack_union ip=copy_user_generic_string error_code=0x0
>
> What is likely happening is the user page fault is triggering
> code to do a "perf_callchain" dump, which is calling copy_from_user_nmi()
> which calls copy_user_generic_string() which is somehow getting the user
> RBP in the RDI register somehow?

So that code very much relies on the 'recursive' NMI/iret magic from
Steve, patch 3f3c8b8c4b2a3 (and assorted fixes later).

If CR2 is getting corrupted; 7fbb98c5cb075 seems relevant.

Peter, does x32 have a slightly different ABI/calling convention that
would make any of these patches just slightly 'off'?

2014-02-24 17:30:54

by Vince Weaver

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On Mon, 24 Feb 2014, Peter Zijlstra wrote:

> On Mon, Feb 24, 2014 at 12:10:44PM -0500, Vince Weaver wrote:
> > On Mon, 24 Feb 2014, H. Peter Anvin wrote:
> >
> > > On February 24, 2014 8:34:30 AM PST, Vince Weaver <[email protected]> wrote:
> > > >On Mon, 24 Feb 2014, Vince Weaver wrote:
> > > >
> > > >> Just touching the mmap page with a write of a single byte (it doesn't
> > > >
> > > >> matter where) is enough to trigger the bug.
> > > >
> > > >OK, investigating this more.
> > > >
> > > >perf_fuzzer-2971 [000] 154.944114: page_fault_user:
> > > >address=0xf7729000 ip=0x41efab error_code=0x6
> > > >perf_fuzzer-2971 [000] 154.944118: function:
> > > >ip=0xffffffff810d40e7 parent_ip=0xffffffff810d0840
> > > >perf_fuzzer-2971 [000] 154.944119: function:
> > > >ip=0xffffffff812a91a5 parent_ip=0xffffffff81013ff5
> > > >perf_fuzzer-2971 [000] 154.944120: function:
> > > >ip=0xffffffff8153837c parent_ip=0xffffffff81535432
> > > >perf_fuzzer-2971 [000] 154.944121: page_fault_kernel:
> > > >address=0x22e0 ip=0xffffffff812a7d5c error_code=0x0
> >
> > > Ok, so the obvious question is what is at that kernel address?
> > >
> >
> > It's in copy_user_generic_string()
> > rep movsq %ds:(%rsi),%es:(%rdi)
> >
> > And looking at the ftrace:
> > perf_fuzzer-2979 [000] 161.475920: page_fault_user: address=__per_cpu_end ip=__per_cpu_end error_code=0x6
> > perf_fuzzer-2979 [000] 161.475922: function: perf_callchain
> > perf_fuzzer-2979 [000] 161.475922: function: copy_from_user_nmi
> > perf_fuzzer-2979 [000] 161.475923: function: trace_do_page_fault
> > perf_fuzzer-2979 [000] 161.475924: page_fault_kernel: address=irq_stack_union ip=copy_user_generic_string error_code=0x0
> >
> > What is likely happening is the user page fault is triggering
> > code to do a "perf_callchain" dump, which is calling copy_from_user_nmi()
> > which calls copy_user_generic_string() which is somehow getting the user
> > RBP in the RDI register somehow?
>
> So that code very much relies on the 'recursive' NMI/iret magic from
> Steve, patch 3f3c8b8c4b2a3 (and assorted fixes later).
>
> If CR2 is getting corrupted; 7fbb98c5cb075 seems relevant.
>
> Peter, does x32 have a slightly different ABI/calling convention that
> would make any of these patches just slightly 'off'?

I do note that
perf_callchain_user();

Does
fp = (void __user *)regs->bp;

...

bytes = copy_from_user_nmi(&frame, fp, sizeof(frame));


And in my particular executable RBP has nothing to do with a frame
pointer, but is instead being used as a general purpose register.

Am I missing something here? Though in that case I'm not sure why this
wouldn't be easier to trigger.

Vince

2014-02-24 17:39:11

by Vince Weaver

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On Mon, 24 Feb 2014, Vince Weaver wrote:

> I do note that
> perf_callchain_user();
>
> Does
> fp = (void __user *)regs->bp;
>
> ...
>
> bytes = copy_from_user_nmi(&frame, fp, sizeof(frame));
>
>
> And in my particular executable RBP has nothing to do with a frame
> pointer, but is instead being used as a general purpose register.

and as a reminder, I'm seeing this on an x32 executable, so
perf_callchain_user32() is probably coming into play.

So maybe it is an x32 issue after all.

Vince

2014-02-24 17:41:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On Mon, Feb 24, 2014 at 12:32:39PM -0500, Vince Weaver wrote:
> I do note that
> perf_callchain_user();
>
> Does
> fp = (void __user *)regs->bp;
>
> ...
>
> bytes = copy_from_user_nmi(&frame, fp, sizeof(frame));
>
>
> And in my particular executable RBP has nothing to do with a frame
> pointer, but is instead being used as a general purpose register.
>
> Am I missing something here? Though in that case I'm not sure why this
> wouldn't be easier to trigger.

Ah, in case the frame doesn't actually exist we would expect to fault
and get the fixup treatment, returning a short copy (the return value
being bytes _NOT_ copied).

When that happens;

if (bytes != 0)
break;

At which point we'll terminate the stack frame iteration.

This is where we rely on being able to take a fault from NMI context,
the fault iret will re-enable NMIs, necessitating all the magic Steve
did.

2014-02-24 17:41:46

by H. Peter Anvin

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On 02/24/2014 09:32 AM, Vince Weaver wrote:
>>
>> Peter, does x32 have a slightly different ABI/calling convention that
>> would make any of these patches just slightly 'off'?
>
> I do note that
> perf_callchain_user();
>
> Does
> fp = (void __user *)regs->bp;
>
> ...
>
> bytes = copy_from_user_nmi(&frame, fp, sizeof(frame));
>
>
> And in my particular executable RBP has nothing to do with a frame
> pointer, but is instead being used as a general purpose register.
>
> Am I missing something here? Though in that case I'm not sure why this
> wouldn't be easier to trigger.
>

Neither x86-64 nor x32 are typically compiled with fixed frame pointers
(which would be %rbp if they are). So I'm guessing the perf_callchain
logic is only applicable to a user-space binary explicitly compiled with
frame pointers turned on.

So copy_from_user_nmi() stumbles onto a nonexistent page and takes a
page fault. This isn't a big deal, because perf_callchain_user() is set
up to handle that (and just terminates the trace), *except* now CR2 is
corrupt, and we took this event while handling a page fault already...
and apparently before we even did read_cr2() in __do_page_fault.

The description of copy_from_user_nmi() states:

/*
* We rely on the nested NMI work to allow atomic faults from the NMI
path; the
* nested NMI paths are careful to preserve CR2.
*/

... but that doesn't seem to happen here for whatever reason.

There is no hint in your trace what happens after the kernel page fault
so that makes it hard to know.

-hpa

2014-02-24 17:43:30

by H. Peter Anvin

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On 02/24/2014 09:41 AM, Vince Weaver wrote:
> On Mon, 24 Feb 2014, Vince Weaver wrote:
>
>> I do note that
>> perf_callchain_user();
>>
>> Does
>> fp = (void __user *)regs->bp;
>>
>> ...
>>
>> bytes = copy_from_user_nmi(&frame, fp, sizeof(frame));
>>
>>
>> And in my particular executable RBP has nothing to do with a frame
>> pointer, but is instead being used as a general purpose register.
>
> and as a reminder, I'm seeing this on an x32 executable, so
> perf_callchain_user32() is probably coming into play.
>
> So maybe it is an x32 issue after all.
>

No.

if (!test_thread_flag(TIF_IA32))
return 0;

TIF_IA32 is clear for an x32 process.

-hpa

2014-02-24 17:52:55

by H. Peter Anvin

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On 02/24/2014 09:25 AM, Peter Zijlstra wrote:
>>
>> What is likely happening is the user page fault is triggering
>> code to do a "perf_callchain" dump, which is calling copy_from_user_nmi()
>> which calls copy_user_generic_string() which is somehow getting the user
>> RBP in the RDI register somehow?
>
> So that code very much relies on the 'recursive' NMI/iret magic from
> Steve, patch 3f3c8b8c4b2a3 (and assorted fixes later).
>
> If CR2 is getting corrupted; 7fbb98c5cb075 seems relevant.
>
> Peter, does x32 have a slightly different ABI/calling convention that
> would make any of these patches just slightly 'off'?
>

As long as we're talking kernel code, x32 isn't even involved (we do not
support compiling the kernel as x32 and most likely never will.)

-hpa

2014-02-24 17:58:19

by Vince Weaver

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On Mon, 24 Feb 2014, H. Peter Anvin wrote:

> On 02/24/2014 09:32 AM, Vince Weaver wrote:
> >>
> >> Peter, does x32 have a slightly different ABI/calling convention that
> >> would make any of these patches just slightly 'off'?
> >
> > I do note that
> > perf_callchain_user();
> >
> > Does
> > fp = (void __user *)regs->bp;
> >
> > ...
> >
> > bytes = copy_from_user_nmi(&frame, fp, sizeof(frame));
> >
> >
> > And in my particular executable RBP has nothing to do with a frame
> > pointer, but is instead being used as a general purpose register.
> >
> > Am I missing something here? Though in that case I'm not sure why this
> > wouldn't be easier to trigger.
> >
>
> Neither x86-64 nor x32 are typically compiled with fixed frame pointers
> (which would be %rbp if they are). So I'm guessing the perf_callchain
> logic is only applicable to a user-space binary explicitly compiled with
> frame pointers turned on.
>
> So copy_from_user_nmi() stumbles onto a nonexistent page and takes a
> page fault. This isn't a big deal, because perf_callchain_user() is set
> up to handle that (and just terminates the trace), *except* now CR2 is
> corrupt, and we took this event while handling a page fault already...
> and apparently before we even did read_cr2() in __do_page_fault.
>
> The description of copy_from_user_nmi() states:
>
> /*
> * We rely on the nested NMI work to allow atomic faults from the NMI
> path; the
> * nested NMI paths are careful to preserve CR2.
> */
>
> ... but that doesn't seem to happen here for whatever reason.
>
> There is no hint in your trace what happens after the kernel page fault
> so that makes it hard to know.

Ahh, ftrace, the cause of and solution to all my perf_fuzzing problems.

Anyway I've attached the full tail end of the trace if you want to see
everything that happens.

Vince


Attachments:
little.trace.bz2 (59.99 kB)

2014-02-24 18:05:16

by Vince Weaver

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On Mon, 24 Feb 2014, Vince Weaver wrote:

> On Mon, 24 Feb 2014, H. Peter Anvin wrote:
>
> > On 02/24/2014 09:32 AM, Vince Weaver wrote:
> > >>
> > >> Peter, does x32 have a slightly different ABI/calling convention that
> > >> would make any of these patches just slightly 'off'?
> > >
> > > I do note that
> > > perf_callchain_user();
> > >
> > > Does
> > > fp = (void __user *)regs->bp;
> > >
> > > ...
> > >
> > > bytes = copy_from_user_nmi(&frame, fp, sizeof(frame));
> > >
> > >
> > > And in my particular executable RBP has nothing to do with a frame
> > > pointer, but is instead being used as a general purpose register.
> > >
> > > Am I missing something here? Though in that case I'm not sure why this
> > > wouldn't be easier to trigger.
> > >
> >
> > Neither x86-64 nor x32 are typically compiled with fixed frame pointers
> > (which would be %rbp if they are). So I'm guessing the perf_callchain
> > logic is only applicable to a user-space binary explicitly compiled with
> > frame pointers turned on.
> >
> > So copy_from_user_nmi() stumbles onto a nonexistent page and takes a
> > page fault. This isn't a big deal, because perf_callchain_user() is set
> > up to handle that (and just terminates the trace), *except* now CR2 is
> > corrupt, and we took this event while handling a page fault already...
> > and apparently before we even did read_cr2() in __do_page_fault.
> >
> > The description of copy_from_user_nmi() states:
> >
> > /*
> > * We rely on the nested NMI work to allow atomic faults from the NMI
> > path; the
> > * nested NMI paths are careful to preserve CR2.
> > */
> >
> > ... but that doesn't seem to happen here for whatever reason.
> >
> > There is no hint in your trace what happens after the kernel page fault
> > so that makes it hard to know.
>
> Ahh, ftrace, the cause of and solution to all my perf_fuzzing problems.
>
> Anyway I've attached the full tail end of the trace if you want to see
> everything that happens.

and then I note there are *two* kernel page faults.

perf_fuzzer-2979 [000] 161.475924: page_fault_kernel: address=irq_stack_union ip=copy_user_generic_string error_code=0x0
address=0x1 ip=0xffffffff812a7d9c error_code=0x0
perf_fuzzer-2979 [000] 161.475924: function: __do_page_fault
perf_fuzzer-2979 [000] 161.475924: function: bad_area_nosemaphore
perf_fuzzer-2979 [000] 161.475925: function: __bad_area_nosemaphore
perf_fuzzer-2979 [000] 161.475925: function: no_context
perf_fuzzer-2979 [000] 161.475925: function: fixup_exception
perf_fuzzer-2979 [000] 161.475926: function: search_exception_tables
perf_fuzzer-2979 [000] 161.475926: function: search_extable
perf_fuzzer-2979 [000] 161.475927: function: copy_user_handle_tail
perf_fuzzer-2979 [000] 161.475927: function: trace_do_page_fault
perf_fuzzer-2979 [000] 161.475928: page_fault_kernel: address=irq_stack_union ip=copy_user_handle_tail error_code=0x0
address=0x1 ip=0xffffffff812a92bb error_code=0x0
perf_fuzzer-2979 [000] 161.475928: function: __do_page_fault
perf_fuzzer-2979 [000] 161.475928: function: bad_area_nosemaphore
perf_fuzzer-2979 [000] 161.475929: function: __bad_area_nosemaphore
perf_fuzzer-2979 [000] 161.475929: function: no_context
perf_fuzzer-2979 [000] 161.475929: function: fixup_exception
perf_fuzzer-2979 [000] 161.475929: function: search_exception_tables
perf_fuzzer-2979 [000] 161.475930: function: search_extable
perf_fuzzer-2979 [000] 161.475931: function: perf_output_begin
perf_fuzzer-2979 [000] 161.475931: function: perf_output_copy

That second one is in copy_user_handle_tail()

Sorry for the sloppy analysis here, I did most of the initial tracing last
night at 1am typing one-handed with a sick crying baby draped over one
shoulder, so not really operating at my best.

Vince

2014-02-24 18:34:58

by H. Peter Anvin

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On 02/24/2014 10:07 AM, Vince Weaver wrote:
>>
>> Anyway I've attached the full tail end of the trace if you want to see
>> everything that happens.
>
> and then I note there are *two* kernel page faults.
>
> perf_fuzzer-2979 [000] 161.475924: page_fault_kernel: address=irq_stack_union ip=copy_user_generic_string error_code=0x0
> address=0x1 ip=0xffffffff812a7d9c error_code=0x0
> perf_fuzzer-2979 [000] 161.475924: function: __do_page_fault
> perf_fuzzer-2979 [000] 161.475924: function: bad_area_nosemaphore
> perf_fuzzer-2979 [000] 161.475925: function: __bad_area_nosemaphore
> perf_fuzzer-2979 [000] 161.475925: function: no_context
> perf_fuzzer-2979 [000] 161.475925: function: fixup_exception
> perf_fuzzer-2979 [000] 161.475926: function: search_exception_tables
> perf_fuzzer-2979 [000] 161.475926: function: search_extable
> perf_fuzzer-2979 [000] 161.475927: function: copy_user_handle_tail
> perf_fuzzer-2979 [000] 161.475927: function: trace_do_page_fault
> perf_fuzzer-2979 [000] 161.475928: page_fault_kernel: address=irq_stack_union ip=copy_user_handle_tail error_code=0x0
> address=0x1 ip=0xffffffff812a92bb error_code=0x0
> perf_fuzzer-2979 [000] 161.475928: function: __do_page_fault
> perf_fuzzer-2979 [000] 161.475928: function: bad_area_nosemaphore
> perf_fuzzer-2979 [000] 161.475929: function: __bad_area_nosemaphore
> perf_fuzzer-2979 [000] 161.475929: function: no_context
> perf_fuzzer-2979 [000] 161.475929: function: fixup_exception
> perf_fuzzer-2979 [000] 161.475929: function: search_exception_tables
> perf_fuzzer-2979 [000] 161.475930: function: search_extable
> perf_fuzzer-2979 [000] 161.475931: function: perf_output_begin
> perf_fuzzer-2979 [000] 161.475931: function: perf_output_copy
>
> That second one is in copy_user_handle_tail()
>

Either way, it really seems like we have a case of CR2 leakage out of
the NMI context.

-hpa

2014-02-24 19:13:32

by Steven Rostedt

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On Mon, 24 Feb 2014 10:34:13 -0800
"H. Peter Anvin" <[email protected]> wrote:

> On 02/24/2014 10:07 AM, Vince Weaver wrote:
> >>
> >> Anyway I've attached the full tail end of the trace if you want to see
> >> everything that happens.
> >
> > and then I note there are *two* kernel page faults.
> >
> > perf_fuzzer-2979 [000] 161.475924: page_fault_kernel: address=irq_stack_union ip=copy_user_generic_string error_code=0x0
> > address=0x1 ip=0xffffffff812a7d9c error_code=0x0
> > perf_fuzzer-2979 [000] 161.475924: function: __do_page_fault
> > perf_fuzzer-2979 [000] 161.475924: function: bad_area_nosemaphore
> > perf_fuzzer-2979 [000] 161.475925: function: __bad_area_nosemaphore
> > perf_fuzzer-2979 [000] 161.475925: function: no_context
> > perf_fuzzer-2979 [000] 161.475925: function: fixup_exception
> > perf_fuzzer-2979 [000] 161.475926: function: search_exception_tables
> > perf_fuzzer-2979 [000] 161.475926: function: search_extable
> > perf_fuzzer-2979 [000] 161.475927: function: copy_user_handle_tail
> > perf_fuzzer-2979 [000] 161.475927: function: trace_do_page_fault
> > perf_fuzzer-2979 [000] 161.475928: page_fault_kernel: address=irq_stack_union ip=copy_user_handle_tail error_code=0x0
> > address=0x1 ip=0xffffffff812a92bb error_code=0x0
> > perf_fuzzer-2979 [000] 161.475928: function: __do_page_fault
> > perf_fuzzer-2979 [000] 161.475928: function: bad_area_nosemaphore
> > perf_fuzzer-2979 [000] 161.475929: function: __bad_area_nosemaphore
> > perf_fuzzer-2979 [000] 161.475929: function: no_context
> > perf_fuzzer-2979 [000] 161.475929: function: fixup_exception
> > perf_fuzzer-2979 [000] 161.475929: function: search_exception_tables
> > perf_fuzzer-2979 [000] 161.475930: function: search_extable
> > perf_fuzzer-2979 [000] 161.475931: function: perf_output_begin
> > perf_fuzzer-2979 [000] 161.475931: function: perf_output_copy
> >
> > That second one is in copy_user_handle_tail()
> >
>
> Either way, it really seems like we have a case of CR2 leakage out of
> the NMI context.

Ah, and x86_64 saves off the cr2 register when entering NMI and restores
it before returning. But it seems to be missing from the i386 code.

-- Steve

2014-02-24 19:16:23

by H. Peter Anvin

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On 02/24/2014 11:13 AM, Steven Rostedt wrote:
>>
>> Either way, it really seems like we have a case of CR2 leakage out of
>> the NMI context.
>
> Ah, and x86_64 saves off the cr2 register when entering NMI and restores
> it before returning. But it seems to be missing from the i386 code.
>

OK, that might be a problem, but this is the 64-bit kernel.

-hpa

2014-02-24 19:30:54

by Peter Zijlstra

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On Mon, Feb 24, 2014 at 02:13:29PM -0500, Steven Rostedt wrote:
> Ah, and x86_64 saves off the cr2 register when entering NMI and restores
> it before returning. But it seems to be missing from the i386 code.

arch/x86/kernel/nmi.c:

#define nmi_nesting_preprocess(regs) \
do { \
if (this_cpu_read(nmi_state) != NMI_NOT_RUNNING) { \
this_cpu_write(nmi_state, NMI_LATCHED); \
return; \
} \
this_cpu_write(nmi_state, NMI_EXECUTING); \
this_cpu_write(nmi_cr2, read_cr2()); \
} while (0); \
nmi_restart:

#define nmi_nesting_postprocess() \
do { \
if (unlikely(this_cpu_read(nmi_cr2) != read_cr2())) \
write_cr2(this_cpu_read(nmi_cr2)); \
if (this_cpu_dec_return(nmi_state)) \
goto nmi_restart; \
} while (0)

That very much looks like saving/restoring CR2 to me.

FWIW; I hate how the x86_64 and i386 versions of this NMI nesting magic
are so completely different.

2014-02-24 19:32:52

by Steven Rostedt

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On Mon, 24 Feb 2014 20:30:43 +0100
Peter Zijlstra <[email protected]> wrote:

> On Mon, Feb 24, 2014 at 02:13:29PM -0500, Steven Rostedt wrote:
> > Ah, and x86_64 saves off the cr2 register when entering NMI and restores
> > it before returning. But it seems to be missing from the i386 code.
>
> arch/x86/kernel/nmi.c:
>
> #define nmi_nesting_preprocess(regs) \
> do { \
> if (this_cpu_read(nmi_state) != NMI_NOT_RUNNING) { \
> this_cpu_write(nmi_state, NMI_LATCHED); \
> return; \
> } \
> this_cpu_write(nmi_state, NMI_EXECUTING); \
> this_cpu_write(nmi_cr2, read_cr2()); \
> } while (0); \
> nmi_restart:
>
> #define nmi_nesting_postprocess() \
> do { \
> if (unlikely(this_cpu_read(nmi_cr2) != read_cr2())) \
> write_cr2(this_cpu_read(nmi_cr2)); \
> if (this_cpu_dec_return(nmi_state)) \
> goto nmi_restart; \
> } while (0)
>
> That very much looks like saving/restoring CR2 to me.
>

Ah, I forgot to look in the C file.

> FWIW; I hate how the x86_64 and i386 versions of this NMI nesting magic
> are so completely different.

Yeah, it sucks. But that's because the architecture on handling NMIs is
completely different. x86_64 swaps the stack automatically, where as
i386 uses the same stack if you are already in the kernel.

-- Steve

2014-02-25 03:50:24

by H. Peter Anvin

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On 02/24/2014 11:30 AM, Peter Zijlstra wrote:
> On Mon, Feb 24, 2014 at 02:13:29PM -0500, Steven Rostedt wrote:
>> Ah, and x86_64 saves off the cr2 register when entering NMI and restores
>> it before returning. But it seems to be missing from the i386 code.
>
> arch/x86/kernel/nmi.c:
>
> #define nmi_nesting_preprocess(regs) \
> do { \
> if (this_cpu_read(nmi_state) != NMI_NOT_RUNNING) { \
> this_cpu_write(nmi_state, NMI_LATCHED); \
> return; \
> } \
> this_cpu_write(nmi_state, NMI_EXECUTING); \
> this_cpu_write(nmi_cr2, read_cr2()); \
> } while (0); \
> nmi_restart:
>
> #define nmi_nesting_postprocess() \
> do { \
> if (unlikely(this_cpu_read(nmi_cr2) != read_cr2())) \
> write_cr2(this_cpu_read(nmi_cr2)); \
> if (this_cpu_dec_return(nmi_state)) \
> goto nmi_restart; \
> } while (0)
>
> That very much looks like saving/restoring CR2 to me.
>
> FWIW; I hate how the x86_64 and i386 versions of this NMI nesting magic
> are so completely different.

Is there any way that nmi_cr2 can end up getting overwritten by multiple
nestings of some kind? I would have thought it would have made more
sense to spill cr2 onto the stack after the stack has been properly set up.

-hpa

2014-02-25 14:05:53

by Vince Weaver

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On Mon, 24 Feb 2014, H. Peter Anvin wrote:

> On 02/24/2014 11:30 AM, Peter Zijlstra wrote:
> > On Mon, Feb 24, 2014 at 02:13:29PM -0500, Steven Rostedt wrote:
> >> Ah, and x86_64 saves off the cr2 register when entering NMI and restores
> >> it before returning. But it seems to be missing from the i386 code.
> >
> > arch/x86/kernel/nmi.c:
> >
> > #define nmi_nesting_preprocess(regs) \
> > do { \
> > if (this_cpu_read(nmi_state) != NMI_NOT_RUNNING) { \
> > this_cpu_write(nmi_state, NMI_LATCHED); \
> > return; \
> > } \
> > this_cpu_write(nmi_state, NMI_EXECUTING); \
> > this_cpu_write(nmi_cr2, read_cr2()); \
> > } while (0); \
> > nmi_restart:
> >
> > #define nmi_nesting_postprocess() \
> > do { \
> > if (unlikely(this_cpu_read(nmi_cr2) != read_cr2())) \
> > write_cr2(this_cpu_read(nmi_cr2)); \
> > if (this_cpu_dec_return(nmi_state)) \
> > goto nmi_restart; \
> > } while (0)
> >
> > That very much looks like saving/restoring CR2 to me.
> >
> > FWIW; I hate how the x86_64 and i386 versions of this NMI nesting magic
> > are so completely different.
>
> Is there any way that nmi_cr2 can end up getting overwritten by multiple
> nestings of some kind? I would have thought it would have made more
> sense to spill cr2 onto the stack after the stack has been properly set up.

So how can I help with debugging this?

While the missing cr2 issue made debugging frustrating, I find the other
aspects of the bug more serious:

1. Programs that are doing valid memory accesses can segfault
and worse
2. This bug can cause an instant-reboot of the system (I assume somehow
with the right combination of memory accesses it causes a
triple-fault?)

#2 is why I spent all of this time tracking this down, I couldn't leave a
machine fuzzing overnight without the machine rebooting.

Vince

2014-02-25 14:36:00

by H. Peter Anvin

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

#2 is what I really don't understand.

I worry something else is going on there

On February 25, 2014 6:07:51 AM PST, Vince Weaver <[email protected]> wrote:
>On Mon, 24 Feb 2014, H. Peter Anvin wrote:
>
>> On 02/24/2014 11:30 AM, Peter Zijlstra wrote:
>> > On Mon, Feb 24, 2014 at 02:13:29PM -0500, Steven Rostedt wrote:
>> >> Ah, and x86_64 saves off the cr2 register when entering NMI and
>restores
>> >> it before returning. But it seems to be missing from the i386
>code.
>> >
>> > arch/x86/kernel/nmi.c:
>> >
>> > #define nmi_nesting_preprocess(regs) \
>> > do { \
>> > if (this_cpu_read(nmi_state) != NMI_NOT_RUNNING) { \
>> > this_cpu_write(nmi_state, NMI_LATCHED); \
>> > return; \
>> > } \
>> > this_cpu_write(nmi_state, NMI_EXECUTING); \
>> > this_cpu_write(nmi_cr2, read_cr2()); \
>> > } while (0); \
>> > nmi_restart:
>> >
>> > #define nmi_nesting_postprocess() \
>> > do { \
>> > if (unlikely(this_cpu_read(nmi_cr2) != read_cr2())) \
>> > write_cr2(this_cpu_read(nmi_cr2)); \
>> > if (this_cpu_dec_return(nmi_state)) \
>> > goto nmi_restart; \
>> > } while (0)
>> >
>> > That very much looks like saving/restoring CR2 to me.
>> >
>> > FWIW; I hate how the x86_64 and i386 versions of this NMI nesting
>magic
>> > are so completely different.
>>
>> Is there any way that nmi_cr2 can end up getting overwritten by
>multiple
>> nestings of some kind? I would have thought it would have made more
>> sense to spill cr2 onto the stack after the stack has been properly
>set up.
>
>So how can I help with debugging this?
>
>While the missing cr2 issue made debugging frustrating, I find the
>other
>aspects of the bug more serious:
>
> 1. Programs that are doing valid memory accesses can segfault
>and worse
>2. This bug can cause an instant-reboot of the system (I assume
>somehow
> with the right combination of memory accesses it causes a
> triple-fault?)
>
>#2 is why I spent all of this time tracking this down, I couldn't leave
>a
>machine fuzzing overnight without the machine rebooting.
>
>Vince

--
Sent from my mobile phone. Please pardon brevity and lack of formatting.

2014-02-25 14:47:04

by Steven Rostedt

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On Tue, 25 Feb 2014 06:34:55 -0800
"H. Peter Anvin" <[email protected]> wrote:

> #2 is what I really don't understand.
>
> I worry something else is going on there

Yeah, me too.

-- Steve

>
> >
> >While the missing cr2 issue made debugging frustrating, I find the
> >other
> >aspects of the bug more serious:
> >
> > 1. Programs that are doing valid memory accesses can segfault
> >and worse
> >2. This bug can cause an instant-reboot of the system (I assume
> >somehow
> > with the right combination of memory accesses it causes a
> > triple-fault?)
> >
> >#2 is why I spent all of this time tracking this down, I couldn't leave
> >a
> >machine fuzzing overnight without the machine rebooting.
> >
> >Vince
>

2014-02-25 15:31:32

by Vince Weaver

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On Tue, 25 Feb 2014, Steven Rostedt wrote:

> On Tue, 25 Feb 2014 06:34:55 -0800
> "H. Peter Anvin" <[email protected]> wrote:
>
> > #2 is what I really don't understand.
> >
> > I worry something else is going on there
>
> Yeah, me too.
>

OK, well I'll work on isolating that next, I was hoping the segfault issue
was related. It's a lot harder getting traces out of a machine that
insta-reboots so fast that it can't even finish printing the oops message.

I need my testing machine for some unrelated experiments so I might not
get a chance to work on this further for a day or two.

Vince

2014-02-26 15:04:28

by Vince Weaver

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On Tue, 25 Feb 2014, Vince Weaver wrote:

> On Tue, 25 Feb 2014, Steven Rostedt wrote:
>
> > On Tue, 25 Feb 2014 06:34:55 -0800
> > "H. Peter Anvin" <[email protected]> wrote:
> >
> > > #2 is what I really don't understand.
> > >
> > > I worry something else is going on there
> >
> > Yeah, me too.
> >
>
> OK, well I'll work on isolating that next, I was hoping the segfault issue
> was related. It's a lot harder getting traces out of a machine that
> insta-reboots so fast that it can't even finish printing the oops message.

OK, dug up the random seeds that are causing the system to reboot, and it
turns out that it's hard to reproduce the actual reboot, but in the cases
where it doesn't reboot it segfaults in the same code path that was
causing the other segfaults in this thread. So I think it's all the same
bug, although I can investigate further if needed.

Vince

2014-02-27 22:04:29

by Vince Weaver

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot


I spent some more time on this.
I managed to get a trace that exhibited the bug practically right
away, but still unable to generate a reproducible trace :(

So instead I'm adding WARN's and trace_printks to see what I can find out.

Here's a summary of what I think is happneing. Please let me know if I'm
wildly wrong in analyzing this.



Userspace accesses the perf
ring-buffer-user-page of an event for the first time (in this trace
it's a SW_CPU_CLOCK event).

This triggers a page_fault to bring in the page.

This triggers a TRACEPOINT event (task/task_newtask) which
has PERF_SAMPLE_IP|PERF_SAMPLE_TID|PERF_SAMPLE_TIME|PERF_SAMPLE_CALLCHAIN|PERF_SAMPLE_ID|PERF_SAMPLE_CPU|PERF_SAMPLE_PERIOD|PERF_SAMPLE_STREAM_ID|PERF_SAMPLE_WEIGHT|PERF_SAMPLE_DATA_SRC|PERF_SAMPLE_TRANSACTION;
and exclude_callchain_kernel=1, triggering a perf_callchain().

Here is a dump of the stack right after we enter perf_callchain()

vince@core2:~$ [ 202.320444] ------------[ cut here ]------------
[ 202.324001] WARNING: CPU: 1 PID: 2873 at kernel/events/callchain.c:168 perf_callchain+0x67/0x211()
hermal_sys ehci_pci ehci_hcd sg sd_mod usbcore usb_common
[ 202.324001] CPU: 1 PID: 2873 Comm: perf_fuzzer Not tainted 3.14.0-rc3+ #24
[ 202.324001] Hardware name: AOpen DE7000/nMCP7ALPx-DE R1.06 Oct.19.2012, BIOS 080015 10/19/2012
[ 202.324001] 00000000000000a8 ffff880119ec1b48 ffffffff81531dc1 00000000000000a8
[ 202.324001] 0000000000000000 ffff880119ec1b88 ffffffff81040ce4 ffff880119ec1b88
[ 202.324001] ffffffff810d416f ffff880119ec1e38 ffff8800cba67800 0000000000000000
[ 202.324001] Call Trace:
[ 202.324001] [<ffffffff81531dc1>] dump_stack+0x49/0x60
[ 202.324001] [<ffffffff81040ce4>] warn_slowpath_common+0x81/0x9b
[ 202.324001] [<ffffffff810d416f>] ? perf_callchain+0x67/0x211
[ 202.324001] [<ffffffff81040d18>] warn_slowpath_null+0x1a/0x1c
[ 202.324001] [<ffffffff810d416f>] perf_callchain+0x67/0x211
[ 202.324001] [<ffffffff8106b0ab>] ? local_clock+0x1b/0x24
[ 202.324001] [<ffffffff810d0872>] perf_prepare_sample+0x7b/0x304
[ 202.324001] [<ffffffff810d1079>] __perf_event_overflow+0x156/0x1c1
[ 202.324001] [<ffffffff810f4613>] ? free_pgtables+0xa7/0xc9
[ 202.324001] [<ffffffff810d125e>] perf_swevent_overflow+0x41/0x5b
[ 202.324001] [<ffffffff810d12ea>] perf_swevent_event+0x72/0x74
[ 202.324001] [<ffffffff810d1478>] perf_tp_event+0xea/0x1ef
[ 202.324001] [<ffffffff8103515a>] ? perf_trace_x86_exceptions+0x4c/0xba
[ 202.324001] [<ffffffff810351b6>] perf_trace_x86_exceptions+0xa8/0xba
[ 202.324001] [<ffffffff8103515a>] ? perf_trace_x86_exceptions+0x4c/0xba
[ 202.324001] [<ffffffff81538439>] trace_do_page_fault+0x48/0x99
[ 202.324001] [<ffffffff815354b2>] trace_page_fault+0x22/0x30
[ 202.324001] ---[ end trace 31ccd31b4e82cb42 ]---

This triggers a kernel pagefault because the BP register is not valid and
copy_from_user_nmi() tries to copy the user stack area from there.

This *should* be OK? And it maybe looks like it works, but then

copy_user_handle_tail()

causes another page fault again at the invaid BP register.

But then the code continues to
perf_output_begin()


237.265689: page_fault_kernel: address=0x17a0 ip=copy_user_handle_tail error_code=0x0
237.265689: function: __do_page_fault
237.265689: function: bad_area_nosemaphore
237.265690: function: __bad_area_nosemaphore
237.265690: function: no_context
237.265690: function: fixup_exception
237.265690: function: search_exception_tables
237.265690: function: search_extable
237.265691: function: perf_output_begin
237.265692: bprint: perf_output_begin: VMW: event type 2 config 2a st: 2c3e7

how are we back in __do_page_fault again here?

237.265692: function: __do_page_fault
237.265692: function: down_read_trylock
237.265692: function: _cond_resched
237.265693: function: find_vma
237.265693: function: bad_area
237.265693: function: up_read
237.265693: function: __bad_area_nosemaphore
237.265694: function: is_prefetch
237.265694: function: convert_ip_to_linear
237.265695: function: unhandled_signal
237.265695: function: __printk_ratelimit
237.265695: function: _raw_spin_trylock
237.265695: function: _raw_spin_unlock_irqrestore
237.265696: function: printk
237.265696: function: vprintk_emit

[ 202.877004] perf_fuzzer[2873]: segfault at 17a0 ip 00000000004017fd sp 00000000ffd19d10 error 6 in perf_fuzzer[400000+d1000]


Vince

2014-02-27 22:32:23

by Steven Rostedt

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On Thu, 27 Feb 2014 17:06:36 -0500 (EST)
Vince Weaver <[email protected]> wrote:

>
> I spent some more time on this.
> I managed to get a trace that exhibited the bug practically right
> away, but still unable to generate a reproducible trace :(
>
> So instead I'm adding WARN's and trace_printks to see what I can find out.
>
> Here's a summary of what I think is happneing. Please let me know if I'm
> wildly wrong in analyzing this.
>
>
>
> Userspace accesses the perf
> ring-buffer-user-page of an event for the first time (in this trace
> it's a SW_CPU_CLOCK event).
>
> This triggers a page_fault to bring in the page.
>
> This triggers a TRACEPOINT event (task/task_newtask) which
> has PERF_SAMPLE_IP|PERF_SAMPLE_TID|PERF_SAMPLE_TIME|PERF_SAMPLE_CALLCHAIN|PERF_SAMPLE_ID|PERF_SAMPLE_CPU|PERF_SAMPLE_PERIOD|PERF_SAMPLE_STREAM_ID|PERF_SAMPLE_WEIGHT|PERF_SAMPLE_DATA_SRC|PERF_SAMPLE_TRANSACTION;
> and exclude_callchain_kernel=1, triggering a perf_callchain().
>
> Here is a dump of the stack right after we enter perf_callchain()
>
> vince@core2:~$ [ 202.320444] ------------[ cut here ]------------
> [ 202.324001] WARNING: CPU: 1 PID: 2873 at kernel/events/callchain.c:168 perf_callchain+0x67/0x211()
> hermal_sys ehci_pci ehci_hcd sg sd_mod usbcore usb_common
> [ 202.324001] CPU: 1 PID: 2873 Comm: perf_fuzzer Not tainted 3.14.0-rc3+ #24
> [ 202.324001] Hardware name: AOpen DE7000/nMCP7ALPx-DE R1.06 Oct.19.2012, BIOS 080015 10/19/2012
> [ 202.324001] 00000000000000a8 ffff880119ec1b48 ffffffff81531dc1 00000000000000a8
> [ 202.324001] 0000000000000000 ffff880119ec1b88 ffffffff81040ce4 ffff880119ec1b88
> [ 202.324001] ffffffff810d416f ffff880119ec1e38 ffff8800cba67800 0000000000000000
> [ 202.324001] Call Trace:
> [ 202.324001] [<ffffffff81531dc1>] dump_stack+0x49/0x60
> [ 202.324001] [<ffffffff81040ce4>] warn_slowpath_common+0x81/0x9b
> [ 202.324001] [<ffffffff810d416f>] ? perf_callchain+0x67/0x211
> [ 202.324001] [<ffffffff81040d18>] warn_slowpath_null+0x1a/0x1c
> [ 202.324001] [<ffffffff810d416f>] perf_callchain+0x67/0x211
> [ 202.324001] [<ffffffff8106b0ab>] ? local_clock+0x1b/0x24
> [ 202.324001] [<ffffffff810d0872>] perf_prepare_sample+0x7b/0x304
> [ 202.324001] [<ffffffff810d1079>] __perf_event_overflow+0x156/0x1c1
> [ 202.324001] [<ffffffff810f4613>] ? free_pgtables+0xa7/0xc9
> [ 202.324001] [<ffffffff810d125e>] perf_swevent_overflow+0x41/0x5b
> [ 202.324001] [<ffffffff810d12ea>] perf_swevent_event+0x72/0x74
> [ 202.324001] [<ffffffff810d1478>] perf_tp_event+0xea/0x1ef
> [ 202.324001] [<ffffffff8103515a>] ? perf_trace_x86_exceptions+0x4c/0xba
> [ 202.324001] [<ffffffff810351b6>] perf_trace_x86_exceptions+0xa8/0xba
> [ 202.324001] [<ffffffff8103515a>] ? perf_trace_x86_exceptions+0x4c/0xba
> [ 202.324001] [<ffffffff81538439>] trace_do_page_fault+0x48/0x99
> [ 202.324001] [<ffffffff815354b2>] trace_page_fault+0x22/0x30
> [ 202.324001] ---[ end trace 31ccd31b4e82cb42 ]---
>
> This triggers a kernel pagefault because the BP register is not valid and
> copy_from_user_nmi() tries to copy the user stack area from there.
>
> This *should* be OK? And it maybe looks like it works, but then
>
> copy_user_handle_tail()
>
> causes another page fault again at the invaid BP register.
>
> But then the code continues to
> perf_output_begin()
>
>
> 237.265689: page_fault_kernel: address=0x17a0 ip=copy_user_handle_tail error_code=0x0
> 237.265689: function: __do_page_fault
> 237.265689: function: bad_area_nosemaphore
> 237.265690: function: __bad_area_nosemaphore
> 237.265690: function: no_context
> 237.265690: function: fixup_exception
> 237.265690: function: search_exception_tables
> 237.265690: function: search_extable
> 237.265691: function: perf_output_begin

You can also allow perf to be traced. Remove the CFLAGS_REMOVE_core.o
line from kernel/events/Makefile.

> 237.265692: bprint: perf_output_begin: VMW: event type 2 config 2a st: 2c3e7
>
> how are we back in __do_page_fault again here?

Well, the perf ring buffer is vmalloced, right? That can cause a page
fault too.

>
> 237.265692: function: __do_page_fault

Try adding a trace_printk() to see all the page faults?

> 237.265692: function: down_read_trylock
> 237.265692: function: _cond_resched
> 237.265693: function: find_vma
> 237.265693: function: bad_area
> 237.265693: function: up_read
> 237.265693: function: __bad_area_nosemaphore
> 237.265694: function: is_prefetch
> 237.265694: function: convert_ip_to_linear
> 237.265695: function: unhandled_signal
> 237.265695: function: __printk_ratelimit
> 237.265695: function: _raw_spin_trylock
> 237.265695: function: _raw_spin_unlock_irqrestore
> 237.265696: function: printk
> 237.265696: function: vprintk_emit
>
> [ 202.877004] perf_fuzzer[2873]: segfault at 17a0 ip 00000000004017fd sp 00000000ffd19d10 error 6 in perf_fuzzer[400000+d1000]
>

Yeah, something is getting mesed up.

-- Steve

2014-02-27 22:53:48

by H. Peter Anvin

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On 02/27/2014 02:31 PM, Steven Rostedt wrote:
>
> Yeah, something is getting mesed up.
>

What it *looks* like to me is that we try to nest the cr2 save/restore,
which doesn't nest because it is a percpu variable.

... except in the x86-64 case, we *ALSO* save/restore cr2 inside
entry_64.S, which makes the stuff in do_nmi completely redundant and
there for no good reason.

I would actually suggest we do the equivalent on i386 as well.

Vince, could you try this patch as an experiment?



Attachments:
patch (952.00 B)

2014-02-27 23:30:13

by Steven Rostedt

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On Thu, 27 Feb 2014 14:52:54 -0800
"H. Peter Anvin" <[email protected]> wrote:

> On 02/27/2014 02:31 PM, Steven Rostedt wrote:
> >
> > Yeah, something is getting mesed up.
> >
>
> What it *looks* like to me is that we try to nest the cr2 save/restore,
> which doesn't nest because it is a percpu variable.
>
> ... except in the x86-64 case, we *ALSO* save/restore cr2 inside
> entry_64.S, which makes the stuff in do_nmi completely redundant and
> there for no good reason.

Peter, look at the code. That percpu cr2 is in a #ifdef CONFIG_X86_32
section. That is, it isn't even executed. That's i386 code. The only
place the cr2 is saved for x86_64 is in entry_64.S.

-- Steve

>
> I would actually suggest we do the equivalent on i386 as well.
>
> Vince, could you try this patch as an experiment?
>
>

2014-02-27 23:47:33

by H. Peter Anvin

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On 02/27/2014 03:30 PM, Steven Rostedt wrote:
> On Thu, 27 Feb 2014 14:52:54 -0800
> "H. Peter Anvin" <[email protected]> wrote:
>
>> On 02/27/2014 02:31 PM, Steven Rostedt wrote:
>>>
>>> Yeah, something is getting mesed up.
>>>
>>
>> What it *looks* like to me is that we try to nest the cr2 save/restore,
>> which doesn't nest because it is a percpu variable.
>>
>> ... except in the x86-64 case, we *ALSO* save/restore cr2 inside
>> entry_64.S, which makes the stuff in do_nmi completely redundant and
>> there for no good reason.
>
> Peter, look at the code. That percpu cr2 is in a #ifdef CONFIG_X86_32
> section. That is, it isn't even executed. That's i386 code. The only
> place the cr2 is saved for x86_64 is in entry_64.S.
>

Right, egg on my face. However, I still think it would make more sense
for it to nest the way entry_64.S does if at all possible.

That makes this even more confusing, though. I would still like to see
what happens with the patch I sent Vince.

-hpa

2014-02-28 00:58:00

by Vince Weaver

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On Thu, 27 Feb 2014, H. Peter Anvin wrote:

> On 02/27/2014 03:30 PM, Steven Rostedt wrote:
> > On Thu, 27 Feb 2014 14:52:54 -0800
> > "H. Peter Anvin" <[email protected]> wrote:
> >
> >> On 02/27/2014 02:31 PM, Steven Rostedt wrote:
> >>>
> >>> Yeah, something is getting mesed up.
> >>>
> >>
> >> What it *looks* like to me is that we try to nest the cr2 save/restore,
> >> which doesn't nest because it is a percpu variable.
> >>
> >> ... except in the x86-64 case, we *ALSO* save/restore cr2 inside
> >> entry_64.S, which makes the stuff in do_nmi completely redundant and
> >> there for no good reason.
> >
> > Peter, look at the code. That percpu cr2 is in a #ifdef CONFIG_X86_32
> > section. That is, it isn't even executed. That's i386 code. The only
> > place the cr2 is saved for x86_64 is in entry_64.S.
> >
>
> Right, egg on my face. However, I still think it would make more sense
> for it to nest the way entry_64.S does if at all possible.
>
> That makes this even more confusing, though. I would still like to see
> what happens with the patch I sent Vince.

I'll try your patch momentarily, first I had some other changes I started
running before I left work (for some reason it recompiled the whole
kernel).

8: function: perf_output_begin
8: bprint: perf_output_begin: VMW: event type 2 config 2a st: 2c3e
8: bputs: perf_output_begin: VMW: before rcu_dereference
9: function: __do_page_fault
9: function: down_read_trylock
9: function: _cond_resched
9: function: find_vma

so it looks like the fault happens

rcu_read_lock();

116 /*
117 * For inherited events we send all the output towards the parent.
118 */
119 if (event->parent)
120 event = event->parent;
121

somewhere between here

122 rb = rcu_dereference(event->rb);
123 if (unlikely(!rb))
124 goto out;

and here

125
126 if (unlikely(!rb->nr_pages))
127 goto out;

although if rcu locks do anything to turn off tracing then this could be
suspect.

Vince

2014-02-28 01:32:34

by Vince Weaver

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On Thu, 27 Feb 2014, H. Peter Anvin wrote:

> On 02/27/2014 02:31 PM, Steven Rostedt wrote:
> >
> > Yeah, something is getting mesed up.
> >
>
> What it *looks* like to me is that we try to nest the cr2 save/restore,
> which doesn't nest because it is a percpu variable.
>
> ... except in the x86-64 case, we *ALSO* save/restore cr2 inside
> entry_64.S, which makes the stuff in do_nmi completely redundant and
> there for no good reason.
>
> I would actually suggest we do the equivalent on i386 as well.
>
> Vince, could you try this patch as an experiment?

OK with your patch applied it does not segfault.

The trace looks like you'd expect too:

perf_fuzzer-2910 [000] 255.355331: page_fault_kernel: address=irq_stack_union ip=copy_user_handle_tail error_code=0x0
perf_fuzzer-2910 [000] 255.355331: function: __do_page_fault
perf_fuzzer-2910 [000] 255.355331: function: bad_area_nosemaphore
perf_fuzzer-2910 [000] 255.355331: function: __bad_area_nosemaphore
perf_fuzzer-2910 [000] 255.355332: function: no_context
perf_fuzzer-2910 [000] 255.355332: function: fixup_exception
perf_fuzzer-2910 [000] 255.355332: function: search_exception_tables
perf_fuzzer-2910 [000] 255.355332: function: search_extable
perf_fuzzer-2910 [000] 255.355333: function: perf_output_begin
perf_fuzzer-2910 [000] 255.355333: bprint: perf_output_begin: VMW: event type 2 config 2a st: 2c3e7
perf_fuzzer-2910 [000] 255.355333: bprint: perf_output_begin: VMW: before event->parent (nil)
perf_fuzzer-2910 [000] 255.355334: bprint: perf_output_begin: VMW: before rcu_dereference (nil)
perf_fuzzer-2910 [000] 255.355334: function: __do_page_fault
perf_fuzzer-2910 [000] 255.355334: function: down_read_trylock
perf_fuzzer-2910 [000] 255.355334: function: _cond_resched
perf_fuzzer-2910 [000] 255.355335: function: find_vma
perf_fuzzer-2910 [000] 255.355335: function: handle_mm_fault
perf_fuzzer-2910 [000] 255.355335: function: __do_fault
perf_fuzzer-2910 [000] 255.355335: bputs: perf_mmap_fault: VMW: perf_mmap_fault
perf_fuzzer-2910 [000] 255.355336: bprint: perf_mmap_fault: VMW: perf_mmap_fault 0xffff8800cba6a980
perf_fuzzer-2910 [000] 255.355336: function: perf_mmap_to_page
perf_fuzzer-2910 [000] 255.355336: function: _cond_resched
perf_fuzzer-2910 [000] 255.355336: function: unlock_page
perf_fuzzer-2910 [000] 255.355336: function: page_waitqueue
perf_fuzzer-2910 [000] 255.355337: function: __wake_up_bit
perf_fuzzer-2910 [000] 255.355337: bputs: perf_mmap_fault: VMW: perf_mmap_fault
perf_fuzzer-2910 [000] 255.355337: function: _cond_resched
perf_fuzzer-2910 [000] 255.355337: function: _raw_spin_lock
perf_fuzzer-2910 [000] 255.355337: function: page_add_file_rmap
perf_fuzzer-2910 [000] 255.355338: function: __inc_zone_page_state
perf_fuzzer-2910 [000] 255.355338: function: __inc_zone_state
perf_fuzzer-2910 [000] 255.355338: function: set_page_dirty
perf_fuzzer-2910 [000] 255.355338: function: page_mapping
perf_fuzzer-2910 [000] 255.355338: function: anon_set_page_dirty
perf_fuzzer-2910 [000] 255.355339: function: unlock_page


Vince

2014-02-28 02:18:53

by H. Peter Anvin

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

Ok... I think we're definitely talking about a cr2 leak. The reboot might be a race condition in the NMI nesting handling maybe?

On February 27, 2014 5:34:34 PM PST, Vince Weaver <[email protected]> wrote:
>On Thu, 27 Feb 2014, H. Peter Anvin wrote:
>
>> On 02/27/2014 02:31 PM, Steven Rostedt wrote:
>> >
>> > Yeah, something is getting mesed up.
>> >
>>
>> What it *looks* like to me is that we try to nest the cr2
>save/restore,
>> which doesn't nest because it is a percpu variable.
>>
>> ... except in the x86-64 case, we *ALSO* save/restore cr2 inside
>> entry_64.S, which makes the stuff in do_nmi completely redundant and
>> there for no good reason.
>>
>> I would actually suggest we do the equivalent on i386 as well.
>>
>> Vince, could you try this patch as an experiment?
>
>OK with your patch applied it does not segfault.
>
>The trace looks like you'd expect too:
>
>perf_fuzzer-2910 [000] 255.355331: page_fault_kernel:
>address=irq_stack_union ip=copy_user_handle_tail error_code=0x0
>perf_fuzzer-2910 [000] 255.355331: function:
>__do_page_fault
>perf_fuzzer-2910 [000] 255.355331: function:
>bad_area_nosemaphore
>perf_fuzzer-2910 [000] 255.355331: function:
>__bad_area_nosemaphore
>perf_fuzzer-2910 [000] 255.355332: function:
>no_context
>perf_fuzzer-2910 [000] 255.355332: function:
> fixup_exception
>perf_fuzzer-2910 [000] 255.355332: function:
> search_exception_tables
>perf_fuzzer-2910 [000] 255.355332: function:
> search_extable
>perf_fuzzer-2910 [000] 255.355333: function:
>perf_output_begin
>perf_fuzzer-2910 [000] 255.355333: bprint:
>perf_output_begin: VMW: event type 2 config 2a st: 2c3e7
>perf_fuzzer-2910 [000] 255.355333: bprint:
>perf_output_begin: VMW: before event->parent (nil)
>perf_fuzzer-2910 [000] 255.355334: bprint:
>perf_output_begin: VMW: before rcu_dereference (nil)
>perf_fuzzer-2910 [000] 255.355334: function:
>__do_page_fault
>perf_fuzzer-2910 [000] 255.355334: function:
>down_read_trylock
>perf_fuzzer-2910 [000] 255.355334: function:
>_cond_resched
>perf_fuzzer-2910 [000] 255.355335: function: find_vma
>perf_fuzzer-2910 [000] 255.355335: function:
>handle_mm_fault
>perf_fuzzer-2910 [000] 255.355335: function:
>__do_fault
>perf_fuzzer-2910 [000] 255.355335: bputs:
>perf_mmap_fault: VMW: perf_mmap_fault
>perf_fuzzer-2910 [000] 255.355336: bprint:
>perf_mmap_fault: VMW: perf_mmap_fault 0xffff8800cba6a980
>perf_fuzzer-2910 [000] 255.355336: function:
>perf_mmap_to_page
>perf_fuzzer-2910 [000] 255.355336: function:
>_cond_resched
>perf_fuzzer-2910 [000] 255.355336: function:
>unlock_page
>perf_fuzzer-2910 [000] 255.355336: function:
>page_waitqueue
>perf_fuzzer-2910 [000] 255.355337: function:
>__wake_up_bit
>perf_fuzzer-2910 [000] 255.355337: bputs:
>perf_mmap_fault: VMW: perf_mmap_fault
>perf_fuzzer-2910 [000] 255.355337: function:
>_cond_resched
>perf_fuzzer-2910 [000] 255.355337: function:
>_raw_spin_lock
>perf_fuzzer-2910 [000] 255.355337: function:
>page_add_file_rmap
>perf_fuzzer-2910 [000] 255.355338: function:
>__inc_zone_page_state
>perf_fuzzer-2910 [000] 255.355338: function:
> __inc_zone_state
>perf_fuzzer-2910 [000] 255.355338: function:
>set_page_dirty
>perf_fuzzer-2910 [000] 255.355338: function:
>page_mapping
>perf_fuzzer-2910 [000] 255.355338: function:
>anon_set_page_dirty
>perf_fuzzer-2910 [000] 255.355339: function:
>unlock_page
>
>
>Vince

--
Sent from my mobile phone. Please pardon brevity and lack of formatting.

2014-02-28 02:57:30

by Steven Rostedt

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On Thu, 27 Feb 2014 20:34:34 -0500 (EST)
Vince Weaver <[email protected]> wrote:


> > I would actually suggest we do the equivalent on i386 as well.
> >
> > Vince, could you try this patch as an experiment?
>
> OK with your patch applied it does not segfault.
>

Vince, Great! Can you remove Peter's patch, and try this one. It
removes the crud to save the cr2 from entry_64.S and makes both i386
and x86_64 do the same thing in regards to cr2 handling.

-- Steve

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 1e96c36..937cb8d 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -1854,29 +1854,11 @@ end_repeat_nmi:
call save_paranoid
DEFAULT_FRAME 0

- /*
- * Save off the CR2 register. If we take a page fault in the NMI then
- * it could corrupt the CR2 value. If the NMI preempts a page fault
- * handler before it was able to read the CR2 register, and then the
- * NMI itself takes a page fault, the page fault that was preempted
- * will read the information from the NMI page fault and not the
- * origin fault. Save it off and restore it if it changes.
- * Use the r12 callee-saved register.
- */
- movq %cr2, %r12
-
/* paranoidentry do_nmi, 0; without TRACE_IRQS_OFF */
movq %rsp,%rdi
movq $-1,%rsi
call do_nmi

- /* Did the NMI take a page fault? Restore cr2 if it did */
- movq %cr2, %rcx
- cmpq %rcx, %r12
- je 1f
- movq %r12, %cr2
-1:
-
testl %ebx,%ebx /* swapgs needed? */
jnz nmi_restore
nmi_swapgs:
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 6fcb49c..f1a6294 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -443,7 +443,6 @@ enum nmi_states {
NMI_LATCHED,
};
static DEFINE_PER_CPU(enum nmi_states, nmi_state);
-static DEFINE_PER_CPU(unsigned long, nmi_cr2);

#define nmi_nesting_preprocess(regs) \
do { \
@@ -452,14 +451,11 @@ static DEFINE_PER_CPU(unsigned long, nmi_cr2);
return; \
} \
this_cpu_write(nmi_state, NMI_EXECUTING); \
- this_cpu_write(nmi_cr2, read_cr2()); \
} while (0); \
nmi_restart:

#define nmi_nesting_postprocess() \
do { \
- if (unlikely(this_cpu_read(nmi_cr2) != read_cr2())) \
- write_cr2(this_cpu_read(nmi_cr2)); \
if (this_cpu_dec_return(nmi_state)) \
goto nmi_restart; \
} while (0)
@@ -512,8 +508,21 @@ static inline void nmi_nesting_postprocess(void)
dotraplinkage notrace __kprobes void
do_nmi(struct pt_regs *regs, long error_code)
{
+ unsigned long cr2;
+
nmi_nesting_preprocess(regs);

+ /*
+ * Save off the CR2 register. If we take a page fault in the NMI then
+ * it could corrupt the CR2 value. If the NMI preempts a page fault
+ * handler before it was able to read the CR2 register, and then the
+ * NMI itself takes a page fault, the page fault that was preempted
+ * will read the information from the NMI page fault and not the
+ * origin fault. Save it off and restore it if it changes.
+ * Use the r12 callee-saved register.
+ */
+ cr2 = read_cr2();
+
nmi_enter();

inc_irq_stat(__nmi_count);
@@ -523,6 +532,10 @@ do_nmi(struct pt_regs *regs, long error_code)

nmi_exit();

+ /* Reads are cheaper than writes */
+ if (unlikely(cr2 != read_cr2()))
+ write_cr2(cr2);
+
/* On i386, may loop back to preprocess */
nmi_nesting_postprocess();
}

2014-02-28 09:39:55

by Peter Zijlstra

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On Thu, Feb 27, 2014 at 05:31:50PM -0500, Steven Rostedt wrote:
> Well, the perf ring buffer is vmalloced, right? That can cause a page
> fault too.

On x86 they're typically not -- although we have a debug CONFIG option
to test that code on x86 too. On SPARC/ARM etc.. we have to use
vmalloc_user() because of cache aliasing (yay for VIPT).

2014-02-28 11:11:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On Thu, Feb 27, 2014 at 09:57:26PM -0500, Steven Rostedt wrote:
> @@ -512,8 +508,21 @@ static inline void nmi_nesting_postprocess(void)
> dotraplinkage notrace __kprobes void
> do_nmi(struct pt_regs *regs, long error_code)
> {
> + unsigned long cr2;
> +
> nmi_nesting_preprocess(regs);
>
> + /*
> + * Save off the CR2 register. If we take a page fault in the NMI then
> + * it could corrupt the CR2 value. If the NMI preempts a page fault
> + * handler before it was able to read the CR2 register, and then the
> + * NMI itself takes a page fault, the page fault that was preempted
> + * will read the information from the NMI page fault and not the
> + * origin fault. Save it off and restore it if it changes.

> + * Use the r12 callee-saved register.

You might want to make that line go away :-)

> + */
> + cr2 = read_cr2();
> +
> nmi_enter();
>
> inc_irq_stat(__nmi_count);
> @@ -523,6 +532,10 @@ do_nmi(struct pt_regs *regs, long error_code)
>
> nmi_exit();
>
> + /* Reads are cheaper than writes */
> + if (unlikely(cr2 != read_cr2()))
> + write_cr2(cr2);
> +
> /* On i386, may loop back to preprocess */
> nmi_nesting_postprocess();
> }

2014-02-28 13:38:10

by Steven Rostedt

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On Fri, 28 Feb 2014 12:11:11 +0100
Peter Zijlstra <[email protected]> wrote:

> On Thu, Feb 27, 2014 at 09:57:26PM -0500, Steven Rostedt wrote:
> > @@ -512,8 +508,21 @@ static inline void nmi_nesting_postprocess(void)
> > dotraplinkage notrace __kprobes void
> > do_nmi(struct pt_regs *regs, long error_code)
> > {
> > + unsigned long cr2;
> > +
> > nmi_nesting_preprocess(regs);
> >
> > + /*
> > + * Save off the CR2 register. If we take a page fault in the NMI then
> > + * it could corrupt the CR2 value. If the NMI preempts a page fault
> > + * handler before it was able to read the CR2 register, and then the
> > + * NMI itself takes a page fault, the page fault that was preempted
> > + * will read the information from the NMI page fault and not the
> > + * origin fault. Save it off and restore it if it changes.
>
> > + * Use the r12 callee-saved register.
>
> You might want to make that line go away :-)
>

Gah! I noticed that before sending and deleted the line, recreated the
patch and sent that out. The one step I forgot to do was to save the
file after deleting it :-p

-- Steve

2014-02-28 14:13:36

by Vince Weaver

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On Thu, 27 Feb 2014, Steven Rostedt wrote:

> On Thu, 27 Feb 2014 20:34:34 -0500 (EST)
> Vince Weaver <[email protected]> wrote:
>
>
> > > I would actually suggest we do the equivalent on i386 as well.
> > >
> > > Vince, could you try this patch as an experiment?
> >
> > OK with your patch applied it does not segfault.
> >
>
> Vince, Great! Can you remove Peter's patch, and try this one. It
> removes the crud to save the cr2 from entry_64.S and makes both i386
> and x86_64 do the same thing in regards to cr2 handling.

no, with only this patch applied it segfaults as per previous:

[ 126.396049] perf_fuzzer[2904]: segfault at 17a0 ip 00000000004017fd sp 00000000ffaff3f0 error 6 in perf_fuzzer[400000+d1000]

Vince

2014-02-28 14:23:44

by Steven Rostedt

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On Fri, 28 Feb 2014 09:15:33 -0500 (EST)
Vince Weaver <[email protected]> wrote:

> On Thu, 27 Feb 2014, Steven Rostedt wrote:
>
> > On Thu, 27 Feb 2014 20:34:34 -0500 (EST)
> > Vince Weaver <[email protected]> wrote:
> >
> >
> > > > I would actually suggest we do the equivalent on i386 as well.
> > > >
> > > > Vince, could you try this patch as an experiment?
> > >
> > > OK with your patch applied it does not segfault.
> > >
> >
> > Vince, Great! Can you remove Peter's patch, and try this one. It
> > removes the crud to save the cr2 from entry_64.S and makes both i386
> > and x86_64 do the same thing in regards to cr2 handling.
>
> no, with only this patch applied it segfaults as per previous:
>
> [ 126.396049] perf_fuzzer[2904]: segfault at 17a0 ip 00000000004017fd sp 00000000ffaff3f0 error 6 in perf_fuzzer[400000+d1000]

Interesting. Are you doing a perf function trace?

And just in case, can you add this patch and make sure the copy is
called by NMI.

-- Steve

diff --git a/arch/x86/lib/usercopy.c b/arch/x86/lib/usercopy.c
index ddf9ecb..ca943cd 100644
--- a/arch/x86/lib/usercopy.c
+++ b/arch/x86/lib/usercopy.c
@@ -29,6 +29,7 @@ copy_from_user_nmi(void *to, const void __user *from, unsigned long n)
*/
pagefault_disable();
ret = __copy_from_user_inatomic(to, from, n);
+ trace_dump_stack(2)
pagefault_enable();

return ret;

2014-02-28 15:05:21

by Vince Weaver

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On Fri, 28 Feb 2014, Steven Rostedt wrote:

> Interesting. Are you doing a perf function trace?
>
> And just in case, can you add this patch and make sure the copy is
> called by NMI.

199.900682: function: trace_do_page_fault
199.900683: page_fault_user: address=__per_cpu_end ip=__per_cpu_end error_code=0x6
199.900683: function: perf_swevent_get_recursion_context
199.900684: function: perf_tp_event
199.900684: function: perf_swevent_event
199.900684: function: perf_swevent_overflow
199.900684: function: __perf_event_overflow
199.900685: function: perf_prepare_sample
199.900685: function: __perf_event_header__init_id
199.900685: function: task_tgid_nr_ns
199.900685: function: perf_event_tid
199.900686: function: __task_pid_nr_ns
199.900686: function: perf_callchain
199.900687: function: copy_from_user_nmi
199.900687: function: trace_do_page_fault
199.900687: page_fault_kernel: address=irq_stack_union ip=copy_user_generic_string error_code=0x0
199.900688: function: __do_page_fault
199.900688: function: bad_area_nosemaphore
199.900688: function: __bad_area_nosemaphore
199.900689: function: no_context
199.900689: function: fixup_exception
199.900689: function: search_exception_tables
199.900689: function: search_extable
199.900691: function: copy_user_handle_tail
199.900691: function: trace_do_page_fault
199.900691: page_fault_kernel: address=irq_stack_union ip=copy_user_handle_tail error_code=0x0
199.900691: function: __do_page_fault
199.900692: function: bad_area_nosemaphore
199.900692: function: __bad_area_nosemaphore
199.900692: function: no_context
199.900692: function: fixup_exception
199.900692: function: search_exception_tables
199.900692: function: search_extable
199.900693: function: save_stack_trace
199.900693: function: dump_trace
199.900694: function: print_context_stack
199.900694: function: __kernel_text_address
199.900694: function: is_module_text_address
199.900695: function: __module_text_address
199.900695: function: __module_address
199.900695: function: __kernel_text_address
199.900695: function: is_module_text_address
199.900696: function: __module_text_address
199.900696: function: __module_address
...
199.900705: function: __kernel_text_address
199.900809: kernel_stack: <stack trace>
=> perf_callchain (ffffffff810d35a2)
=> perf_prepare_sample (ffffffff810cfae3)
=> __perf_event_overflow (ffffffff810d02f4)
=> perf_swevent_overflow (ffffffff810d04e3)
=> perf_swevent_event (ffffffff810d0574)
=> perf_tp_event (ffffffff810d070c)
=> perf_trace_x86_exceptions (ffffffff810341b6)
=> trace_do_page_fault (ffffffff81537702)
=> trace_page_fault (ffffffff81534772)
199.900810: function: perf_output_begin
199.900810: function: __do_page_fault
199.900810: function: __perf_sw_event
199.900810: function: perf_swevent_get_recursion_context
199.900811: function: down_read_trylock
199.900811: function: _cond_resched
199.900811: function: find_vma
199.900811: function: bad_area
199.900812: function: up_read
199.900812: function: __bad_area_nosemaphore
199.900812: function: is_prefetch
199.900812: function: convert_ip_to_linear
199.900813: function: unhandled_signal
199.900813: function: __printk_ratelimit
199.900813: function: _raw_spin_trylock
199.900813: function: _raw_spin_unlock_irqrestore
199.900814: function: printk
199.900814: function: vprintk_emit

2014-02-28 15:14:13

by H. Peter Anvin

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

If I'm reading this right we end up going from the page fault tracepoint to copy_from_user_nmi() without going through NMI, and the cr2 corruption is obvious. I guess the assumption that only the NMI path needed to save cr2 is flawed?

On February 28, 2014 7:07:29 AM PST, Vince Weaver <[email protected]> wrote:
>On Fri, 28 Feb 2014, Steven Rostedt wrote:
>
>> Interesting. Are you doing a perf function trace?
>>
>> And just in case, can you add this patch and make sure the copy is
>> called by NMI.
>
>199.900682: function: trace_do_page_fault
>199.900683: page_fault_user: address=__per_cpu_end
>ip=__per_cpu_end error_code=0x6
>199.900683: function: perf_swevent_get_recursion_context
>199.900684: function: perf_tp_event
>199.900684: function: perf_swevent_event
>199.900684: function: perf_swevent_overflow
>199.900684: function: __perf_event_overflow
>199.900685: function: perf_prepare_sample
>199.900685: function:
>__perf_event_header__init_id
>199.900685: function: task_tgid_nr_ns
>199.900685: function: perf_event_tid
>199.900686: function: __task_pid_nr_ns
>199.900686: function: perf_callchain
>199.900687: function: copy_from_user_nmi
>199.900687: function: trace_do_page_fault
>199.900687: page_fault_kernel: address=irq_stack_union
>ip=copy_user_generic_string error_code=0x0
>199.900688: function: __do_page_fault
>199.900688: function: bad_area_nosemaphore
>199.900688: function: __bad_area_nosemaphore
>199.900689: function: no_context
>199.900689: function: fixup_exception
>199.900689: function:
>search_exception_tables
>199.900689: function: search_extable
>199.900691: function: copy_user_handle_tail
>199.900691: function: trace_do_page_fault
>199.900691: page_fault_kernel: address=irq_stack_union
>ip=copy_user_handle_tail error_code=0x0
>199.900691: function: __do_page_fault
>199.900692: function: bad_area_nosemaphore
>199.900692: function: __bad_area_nosemaphore
>199.900692: function: no_context
>199.900692: function: fixup_exception
>199.900692: function:
>search_exception_tables
>199.900692: function: search_extable
>199.900693: function: save_stack_trace
>199.900693: function: dump_trace
>199.900694: function: print_context_stack
>199.900694: function: __kernel_text_address
>199.900694: function: is_module_text_address
>199.900695: function: __module_text_address
>199.900695: function: __module_address
>199.900695: function: __kernel_text_address
>199.900695: function: is_module_text_address
>199.900696: function: __module_text_address
>199.900696: function: __module_address
>...
>199.900705: function: __kernel_text_address
>199.900809: kernel_stack: <stack trace>
>=> perf_callchain (ffffffff810d35a2)
>=> perf_prepare_sample (ffffffff810cfae3)
>=> __perf_event_overflow (ffffffff810d02f4)
>=> perf_swevent_overflow (ffffffff810d04e3)
>=> perf_swevent_event (ffffffff810d0574)
>=> perf_tp_event (ffffffff810d070c)
>=> perf_trace_x86_exceptions (ffffffff810341b6)
>=> trace_do_page_fault (ffffffff81537702)
>=> trace_page_fault (ffffffff81534772)
>199.900810: function: perf_output_begin
>199.900810: function: __do_page_fault
>199.900810: function: __perf_sw_event
>199.900810: function:
>perf_swevent_get_recursion_context
>199.900811: function: down_read_trylock
>199.900811: function: _cond_resched
>199.900811: function: find_vma
>199.900811: function: bad_area
>199.900812: function: up_read
>199.900812: function: __bad_area_nosemaphore
>199.900812: function: is_prefetch
>199.900812: function: convert_ip_to_linear
>199.900813: function: unhandled_signal
>199.900813: function: __printk_ratelimit
>199.900813: function: _raw_spin_trylock
>199.900813: function: _raw_spin_unlock_irqrestore
>199.900814: function: printk
>199.900814: function: vprintk_emit

--
Sent from my mobile phone. Please pardon brevity and lack of formatting.

2014-02-28 15:20:05

by Steven Rostedt

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On Fri, 28 Feb 2014 10:07:29 -0500 (EST)
Vince Weaver <[email protected]> wrote:

> On Fri, 28 Feb 2014, Steven Rostedt wrote:

> 199.900696: function: __module_address
> ...
> 199.900705: function: __kernel_text_address
> 199.900809: kernel_stack: <stack trace>
> => perf_callchain (ffffffff810d35a2)
> => perf_prepare_sample (ffffffff810cfae3)
> => __perf_event_overflow (ffffffff810d02f4)
> => perf_swevent_overflow (ffffffff810d04e3)
> => perf_swevent_event (ffffffff810d0574)
> => perf_tp_event (ffffffff810d070c)
> => perf_trace_x86_exceptions (ffffffff810341b6)
> => trace_do_page_fault (ffffffff81537702)
> => trace_page_fault (ffffffff81534772)

Thank you!!! You just found the bug :-)

The bug was caused by:

commit 25c74b10bacead867478480170083f69cfc0db48
x86, trace: Register exception handler to trace IDT

With this code:

dotraplinkage void __kprobes
trace_do_page_fault(struct pt_regs *regs, unsigned long error_code)
{
enum ctx_state prev_state;

prev_state = exception_enter();
trace_page_fault_entries(regs, error_code);
__do_page_fault(regs, error_code);
exception_exit(prev_state);
}

The trace_page_fault_entries() which is called before the cr2 is saved
can fault by perf doing a userspace stack trace. But the cr2 is not
restored when calling __do_page_fault() and that gets the wrong cr2.

Below is a patch that should fix this. Please remove all other patches
and try this out.

Thanks,

-- Steve

> 199.900810: function: perf_output_begin
> 199.900810: function: __do_page_fault
> 199.900810: function: __perf_sw_event
> 199.900810: function: perf_swevent_get_recursion_context
> 199.900811: function: down_read_trylock
> 199.900811: function: _cond_resched
> 199.900811: function: find_vma
> 199.900811: function: bad_area
> 199.900812: function: up_read
> 199.900812: function: __bad_area_nosemaphore
> 199.900812: function: is_prefetch
> 199.900812: function: convert_ip_to_linear
> 199.900813: function: unhandled_signal
> 199.900813: function: __printk_ratelimit
> 199.900813: function: _raw_spin_trylock
> 199.900813: function: _raw_spin_unlock_irqrestore
> 199.900814: function: printk
> 199.900814: function: vprintk_emit

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 6dea040..66b636d 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1271,9 +1271,15 @@ dotraplinkage void __kprobes
trace_do_page_fault(struct pt_regs *regs, unsigned long error_code)
{
enum ctx_state prev_state;
+ unsigned long cr2;

prev_state = exception_enter();
+ /* The trace might fault, save the cr2 register */
+ cr2 = read_cr2();
trace_page_fault_entries(regs, error_code);
+ /* Put back the original cr2 if needed */
+ if (cr2 != read_cr2())
+ write_cr2(cr2);
__do_page_fault(regs, error_code);
exception_exit(prev_state);
}

2014-02-28 15:30:09

by Steven Rostedt

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On Fri, 28 Feb 2014 10:20:00 -0500
Steven Rostedt <[email protected]> wrote:

> Below is a patch that should fix this. Please remove all other patches
> and try this out.

Updated patch, as Peter Zijlstra on IRC asked me if the
exception_enter() can be traced. And looking at it, it sure can be.

-- Steve

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 6dea040..f606b67 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1271,9 +1271,15 @@ dotraplinkage void __kprobes
trace_do_page_fault(struct pt_regs *regs, unsigned long error_code)
{
enum ctx_state prev_state;
+ unsigned long cr2;

+ /* The trace might fault, save the cr2 register */
+ cr2 = read_cr2();
prev_state = exception_enter();
trace_page_fault_entries(regs, error_code);
+ /* Put back the original cr2 if needed */
+ if (cr2 != read_cr2())
+ write_cr2(cr2);
__do_page_fault(regs, error_code);
exception_exit(prev_state);
}

2014-02-28 15:40:38

by Peter Zijlstra

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On Fri, Feb 28, 2014 at 07:13:06AM -0800, H. Peter Anvin wrote:
> If I'm reading this right we end up going from the page fault
> tracepoint to copy_from_user_nmi() without going through NMI, and the
> cr2 corruption is obvious. I guess the assumption that only the NMI
> path needed to save cr2 is flawed?

It was never assumed it would only go through NMI, but that it would be
NMI safe -- and as it turns out, it is that.

What I did assume was that any other callsites would be safe, seeing how
they'd already be running in 'normal' contexts.

I had not considered people putting tracepoints _that_ early in the
exception paths.

Note that there's more tracepoints there than the one mentioned.

2014-02-28 16:16:22

by H. Peter Anvin

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On 02/28/2014 07:40 AM, Peter Zijlstra wrote:
> On Fri, Feb 28, 2014 at 07:13:06AM -0800, H. Peter Anvin wrote:
>> If I'm reading this right we end up going from the page fault
>> tracepoint to copy_from_user_nmi() without going through NMI, and the
>> cr2 corruption is obvious. I guess the assumption that only the NMI
>> path needed to save cr2 is flawed?
>
> It was never assumed it would only go through NMI, but that it would be
> NMI safe -- and as it turns out, it is that.
>
> What I did assume was that any other callsites would be safe, seeing how
> they'd already be running in 'normal' contexts.
>
> I had not considered people putting tracepoints _that_ early in the
> exception paths.
>
> Note that there's more tracepoints there than the one mentioned.
>

Well, I was talking about the assumption spelled out in the comment
above copy_from_user_nmi() which pretty much states "cr2 is safe because
cr2 is saved/restored in the NMI wrappers."

-hpa

2014-02-28 16:29:49

by Steven Rostedt

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On Fri, 28 Feb 2014 08:15:11 -0800
"H. Peter Anvin" <[email protected]> wrote:

> Well, I was talking about the assumption spelled out in the comment
> above copy_from_user_nmi() which pretty much states "cr2 is safe because
> cr2 is saved/restored in the NMI wrappers."

Yeah, it seems that the name "copy_from_user_nmi()" is a misnomer. As
it can be called outside of nmi context. Perhaps we should have a
copy_from_user_trace() that does the save and restore of the cr2.

As that's the only place that faults, it may be the best answer.

Arnaldo,

Can you test this patch and see if it fixes the bug for you too. Vince
already said it fixes it for him.

I'm attaching it below (it's from H. Peter).

Peter Z., as both Jiri's and my patch fixed the callers of the problem
area, and as we have been discussing, there may be more problem areas,
I'm thinking the best solution is to just use H. Peter's patch instead.
And then we should rename it to copy_from_user_trace().

-- Steve


diff --git a/arch/x86/lib/usercopy.c b/arch/x86/lib/usercopy.c
index ddf9ecb..938e45c 100644
--- a/arch/x86/lib/usercopy.c
+++ b/arch/x86/lib/usercopy.c
@@ -10,6 +10,8 @@
#include <asm/word-at-a-time.h>
#include <linux/sched.h>

+#include <asm/processor.h>
+
/*
* We rely on the nested NMI work to allow atomic faults from the NMI path; the
* nested NMI paths are careful to preserve CR2.
@@ -18,6 +20,7 @@ unsigned long
copy_from_user_nmi(void *to, const void __user *from, unsigned long n)
{
unsigned long ret;
+ unsigned long cr2;

if (__range_not_ok(from, n, TASK_SIZE))
return 0;
@@ -27,9 +30,11 @@ copy_from_user_nmi(void *to, const void __user *from, unsigned long n)
* disable pagefaults so that its behaviour is consistent even when
* called form other contexts.
*/
+ cr2 = read_cr2();
pagefault_disable();
ret = __copy_from_user_inatomic(to, from, n);
pagefault_enable();
+ write_cr2(cr2);

return ret;
}

2014-02-28 19:33:21

by Steven Rostedt

[permalink] [raw]
Subject: [PATCH] x86: Rename copy_from_user_nmi() to copy_from_user_trace()

[ H. Peter, Here's the rename patch. I did not include your update. You
can add that first and then massage this patch on top. But this isn't
critical for mainline or stable, where as I believe your patch is. ]

The tracing utilities sometimes need to read from userspace (stack tracing),
and to do this it has as special copy_from_user function called
copy_from_user_nmi(). Well, as tracers can call this from outside of
nmi context, the "_nmi" part is a misnomer and "_trace" is a better
name.

Signed-off-by: Steven Rostedt <[email protected]>
---
arch/x86/include/asm/perf_event.h | 2 +-
arch/x86/include/asm/uaccess.h | 2 +-
arch/x86/kernel/cpu/perf_event.c | 4 ++--
arch/x86/kernel/cpu/perf_event_intel_ds.c | 2 +-
arch/x86/kernel/cpu/perf_event_intel_lbr.c | 2 +-
arch/x86/lib/usercopy.c | 10 ++++++----
arch/x86/oprofile/backtrace.c | 4 ++--
7 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 8249df4..7bf4b25 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -274,6 +274,6 @@ static inline void perf_check_microcode(void) { }
static inline void amd_pmu_disable_virt(void) { }
#endif

-#define arch_perf_out_copy_user copy_from_user_nmi
+#define arch_perf_out_copy_user copy_from_user_trace

#endif /* _ASM_X86_PERF_EVENT_H */
diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index 8ec57c0..d734baf 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -515,7 +515,7 @@ struct __large_struct { unsigned long buf[100]; };
__put_user_size_ex((__typeof__(*(ptr)))(x), (ptr), sizeof(*(ptr)))

extern unsigned long
-copy_from_user_nmi(void *to, const void __user *from, unsigned long n);
+copy_from_user_trace(void *to, const void __user *from, unsigned long n);
extern __must_check long
strncpy_from_user(char *dst, const char __user *src, long count);

diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 8e13293..7eda4af 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1988,7 +1988,7 @@ perf_callchain_user32(struct pt_regs *regs, struct perf_callchain_entry *entry)
frame.next_frame = 0;
frame.return_address = 0;

- bytes = copy_from_user_nmi(&frame, fp, sizeof(frame));
+ bytes = copy_from_user_trace(&frame, fp, sizeof(frame));
if (bytes != 0)
break;

@@ -2040,7 +2040,7 @@ perf_callchain_user(struct perf_callchain_entry *entry, struct pt_regs *regs)
frame.next_frame = NULL;
frame.return_address = 0;

- bytes = copy_from_user_nmi(&frame, fp, sizeof(frame));
+ bytes = copy_from_user_trace(&frame, fp, sizeof(frame));
if (bytes != 0)
break;

diff --git a/arch/x86/kernel/cpu/perf_event_intel_ds.c b/arch/x86/kernel/cpu/perf_event_intel_ds.c
index ae96cfa..95d76c6 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_ds.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_ds.c
@@ -788,7 +788,7 @@ static int intel_pmu_pebs_fixup_ip(struct pt_regs *regs)
u8 *buf = this_cpu_read(insn_buffer);

size = ip - to; /* Must fit our buffer, see above */
- bytes = copy_from_user_nmi(buf, (void __user *)to, size);
+ bytes = copy_from_user_trace(buf, (void __user *)to, size);
if (bytes != 0)
return 0;

diff --git a/arch/x86/kernel/cpu/perf_event_intel_lbr.c b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
index d82d155..0505ada 100644
--- a/arch/x86/kernel/cpu/perf_event_intel_lbr.c
+++ b/arch/x86/kernel/cpu/perf_event_intel_lbr.c
@@ -490,7 +490,7 @@ static int branch_type(unsigned long from, unsigned long to, int abort)
return X86_BR_NONE;

/* may fail if text not present */
- bytes = copy_from_user_nmi(buf, (void __user *)from, size);
+ bytes = copy_from_user_trace(buf, (void __user *)from, size);
if (bytes != 0)
return X86_BR_NONE;

diff --git a/arch/x86/lib/usercopy.c b/arch/x86/lib/usercopy.c
index ddf9ecb..131ff16e 100644
--- a/arch/x86/lib/usercopy.c
+++ b/arch/x86/lib/usercopy.c
@@ -11,11 +11,13 @@
#include <linux/sched.h>

/*
- * We rely on the nested NMI work to allow atomic faults from the NMI path; the
- * nested NMI paths are careful to preserve CR2.
+ * Used by tracing, needs to restore state of the cr2 register if
+ * the copy triggered a page fault. That's because tracing can happen
+ * between the time a normal page fault occurred and the cr2 is read
+ * by the handler.
*/
unsigned long
-copy_from_user_nmi(void *to, const void __user *from, unsigned long n)
+copy_from_user_trace(void *to, const void __user *from, unsigned long n)
{
unsigned long ret;

@@ -33,4 +35,4 @@ copy_from_user_nmi(void *to, const void __user *from, unsigned long n)

return ret;
}
-EXPORT_SYMBOL_GPL(copy_from_user_nmi);
+EXPORT_SYMBOL_GPL(copy_from_user_trace);
diff --git a/arch/x86/oprofile/backtrace.c b/arch/x86/oprofile/backtrace.c
index 5d04be5..7831650 100644
--- a/arch/x86/oprofile/backtrace.c
+++ b/arch/x86/oprofile/backtrace.c
@@ -46,7 +46,7 @@ dump_user_backtrace_32(struct stack_frame_ia32 *head)
struct stack_frame_ia32 *fp;
unsigned long bytes;

- bytes = copy_from_user_nmi(bufhead, head, sizeof(bufhead));
+ bytes = copy_from_user_trace(bufhead, head, sizeof(bufhead));
if (bytes != 0)
return NULL;

@@ -92,7 +92,7 @@ static struct stack_frame *dump_user_backtrace(struct stack_frame *head)
struct stack_frame bufhead[2];
unsigned long bytes;

- bytes = copy_from_user_nmi(bufhead, head, sizeof(bufhead));
+ bytes = copy_from_user_trace(bufhead, head, sizeof(bufhead));
if (bytes != 0)
return NULL;

--
1.8.1.4

2014-02-28 20:34:15

by Paul E. McKenney

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On Thu, Feb 27, 2014 at 08:00:04PM -0500, Vince Weaver wrote:
> On Thu, 27 Feb 2014, H. Peter Anvin wrote:
>
> > On 02/27/2014 03:30 PM, Steven Rostedt wrote:
> > > On Thu, 27 Feb 2014 14:52:54 -0800
> > > "H. Peter Anvin" <[email protected]> wrote:
> > >
> > >> On 02/27/2014 02:31 PM, Steven Rostedt wrote:
> > >>>
> > >>> Yeah, something is getting mesed up.
> > >>>
> > >>
> > >> What it *looks* like to me is that we try to nest the cr2 save/restore,
> > >> which doesn't nest because it is a percpu variable.
> > >>
> > >> ... except in the x86-64 case, we *ALSO* save/restore cr2 inside
> > >> entry_64.S, which makes the stuff in do_nmi completely redundant and
> > >> there for no good reason.
> > >
> > > Peter, look at the code. That percpu cr2 is in a #ifdef CONFIG_X86_32
> > > section. That is, it isn't even executed. That's i386 code. The only
> > > place the cr2 is saved for x86_64 is in entry_64.S.
> > >
> >
> > Right, egg on my face. However, I still think it would make more sense
> > for it to nest the way entry_64.S does if at all possible.
> >
> > That makes this even more confusing, though. I would still like to see
> > what happens with the patch I sent Vince.
>
> I'll try your patch momentarily, first I had some other changes I started
> running before I left work (for some reason it recompiled the whole
> kernel).
>
> 8: function: perf_output_begin
> 8: bprint: perf_output_begin: VMW: event type 2 config 2a st: 2c3e
> 8: bputs: perf_output_begin: VMW: before rcu_dereference
> 9: function: __do_page_fault
> 9: function: down_read_trylock
> 9: function: _cond_resched
> 9: function: find_vma
>
> so it looks like the fault happens
>
> rcu_read_lock();
>
> 116 /*
> 117 * For inherited events we send all the output towards the parent.
> 118 */
> 119 if (event->parent)
> 120 event = event->parent;
> 121
>
> somewhere between here
>
> 122 rb = rcu_dereference(event->rb);
> 123 if (unlikely(!rb))
> 124 goto out;
>
> and here
>
> 125
> 126 if (unlikely(!rb->nr_pages))
> 127 goto out;
>
> although if rcu locks do anything to turn off tracing then this could be
> suspect.

The most likely suspect is of course event->rb in the rcu_dereference.
I have to defer to Steven on how rcu_read_lock() and rcu_read_unlock()
currently interact with tracing. ;-)

Thanx, Paul

2014-02-28 20:39:40

by H. Peter Anvin

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

Now we need to figure out if the reboot problem and the segfault problem are actually the same... I have a nasty feeling they might be different problems.

On February 28, 2014 7:07:29 AM PST, Vince Weaver <[email protected]> wrote:
>On Fri, 28 Feb 2014, Steven Rostedt wrote:
>
>> Interesting. Are you doing a perf function trace?
>>
>> And just in case, can you add this patch and make sure the copy is
>> called by NMI.
>
>199.900682: function: trace_do_page_fault
>199.900683: page_fault_user: address=__per_cpu_end
>ip=__per_cpu_end error_code=0x6
>199.900683: function: perf_swevent_get_recursion_context
>199.900684: function: perf_tp_event
>199.900684: function: perf_swevent_event
>199.900684: function: perf_swevent_overflow
>199.900684: function: __perf_event_overflow
>199.900685: function: perf_prepare_sample
>199.900685: function:
>__perf_event_header__init_id
>199.900685: function: task_tgid_nr_ns
>199.900685: function: perf_event_tid
>199.900686: function: __task_pid_nr_ns
>199.900686: function: perf_callchain
>199.900687: function: copy_from_user_nmi
>199.900687: function: trace_do_page_fault
>199.900687: page_fault_kernel: address=irq_stack_union
>ip=copy_user_generic_string error_code=0x0
>199.900688: function: __do_page_fault
>199.900688: function: bad_area_nosemaphore
>199.900688: function: __bad_area_nosemaphore
>199.900689: function: no_context
>199.900689: function: fixup_exception
>199.900689: function:
>search_exception_tables
>199.900689: function: search_extable
>199.900691: function: copy_user_handle_tail
>199.900691: function: trace_do_page_fault
>199.900691: page_fault_kernel: address=irq_stack_union
>ip=copy_user_handle_tail error_code=0x0
>199.900691: function: __do_page_fault
>199.900692: function: bad_area_nosemaphore
>199.900692: function: __bad_area_nosemaphore
>199.900692: function: no_context
>199.900692: function: fixup_exception
>199.900692: function:
>search_exception_tables
>199.900692: function: search_extable
>199.900693: function: save_stack_trace
>199.900693: function: dump_trace
>199.900694: function: print_context_stack
>199.900694: function: __kernel_text_address
>199.900694: function: is_module_text_address
>199.900695: function: __module_text_address
>199.900695: function: __module_address
>199.900695: function: __kernel_text_address
>199.900695: function: is_module_text_address
>199.900696: function: __module_text_address
>199.900696: function: __module_address
>...
>199.900705: function: __kernel_text_address
>199.900809: kernel_stack: <stack trace>
>=> perf_callchain (ffffffff810d35a2)
>=> perf_prepare_sample (ffffffff810cfae3)
>=> __perf_event_overflow (ffffffff810d02f4)
>=> perf_swevent_overflow (ffffffff810d04e3)
>=> perf_swevent_event (ffffffff810d0574)
>=> perf_tp_event (ffffffff810d070c)
>=> perf_trace_x86_exceptions (ffffffff810341b6)
>=> trace_do_page_fault (ffffffff81537702)
>=> trace_page_fault (ffffffff81534772)
>199.900810: function: perf_output_begin
>199.900810: function: __do_page_fault
>199.900810: function: __perf_sw_event
>199.900810: function:
>perf_swevent_get_recursion_context
>199.900811: function: down_read_trylock
>199.900811: function: _cond_resched
>199.900811: function: find_vma
>199.900811: function: bad_area
>199.900812: function: up_read
>199.900812: function: __bad_area_nosemaphore
>199.900812: function: is_prefetch
>199.900812: function: convert_ip_to_linear
>199.900813: function: unhandled_signal
>199.900813: function: __printk_ratelimit
>199.900813: function: _raw_spin_trylock
>199.900813: function: _raw_spin_unlock_irqrestore
>199.900814: function: printk
>199.900814: function: vprintk_emit

--
Sent from my mobile phone. Please pardon brevity and lack of formatting.

2014-02-28 20:46:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] x86: Rename copy_from_user_nmi() to copy_from_user_trace()

On Fri, Feb 28, 2014 at 02:33:16PM -0500, Steven Rostedt wrote:
> [ H. Peter, Here's the rename patch. I did not include your update. You
> can add that first and then massage this patch on top. But this isn't
> critical for mainline or stable, where as I believe your patch is. ]
>
> The tracing utilities sometimes need to read from userspace (stack tracing),
> and to do this it has as special copy_from_user function called
> copy_from_user_nmi(). Well, as tracers can call this from outside of
> nmi context, the "_nmi" part is a misnomer and "_trace" is a better
> name.

NAK, spin_lock_irq() is very much an IRQ safe lock. Similarly
copy_from_user_nmi() is an NMI safe copy from user.

Furthermore, there's exactly 0 trace users, so the proposed name is
actively worse.

2014-02-28 20:46:41

by Steven Rostedt

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On Fri, 28 Feb 2014 12:38:52 -0800
"H. Peter Anvin" <[email protected]> wrote:

> Now we need to figure out if the reboot problem and the segfault problem are actually the same... I have a nasty feeling they might be different problems.

I wonder if there was any recursion problem. Although, I believe perf
has recursion checks. But perhaps something recursed before the checks?

-- Steve

2014-02-28 20:47:20

by Steven Rostedt

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On Fri, 28 Feb 2014 12:34:05 -0800
"Paul E. McKenney" <[email protected]> wrote:

> On Thu, Feb 27, 2014 at 08:00:04PM -0500, Vince Weaver wrote:
> > On Thu, 27 Feb 2014, H. Peter Anvin wrote:
> >
> > > On 02/27/2014 03:30 PM, Steven Rostedt wrote:
> > > > On Thu, 27 Feb 2014 14:52:54 -0800
> > > > "H. Peter Anvin" <[email protected]> wrote:
> > > >
> > > >> On 02/27/2014 02:31 PM, Steven Rostedt wrote:
> > > >>>
> > > >>> Yeah, something is getting mesed up.
> > > >>>
> > > >>
> > > >> What it *looks* like to me is that we try to nest the cr2 save/restore,
> > > >> which doesn't nest because it is a percpu variable.
> > > >>
> > > >> ... except in the x86-64 case, we *ALSO* save/restore cr2 inside
> > > >> entry_64.S, which makes the stuff in do_nmi completely redundant and
> > > >> there for no good reason.
> > > >
> > > > Peter, look at the code. That percpu cr2 is in a #ifdef CONFIG_X86_32
> > > > section. That is, it isn't even executed. That's i386 code. The only
> > > > place the cr2 is saved for x86_64 is in entry_64.S.
> > > >
> > >
> > > Right, egg on my face. However, I still think it would make more sense
> > > for it to nest the way entry_64.S does if at all possible.
> > >
> > > That makes this even more confusing, though. I would still like to see
> > > what happens with the patch I sent Vince.
> >
> > I'll try your patch momentarily, first I had some other changes I started
> > running before I left work (for some reason it recompiled the whole
> > kernel).
> >
> > 8: function: perf_output_begin
> > 8: bprint: perf_output_begin: VMW: event type 2 config 2a st: 2c3e
> > 8: bputs: perf_output_begin: VMW: before rcu_dereference
> > 9: function: __do_page_fault
> > 9: function: down_read_trylock
> > 9: function: _cond_resched
> > 9: function: find_vma
> >
> > so it looks like the fault happens
> >
> > rcu_read_lock();
> >
> > 116 /*
> > 117 * For inherited events we send all the output towards the parent.
> > 118 */
> > 119 if (event->parent)
> > 120 event = event->parent;
> > 121
> >
> > somewhere between here
> >
> > 122 rb = rcu_dereference(event->rb);
> > 123 if (unlikely(!rb))
> > 124 goto out;
> >
> > and here
> >
> > 125
> > 126 if (unlikely(!rb->nr_pages))
> > 127 goto out;
> >
> > although if rcu locks do anything to turn off tracing then this could be
> > suspect.
>
> The most likely suspect is of course event->rb in the rcu_dereference.
> I have to defer to Steven on how rcu_read_lock() and rcu_read_unlock()
> currently interact with tracing. ;-)

These are all perf related. You'll need to defer to Peter Zijlstra ;-)

-- Steve

2014-02-28 20:51:37

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH] x86: Rename copy_from_user_nmi() to copy_from_user_trace()

On Fri, 28 Feb 2014 21:46:21 +0100
Peter Zijlstra <[email protected]> wrote:

> On Fri, Feb 28, 2014 at 02:33:16PM -0500, Steven Rostedt wrote:
> > [ H. Peter, Here's the rename patch. I did not include your update. You
> > can add that first and then massage this patch on top. But this isn't
> > critical for mainline or stable, where as I believe your patch is. ]
> >
> > The tracing utilities sometimes need to read from userspace (stack tracing),
> > and to do this it has as special copy_from_user function called
> > copy_from_user_nmi(). Well, as tracers can call this from outside of
> > nmi context, the "_nmi" part is a misnomer and "_trace" is a better
> > name.
>
> NAK, spin_lock_irq() is very much an IRQ safe lock. Similarly
> copy_from_user_nmi() is an NMI safe copy from user.
>
> Furthermore, there's exactly 0 trace users, so the proposed name is
> actively worse.

Heh, I consider perf and oprofile special tracers ;-)

-- Steve

2014-02-28 20:54:23

by Peter Zijlstra

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On Fri, Feb 28, 2014 at 03:47:16PM -0500, Steven Rostedt wrote:
> > > I'll try your patch momentarily, first I had some other changes I started
> > > running before I left work (for some reason it recompiled the whole
> > > kernel).
> > >
> > > 8: function: perf_output_begin
> > > 8: bprint: perf_output_begin: VMW: event type 2 config 2a st: 2c3e
> > > 8: bputs: perf_output_begin: VMW: before rcu_dereference
> > > 9: function: __do_page_fault
> > > 9: function: down_read_trylock
> > > 9: function: _cond_resched
> > > 9: function: find_vma
> > >
> > > so it looks like the fault happens
> > >
> > > rcu_read_lock();
> > >
> > > 116 /*
> > > 117 * For inherited events we send all the output towards the parent.
> > > 118 */
> > > 119 if (event->parent)
> > > 120 event = event->parent;
> > > 121
> > >
> > > somewhere between here
> > >
> > > 122 rb = rcu_dereference(event->rb);
> > > 123 if (unlikely(!rb))
> > > 124 goto out;
> > >
> > > and here
> > >
> > > 125
> > > 126 if (unlikely(!rb->nr_pages))
> > > 127 goto out;
> > >
> > > although if rcu locks do anything to turn off tracing then this could be
> > > suspect.
> >
> > The most likely suspect is of course event->rb in the rcu_dereference.
> > I have to defer to Steven on how rcu_read_lock() and rcu_read_unlock()
> > currently interact with tracing. ;-)
>
> These are all perf related. You'll need to defer to Peter Zijlstra ;-)

I'm lost.. :/

pretty much all perf objects are RCU freed.

2014-02-28 20:55:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On Fri, Feb 28, 2014 at 08:15:11AM -0800, H. Peter Anvin wrote:
> On 02/28/2014 07:40 AM, Peter Zijlstra wrote:
> > On Fri, Feb 28, 2014 at 07:13:06AM -0800, H. Peter Anvin wrote:
> >> If I'm reading this right we end up going from the page fault
> >> tracepoint to copy_from_user_nmi() without going through NMI, and the
> >> cr2 corruption is obvious. I guess the assumption that only the NMI
> >> path needed to save cr2 is flawed?
> >
> > It was never assumed it would only go through NMI, but that it would be
> > NMI safe -- and as it turns out, it is that.
> >
> > What I did assume was that any other callsites would be safe, seeing how
> > they'd already be running in 'normal' contexts.
> >
> > I had not considered people putting tracepoints _that_ early in the
> > exception paths.
> >
> > Note that there's more tracepoints there than the one mentioned.
> >
>
> Well, I was talking about the assumption spelled out in the comment
> above copy_from_user_nmi() which pretty much states "cr2 is safe because
> cr2 is saved/restored in the NMI wrappers."

That is because we assumed NMI was the only thing that could interrupt a
fault handler before it would read CR2. Never thinking someone would put
a tracepoint there.

2014-02-28 20:56:45

by Peter Zijlstra

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On Fri, Feb 28, 2014 at 11:29:46AM -0500, Steven Rostedt wrote:
> On Fri, 28 Feb 2014 08:15:11 -0800
> "H. Peter Anvin" <[email protected]> wrote:
>
> > Well, I was talking about the assumption spelled out in the comment
> > above copy_from_user_nmi() which pretty much states "cr2 is safe because
> > cr2 is saved/restored in the NMI wrappers."
>
> Yeah, it seems that the name "copy_from_user_nmi()" is a misnomer. As
> it can be called outside of nmi context. Perhaps we should have a
> copy_from_user_trace() that does the save and restore of the cr2.
>
> As that's the only place that faults, it may be the best answer.
>
> Arnaldo,
>
> Can you test this patch and see if it fixes the bug for you too. Vince
> already said it fixes it for him.
>
> I'm attaching it below (it's from H. Peter).
>
> Peter Z., as both Jiri's and my patch fixed the callers of the problem
> area, and as we have been discussing, there may be more problem areas,
> I'm thinking the best solution is to just use H. Peter's patch instead.
> And then we should rename it to copy_from_user_trace().

Like already said; _trace is an absolutely abysmal name. Also you
_really_ don't want an unconditional CR2 write in there, that's just
stupidly expensive.

Also, this function is called a _LOT_ under certain workloads, I don't
know how cheap a CR2 read is, but it had better be really cheap.

2014-02-28 20:58:34

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] x86: Rename copy_from_user_nmi() to copy_from_user_trace()

On Fri, Feb 28, 2014 at 03:51:30PM -0500, Steven Rostedt wrote:
> On Fri, 28 Feb 2014 21:46:21 +0100
> Peter Zijlstra <[email protected]> wrote:
>
> > On Fri, Feb 28, 2014 at 02:33:16PM -0500, Steven Rostedt wrote:
> > > [ H. Peter, Here's the rename patch. I did not include your update. You
> > > can add that first and then massage this patch on top. But this isn't
> > > critical for mainline or stable, where as I believe your patch is. ]
> > >
> > > The tracing utilities sometimes need to read from userspace (stack tracing),
> > > and to do this it has as special copy_from_user function called
> > > copy_from_user_nmi(). Well, as tracers can call this from outside of
> > > nmi context, the "_nmi" part is a misnomer and "_trace" is a better
> > > name.
> >
> > NAK, spin_lock_irq() is very much an IRQ safe lock. Similarly
> > copy_from_user_nmi() is an NMI safe copy from user.
> >
> > Furthermore, there's exactly 0 trace users, so the proposed name is
> > actively worse.
>
> Heh, I consider perf and oprofile special tracers ;-)

They're first and foremost profilers; which is an entirely different
thing. But other people seem to get confused on this distinction too.

2014-02-28 21:01:46

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH] x86: Rename copy_from_user_nmi() to copy_from_user_trace()

On Fri, 28 Feb 2014 21:46:21 +0100
Peter Zijlstra <[email protected]> wrote:

> NAK, spin_lock_irq() is very much an IRQ safe lock. Similarly

Actually, that's not true. It doesn't mean it is an IRQ safe lock. That
would mean it is safe to use in IRQs. What it does mean is that it
disables irqs and spin_unlock_irq() enables irqs. And we have
spin_lock_irqsave() which disables irqs and saves the flags for a
spin_lock_irqrestore()

> copy_from_user_nmi() is an NMI safe copy from user.
>
> Furthermore, there's exactly 0 trace users, so the proposed name is
> actively worse.

OK, I'll go with that. So instead of an _nmi() which it has nothing to
do with NMIs, how about making it like spin_lock_irq() and a more
descriptive name about what it does?

copy_from_user_savecr2()

?

-- Steve

2014-02-28 21:06:42

by Steven Rostedt

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On Fri, 28 Feb 2014 21:56:38 +0100
Peter Zijlstra <[email protected]> wrote:


> Like already said; _trace is an absolutely abysmal name. Also you
> _really_ don't want an unconditional CR2 write in there, that's just
> stupidly expensive.

But a read isn't. Which is why we only do a write if the copy caused a
page fault (which is already expensive).

The proposed patch will have:

if (cr2 != read_cr2())
write_cr2(cr2);

>
> Also, this function is called a _LOT_ under certain workloads, I don't
> know how cheap a CR2 read is, but it had better be really cheap.

That's a HPA question.

-- Steve

2014-02-28 21:16:13

by Vince Weaver

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On Fri, 28 Feb 2014, H. Peter Anvin wrote:

> Now we need to figure out if the reboot problem and the segfault problem
> are actually the same... I have a nasty feeling they might be different
> problems.

I'm currently running a script that tries setting EBP to all possible
32-bit pages and running the test to see if that triggers anything.

If I look at my notes the original reboot crash might have happened when I
had the fuzzer also generating overflow signals (my current tests do not)
so I'm not sure if having all this mess triggered from inside a signal
handler could make it reboot somehow.

I was away from the computer this afternoon and of course I have scores of
e-mails on this topic now with lots of competing patches. Is there one
in particular I'm supposed to be testing?

Thanks,

Vince

2014-02-28 21:17:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] x86: Rename copy_from_user_nmi() to copy_from_user_trace()

On Fri, Feb 28, 2014 at 04:01:39PM -0500, Steven Rostedt wrote:
> OK, I'll go with that. So instead of an _nmi() which it has nothing to
> do with NMIs,

Well, how is 'safe to use from NMI' nothing to do with NMIs?

> how about making it like spin_lock_irq() and a more
> descriptive name about what it does?
>
> copy_from_user_savecr2()

So how expensive are these CR2 reads? Because if they're more expensive
than one or two cycles (normal instructions) I'd really rather not put
them in there.

I seem to have missed what was wrong with Jiri's patch. Are we worried
about the function trace events?

So IF we're going to push them into this side of things; I'd really
rather see the CR2 foo in perf_callchain_user() and make
arch_perf_out_copy_user() a real function which includes the CR2 goo
too.

2014-02-28 21:17:40

by Paul E. McKenney

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On Fri, Feb 28, 2014 at 09:54:09PM +0100, Peter Zijlstra wrote:
> On Fri, Feb 28, 2014 at 03:47:16PM -0500, Steven Rostedt wrote:
> > > > I'll try your patch momentarily, first I had some other changes I started
> > > > running before I left work (for some reason it recompiled the whole
> > > > kernel).
> > > >
> > > > 8: function: perf_output_begin
> > > > 8: bprint: perf_output_begin: VMW: event type 2 config 2a st: 2c3e
> > > > 8: bputs: perf_output_begin: VMW: before rcu_dereference
> > > > 9: function: __do_page_fault
> > > > 9: function: down_read_trylock
> > > > 9: function: _cond_resched
> > > > 9: function: find_vma
> > > >
> > > > so it looks like the fault happens
> > > >
> > > > rcu_read_lock();
> > > >
> > > > 116 /*
> > > > 117 * For inherited events we send all the output towards the parent.
> > > > 118 */
> > > > 119 if (event->parent)
> > > > 120 event = event->parent;
> > > > 121
> > > >
> > > > somewhere between here
> > > >
> > > > 122 rb = rcu_dereference(event->rb);
> > > > 123 if (unlikely(!rb))
> > > > 124 goto out;
> > > >
> > > > and here
> > > >
> > > > 125
> > > > 126 if (unlikely(!rb->nr_pages))
> > > > 127 goto out;
> > > >
> > > > although if rcu locks do anything to turn off tracing then this could be
> > > > suspect.
> > >
> > > The most likely suspect is of course event->rb in the rcu_dereference.
> > > I have to defer to Steven on how rcu_read_lock() and rcu_read_unlock()
> > > currently interact with tracing. ;-)
> >
> > These are all perf related. You'll need to defer to Peter Zijlstra ;-)
>
> I'm lost.. :/
>
> pretty much all perf objects are RCU freed.

This code isn't running in idle context is it? If so, RCU will happily
free out from under it. CONFIG_PROVE_RCU should detect this sort of thing,
though.

Thanx, Paul

2014-02-28 21:27:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On Fri, Feb 28, 2014 at 01:17:33PM -0800, Paul E. McKenney wrote:
> This code isn't running in idle context is it? If so, RCU will happily
> free out from under it. CONFIG_PROVE_RCU should detect this sort of thing,
> though.

Well, interrupts/NMIs can happen when idle, but the interrupt/NMI
entry code deals with the idle state AFAIK.

2014-02-28 21:30:42

by Steven Rostedt

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On Fri, 28 Feb 2014 16:18:23 -0500 (EST)
Vince Weaver <[email protected]> wrote:

> I was away from the computer this afternoon and of course I have scores of
> e-mails on this topic now with lots of competing patches. Is there one
> in particular I'm supposed to be testing?

I was poking fun at you on IRC for this exact reason:

<rostedt> poor Vince, I keep sending him new patches. "No, don't test this patch, now test this one. Oh wait, try this one instead"
* peterz sees Vince thinking: "stop... sending.. me.. damn... patches... already... !!!11!"
<rostedt> or at least, "Let me finish this test before I cancel it again for another damn patch"
<rostedt> then he's probably doing "I'm not going to run any tests now, until I wait a while to see if there's a new patch to test"

:-)

Anyway, I would say wait a bit while we sort this out. At least we have
a strong idea of what the bug is. Now we need to agree on the solution.

Vince, you have been really terrific in helping us figure out what was
wrong. I don't want to waste your time until we can all agree on what
the proper fix is. When we do, it would be great if you can test it.

But for now, have a great weekend and thanks for all your hard work in
narrowing down this issue.

-- Steve

2014-02-28 21:51:56

by Paul E. McKenney

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On Fri, Feb 28, 2014 at 10:27:00PM +0100, Peter Zijlstra wrote:
> On Fri, Feb 28, 2014 at 01:17:33PM -0800, Paul E. McKenney wrote:
> > This code isn't running in idle context is it? If so, RCU will happily
> > free out from under it. CONFIG_PROVE_RCU should detect this sort of thing,
> > though.
>
> Well, interrupts/NMIs can happen when idle, but the interrupt/NMI
> entry code deals with the idle state AFAIK.

Yep, rcu_irq_enter() and rcu_nmi_enter() deal with that. More worried
about this being code invoked from some energy-efficiency driver or
another within the idle loop.

Thanx, Paul

2014-02-28 21:55:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On Fri, Feb 28, 2014 at 01:51:50PM -0800, Paul E. McKenney wrote:
> On Fri, Feb 28, 2014 at 10:27:00PM +0100, Peter Zijlstra wrote:
> > On Fri, Feb 28, 2014 at 01:17:33PM -0800, Paul E. McKenney wrote:
> > > This code isn't running in idle context is it? If so, RCU will happily
> > > free out from under it. CONFIG_PROVE_RCU should detect this sort of thing,
> > > though.
> >
> > Well, interrupts/NMIs can happen when idle, but the interrupt/NMI
> > entry code deals with the idle state AFAIK.
>
> Yep, rcu_irq_enter() and rcu_nmi_enter() deal with that. More worried
> about this being code invoked from some energy-efficiency driver or
> another within the idle loop.

Right, so any tracepoint can end up there; but I thought there was
already the rule that tracepoints needed RCU enabled.

2014-02-28 22:05:57

by Steven Rostedt

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On Fri, 28 Feb 2014 22:55:11 +0100
Peter Zijlstra <[email protected]> wrote:

> On Fri, Feb 28, 2014 at 01:51:50PM -0800, Paul E. McKenney wrote:
> > On Fri, Feb 28, 2014 at 10:27:00PM +0100, Peter Zijlstra wrote:
> > > On Fri, Feb 28, 2014 at 01:17:33PM -0800, Paul E. McKenney wrote:
> > > > This code isn't running in idle context is it? If so, RCU will happily
> > > > free out from under it. CONFIG_PROVE_RCU should detect this sort of thing,
> > > > though.
> > >
> > > Well, interrupts/NMIs can happen when idle, but the interrupt/NMI
> > > entry code deals with the idle state AFAIK.
> >
> > Yep, rcu_irq_enter() and rcu_nmi_enter() deal with that. More worried
> > about this being code invoked from some energy-efficiency driver or
> > another within the idle loop.
>
> Right, so any tracepoint can end up there; but I thought there was
> already the rule that tracepoints needed RCU enabled.

There is and we have special tracepoint caller for those cases we want a
tracepoint out of RCU scope. These will reactivate rcu in the
tracepoint code.

trace_<tp_name>_rcuidle(...)

-- Steve

2014-02-28 22:23:33

by Paul E. McKenney

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On Fri, Feb 28, 2014 at 05:05:53PM -0500, Steven Rostedt wrote:
> On Fri, 28 Feb 2014 22:55:11 +0100
> Peter Zijlstra <[email protected]> wrote:
>
> > On Fri, Feb 28, 2014 at 01:51:50PM -0800, Paul E. McKenney wrote:
> > > On Fri, Feb 28, 2014 at 10:27:00PM +0100, Peter Zijlstra wrote:
> > > > On Fri, Feb 28, 2014 at 01:17:33PM -0800, Paul E. McKenney wrote:
> > > > > This code isn't running in idle context is it? If so, RCU will happily
> > > > > free out from under it. CONFIG_PROVE_RCU should detect this sort of thing,
> > > > > though.
> > > >
> > > > Well, interrupts/NMIs can happen when idle, but the interrupt/NMI
> > > > entry code deals with the idle state AFAIK.
> > >
> > > Yep, rcu_irq_enter() and rcu_nmi_enter() deal with that. More worried
> > > about this being code invoked from some energy-efficiency driver or
> > > another within the idle loop.
> >
> > Right, so any tracepoint can end up there; but I thought there was
> > already the rule that tracepoints needed RCU enabled.
>
> There is and we have special tracepoint caller for those cases we want a
> tracepoint out of RCU scope. These will reactivate rcu in the
> tracepoint code.
>
> trace_<tp_name>_rcuidle(...)

OK, I finally looked at the emails leading up to this in the thread...
I believe that I am doing premature debugging.

Thanx, Paul

2014-02-28 23:31:58

by Vince Weaver

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On Fri, 28 Feb 2014, Steven Rostedt wrote:

> On Fri, 28 Feb 2014 16:18:23 -0500 (EST)
> Vince Weaver <[email protected]> wrote:
>
> > I was away from the computer this afternoon and of course I have scores of
> > e-mails on this topic now with lots of competing patches. Is there one
> > in particular I'm supposed to be testing?
>
> I was poking fun at you on IRC for this exact reason:
>
> <rostedt> poor Vince, I keep sending him new patches. "No, don't test this patch, now test this one. Oh wait, try this one instead"
> * peterz sees Vince thinking: "stop... sending.. me.. damn... patches... already... !!!11!"
> <rostedt> or at least, "Let me finish this test before I cancel it again for another damn patch"
> <rostedt> then he's probably doing "I'm not going to run any tests now, until I wait a while to see if there's a new patch to test"

Well while it might appear that I spend all of my days finding perf_event
bugs, I actually am a college professor so I do occasionally have to run
off to teach a class, meet with students, or write papers/grants for other
academics to reject.

It's nice others can reproduce the issue now, it would have saved me a lot
of trouble, although now in theory I have a much better handle of how to
use/abuse ftrace so I guess it was worth it.

Once the fix gets into git I'm sure the relentless perf_fuzzer will let us
know if there are any other issues left. I do look forward to the day
when I can leave it running overnight and have a clean syslog the next
morning.

Vince

2014-03-01 00:44:37

by H. Peter Anvin

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On 02/28/2014 03:34 PM, Vince Weaver wrote:
>
> Well while it might appear that I spend all of my days finding perf_event
> bugs, I actually am a college professor so I do occasionally have to run
> off to teach a class, meet with students, or write papers/grants for other
> academics to reject.
>

We really appreciate your help. This has been really critical.

>
> It's nice others can reproduce the issue now, it would have saved me a lot
> of trouble, although now in theory I have a much better handle of how to
> use/abuse ftrace so I guess it was worth it.
>
> Once the fix gets into git I'm sure the relentless perf_fuzzer will let us
> know if there are any other issues left. I do look forward to the day
> when I can leave it running overnight and have a clean syslog the next
> morning.
>

We all do, definitely, and your help has been a huge step in that direction.

-hpa

2014-03-01 03:36:55

by Steven Rostedt

[permalink] [raw]
Subject: Re: perf_fuzzer compiled for x32 causes reboot

On Fri, 28 Feb 2014 18:34:00 -0500 (EST)
Vince Weaver <[email protected]> wrote:

> > I was poking fun at you on IRC for this exact reason:
> >
> > <rostedt> poor Vince, I keep sending him new patches. "No, don't test this patch, now test this one. Oh wait, try this one instead"
> > * peterz sees Vince thinking: "stop... sending.. me.. damn... patches... already... !!!11!"
> > <rostedt> or at least, "Let me finish this test before I cancel it again for another damn patch"
> > <rostedt> then he's probably doing "I'm not going to run any tests now, until I wait a while to see if there's a new patch to test"

I hope you were not offended by this. It was actually a testament to
the work you've given us.

>
> Well while it might appear that I spend all of my days finding perf_event
> bugs, I actually am a college professor so I do occasionally have to run
> off to teach a class, meet with students, or write papers/grants for other
> academics to reject.

But perf_event bug finder is a much more prestigious title than
"college professor" ;-)

>
> It's nice others can reproduce the issue now, it would have saved me a lot
> of trouble, although now in theory I have a much better handle of how to
> use/abuse ftrace so I guess it was worth it.

I always enjoy finding bugs in other people's code. That's usually the
best way to learn what their code does.

>
> Once the fix gets into git I'm sure the relentless perf_fuzzer will let us
> know if there are any other issues left. I do look forward to the day
> when I can leave it running overnight and have a clean syslog the next
> morning.

BTW, is the perf_fuzzer code posted somewhere? It sounds like it can be
really useful for us to do our own testing too.

Thanks,

-- Steve