Peter,
I recently encountered a strange problem using 3.13-rc* kernel that
did not happen in a 3.12 kernel. When I ran the high_systime benchmark
of the AIM7 test suite, I saw the following errors:
Child terminated by signal #11
core dumped
Child terminated by signal #11
Child process called exit(), status = 139
core dumped
Child terminated by signal #11
This only happened when I monitored the running of the benchmark using
"perf record -g". There was no problem if callchain was not enabled.
Adding debug code to the kernel showed the following stack trace:
Call Trace:
<NMI> [<ffffffff815710af>] dump_stack+0x49/0x62
[<ffffffff8104e3bc>] warn_slowpath_common+0x8c/0xc0
[<ffffffff8104e40a>] warn_slowpath_null+0x1a/0x20
[<ffffffff8105f1f1>] force_sig_info+0x131/0x140
[<ffffffff81042a4f>] force_sig_info_fault+0x5f/0x70
[<ffffffff8106d8da>] ? search_exception_tables+0x2a/0x50
[<ffffffff81043b3d>] ? fixup_exception+0x1d/0x70
[<ffffffff81042cc9>] no_context+0x159/0x1f0
[<ffffffff81042e8d>] __bad_area_nosemaphore+0x12d/0x230
[<ffffffff81042e8d>] ? __bad_area_nosemaphore+0x12d/0x230
[<ffffffff81042fa3>] bad_area_nosemaphore+0x13/0x20
[<ffffffff81578fc2>] __do_page_fault+0x362/0x480
[<ffffffff81578fc2>] ? __do_page_fault+0x362/0x480
[<ffffffff815791be>] do_page_fault+0xe/0x10
[<ffffffff81575962>] page_fault+0x22/0x30
[<ffffffff815817e4>] ? bad_to_user+0x5e/0x66b
[<ffffffff81285316>] copy_from_user_nmi+0x76/0x90
[<ffffffff81017a20>] perf_callchain_user+0xd0/0x360
[<ffffffff8111f64f>] perf_callchain+0x1af/0x1f0
[<ffffffff81117693>] perf_prepare_sample+0x2f3/0x3a0
[<ffffffff8111a2af>] __perf_event_overflow+0x10f/0x220
[<ffffffff8111ab14>] perf_event_overflow+0x14/0x20
[<ffffffff8101f69e>] intel_pmu_handle_irq+0x1de/0x3c0
[<ffffffff81008e44>] ? emulate_vsyscall+0x144/0x390
[<ffffffff81576e64>] perf_event_nmi_handler+0x34/0x60
[<ffffffff8157664a>] nmi_handle+0x8a/0x170
[<ffffffff81576848>] default_do_nmi+0x68/0x210
[<ffffffff81576a80>] do_nmi+0x90/0xe0
[<ffffffff81575c67>] end_repeat_nmi+0x1e/0x2e
[<ffffffff81008e44>] ? emulate_vsyscall+0x144/0x390
[<ffffffff81008e44>] ? emulate_vsyscall+0x144/0x390
[<ffffffff81008e44>] ? emulate_vsyscall+0x144/0x390
<<EOE>> [<ffffffff81042f7d>] __bad_area_nosemaphore+0x21d/0x230
[<ffffffff81042fa3>] bad_area_nosemaphore+0x13/0x20
[<ffffffff81578fc2>] __do_page_fault+0x362/0x480
[<ffffffff8113cfbc>] ? vm_mmap_pgoff+0xbc/0xe0
[<ffffffff815791be>] do_page_fault+0xe/0x10
[<ffffffff81575962>] page_fault+0x22/0x30
---[ end trace 037bf09d279751ec ]---
So this is a double page faults. Looking at relevant changes in
3.13 kernel, I spotted the following one patch that modified the
perf_callchain_user() function shown up in the stack trace above:
perf: Fix arch_perf_out_copy_user default
@@ -2041,7 +2041,7 @@ perf_callchain_user(struct perf_callchain_entry
*entry, struct pt_regs *regs)
frame.return_address = 0;
bytes = copy_from_user_nmi(&frame, fp, sizeof(frame));
- if (bytes != sizeof(frame))
+ if (bytes != 0)
break;
if (!valid_user_frame(fp, sizeof(frame)))
I wondered if it was the cause of the SIGSEGV. Please let me know your
thought on that.
-Longman
On Fri, Jan 10, 2014 at 10:29:13AM -0500, Waiman Long wrote:
> Peter,
>
> Call Trace:
> <NMI> [<ffffffff815710af>] dump_stack+0x49/0x62
> [<ffffffff8104e3bc>] warn_slowpath_common+0x8c/0xc0
> [<ffffffff8104e40a>] warn_slowpath_null+0x1a/0x20
> [<ffffffff8105f1f1>] force_sig_info+0x131/0x140
> [<ffffffff81042a4f>] force_sig_info_fault+0x5f/0x70
> [<ffffffff8106d8da>] ? search_exception_tables+0x2a/0x50
> [<ffffffff81043b3d>] ? fixup_exception+0x1d/0x70
> [<ffffffff81042cc9>] no_context+0x159/0x1f0
> [<ffffffff81042e8d>] __bad_area_nosemaphore+0x12d/0x230
> [<ffffffff81042e8d>] ? __bad_area_nosemaphore+0x12d/0x230
> [<ffffffff81042fa3>] bad_area_nosemaphore+0x13/0x20
> [<ffffffff81578fc2>] __do_page_fault+0x362/0x480
> [<ffffffff81578fc2>] ? __do_page_fault+0x362/0x480
> [<ffffffff815791be>] do_page_fault+0xe/0x10
> [<ffffffff81575962>] page_fault+0x22/0x30
> [<ffffffff815817e4>] ? bad_to_user+0x5e/0x66b
> [<ffffffff81285316>] copy_from_user_nmi+0x76/0x90
> [<ffffffff81017a20>] perf_callchain_user+0xd0/0x360
> [<ffffffff8111f64f>] perf_callchain+0x1af/0x1f0
> [<ffffffff81117693>] perf_prepare_sample+0x2f3/0x3a0
> [<ffffffff8111a2af>] __perf_event_overflow+0x10f/0x220
> [<ffffffff8111ab14>] perf_event_overflow+0x14/0x20
> [<ffffffff8101f69e>] intel_pmu_handle_irq+0x1de/0x3c0
> [<ffffffff81008e44>] ? emulate_vsyscall+0x144/0x390
> [<ffffffff81576e64>] perf_event_nmi_handler+0x34/0x60
> [<ffffffff8157664a>] nmi_handle+0x8a/0x170
> [<ffffffff81576848>] default_do_nmi+0x68/0x210
> [<ffffffff81576a80>] do_nmi+0x90/0xe0
> [<ffffffff81575c67>] end_repeat_nmi+0x1e/0x2e
> [<ffffffff81008e44>] ? emulate_vsyscall+0x144/0x390
> [<ffffffff81008e44>] ? emulate_vsyscall+0x144/0x390
> [<ffffffff81008e44>] ? emulate_vsyscall+0x144/0x390
> <<EOE>> [<ffffffff81042f7d>] __bad_area_nosemaphore+0x21d/0x230
> [<ffffffff81042fa3>] bad_area_nosemaphore+0x13/0x20
> [<ffffffff81578fc2>] __do_page_fault+0x362/0x480
> [<ffffffff8113cfbc>] ? vm_mmap_pgoff+0xbc/0xe0
> [<ffffffff815791be>] do_page_fault+0xe/0x10
> [<ffffffff81575962>] page_fault+0x22/0x30
> ---[ end trace 037bf09d279751ec ]---
>
> So this is a double page faults. Looking at relevant changes in
> 3.13 kernel, I spotted the following one patch that modified the
> perf_callchain_user() function shown up in the stack trace above:
>
Hurm, that's an expected double fault, not something we should take the
process down for.
I'll have to look at how all that works for a bit.
On Fri, Jan 10, 2014 at 05:58:22PM +0100, Peter Zijlstra wrote:
> On Fri, Jan 10, 2014 at 10:29:13AM -0500, Waiman Long wrote:
> > Peter,
> >
> > Call Trace:
> > <NMI> [<ffffffff815710af>] dump_stack+0x49/0x62
> > [<ffffffff8104e3bc>] warn_slowpath_common+0x8c/0xc0
> > [<ffffffff8104e40a>] warn_slowpath_null+0x1a/0x20
> > [<ffffffff8105f1f1>] force_sig_info+0x131/0x140
> > [<ffffffff81042a4f>] force_sig_info_fault+0x5f/0x70
> > [<ffffffff8106d8da>] ? search_exception_tables+0x2a/0x50
> > [<ffffffff81043b3d>] ? fixup_exception+0x1d/0x70
> > [<ffffffff81042cc9>] no_context+0x159/0x1f0
> > [<ffffffff81042e8d>] __bad_area_nosemaphore+0x12d/0x230
> > [<ffffffff81042e8d>] ? __bad_area_nosemaphore+0x12d/0x230
> > [<ffffffff81042fa3>] bad_area_nosemaphore+0x13/0x20
> > [<ffffffff81578fc2>] __do_page_fault+0x362/0x480
> > [<ffffffff81578fc2>] ? __do_page_fault+0x362/0x480
> > [<ffffffff815791be>] do_page_fault+0xe/0x10
> > [<ffffffff81575962>] page_fault+0x22/0x30
> > [<ffffffff815817e4>] ? bad_to_user+0x5e/0x66b
> > [<ffffffff81285316>] copy_from_user_nmi+0x76/0x90
> > [<ffffffff81017a20>] perf_callchain_user+0xd0/0x360
> > [<ffffffff8111f64f>] perf_callchain+0x1af/0x1f0
> > [<ffffffff81117693>] perf_prepare_sample+0x2f3/0x3a0
> > [<ffffffff8111a2af>] __perf_event_overflow+0x10f/0x220
> > [<ffffffff8111ab14>] perf_event_overflow+0x14/0x20
> > [<ffffffff8101f69e>] intel_pmu_handle_irq+0x1de/0x3c0
> > [<ffffffff81008e44>] ? emulate_vsyscall+0x144/0x390
> > [<ffffffff81576e64>] perf_event_nmi_handler+0x34/0x60
> > [<ffffffff8157664a>] nmi_handle+0x8a/0x170
> > [<ffffffff81576848>] default_do_nmi+0x68/0x210
> > [<ffffffff81576a80>] do_nmi+0x90/0xe0
> > [<ffffffff81575c67>] end_repeat_nmi+0x1e/0x2e
> > [<ffffffff81008e44>] ? emulate_vsyscall+0x144/0x390
> > [<ffffffff81008e44>] ? emulate_vsyscall+0x144/0x390
> > [<ffffffff81008e44>] ? emulate_vsyscall+0x144/0x390
> > <<EOE>> [<ffffffff81042f7d>] __bad_area_nosemaphore+0x21d/0x230
> > [<ffffffff81042fa3>] bad_area_nosemaphore+0x13/0x20
> > [<ffffffff81578fc2>] __do_page_fault+0x362/0x480
> > [<ffffffff8113cfbc>] ? vm_mmap_pgoff+0xbc/0xe0
> > [<ffffffff815791be>] do_page_fault+0xe/0x10
> > [<ffffffff81575962>] page_fault+0x22/0x30
> > ---[ end trace 037bf09d279751ec ]---
> >
> > So this is a double page faults. Looking at relevant changes in
> > 3.13 kernel, I spotted the following one patch that modified the
> > perf_callchain_user() function shown up in the stack trace above:
> >
>
> Hurm, that's an expected double fault, not something we should take the
> process down for.
>
> I'll have to look at how all that works for a bit.
How easily can you reproduce this? Could you test something like the
below, which would allow us to take double faults from NMI context.
---
arch/x86/mm/fault.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 9ff85bb8dd69..18c498d4274d 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -641,7 +641,7 @@ no_context(struct pt_regs *regs, unsigned long error_code,
/* Are we prepared to handle this kernel fault? */
if (fixup_exception(regs)) {
- if (current_thread_info()->sig_on_uaccess_error && signal) {
+ if (!in_nmi() && current_thread_info()->sig_on_uaccess_error && signal) {
tsk->thread.trap_nr = X86_TRAP_PF;
tsk->thread.error_code = error_code | PF_USER;
tsk->thread.cr2 = address;
On Fri, Jan 10, 2014 at 06:02:23PM +0100, Peter Zijlstra wrote:
> On Fri, Jan 10, 2014 at 05:58:22PM +0100, Peter Zijlstra wrote:
> > On Fri, Jan 10, 2014 at 10:29:13AM -0500, Waiman Long wrote:
> > > Peter,
> > >
> > > Call Trace:
> > > <NMI> [<ffffffff815710af>] dump_stack+0x49/0x62
> > > [<ffffffff8104e3bc>] warn_slowpath_common+0x8c/0xc0
> > > [<ffffffff8104e40a>] warn_slowpath_null+0x1a/0x20
> > > [<ffffffff8105f1f1>] force_sig_info+0x131/0x140
> > > [<ffffffff81042a4f>] force_sig_info_fault+0x5f/0x70
> > > [<ffffffff8106d8da>] ? search_exception_tables+0x2a/0x50
> > > [<ffffffff81043b3d>] ? fixup_exception+0x1d/0x70
> > > [<ffffffff81042cc9>] no_context+0x159/0x1f0
> > > [<ffffffff81042e8d>] __bad_area_nosemaphore+0x12d/0x230
> > > [<ffffffff81042e8d>] ? __bad_area_nosemaphore+0x12d/0x230
> > > [<ffffffff81042fa3>] bad_area_nosemaphore+0x13/0x20
> > > [<ffffffff81578fc2>] __do_page_fault+0x362/0x480
> > > [<ffffffff81578fc2>] ? __do_page_fault+0x362/0x480
> > > [<ffffffff815791be>] do_page_fault+0xe/0x10
> > > [<ffffffff81575962>] page_fault+0x22/0x30
> > > [<ffffffff815817e4>] ? bad_to_user+0x5e/0x66b
> > > [<ffffffff81285316>] copy_from_user_nmi+0x76/0x90
> > > [<ffffffff81017a20>] perf_callchain_user+0xd0/0x360
> > > [<ffffffff8111f64f>] perf_callchain+0x1af/0x1f0
> > > [<ffffffff81117693>] perf_prepare_sample+0x2f3/0x3a0
> > > [<ffffffff8111a2af>] __perf_event_overflow+0x10f/0x220
> > > [<ffffffff8111ab14>] perf_event_overflow+0x14/0x20
> > > [<ffffffff8101f69e>] intel_pmu_handle_irq+0x1de/0x3c0
> > > [<ffffffff81008e44>] ? emulate_vsyscall+0x144/0x390
> > > [<ffffffff81576e64>] perf_event_nmi_handler+0x34/0x60
> > > [<ffffffff8157664a>] nmi_handle+0x8a/0x170
> > > [<ffffffff81576848>] default_do_nmi+0x68/0x210
> > > [<ffffffff81576a80>] do_nmi+0x90/0xe0
> > > [<ffffffff81575c67>] end_repeat_nmi+0x1e/0x2e
> > > [<ffffffff81008e44>] ? emulate_vsyscall+0x144/0x390
> > > [<ffffffff81008e44>] ? emulate_vsyscall+0x144/0x390
> > > [<ffffffff81008e44>] ? emulate_vsyscall+0x144/0x390
> > > <<EOE>> [<ffffffff81042f7d>] __bad_area_nosemaphore+0x21d/0x230
> > > [<ffffffff81042fa3>] bad_area_nosemaphore+0x13/0x20
> > > [<ffffffff81578fc2>] __do_page_fault+0x362/0x480
> > > [<ffffffff8113cfbc>] ? vm_mmap_pgoff+0xbc/0xe0
> > > [<ffffffff815791be>] do_page_fault+0xe/0x10
> > > [<ffffffff81575962>] page_fault+0x22/0x30
> > > ---[ end trace 037bf09d279751ec ]---
> > >
> > > So this is a double page faults. Looking at relevant changes in
> > > 3.13 kernel, I spotted the following one patch that modified the
> > > perf_callchain_user() function shown up in the stack trace above:
> > >
> >
> > Hurm, that's an expected double fault, not something we should take the
> > process down for.
> >
> > I'll have to look at how all that works for a bit.
Andy, introduced all this in 4fc3490114bb ("x86-64: Set siginfo and
context on vsyscall emulation faults").
It looks like your initial userspace fault hit the magic button and ends
up in emulate_vsyscall. Right at that point we trigger a PMI, which
tries to do a stack-trace. That stack-trace also stumbles into unmapped
memory (might be the same) and faults again.
Now at that point, we usually just give up on the callchain and proceed
like normal, however because of this double fault emulate-vsyscall
SIGSEGV magic you loose.
So the below might well be a valid fix.. Anybody? Andy?
> How easily can you reproduce this? Could you test something like the
> below, which would allow us to take double faults from NMI context.
>
> ---
> arch/x86/mm/fault.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 9ff85bb8dd69..18c498d4274d 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -641,7 +641,7 @@ no_context(struct pt_regs *regs, unsigned long error_code,
>
> /* Are we prepared to handle this kernel fault? */
> if (fixup_exception(regs)) {
> - if (current_thread_info()->sig_on_uaccess_error && signal) {
> + if (!in_nmi() && current_thread_info()->sig_on_uaccess_error && signal) {
> tsk->thread.trap_nr = X86_TRAP_PF;
> tsk->thread.error_code = error_code | PF_USER;
> tsk->thread.cr2 = address;
On Fri, Jan 10, 2014 at 9:41 AM, Peter Zijlstra <[email protected]> wrote:
> On Fri, Jan 10, 2014 at 06:02:23PM +0100, Peter Zijlstra wrote:
>> On Fri, Jan 10, 2014 at 05:58:22PM +0100, Peter Zijlstra wrote:
>> > On Fri, Jan 10, 2014 at 10:29:13AM -0500, Waiman Long wrote:
>> > > Peter,
>> > >
>> > > Call Trace:
>> > > <NMI> [<ffffffff815710af>] dump_stack+0x49/0x62
>> > > [<ffffffff8104e3bc>] warn_slowpath_common+0x8c/0xc0
>> > > [<ffffffff8104e40a>] warn_slowpath_null+0x1a/0x20
>> > > [<ffffffff8105f1f1>] force_sig_info+0x131/0x140
>> > > [<ffffffff81042a4f>] force_sig_info_fault+0x5f/0x70
>> > > [<ffffffff8106d8da>] ? search_exception_tables+0x2a/0x50
>> > > [<ffffffff81043b3d>] ? fixup_exception+0x1d/0x70
>> > > [<ffffffff81042cc9>] no_context+0x159/0x1f0
>> > > [<ffffffff81042e8d>] __bad_area_nosemaphore+0x12d/0x230
>> > > [<ffffffff81042e8d>] ? __bad_area_nosemaphore+0x12d/0x230
>> > > [<ffffffff81042fa3>] bad_area_nosemaphore+0x13/0x20
>> > > [<ffffffff81578fc2>] __do_page_fault+0x362/0x480
>> > > [<ffffffff81578fc2>] ? __do_page_fault+0x362/0x480
>> > > [<ffffffff815791be>] do_page_fault+0xe/0x10
>> > > [<ffffffff81575962>] page_fault+0x22/0x30
>> > > [<ffffffff815817e4>] ? bad_to_user+0x5e/0x66b
>> > > [<ffffffff81285316>] copy_from_user_nmi+0x76/0x90
>> > > [<ffffffff81017a20>] perf_callchain_user+0xd0/0x360
>> > > [<ffffffff8111f64f>] perf_callchain+0x1af/0x1f0
>> > > [<ffffffff81117693>] perf_prepare_sample+0x2f3/0x3a0
>> > > [<ffffffff8111a2af>] __perf_event_overflow+0x10f/0x220
>> > > [<ffffffff8111ab14>] perf_event_overflow+0x14/0x20
>> > > [<ffffffff8101f69e>] intel_pmu_handle_irq+0x1de/0x3c0
>> > > [<ffffffff81008e44>] ? emulate_vsyscall+0x144/0x390
>> > > [<ffffffff81576e64>] perf_event_nmi_handler+0x34/0x60
>> > > [<ffffffff8157664a>] nmi_handle+0x8a/0x170
>> > > [<ffffffff81576848>] default_do_nmi+0x68/0x210
>> > > [<ffffffff81576a80>] do_nmi+0x90/0xe0
>> > > [<ffffffff81575c67>] end_repeat_nmi+0x1e/0x2e
>> > > [<ffffffff81008e44>] ? emulate_vsyscall+0x144/0x390
>> > > [<ffffffff81008e44>] ? emulate_vsyscall+0x144/0x390
>> > > [<ffffffff81008e44>] ? emulate_vsyscall+0x144/0x390
>> > > <<EOE>> [<ffffffff81042f7d>] __bad_area_nosemaphore+0x21d/0x230
>> > > [<ffffffff81042fa3>] bad_area_nosemaphore+0x13/0x20
>> > > [<ffffffff81578fc2>] __do_page_fault+0x362/0x480
>> > > [<ffffffff8113cfbc>] ? vm_mmap_pgoff+0xbc/0xe0
>> > > [<ffffffff815791be>] do_page_fault+0xe/0x10
>> > > [<ffffffff81575962>] page_fault+0x22/0x30
>> > > ---[ end trace 037bf09d279751ec ]---
>> > >
>> > > So this is a double page faults. Looking at relevant changes in
>> > > 3.13 kernel, I spotted the following one patch that modified the
>> > > perf_callchain_user() function shown up in the stack trace above:
>> > >
>> >
>> > Hurm, that's an expected double fault, not something we should take the
>> > process down for.
>> >
>> > I'll have to look at how all that works for a bit.
>
> Andy, introduced all this in 4fc3490114bb ("x86-64: Set siginfo and
> context on vsyscall emulation faults").
>
> It looks like your initial userspace fault hit the magic button and ends
> up in emulate_vsyscall. Right at that point we trigger a PMI, which
> tries to do a stack-trace. That stack-trace also stumbles into unmapped
> memory (might be the same) and faults again.
>
> Now at that point, we usually just give up on the callchain and proceed
> like normal, however because of this double fault emulate-vsyscall
> SIGSEGV magic you loose.
>
> So the below might well be a valid fix.. Anybody? Andy?
Yuck -- when I wrote that thing, I hadn't imagined that an interrupt
(there's nothing particularly special about NMIs here, I think) would
try to access user memory. The fix below looks okay, but IMO it needs
a big fat comment explaining what's going on.
Is there a way to ask whether the previous entry into the kernel came
from user space? The valid "sig_on_uaccess_error" case happens when
the current fault was triggered by a fault from userspace. The
invalid case (and any invalid case from, say, an int3 that a
tracepoint stuck in there) would be a page fault triggered by a fault
handler that in turn started in kernel space (in particular, in
emulate_vsyscall).
For that matter, why does current_thread_info() work from an NMI at
all? Does the NMI vector not have its own stack? The call that
installs it is set_intr_gate_ist(X86_TRAP_NMI, &nmi, NMI_STACK).
In any case, this at least needs a comment. I don't see why this same
bug couldn't be triggered by non-NMI based tracing mechanisms, though.
Sigh, corner cases of corner cases...
>
>> How easily can you reproduce this? Could you test something like the
>> below, which would allow us to take double faults from NMI context.
>>
>> ---
>> arch/x86/mm/fault.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
>> index 9ff85bb8dd69..18c498d4274d 100644
>> --- a/arch/x86/mm/fault.c
>> +++ b/arch/x86/mm/fault.c
>> @@ -641,7 +641,7 @@ no_context(struct pt_regs *regs, unsigned long error_code,
>>
>> /* Are we prepared to handle this kernel fault? */
>> if (fixup_exception(regs)) {
>> - if (current_thread_info()->sig_on_uaccess_error && signal) {
>> + if (!in_nmi() && current_thread_info()->sig_on_uaccess_error && signal) {
>> tsk->thread.trap_nr = X86_TRAP_PF;
>> tsk->thread.error_code = error_code | PF_USER;
>> tsk->thread.cr2 = address;
--Andy
On 01/10/2014 12:02 PM, Peter Zijlstra wrote:
> On Fri, Jan 10, 2014 at 05:58:22PM +0100, Peter Zijlstra wrote:
>> On Fri, Jan 10, 2014 at 10:29:13AM -0500, Waiman Long wrote:
>>> Peter,
>>>
>>> Call Trace:
>>> <NMI> [<ffffffff815710af>] dump_stack+0x49/0x62
>>> [<ffffffff8104e3bc>] warn_slowpath_common+0x8c/0xc0
>>> [<ffffffff8104e40a>] warn_slowpath_null+0x1a/0x20
>>> [<ffffffff8105f1f1>] force_sig_info+0x131/0x140
>>> [<ffffffff81042a4f>] force_sig_info_fault+0x5f/0x70
>>> [<ffffffff8106d8da>] ? search_exception_tables+0x2a/0x50
>>> [<ffffffff81043b3d>] ? fixup_exception+0x1d/0x70
>>> [<ffffffff81042cc9>] no_context+0x159/0x1f0
>>> [<ffffffff81042e8d>] __bad_area_nosemaphore+0x12d/0x230
>>> [<ffffffff81042e8d>] ? __bad_area_nosemaphore+0x12d/0x230
>>> [<ffffffff81042fa3>] bad_area_nosemaphore+0x13/0x20
>>> [<ffffffff81578fc2>] __do_page_fault+0x362/0x480
>>> [<ffffffff81578fc2>] ? __do_page_fault+0x362/0x480
>>> [<ffffffff815791be>] do_page_fault+0xe/0x10
>>> [<ffffffff81575962>] page_fault+0x22/0x30
>>> [<ffffffff815817e4>] ? bad_to_user+0x5e/0x66b
>>> [<ffffffff81285316>] copy_from_user_nmi+0x76/0x90
>>> [<ffffffff81017a20>] perf_callchain_user+0xd0/0x360
>>> [<ffffffff8111f64f>] perf_callchain+0x1af/0x1f0
>>> [<ffffffff81117693>] perf_prepare_sample+0x2f3/0x3a0
>>> [<ffffffff8111a2af>] __perf_event_overflow+0x10f/0x220
>>> [<ffffffff8111ab14>] perf_event_overflow+0x14/0x20
>>> [<ffffffff8101f69e>] intel_pmu_handle_irq+0x1de/0x3c0
>>> [<ffffffff81008e44>] ? emulate_vsyscall+0x144/0x390
>>> [<ffffffff81576e64>] perf_event_nmi_handler+0x34/0x60
>>> [<ffffffff8157664a>] nmi_handle+0x8a/0x170
>>> [<ffffffff81576848>] default_do_nmi+0x68/0x210
>>> [<ffffffff81576a80>] do_nmi+0x90/0xe0
>>> [<ffffffff81575c67>] end_repeat_nmi+0x1e/0x2e
>>> [<ffffffff81008e44>] ? emulate_vsyscall+0x144/0x390
>>> [<ffffffff81008e44>] ? emulate_vsyscall+0x144/0x390
>>> [<ffffffff81008e44>] ? emulate_vsyscall+0x144/0x390
>>> <<EOE>> [<ffffffff81042f7d>] __bad_area_nosemaphore+0x21d/0x230
>>> [<ffffffff81042fa3>] bad_area_nosemaphore+0x13/0x20
>>> [<ffffffff81578fc2>] __do_page_fault+0x362/0x480
>>> [<ffffffff8113cfbc>] ? vm_mmap_pgoff+0xbc/0xe0
>>> [<ffffffff815791be>] do_page_fault+0xe/0x10
>>> [<ffffffff81575962>] page_fault+0x22/0x30
>>> ---[ end trace 037bf09d279751ec ]---
>>>
>>> So this is a double page faults. Looking at relevant changes in
>>> 3.13 kernel, I spotted the following one patch that modified the
>>> perf_callchain_user() function shown up in the stack trace above:
>>>
>> Hurm, that's an expected double fault, not something we should take the
>> process down for.
>>
>> I'll have to look at how all that works for a bit.
> How easily can you reproduce this? Could you test something like the
> below, which would allow us to take double faults from NMI context.
The error can be readily reproducible in my current setup.
> ---
> arch/x86/mm/fault.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 9ff85bb8dd69..18c498d4274d 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -641,7 +641,7 @@ no_context(struct pt_regs *regs, unsigned long error_code,
>
> /* Are we prepared to handle this kernel fault? */
> if (fixup_exception(regs)) {
> - if (current_thread_info()->sig_on_uaccess_error&& signal) {
> + if (!in_nmi()&& current_thread_info()->sig_on_uaccess_error&& signal) {
> tsk->thread.trap_nr = X86_TRAP_PF;
> tsk->thread.error_code = error_code | PF_USER;
> tsk->thread.cr2 = address;
Yes, this change fixed the error that I got. I no longer see SIGSEGV
when I run the test.
I did tried to back out your "perf: Fix arch_perf_out_copy_user default"
patch, but it didn't fix the problem.
Thank for the quick turnaround!
-Longman
On 01/10/2014 01:54 PM, Andy Lutomirski wrote:
> On Fri, Jan 10, 2014 at 9:41 AM, Peter Zijlstra<[email protected]> wrote:
>> On Fri, Jan 10, 2014 at 06:02:23PM +0100, Peter Zijlstra wrote:
>>> On Fri, Jan 10, 2014 at 05:58:22PM +0100, Peter Zijlstra wrote:
>>>> On Fri, Jan 10, 2014 at 10:29:13AM -0500, Waiman Long wrote:
>>>>> Peter,
>>>>>
>>>>> Call Trace:
>>>>> <NMI> [<ffffffff815710af>] dump_stack+0x49/0x62
>>>>> [<ffffffff8104e3bc>] warn_slowpath_common+0x8c/0xc0
>>>>> [<ffffffff8104e40a>] warn_slowpath_null+0x1a/0x20
>>>>> [<ffffffff8105f1f1>] force_sig_info+0x131/0x140
>>>>> [<ffffffff81042a4f>] force_sig_info_fault+0x5f/0x70
>>>>> [<ffffffff8106d8da>] ? search_exception_tables+0x2a/0x50
>>>>> [<ffffffff81043b3d>] ? fixup_exception+0x1d/0x70
>>>>> [<ffffffff81042cc9>] no_context+0x159/0x1f0
>>>>> [<ffffffff81042e8d>] __bad_area_nosemaphore+0x12d/0x230
>>>>> [<ffffffff81042e8d>] ? __bad_area_nosemaphore+0x12d/0x230
>>>>> [<ffffffff81042fa3>] bad_area_nosemaphore+0x13/0x20
>>>>> [<ffffffff81578fc2>] __do_page_fault+0x362/0x480
>>>>> [<ffffffff81578fc2>] ? __do_page_fault+0x362/0x480
>>>>> [<ffffffff815791be>] do_page_fault+0xe/0x10
>>>>> [<ffffffff81575962>] page_fault+0x22/0x30
>>>>> [<ffffffff815817e4>] ? bad_to_user+0x5e/0x66b
>>>>> [<ffffffff81285316>] copy_from_user_nmi+0x76/0x90
>>>>> [<ffffffff81017a20>] perf_callchain_user+0xd0/0x360
>>>>> [<ffffffff8111f64f>] perf_callchain+0x1af/0x1f0
>>>>> [<ffffffff81117693>] perf_prepare_sample+0x2f3/0x3a0
>>>>> [<ffffffff8111a2af>] __perf_event_overflow+0x10f/0x220
>>>>> [<ffffffff8111ab14>] perf_event_overflow+0x14/0x20
>>>>> [<ffffffff8101f69e>] intel_pmu_handle_irq+0x1de/0x3c0
>>>>> [<ffffffff81008e44>] ? emulate_vsyscall+0x144/0x390
>>>>> [<ffffffff81576e64>] perf_event_nmi_handler+0x34/0x60
>>>>> [<ffffffff8157664a>] nmi_handle+0x8a/0x170
>>>>> [<ffffffff81576848>] default_do_nmi+0x68/0x210
>>>>> [<ffffffff81576a80>] do_nmi+0x90/0xe0
>>>>> [<ffffffff81575c67>] end_repeat_nmi+0x1e/0x2e
>>>>> [<ffffffff81008e44>] ? emulate_vsyscall+0x144/0x390
>>>>> [<ffffffff81008e44>] ? emulate_vsyscall+0x144/0x390
>>>>> [<ffffffff81008e44>] ? emulate_vsyscall+0x144/0x390
>>>>> <<EOE>> [<ffffffff81042f7d>] __bad_area_nosemaphore+0x21d/0x230
>>>>> [<ffffffff81042fa3>] bad_area_nosemaphore+0x13/0x20
>>>>> [<ffffffff81578fc2>] __do_page_fault+0x362/0x480
>>>>> [<ffffffff8113cfbc>] ? vm_mmap_pgoff+0xbc/0xe0
>>>>> [<ffffffff815791be>] do_page_fault+0xe/0x10
>>>>> [<ffffffff81575962>] page_fault+0x22/0x30
>>>>> ---[ end trace 037bf09d279751ec ]---
>>>>>
>>>>> So this is a double page faults. Looking at relevant changes in
>>>>> 3.13 kernel, I spotted the following one patch that modified the
>>>>> perf_callchain_user() function shown up in the stack trace above:
>>>>>
>>>> Hurm, that's an expected double fault, not something we should take the
>>>> process down for.
>>>>
>>>> I'll have to look at how all that works for a bit.
>> Andy, introduced all this in 4fc3490114bb ("x86-64: Set siginfo and
>> context on vsyscall emulation faults").
>>
>> It looks like your initial userspace fault hit the magic button and ends
>> up in emulate_vsyscall. Right at that point we trigger a PMI, which
>> tries to do a stack-trace. That stack-trace also stumbles into unmapped
>> memory (might be the same) and faults again.
>>
>> Now at that point, we usually just give up on the callchain and proceed
>> like normal, however because of this double fault emulate-vsyscall
>> SIGSEGV magic you loose.
>>
>> So the below might well be a valid fix.. Anybody? Andy?
> Yuck -- when I wrote that thing, I hadn't imagined that an interrupt
> (there's nothing particularly special about NMIs here, I think) would
> try to access user memory. The fix below looks okay, but IMO it needs
> a big fat comment explaining what's going on.
>
> Is there a way to ask whether the previous entry into the kernel came
> from user space? The valid "sig_on_uaccess_error" case happens when
> the current fault was triggered by a fault from userspace. The
> invalid case (and any invalid case from, say, an int3 that a
> tracepoint stuck in there) would be a page fault triggered by a fault
> handler that in turn started in kernel space (in particular, in
> emulate_vsyscall).
The processes that got the SIGSEGV were all running shell scripts. I am
not totally sure that they were running in user space when getting the
PMIs, but are likely the case.
-Longman
On Fri, Jan 10, 2014 at 11:43 AM, Waiman Long <[email protected]> wrote:
> On 01/10/2014 01:54 PM, Andy Lutomirski wrote:
>>
>> On Fri, Jan 10, 2014 at 9:41 AM, Peter Zijlstra<[email protected]>
>> wrote:
>>>
>>> On Fri, Jan 10, 2014 at 06:02:23PM +0100, Peter Zijlstra wrote:
>>>>
>>>> On Fri, Jan 10, 2014 at 05:58:22PM +0100, Peter Zijlstra wrote:
>>>>>
>>>>> On Fri, Jan 10, 2014 at 10:29:13AM -0500, Waiman Long wrote:
>>>>>>
>>>>>> Peter,
>>>>>>
>>>>>> Call Trace:
>>>>>> <NMI> [<ffffffff815710af>] dump_stack+0x49/0x62
>>>>>> [<ffffffff8104e3bc>] warn_slowpath_common+0x8c/0xc0
>>>>>> [<ffffffff8104e40a>] warn_slowpath_null+0x1a/0x20
>>>>>> [<ffffffff8105f1f1>] force_sig_info+0x131/0x140
>>>>>> [<ffffffff81042a4f>] force_sig_info_fault+0x5f/0x70
>>>>>> [<ffffffff8106d8da>] ? search_exception_tables+0x2a/0x50
>>>>>> [<ffffffff81043b3d>] ? fixup_exception+0x1d/0x70
>>>>>> [<ffffffff81042cc9>] no_context+0x159/0x1f0
>>>>>> [<ffffffff81042e8d>] __bad_area_nosemaphore+0x12d/0x230
>>>>>> [<ffffffff81042e8d>] ? __bad_area_nosemaphore+0x12d/0x230
>>>>>> [<ffffffff81042fa3>] bad_area_nosemaphore+0x13/0x20
>>>>>> [<ffffffff81578fc2>] __do_page_fault+0x362/0x480
>>>>>> [<ffffffff81578fc2>] ? __do_page_fault+0x362/0x480
>>>>>> [<ffffffff815791be>] do_page_fault+0xe/0x10
>>>>>> [<ffffffff81575962>] page_fault+0x22/0x30
>>>>>> [<ffffffff815817e4>] ? bad_to_user+0x5e/0x66b
>>>>>> [<ffffffff81285316>] copy_from_user_nmi+0x76/0x90
>>>>>> [<ffffffff81017a20>] perf_callchain_user+0xd0/0x360
>>>>>> [<ffffffff8111f64f>] perf_callchain+0x1af/0x1f0
>>>>>> [<ffffffff81117693>] perf_prepare_sample+0x2f3/0x3a0
>>>>>> [<ffffffff8111a2af>] __perf_event_overflow+0x10f/0x220
>>>>>> [<ffffffff8111ab14>] perf_event_overflow+0x14/0x20
>>>>>> [<ffffffff8101f69e>] intel_pmu_handle_irq+0x1de/0x3c0
>>>>>> [<ffffffff81008e44>] ? emulate_vsyscall+0x144/0x390
>>>>>> [<ffffffff81576e64>] perf_event_nmi_handler+0x34/0x60
>>>>>> [<ffffffff8157664a>] nmi_handle+0x8a/0x170
>>>>>> [<ffffffff81576848>] default_do_nmi+0x68/0x210
>>>>>> [<ffffffff81576a80>] do_nmi+0x90/0xe0
>>>>>> [<ffffffff81575c67>] end_repeat_nmi+0x1e/0x2e
>>>>>> [<ffffffff81008e44>] ? emulate_vsyscall+0x144/0x390
>>>>>> [<ffffffff81008e44>] ? emulate_vsyscall+0x144/0x390
>>>>>> [<ffffffff81008e44>] ? emulate_vsyscall+0x144/0x390
>>>>>> <<EOE>> [<ffffffff81042f7d>] __bad_area_nosemaphore+0x21d/0x230
>>>>>> [<ffffffff81042fa3>] bad_area_nosemaphore+0x13/0x20
>>>>>> [<ffffffff81578fc2>] __do_page_fault+0x362/0x480
>>>>>> [<ffffffff8113cfbc>] ? vm_mmap_pgoff+0xbc/0xe0
>>>>>> [<ffffffff815791be>] do_page_fault+0xe/0x10
>>>>>> [<ffffffff81575962>] page_fault+0x22/0x30
>>>>>> ---[ end trace 037bf09d279751ec ]---
>>>>>>
>>>>>> So this is a double page faults. Looking at relevant changes in
>>>>>> 3.13 kernel, I spotted the following one patch that modified the
>>>>>> perf_callchain_user() function shown up in the stack trace above:
>>>>>>
>>>>> Hurm, that's an expected double fault, not something we should take the
>>>>> process down for.
>>>>>
>>>>> I'll have to look at how all that works for a bit.
>>>
>>> Andy, introduced all this in 4fc3490114bb ("x86-64: Set siginfo and
>>> context on vsyscall emulation faults").
>>>
>>> It looks like your initial userspace fault hit the magic button and ends
>>> up in emulate_vsyscall. Right at that point we trigger a PMI, which
>>> tries to do a stack-trace. That stack-trace also stumbles into unmapped
>>> memory (might be the same) and faults again.
>>>
>>> Now at that point, we usually just give up on the callchain and proceed
>>> like normal, however because of this double fault emulate-vsyscall
>>> SIGSEGV magic you loose.
>>>
>>> So the below might well be a valid fix.. Anybody? Andy?
>>
>> Yuck -- when I wrote that thing, I hadn't imagined that an interrupt
>> (there's nothing particularly special about NMIs here, I think) would
>> try to access user memory. The fix below looks okay, but IMO it needs
>> a big fat comment explaining what's going on.
>>
>> Is there a way to ask whether the previous entry into the kernel came
>> from user space? The valid "sig_on_uaccess_error" case happens when
>> the current fault was triggered by a fault from userspace. The
>> invalid case (and any invalid case from, say, an int3 that a
>> tracepoint stuck in there) would be a page fault triggered by a fault
>> handler that in turn started in kernel space (in particular, in
>> emulate_vsyscall).
>
>
> The processes that got the SIGSEGV were all running shell scripts. I am not
> totally sure that they were running in user space when getting the PMIs, but
> are likely the case.
The error you're seeing is:
- Userspace code calls into the vsyscall page (e.g. old code calling
gettimeofday).
- Calls to the vsyscall page trap, so you end up executing in kernel
space in emulate_vsyscall.
- perf is running, so you end up in an NMI handler that triggers
while sig_on_uaccess_fault is true.
- The perf callchain code tries to read from a bad userspace pointer
(not sure why -- the ip value in the vsyscall page *is* readable).
That traps (as expected), but the trap handler injects SIGSEGV due to
sig_on_uaccess_fault==1.
The cleanest fix I can think of is to change the condition to
sig_on_uaccess_fault==1 && (the get_user caller was a single kernel
entry away from userspace). In the failure you're seeing, there was
an NMI in there.
--Andy
On Fri, Jan 10, 2014 at 10:54:52AM -0800, Andy Lutomirski wrote:
> Yuck -- when I wrote that thing, I hadn't imagined that an interrupt
> (there's nothing particularly special about NMIs here, I think) would
> try to access user memory. The fix below looks okay, but IMO it needs
> a big fat comment explaining what's going on.
Agreed on both points, we can equally trigger this using software
timers, so any interrupt must be exempt. And yes a comment!
> Is there a way to ask whether the previous entry into the kernel came
> from user space?
Not afaik, but in_interrupt() gets us any interrupt context, whatever
remains must be task context. Still not quite the same, but close enough
I think.
> The valid "sig_on_uaccess_error" case happens when
> the current fault was triggered by a fault from userspace. The
> invalid case (and any invalid case from, say, an int3 that a
> tracepoint stuck in there) would be a page fault triggered by a fault
> handler that in turn started in kernel space (in particular, in
> emulate_vsyscall).
>
> For that matter, why does current_thread_info() work from an NMI at
> all? Does the NMI vector not have its own stack? The call that
> installs it is set_intr_gate_ist(X86_TRAP_NMI, &nmi, NMI_STACK).
NMIs do have their own stack, however x86_64 grabs kernel_stack from a
per-cpu variable, not rsp.
> In any case, this at least needs a comment. I don't see why this same
> bug couldn't be triggered by non-NMI based tracing mechanisms, though.
>
> Sigh, corner cases of corner cases...
:-)
Something like this perhaps?
---
Subject: x86, mm: Allow double faults from interrupts
Waiman managed to trigger a PMI while in a emulate_vsyscall() fault, the
PMI in turn managed to trigger a fault while obtaining a stack trace.
This triggered the double fault logic and killed the process dead.
Fix this by explicitly excluding interrupts from the double fault logic.
Reported-by: Waiman Long <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
---
arch/x86/mm/fault.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 9ff85bb8dd69..4c8e32986aad 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -641,6 +641,20 @@ no_context(struct pt_regs *regs, unsigned long error_code,
/* Are we prepared to handle this kernel fault? */
if (fixup_exception(regs)) {
+ /*
+ * Any interrupt that takes a fault gets the fixup. This
+ * makes the below double fault logic only apply to a
+ * task double faulting from task context.
+ */
+ if (in_interrupt())
+ return;
+
+ /*
+ * Per the above we're !in_interrupt(), aka. task context.
+ *
+ * In this case we need to make sure we're not double faulting
+ * through the emulate_vsyscall() logic.
+ */
if (current_thread_info()->sig_on_uaccess_error && signal) {
tsk->thread.trap_nr = X86_TRAP_PF;
tsk->thread.error_code = error_code | PF_USER;
@@ -649,6 +663,10 @@ no_context(struct pt_regs *regs, unsigned long error_code,
/* XXX: hwpoison faults will set the wrong code. */
force_sig_info_fault(signal, si_code, address, tsk, 0);
}
+
+ /*
+ * Barring that, we can do the fixup and be happy.
+ */
return;
}
On Fri, Jan 10, 2014 at 02:37:10PM -0500, Waiman Long wrote:
> >@@ -641,7 +641,7 @@ no_context(struct pt_regs *regs, unsigned long error_code,
> >
> > /* Are we prepared to handle this kernel fault? */
> > if (fixup_exception(regs)) {
> >- if (current_thread_info()->sig_on_uaccess_error&& signal) {
> >+ if (!in_nmi()&& current_thread_info()->sig_on_uaccess_error&& signal) {
> > tsk->thread.trap_nr = X86_TRAP_PF;
> > tsk->thread.error_code = error_code | PF_USER;
> > tsk->thread.cr2 = address;
>
> Yes, this change fixed the error that I got. I no longer see SIGSEGV when I
> run the test.
Awesome, just send a more elaborate version that should have the same
effect.
> I did tried to back out your "perf: Fix arch_perf_out_copy_user default"
> patch, but it didn't fix the problem.
The culprit was: e00b12e64be9 ("perf/x86: Further optimize
copy_from_user_nmi()")
On Fri, Jan 10, 2014 at 11:56:07AM -0800, Andy Lutomirski wrote:
> - The perf callchain code tries to read from a bad userspace pointer
> (not sure why -- the ip value in the vsyscall page *is* readable).
> That traps (as expected), but the trap handler injects SIGSEGV due to
> sig_on_uaccess_fault==1.
So the perf thing tries to walk the entire stack in order to obtain a
callchain. Its fairly common to hit crap when attempting this :-)
On Fri, Jan 10, 2014 at 12:06 PM, Peter Zijlstra <[email protected]> wrote:
> On Fri, Jan 10, 2014 at 10:54:52AM -0800, Andy Lutomirski wrote:
>> Yuck -- when I wrote that thing, I hadn't imagined that an interrupt
>> (there's nothing particularly special about NMIs here, I think) would
>> try to access user memory. The fix below looks okay, but IMO it needs
>> a big fat comment explaining what's going on.
>
> Agreed on both points, we can equally trigger this using software
> timers, so any interrupt must be exempt. And yes a comment!
>
>> Is there a way to ask whether the previous entry into the kernel came
>> from user space?
>
> Not afaik, but in_interrupt() gets us any interrupt context, whatever
> remains must be task context. Still not quite the same, but close enough
> I think.
>
>> The valid "sig_on_uaccess_error" case happens when
>> the current fault was triggered by a fault from userspace. The
>> invalid case (and any invalid case from, say, an int3 that a
>> tracepoint stuck in there) would be a page fault triggered by a fault
>> handler that in turn started in kernel space (in particular, in
>> emulate_vsyscall).
>>
>> For that matter, why does current_thread_info() work from an NMI at
>> all? Does the NMI vector not have its own stack? The call that
>> installs it is set_intr_gate_ist(X86_TRAP_NMI, &nmi, NMI_STACK).
>
> NMIs do have their own stack, however x86_64 grabs kernel_stack from a
> per-cpu variable, not rsp.
>
>> In any case, this at least needs a comment. I don't see why this same
>> bug couldn't be triggered by non-NMI based tracing mechanisms, though.
>>
>> Sigh, corner cases of corner cases...
>
> :-)
>
> Something like this perhaps?
>
> ---
> Subject: x86, mm: Allow double faults from interrupts
>
> Waiman managed to trigger a PMI while in a emulate_vsyscall() fault, the
> PMI in turn managed to trigger a fault while obtaining a stack trace.
> This triggered the double fault logic and killed the process dead.
>
> Fix this by explicitly excluding interrupts from the double fault logic.
>
> Reported-by: Waiman Long <[email protected]>
> Signed-off-by: Peter Zijlstra <[email protected]>
> ---
> arch/x86/mm/fault.c | 18 ++++++++++++++++++
> 1 file changed, 18 insertions(+)
>
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 9ff85bb8dd69..4c8e32986aad 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -641,6 +641,20 @@ no_context(struct pt_regs *regs, unsigned long error_code,
>
> /* Are we prepared to handle this kernel fault? */
> if (fixup_exception(regs)) {
> + /*
> + * Any interrupt that takes a fault gets the fixup. This
> + * makes the below double fault logic only apply to a
> + * task double faulting from task context.
> + */
> + if (in_interrupt())
> + return;
> +
> + /*
> + * Per the above we're !in_interrupt(), aka. task context.
> + *
> + * In this case we need to make sure we're not double faulting
> + * through the emulate_vsyscall() logic.
> + */
> if (current_thread_info()->sig_on_uaccess_error && signal) {
> tsk->thread.trap_nr = X86_TRAP_PF;
> tsk->thread.error_code = error_code | PF_USER;
> @@ -649,6 +663,10 @@ no_context(struct pt_regs *regs, unsigned long error_code,
> /* XXX: hwpoison faults will set the wrong code. */
> force_sig_info_fault(signal, si_code, address, tsk, 0);
> }
> +
> + /*
> + * Barring that, we can do the fixup and be happy.
> + */
> return;
> }
>
My only real nit is the same thing that confused me the first time I
read your email -- when I see "double fault", I think #DF. How about
something like "It's possible for an interrupt to occur while
sig_on_uaccess_error is set. In that case, uaccess faults that are
caused by the interrupt handler should not send signals."?
--Andy
On 01/10/2014 03:06 PM, Peter Zijlstra wrote:
>
> :-)
>
> Something like this perhaps?
>
> ---
> Subject: x86, mm: Allow double faults from interrupts
>
> Waiman managed to trigger a PMI while in a emulate_vsyscall() fault, the
> PMI in turn managed to trigger a fault while obtaining a stack trace.
> This triggered the double fault logic and killed the process dead.
>
> Fix this by explicitly excluding interrupts from the double fault logic.
>
> Reported-by: Waiman Long<[email protected]>
> Signed-off-by: Peter Zijlstra<[email protected]>
> ---
> arch/x86/mm/fault.c | 18 ++++++++++++++++++
> 1 file changed, 18 insertions(+)
>
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 9ff85bb8dd69..4c8e32986aad 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -641,6 +641,20 @@ no_context(struct pt_regs *regs, unsigned long error_code,
>
> /* Are we prepared to handle this kernel fault? */
> if (fixup_exception(regs)) {
> + /*
> + * Any interrupt that takes a fault gets the fixup. This
> + * makes the below double fault logic only apply to a
> + * task double faulting from task context.
> + */
> + if (in_interrupt())
> + return;
> +
> + /*
> + * Per the above we're !in_interrupt(), aka. task context.
> + *
> + * In this case we need to make sure we're not double faulting
> + * through the emulate_vsyscall() logic.
> + */
> if (current_thread_info()->sig_on_uaccess_error&& signal) {
> tsk->thread.trap_nr = X86_TRAP_PF;
> tsk->thread.error_code = error_code | PF_USER;
> @@ -649,6 +663,10 @@ no_context(struct pt_regs *regs, unsigned long error_code,
> /* XXX: hwpoison faults will set the wrong code. */
> force_sig_info_fault(signal, si_code, address, tsk, 0);
> }
> +
> + /*
> + * Barring that, we can do the fixup and be happy.
> + */
> return;
> }
>
Are you going to send out an official patch to fix this problem? I
really like to see it merged into 3.13 before it is released.
-Longman
Commit-ID: c026b3591e4f2a4993df773183704bb31634e0bd
Gitweb: http://git.kernel.org/tip/c026b3591e4f2a4993df773183704bb31634e0bd
Author: Peter Zijlstra <[email protected]>
AuthorDate: Fri, 10 Jan 2014 21:06:03 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 16 Jan 2014 09:19:48 +0100
x86, mm, perf: Allow recursive faults from interrupts
Waiman managed to trigger a PMI while in a emulate_vsyscall() fault,
the PMI in turn managed to trigger a fault while obtaining a stack
trace. This triggered the sig_on_uaccess_error recursive fault logic
and killed the process dead.
Fix this by explicitly excluding interrupts from the recursive fault
logic.
Reported-and-Tested-by: Waiman Long <[email protected]>
Fixes: e00b12e64be9 ("perf/x86: Further optimize copy_from_user_nmi()")
Cc: Aswin Chandramouleeswaran <[email protected]>
Cc: Scott J Norton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arnaldo Carvalho de Melo <[email protected]>
Cc: Andrew Morton <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/mm/fault.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 9ff85bb..9d591c8 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -641,6 +641,20 @@ no_context(struct pt_regs *regs, unsigned long error_code,
/* Are we prepared to handle this kernel fault? */
if (fixup_exception(regs)) {
+ /*
+ * Any interrupt that takes a fault gets the fixup. This makes
+ * the below recursive fault logic only apply to a faults from
+ * task context.
+ */
+ if (in_interrupt())
+ return;
+
+ /*
+ * Per the above we're !in_interrupt(), aka. task context.
+ *
+ * In this case we need to make sure we're not recursively
+ * faulting through the emulate_vsyscall() logic.
+ */
if (current_thread_info()->sig_on_uaccess_error && signal) {
tsk->thread.trap_nr = X86_TRAP_PF;
tsk->thread.error_code = error_code | PF_USER;
@@ -649,6 +663,10 @@ no_context(struct pt_regs *regs, unsigned long error_code,
/* XXX: hwpoison faults will set the wrong code. */
force_sig_info_fault(signal, si_code, address, tsk, 0);
}
+
+ /*
+ * Barring that, we can do the fixup and be happy.
+ */
return;
}
On 01/16/2014 08:39 AM, tip-bot for Peter Zijlstra wrote:
> Commit-ID: c026b3591e4f2a4993df773183704bb31634e0bd
> Gitweb: http://git.kernel.org/tip/c026b3591e4f2a4993df773183704bb31634e0bd
> Author: Peter Zijlstra<[email protected]>
> AuthorDate: Fri, 10 Jan 2014 21:06:03 +0100
> Committer: Ingo Molnar<[email protected]>
> CommitDate: Thu, 16 Jan 2014 09:19:48 +0100
>
> x86, mm, perf: Allow recursive faults from interrupts
>
> Waiman managed to trigger a PMI while in a emulate_vsyscall() fault,
> the PMI in turn managed to trigger a fault while obtaining a stack
> trace. This triggered the sig_on_uaccess_error recursive fault logic
> and killed the process dead.
>
> Fix this by explicitly excluding interrupts from the recursive fault
> logic.
>
> Reported-and-Tested-by: Waiman Long<[email protected]>
> Fixes: e00b12e64be9 ("perf/x86: Further optimize copy_from_user_nmi()")
> Cc: Aswin Chandramouleeswaran<[email protected]>
> Cc: Scott J Norton<[email protected]>
> Cc: Linus Torvalds<[email protected]>
> Cc: Andy Lutomirski<[email protected]>
> Cc: Arnaldo Carvalho de Melo<[email protected]>
> Cc: Andrew Morton<[email protected]>
> Signed-off-by: Peter Zijlstra<[email protected]>
> Link: http://lkml.kernel.org/r/[email protected]
> Signed-off-by: Ingo Molnar<[email protected]>
> ---
> arch/x86/mm/fault.c | 18 ++++++++++++++++++
> 1 file changed, 18 insertions(+)
>
>
Will that be picked up by Linus as it is a 3.13 regression?
-Longman
On Fri, Jan 17, 2014 at 10:10 AM, Waiman Long <[email protected]> wrote:
> On 01/16/2014 08:39 AM, tip-bot for Peter Zijlstra wrote:
>>
>> Commit-ID: c026b3591e4f2a4993df773183704bb31634e0bd
>> Gitweb:
>> http://git.kernel.org/tip/c026b3591e4f2a4993df773183704bb31634e0bd
>> Author: Peter Zijlstra<[email protected]>
>> AuthorDate: Fri, 10 Jan 2014 21:06:03 +0100
>> Committer: Ingo Molnar<[email protected]>
>> CommitDate: Thu, 16 Jan 2014 09:19:48 +0100
>>
>> x86, mm, perf: Allow recursive faults from interrupts
>>
>> Waiman managed to trigger a PMI while in a emulate_vsyscall() fault,
>> the PMI in turn managed to trigger a fault while obtaining a stack
>> trace. This triggered the sig_on_uaccess_error recursive fault logic
>> and killed the process dead.
>>
>> Fix this by explicitly excluding interrupts from the recursive fault
>> logic.
>>
>> Reported-and-Tested-by: Waiman Long<[email protected]>
>> Fixes: e00b12e64be9 ("perf/x86: Further optimize copy_from_user_nmi()")
>> Cc: Aswin Chandramouleeswaran<[email protected]>
>> Cc: Scott J Norton<[email protected]>
>> Cc: Linus Torvalds<[email protected]>
>> Cc: Andy Lutomirski<[email protected]>
>> Cc: Arnaldo Carvalho de Melo<[email protected]>
>> Cc: Andrew Morton<[email protected]>
>> Signed-off-by: Peter Zijlstra<[email protected]>
>> Link:
>> http://lkml.kernel.org/r/[email protected]
>> Signed-off-by: Ingo Molnar<[email protected]>
>> ---
>> arch/x86/mm/fault.c | 18 ++++++++++++++++++
>> 1 file changed, 18 insertions(+)
>>
>>
>
> Will that be picked up by Linus as it is a 3.13 regression?
Does anyone actually know why this regressed recently? The buggy code
has been there for quite a while.
--Andy
On 01/17/2014 02:17 PM, Andy Lutomirski wrote:
> On Fri, Jan 17, 2014 at 10:10 AM, Waiman Long<[email protected]> wrote:
>> On 01/16/2014 08:39 AM, tip-bot for Peter Zijlstra wrote:
>>> Commit-ID: c026b3591e4f2a4993df773183704bb31634e0bd
>>> Gitweb:
>>> http://git.kernel.org/tip/c026b3591e4f2a4993df773183704bb31634e0bd
>>> Author: Peter Zijlstra<[email protected]>
>>> AuthorDate: Fri, 10 Jan 2014 21:06:03 +0100
>>> Committer: Ingo Molnar<[email protected]>
>>> CommitDate: Thu, 16 Jan 2014 09:19:48 +0100
>>>
>>> x86, mm, perf: Allow recursive faults from interrupts
>>>
>>> Waiman managed to trigger a PMI while in a emulate_vsyscall() fault,
>>> the PMI in turn managed to trigger a fault while obtaining a stack
>>> trace. This triggered the sig_on_uaccess_error recursive fault logic
>>> and killed the process dead.
>>>
>>> Fix this by explicitly excluding interrupts from the recursive fault
>>> logic.
>>>
>>> Reported-and-Tested-by: Waiman Long<[email protected]>
>>> Fixes: e00b12e64be9 ("perf/x86: Further optimize copy_from_user_nmi()")
>>> Cc: Aswin Chandramouleeswaran<[email protected]>
>>> Cc: Scott J Norton<[email protected]>
>>> Cc: Linus Torvalds<[email protected]>
>>> Cc: Andy Lutomirski<[email protected]>
>>> Cc: Arnaldo Carvalho de Melo<[email protected]>
>>> Cc: Andrew Morton<[email protected]>
>>> Signed-off-by: Peter Zijlstra<[email protected]>
>>> Link:
>>> http://lkml.kernel.org/r/[email protected]
>>> Signed-off-by: Ingo Molnar<[email protected]>
>>> ---
>>> arch/x86/mm/fault.c | 18 ++++++++++++++++++
>>> 1 file changed, 18 insertions(+)
>>>
>>>
>> Will that be picked up by Linus as it is a 3.13 regression?
> Does anyone actually know why this regressed recently? The buggy code
> has been there for quite a while.
>
> --Andy
Yes, the bug was there for a while, but a recent change by Peter (see
the "Fixes:" line above) made it much easier to hit it.
-Longman
On Fri, Jan 17, 2014 at 12:08 PM, Waiman Long <[email protected]> wrote:
> On 01/17/2014 02:17 PM, Andy Lutomirski wrote:
>>
>> On Fri, Jan 17, 2014 at 10:10 AM, Waiman Long<[email protected]> wrote:
>>>
>>> On 01/16/2014 08:39 AM, tip-bot for Peter Zijlstra wrote:
>>>>
>>>> Commit-ID: c026b3591e4f2a4993df773183704bb31634e0bd
>>>> Gitweb:
>>>> http://git.kernel.org/tip/c026b3591e4f2a4993df773183704bb31634e0bd
>>>> Author: Peter Zijlstra<[email protected]>
>>>> AuthorDate: Fri, 10 Jan 2014 21:06:03 +0100
>>>> Committer: Ingo Molnar<[email protected]>
>>>> CommitDate: Thu, 16 Jan 2014 09:19:48 +0100
>>>>
>>>> x86, mm, perf: Allow recursive faults from interrupts
>>>>
>>>> Waiman managed to trigger a PMI while in a emulate_vsyscall() fault,
>>>> the PMI in turn managed to trigger a fault while obtaining a stack
>>>> trace. This triggered the sig_on_uaccess_error recursive fault logic
>>>> and killed the process dead.
>>>>
>>>> Fix this by explicitly excluding interrupts from the recursive fault
>>>> logic.
>>>>
>>>> Reported-and-Tested-by: Waiman Long<[email protected]>
>>>> Fixes: e00b12e64be9 ("perf/x86: Further optimize copy_from_user_nmi()")
>>>> Cc: Aswin Chandramouleeswaran<[email protected]>
>>>> Cc: Scott J Norton<[email protected]>
>>>> Cc: Linus Torvalds<[email protected]>
>>>> Cc: Andy Lutomirski<[email protected]>
>>>> Cc: Arnaldo Carvalho de Melo<[email protected]>
>>>> Cc: Andrew Morton<[email protected]>
>>>> Signed-off-by: Peter Zijlstra<[email protected]>
>>>> Link:
>>>>
>>>> http://lkml.kernel.org/r/[email protected]
>>>> Signed-off-by: Ingo Molnar<[email protected]>
>>>> ---
>>>> arch/x86/mm/fault.c | 18 ++++++++++++++++++
>>>> 1 file changed, 18 insertions(+)
>>>>
>>>>
>>> Will that be picked up by Linus as it is a 3.13 regression?
>>
>> Does anyone actually know why this regressed recently? The buggy code
>> has been there for quite a while.
>>
>> --Andy
>
>
> Yes, the bug was there for a while, but a recent change by Peter (see the
> "Fixes:" line above) made it much easier to hit it.
Thanks!
So I feel slightly better now -- this particular bug didn't actually
exist when I wrote the offending code :) But that also means that
this should really be fixed in 3.13.
--Andy