Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933845AbaFRGgv (ORCPT ); Wed, 18 Jun 2014 02:36:51 -0400 Received: from mga09.intel.com ([134.134.136.24]:28216 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756390AbaFRGgt (ORCPT ); Wed, 18 Jun 2014 02:36:49 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.01,499,1400050800"; d="scan'208";a="530277596" From: Liu ShuoX To: linux-kernel@vger.kernel.org Cc: yanmin_zhang@linux.intel.com, Zhang Yanmin , Liu Shuox , Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , x86@kernel.org (maintainer:X86 ARCHITECTURE...), Ramkumar Ramachandra Subject: [PATCH v4] perf: fix kernel panic when parsing user space CS saved in pt_regs Date: Wed, 18 Jun 2014 14:34:31 +0800 Message-Id: <1403073277-29130-1-git-send-email-shuox.liu@intel.com> X-Mailer: git-send-email 1.8.3.2 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Zhang Yanmin ChangeLog V4: Explain the patch sceanrio clearly ChangeLog V3: Keep rsp pointing to pt_regs before sysexit. ChangeLog V2: Before sysexit, perf NMI might arrive. There is still a race. Here we change rsp to keep it pointing to pt_regs->orig_ax. In addition, after sti, before sysexit, an irq might arrives. That causes more chances for perf NMI to jump in. We hit a kernel panic when running perf to collect performance data with callchain info. kenel is x86_64 and user space apps are 32bit. Run command: # perf record -a -g -f sleep 30 kernel panic usually within 30 seconds. BUG: unable to handle kernel NULL pointer dereference at 0000000000000004 IP: [] get_segment_base+0x71/0xc0 PGD 6c65f067 PUD 0 Oops: 0000 [#1] PREEMPT SMP Modules linked in: <...> CPU: 1 PID: 304 Comm: Binder_2 Tainted: G W O 3.10.20-263902-g184bfbc-dirty #14 task: ffff8800764dc300 ti: ffff88006c6e8000 task.ti: ffff88006c6e8000 RIP: 0010:[] [] get_segment_base+0x71/0xc0 RSP: 0018:ffff^X8007ea87b98 EFLAGS: 00010092 RAX: 0000000000000024 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000009 RBP: ffff88007ea87ba8 R08: ffffffff83143b3c R09: ffffffff831848a8 R10: 0000000000000000 R11: 00000000001bf2d8 R12: 0000000000000000 R13: ffff88006c6e9fd8 R14: ffff88006c6e9f58 R15: ffff8800764dc300 FS: 0000000000000000(0000) GS:ffff88007ea80000(006b) knlGS:00000000f704add0 CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033 CR2: 0000000000000004 CR3: 0000000076588000 CR4: 00000^P00001007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Stack: ffff88005f266c00 0000000000000000 ffff88007ea87c18 ffffffff82013cac ffff88007ea87d58 00000016fe4704a0 00000000000001a7 ffff88007ea87ef8 ffff88005f266c00 ffff88007ea87ef8 ffff8800!e07b400 ffff88005f266c00 Call Trace: [] perf_callchain_user+0x15c/0x240 [] perf_callchain+0x134/0x180 [] ? local_clock+0x47/0x60 [] perf_prepare_sample+0x1bb/0x240 [] __perf_event_overflow+0x147/0x230 [] ? x86_perf_event_set_period+0xd8/0x150 [] perf_event_overflow+0x14/0x20 [] intel_pmu_handle_irq+0x1c2/0x270 [] ? call_softirq+0x30/0x30 [] perf_event_nmi_handler+0x21/0x30 [] nmi_handle.isr!.1+0x59/0x=0 [] default_do_nmi+0x58/0x240 [] do_nmi+0xb8/0xf0 [] end_repeat_nmi+0x1e/0x2e [] ? call_softirq+0x30/0x30 [] ? call_softirq+0x30/0x30 [] ? call_softirq+0x30/0x30 perf_callchain_user32 calls get_segment_base to get cs/ss base address. At kernel panic, get_segment_base considers the cs as LDT index and uses current->active_mm->context.ldt to access the desc. But the app is 32bit and doesn't use LDT. current->active_mm->context.ldt is equal to NULL. That causes a bad dereference and kernel panic. We dump pt_regs in function perf_callchain_user32. At panic, the values are incorrect. After collecting lots of logs, we find it always happens when app runs system call. Sometimes, the panic callchain has the address of trace_hardirqs_on_thunk. perf_callchain checks if pt_regs saves user space reg info. If not, perf_callchain calls task_pt_regs(current) to get the address of pt_regs on the top of kernel stack. After checking sysexit_from_sys_call, we find pt_regs on the top of the stack might be erased by trace_hardirqs_on_thunk or other interrupt handlers. Basically, ia32 uses sysenter to start system calls. sysexit_from_sys_call=>trace_hardirqs_on_thunk. Before calling trace_hardirqs_on_thunk, sysexit_from_sys_call already pops up pt_regs, and register rsp points to pt_regs->ss, almost at the top of the kernel stack. Then trace_hardirqs_on_thunk is called and it uses kernel stack to save local vars, which erase old pt_regs. If perf NMI happens here, perf might access a ruined pt_regs when saving userspace callchain. If app is 64bit, it doesn't go through this path. sysret_check keeps rsp pointing to pt_regs before executing sysretq to exit to user space. The patch fixes it by keeping rsp pointing to pt_regs like 64bit path. Signed-off-by: Zhang Yanmin Signed-off-by: Liu Shuox --- arch/x86/ia32/ia32entry.S | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S index 4299eb0..d2f905b 100644 --- a/arch/x86/ia32/ia32entry.S +++ b/arch/x86/ia32/ia32entry.S @@ -172,15 +172,16 @@ sysexit_from_sys_call: andl $~0x200,EFLAGS-R11(%rsp) movl RIP-R11(%rsp),%edx /* User %eip */ CFI_REGISTER rip,rdx - RESTORE_ARGS 0,24,0,0,0,0 - xorq %r8,%r8 + RESTORE_ARGS 0,-ARG_SKIP,0,0,0,0 + movq EFLAGS-R11(%rsp),%r8 /* rflags */ + movq RSP-R11(%rsp),%rcx /* User %esp */ xorq %r9,%r9 xorq %r10,%r10 xorq %r11,%r11 - popfq_cfi + pushq_cfi %r8 /*CFI_RESTORE rflags*/ - popq_cfi %rcx /* User %esp */ - CFI_REGISTER rsp,rcx + popfq_cfi + xorq %r8,%r8 TRACE_IRQS_ON ENABLE_INTERRUPTS_SYSEXIT32 -- 1.8.3.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/