Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757340AbcJXLOJ (ORCPT ); Mon, 24 Oct 2016 07:14:09 -0400 Received: from mx1.redhat.com ([209.132.183.28]:52188 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753962AbcJXLOE (ORCPT ); Mon, 24 Oct 2016 07:14:04 -0400 Date: Mon, 24 Oct 2016 06:14:02 -0500 From: Josh Poimboeuf To: Peter Zijlstra Cc: Vince Weaver , linux-kernel@vger.kernel.org, Ingo Molnar , Arnaldo Carvalho de Melo , Andy Lutomirski Subject: Re: perf: perf_fuzzer triggers vmalloc_fault (then crashes) Message-ID: <20161024111402.fv2sswwgnx6qm3ic@treble> References: <20161024101802.GG3102@twins.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20161024101802.GG3102@twins.programming.kicks-ass.net> User-Agent: Mutt/1.6.0.1 (2016-04-01) X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.30]); Mon, 24 Oct 2016 11:14:03 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 7205 Lines: 113 On Mon, Oct 24, 2016 at 12:18:02PM +0200, Peter Zijlstra wrote: > On Fri, Oct 21, 2016 at 11:05:40PM -0400, Vince Weaver wrote: > > > > This is on an AMD a10 system. With paranoid=1. Think it's > > probably unrelated to the (unreseolved) AMD IBS issues. > > This is 4.9-rc0 just before rc1 (can't get actual rc1 to boot) > > > > Machine locks hard after this. > > > > [ 8098.085662] BAD LUCK: lost 42 message(s) from NMI context! > > [ 8098.085663] ------------[ cut here ]------------ > > [ 8098.085664] WARNING: CPU: 0 PID: 21338 at arch/x86/mm/fault.c:435 vmalloc_fault+0x58/0x1f0 > > [ 8098.085668] CPU: 0 PID: 21338 Comm: perf_fuzzer Not tainted 4.8.0+ #37 > > [ 8098.085668] Hardware name: Hewlett-Packard HP Compaq Pro 6305 SFF/1850, BIOS K06 v02.57 08/16/2013 > > [ 8098.085670] Call Trace: > > [ 8098.085670] [] ? dump_stack+0x46/0x59 > > [ 8098.085670] [] ? __warn+0xd5/0xee > > [ 8098.085671] [] ? vmalloc_fault+0x58/0x1f0 > > [ 8098.085671] [] ? __do_page_fault+0x6d/0x48e > > [ 8098.085671] [] ? perf_log_throttle+0xa4/0xf4 > > [ 8098.085672] [] ? trace_page_fault+0x22/0x30 > > [ 8098.085672] [] ? __unwind_start+0x28/0x42 > > [ 8098.085672] [] ? perf_callchain_kernel+0x75/0xac > > [ 8098.085672] [] ? get_perf_callchain+0x13a/0x1f0 > > [ 8098.085673] [] ? perf_callchain+0x6a/0x6c > > [ 8098.085673] [] ? perf_prepare_sample+0x71/0x2eb > > [ 8098.085673] [] ? perf_event_output_forward+0x1a/0x54 > > [ 8098.085674] [] ? __default_send_IPI_shortcut+0x10/0x2d > > [ 8098.085674] [] ? __perf_event_overflow+0xfb/0x167 > > [ 8098.085674] [] ? x86_pmu_handle_irq+0x113/0x150 > > [ 8098.085675] [] ? native_read_msr+0x6/0x34 > > [ 8098.085675] [] ? perf_event_nmi_handler+0x22/0x39 > > [ 8098.085675] [] ? perf_ibs_nmi_handler+0x4a/0x51 > > [ 8098.085676] [] ? perf_event_nmi_handler+0x22/0x39 > > [ 8098.085676] [] ? nmi_handle+0x4d/0xf0 > > [ 8098.085676] [] ? perf_ibs_handle_irq+0x3d1/0x3d1 > > [ 8098.085676] [] ? default_do_nmi+0x3c/0xd5 > > [ 8098.085677] [] ? do_nmi+0x92/0x102 > > [ 8098.085677] [] ? end_repeat_nmi+0x1a/0x1e > > [ 8098.085677] [] ? entry_SYSCALL_64_after_swapgs+0x12/0x4a > > [ 8098.085678] [] ? entry_SYSCALL_64_after_swapgs+0x12/0x4a > > [ 8098.085678] [] ? entry_SYSCALL_64_after_swapgs+0x12/0x4a > > [ 8098.085678] ^A4---[ end trace 632723104d47d31a ]--- > > So we get an NMI based stack unwind (without frame pointers) overrun the > actual stack, and tickle the new guard page thing: > > > [ 8098.085679] BUG: stack guard page was hit at ffffc90008500000 (stack is ffffc900084fc000..ffffc900084fffff) > > [ 8098.085679] kernel stack overflow (page fault): 0000 [#1] SMP > > [ 8098.085683] CPU: 0 PID: 21338 Comm: perf_fuzzer Tainted: G W 4.8.0+ #37 > > [ 8098.085683] Hardware name: Hewlett-Packard HP Compaq Pro 6305 SFF/1850, BIOS K06 v02.57 08/16/2013 > > [ 8098.085684] task: ffff8802265d2080 task.stack: ffffc900084fc000 > > [ 8098.085684] RIP: 0010:[] ^Ac [] __unwind_start+0x28/0x42 > > [ 8098.085684] RSP: 0018:ffff88022ec05af0 EFLAGS: 00010006 > > [ 8098.085685] RAX: 00000000ffffffea RBX: ffff88022ec05b08 RCX: ffffc90008500000 > > [ 8098.085685] RDX: ffff88022ec00000 RSI: 0000000000001000 RDI: 000000000000c4d0 > > [ 8098.085685] RBP: ffffc90008500000 R08: ffff88022ec08000 R09: 0000000000000000 > > [ 8098.085686] R10: 0000000000000002 R11: 0000000000000206 R12: ffff88022ec05b70 > > [ 8098.085686] R13: ffff88022ec05ef8 R14: 0000000000000000 R15: 0000000000000001 > > [ 8098.085687] FS: 00007f06e791c700(0000) GS:ffff88022ec00000(0000) knlGS:0000000000000000 > > [ 8098.085687] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [ 8098.085687] CR2: ffffc90008500000 CR3: 0000000223c25000 CR4: 00000000000407f0 > > [ 8098.085688] DR0: 0000000000000000 DR1: 0000000000005fc8 DR2: 0000000000005fc8 > > [ 8098.085688] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600 > > [ 8098.085689] Call Trace: > > [ 8098.085690] ^Ad [] ? perf_callchain_kernel+0x75/0xac > > [ 8098.085690] [] ? vsnprintf+0x380/0x3b4 > > [ 8098.085690] [] ? sprintf+0x42/0x4a > > [ 8098.085691] [] ? __sprint_symbol+0x9d/0xd1 > > [ 8098.085691] [] ? symbol_string+0x51/0x5d > > [ 8098.085691] [] ? __sprint_symbol+0x9d/0xd1 > > [ 8098.085692] [] ? symbol_string+0x51/0x5d > > [ 8098.085692] [] ? pointer+0x85/0x379 > > [ 8098.085692] [] ? vsnprintf+0x80/0x3b4 > > [ 8098.085692] [] ? irq_work_queue+0xa/0x66 > > [ 8098.085693] [] ? vprintk_nmi+0x88/0x97 > > [ 8098.085693] [] ? vprintk_nmi+0x88/0x97 > > [ 8098.085693] [] ? printk+0x43/0x4b > > [ 8098.085694] [] ? __module_text_address+0x9/0x4f > > [ 8098.085694] [] ? is_module_text_address+0x5/0xc > > [ 8098.085694] [] ? show_trace_log_lvl+0x108/0x195 > > [ 8098.085694] [] ? no_context+0x102/0x36c > > [ 8098.085695] [] ? no_context+0x102/0x36c > > [ 8098.085695] [] ? show_stack_log_lvl+0x15b/0x172 > > [ 8098.085695] [] ? show_regs+0x64/0x136 > > [ 8098.085696] [] ? __die+0x8c/0xc4 > > [ 8098.085696] [] ? die+0x3d/0x56 > > [ 8098.085696] [] ? handle_stack_overflow+0x47/0x51 > > [ 8098.085697] [] ? no_context+0x102/0x36c > > [ 8098.085697] ^AdCode: ^A1BUG: unable to handle kernel ^AcNULL pointer dereference^Ac at 0000000000000008 > > [ 8098.085697] IP:^Ac [<0000000000000008>] 0x8 > > [ 8098.085698] PGD 2231d5067 PUD 225162067 PMD 0 > > [ 8098.085698] Oops: 0010 [#2] SMP > > [ 8098.085702] > > [ 8098.957250] ---[ end trace 632723104d47d31b ]--- > > [ 8098.957250] Kernel panic - not syncing: Fatal exception in interrupt > > [ 8098.957301] Kernel Offset: disabled > > [ 8098.973814] ---[ end Kernel panic - not syncing: Fatal exception in interrupt > > [ 8098.981719] ------------[ cut here ]------------ > > [ 8098.981720] WARNING: CPU: 0 PID: 21338 at arch/x86/kernel/smp.c:127 update_process_times+0x3b/0x45 > > And then the machine (understandably) goes off the rails entirely.. > > Josh, Andy, any clue on how I should go about fixing this? This is a bug in the unwinder. The NMI hit in the entry code right after setting up the stack pointer from cpu_current_top_of_stack, so the kernel stack was empty. __unwind_start() tried to dereference the pointer (0xffffc90008500000) at the top of the stack. I'll make a patch. -- Josh