Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934067AbcLPOBX (ORCPT ); Fri, 16 Dec 2016 09:01:23 -0500 Received: from mail-wm0-f66.google.com ([74.125.82.66]:35789 "EHLO mail-wm0-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757073AbcLPOBN (ORCPT ); Fri, 16 Dec 2016 09:01:13 -0500 Date: Fri, 16 Dec 2016 15:00:43 +0100 From: Michal Hocko To: Vegard Nossum Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Rik van Riel , Matthew Wilcox , Peter Zijlstra , Andrew Morton , Al Viro , Ingo Molnar , Linus Torvalds Subject: Re: crash during oom reaper Message-ID: <20161216140043.GN13940@dhcp22.suse.cz> References: <20161216082202.21044-1-vegard.nossum@oracle.com> <20161216082202.21044-4-vegard.nossum@oracle.com> <20161216090157.GA13940@dhcp22.suse.cz> <20161216101113.GE13940@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.6.0 (2016-04-01) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6120 Lines: 133 On Fri 16-12-16 14:14:17, Vegard Nossum wrote: [...] > Out of memory: Kill process 1650 (trinity-main) score 90 or sacrifice child > Killed process 1724 (trinity-c14) total-vm:37280kB, anon-rss:236kB, > file-rss:112kB, shmem-rss:112kB > BUG: unable to handle kernel NULL pointer dereference at 00000000000001e8 > IP: [] copy_process.part.41+0x2150/0x5580 > PGD c001067 PUD c000067 > PMD 0 > Oops: 0002 [#1] PREEMPT SMP KASAN > Dumping ftrace buffer: > (ftrace buffer empty) > CPU: 28 PID: 1650 Comm: trinity-main Not tainted 4.9.0-rc6+ #317 Hmm, so this was the oom victim initially but we have decided to kill its child 1724 instead. > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > Ubuntu-1.8.2-1ubuntu1 04/01/2014 > task: ffff88000f9bc440 task.stack: ffff88000c778000 > RIP: 0010:[] [] > copy_process.part.41+0x2150/0x5580 Could you match this to the kernel source please? > RSP: 0018:ffff88000c77fc18 EFLAGS: 00010297 > RAX: 0000000000000000 RBX: ffff88000fa11c00 RCX: 0000000000000000 > RDX: 0000000000000000 RSI: dffffc0000000000 RDI: ffff88000f2a33b0 > RBP: ffff88000c77fdb0 R08: ffff88000c77f900 R09: 0000000000000002 > R10: 00000000cb9401ca R11: 00000000c6eda739 R12: ffff88000f894d00 > R13: ffff88000c7c4700 R14: ffff88000fa11c50 R15: ffff88000f2a3200 > FS: 00007fb7d2a24700(0000) GS:ffff880011b00000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00000000000001e8 CR3: 000000001010d000 CR4: 00000000000406e0 > Stack: > 0000000000000046 0000000001200011 ffffed0001f129ac ffff88000f894d60 > 0000000000000000 0000000000000000 ffff88000f894d08 ffff88000f894da0 > ffff88000c7a8620 ffff88000f020318 ffff88000fa11c18 ffff88000f894e40 > Call Trace: > [] ? __cleanup_sighand+0x50/0x50 > [] ? memzero_explicit+0xe/0x10 > [] ? urandom_read+0x232/0x4d0 > [] _do_fork+0x1a4/0xa40 and here we are copying the pid 1650 to its child. I am wondering whether that might be the killed child. But the child is visible only very late during the fork to the oom killer. > [] ? fork_idle+0x180/0x180 > [] ? syscall_trace_enter+0x3aa/0xd40 > [] ? __context_tracking_exit.part.4+0x9a/0x1e0 > [] ? exit_to_usermode_loop+0x150/0x150 > [] ? check_preemption_disabled+0x37/0x1e0 > [] SyS_clone+0x37/0x50 > [] ? ptregs_sys_rt_sigreturn+0x10/0x10 > [] do_syscall_64+0x1af/0x4d0 > [] entry_SYSCALL64_slow_path+0x25/0x25 > Code: be 00 00 00 00 00 fc ff df 48 c1 e8 03 80 3c 30 00 74 08 4c 89 f7 e8 > d0 7d 3c 00 f6 43 51 08 74 11 e8 45 fa 1d 00 48 8b 44 24 20 ff 88 e8 01 > 00 00 e8 34 fa 1d 00 48 8b 44 24 70 48 83 c0 60 > RIP [] copy_process.part.41+0x2150/0x5580 > RSP > CR2: 00000000000001e8 > ---[ end trace b8f81ad60c106e75 ]--- > > Manifestation 2: > > Killed process 1775 (trinity-c21) total-vm:37404kB, anon-rss:232kB, > file-rss:420kB, shmem-rss:116kB > oom_reaper: reaped process 1775 (trinity-c21), now anon-rss:0kB, > file-rss:0kB, shmem-rss:116kB > ================================================================== > BUG: KASAN: use-after-free in p9_client_read+0x8f0/0x960 at addr > ffff880010284d00 > Read of size 8 by task trinity-main/1649 > CPU: 3 PID: 1649 Comm: trinity-main Not tainted 4.9.0+ #318 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > Ubuntu-1.8.2-1ubuntu1 04/01/2014 > ffff8800068a7770 ffffffff82012301 ffff88001100f600 ffff880010284d00 > ffff880010284d60 ffff880010284d00 ffff8800068a7798 ffffffff8165872c > ffff8800068a7828 ffff880010284d00 ffff88001100f600 ffff8800068a7818 > Call Trace: > [] dump_stack+0x83/0xb2 > [] kasan_object_err+0x1c/0x70 > [] kasan_report_error+0x1f5/0x4e0 > [] ? kasan_slab_alloc+0x12/0x20 > [] ? check_preemption_disabled+0x37/0x1e0 > [] __asan_report_load8_noabort+0x3e/0x40 > [] ? assoc_array_gc+0x1310/0x1330 > [] ? p9_client_read+0x8f0/0x960 > [] p9_client_read+0x8f0/0x960 no idea how we would end up with use after here. Even if I unmapped the page then the read code should be able to cope with that. This smells like a p9 issue to me. [...] > Manifestation 3: > > Out of memory: Kill process 1650 (trinity-main) score 91 or sacrifice child > Killed process 1731 (trinity-main) total-vm:37140kB, anon-rss:192kB, > file-rss:0kB, shmem-rss:0kB > ================================================================== > BUG: KASAN: use-after-free in unlink_file_vma+0xa5/0xb0 at addr > ffff880006689db0 > Read of size 8 by task trinity-main/1731 > CPU: 5 PID: 1731 Comm: trinity-main Not tainted 4.9.0 #314 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > Ubuntu-1.8.2-1ubuntu1 04/01/2014 > ffff880000aaf7f8 ffffffff81fb1ab1 ffff8800110ed500 ffff880006689c00 > ffff880006689db8 ffff880000aaf998 ffff880000aaf820 ffffffff8162c5ac > ffff880000aaf8b0 ffff880006689c00Out of memory: Kill process 1650 > (trinity-main) score 91 or sacrifice child > Killed process 1650 (trinity-main) total-vm:37140kB, anon-rss:192kB, > file-rss:140kB, shmem-rss:18632kB > oom_reaper: reaped process 1650 (trinity-main), now anon-rss:0kB, > file-rss:0kB, shmem-rss:18632kB > ffff8800110ed500 ffff880000aaf8a0 > Call Trace: > [] dump_stack+0x83/0xb2 > [] kasan_object_err+0x1c/0x70 > [] kasan_report_error+0x1f5/0x4e0 > [] ? vm_normal_page_pmd+0x240/0x240 > [] __asan_report_load8_noabort+0x3e/0x40 > [] ? unlink_file_vma+0xa5/0xb0 > [] unlink_file_vma+0xa5/0xb0 Hmm, the oom repaper doesn't touch vma->vm_file so I do not see how it could be related to the activity of the reaper. I will have a look closer. -- Michal Hocko SUSE Labs