Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1760667AbcLPO0N (ORCPT ); Fri, 16 Dec 2016 09:26:13 -0500 Received: from userp1040.oracle.com ([156.151.31.81]:16750 "EHLO userp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757547AbcLPO0G (ORCPT ); Fri, 16 Dec 2016 09:26:06 -0500 Subject: Re: crash during oom reaper To: Michal Hocko References: <20161216082202.21044-1-vegard.nossum@oracle.com> <20161216082202.21044-4-vegard.nossum@oracle.com> <20161216090157.GA13940@dhcp22.suse.cz> <20161216101113.GE13940@dhcp22.suse.cz> <20161216140043.GN13940@dhcp22.suse.cz> Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Rik van Riel , Matthew Wilcox , Peter Zijlstra , Andrew Morton , Al Viro , Ingo Molnar , Linus Torvalds From: Vegard Nossum Message-ID: <2d65449b-5f8a-7a29-e879-9c27bd1d4537@oracle.com> Date: Fri, 16 Dec 2016 15:25:27 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.5.1 MIME-Version: 1.0 In-Reply-To: <20161216140043.GN13940@dhcp22.suse.cz> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-Source-IP: userv0021.oracle.com [156.151.31.71] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3238 Lines: 84 On 12/16/2016 03:00 PM, Michal Hocko wrote: > On Fri 16-12-16 14:14:17, Vegard Nossum wrote: > [...] >> Out of memory: Kill process 1650 (trinity-main) score 90 or sacrifice child >> Killed process 1724 (trinity-c14) total-vm:37280kB, anon-rss:236kB, >> file-rss:112kB, shmem-rss:112kB >> BUG: unable to handle kernel NULL pointer dereference at 00000000000001e8 >> IP: [] copy_process.part.41+0x2150/0x5580 >> PGD c001067 PUD c000067 >> PMD 0 >> Oops: 0002 [#1] PREEMPT SMP KASAN >> Dumping ftrace buffer: >> (ftrace buffer empty) >> CPU: 28 PID: 1650 Comm: trinity-main Not tainted 4.9.0-rc6+ #317 > > Hmm, so this was the oom victim initially but we have decided to kill > its child 1724 instead. > >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS >> Ubuntu-1.8.2-1ubuntu1 04/01/2014 >> task: ffff88000f9bc440 task.stack: ffff88000c778000 >> RIP: 0010:[] [] >> copy_process.part.41+0x2150/0x5580 > > Could you match this to the kernel source please? kernel/fork.c:629 dup_mmap() it's atomic_dec(&inode->i_writecount), it matches up with file_inode(file) == NULL: (gdb) p &((struct inode *)0)->i_writecount $1 = (atomic_t *) 0x1e8 >> Killed process 1775 (trinity-c21) total-vm:37404kB, anon-rss:232kB, >> file-rss:420kB, shmem-rss:116kB >> oom_reaper: reaped process 1775 (trinity-c21), now anon-rss:0kB, >> file-rss:0kB, shmem-rss:116kB >> ================================================================== >> BUG: KASAN: use-after-free in p9_client_read+0x8f0/0x960 at addr >> ffff880010284d00 >> Read of size 8 by task trinity-main/1649 >> CPU: 3 PID: 1649 Comm: trinity-main Not tainted 4.9.0+ #318 >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS >> Ubuntu-1.8.2-1ubuntu1 04/01/2014 >> ffff8800068a7770 ffffffff82012301 ffff88001100f600 ffff880010284d00 >> ffff880010284d60 ffff880010284d00 ffff8800068a7798 ffffffff8165872c >> ffff8800068a7828 ffff880010284d00 ffff88001100f600 ffff8800068a7818 >> Call Trace: >> [] dump_stack+0x83/0xb2 >> [] kasan_object_err+0x1c/0x70 >> [] kasan_report_error+0x1f5/0x4e0 >> [] ? kasan_slab_alloc+0x12/0x20 >> [] ? check_preemption_disabled+0x37/0x1e0 >> [] __asan_report_load8_noabort+0x3e/0x40 >> [] ? assoc_array_gc+0x1310/0x1330 >> [] ? p9_client_read+0x8f0/0x960 >> [] p9_client_read+0x8f0/0x960 > > no idea how we would end up with use after here. Even if I unmapped the > page then the read code should be able to cope with that. This smells > like a p9 issue to me. This is fid->clnt dereference at the top of p9_client_read(). Ah, yes, this is the one coming from a page fault: p9_client_read v9fs_fid_readpage v9fs_vfs_readpage handle_mm_fault __do_page_fault the bad fid pointer is filp->private_data. Hm, so I guess the file itself was NOT freed prematurely (as otherwise we'd probably have seen a KASAN report for the filp->private_data dereference), but the ->private_data itself was. Maybe the whole thing is fundamentally a 9p bug and the OOM killer just happens to trigger it. Vegard