Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933073AbdLVKNa (ORCPT ); Fri, 22 Dec 2017 05:13:30 -0500 Received: from mail-qt0-f169.google.com ([209.85.216.169]:46414 "EHLO mail-qt0-f169.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755893AbdLVKNI (ORCPT ); Fri, 22 Dec 2017 05:13:08 -0500 X-Google-Smtp-Source: ACJfBoutKe7g/95pfh/zirAZIGKRl382XJaNGqKGMHb/Dgp+ultYZEfyuHdMHtuHkWRMRaJNLObQsDz1yWlrxFXfvPw= MIME-Version: 1.0 In-Reply-To: <87y3lvi480.fsf@xmission.com> References: <20171219193020.GA9237@codemonkey.org.uk> <878tdy5r5t.fsf@xmission.com> <87mv2e17vz.fsf@xmission.com> <20171220052803.GA17079@codemonkey.org.uk> <871sjp1cjz.fsf@xmission.com> <20171221031606.GA4636@codemonkey.org.uk> <87po78trjm.fsf@xmission.com> <20171221220044.GA4977@codemonkey.org.uk> <87wp1fk0pd.fsf@xmission.com> <20171222033500.GA17273@codemonkey.org.uk> <87y3lvi480.fsf@xmission.com> From: Alexey Dobriyan Date: Fri, 22 Dec 2017 12:13:06 +0200 Message-ID: Subject: Re: proc_flush_task oops To: "Eric W. Biederman" Cc: Dave Jones , Linus Torvalds , Al Viro , Linux Kernel , syzkaller-bugs@googlegroups.com, Gargi Sharma , Oleg Nesterov , Rik van Riel , Andrew Morton Content-Type: multipart/mixed; boundary="94eb2c0399b41b59b60560eb109f" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 8369 Lines: 178 --94eb2c0399b41b59b60560eb109f Content-Type: text/plain; charset="UTF-8" On 12/22/17, Eric W. Biederman wrote: > Dave Jones writes: > >> On Thu, Dec 21, 2017 at 07:31:26PM -0600, Eric W. Biederman wrote: >> > Dave Jones writes: >> > >> > > On Thu, Dec 21, 2017 at 12:38:12PM +0200, Alexey Dobriyan wrote: >> > > >> > > > > with proc_mnt still set to NULL is a mystery to me. >> > > > > >> > > > > Is there any chance the idr code doesn't always return the >> lowest valid >> > > > > free number? So init gets assigned something other than 1? >> > > > >> > > > Well, this theory is easy to test (attached). >> > > >> > > I didn't hit this BUG, but I hit the same oops in proc_flush_task. >> > >> > Scratch one idea. >> > >> > If it isn't too much trouble can you try this. >> > >> > I am wondering if somehow the proc_mnt that is NULL is somewhere in >> the >> > middle of the stack of pid namespaces. >> > >> > This adds two warnings. The first just reports which pid namespace in >> > the stack of pid namespaces is problematic, and the pid number in that >> > pid namespace. Which should give a whole lot more to go by. >> > >> > The second warning complains if we manage to create a pid namespace >> > where the parent pid namespace is not properly set up. The test to >> > prevent that looks quite robust, but at this point I don't know where >> to >> > look. >> >> Progress ? >> >> [ 1653.030190] ------------[ cut here ]------------ >> [ 1653.030852] 1/1: 2 no proc_mnt >> [ 1653.030946] WARNING: CPU: 2 PID: 4420 at kernel/pid.c:213 >> alloc_pid+0x24f/0x2a0 > > Yes. I don't know why Alexey's patch did not fire but this is > confirmation that the first pid allocated was #2 and not #1. > > Which explains the pid_mnt not being set, and it is definitely the new > code, changing from the pid bitmap+hash table to an idr. > > So it looks like idr_alloc_cyclic in some configuration for the first > allocation returns value #2 instead of value #1. > > I don't know that code, and it is quite complicated so I will have to > stare at it some more to even guess why it is doing that. > > This is confirmation that reverting those pid changes will fix the > problem. As they are definitely at fault. > > > Hmm. After a little more staring I have a hunch what is going wrong. > It is just possible that there is a failure in alloc_pid during the > first pid allocation and then idr_next gets left at 2. I need to sleep > before I can think of a patch to test that. > > Hmm. A failure and then restart would also explain why Alexey's patch > did not fire. An incomplete reset of state. You are right (you are also right about sysctl :-\) unshare fork alloc_pid in level 1 succeeds alloc_pid in level 0 fails, ->idr_next is 2 fork alloc pid 2 exit Reliable reproducer and fail injection patch attached I'd say proper fix is allocating pids in the opposite order so that failure in the last layer doesn't move IDR cursor in baby child pidns. BUG: unable to handle kernel NULL pointer dereference at (null) IP: proc_flush_task+0x7b/0x180 PGD 3dbb5067 P4D 3dbb5067 PUD 3d11c067 PMD 0 Oops: 0000 [#1] PREEMPT SMP CPU: 1 PID: 2775 Comm: trinity Not tainted 4.15.0-rc4-00202-gead68f216110-dirty #12 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc27 04/01/2014 RIP: 0010:proc_flush_task+0x7b/0x180 RSP: 0018:ffffc9000016bca8 EFLAGS: 00010296 RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000032 RDX: ffffc9000016bcbc RSI: ffffc9000016bcc8 RDI: 0000000000000001 RBP: ffffc9000016bcbb R08: 0000000036373732 R09: 0000000000000000 R10: 0000000000000000 R11: ffffc9000016bcbc R12: 0000000000000002 R13: 0000000000000000 R14: 0000000000000002 R15: ffff88003e2aee00 FS: 00007f9b52c86700(0000) GS:ffff88003fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000003d24b000 CR4: 00000000000006a0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: release_task+0x3c/0x430 ? thread_group_cputime_adjusted+0x35/0x40 wait_consider_task+0x7a3/0x7d0 do_wait+0xe5/0x1f0 kernel_wait4+0x74/0x110 ? task_stopped_code+0x50/0x50 SyS_wait4+0x77/0x80 ? handle_mm_fault+0x4a/0x50 ? __do_page_fault+0x1a9/0x3e0 ? entry_SYSCALL_64_fastpath+0x13/0x6c entry_SYSCALL_64_fastpath+0x13/0x6c --94eb2c0399b41b59b60560eb109f Content-Type: text/plain; charset="US-ASCII"; name="proc-flush-task.diff" Content-Disposition: attachment; filename="proc-flush-task.diff" Content-Transfer-Encoding: base64 X-Attachment-Id: file0 ZGlmZiAtLWdpdCBhL2tlcm5lbC9waWQuYyBiL2tlcm5lbC9waWQuYwppbmRleCBiMTNiNjI0ZTJj NDkuLmM5ZWYzMzI4YTUyYSAxMDA2NDQKLS0tIGEva2VybmVsL3BpZC5jCisrKyBiL2tlcm5lbC9w aWQuYwpAQCAtMTQ2LDYgKzE0Niw4IEBAIHZvaWQgZnJlZV9waWQoc3RydWN0IHBpZCAqcGlkKQog CiBzdHJ1Y3QgcGlkICphbGxvY19waWQoc3RydWN0IHBpZF9uYW1lc3BhY2UgKm5zKQogeworCXN0 YXRpYyBpbnQgdHJ5ID0gMDsKKwlib29sIGZhaWwgPSBmYWxzZTsKIAlzdHJ1Y3QgcGlkICpwaWQ7 CiAJZW51bSBwaWRfdHlwZSB0eXBlOwogCWludCBpLCBucjsKQEAgLTE1Myw2ICsxNTUsOSBAQCBz dHJ1Y3QgcGlkICphbGxvY19waWQoc3RydWN0IHBpZF9uYW1lc3BhY2UgKm5zKQogCXN0cnVjdCB1 cGlkICp1cGlkOwogCWludCByZXR2YWwgPSAtRU5PTUVNOwogCisJaWYgKHN0cm5jbXAoY3VycmVu dC0+Y29tbSwgInRyaW5pdHkiLCA3KSA9PSAwKQorCQlwcmludGsoIiVzOiBwaWRfbnMgbGV2ZWwg JWRcbiIsIF9fZnVuY19fLCBucy0+bGV2ZWwpOworCiAJcGlkID0ga21lbV9jYWNoZV9hbGxvYyhu cy0+cGlkX2NhY2hlcCwgR0ZQX0tFUk5FTCk7CiAJaWYgKCFwaWQpCiAJCXJldHVybiBFUlJfUFRS KHJldHZhbCk7CkBAIC0xNjMsNiArMTY4LDkgQEAgc3RydWN0IHBpZCAqYWxsb2NfcGlkKHN0cnVj dCBwaWRfbmFtZXNwYWNlICpucykKIAlmb3IgKGkgPSBucy0+bGV2ZWw7IGkgPj0gMDsgaS0tKSB7 CiAJCWludCBwaWRfbWluID0gMTsKIAorCQlpZiAoc3RybmNtcChjdXJyZW50LT5jb21tLCAidHJp bml0eSIsIDcpID09IDApCisJCQlwcmludGsoIiVzOiAnJXMnIGxldmVsICVkXG4iLCBfX2Z1bmNf XywgY3VycmVudC0+Y29tbSwgdG1wLT5sZXZlbCk7CisKIAkJaWRyX3ByZWxvYWQoR0ZQX0tFUk5F TCk7CiAJCXNwaW5fbG9ja19pcnEoJnBpZG1hcF9sb2NrKTsKIApAQCAtMTczLDEyICsxODEsMjUg QEAgc3RydWN0IHBpZCAqYWxsb2NfcGlkKHN0cnVjdCBwaWRfbmFtZXNwYWNlICpucykKIAkJaWYg KGlkcl9nZXRfY3Vyc29yKCZ0bXAtPmlkcikgPiBSRVNFUlZFRF9QSURTKQogCQkJcGlkX21pbiA9 IFJFU0VSVkVEX1BJRFM7CiAKKwkJaWYgKGZhaWwgJiYgc3RybmNtcChjdXJyZW50LT5jb21tLCAi dHJpbml0eSIsIDcpID09IDApIHsKKwkJCWZhaWwgPSBmYWxzZTsKKwkJCW5yID0gLUVOT01FTTsK KwkJCWdvdG8geHh4OworCQl9CisKIAkJLyoKIAkJICogU3RvcmUgYSBudWxsIHBvaW50ZXIgc28g ZmluZF9waWRfbnMgZG9lcyBub3QgZmluZAogCQkgKiBhIHBhcnRpYWxseSBpbml0aWFsaXplZCBQ SUQgKHNlZSBiZWxvdykuCiAJCSAqLworCQlpZiAoc3RybmNtcChjdXJyZW50LT5jb21tLCAidHJp bml0eSIsIDcpID09IDApIHsKKwkJCXByaW50aygiQkVGT1JFIC0+aWRyX25leHQgJXVcbiIsIHRt cC0+aWRyLmlkcl9uZXh0KTsKKwkJfQogCQluciA9IGlkcl9hbGxvY19jeWNsaWMoJnRtcC0+aWRy LCBOVUxMLCBwaWRfbWluLAogCQkJCSAgICAgIHBpZF9tYXgsIEdGUF9BVE9NSUMpOworCQlpZiAo c3RybmNtcChjdXJyZW50LT5jb21tLCAidHJpbml0eSIsIDcpID09IDApIHsKKwkJCXByaW50aygi IEFGVEVSIC0+aWRyX25leHQgJXVcbiIsIHRtcC0+aWRyLmlkcl9uZXh0KTsKKwkJfQoreHh4Ogog CQlzcGluX3VubG9ja19pcnEoJnBpZG1hcF9sb2NrKTsKIAkJaWRyX3ByZWxvYWRfZW5kKCk7CiAK QEAgLTE4Nyw2ICsyMDgsMTIgQEAgc3RydWN0IHBpZCAqYWxsb2NfcGlkKHN0cnVjdCBwaWRfbmFt ZXNwYWNlICpucykKIAkJCWdvdG8gb3V0X2ZyZWU7CiAJCX0KIAorCQlpZiAodHJ5ID09IDAgJiYg c3RybmNtcChjdXJyZW50LT5jb21tLCAidHJpbml0eSIsIDcpID09IDApIHsKKwkJCWlmICh0bXAt PmxldmVsID09IDEpIHsKKwkJCQlmYWlsID0gdHJ1ZTsKKwkJCX0KKwkJfQorCiAJCXBpZC0+bnVt YmVyc1tpXS5uciA9IG5yOwogCQlwaWQtPm51bWJlcnNbaV0ubnMgPSB0bXA7CiAJCXRtcCA9IHRt cC0+cGFyZW50OwpAQCAtMjI5LDYgKzI1Niw5IEBAIHN0cnVjdCBwaWQgKmFsbG9jX3BpZChzdHJ1 Y3QgcGlkX25hbWVzcGFjZSAqbnMpCiAJc3Bpbl91bmxvY2tfaXJxKCZwaWRtYXBfbG9jayk7CiAK IAlrbWVtX2NhY2hlX2ZyZWUobnMtPnBpZF9jYWNoZXAsIHBpZCk7CisJaWYgKHN0cm5jbXAoY3Vy cmVudC0+Y29tbSwgInRyaW5pdHkiLCA3KSA9PSAwKSB7CisJCXRyeSsrOworCX0KIAlyZXR1cm4g RVJSX1BUUihyZXR2YWwpOwogfQogCg== --94eb2c0399b41b59b60560eb109f Content-Type: text/x-csrc; charset="US-ASCII"; name="proc-flush-task.c" Content-Disposition: attachment; filename="proc-flush-task.c" Content-Transfer-Encoding: base64 X-Attachment-Id: file2 I2RlZmluZSBfR05VX1NPVVJDRQojaW5jbHVkZSA8c2NoZWQuaD4KI2luY2x1ZGUgPHVuaXN0ZC5o PgojaW5jbHVkZSA8c3lzL3R5cGVzLmg+CiNpbmNsdWRlIDxzeXMvd2FpdC5oPgoKaW50IG1haW4o dm9pZCkKewoJaW50IHBpZDsKCgl1bnNoYXJlKENMT05FX05FV1BJRCk7Cglmb3JrKCk7CglwaWQg PSBmb3JrKCk7CglpZiAocGlkID09IDApIHsKCQlyZXR1cm4gMDsKCX0KCXdhaXRwaWQocGlkLCBO VUxMLCAwKTsKCXJldHVybiAwOwp9Cg== --94eb2c0399b41b59b60560eb109f--