Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756468AbdLVOm1 (ORCPT ); Fri, 22 Dec 2017 09:42:27 -0500 Received: from out02.mta.xmission.com ([166.70.13.232]:47965 "EHLO out02.mta.xmission.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755464AbdLVOmX (ORCPT ); Fri, 22 Dec 2017 09:42:23 -0500 From: ebiederm@xmission.com (Eric W. Biederman) To: Alexey Dobriyan Cc: Dave Jones , Linus Torvalds , Al Viro , Linux Kernel , syzkaller-bugs@googlegroups.com, Gargi Sharma , Oleg Nesterov , Rik van Riel , Andrew Morton References: <20171219193020.GA9237@codemonkey.org.uk> <878tdy5r5t.fsf@xmission.com> <87mv2e17vz.fsf@xmission.com> <20171220052803.GA17079@codemonkey.org.uk> <871sjp1cjz.fsf@xmission.com> <20171221031606.GA4636@codemonkey.org.uk> <87po78trjm.fsf@xmission.com> <20171221220044.GA4977@codemonkey.org.uk> <87wp1fk0pd.fsf@xmission.com> <20171222033500.GA17273@codemonkey.org.uk> <87y3lvi480.fsf@xmission.com> Date: Fri, 22 Dec 2017 08:41:54 -0600 In-Reply-To: (Alexey Dobriyan's message of "Fri, 22 Dec 2017 12:13:06 +0200") Message-ID: <87lghuesel.fsf@xmission.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-XM-SPF: eid=1eSOWi-0000lf-QE;;;mid=<87lghuesel.fsf@xmission.com>;;;hst=in01.mta.xmission.com;;;ip=67.3.133.177;;;frm=ebiederm@xmission.com;;;spf=neutral X-XM-AID: U2FsdGVkX1+nsX3vljY3CpTRJo5uciadMM6+oklRFjo= X-SA-Exim-Connect-IP: 67.3.133.177 X-SA-Exim-Mail-From: ebiederm@xmission.com X-Spam-Report: * -1.0 ALL_TRUSTED Passed through trusted hosts only via SMTP * 0.0 TVD_RCVD_IP Message was received from an IP address * 0.0 T_TM2_M_HEADER_IN_MSG BODY: No description available. * 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% * [score: 0.5000] * -0.0 DCC_CHECK_NEGATIVE Not listed in DCC * [sa07 1397; Body=1 Fuz1=1 Fuz2=1] X-Spam-DCC: XMission; sa07 1397; Body=1 Fuz1=1 Fuz2=1 X-Spam-Combo: ;Alexey Dobriyan X-Spam-Relay-Country: X-Spam-Timing: total 1039 ms - load_scoreonly_sql: 0.03 (0.0%), signal_user_changed: 2.6 (0.2%), b_tie_ro: 1.84 (0.2%), parse: 0.95 (0.1%), extract_message_metadata: 14 (1.3%), get_uri_detail_list: 2.7 (0.3%), tests_pri_-1000: 7 (0.7%), tests_pri_-950: 1.19 (0.1%), tests_pri_-900: 0.98 (0.1%), tests_pri_-400: 30 (2.9%), check_bayes: 29 (2.8%), b_tokenize: 11 (1.0%), b_tok_get_all: 10 (0.9%), b_comp_prob: 3.2 (0.3%), b_tok_touch_all: 3.1 (0.3%), b_finish: 0.64 (0.1%), tests_pri_0: 252 (24.3%), check_dkim_signature: 0.59 (0.1%), check_dkim_adsp: 3.0 (0.3%), tests_pri_500: 728 (70.0%), poll_dns_idle: 723 (69.6%), rewrite_mail: 0.00 (0.0%) Subject: Re: proc_flush_task oops X-Spam-Flag: No X-SA-Exim-Version: 4.2.1 (built Thu, 05 May 2016 13:38:54 -0600) X-SA-Exim-Scanned: Yes (on in01.mta.xmission.com) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3936 Lines: 108 Alexey Dobriyan writes: > On 12/22/17, Eric W. Biederman wrote: >> Dave Jones writes: >> >>> On Thu, Dec 21, 2017 at 07:31:26PM -0600, Eric W. Biederman wrote: >>> > Dave Jones writes: >>> > >>> > > On Thu, Dec 21, 2017 at 12:38:12PM +0200, Alexey Dobriyan wrote: >>> > > >>> > > > > with proc_mnt still set to NULL is a mystery to me. >>> > > > > >>> > > > > Is there any chance the idr code doesn't always return the >>> lowest valid >>> > > > > free number? So init gets assigned something other than 1? >>> > > > >>> > > > Well, this theory is easy to test (attached). >>> > > >>> > > I didn't hit this BUG, but I hit the same oops in proc_flush_task. >>> > >>> > Scratch one idea. >>> > >>> > If it isn't too much trouble can you try this. >>> > >>> > I am wondering if somehow the proc_mnt that is NULL is somewhere in >>> the >>> > middle of the stack of pid namespaces. >>> > >>> > This adds two warnings. The first just reports which pid namespace in >>> > the stack of pid namespaces is problematic, and the pid number in that >>> > pid namespace. Which should give a whole lot more to go by. >>> > >>> > The second warning complains if we manage to create a pid namespace >>> > where the parent pid namespace is not properly set up. The test to >>> > prevent that looks quite robust, but at this point I don't know where >>> to >>> > look. >>> >>> Progress ? >>> >>> [ 1653.030190] ------------[ cut here ]------------ >>> [ 1653.030852] 1/1: 2 no proc_mnt >>> [ 1653.030946] WARNING: CPU: 2 PID: 4420 at kernel/pid.c:213 >>> alloc_pid+0x24f/0x2a0 >> >> Yes. I don't know why Alexey's patch did not fire but this is >> confirmation that the first pid allocated was #2 and not #1. >> >> Which explains the pid_mnt not being set, and it is definitely the new >> code, changing from the pid bitmap+hash table to an idr. >> >> So it looks like idr_alloc_cyclic in some configuration for the first >> allocation returns value #2 instead of value #1. >> >> I don't know that code, and it is quite complicated so I will have to >> stare at it some more to even guess why it is doing that. >> >> This is confirmation that reverting those pid changes will fix the >> problem. As they are definitely at fault. >> >> >> Hmm. After a little more staring I have a hunch what is going wrong. >> It is just possible that there is a failure in alloc_pid during the >> first pid allocation and then idr_next gets left at 2. I need to sleep >> before I can think of a patch to test that. >> >> Hmm. A failure and then restart would also explain why Alexey's patch >> did not fire. An incomplete reset of state. > > You are right (you are also right about sysctl :-\) > > unshare > fork > alloc_pid in level 1 succeeds > alloc_pid in level 0 fails, ->idr_next is 2 > fork > alloc pid 2 > exit > > Reliable reproducer and fail injection patch attached > > I'd say proper fix is allocating pids in the opposite order > so that failure in the last layer doesn't move IDR cursor > in baby child pidns. I agree with you about changing the order. That will make the code simpler and in the second loop actually conforming C code, and fix the immediate problem. I was worrying about the case where the mount of the proc filesystem fails, but we call disable_pid_allocation in that case. So we won't try to allocate a pid again. That seems better than any path where we might have to reset the allocation state. The nasty thing is that the pid bitmap+hashtable code also did not set the pointer back and we did not have any problems. AKA the bug is not new. Which means with the new code allocating pid numbers is failing much more often when allocating pids than the old code ever did. That is not good. Is it perhaps the GFP_ATOMIC in a context where we could otherwise sleep? Eric