Received: by 2002:a05:6a10:f347:0:0:0:0 with SMTP id d7csp11156426pxu; Thu, 31 Dec 2020 01:23:46 -0800 (PST) X-Google-Smtp-Source: ABdhPJx7yWRohN4QytJsvqbLsAbYOldbhcH3Y9CLUL0yoevK/D4gq+CJYfGyCQfNgS5KVIY+Ir5Z X-Received: by 2002:a17:907:2131:: with SMTP id qo17mr51775573ejb.546.1609406626593; Thu, 31 Dec 2020 01:23:46 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1609406626; cv=none; d=google.com; s=arc-20160816; b=Dx3HY9YUiLcL10zQy9Cz6KkWkku1/UoJ07R0xRyks6BvFUk14mdn7m9zlxKwP3ogSC 7I3+5nbRKL+xaOjb6FChRv8XJt9djZOB/Ige6yrKr7HD1346moPjQmAnupmOC453eitv w90c5A1VM3p1hR1QAm2PKdTZndGNteKftgQPpmIHIibWklG3EEmAK5KbAHnscSKIzYfO TV3FpqieCbuhzouDs5Op+pFkw4i2nRzp8zCGl5deKaQZQVUkv3CisBAG7CATNfdCTKr7 bFYou56Ja71A8PvpzCoejkLKa4c4sFNxTess5/nN9vR4mFemS79/7Q0Jg1aSL2rDRGoC jB3w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=zQ7lelITecvDETFHXoAxZtihKa45IDeI3mmeg28jb7w=; b=x6RzaEI+I0afyVSdRyROcHSWD9PW3WWuzmems3PUQPj2WPbtJzBgJ8EwOoUxRsOA79 zLd4t5thIhWO235fjDeEQjmInswxI/OQGCtLnSVUjQnU+BHs/qCwFL+WP6aYhFIHxRY1 2mO3UCRyTQObUX2iGKFAnISfiET0akvLA6j4zIqMYQlA/G0LbQbGf/B5tu/vkc3Aohm4 sZz1Xg9AeB6jelxzwDDczVlbPAumDuwP5Bap9R7+y8UE3+GuEod+jFdaTmtuR2d6VHfz hq+X0TcKAjUu9eJ0nHsJhjomySr3OQvtWRzAHcBx9plw+BhyAeDcgzgz6n0DCA3DRcQc xTCA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linuxfoundation.org header.s=korg header.b=A7EH1tDe; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linuxfoundation.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id i4si22510356ejg.258.2020.12.31.01.23.21; Thu, 31 Dec 2020 01:23:46 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@linuxfoundation.org header.s=korg header.b=A7EH1tDe; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linuxfoundation.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726317AbgLaJV7 (ORCPT + 99 others); Thu, 31 Dec 2020 04:21:59 -0500 Received: from mail.kernel.org ([198.145.29.99]:46864 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726155AbgLaJV6 (ORCPT ); Thu, 31 Dec 2020 04:21:58 -0500 Received: by mail.kernel.org (Postfix) with ESMTPSA id 7F75622262; Thu, 31 Dec 2020 09:21:16 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linuxfoundation.org; s=korg; t=1609406477; bh=zPY7UKQ8SEjalIA1GReemQgbCBmlAA+Hr7wC7DCWiNw=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=A7EH1tDebsh3S5TUoBrYzDVinroncvKsY9kLmZ5qQFlXgmA/2PU6k+zod0MwzBM+f +XzxWCfMoXq5m50nfbuNi4pmM8MPqV/oJTARdbIoRHcMUptosxNj3t3RXx2HDpj9n/ ioU4x9AYXRMw/IgrXGPKUddSaNikh4g6ao19w1Co= Date: Thu, 31 Dec 2020 10:22:41 +0100 From: Greg Kroah-Hartman To: Wen Yang Cc: Sasha Levin , Xunlei Pang , linux-kernel@vger.kernel.org, Pavel Emelyanov , Oleg Nesterov , Sukadev Bhattiprolu , Paul Menage , "Eric W. Biederman" , stable@vger.kernel.org Subject: Re: [PATCH 00/10] Cover letter: fix a race in release_task when flushing the dentry Message-ID: References: <20201203183204.63759-1-wenyang@linux.alibaba.com> <06bffff8-ed78-e8f5-191e-ecaaec266d46@linux.alibaba.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <06bffff8-ed78-e8f5-191e-ecaaec266d46@linux.alibaba.com> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Dec 17, 2020 at 10:26:23AM +0800, Wen Yang wrote: > > > 在 2020/12/4 上午2:31, Wen Yang 写道: > > The dentries such as /proc//ns/ have the DCACHE_OP_DELETE flag, they > > should be deleted when the process exits. > > > > Suppose the following race appears: > > > > release_task                 dput > > -> proc_flush_task > >                             -> dentry->d_op->d_delete(dentry) > > -> __exit_signal > >                              -> dentry->d_lockref.count--  and return. > > > > In the proc_flush_task(), if another process is using this dentry, it will > > not be deleted. At the same time, in dput(), d_op->d_delete() can be executed > > before __exit_signal(pid has not been hashed), d_delete returns false, so > > this dentry still cannot be deleted. > > > > This dentry will always be cached (although its count is 0 and the > > DCACHE_OP_DELETE flag is set), its parent denry will also be cached too, and > > these dentries can only be deleted when drop_caches is manually triggered. > > > > This will result in wasted memory. What's more troublesome is that these > > dentries reference pid, according to the commit f333c700c610 ("pidns: Add a > > limit on the number of pid namespaces"), if the pid cannot be released, it > > may result in the inability to create a new pid_ns. > > > > This problem occurred in our cluster environment (Linux 4.9 LTS). > > We could reproduce it by manually constructing a test program + adding some > > debugging switches in the kernel: > > * A test program to open the directory (/proc//ns) [1] > > * Adding some debugging switches to the kernel, adding a delay between > > proc_flush_task and __exit_signal in release_task() [2] > > > > The test process is as follows: > > > > A, terminal #1 > > > > Turn on the debug switch: > > echo 1> /proc/sys/vm/dentry_debug_trace > > > > Execute the following unshare command: > > sudo unshare --pid --fork --mount-proc bash > > > > > > B, terminal #2 > > > > Find the pid of the unshare process: > > > > # pstree -p | grep unshare > > | `-sshd(716)---bash(718)--sudo(816)---unshare(817)---bash(818) > > > > > > Find the corresponding dentry: > > # dmesg | grep pid=818 > > [70.424722] XXX proc_pid_instantiate:3119 pid=818 tid=818 entry=818/ffff8802c7b670e8 > > > > > > C, terminal #3 > > > > Execute the opendir program, it will always open the /proc/818/ns/ directory: > > > > # ./a.out /proc/818/ns/ > > pid: 876 > > . > > .. > > net > > uts > > ipc > > pid > > user > > mnt > > cgroup > > > > D, go back to terminal #2 > > > > Turn on the debugging switches to construct the race: > > # echo 818> /proc/sys/vm/dentry_debug_pid > > # echo 1> /proc/sys/vm/dentry_debug_delay > > > > Kill the unshare process (pid 818). Since the debugging switches have been > > turned on, it will get stuck in release_task(): > > # kill -9 818 > > > > Then kill the process that opened the /proc/818/ns/ directory: > > # kill -9 876 > > > > Then turn off these debugging switches to allow the 818 process to exit: > > # echo 0> /proc/sys/vm/dentry_debug_delay > > # echo 0> /proc/sys/vm/dentry_debug_pid > > > > Checking the dmesg, we will find that the dentry(/proc/818/ns) ’s count is 0, > > and the flag is 2800cc (#define DCACHE_OP_DELETE 0x00000008), but it is still > > cached: > > # dmesg | grep ffff8802a3999548 > > … > > [565.559156] XXX dput:853 dentry=ns/ffff8802bea7b528, flag=2800cc, cnt=0, inode=ffff8802b38c2010, pdentry=818/ffff8802c7b670e8, pflag=20008c, pcnt=1, pinode=ffff8802c7812010, keywords: be cached > > > > > > It could also be verified via the crash tool: > > > > crash> dentry.d_flags,d_iname,d_inode,d_lockref -x ffff8802bea7b528 > > d_flags = 0x2800cc > > d_iname = "ns\000kkkkkkkkkkkkkkkkkkkkkkkkkkkk" > > d_inode = 0xffff8802b38c2010 > > d_lockref = { > > { > > lock_count = 0x0, > > { > > lock = { > > { > > rlock = { > > raw_lock = { > > { > > val = { > > counter = 0x0 > > }, > > { > > locked = 0x0, > > pending = 0x0 > > }, > > { > > locked_pending = 0x0, > > tail = 0x0 > > } > > } > > } > > } > > } > > }, > > count = 0x0 > > } > > } > > } > > crash> kmem ffff8802bea7b528 > > CACHE OBJSIZE ALLOCATED TOTAL SLABS SSIZE NAME > > ffff8802dd5f5900 192 23663 26130 871 16k dentry > > SLAB MEMORY NODE TOTAL ALLOCATED FREE > > ffffea000afa9e00 ffff8802bea78000 0 30 25 5 > > FREE / [ALLOCATED] > > [ffff8802bea7b520] > > > > PAGE PHYSICAL MAPPING INDEX CNT FLAGS > > ffffea000afa9ec0 2bea7b000 dead000000000400 0 0 2fffff80000000 > > crash> > > > > This series of patches is to fix this issue. > > > > Regards, > > Wen > > > > Alexey Dobriyan (1): > > proc: use %u for pid printing and slightly less stack > > > > Andreas Gruenbacher (1): > > proc: Pass file mode to proc_pid_make_inode > > > > Christian Brauner (1): > > clone: add CLONE_PIDFD > > > > Eric W. Biederman (6): > > proc: Better ownership of files for non-dumpable tasks in user > > namespaces > > proc: Rename in proc_inode rename sysctl_inodes sibling_inodes > > proc: Generalize proc_sys_prune_dcache into proc_prune_siblings_dcache > > proc: Clear the pieces of proc_inode that proc_evict_inode cares about > > proc: Use d_invalidate in proc_prune_siblings_dcache > > proc: Use a list of inodes to flush from proc > > > > Joel Fernandes (Google) (1): > > pidfd: add polling support > > > > fs/proc/base.c | 242 ++++++++++++++++++++------------------------- > > fs/proc/fd.c | 20 +--- > > fs/proc/inode.c | 67 ++++++++++++- > > fs/proc/internal.h | 22 ++--- > > fs/proc/namespaces.c | 3 +- > > fs/proc/proc_sysctl.c | 45 ++------- > > fs/proc/self.c | 6 +- > > fs/proc/thread_self.c | 5 +- > > include/linux/pid.h | 5 + > > include/linux/proc_fs.h | 4 +- > > include/uapi/linux/sched.h | 1 + > > kernel/exit.c | 5 +- > > kernel/fork.c | 145 ++++++++++++++++++++++++++- > > kernel/pid.c | 3 + > > kernel/signal.c | 11 +++ > > security/selinux/hooks.c | 1 + > > 16 files changed, 357 insertions(+), 228 deletions(-) > > > > [1] A test program to open the directory (/proc//ns) > > #include > > #include > > #include > > #include > > > > int main(int argc, char *argv[]) > > { > > DIR *dip; > > struct dirent *dit; > > > > if (argc < 2) { > > printf("Usage :%s \n", argv[0]); > > return -1; > > } > > > > if ((dip = opendir(argv[1])) == NULL) { > > perror("opendir"); > > return -1; > > } > > > > printf("pid: %d\n", getpid()); > > while((dit = readdir (dip)) != NULL) { > > printf("%s\n", dit->d_name); > > } > > > > while (1) > > sleep (1); > > > > return 0; > > } > > > > [2] Adding some debugging switches to the kernel, also adding a delay between > > proc_flush_task and __exit_signal in release_task(): > > > > diff --git a/fs/dcache.c b/fs/dcache.c > > index 05bad55..fafad37 100644 > > --- a/fs/dcache.c > > +++ b/fs/dcache.c > > @@ -84,6 +84,9 @@ > > int sysctl_vfs_cache_pressure __read_mostly = 100; > > EXPORT_SYMBOL_GPL(sysctl_vfs_cache_pressure); > > > > +int sysctl_dentry_debug_trace __read_mostly = 0; > > +EXPORT_SYMBOL_GPL(sysctl_dentry_debug_trace); > > + > > __cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock); > > > > EXPORT_SYMBOL(rename_lock); > > @@ -758,6 +761,26 @@ static inline bool fast_dput(struct dentry *dentry) > > return 0; > > } > > > > +#define DENTRY_DEBUG_TRACE(dentry, keywords) \ > > +do { \ > > + if (sysctl_dentry_debug_trace) \ > > + printk("XXX %s:%d " \ > > + "dentry=%s/%p, flag=%x, cnt=%d, inode=%p, " \ > > + "pdentry=%s/%p, pflag=%x, pcnt=%d, pinode=%p, " \ > > + "keywords: %s\n", \ > > + __func__, __LINE__, \ > > + dentry->d_name.name, \ > > + dentry, \ > > + dentry->d_flags, \ > > + dentry->d_lockref.count, \ > > + dentry->d_inode, \ > > + dentry->d_parent->d_name.name, \ > > + dentry->d_parent, \ > > + dentry->d_parent->d_flags, \ > > + dentry->d_parent->d_lockref.count, \ > > + dentry->d_parent->d_inode, \ > > + keywords); \ > > +} while (0) > > > > /* > > * This is dput > > @@ -804,6 +827,8 @@ void dput(struct dentry *dentry) > > > > WARN_ON(d_in_lookup(dentry)); > > > > + DENTRY_DEBUG_TRACE(dentry, "be checked"); > > + > > /* Unreachable? Get rid of it */ > > if (unlikely(d_unhashed(dentry))) > > goto kill_it; > > @@ -812,8 +837,10 @@ void dput(struct dentry *dentry) > > goto kill_it; > > > > if (unlikely(dentry->d_flags & DCACHE_OP_DELETE)) { > > - if (dentry->d_op->d_delete(dentry)) > > + if (dentry->d_op->d_delete(dentry)) { > > + DENTRY_DEBUG_TRACE(dentry, "be killed"); > > goto kill_it; > > + } > > } > > > > if (!(dentry->d_flags & DCACHE_REFERENCED)) > > @@ -822,6 +849,9 @@ void dput(struct dentry *dentry) > > > > dentry->d_lockref.count--; > > spin_unlock(&dentry->d_lock); > > + > > + DENTRY_DEBUG_TRACE(dentry, "be cached"); > > + > > return; > > > > kill_it: > > diff --git a/fs/proc/base.c b/fs/proc/base.c > > index b9e4183..419a409 100644 > > --- a/fs/proc/base.c > > +++ b/fs/proc/base.c > > @@ -3090,6 +3090,8 @@ void proc_flush_task(struct task_struct *task) > > } > > } > > > > +extern int sysctl_dentry_debug_trace; > > + > > static int proc_pid_instantiate(struct inode *dir, > > struct dentry * dentry, > > struct task_struct *task, const void *ptr) > > @@ -3111,6 +3113,12 @@ static int proc_pid_instantiate(struct inode *dir, > > d_set_d_op(dentry, &pid_dentry_operations); > > > > d_add(dentry, inode); > > + > > + if (sysctl_dentry_debug_trace) > > + printk("XXX %s:%d pid=%d tid=%d entry=%s/%p\n", > > + __func__, __LINE__, task->pid, task->tgid, > > + dentry->d_name.name, dentry); > > + > > /* Close the race of the process dying before we return the dentry */ > > if (pid_revalidate(dentry, 0)) > > return 0; > > diff --git a/kernel/exit.c b/kernel/exit.c > > index 27f4168..2b3e1b6 100644 > > --- a/kernel/exit.c > > +++ b/kernel/exit.c > > @@ -55,6 +55,8 @@ > > #include > > #include > > > > +#include > > + > > #include > > #include > > #include > > @@ -164,6 +166,8 @@ static void delayed_put_task_struct(struct rcu_head *rhp) > > put_task_struct(tsk); > > } > > > > +int sysctl_dentry_debug_delay __read_mostly = 0; > > +int sysctl_dentry_debug_pid __read_mostly = 0; > > > > void release_task(struct task_struct *p) > > { > > @@ -178,6 +182,11 @@ void release_task(struct task_struct *p) > > > > proc_flush_task(p); > > > > + if (sysctl_dentry_debug_delay && p->pid == sysctl_dentry_debug_pid) { > > + while (sysctl_dentry_debug_delay) > > + mdelay(1); > > + } > > + > > write_lock_irq(&tasklist_lock); > > ptrace_release_task(p); > > __exit_signal(p); > > diff --git a/kernel/sysctl.c b/kernel/sysctl.c > > index 513e6da..27f1395 100644 > > --- a/kernel/sysctl.c > > +++ b/kernel/sysctl.c > > @@ -282,6 +282,10 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write, > > static int max_extfrag_threshold = 1000; > > #endif > > > > +extern int sysctl_dentry_debug_trace; > > +extern int sysctl_dentry_debug_delay; > > +extern int sysctl_dentry_debug_pid; > > + > > static struct ctl_table kern_table[] = { > > { > > .procname = "sched_child_runs_first", > > @@ -1498,6 +1502,30 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write, > > .proc_handler = proc_dointvec, > > .extra1 = &zero, > > }, > > + { > > + .procname = "dentry_debug_trace", > > + .data = &sysctl_dentry_debug_trace, > > + .maxlen = sizeof(sysctl_dentry_debug_trace), > > + .mode = 0644, > > + .proc_handler = proc_dointvec, > > + .extra1 = &zero, > > + }, > > + { > > + .procname = "dentry_debug_delay", > > + .data = &sysctl_dentry_debug_delay, > > + .maxlen = sizeof(sysctl_dentry_debug_delay), > > + .mode = 0644, > > + .proc_handler = proc_dointvec, > > + .extra1 = &zero, > > + }, > > + { > > + .procname = "dentry_debug_pid", > > + .data = &sysctl_dentry_debug_pid, > > + .maxlen = sizeof(sysctl_dentry_debug_pid), > > + .mode = 0644, > > + .proc_handler = proc_dointvec, > > + .extra1 = &zero, > > + }, > > #ifdef HAVE_ARCH_PICK_MMAP_LAYOUT > > { > > .procname = "legacy_va_layout", > > > > > > Signed-off-by: Wen Yang > > Cc: Pavel Emelyanov > > Cc: Oleg Nesterov > > Cc: Sukadev Bhattiprolu > > Cc: Paul Menage > > Cc: "Eric W. Biederman" > > Cc: Greg Kroah-Hartman > > Cc: > > > > Hi Greg, > > Could you kindly give some suggestions? I'm sorry, but I don't really understand what this patch series is for. Why is there a patch in the 00/10 email? What stable kernel(s) should this be backported to? And why can't you just use a newer kernel (like 4.19) if you are hitting this issue with much older ones? confused, greg k-h