Received: by 2002:a05:6a10:2785:0:0:0:0 with SMTP id ia5csp1009348pxb; Sat, 9 Jan 2021 04:36:58 -0800 (PST) X-Google-Smtp-Source: ABdhPJw95MoEiP3CLvH0sV7qs4CgbNxg7Qa2KRYrFL5AC46Pzz6RcO21qyrfmhCzRuRCpEMjmJa7 X-Received: by 2002:a50:9f4a:: with SMTP id b68mr8292733edf.296.1610195818509; Sat, 09 Jan 2021 04:36:58 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1610195818; cv=none; d=google.com; s=arc-20160816; b=0/l1eoW/eMVG+tMSYVSWywbws1PzrYd0Mre3rHjE6QyYVIoJiI69Cvem1jJG2IDWm/ TFq6NagbPdSdL3w+EHEuILidhiNOkjg7uY0801OstG9yllSIgprsVG85Qfe/ts3cnno8 Tqp9AgDq/BMlgA0vHFkHq53FX7MjLX7duY92g/3J5r9H2Tuuc4bJCp2vR8qRYOU4amo4 nN9nncTW7lGyhDqIYl7C6pxtuTY34E/bm1zbBcnsuH8uyBypqsnk4skVwp9h9nBQwBE0 yISsGeQhO+xOlJtNkA+MKHpUUkRyhA+cFDegqZVfO09Bg7Fg/ijsiiFGbZap7gajdZT6 XrMw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=le47zmBdVvvYvcLBir6NYWRrk0GAPOQm4Q6GBRFpii4=; b=wj4tEf0uOTWS7BTe4TkWaL2ICh6a8gTM8DlOtwk2jOag4j7Z7nWFUx3vAJIB41YKKT vl+I2THEMclAf0SmtCHz0KRVpzRJdsHJIfa1ULUswOg6E9FFwwpzIpaqZ5FeHz1Ltuqr Gxxu+cFfH7ApeBgvvMKxp1sJYZEc7KJepJHYJ8nSjn6XxM9Mjd6MekVWQ3a17Zz4qgXT Q7KNTCvrymT3RKcFspMnwm5doynDvyue16D4ny90oC/RIG+6k/RS202/6GKy2AVADHId egBULuU3ubDeYSsq01DJZAkBIkLmHjgBpyZXp5VCT4DFYPdlOWxNCxnIhaymQrzb7jpq dACA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linuxfoundation.org header.s=korg header.b=zm4qSLU5; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linuxfoundation.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id p18si4404480ejm.276.2021.01.09.04.36.34; Sat, 09 Jan 2021 04:36:58 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@linuxfoundation.org header.s=korg header.b=zm4qSLU5; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linuxfoundation.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726088AbhAIMft (ORCPT + 99 others); Sat, 9 Jan 2021 07:35:49 -0500 Received: from mail.kernel.org ([198.145.29.99]:37278 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726001AbhAIMfs (ORCPT ); Sat, 9 Jan 2021 07:35:48 -0500 Received: by mail.kernel.org (Postfix) with ESMTPSA id 776D422525; Sat, 9 Jan 2021 12:35:07 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linuxfoundation.org; s=korg; t=1610195707; bh=GtNFV/pVKoIupNxyNeIvuiRTCJ8JEKJLNh9pSEUSwDY=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=zm4qSLU5zAb6HTsLm3jSOYGMlC4w67Gw+jjb0dGLBqknWgDJlnYDMguTHxNGPhdQy M+BfuwANdF2uFOobW7S2IUI5PnWkJwr+k7fGVP6OuFsNspiJEbhyQygE3S+D/BoK4+ wZco/Sx7H9BQp2EYBJ+H2pooDYcBkjQvVQXi6QOw= Date: Sat, 9 Jan 2021 13:36:23 +0100 From: Greg Kroah-Hartman To: Wen Yang Cc: Christian Brauner , Sasha Levin , Xunlei Pang , linux-kernel@vger.kernel.org Subject: Re: [PATCH v2 4.9 00/10] fix a race in release_task when flushing the dentry Message-ID: References: <20210107075222.62623-1-wenyang@linux.alibaba.com> <82fb683a-bc9d-2083-f657-116f3e96d785@linux.alibaba.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <82fb683a-bc9d-2083-f657-116f3e96d785@linux.alibaba.com> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jan 08, 2021 at 10:42:47AM +0800, Wen Yang wrote: > > > 在 2021/1/8 上午2:28, Greg Kroah-Hartman 写道: > > On Fri, Jan 08, 2021 at 12:21:38AM +0800, Wen Yang wrote: > > > > > > > > > 在 2021/1/7 下午8:17, Greg Kroah-Hartman 写道: > > > > On Thu, Jan 07, 2021 at 03:52:12PM +0800, Wen Yang wrote: > > > > > The dentries such as /proc//ns/ have the DCACHE_OP_DELETE flag, they > > > > > should be deleted when the process exits. > > > > > > > > > > Suppose the following race appears: > > > > > > > > > > release_task dput > > > > > -> proc_flush_task > > > > > -> dentry->d_op->d_delete(dentry) > > > > > -> __exit_signal > > > > > -> dentry->d_lockref.count-- and return. > > > > > > > > > > In the proc_flush_task(), if another process is using this dentry, it will > > > > > not be deleted. At the same time, in dput(), d_op->d_delete() can be executed > > > > > before __exit_signal(pid has not been hashed), d_delete returns false, so > > > > > this dentry still cannot be deleted. > > > > > > > > > > This dentry will always be cached (although its count is 0 and the > > > > > DCACHE_OP_DELETE flag is set), its parent denry will also be cached too, and > > > > > these dentries can only be deleted when drop_caches is manually triggered. > > > > > > > > > > This will result in wasted memory. What's more troublesome is that these > > > > > dentries reference pid, according to the commit f333c700c610 ("pidns: Add a > > > > > limit on the number of pid namespaces"), if the pid cannot be released, it > > > > > may result in the inability to create a new pid_ns. > > > > > > > > > > This issue was introduced by 60347f6716aa ("pid namespaces: prepare > > > > > proc_flust_task() to flush entries from multiple proc trees"), exposed by > > > > > f333c700c610 ("pidns: Add a limit on the number of pid namespaces"), and then > > > > > fixed by 7bc3e6e55acf ("proc: Use a list of inodes to flush from proc"). > > > > > > > > Why are you just submitting a series for 4.9 and 4.19, what about 4.14? > > > > We can't have users move to a newer kernel and then experience old bugs, > > > > right? > > > > > > > Okay, the patches corresponding to 4.14 will be ready later. > > > > Note for some reason you didn't cc: the stable list for these patches :( > > > > > > But the larger question is why are you backporting a whole new feature > > > > here? Why is CLONE_PIDFD needed? That feels really wrong... > > > > > > > > > > The reason for backporting CLONE_PIDFD is because 7bc3e6e55acf ("proc: Use a > > > list of inodes to flush from proc") relies on wait_pidfd.lock. There are > > > indeed many associated modifications here. We are also testing it. Please > > > check the code more. > > > > Is the only "issue" here wasted memory? Will it eventually be freed > > anyway even if you do not echo to the proc file to flush caches? > > > > You mention the inability to create a new pid for a specific namespace, > > is that really a problem? Shouldn't the code handle such issues > > normally? What breaks without these changes? > > > > I think at this point, it might just time for you to move to a newer > > kernel release, as adding a whole new userspace feature for this feels > > really really odd. > > > > What is preventing you from doing that today? What holds you to older > > kernels that will not allow you to move forward? > > > > We have encountered this problem in the cloud server environment. Users will > frequently create and delete containers, and the corresponding pid_ns will > accumulate, eventually making it impossible to create a new container. > > https://bugzilla.kernel.org/show_bug.cgi?id=208613 > > The kernels (4.9/4.19) used on a large scale in our current production > environment (almost tens of thousands of machines) may need to be fixed. What prevents you from moving them to 5.4 or better yet, 5.10? You will have to do it soon anyway, I'm sure you have been testing those kernels to validate that all works well with them on a subset of your environment, so for those systems that have this problem, why can't you update the base kernel? thanks, greg k-h