Received: by 2002:a05:6a10:6744:0:0:0:0 with SMTP id w4csp4558994pxu; Tue, 20 Oct 2020 22:28:03 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxsURHcvVmEevLMSdQx9xnVI6v4XVfc4j1TAF0fpl6Y0nAXi3z8+bPqthecYJCd7YeNWYKN X-Received: by 2002:a17:906:b0d7:: with SMTP id bk23mr1758625ejb.103.1603258083245; Tue, 20 Oct 2020 22:28:03 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1603258083; cv=none; d=google.com; s=arc-20160816; b=DKlMKjx20ulFP7qrEXE/6XiZpULg5koumpToHYaP40jnhumVZdWq/YLtZjmhgGilFR 48SRApih1DcyRzeYHtgi47MwwKvNKr31g4961Obfe7vhdzhDX6IwgxEHR8dTh8xOI3H0 j7B4RYYU8FJgr6NkhTeb5fOlM5ZA8m8gkSSn6FiviKE9DUabgc+2U3ISg/YOOyBcm3CB ZCovhfNbxmtU6iKpCbGQJOgGcQadkdZjnvFeO7iyUQZfnVFRabWUuN5HDeqht/7ROfTy g6tkrWZCy4zFCCzZk93uzpioh8oF38LmDcWBaibyvFFzc887obNjPpbR8m2Q4sG5Xu5z HIKg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date; bh=FVoTlfJJQZOD0OUCb6Bnng5qryQ9Eux9y7jVr+EXi1U=; b=qJZfBjA5nHDVFs8Ir1MLdnsd7JDfUR81DJn0S6clEbRrPx9No3EChSMJXBTIRWnZZB RuioNojLif01uJVkQ6/4ycMN3hkqprcwwguHSBsHvMI9w/ceebMA0X7Xd33g8zSTWFYd CYx1g29OBXDc0N2lS8Uu6/vhnDZxWR/0atOee0MvlPDtsIXoeFhtbl4YPXaMouFjWWkF uqcXnheZbpQ4xW+P2oR7E4kf+CjJAnkKPoVvnoo01GbYELmG1BzZxq0r4d/yYP0ZnVLz GQLkQFrb5Uxdt6vI9yRt+wjbP+DFEZVCctGZPAj2yUwRHE/l+/OW2Ie0Dzk431kdMN0P 3h/A== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id c26si564777eja.409.2020.10.20.22.27.41; Tue, 20 Oct 2020 22:28:03 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2408024AbgJTO0f (ORCPT + 99 others); Tue, 20 Oct 2020 10:26:35 -0400 Received: from youngberry.canonical.com ([91.189.89.112]:47990 "EHLO youngberry.canonical.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2407899AbgJTO0e (ORCPT ); Tue, 20 Oct 2020 10:26:34 -0400 Received: from ip5f5af0a0.dynamic.kabel-deutschland.de ([95.90.240.160] helo=wittgenstein) by youngberry.canonical.com with esmtpsa (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.86_2) (envelope-from ) id 1kUsau-0000Dv-Ju; Tue, 20 Oct 2020 14:26:32 +0000 Date: Tue, 20 Oct 2020 16:26:32 +0200 From: Christian Brauner To: Giuseppe Scrivano Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, containers@lists.linux-foundation.org, linux@rasmusvillemoes.dk, viro@zeniv.linux.org.uk Subject: Re: [PATCH v2 1/2] fs, close_range: add flag CLOSE_RANGE_CLOEXEC Message-ID: <20201020142632.7wllfigtfgqzoou4@wittgenstein> References: <20201019102654.16642-1-gscrivan@redhat.com> <20201019102654.16642-2-gscrivan@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20201019102654.16642-2-gscrivan@redhat.com> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Oct 19, 2020 at 12:26:53PM +0200, Giuseppe Scrivano wrote: > When the flag CLOSE_RANGE_CLOEXEC is set, close_range doesn't > immediately close the files but it sets the close-on-exec bit. > > It is useful for e.g. container runtimes that usually install a > seccomp profile "as late as possible" before execv'ing the container > process itself. The container runtime could either do: > 1 2 > - install_seccomp_profile(); - close_range(MIN_FD, MAX_INT, 0); > - close_range(MIN_FD, MAX_INT, 0); - install_seccomp_profile(); > - execve(...); - execve(...); > > Both alternative have some disadvantages. > > In the first variant the seccomp_profile cannot block the close_range > syscall, as well as opendir/read/close/... for the fallback on older > kernels). > In the second variant, close_range() can be used only on the fds > that are not going to be needed by the runtime anymore, and it must be > potentially called multiple times to account for the different ranges > that must be closed. > > Using close_range(..., ..., CLOSE_RANGE_CLOEXEC) solves these issues. > The runtime is able to use the open fds and the seccomp profile could > block close_range() and the syscalls used for its fallback. I see, so you want those fds to be closed after exec but still use them before. Yeah, this is a good use-case. (I proposed this extension quite a while ago when we started discussing this syscall. Thanks for working ont this!) > > Signed-off-by: Giuseppe Scrivano > --- > fs/file.c | 44 ++++++++++++++++++++++++-------- > include/uapi/linux/close_range.h | 3 +++ > 2 files changed, 37 insertions(+), 10 deletions(-) > > diff --git a/fs/file.c b/fs/file.c > index 21c0893f2f1d..0295d4f7c5ef 100644 > --- a/fs/file.c > +++ b/fs/file.c > @@ -672,6 +672,35 @@ int __close_fd(struct files_struct *files, unsigned fd) > } > EXPORT_SYMBOL(__close_fd); /* for ksys_close() */ > > +static inline void __range_cloexec(struct files_struct *cur_fds, > + unsigned int fd, unsigned int max_fd) > +{ > + struct fdtable *fdt; > + > + if (fd > max_fd) > + return; Looks like formatting issues here. > + > + spin_lock(&cur_fds->file_lock); > + fdt = files_fdtable(cur_fds); > + bitmap_set(fdt->close_on_exec, fd, max_fd - fd + 1); I think that this is ok and that there's no reason to make this anymore complex unless we somehow really see performance issues which I doubt. If Al is ok with doing it this way and doesn't see any obvious issues I'll be taking this for some testing and would come back to ack this and pick it up. > + spin_unlock(&cur_fds->file_lock); > +} > + > +static inline void __range_close(struct files_struct *cur_fds, unsigned int fd, > + unsigned int max_fd) > +{ > + while (fd <= max_fd) { > + struct file *file; > + > + file = pick_file(cur_fds, fd++); > + if (!file) > + continue; > + > + filp_close(file, cur_fds); > + cond_resched(); > + } > +} > + > /** > * __close_range() - Close all file descriptors in a given range. > * > @@ -687,7 +716,7 @@ int __close_range(unsigned fd, unsigned max_fd, unsigned int flags) > struct task_struct *me = current; > struct files_struct *cur_fds = me->files, *fds = NULL; > > - if (flags & ~CLOSE_RANGE_UNSHARE) > + if (flags & ~(CLOSE_RANGE_UNSHARE | CLOSE_RANGE_CLOEXEC)) > return -EINVAL; > > if (fd > max_fd) > @@ -725,16 +754,11 @@ int __close_range(unsigned fd, unsigned max_fd, unsigned int flags) > } > > max_fd = min(max_fd, cur_max); > - while (fd <= max_fd) { > - struct file *file; > > - file = pick_file(cur_fds, fd++); > - if (!file) > - continue; > - > - filp_close(file, cur_fds); > - cond_resched(); > - } > + if (flags & CLOSE_RANGE_CLOEXEC) > + __range_cloexec(cur_fds, fd, max_fd); > + else > + __range_close(cur_fds, fd, max_fd); > > if (fds) { > /* > diff --git a/include/uapi/linux/close_range.h b/include/uapi/linux/close_range.h > index 6928a9fdee3c..2d804281554c 100644 > --- a/include/uapi/linux/close_range.h > +++ b/include/uapi/linux/close_range.h > @@ -5,5 +5,8 @@ > /* Unshare the file descriptor table before closing file descriptors. */ > #define CLOSE_RANGE_UNSHARE (1U << 1) > > +/* Set the FD_CLOEXEC bit instead of closing the file descriptor. */ > +#define CLOSE_RANGE_CLOEXEC (1U << 2) > + > #endif /* _UAPI_LINUX_CLOSE_RANGE_H */ > > -- > 2.26.2 > > _______________________________________________ > Containers mailing list > Containers@lists.linux-foundation.org > https://lists.linuxfoundation.org/mailman/listinfo/containers