2020-06-02 20:44:52

by Christian Brauner

[permalink] [raw]
Subject: [PATCH v5 0/3] close_range()

Hey everyone,

This is a resend of the close_range() syscall, as discussed in [1]. There weren't any outstanding
discussions anymore and this was in mergeable shape. I simply hadn't gotten around to moving this
into my for-next the last few cycles and then forgot about it. Thanks to Kyle and the Python people,
and others for consistenly reminding me before every merge window and mea culpa for not moving on
this sooner. I plan on moving this into for-next after v5.8-rc1 has been released and targeting the
v5.9 merge window.

As mentioned before, I was contacted by FreeBSD as they wanted to have the same close_range()
syscall as we proposed here. We've coordinated this and in the meantime, Kyle was fast enough to
merge close_range() into FreeBSD already in April:
https://reviews.freebsd.org/D21627
https://svnweb.freebsd.org/base?view=revision&revision=359836
and the current plan is to backport close_range() to FreeBSD 12.2 (cf. [2]) once its merged in
Linux too. Python is in the process of switching to close_range() on FreeBSD and they are waiting on
us to merge this to switch on Linux as well: https://bugs.python.org/issue38061

The missing close_range() syscall is also the reason why we still have that gap between 435 and 437
in the syscall tables as 436 was the syscall number reserved for close_range().

So again, sorry for the delay.

Thanks!
Christian

[1]: https://lore.kernel.org/lkml/[email protected]/
[2]: https://twitter.com/kaevans91/status/1267907092406566912

Christian Brauner (3):
open: add close_range()
arch: wire-up close_range()
tests: add close_range() tests

arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
fs/file.c | 62 +++++++-
fs/open.c | 20 +++
include/linux/fdtable.h | 2 +
include/linux/syscalls.h | 2 +
include/uapi/asm-generic/unistd.h | 4 +-
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/core/.gitignore | 1 +
tools/testing/selftests/core/Makefile | 7 +
.../testing/selftests/core/close_range_test.c | 149 ++++++++++++++++++
27 files changed, 258 insertions(+), 10 deletions(-)
create mode 100644 tools/testing/selftests/core/.gitignore
create mode 100644 tools/testing/selftests/core/Makefile
create mode 100644 tools/testing/selftests/core/close_range_test.c


base-commit: 3d77e6a8804abcc0504c904bd6e5cdf3a5cf8162
--
2.26.2


2020-06-02 20:45:07

by Christian Brauner

[permalink] [raw]
Subject: [PATCH v5 1/3] open: add close_range()

This adds the close_range() syscall. It allows to efficiently close a range
of file descriptors up to all file descriptors of a calling task.

I've also coordinated with some FreeBSD developers who got in touch with
me (Cced below). FreeBSD intends to add the same syscall once we merged it.
Quite a bunch of projects in userspace are waiting on this syscall
including Python and systemd.

The syscall came up in a recent discussion around the new mount API and
making new file descriptor types cloexec by default. During this
discussion, Al suggested the close_range() syscall (cf. [1]). Note, a
syscall in this manner has been requested by various people over time.

First, it helps to close all file descriptors of an exec()ing task. This
can be done safely via (quoting Al's example from [1] verbatim):

/* that exec is sensitive */
unshare(CLONE_FILES);
/* we don't want anything past stderr here */
close_range(3, ~0U);
execve(....);

The code snippet above is one way of working around the problem that file
descriptors are not cloexec by default. This is aggravated by the fact that
we can't just switch them over without massively regressing userspace. For
a whole class of programs having an in-kernel method of closing all file
descriptors is very helpful (e.g. demons, service managers, programming
language standard libraries, container managers etc.).
(Please note, unshare(CLONE_FILES) should only be needed if the calling
task is multi-threaded and shares the file descriptor table with another
thread in which case two threads could race with one thread allocating file
descriptors and the other one closing them via close_range(). For the
general case close_range() before the execve() is sufficient.)

Second, it allows userspace to avoid implementing closing all file
descriptors by parsing through /proc/<pid>/fd/* and calling close() on each
file descriptor. From looking at various large(ish) userspace code bases
this or similar patterns are very common in:
- service managers (cf. [4])
- libcs (cf. [6])
- container runtimes (cf. [5])
- programming language runtimes/standard libraries
- Python (cf. [2])
- Rust (cf. [7], [8])
As Dmitry pointed out there's even a long-standing glibc bug about missing
kernel support for this task (cf. [3]).
In addition, the syscall will also work for tasks that do not have procfs
mounted and on kernels that do not have procfs support compiled in. In such
situations the only way to make sure that all file descriptors are closed
is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE,
OPEN_MAX trickery (cf. comment [8] on Rust).

The performance is striking. For good measure, comparing the following
simple close_all_fds() userspace implementation that is essentially just
glibc's version in [6]:

static int close_all_fds(void)
{
int dir_fd;
DIR *dir;
struct dirent *direntp;

dir = opendir("/proc/self/fd");
if (!dir)
return -1;
dir_fd = dirfd(dir);
while ((direntp = readdir(dir))) {
int fd;
if (strcmp(direntp->d_name, ".") == 0)
continue;
if (strcmp(direntp->d_name, "..") == 0)
continue;
fd = atoi(direntp->d_name);
if (fd == dir_fd || fd == 0 || fd == 1 || fd == 2)
continue;
close(fd);
}
closedir(dir);
return 0;
}

to close_range() yields:
1. closing 4 open files:
- close_all_fds(): ~280 us
- close_range(): ~24 us

2. closing 1000 open files:
- close_all_fds(): ~5000 us
- close_range(): ~800 us

close_range() is designed to allow for some flexibility. Specifically, it
does not simply always close all open file descriptors of a task. Instead,
callers can specify an upper bound.
This is e.g. useful for scenarios where specific file descriptors are
created with well-known numbers that are supposed to be excluded from
getting closed.
For extra paranoia close_range() comes with a flags argument. This can e.g.
be used to implement extension. Once can imagine userspace wanting to stop
at the first error instead of ignoring errors under certain circumstances.
There might be other valid ideas in the future. In any case, a flag
argument doesn't hurt and keeps us on the safe side.

From an implementation side this is kept rather dumb. It saw some input
from David and Jann but all nonsense is obviously my own!
- Errors to close file descriptors are currently ignored. (Could be changed
by setting a flag in the future if needed.)
- __close_range() is a rather simplistic wrapper around __close_fd().
My reasoning behind this is based on the nature of how __close_fd() needs
to release an fd. But maybe I misunderstood specifics:
We take the files_lock and rcu-dereference the fdtable of the calling
task, we find the entry in the fdtable, get the file and need to release
files_lock before calling filp_close().
In the meantime the fdtable might have been altered so we can't just
retake the spinlock and keep the old rcu-reference of the fdtable
around. Instead we need to grab a fresh reference to the fdtable.
If my reasoning is correct then there's really no point in fancyfying
__close_range(): We just need to rcu-dereference the fdtable of the
calling task once to cap the max_fd value correctly and then go on
calling __close_fd() in a loop.

/* References */
[1]: https://lore.kernel.org/lkml/[email protected]/
[2]: https://github.com/python/cpython/blob/9e4f2f3a6b8ee995c365e86d976937c141d867f8/Modules/_posixsubprocess.c#L220
[3]: https://sourceware.org/bugzilla/show_bug.cgi?id=10353#c7
[4]: https://github.com/systemd/systemd/blob/5238e9575906297608ff802a27e2ff9effa3b338/src/basic/fd-util.c#L217
[5]: https://github.com/lxc/lxc/blob/ddf4b77e11a4d08f09b7b9cd13e593f8c047edc5/src/lxc/start.c#L236
[6]: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/grantpt.c;h=2030e07fa6e652aac32c775b8c6e005844c3c4eb;hb=HEAD#l17
Note that this is an internal implementation that is not exported.
Currently, libc seems to not provide an exported version of this
because of missing kernel support to do this.
[7]: https://github.com/rust-lang/rust/issues/12148
[8]: https://github.com/rust-lang/rust/blob/5f47c0613ed4eb46fca3633c1297364c09e5e451/src/libstd/sys/unix/process2.rs#L303-L308
Rust's solution is slightly different but is equally unperformant.
Rust calls getdtablesize() which is a glibc library function that
simply returns the current RLIMIT_NOFILE or OPEN_MAX values. Rust then
goes on to call close() on each fd. That's obviously overkill for most
tasks. Rarely, tasks - especially non-demons - hit RLIMIT_NOFILE or
OPEN_MAX.
Let's be nice and assume an unprivileged user with RLIMIT_NOFILE set
to 1024. Even in this case, there's a very high chance that in the
common case Rust is calling the close() syscall 1021 times pointlessly
if the task just has 0, 1, and 2 open.

Suggested-by: Al Viro <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Kyle Evans <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: David Howells <[email protected]>
Cc: Dmitry V. Levin <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Florian Weimer <[email protected]>
Cc: [email protected]
---
/* v2 */
- Linus Torvalds <[email protected]>:
- add cond_resched() to yield cpu when closing a lot of file descriptors
- Al Viro <[email protected]>:
- add cond_resched() to yield cpu when closing a lot of file descriptors

/* v3 */
unchanged

/* v4 */
- Oleg Nesterov <[email protected]>:
- fix braino: s/max()/min()/

/* v5 */
unchanged
---
fs/file.c | 62 ++++++++++++++++++++++++++++++++++------
fs/open.c | 20 +++++++++++++
include/linux/fdtable.h | 2 ++
include/linux/syscalls.h | 2 ++
4 files changed, 78 insertions(+), 8 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index abb8b7081d7a..e260bfe687d1 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -10,6 +10,7 @@
#include <linux/syscalls.h>
#include <linux/export.h>
#include <linux/fs.h>
+#include <linux/kernel.h>
#include <linux/mm.h>
#include <linux/sched/signal.h>
#include <linux/slab.h>
@@ -620,12 +621,9 @@ void fd_install(unsigned int fd, struct file *file)

EXPORT_SYMBOL(fd_install);

-/*
- * The same warnings as for __alloc_fd()/__fd_install() apply here...
- */
-int __close_fd(struct files_struct *files, unsigned fd)
+static struct file *pick_file(struct files_struct *files, unsigned fd)
{
- struct file *file;
+ struct file *file = NULL;
struct fdtable *fdt;

spin_lock(&files->file_lock);
@@ -637,15 +635,63 @@ int __close_fd(struct files_struct *files, unsigned fd)
goto out_unlock;
rcu_assign_pointer(fdt->fd[fd], NULL);
__put_unused_fd(files, fd);
- spin_unlock(&files->file_lock);
- return filp_close(file, files);

out_unlock:
spin_unlock(&files->file_lock);
- return -EBADF;
+ return file;
+}
+
+/*
+ * The same warnings as for __alloc_fd()/__fd_install() apply here...
+ */
+int __close_fd(struct files_struct *files, unsigned fd)
+{
+ struct file *file;
+
+ file = pick_file(files, fd);
+ if (!file)
+ return -EBADF;
+
+ return filp_close(file, files);
}
EXPORT_SYMBOL(__close_fd); /* for ksys_close() */

+/**
+ * __close_range() - Close all file descriptors in a given range.
+ *
+ * @fd: starting file descriptor to close
+ * @max_fd: last file descriptor to close
+ *
+ * This closes a range of file descriptors. All file descriptors
+ * from @fd up to and including @max_fd are closed.
+ */
+int __close_range(struct files_struct *files, unsigned fd, unsigned max_fd)
+{
+ unsigned int cur_max;
+
+ if (fd > max_fd)
+ return -EINVAL;
+
+ rcu_read_lock();
+ cur_max = files_fdtable(files)->max_fds;
+ rcu_read_unlock();
+
+ /* cap to last valid index into fdtable */
+ max_fd = min(max_fd, (cur_max - 1));
+ while (fd <= max_fd) {
+ struct file *file;
+
+ file = pick_file(files, fd++);
+ if (!file)
+ continue;
+
+ filp_close(file, files);
+ cond_resched();
+ }
+
+ return 0;
+}
+
/*
* variant of __close_fd that gets a ref on the file for later fput.
* The caller must ensure that filp_close() called on the file, and then
diff --git a/fs/open.c b/fs/open.c
index 719b320ede52..87e076e9e127 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1279,6 +1279,26 @@ SYSCALL_DEFINE1(close, unsigned int, fd)
return retval;
}

+/**
+ * close_range() - Close all file descriptors in a given range.
+ *
+ * @fd: starting file descriptor to close
+ * @max_fd: last file descriptor to close
+ * @flags: reserved for future extensions
+ *
+ * This closes a range of file descriptors. All file descriptors
+ * from @fd up to and including @max_fd are closed.
+ * Currently, errors to close a given file descriptor are ignored.
+ */
+SYSCALL_DEFINE3(close_range, unsigned int, fd, unsigned int, max_fd,
+ unsigned int, flags)
+{
+ if (flags)
+ return -EINVAL;
+
+ return __close_range(current->files, fd, max_fd);
+}
+
/*
* This routine simulates a hangup on the tty, to arrange that users
* are given clean terminals at login time.
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index f07c55ea0c22..fcd07181a365 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -121,6 +121,8 @@ extern void __fd_install(struct files_struct *files,
unsigned int fd, struct file *file);
extern int __close_fd(struct files_struct *files,
unsigned int fd);
+extern int __close_range(struct files_struct *files, unsigned int fd,
+ unsigned int max_fd);
extern int __close_fd_get_file(unsigned int fd, struct file **res);

extern struct kmem_cache *files_cachep;
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 1815065d52f3..18fea399329b 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -442,6 +442,8 @@ asmlinkage long sys_openat(int dfd, const char __user *filename, int flags,
asmlinkage long sys_openat2(int dfd, const char __user *filename,
struct open_how *how, size_t size);
asmlinkage long sys_close(unsigned int fd);
+asmlinkage long sys_close_range(unsigned int fd, unsigned int max_fd,
+ unsigned int flags);
asmlinkage long sys_vhangup(void);

/* fs/pipe.c */
--
2.26.2

2020-06-02 20:47:05

by Christian Brauner

[permalink] [raw]
Subject: [PATCH v5 3/3] tests: add close_range() tests

This adds basic tests for the new close_range() syscall.
- test that no invalid flags can be passed
- test that a range of file descriptors is correctly closed
- test that a range of file descriptors is correctly closed if there there
are already closed file descriptors in the range
- test that max_fd is correctly capped to the current fdtable maximum

Signed-off-by: Christian Brauner <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: David Howells <[email protected]>
Cc: Dmitry V. Levin <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Florian Weimer <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
/* v2 */
unchanged

/* v3 */
- Christian Brauner <[email protected]>:
- verify that close_range() correctly closes a single file descriptor

/* v4 */
- Christian Brauner <[email protected]>:
- add missing Cc for Shuah
- add missing Cc for linux-kselftest

/* v5 */
- Michael Ellerman <[email protected]>:
- remove include of unexported kernel headers
- Christian Brauner <[email protected]>:
- add missing SPDX header to Makefile
---
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/core/.gitignore | 1 +
tools/testing/selftests/core/Makefile | 7 +
.../testing/selftests/core/close_range_test.c | 149 ++++++++++++++++++
4 files changed, 158 insertions(+)
create mode 100644 tools/testing/selftests/core/.gitignore
create mode 100644 tools/testing/selftests/core/Makefile
create mode 100644 tools/testing/selftests/core/close_range_test.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 2ff68702fd41..2e387ce83311 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -6,6 +6,7 @@ TARGETS += breakpoints
TARGETS += capabilities
TARGETS += cgroup
TARGETS += clone3
+TARGETS += core
TARGETS += cpufreq
TARGETS += cpu-hotplug
TARGETS += drivers/dma-buf
diff --git a/tools/testing/selftests/core/.gitignore b/tools/testing/selftests/core/.gitignore
new file mode 100644
index 000000000000..6e6712ce5817
--- /dev/null
+++ b/tools/testing/selftests/core/.gitignore
@@ -0,0 +1 @@
+close_range_test
diff --git a/tools/testing/selftests/core/Makefile b/tools/testing/selftests/core/Makefile
new file mode 100644
index 000000000000..f6f2d6f473c6
--- /dev/null
+++ b/tools/testing/selftests/core/Makefile
@@ -0,0 +1,7 @@
+# SPDX-License-Identifier: GPL-2.0-only
+CFLAGS += -g -I../../../../usr/include/
+
+TEST_GEN_PROGS := close_range_test
+
+include ../lib.mk
+
diff --git a/tools/testing/selftests/core/close_range_test.c b/tools/testing/selftests/core/close_range_test.c
new file mode 100644
index 000000000000..6d92e239a228
--- /dev/null
+++ b/tools/testing/selftests/core/close_range_test.c
@@ -0,0 +1,149 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <linux/kernel.h>
+#include <limits.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <syscall.h>
+#include <unistd.h>
+
+#include "../kselftest.h"
+
+#ifndef __NR_close_range
+#define __NR_close_range -1
+#endif
+
+static inline int sys_close_range(unsigned int fd, unsigned int max_fd,
+ unsigned int flags)
+{
+ return syscall(__NR_close_range, fd, max_fd, flags);
+}
+
+#ifndef ARRAY_SIZE
+#define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
+#endif
+
+int main(int argc, char **argv)
+{
+ const char *test_name = "close_range";
+ int i, ret;
+ int open_fds[101];
+ int fd_max, fd_mid, fd_min;
+
+ ksft_set_plan(9);
+
+ for (i = 0; i < ARRAY_SIZE(open_fds); i++) {
+ int fd;
+
+ fd = open("/dev/null", O_RDONLY | O_CLOEXEC);
+ if (fd < 0) {
+ if (errno == ENOENT)
+ ksft_exit_skip(
+ "%s test: skipping test since /dev/null does not exist\n",
+ test_name);
+
+ ksft_exit_fail_msg(
+ "%s test: %s - failed to open /dev/null\n",
+ strerror(errno), test_name);
+ }
+
+ open_fds[i] = fd;
+ }
+
+ fd_min = open_fds[0];
+ fd_max = open_fds[99];
+
+ ret = sys_close_range(fd_min, fd_max, 1);
+ if (!ret)
+ ksft_exit_fail_msg(
+ "%s test: managed to pass invalid flag value\n",
+ test_name);
+ if (errno == ENOSYS)
+ ksft_exit_skip("%s test: close_range() syscall not supported\n", test_name);
+
+ ksft_test_result_pass("do not allow invalid flag values for close_range()\n");
+
+ fd_mid = open_fds[50];
+ ret = sys_close_range(fd_min, fd_mid, 0);
+ if (ret < 0)
+ ksft_exit_fail_msg(
+ "%s test: Failed to close range of file descriptors from %d to %d\n",
+ test_name, fd_min, fd_mid);
+ ksft_test_result_pass("close_range() from %d to %d\n", fd_min, fd_mid);
+
+ for (i = 0; i <= 50; i++) {
+ ret = fcntl(open_fds[i], F_GETFL);
+ if (ret >= 0)
+ ksft_exit_fail_msg(
+ "%s test: Failed to close range of file descriptors from %d to %d\n",
+ test_name, fd_min, fd_mid);
+ }
+ ksft_test_result_pass("fcntl() verify closed range from %d to %d\n", fd_min, fd_mid);
+
+ /* create a couple of gaps */
+ close(57);
+ close(78);
+ close(81);
+ close(82);
+ close(84);
+ close(90);
+
+ fd_mid = open_fds[51];
+ /* Choose slightly lower limit and leave some fds for a later test */
+ fd_max = open_fds[92];
+ ret = sys_close_range(fd_mid, fd_max, 0);
+ if (ret < 0)
+ ksft_exit_fail_msg(
+ "%s test: Failed to close range of file descriptors from 51 to 100\n",
+ test_name);
+ ksft_test_result_pass("close_range() from %d to %d\n", fd_mid, fd_max);
+
+ for (i = 51; i <= 92; i++) {
+ ret = fcntl(open_fds[i], F_GETFL);
+ if (ret >= 0)
+ ksft_exit_fail_msg(
+ "%s test: Failed to close range of file descriptors from 51 to 100\n",
+ test_name);
+ }
+ ksft_test_result_pass("fcntl() verify closed range from %d to %d\n", fd_mid, fd_max);
+
+ fd_mid = open_fds[93];
+ fd_max = open_fds[99];
+ /* test that the kernel caps and still closes all fds */
+ ret = sys_close_range(fd_mid, UINT_MAX, 0);
+ if (ret < 0)
+ ksft_exit_fail_msg(
+ "%s test: Failed to close range of file descriptors from 51 to 100\n",
+ test_name);
+ ksft_test_result_pass("close_range() from %d to %d\n", fd_mid, fd_max);
+
+ for (i = 93; i < 100; i++) {
+ ret = fcntl(open_fds[i], F_GETFL);
+ if (ret >= 0)
+ ksft_exit_fail_msg(
+ "%s test: Failed to close range of file descriptors from 51 to 100\n",
+ test_name);
+ }
+ ksft_test_result_pass("fcntl() verify closed range from %d to %d\n", fd_mid, fd_max);
+
+ ret = sys_close_range(open_fds[100], open_fds[100], 0);
+ if (ret < 0)
+ ksft_exit_fail_msg(
+ "%s test: Failed to close single file descriptor\n",
+ test_name);
+ ksft_test_result_pass("close_range() closed single file descriptor\n");
+
+ ret = fcntl(open_fds[100], F_GETFL);
+ if (ret >= 0)
+ ksft_exit_fail_msg(
+ "%s test: Failed to close single file descriptor\n",
+ test_name);
+ ksft_test_result_pass("fcntl() verify closed single file descriptor\n");
+
+ return ksft_exit_pass();
+}
--
2.26.2

2020-06-02 20:47:32

by Christian Brauner

[permalink] [raw]
Subject: [PATCH v5 2/3] arch: wire-up close_range()

This wires up the close_range() syscall into all arches at once.

Suggested-by: Arnd Bergmann <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
Reviewed-by: Oleg Nesterov <[email protected]>
Acked-by: Arnd Bergmann <[email protected]>
Acked-by: Michael Ellerman <[email protected]> (powerpc)
Cc: Jann Horn <[email protected]>
Cc: David Howells <[email protected]>
Cc: Dmitry V. Levin <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Florian Weimer <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
/* v2 */
not present

/* v3 */
not present

/* v4 */
introduced
- Arnd Bergmann <[email protected]>:
- split into two patches:
1. add close_range()
2. add syscall to all arches at once
- bump __NR_compat_syscalls in arch/arm64/include/asm/unistd.h

/* v5 */
unchanged
---
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 ++
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
include/uapi/asm-generic/unistd.h | 4 +++-
19 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 36d42da7466a..67ef02ead4da 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -475,5 +475,6 @@
543 common fspick sys_fspick
544 common pidfd_open sys_pidfd_open
# 545 reserved for clone3
+546 common close_range sys_close_range
547 common openat2 sys_openat2
548 common pidfd_getfd sys_pidfd_getfd
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 4d1cf74a2caa..13c5652137fb 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -449,5 +449,6 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
435 common clone3 sys_clone3
+436 common close_range sys_close_range
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index 803039d504de..3b859596840d 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -38,7 +38,7 @@
#define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5)
#define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800)

-#define __NR_compat_syscalls 439
+#define __NR_compat_syscalls 440
#endif

#define __ARCH_WANT_SYS_CLONE
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index c1c61635f89c..902bfb136002 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -879,6 +879,8 @@ __SYSCALL(__NR_fspick, sys_fspick)
__SYSCALL(__NR_pidfd_open, sys_pidfd_open)
#define __NR_clone3 435
__SYSCALL(__NR_clone3, sys_clone3)
+#define __NR_close_range 436
+__SYSCALL(__NR_close_range, sys_close_range)
#define __NR_openat2 437
__SYSCALL(__NR_openat2, sys_openat2)
#define __NR_pidfd_getfd 438
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index 042911e670b8..df2e14da6a29 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -356,5 +356,6 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
# 435 reserved for clone3
+436 common close_range sys_close_range
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index f4f49fcb76d0..553b8858e667 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -435,5 +435,6 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
435 common clone3 __sys_clone3
+436 common close_range sys_close_range
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 4c67b11f9c9e..4467e2211d3c 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -441,5 +441,6 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
435 common clone3 sys_clone3
+436 common close_range sys_close_range
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 1f9e8ad636cc..82ad4cce163a 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -374,5 +374,6 @@
433 n32 fspick sys_fspick
434 n32 pidfd_open sys_pidfd_open
435 n32 clone3 __sys_clone3
+436 n32 close_range sys_close_range
437 n32 openat2 sys_openat2
438 n32 pidfd_getfd sys_pidfd_getfd
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index c0b9d802dbf6..232934c26d07 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -350,5 +350,6 @@
433 n64 fspick sys_fspick
434 n64 pidfd_open sys_pidfd_open
435 n64 clone3 __sys_clone3
+436 n64 close_range sys_close_range
437 n64 openat2 sys_openat2
438 n64 pidfd_getfd sys_pidfd_getfd
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index ac586774c980..d63000c8e769 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -423,5 +423,6 @@
433 o32 fspick sys_fspick
434 o32 pidfd_open sys_pidfd_open
435 o32 clone3 __sys_clone3
+436 o32 close_range sys_close_range
437 o32 openat2 sys_openat2
438 o32 pidfd_getfd sys_pidfd_getfd
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index 52a15f5cd130..8612458afda6 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -433,5 +433,6 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
435 common clone3 sys_clone3_wrapper
+436 common close_range sys_close_range
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 220ae11555f2..ac92f5d7279d 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -525,5 +525,6 @@
435 32 clone3 ppc_clone3 sys_clone3
435 64 clone3 sys_clone3
435 spu clone3 sys_ni_syscall
+436 common close_range sys_close_range
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index bd7bd3581a0f..371bb1f2bfc3 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -438,5 +438,6 @@
433 common fspick sys_fspick sys_fspick
434 common pidfd_open sys_pidfd_open sys_pidfd_open
435 common clone3 sys_clone3 sys_clone3
+436 common close_range sys_close_range sys_close_range
437 common openat2 sys_openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd sys_pidfd_getfd
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index c7a30fcd135f..4db428e7acdd 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -438,5 +438,6 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
# 435 reserved for clone3
+436 common close_range sys_close_range
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index f13615ecdecc..c9233f79a11b 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -481,5 +481,6 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
# 435 reserved for clone3
+436 common close_range sys_close_range
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 54581ac671b4..49ea7190351a 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -440,5 +440,6 @@
433 i386 fspick sys_fspick
434 i386 pidfd_open sys_pidfd_open
435 i386 clone3 sys_clone3
+436 i386 close_range sys_close_range
437 i386 openat2 sys_openat2
438 i386 pidfd_getfd sys_pidfd_getfd
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 37b844f839bc..c2b50b16a24c 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -357,6 +357,7 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
435 common clone3 sys_clone3
+436 common close_range sys_close_range
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd

diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 85a9ab1bc04d..d5dd7e506893 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -406,5 +406,6 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
435 common clone3 sys_clone3
+436 common close_range sys_close_range
437 common openat2 sys_openat2
438 common pidfd_getfd sys_pidfd_getfd
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 3a3201e4618e..ed4e7c2a557f 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -850,6 +850,8 @@ __SYSCALL(__NR_pidfd_open, sys_pidfd_open)
#define __NR_clone3 435
__SYSCALL(__NR_clone3, sys_clone3)
#endif
+#define __NR_close_range 436
+__SYSCALL(__NR_close_range, sys_close_range)

#define __NR_openat2 437
__SYSCALL(__NR_openat2, sys_openat2)
@@ -857,7 +859,7 @@ __SYSCALL(__NR_openat2, sys_openat2)
__SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)

#undef __NR_syscalls
-#define __NR_syscalls 439
+#define __NR_syscalls 440

/*
* 32 bit systems traditionally used different
--
2.26.2

2020-06-02 21:06:05

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v5 0/3] close_range()

On Tue, Jun 2, 2020 at 1:42 PM Christian Brauner
<[email protected]> wrote:
>
> This is a resend of the close_range() syscall, as discussed in [1]. There weren't any outstanding
> discussions anymore and this was in mergeable shape. I simply hadn't gotten around to moving this
> into my for-next the last few cycles and then forgot about it. Thanks to Kyle and the Python people,
> and others for consistenly reminding me before every merge window and mea culpa for not moving on
> this sooner. I plan on moving this into for-next after v5.8-rc1 has been released and targeting the
> v5.9 merge window.

Btw, I did have one reaction that I can't find in the original thread,
which probably means that it got lost.

If one of the designed uses for this is for dropping file descriptors
just before execve(), it's possible that we'd want to have the option
to say "unshare my fd array" as part of close_range().

Yes, yes, you can do

unshare(CLONE_FILES);
close_range(3,~0u);

to do it as two operations (and you had that as the example typical
use), but it would actually be better to be able to do

close_range(3, ~0ul, CLOSE_RANGE_UNSHARE);

instead. Because otherwise we just waste time copying the file
descriptors first in the unshare, and then closing them after.. Double
the work..

And maybe this _did_ get mentioned last time, and I just don't find
it. I also don't see anything like that in the patches, although the
flags argument is there.

Linus

2020-06-02 23:35:25

by Florian Weimer

[permalink] [raw]
Subject: Re: [PATCH v5 1/3] open: add close_range()

* Christian Brauner:

> The performance is striking. For good measure, comparing the following
> simple close_all_fds() userspace implementation that is essentially just
> glibc's version in [6]:
>
> static int close_all_fds(void)
> {
> int dir_fd;
> DIR *dir;
> struct dirent *direntp;
>
> dir = opendir("/proc/self/fd");
> if (!dir)
> return -1;
> dir_fd = dirfd(dir);
> while ((direntp = readdir(dir))) {
> int fd;
> if (strcmp(direntp->d_name, ".") == 0)
> continue;
> if (strcmp(direntp->d_name, "..") == 0)
> continue;
> fd = atoi(direntp->d_name);
> if (fd == dir_fd || fd == 0 || fd == 1 || fd == 2)
> continue;
> close(fd);
> }
> closedir(dir);
> return 0;
> }
>

> [6]: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/grantpt.c;h=2030e07fa6e652aac32c775b8c6e005844c3c4eb;hb=HEAD#l17
> Note that this is an internal implementation that is not exported.
> Currently, libc seems to not provide an exported version of this
> because of missing kernel support to do this.

Just to be clear, this code is not compiled into glibc anymore in
typical configurations. I have posted a patch to turn grantpt into a
no-op: <https://sourceware.org/pipermail/libc-alpha/2020-May/114379.html>

I'm not entirely convinced that it's safe to keep iterating over
/proc/self/fd while also closing descriptors. Ideally, I think an
application should call getdents64, process the file names for
descriptors in the buffer, and if any have been closed, seek to zero
before the next getdents64 call. Maybe procfs is different, but with
other file systems, unlinking files can trigger directory reordering,
and then you get strange effects. The d_ino behavior for
/proc/self/fd is a bit strange as well (it's not consistently
descriptor plus 3).

2020-06-02 23:36:19

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v5 0/3] close_range()

On Tue, Jun 02, 2020 at 02:03:09PM -0700, Linus Torvalds wrote:
> On Tue, Jun 2, 2020 at 1:42 PM Christian Brauner
> <[email protected]> wrote:
> >
> > This is a resend of the close_range() syscall, as discussed in [1]. There weren't any outstanding
> > discussions anymore and this was in mergeable shape. I simply hadn't gotten around to moving this
> > into my for-next the last few cycles and then forgot about it. Thanks to Kyle and the Python people,
> > and others for consistenly reminding me before every merge window and mea culpa for not moving on
> > this sooner. I plan on moving this into for-next after v5.8-rc1 has been released and targeting the
> > v5.9 merge window.
>
> Btw, I did have one reaction that I can't find in the original thread,
> which probably means that it got lost.
>
> If one of the designed uses for this is for dropping file descriptors
> just before execve(), it's possible that we'd want to have the option
> to say "unshare my fd array" as part of close_range().
>
> Yes, yes, you can do
>
> unshare(CLONE_FILES);
> close_range(3,~0u);
>
> to do it as two operations (and you had that as the example typical
> use), but it would actually be better to be able to do
>
> close_range(3, ~0ul, CLOSE_RANGE_UNSHARE);
>
> instead. Because otherwise we just waste time copying the file
> descriptors first in the unshare, and then closing them after.. Double
> the work..
>
> And maybe this _did_ get mentioned last time, and I just don't find
> it. I also don't see anything like that in the patches, although the
> flags argument is there.

I spent some good time digging and I couldn't find this mentioned
anywhere so maybe it just never got sent to the list?
It sounds pretty useful, so yeah let me add a patch for this tomorrow.

Christian

2020-06-02 23:43:07

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v5 1/3] open: add close_range()

On Wed, Jun 03, 2020 at 01:30:57AM +0200, Florian Weimer wrote:
> * Christian Brauner:
>
> > The performance is striking. For good measure, comparing the following
> > simple close_all_fds() userspace implementation that is essentially just
> > glibc's version in [6]:
> >
> > static int close_all_fds(void)
> > {
> > int dir_fd;
> > DIR *dir;
> > struct dirent *direntp;
> >
> > dir = opendir("/proc/self/fd");
> > if (!dir)
> > return -1;
> > dir_fd = dirfd(dir);
> > while ((direntp = readdir(dir))) {
> > int fd;
> > if (strcmp(direntp->d_name, ".") == 0)
> > continue;
> > if (strcmp(direntp->d_name, "..") == 0)
> > continue;
> > fd = atoi(direntp->d_name);
> > if (fd == dir_fd || fd == 0 || fd == 1 || fd == 2)
> > continue;
> > close(fd);
> > }
> > closedir(dir);
> > return 0;
> > }
> >
>
> > [6]: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/grantpt.c;h=2030e07fa6e652aac32c775b8c6e005844c3c4eb;hb=HEAD#l17
> > Note that this is an internal implementation that is not exported.
> > Currently, libc seems to not provide an exported version of this
> > because of missing kernel support to do this.
>
> Just to be clear, this code is not compiled into glibc anymore in
> typical configurations. I have posted a patch to turn grantpt into a
> no-op: <https://sourceware.org/pipermail/libc-alpha/2020-May/114379.html>

That's great! (I remember commenting on that thread.)

2020-06-03 00:13:04

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v5 0/3] close_range()

On Tue, Jun 2, 2020 at 4:33 PM Christian Brauner
<[email protected]> wrote:
> >
> > And maybe this _did_ get mentioned last time, and I just don't find
> > it. I also don't see anything like that in the patches, although the
> > flags argument is there.
>
> I spent some good time digging and I couldn't find this mentioned
> anywhere so maybe it just never got sent to the list?

It's entirely possible that it was just a private musing, and you
re-opening this issue just resurrected the thought.

I'm not sure how simple it would be to implement, but looking at it it
shouldn't be problematic to add a "max_fd" argument to unshare_fd()
and dup_fd().

Although the range for unsharing is obviously reversed, so I'd suggest
not trying to make "dup_fd()" take the exact range into account.

More like just making __close_range() do basically something like

rcu_read_lock();
cur_max = files_fdtable(files)->max_fds;
rcu_read_unlock();

if (flags & CLOSE_RANGE_UNSHARE) {
unsigned int max_unshare_fd = ~0u;
if (cur_max >= max_fd)
max_unshare_fd = fd;
unshare_fd(max_unsgare_fd);
}

.. do the rest of __close_range() here ..

and all that "max_unsgare_fd" would do would be to limit the top end
of the file descriptor table unsharing: we'd still do the exact range
handling in __close_range() itself.

Because teaching unshare_fd() and dup_fd() about anything more complex
than the above doesn't sound worth it, but adding a way to just avoid
the unnecessary copy of any high file descriptors sounds simple
enough.

But I haven't thought deeply about this. I might have missed something.

Linus

Subject: Re: [PATCH v5 1/3] open: add close_range()

Hi Christian,

Could we have a manual page for this API (best before it's merged)?

Thanks,

Michael

On Tue, 2 Jun 2020 at 22:44, Christian Brauner
<[email protected]> wrote:
>
> This adds the close_range() syscall. It allows to efficiently close a range
> of file descriptors up to all file descriptors of a calling task.
>
> I've also coordinated with some FreeBSD developers who got in touch with
> me (Cced below). FreeBSD intends to add the same syscall once we merged it.
> Quite a bunch of projects in userspace are waiting on this syscall
> including Python and systemd.
>
> The syscall came up in a recent discussion around the new mount API and
> making new file descriptor types cloexec by default. During this
> discussion, Al suggested the close_range() syscall (cf. [1]). Note, a
> syscall in this manner has been requested by various people over time.
>
> First, it helps to close all file descriptors of an exec()ing task. This
> can be done safely via (quoting Al's example from [1] verbatim):
>
> /* that exec is sensitive */
> unshare(CLONE_FILES);
> /* we don't want anything past stderr here */
> close_range(3, ~0U);
> execve(....);
>
> The code snippet above is one way of working around the problem that file
> descriptors are not cloexec by default. This is aggravated by the fact that
> we can't just switch them over without massively regressing userspace. For
> a whole class of programs having an in-kernel method of closing all file
> descriptors is very helpful (e.g. demons, service managers, programming
> language standard libraries, container managers etc.).
> (Please note, unshare(CLONE_FILES) should only be needed if the calling
> task is multi-threaded and shares the file descriptor table with another
> thread in which case two threads could race with one thread allocating file
> descriptors and the other one closing them via close_range(). For the
> general case close_range() before the execve() is sufficient.)
>
> Second, it allows userspace to avoid implementing closing all file
> descriptors by parsing through /proc/<pid>/fd/* and calling close() on each
> file descriptor. From looking at various large(ish) userspace code bases
> this or similar patterns are very common in:
> - service managers (cf. [4])
> - libcs (cf. [6])
> - container runtimes (cf. [5])
> - programming language runtimes/standard libraries
> - Python (cf. [2])
> - Rust (cf. [7], [8])
> As Dmitry pointed out there's even a long-standing glibc bug about missing
> kernel support for this task (cf. [3]).
> In addition, the syscall will also work for tasks that do not have procfs
> mounted and on kernels that do not have procfs support compiled in. In such
> situations the only way to make sure that all file descriptors are closed
> is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE,
> OPEN_MAX trickery (cf. comment [8] on Rust).
>
> The performance is striking. For good measure, comparing the following
> simple close_all_fds() userspace implementation that is essentially just
> glibc's version in [6]:
>
> static int close_all_fds(void)
> {
> int dir_fd;
> DIR *dir;
> struct dirent *direntp;
>
> dir = opendir("/proc/self/fd");
> if (!dir)
> return -1;
> dir_fd = dirfd(dir);
> while ((direntp = readdir(dir))) {
> int fd;
> if (strcmp(direntp->d_name, ".") == 0)
> continue;
> if (strcmp(direntp->d_name, "..") == 0)
> continue;
> fd = atoi(direntp->d_name);
> if (fd == dir_fd || fd == 0 || fd == 1 || fd == 2)
> continue;
> close(fd);
> }
> closedir(dir);
> return 0;
> }
>
> to close_range() yields:
> 1. closing 4 open files:
> - close_all_fds(): ~280 us
> - close_range(): ~24 us
>
> 2. closing 1000 open files:
> - close_all_fds(): ~5000 us
> - close_range(): ~800 us
>
> close_range() is designed to allow for some flexibility. Specifically, it
> does not simply always close all open file descriptors of a task. Instead,
> callers can specify an upper bound.
> This is e.g. useful for scenarios where specific file descriptors are
> created with well-known numbers that are supposed to be excluded from
> getting closed.
> For extra paranoia close_range() comes with a flags argument. This can e.g.
> be used to implement extension. Once can imagine userspace wanting to stop
> at the first error instead of ignoring errors under certain circumstances.
> There might be other valid ideas in the future. In any case, a flag
> argument doesn't hurt and keeps us on the safe side.
>
> From an implementation side this is kept rather dumb. It saw some input
> from David and Jann but all nonsense is obviously my own!
> - Errors to close file descriptors are currently ignored. (Could be changed
> by setting a flag in the future if needed.)
> - __close_range() is a rather simplistic wrapper around __close_fd().
> My reasoning behind this is based on the nature of how __close_fd() needs
> to release an fd. But maybe I misunderstood specifics:
> We take the files_lock and rcu-dereference the fdtable of the calling
> task, we find the entry in the fdtable, get the file and need to release
> files_lock before calling filp_close().
> In the meantime the fdtable might have been altered so we can't just
> retake the spinlock and keep the old rcu-reference of the fdtable
> around. Instead we need to grab a fresh reference to the fdtable.
> If my reasoning is correct then there's really no point in fancyfying
> __close_range(): We just need to rcu-dereference the fdtable of the
> calling task once to cap the max_fd value correctly and then go on
> calling __close_fd() in a loop.
>
> /* References */
> [1]: https://lore.kernel.org/lkml/[email protected]/
> [2]: https://github.com/python/cpython/blob/9e4f2f3a6b8ee995c365e86d976937c141d867f8/Modules/_posixsubprocess.c#L220
> [3]: https://sourceware.org/bugzilla/show_bug.cgi?id=10353#c7
> [4]: https://github.com/systemd/systemd/blob/5238e9575906297608ff802a27e2ff9effa3b338/src/basic/fd-util.c#L217
> [5]: https://github.com/lxc/lxc/blob/ddf4b77e11a4d08f09b7b9cd13e593f8c047edc5/src/lxc/start.c#L236
> [6]: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/grantpt.c;h=2030e07fa6e652aac32c775b8c6e005844c3c4eb;hb=HEAD#l17
> Note that this is an internal implementation that is not exported.
> Currently, libc seems to not provide an exported version of this
> because of missing kernel support to do this.
> [7]: https://github.com/rust-lang/rust/issues/12148
> [8]: https://github.com/rust-lang/rust/blob/5f47c0613ed4eb46fca3633c1297364c09e5e451/src/libstd/sys/unix/process2.rs#L303-L308
> Rust's solution is slightly different but is equally unperformant.
> Rust calls getdtablesize() which is a glibc library function that
> simply returns the current RLIMIT_NOFILE or OPEN_MAX values. Rust then
> goes on to call close() on each fd. That's obviously overkill for most
> tasks. Rarely, tasks - especially non-demons - hit RLIMIT_NOFILE or
> OPEN_MAX.
> Let's be nice and assume an unprivileged user with RLIMIT_NOFILE set
> to 1024. Even in this case, there's a very high chance that in the
> common case Rust is calling the close() syscall 1021 times pointlessly
> if the task just has 0, 1, and 2 open.
>
> Suggested-by: Al Viro <[email protected]>
> Signed-off-by: Christian Brauner <[email protected]>
> Cc: Arnd Bergmann <[email protected]>
> Cc: Kyle Evans <[email protected]>
> Cc: Jann Horn <[email protected]>
> Cc: David Howells <[email protected]>
> Cc: Dmitry V. Levin <[email protected]>
> Cc: Oleg Nesterov <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Florian Weimer <[email protected]>
> Cc: [email protected]
> ---
> /* v2 */
> - Linus Torvalds <[email protected]>:
> - add cond_resched() to yield cpu when closing a lot of file descriptors
> - Al Viro <[email protected]>:
> - add cond_resched() to yield cpu when closing a lot of file descriptors
>
> /* v3 */
> unchanged
>
> /* v4 */
> - Oleg Nesterov <[email protected]>:
> - fix braino: s/max()/min()/
>
> /* v5 */
> unchanged
> ---
> fs/file.c | 62 ++++++++++++++++++++++++++++++++++------
> fs/open.c | 20 +++++++++++++
> include/linux/fdtable.h | 2 ++
> include/linux/syscalls.h | 2 ++
> 4 files changed, 78 insertions(+), 8 deletions(-)
>
> diff --git a/fs/file.c b/fs/file.c
> index abb8b7081d7a..e260bfe687d1 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -10,6 +10,7 @@
> #include <linux/syscalls.h>
> #include <linux/export.h>
> #include <linux/fs.h>
> +#include <linux/kernel.h>
> #include <linux/mm.h>
> #include <linux/sched/signal.h>
> #include <linux/slab.h>
> @@ -620,12 +621,9 @@ void fd_install(unsigned int fd, struct file *file)
>
> EXPORT_SYMBOL(fd_install);
>
> -/*
> - * The same warnings as for __alloc_fd()/__fd_install() apply here...
> - */
> -int __close_fd(struct files_struct *files, unsigned fd)
> +static struct file *pick_file(struct files_struct *files, unsigned fd)
> {
> - struct file *file;
> + struct file *file = NULL;
> struct fdtable *fdt;
>
> spin_lock(&files->file_lock);
> @@ -637,15 +635,63 @@ int __close_fd(struct files_struct *files, unsigned fd)
> goto out_unlock;
> rcu_assign_pointer(fdt->fd[fd], NULL);
> __put_unused_fd(files, fd);
> - spin_unlock(&files->file_lock);
> - return filp_close(file, files);
>
> out_unlock:
> spin_unlock(&files->file_lock);
> - return -EBADF;
> + return file;
> +}
> +
> +/*
> + * The same warnings as for __alloc_fd()/__fd_install() apply here...
> + */
> +int __close_fd(struct files_struct *files, unsigned fd)
> +{
> + struct file *file;
> +
> + file = pick_file(files, fd);
> + if (!file)
> + return -EBADF;
> +
> + return filp_close(file, files);
> }
> EXPORT_SYMBOL(__close_fd); /* for ksys_close() */
>
> +/**
> + * __close_range() - Close all file descriptors in a given range.
> + *
> + * @fd: starting file descriptor to close
> + * @max_fd: last file descriptor to close
> + *
> + * This closes a range of file descriptors. All file descriptors
> + * from @fd up to and including @max_fd are closed.
> + */
> +int __close_range(struct files_struct *files, unsigned fd, unsigned max_fd)
> +{
> + unsigned int cur_max;
> +
> + if (fd > max_fd)
> + return -EINVAL;
> +
> + rcu_read_lock();
> + cur_max = files_fdtable(files)->max_fds;
> + rcu_read_unlock();
> +
> + /* cap to last valid index into fdtable */
> + max_fd = min(max_fd, (cur_max - 1));
> + while (fd <= max_fd) {
> + struct file *file;
> +
> + file = pick_file(files, fd++);
> + if (!file)
> + continue;
> +
> + filp_close(file, files);
> + cond_resched();
> + }
> +
> + return 0;
> +}
> +
> /*
> * variant of __close_fd that gets a ref on the file for later fput.
> * The caller must ensure that filp_close() called on the file, and then
> diff --git a/fs/open.c b/fs/open.c
> index 719b320ede52..87e076e9e127 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -1279,6 +1279,26 @@ SYSCALL_DEFINE1(close, unsigned int, fd)
> return retval;
> }
>
> +/**
> + * close_range() - Close all file descriptors in a given range.
> + *
> + * @fd: starting file descriptor to close
> + * @max_fd: last file descriptor to close
> + * @flags: reserved for future extensions
> + *
> + * This closes a range of file descriptors. All file descriptors
> + * from @fd up to and including @max_fd are closed.
> + * Currently, errors to close a given file descriptor are ignored.
> + */
> +SYSCALL_DEFINE3(close_range, unsigned int, fd, unsigned int, max_fd,
> + unsigned int, flags)
> +{
> + if (flags)
> + return -EINVAL;
> +
> + return __close_range(current->files, fd, max_fd);
> +}
> +
> /*
> * This routine simulates a hangup on the tty, to arrange that users
> * are given clean terminals at login time.
> diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
> index f07c55ea0c22..fcd07181a365 100644
> --- a/include/linux/fdtable.h
> +++ b/include/linux/fdtable.h
> @@ -121,6 +121,8 @@ extern void __fd_install(struct files_struct *files,
> unsigned int fd, struct file *file);
> extern int __close_fd(struct files_struct *files,
> unsigned int fd);
> +extern int __close_range(struct files_struct *files, unsigned int fd,
> + unsigned int max_fd);
> extern int __close_fd_get_file(unsigned int fd, struct file **res);
>
> extern struct kmem_cache *files_cachep;
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 1815065d52f3..18fea399329b 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -442,6 +442,8 @@ asmlinkage long sys_openat(int dfd, const char __user *filename, int flags,
> asmlinkage long sys_openat2(int dfd, const char __user *filename,
> struct open_how *how, size_t size);
> asmlinkage long sys_close(unsigned int fd);
> +asmlinkage long sys_close_range(unsigned int fd, unsigned int max_fd,
> + unsigned int flags);
> asmlinkage long sys_vhangup(void);
>
> /* fs/pipe.c */
> --
> 2.26.2
>


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

2020-06-03 23:58:13

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v5 0/3] close_range()

On Tue, Jun 02, 2020 at 05:08:22PM -0700, Linus Torvalds wrote:
> On Tue, Jun 2, 2020 at 4:33 PM Christian Brauner
> <[email protected]> wrote:
> > >
> > > And maybe this _did_ get mentioned last time, and I just don't find
> > > it. I also don't see anything like that in the patches, although the
> > > flags argument is there.
> >
> > I spent some good time digging and I couldn't find this mentioned
> > anywhere so maybe it just never got sent to the list?
>
> It's entirely possible that it was just a private musing, and you
> re-opening this issue just resurrected the thought.
>
> I'm not sure how simple it would be to implement, but looking at it it
> shouldn't be problematic to add a "max_fd" argument to unshare_fd()
> and dup_fd().
>
> Although the range for unsharing is obviously reversed, so I'd suggest
> not trying to make "dup_fd()" take the exact range into account.
>
> More like just making __close_range() do basically something like
>
> rcu_read_lock();
> cur_max = files_fdtable(files)->max_fds;
> rcu_read_unlock();
>
> if (flags & CLOSE_RANGE_UNSHARE) {
> unsigned int max_unshare_fd = ~0u;
> if (cur_max >= max_fd)
> max_unshare_fd = fd;
> unshare_fd(max_unsgare_fd);
> }
>
> .. do the rest of __close_range() here ..
>
> and all that "max_unsgare_fd" would do would be to limit the top end
> of the file descriptor table unsharing: we'd still do the exact range
> handling in __close_range() itself.
>
> Because teaching unshare_fd() and dup_fd() about anything more complex
> than the above doesn't sound worth it, but adding a way to just avoid
> the unnecessary copy of any high file descriptors sounds simple
> enough.

Ok, here's what I have. (I think in your example above cur_max and
max_fd are switched or I might have missed your point completely.) I was
a little in doubt whether capping dup_fd() between NR_OPEN_DEFAULT and
open_files was a sane thing to do but I think it is. Torture testing
this with a proper test-suite and with all debugging options enabled
didn't yet find any obvious issues. Does the below look somewhat sane?:

From 4ee3fdac02f3cd70e31669e35d3f494913f3fd3f Mon Sep 17 00:00:00 2001
From: Christian Brauner <[email protected]>
Date: Wed, 3 Jun 2020 21:48:55 +0200
Subject: [PATCH 1/2] close_range: add CLOSE_RANGE_UNSHARE

One of the use-cases of close_range() is to drop file descriptors just before
execve(). This would usually be expressed in the sequence:

unshare(CLONE_FILES);
close_range(3, ~0U);

as pointed out by Linus it might be desirable to have this be a part of
close_range() itself under a new flag CLOSE_RANGE_UNSHARE.

This expands {dup,unshare)_fd() to take a max_fds argument that indicates the
maximum number of file descriptors to copy from the old struct files. When the
user requests that all file descriptors are supposed to be closed via
close_range(min, max) then we can cap via unshare_fd(min) and hence don't need
to do any of the heavy fput() work for everything above min.

The patch makes it so that if CLOSE_RANGE_UNSHARE is requested and we do in
fact currently share our file descriptor table we create a new private copy.
We then close all fds in the requested range and finally after we're done we
install the new fd table.

Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
---
fs/file.c | 65 ++++++++++++++++++++++++++++----
fs/open.c | 5 +--
include/linux/fdtable.h | 8 ++--
include/uapi/linux/close_range.h | 9 +++++
kernel/fork.c | 11 +++---
5 files changed, 79 insertions(+), 19 deletions(-)
create mode 100644 include/uapi/linux/close_range.h

diff --git a/fs/file.c b/fs/file.c
index e260bfe687d1..718356ed6682 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -19,6 +19,7 @@
#include <linux/bitops.h>
#include <linux/spinlock.h>
#include <linux/rcupdate.h>
+#include <linux/close_range.h>

unsigned int sysctl_nr_open __read_mostly = 1024*1024;
unsigned int sysctl_nr_open_min = BITS_PER_LONG;
@@ -265,12 +266,22 @@ static unsigned int count_open_files(struct fdtable *fdt)
return i;
}

+static unsigned int sane_fdtable_size(struct fdtable *fdt, unsigned int max_fds)
+{
+ unsigned int count;
+
+ count = count_open_files(fdt);
+ if (max_fds < NR_OPEN_DEFAULT)
+ max_fds = NR_OPEN_DEFAULT;
+ return min(count, max_fds);
+}
+
/*
* Allocate a new files structure and copy contents from the
* passed in files structure.
* errorp will be valid only when the returned files_struct is NULL.
*/
-struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
+struct files_struct *dup_fd(struct files_struct *oldf, unsigned int max_fds, int *errorp)
{
struct files_struct *newf;
struct file **old_fds, **new_fds;
@@ -297,7 +308,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)

spin_lock(&oldf->file_lock);
old_fdt = files_fdtable(oldf);
- open_files = count_open_files(old_fdt);
+ open_files = sane_fdtable_size(old_fdt, max_fds);

/*
* Check whether we need to allocate a larger fd array and fd set.
@@ -328,7 +339,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
*/
spin_lock(&oldf->file_lock);
old_fdt = files_fdtable(oldf);
- open_files = count_open_files(old_fdt);
+ open_files = sane_fdtable_size(old_fdt, max_fds);
}

copy_fd_bitmaps(new_fdt, old_fdt, open_files);
@@ -665,30 +676,70 @@ EXPORT_SYMBOL(__close_fd); /* for ksys_close() */
* This closes a range of file descriptors. All file descriptors
* from @fd up to and including @max_fd are closed.
*/
-int __close_range(struct files_struct *files, unsigned fd, unsigned max_fd)
+int __close_range(unsigned fd, unsigned max_fd, unsigned int flags)
{
unsigned int cur_max;
+ struct task_struct *me = current;
+ struct files_struct *cur_fds = me->files, *fds = NULL;
+
+ if (flags & ~CLOSE_RANGE_UNSHARE)
+ return -EINVAL;

if (fd > max_fd)
return -EINVAL;

rcu_read_lock();
- cur_max = files_fdtable(files)->max_fds;
+ cur_max = files_fdtable(cur_fds)->max_fds;
rcu_read_unlock();

+ if (flags & CLOSE_RANGE_UNSHARE) {
+ int ret;
+ unsigned int max_unshare_fds = NR_OPEN_MAX;
+
+ /*
+ * If the requested range is greater than the current maximum,
+ * we're closing everything so only copy all file descriptors
+ * beneath the lowest file descriptor.
+ */
+ if ((max_fd + 1) >= cur_max)
+ max_unshare_fds = fd;
+
+ ret = unshare_fd(CLONE_FILES, max_unshare_fds, &fds);
+ if (ret)
+ return ret;
+
+ /*
+ * We used to share our file descriptor table, and have now
+ * created a private one, make sure we're using it below.
+ */
+ if (fds)
+ swap(cur_fds, fds);
+ }
+
/* cap to last valid index into fdtable */
max_fd = min(max_fd, (cur_max - 1));
while (fd <= max_fd) {
struct file *file;

- file = pick_file(files, fd++);
+ file = pick_file(cur_fds, fd++);
if (!file)
continue;

- filp_close(file, files);
+ filp_close(file, cur_fds);
cond_resched();
}

+ if (fds) {
+ /*
+ * We're done closing the files we were supposed to. Time to install
+ * the new file descriptor table and drop the old one.
+ */
+ task_lock(me);
+ me->files = cur_fds;
+ task_unlock(me);
+ put_files_struct(fds);
+ }
+
return 0;
}

diff --git a/fs/open.c b/fs/open.c
index 87e076e9e127..28c561734ece 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1293,10 +1293,7 @@ SYSCALL_DEFINE1(close, unsigned int, fd)
SYSCALL_DEFINE3(close_range, unsigned int, fd, unsigned int, max_fd,
unsigned int, flags)
{
- if (flags)
- return -EINVAL;
-
- return __close_range(current->files, fd, max_fd);
+ return __close_range(fd, max_fd, flags);
}

/*
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index fcd07181a365..a32bf47c593e 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -22,6 +22,7 @@
* as this is the granularity returned by copy_fdset().
*/
#define NR_OPEN_DEFAULT BITS_PER_LONG
+#define NR_OPEN_MAX ~0U

struct fdtable {
unsigned int max_fds;
@@ -109,7 +110,7 @@ struct files_struct *get_files_struct(struct task_struct *);
void put_files_struct(struct files_struct *fs);
void reset_files_struct(struct files_struct *);
int unshare_files(struct files_struct **);
-struct files_struct *dup_fd(struct files_struct *, int *) __latent_entropy;
+struct files_struct *dup_fd(struct files_struct *, unsigned, int *) __latent_entropy;
void do_close_on_exec(struct files_struct *);
int iterate_fd(struct files_struct *, unsigned,
int (*)(const void *, struct file *, unsigned),
@@ -121,9 +122,10 @@ extern void __fd_install(struct files_struct *files,
unsigned int fd, struct file *file);
extern int __close_fd(struct files_struct *files,
unsigned int fd);
-extern int __close_range(struct files_struct *files, unsigned int fd,
- unsigned int max_fd);
+extern int __close_range(unsigned int fd, unsigned int max_fd, unsigned int flags);
extern int __close_fd_get_file(unsigned int fd, struct file **res);
+extern int unshare_fd(unsigned long unshare_flags, unsigned int max_fds,
+ struct files_struct **new_fdp);

extern struct kmem_cache *files_cachep;

diff --git a/include/uapi/linux/close_range.h b/include/uapi/linux/close_range.h
new file mode 100644
index 000000000000..6928a9fdee3c
--- /dev/null
+++ b/include/uapi/linux/close_range.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _UAPI_LINUX_CLOSE_RANGE_H
+#define _UAPI_LINUX_CLOSE_RANGE_H
+
+/* Unshare the file descriptor table before closing file descriptors. */
+#define CLOSE_RANGE_UNSHARE (1U << 1)
+
+#endif /* _UAPI_LINUX_CLOSE_RANGE_H */
+
diff --git a/kernel/fork.c b/kernel/fork.c
index 48ed22774efa..836976aa7ab9 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1466,7 +1466,7 @@ static int copy_files(unsigned long clone_flags, struct task_struct *tsk)
goto out;
}

- newf = dup_fd(oldf, &error);
+ newf = dup_fd(oldf, NR_OPEN_MAX, &error);
if (!newf)
goto out;

@@ -2894,14 +2894,15 @@ static int unshare_fs(unsigned long unshare_flags, struct fs_struct **new_fsp)
/*
* Unshare file descriptor table if it is being shared
*/
-static int unshare_fd(unsigned long unshare_flags, struct files_struct **new_fdp)
+int unshare_fd(unsigned long unshare_flags, unsigned int max_fds,
+ struct files_struct **new_fdp)
{
struct files_struct *fd = current->files;
int error = 0;

if ((unshare_flags & CLONE_FILES) &&
(fd && atomic_read(&fd->count) > 1)) {
- *new_fdp = dup_fd(fd, &error);
+ *new_fdp = dup_fd(fd, max_fds, &error);
if (!*new_fdp)
return error;
}
@@ -2961,7 +2962,7 @@ int ksys_unshare(unsigned long unshare_flags)
err = unshare_fs(unshare_flags, &new_fs);
if (err)
goto bad_unshare_out;
- err = unshare_fd(unshare_flags, &new_fd);
+ err = unshare_fd(unshare_flags, NR_OPEN_MAX, &new_fd);
if (err)
goto bad_unshare_cleanup_fs;
err = unshare_userns(unshare_flags, &new_cred);
@@ -3050,7 +3051,7 @@ int unshare_files(struct files_struct **displaced)
struct files_struct *copy = NULL;
int error;

- error = unshare_fd(CLONE_FILES, &copy);
+ error = unshare_fd(CLONE_FILES, NR_OPEN_MAX, &copy);
if (error || !copy) {
*displaced = NULL;
return error;
--
2.27.0

2020-06-04 00:16:11

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v5 0/3] close_range()

On Wed, Jun 3, 2020 at 4:24 PM Christian Brauner
<[email protected]> wrote:
>
> Ok, here's what I have. Does the below look somewhat sane?

Probably. Needs lots of testing. But this one looks wrong:

> +int __close_range(unsigned fd, unsigned max_fd, unsigned int flags)
> {
> + if ((max_fd + 1) >= cur_max)
> + max_unshare_fds = fd;

A normal value for "close everything starting at X" would have a
max_fd value of ~0.

So "max_fd+1" would overflow to 0, and then this would never trigger.

Other than that it looks what what I imagine my feverdreams were about.

Linus

2020-06-04 01:54:45

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v5 0/3] close_range()

On Wed, Jun 03, 2020 at 05:13:36PM -0700, Linus Torvalds wrote:
> On Wed, Jun 3, 2020 at 4:24 PM Christian Brauner
> <[email protected]> wrote:
> >
> > Ok, here's what I have. Does the below look somewhat sane?
>
> Probably. Needs lots of testing. But this one looks wrong:

Right, there's a patch for a test-suite for the new flag too using
CLONE_FILES to create a shared fdtable and the proceeds to close all
(or subsets of) fds:

https://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git/commit/?h=close_range&id=498e7e844fe6e3f3306b2cd1b5e926e1cd394b99

I've been running that in an endless loop for a while.

>
> > +int __close_range(unsigned fd, unsigned max_fd, unsigned int flags)
> > {
> > + if ((max_fd + 1) >= cur_max)
> > + max_unshare_fds = fd;
>
> A normal value for "close everything starting at X" would have a
> max_fd value of ~0.

Ugh, obvious braino from my side. This should just be:

if (max_fd >= cur_max)
max_unshare_fds = fd;

>
> So "max_fd+1" would overflow to 0, and then this would never trigger.
>
> Other than that it looks what what I imagine my feverdreams were about.

Thanks!
Christian

2020-06-05 14:58:08

by Szabolcs Nagy

[permalink] [raw]
Subject: Re: [PATCH v5 1/3] open: add close_range()

* Christian Brauner <[email protected]> [2020-06-02 22:42:17 +0200]:
> This adds the close_range() syscall. It allows to efficiently close a range
> of file descriptors up to all file descriptors of a calling task.
>
> I've also coordinated with some FreeBSD developers who got in touch with
> me (Cced below). FreeBSD intends to add the same syscall once we merged it.
> Quite a bunch of projects in userspace are waiting on this syscall
> including Python and systemd.
>
> The syscall came up in a recent discussion around the new mount API and
> making new file descriptor types cloexec by default. During this
> discussion, Al suggested the close_range() syscall (cf. [1]). Note, a
> syscall in this manner has been requested by various people over time.
>
> First, it helps to close all file descriptors of an exec()ing task. This
> can be done safely via (quoting Al's example from [1] verbatim):
>
> /* that exec is sensitive */
> unshare(CLONE_FILES);
> /* we don't want anything past stderr here */
> close_range(3, ~0U);
> execve(....);

this api needs a documentation patch if there isn't yet.

currently there is no libc interface contract in place that
says which calls may use libc internal fds e.g. i've seen

openlog(...) // opens libc internal syslog fd
...
fork()
closefrom(...) // close syslog fd
open(...) // something that reuses the closed fd
syslog(...) // unsafe: uses the wrong fd
execve(...)

syslog uses a libc internal fd that the user trampled on and
this can go bad in many ways depending on what libc apis are
used between closefrom (or equivalent) and exec.

> The code snippet above is one way of working around the problem that file
> descriptors are not cloexec by default. This is aggravated by the fact that
> we can't just switch them over without massively regressing userspace. For

why is a switch_to_cloexec_range worse than close_range?
the former seems safer to me. (and allows libc calls
to be made between such switch and exec: libc internal
fds have to be cloexec anyway)

> a whole class of programs having an in-kernel method of closing all file
> descriptors is very helpful (e.g. demons, service managers, programming
> language standard libraries, container managers etc.).
> (Please note, unshare(CLONE_FILES) should only be needed if the calling
> task is multi-threaded and shares the file descriptor table with another
> thread in which case two threads could race with one thread allocating file
> descriptors and the other one closing them via close_range(). For the
> general case close_range() before the execve() is sufficient.)
>
> Second, it allows userspace to avoid implementing closing all file
> descriptors by parsing through /proc/<pid>/fd/* and calling close() on each
> file descriptor. From looking at various large(ish) userspace code bases
> this or similar patterns are very common in:
> - service managers (cf. [4])
> - libcs (cf. [6])
> - container runtimes (cf. [5])
> - programming language runtimes/standard libraries
> - Python (cf. [2])
> - Rust (cf. [7], [8])
> As Dmitry pointed out there's even a long-standing glibc bug about missing
> kernel support for this task (cf. [3]).
> In addition, the syscall will also work for tasks that do not have procfs
> mounted and on kernels that do not have procfs support compiled in. In such
> situations the only way to make sure that all file descriptors are closed
> is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE,
> OPEN_MAX trickery (cf. comment [8] on Rust).

close_range still seems like a bad operation to expose.

if users really want closing behaviour (instead of marking
fds cloexec) then they likely need coordination with libc
and other libraries.

e.g. this usage does not work:

maxfd = findmaxfd();
call_that_may_leak_fds();
close_range(maxfd,~0U);

as far as i can tell only the close right before exec works.

2020-06-06 02:59:26

by Kyle Evans

[permalink] [raw]
Subject: Re: [PATCH v5 1/3] open: add close_range()

On Fri, Jun 5, 2020 at 9:55 AM Szabolcs Nagy <[email protected]> wrote:
>
> * Christian Brauner <[email protected]> [2020-06-02 22:42:17 +0200]:
> > [... snip ...]
> >
> > First, it helps to close all file descriptors of an exec()ing task. This
> > can be done safely via (quoting Al's example from [1] verbatim):
> >
> > /* that exec is sensitive */
> > unshare(CLONE_FILES);
> > /* we don't want anything past stderr here */
> > close_range(3, ~0U);
> > execve(....);
>
> this api needs a documentation patch if there isn't yet.
>
> currently there is no libc interface contract in place that
> says which calls may use libc internal fds e.g. i've seen
>
> openlog(...) // opens libc internal syslog fd
> ...
> fork()
> closefrom(...) // close syslog fd
> open(...) // something that reuses the closed fd
> syslog(...) // unsafe: uses the wrong fd
> execve(...)
>
> syslog uses a libc internal fd that the user trampled on and
> this can go bad in many ways depending on what libc apis are
> used between closefrom (or equivalent) and exec.
>

Documentation is good. :-) I think you'll find that while this example
seems to be innocuous on FreeBSD (and likely other *BSD), this is an
atypical scenario and generally not advised. You would usually not
start closing until you're actually ready to exec/fail.

> > The code snippet above is one way of working around the problem that file
> > descriptors are not cloexec by default. This is aggravated by the fact that
> > we can't just switch them over without massively regressing userspace. For
>
> why is a switch_to_cloexec_range worse than close_range?
> the former seems safer to me. (and allows libc calls
> to be made between such switch and exec: libc internal
> fds have to be cloexec anyway)
>

I wouldn't say it's worse, but it only solves half the problem. While
closefrom -> exec is the most common usage by a long shot, there are
also times (e.g. post-fork without intent to exec for a daemon/service
type) that you want to go ahead and close everything except maybe a
pipe fd that you've opened for IPC. While uncommon, there's no reason
this needs to devolve into a loop to close 'all the fds' when you can
instead introduce close_range to solve both the exec case and other
less common scenarios.

> > a whole class of programs having an in-kernel method of closing all file
> > descriptors is very helpful (e.g. demons, service managers, programming
> > language standard libraries, container managers etc.).
> > (Please note, unshare(CLONE_FILES) should only be needed if the calling
> > task is multi-threaded and shares the file descriptor table with another
> > thread in which case two threads could race with one thread allocating file
> > descriptors and the other one closing them via close_range(). For the
> > general case close_range() before the execve() is sufficient.)
> >
> > Second, it allows userspace to avoid implementing closing all file
> > descriptors by parsing through /proc/<pid>/fd/* and calling close() on each
> > file descriptor. From looking at various large(ish) userspace code bases
> > this or similar patterns are very common in:
> > - service managers (cf. [4])
> > - libcs (cf. [6])
> > - container runtimes (cf. [5])
> > - programming language runtimes/standard libraries
> > - Python (cf. [2])
> > - Rust (cf. [7], [8])
> > As Dmitry pointed out there's even a long-standing glibc bug about missing
> > kernel support for this task (cf. [3]).
> > In addition, the syscall will also work for tasks that do not have procfs
> > mounted and on kernels that do not have procfs support compiled in. In such
> > situations the only way to make sure that all file descriptors are closed
> > is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE,
> > OPEN_MAX trickery (cf. comment [8] on Rust).
>
> close_range still seems like a bad operation to expose.
>
> if users really want closing behaviour (instead of marking
> fds cloexec) then they likely need coordination with libc
> and other libraries.
>
> e.g. this usage does not work:
>
> maxfd = findmaxfd();
> call_that_may_leak_fds();
> close_range(maxfd,~0U);
>
> as far as i can tell only the close right before exec works.

I don't have much to say on this one, except that's also an unusual
case. I don't know that I'd anticipate close_range getting used for
that kind of cleanup job (I've never seen a similar use of closefrom),
which seems to just be papering over application/library bugs.

Coordination with libc is generally not much of an issue, because this
is really one of the last things you do before exec() or swiftly
failing miserably. Applications that currently loop over all fd <=
maxfd and close(fd) right now are subject to the very same
constraints, this is just a much more efficient way and
debugger-friendly way to accomplish it. You've absolutely not lived
life until you've had to watch thousands of close() calls painfully
scroll by in truss/strace.

Thank,

Kyle Evans

2020-06-06 03:13:43

by Kyle Evans

[permalink] [raw]
Subject: Re: [PATCH v5 1/3] open: add close_range()

On Fri, Jun 5, 2020 at 9:54 PM Kyle Evans <[email protected]> wrote:
>
> On Fri, Jun 5, 2020 at 9:55 AM Szabolcs Nagy <[email protected]> wrote:
> >
> > * Christian Brauner <[email protected]> [2020-06-02 22:42:17 +0200]:
> > > [... snip ...]
> > >
> > > First, it helps to close all file descriptors of an exec()ing task. This
> > > can be done safely via (quoting Al's example from [1] verbatim):
> > >
> > > /* that exec is sensitive */
> > > unshare(CLONE_FILES);
> > > /* we don't want anything past stderr here */
> > > close_range(3, ~0U);
> > > execve(....);
> >
> > this api needs a documentation patch if there isn't yet.
> >
> > currently there is no libc interface contract in place that
> > says which calls may use libc internal fds e.g. i've seen
> >
> > openlog(...) // opens libc internal syslog fd
> > ...
> > fork()
> > closefrom(...) // close syslog fd
> > open(...) // something that reuses the closed fd
> > syslog(...) // unsafe: uses the wrong fd
> > execve(...)
> >
> > syslog uses a libc internal fd that the user trampled on and
> > this can go bad in many ways depending on what libc apis are
> > used between closefrom (or equivalent) and exec.
> >
>
> Documentation is good. :-) I think you'll find that while this example
> seems to be innocuous on FreeBSD (and likely other *BSD), this is an
> atypical scenario and generally not advised. You would usually not
> start closing until you're actually ready to exec/fail.
>

Minor correction: not innocuous here, either; O_CLOFORK is not yet a thing. :-)

2020-06-06 12:02:13

by Szabolcs Nagy

[permalink] [raw]
Subject: Re: [PATCH v5 1/3] open: add close_range()

* Kyle Evans <[email protected]> [2020-06-05 21:54:56 -0500]:
> On Fri, Jun 5, 2020 at 9:55 AM Szabolcs Nagy <[email protected]> wrote:
> > this api needs a documentation patch if there isn't yet.
> >
> > currently there is no libc interface contract in place that
> > says which calls may use libc internal fds e.g. i've seen
> >
> > openlog(...) // opens libc internal syslog fd
> > ...
> > fork()
> > closefrom(...) // close syslog fd
> > open(...) // something that reuses the closed fd
> > syslog(...) // unsafe: uses the wrong fd
> > execve(...)
> >
> > syslog uses a libc internal fd that the user trampled on and
> > this can go bad in many ways depending on what libc apis are
> > used between closefrom (or equivalent) and exec.
> >
>
> Documentation is good. :-) I think you'll find that while this example
> seems to be innocuous on FreeBSD (and likely other *BSD), this is an
> atypical scenario and generally not advised. You would usually not
> start closing until you're actually ready to exec/fail.

it's a recent bug https://bugs.kde.org/show_bug.cgi?id=420921

but not the first closefrom bug i saw: it is a fundamentally
unsafe operation that frees resources owned by others.

> > > The code snippet above is one way of working around the problem that file
> > > descriptors are not cloexec by default. This is aggravated by the fact that
> > > we can't just switch them over without massively regressing userspace. For
> >
> > why is a switch_to_cloexec_range worse than close_range?
> > the former seems safer to me. (and allows libc calls
> > to be made between such switch and exec: libc internal
> > fds have to be cloexec anyway)
> >
>
> I wouldn't say it's worse, but it only solves half the problem. While
> closefrom -> exec is the most common usage by a long shot, there are
> also times (e.g. post-fork without intent to exec for a daemon/service
> type) that you want to go ahead and close everything except maybe a
> pipe fd that you've opened for IPC. While uncommon, there's no reason
> this needs to devolve into a loop to close 'all the fds' when you can
> instead introduce close_range to solve both the exec case and other
> less common scenarios.

the syslog example shows why post-fork closefrom without
intent to exec does not work: there is no contract about
which api calls behave like syslog, so calling anything
after closefrom can be broken.

libc can introduce new api contracts e.g. that some apis
don't use fds internally or after a closefrom call some
apis behave differently, if this is the expected direction
then it would be nice to propose that on the libc-coord
at openwall.com list.

> Coordination with libc is generally not much of an issue, because this
> is really one of the last things you do before exec() or swiftly
> failing miserably. Applications that currently loop over all fd <=
> maxfd and close(fd) right now are subject to the very same
> constraints, this is just a much more efficient way and
> debugger-friendly way to accomplish it. You've absolutely not lived
> life until you've had to watch thousands of close() calls painfully
> scroll by in truss/strace.

applications do a 'close all fds' operation because there
is no alternative. (i think there are better ways to do
this than looping: you can poll on the fds and only close
the ones that didnt POLLNVAL, this should be more portable
than /proc, but it's besides my point) optimizing this
operation may not be the only way to achive whatever those
applications are trying to do.

if closefrom only works before exec then that should be
documented and callers that do otherwise fixed, if
important users do things between closefrom and exec then
i think a different design is needed with libc maintainers
involved.

2020-06-06 14:46:37

by Kyle Evans

[permalink] [raw]
Subject: Re: [PATCH v5 1/3] open: add close_range()

On Sat, Jun 6, 2020 at 6:55 AM Szabolcs Nagy <[email protected]> wrote:
>
> * Kyle Evans <[email protected]> [2020-06-05 21:54:56 -0500]:
> > On Fri, Jun 5, 2020 at 9:55 AM Szabolcs Nagy <[email protected]> wrote:
> > > this api needs a documentation patch if there isn't yet.
> > >
> > > currently there is no libc interface contract in place that
> > > says which calls may use libc internal fds e.g. i've seen
> > >
> > > openlog(...) // opens libc internal syslog fd
> > > ...
> > > fork()
> > > closefrom(...) // close syslog fd
> > > open(...) // something that reuses the closed fd
> > > syslog(...) // unsafe: uses the wrong fd
> > > execve(...)
> > >
> > > syslog uses a libc internal fd that the user trampled on and
> > > this can go bad in many ways depending on what libc apis are
> > > used between closefrom (or equivalent) and exec.
> > >
> >
> > Documentation is good. :-) I think you'll find that while this example
> > seems to be innocuous on FreeBSD (and likely other *BSD), this is an
> > atypical scenario and generally not advised. You would usually not
> > start closing until you're actually ready to exec/fail.
>
> it's a recent bug https://bugs.kde.org/show_bug.cgi?id=420921
>
> but not the first closefrom bug i saw: it is a fundamentally
> unsafe operation that frees resources owned by others.
>

Yes, close() is an inherently unsafe operation, and they managed this
bug without even having closefrom/close_range. I'm not entirely
convinced folks are going to spontaneously develop a need to massively
close things just because close_range exists. If they have a need,
they're already doing it with what they have available and causing
bugs like the above.

> > > > The code snippet above is one way of working around the problem that file
> > > > descriptors are not cloexec by default. This is aggravated by the fact that
> > > > we can't just switch them over without massively regressing userspace. For
> > >
> > > why is a switch_to_cloexec_range worse than close_range?
> > > the former seems safer to me. (and allows libc calls
> > > to be made between such switch and exec: libc internal
> > > fds have to be cloexec anyway)
> > >
> >
> > I wouldn't say it's worse, but it only solves half the problem. While
> > closefrom -> exec is the most common usage by a long shot, there are
> > also times (e.g. post-fork without intent to exec for a daemon/service
> > type) that you want to go ahead and close everything except maybe a
> > pipe fd that you've opened for IPC. While uncommon, there's no reason
> > this needs to devolve into a loop to close 'all the fds' when you can
> > instead introduce close_range to solve both the exec case and other
> > less common scenarios.
>
> the syslog example shows why post-fork closefrom without
> intent to exec does not work: there is no contract about
> which api calls behave like syslog, so calling anything
> after closefrom can be broken.
>

I think that example shows one scenario where it's not safe, that's
again in firmly "don't do that" territory. You can close arbitrary fds
very early or very late, but not somewhere in the middle of an even
remotely complex application. This problem exists with or without
close_range.

Like I said before, this is already done quite successfully now, along
with other not-even-forked uses. e.g. OpenSSH sshd will closefrom()
just after argument parsing:
https://github.com/openbsd/src/blob/master/usr.bin/ssh/sshd.c#L1582

> libc can introduce new api contracts e.g. that some apis
> don't use fds internally or after a closefrom call some
> apis behave differently, if this is the expected direction
> then it would be nice to propose that on the libc-coord
> at openwall.com list.
>

I suspect it's likely better to document that one should close
arbitrary fds very early or very late instead. Documenting which APIs
are inherently unsafe before/after would seem to be fraught with peril
-- you can enumerate what in libc is a potential problem, but there
are other libs in use by applications that will also use fds
internally and potentially cause problems, but we can't possibly raise
awareness of all of them. We can, however, raise awareness of the
valid and incredibly useful use-cases.

> > Coordination with libc is generally not much of an issue, because this
> > is really one of the last things you do before exec() or swiftly
> > failing miserably. Applications that currently loop over all fd <=
> > maxfd and close(fd) right now are subject to the very same
> > constraints, this is just a much more efficient way and
> > debugger-friendly way to accomplish it. You've absolutely not lived
> > life until you've had to watch thousands of close() calls painfully
> > scroll by in truss/strace.
>
> applications do a 'close all fds' operation because there
> is no alternative. (i think there are better ways to do
> this than looping: you can poll on the fds and only close
> the ones that didnt POLLNVAL, this should be more portable
> than /proc, but it's besides my point) optimizing this
> operation may not be the only way to achive whatever those
> applications are trying to do.

In most cases, those applications really are just trying to close all
fds other than the ones they want to stick around. Polling the fds
before trying to close them would just *double* the current problem
without fixing your other concern at all -- you still can't arbitrary
close open fds without understanding their provenance, and now you've
doubled the number of syscalls required just to close what you don't
need.

fdwalk() + close() is an (IMO) great alternative that's even more
flexible, but that still has its own problems.

Thanks,

Kyle Evans

2020-06-07 12:40:54

by David Laight

[permalink] [raw]
Subject: RE: [PATCH v5 0/3] close_range()

From: Christian Brauner
> Sent: 04 June 2020 00:24
..
> -struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
> +struct files_struct *dup_fd(struct files_struct *oldf, unsigned int max_fds, int *errorp)

Shouldn't this get changed to use ERR_PTR() etc?

..
> -int __close_range(struct files_struct *files, unsigned fd, unsigned max_fd)
> +int __close_range(unsigned fd, unsigned max_fd, unsigned int flags)

If the lowest fd that had ever has CLOEXEC (or CLOFORK) set were
remembered a flag could be passed in to say 'only close the fd
with CLOEXEC set'.

Given that CLOEXEC is almost never cleared, and is typically set
on all but a few fd the 'optimisation' of the bitmap is
probably a pessimisation.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2020-06-07 13:25:13

by David Laight

[permalink] [raw]
Subject: RE: [PATCH v5 1/3] open: add close_range()

From: Szabolcs Nagy
> Sent: 05 June 2020 15:56
...
> currently there is no libc interface contract in place that
> says which calls may use libc internal fds e.g. i've seen
>
> openlog(...) // opens libc internal syslog fd
> ...
> fork()
> closefrom(...) // close syslog fd
> open(...) // something that reuses the closed fd
> syslog(...) // unsafe: uses the wrong fd
> execve(...)
>
> syslog uses a libc internal fd that the user trampled on and
> this can go bad in many ways depending on what libc apis are
> used between closefrom (or equivalent) and exec.

It is, of course, traditional that daemons only call
close(0); close(1); close(2);
Took us ages to discover that a misspelt fprintf()
was adding data to the stdout buffer and eventually
flushing 10k of ascii text into an inter-process pipe
that had a 32bit field for 'message extension length'.

FWIW isn't syslog() going to go badly wrong after fork()
anyway?
Unless libc's fork() calls closelog().

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Subject: Re: [PATCH v5 1/3] open: add close_range()

Hey Christian,

Could we please have a manual page for the close_range(2) syscall
that's about to land in 5.9?

Thanks,

Michael

On Wed, 3 Jun 2020 at 12:24, Michael Kerrisk (man-pages)
<[email protected]> wrote:
>
> Hi Christian,
>
> Could we have a manual page for this API (best before it's merged)?
>
> Thanks,
>
> Michael
>
> On Tue, 2 Jun 2020 at 22:44, Christian Brauner
> <[email protected]> wrote:
> >
> > This adds the close_range() syscall. It allows to efficiently close a range
> > of file descriptors up to all file descriptors of a calling task.
> >
> > I've also coordinated with some FreeBSD developers who got in touch with
> > me (Cced below). FreeBSD intends to add the same syscall once we merged it.
> > Quite a bunch of projects in userspace are waiting on this syscall
> > including Python and systemd.
> >
> > The syscall came up in a recent discussion around the new mount API and
> > making new file descriptor types cloexec by default. During this
> > discussion, Al suggested the close_range() syscall (cf. [1]). Note, a
> > syscall in this manner has been requested by various people over time.
> >
> > First, it helps to close all file descriptors of an exec()ing task. This
> > can be done safely via (quoting Al's example from [1] verbatim):
> >
> > /* that exec is sensitive */
> > unshare(CLONE_FILES);
> > /* we don't want anything past stderr here */
> > close_range(3, ~0U);
> > execve(....);
> >
> > The code snippet above is one way of working around the problem that file
> > descriptors are not cloexec by default. This is aggravated by the fact that
> > we can't just switch them over without massively regressing userspace. For
> > a whole class of programs having an in-kernel method of closing all file
> > descriptors is very helpful (e.g. demons, service managers, programming
> > language standard libraries, container managers etc.).
> > (Please note, unshare(CLONE_FILES) should only be needed if the calling
> > task is multi-threaded and shares the file descriptor table with another
> > thread in which case two threads could race with one thread allocating file
> > descriptors and the other one closing them via close_range(). For the
> > general case close_range() before the execve() is sufficient.)
> >
> > Second, it allows userspace to avoid implementing closing all file
> > descriptors by parsing through /proc/<pid>/fd/* and calling close() on each
> > file descriptor. From looking at various large(ish) userspace code bases
> > this or similar patterns are very common in:
> > - service managers (cf. [4])
> > - libcs (cf. [6])
> > - container runtimes (cf. [5])
> > - programming language runtimes/standard libraries
> > - Python (cf. [2])
> > - Rust (cf. [7], [8])
> > As Dmitry pointed out there's even a long-standing glibc bug about missing
> > kernel support for this task (cf. [3]).
> > In addition, the syscall will also work for tasks that do not have procfs
> > mounted and on kernels that do not have procfs support compiled in. In such
> > situations the only way to make sure that all file descriptors are closed
> > is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE,
> > OPEN_MAX trickery (cf. comment [8] on Rust).
> >
> > The performance is striking. For good measure, comparing the following
> > simple close_all_fds() userspace implementation that is essentially just
> > glibc's version in [6]:
> >
> > static int close_all_fds(void)
> > {
> > int dir_fd;
> > DIR *dir;
> > struct dirent *direntp;
> >
> > dir = opendir("/proc/self/fd");
> > if (!dir)
> > return -1;
> > dir_fd = dirfd(dir);
> > while ((direntp = readdir(dir))) {
> > int fd;
> > if (strcmp(direntp->d_name, ".") == 0)
> > continue;
> > if (strcmp(direntp->d_name, "..") == 0)
> > continue;
> > fd = atoi(direntp->d_name);
> > if (fd == dir_fd || fd == 0 || fd == 1 || fd == 2)
> > continue;
> > close(fd);
> > }
> > closedir(dir);
> > return 0;
> > }
> >
> > to close_range() yields:
> > 1. closing 4 open files:
> > - close_all_fds(): ~280 us
> > - close_range(): ~24 us
> >
> > 2. closing 1000 open files:
> > - close_all_fds(): ~5000 us
> > - close_range(): ~800 us
> >
> > close_range() is designed to allow for some flexibility. Specifically, it
> > does not simply always close all open file descriptors of a task. Instead,
> > callers can specify an upper bound.
> > This is e.g. useful for scenarios where specific file descriptors are
> > created with well-known numbers that are supposed to be excluded from
> > getting closed.
> > For extra paranoia close_range() comes with a flags argument. This can e.g.
> > be used to implement extension. Once can imagine userspace wanting to stop
> > at the first error instead of ignoring errors under certain circumstances.
> > There might be other valid ideas in the future. In any case, a flag
> > argument doesn't hurt and keeps us on the safe side.
> >
> > From an implementation side this is kept rather dumb. It saw some input
> > from David and Jann but all nonsense is obviously my own!
> > - Errors to close file descriptors are currently ignored. (Could be changed
> > by setting a flag in the future if needed.)
> > - __close_range() is a rather simplistic wrapper around __close_fd().
> > My reasoning behind this is based on the nature of how __close_fd() needs
> > to release an fd. But maybe I misunderstood specifics:
> > We take the files_lock and rcu-dereference the fdtable of the calling
> > task, we find the entry in the fdtable, get the file and need to release
> > files_lock before calling filp_close().
> > In the meantime the fdtable might have been altered so we can't just
> > retake the spinlock and keep the old rcu-reference of the fdtable
> > around. Instead we need to grab a fresh reference to the fdtable.
> > If my reasoning is correct then there's really no point in fancyfying
> > __close_range(): We just need to rcu-dereference the fdtable of the
> > calling task once to cap the max_fd value correctly and then go on
> > calling __close_fd() in a loop.
> >
> > /* References */
> > [1]: https://lore.kernel.org/lkml/[email protected]/
> > [2]: https://github.com/python/cpython/blob/9e4f2f3a6b8ee995c365e86d976937c141d867f8/Modules/_posixsubprocess.c#L220
> > [3]: https://sourceware.org/bugzilla/show_bug.cgi?id=10353#c7
> > [4]: https://github.com/systemd/systemd/blob/5238e9575906297608ff802a27e2ff9effa3b338/src/basic/fd-util.c#L217
> > [5]: https://github.com/lxc/lxc/blob/ddf4b77e11a4d08f09b7b9cd13e593f8c047edc5/src/lxc/start.c#L236
> > [6]: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/grantpt.c;h=2030e07fa6e652aac32c775b8c6e005844c3c4eb;hb=HEAD#l17
> > Note that this is an internal implementation that is not exported.
> > Currently, libc seems to not provide an exported version of this
> > because of missing kernel support to do this.
> > [7]: https://github.com/rust-lang/rust/issues/12148
> > [8]: https://github.com/rust-lang/rust/blob/5f47c0613ed4eb46fca3633c1297364c09e5e451/src/libstd/sys/unix/process2.rs#L303-L308
> > Rust's solution is slightly different but is equally unperformant.
> > Rust calls getdtablesize() which is a glibc library function that
> > simply returns the current RLIMIT_NOFILE or OPEN_MAX values. Rust then
> > goes on to call close() on each fd. That's obviously overkill for most
> > tasks. Rarely, tasks - especially non-demons - hit RLIMIT_NOFILE or
> > OPEN_MAX.
> > Let's be nice and assume an unprivileged user with RLIMIT_NOFILE set
> > to 1024. Even in this case, there's a very high chance that in the
> > common case Rust is calling the close() syscall 1021 times pointlessly
> > if the task just has 0, 1, and 2 open.
> >
> > Suggested-by: Al Viro <[email protected]>
> > Signed-off-by: Christian Brauner <[email protected]>
> > Cc: Arnd Bergmann <[email protected]>
> > Cc: Kyle Evans <[email protected]>
> > Cc: Jann Horn <[email protected]>
> > Cc: David Howells <[email protected]>
> > Cc: Dmitry V. Levin <[email protected]>
> > Cc: Oleg Nesterov <[email protected]>
> > Cc: Linus Torvalds <[email protected]>
> > Cc: Florian Weimer <[email protected]>
> > Cc: [email protected]
> > ---
> > /* v2 */
> > - Linus Torvalds <[email protected]>:
> > - add cond_resched() to yield cpu when closing a lot of file descriptors
> > - Al Viro <[email protected]>:
> > - add cond_resched() to yield cpu when closing a lot of file descriptors
> >
> > /* v3 */
> > unchanged
> >
> > /* v4 */
> > - Oleg Nesterov <[email protected]>:
> > - fix braino: s/max()/min()/
> >
> > /* v5 */
> > unchanged
> > ---
> > fs/file.c | 62 ++++++++++++++++++++++++++++++++++------
> > fs/open.c | 20 +++++++++++++
> > include/linux/fdtable.h | 2 ++
> > include/linux/syscalls.h | 2 ++
> > 4 files changed, 78 insertions(+), 8 deletions(-)
> >
> > diff --git a/fs/file.c b/fs/file.c
> > index abb8b7081d7a..e260bfe687d1 100644
> > --- a/fs/file.c
> > +++ b/fs/file.c
> > @@ -10,6 +10,7 @@
> > #include <linux/syscalls.h>
> > #include <linux/export.h>
> > #include <linux/fs.h>
> > +#include <linux/kernel.h>
> > #include <linux/mm.h>
> > #include <linux/sched/signal.h>
> > #include <linux/slab.h>
> > @@ -620,12 +621,9 @@ void fd_install(unsigned int fd, struct file *file)
> >
> > EXPORT_SYMBOL(fd_install);
> >
> > -/*
> > - * The same warnings as for __alloc_fd()/__fd_install() apply here...
> > - */
> > -int __close_fd(struct files_struct *files, unsigned fd)
> > +static struct file *pick_file(struct files_struct *files, unsigned fd)
> > {
> > - struct file *file;
> > + struct file *file = NULL;
> > struct fdtable *fdt;
> >
> > spin_lock(&files->file_lock);
> > @@ -637,15 +635,63 @@ int __close_fd(struct files_struct *files, unsigned fd)
> > goto out_unlock;
> > rcu_assign_pointer(fdt->fd[fd], NULL);
> > __put_unused_fd(files, fd);
> > - spin_unlock(&files->file_lock);
> > - return filp_close(file, files);
> >
> > out_unlock:
> > spin_unlock(&files->file_lock);
> > - return -EBADF;
> > + return file;
> > +}
> > +
> > +/*
> > + * The same warnings as for __alloc_fd()/__fd_install() apply here...
> > + */
> > +int __close_fd(struct files_struct *files, unsigned fd)
> > +{
> > + struct file *file;
> > +
> > + file = pick_file(files, fd);
> > + if (!file)
> > + return -EBADF;
> > +
> > + return filp_close(file, files);
> > }
> > EXPORT_SYMBOL(__close_fd); /* for ksys_close() */
> >
> > +/**
> > + * __close_range() - Close all file descriptors in a given range.
> > + *
> > + * @fd: starting file descriptor to close
> > + * @max_fd: last file descriptor to close
> > + *
> > + * This closes a range of file descriptors. All file descriptors
> > + * from @fd up to and including @max_fd are closed.
> > + */
> > +int __close_range(struct files_struct *files, unsigned fd, unsigned max_fd)
> > +{
> > + unsigned int cur_max;
> > +
> > + if (fd > max_fd)
> > + return -EINVAL;
> > +
> > + rcu_read_lock();
> > + cur_max = files_fdtable(files)->max_fds;
> > + rcu_read_unlock();
> > +
> > + /* cap to last valid index into fdtable */
> > + max_fd = min(max_fd, (cur_max - 1));
> > + while (fd <= max_fd) {
> > + struct file *file;
> > +
> > + file = pick_file(files, fd++);
> > + if (!file)
> > + continue;
> > +
> > + filp_close(file, files);
> > + cond_resched();
> > + }
> > +
> > + return 0;
> > +}
> > +
> > /*
> > * variant of __close_fd that gets a ref on the file for later fput.
> > * The caller must ensure that filp_close() called on the file, and then
> > diff --git a/fs/open.c b/fs/open.c
> > index 719b320ede52..87e076e9e127 100644
> > --- a/fs/open.c
> > +++ b/fs/open.c
> > @@ -1279,6 +1279,26 @@ SYSCALL_DEFINE1(close, unsigned int, fd)
> > return retval;
> > }
> >
> > +/**
> > + * close_range() - Close all file descriptors in a given range.
> > + *
> > + * @fd: starting file descriptor to close
> > + * @max_fd: last file descriptor to close
> > + * @flags: reserved for future extensions
> > + *
> > + * This closes a range of file descriptors. All file descriptors
> > + * from @fd up to and including @max_fd are closed.
> > + * Currently, errors to close a given file descriptor are ignored.
> > + */
> > +SYSCALL_DEFINE3(close_range, unsigned int, fd, unsigned int, max_fd,
> > + unsigned int, flags)
> > +{
> > + if (flags)
> > + return -EINVAL;
> > +
> > + return __close_range(current->files, fd, max_fd);
> > +}
> > +
> > /*
> > * This routine simulates a hangup on the tty, to arrange that users
> > * are given clean terminals at login time.
> > diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
> > index f07c55ea0c22..fcd07181a365 100644
> > --- a/include/linux/fdtable.h
> > +++ b/include/linux/fdtable.h
> > @@ -121,6 +121,8 @@ extern void __fd_install(struct files_struct *files,
> > unsigned int fd, struct file *file);
> > extern int __close_fd(struct files_struct *files,
> > unsigned int fd);
> > +extern int __close_range(struct files_struct *files, unsigned int fd,
> > + unsigned int max_fd);
> > extern int __close_fd_get_file(unsigned int fd, struct file **res);
> >
> > extern struct kmem_cache *files_cachep;
> > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> > index 1815065d52f3..18fea399329b 100644
> > --- a/include/linux/syscalls.h
> > +++ b/include/linux/syscalls.h
> > @@ -442,6 +442,8 @@ asmlinkage long sys_openat(int dfd, const char __user *filename, int flags,
> > asmlinkage long sys_openat2(int dfd, const char __user *filename,
> > struct open_how *how, size_t size);
> > asmlinkage long sys_close(unsigned int fd);
> > +asmlinkage long sys_close_range(unsigned int fd, unsigned int max_fd,
> > + unsigned int flags);
> > asmlinkage long sys_vhangup(void);
> >
> > /* fs/pipe.c */
> > --
> > 2.26.2
> >
>
>
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/



--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/