2019-05-23 15:50:23

by Christian Brauner

[permalink] [raw]
Subject: [PATCH v2 0/2] close_range()

Hey,

This is v2 of this patchset.

In accordance with some comments There's a cond_resched() added to the
close loop similar to what is done for close_files().
A common helper pick_file() for __close_fd() and __close_range() has
been split out. This allows to only make a cond_resched() call when
filp_close() has been called similar to what is done in close_files().
Maybe that's not worth it. Jann mentioned that cond_resched() looks
rather cheap.
So it maybe that we could simply do:

while (fd <= max_fd) {
__close(files, fd++);
cond_resched();
}

I also added a missing test for close_range(fd, fd, 0).

Thanks!
Christian

Christian Brauner (2):
open: add close_range()
tests: add close_range() tests

arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
fs/file.c | 62 +++++++-
fs/open.c | 20 +++
include/linux/fdtable.h | 2 +
include/linux/syscalls.h | 2 +
include/uapi/asm-generic/unistd.h | 4 +-
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/core/.gitignore | 1 +
tools/testing/selftests/core/Makefile | 6 +
.../testing/selftests/core/close_range_test.c | 142 ++++++++++++++++++
26 files changed, 249 insertions(+), 9 deletions(-)
create mode 100644 tools/testing/selftests/core/.gitignore
create mode 100644 tools/testing/selftests/core/Makefile
create mode 100644 tools/testing/selftests/core/close_range_test.c

--
2.21.0


2019-05-23 15:51:37

by Christian Brauner

[permalink] [raw]
Subject: [PATCH v2 1/2] open: add close_range()

This adds the close_range() syscall. It allows to efficiently close a range
of file descriptors up to all file descriptors of a calling task.

The syscall came up in a recent discussion around the new mount API and
making new file descriptor types cloexec by default. During this
discussion, Al suggested the close_range() syscall (cf. [1]). Note, a
syscall in this manner has been requested by various people over time.

First, it helps to close all file descriptors of an exec()ing task. This
can be done safely via (quoting Al's example from [1] verbatim):

/* that exec is sensitive */
unshare(CLONE_FILES);
/* we don't want anything past stderr here */
close_range(3, ~0U);
execve(....);

The code snippet above is one way of working around the problem that file
descriptors are not cloexec by default. This is aggravated by the fact that
we can't just switch them over without massively regressing userspace. For
a whole class of programs having an in-kernel method of closing all file
descriptors is very helpful (e.g. demons, service managers, programming
language standard libraries, container managers etc.).
(Please note, unshare(CLONE_FILES) should only be needed if the calling
task is multi-threaded and shares the file descriptor table with another
thread in which case two threads could race with one thread allocating
file descriptors and the other one closing them via close_range(). For the
general case close_range() before the execve() is sufficient.)

Second, it allows userspace to avoid implementing closing all file
descriptors by parsing through /proc/<pid>/fd/* and calling close() on each
file descriptor. From looking at various large(ish) userspace code bases
this or similar patterns are very common in:
- service managers (cf. [4])
- libcs (cf. [6])
- container runtimes (cf. [5])
- programming language runtimes/standard libraries
- Python (cf. [2])
- Rust (cf. [7], [8])
As Dmitry pointed out there's even a long-standing glibc bug about missing
kernel support for this task (cf. [3]).
In addition, the syscall will also work for tasks that do not have procfs
mounted and on kernels that do not have procfs support compiled in. In such
situations the only way to make sure that all file descriptors are closed
is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE,
OPEN_MAX trickery (cf. comment [8] on Rust).

The performance is striking. For good measure, comparing the following
simple close_all_fds() userspace implementation that is essentially just
glibc's version in [6]:

static int close_all_fds(void)
{
int dir_fd;
DIR *dir;
struct dirent *direntp;

dir = opendir("/proc/self/fd");
if (!dir)
return -1;
dir_fd = dirfd(dir);
while ((direntp = readdir(dir))) {
int fd;
if (strcmp(direntp->d_name, ".") == 0)
continue;
if (strcmp(direntp->d_name, "..") == 0)
continue;
fd = atoi(direntp->d_name);
if (fd == dir_fd || fd == 0 || fd == 1 || fd == 2)
continue;
close(fd);
}
closedir(dir);
return 0;
}

to close_range() yields:
1. closing 4 open files:
- close_all_fds(): ~280 us
- close_range(): ~24 us

2. closing 1000 open files:
- close_all_fds(): ~5000 us
- close_range(): ~800 us

close_range() is designed to allow for some flexibility. Specifically, it
does not simply always close all open file descriptors of a task. Instead,
callers can specify an upper bound.
This is e.g. useful for scenarios where specific file descriptors are
created with well-known numbers that are supposed to be excluded from
getting closed.
For extra paranoia close_range() comes with a flags argument. This can e.g.
be used to implement extension. Once can imagine userspace wanting to stop
at the first error instead of ignoring errors under certain circumstances.
There might be other valid ideas in the future. In any case, a flag
argument doesn't hurt and keeps us on the safe side.

From an implementation side this is kept rather dumb. It saw some input
from David and Jann but all nonsense is obviously my own!
- Errors to close file descriptors are currently ignored. (Could be changed
by setting a flag in the future if needed.)
- __close_range() is a rather simplistic wrapper around __close_fd().
My reasoning behind this is based on the nature of how __close_fd() needs
to release an fd. But maybe I misunderstood specifics:
We take the files_lock and rcu-dereference the fdtable of the calling
task, we find the entry in the fdtable, get the file and need to release
files_lock before calling filp_close().
In the meantime the fdtable might have been altered so we can't just
retake the spinlock and keep the old rcu-reference of the fdtable
around. Instead we need to grab a fresh reference to the fdtable.
If my reasoning is correct then there's really no point in fancyfying
__close_range(): We just need to rcu-dereference the fdtable of the
calling task once to cap the max_fd value correctly and then go on
calling __close_fd() in a loop.

/* References */
[1]: https://lore.kernel.org/lkml/[email protected]/
[2]: https://github.com/python/cpython/blob/9e4f2f3a6b8ee995c365e86d976937c141d867f8/Modules/_posixsubprocess.c#L220
[3]: https://sourceware.org/bugzilla/show_bug.cgi?id=10353#c7
[4]: https://github.com/systemd/systemd/blob/5238e9575906297608ff802a27e2ff9effa3b338/src/basic/fd-util.c#L217
[5]: https://github.com/lxc/lxc/blob/ddf4b77e11a4d08f09b7b9cd13e593f8c047edc5/src/lxc/start.c#L236
[6]: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/grantpt.c;h=2030e07fa6e652aac32c775b8c6e005844c3c4eb;hb=HEAD#l17
Note that this is an internal implementation that is not exported.
Currently, libc seems to not provide an exported version of this
because of missing kernel support to do this.
[7]: https://github.com/rust-lang/rust/issues/12148
[8]: https://github.com/rust-lang/rust/blob/5f47c0613ed4eb46fca3633c1297364c09e5e451/src/libstd/sys/unix/process2.rs#L303-L308
Rust's solution is slightly different but is equally unperformant.
Rust calls getdtablesize() which is a glibc library function that
simply returns the current RLIMIT_NOFILE or OPEN_MAX values. Rust then
goes on to call close() on each fd. That's obviously overkill for most
tasks. Rarely, tasks - especially non-demons - hit RLIMIT_NOFILE or
OPEN_MAX.
Let's be nice and assume an unprivileged user with RLIMIT_NOFILE set
to 1024. Even in this case, there's a very high chance that in the
common case Rust is calling the close() syscall 1021 times pointlessly
if the task just has 0, 1, and 2 open.

Suggested-by: Al Viro <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: David Howells <[email protected]>
Cc: Dmitry V. Levin <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Florian Weimer <[email protected]>
Cc: [email protected]
---
v1:
- Linus Torvalds <[email protected]>:
- add cond_resched() to yield cpu when closing a lot of file descriptors
- Al Viro <[email protected]>:
- add cond_resched() to yield cpu when closing a lot of file descriptors
v2: unchanged
---
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
fs/file.c | 62 ++++++++++++++++++---
fs/open.c | 20 +++++++
include/linux/fdtable.h | 2 +
include/linux/syscalls.h | 2 +
include/uapi/asm-generic/unistd.h | 4 +-
22 files changed, 99 insertions(+), 9 deletions(-)

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 9e7704e44f6d..b55d93af8096 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -473,3 +473,4 @@
541 common fsconfig sys_fsconfig
542 common fsmount sys_fsmount
543 common fspick sys_fspick
+545 common close_range sys_close_range
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index aaf479a9e92d..0125c97c75dd 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -447,3 +447,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+435 common close_range sys_close_range
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index c39e90600bb3..9a3270d29b42 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -886,6 +886,8 @@ __SYSCALL(__NR_fsconfig, sys_fsconfig)
__SYSCALL(__NR_fsmount, sys_fsmount)
#define __NR_fspick 433
__SYSCALL(__NR_fspick, sys_fspick)
+#define __NR_close_range 435
+__SYSCALL(__NR_close_range, sys_close_range)

/*
* Please add new compat syscalls above this comment and update
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index e01df3f2f80d..1a90b464e96f 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -354,3 +354,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+435 common close_range sys_close_range
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index 7e3d0734b2f3..2dee2050f9ef 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -433,3 +433,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+435 common close_range sys_close_range
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 26339e417695..923ef69e5a76 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -439,3 +439,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+435 common close_range sys_close_range
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 0e2dd68ade57..967ed9de51cd 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -372,3 +372,4 @@
431 n32 fsconfig sys_fsconfig
432 n32 fsmount sys_fsmount
433 n32 fspick sys_fspick
+435 n32 close_range sys_close_range
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index 5eebfa0d155c..71de731102b1 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -348,3 +348,4 @@
431 n64 fsconfig sys_fsconfig
432 n64 fsmount sys_fsmount
433 n64 fspick sys_fspick
+435 n64 close_range sys_close_range
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index 3cc1374e02d0..5a325ab29f88 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -421,3 +421,4 @@
431 o32 fsconfig sys_fsconfig
432 o32 fsmount sys_fsmount
433 o32 fspick sys_fspick
+435 o32 close_range sys_close_range
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index c9e377d59232..dcc0a0879139 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -430,3 +430,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+435 common close_range sys_close_range
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 103655d84b4b..ba2c1f078cbd 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -515,3 +515,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+435 common close_range sys_close_range
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index e822b2964a83..d7c9043d2902 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -436,3 +436,4 @@
431 common fsconfig sys_fsconfig sys_fsconfig
432 common fsmount sys_fsmount sys_fsmount
433 common fspick sys_fspick sys_fspick
+435 common close_range sys_close_range sys_close_range
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index 016a727d4357..9b5e6bf0ce32 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -436,3 +436,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+435 common close_range sys_close_range
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index e047480b1605..8c674a1e0072 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -479,3 +479,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+435 common close_range sys_close_range
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index ad968b7bac72..7f7a89a96707 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -438,3 +438,4 @@
431 i386 fsconfig sys_fsconfig __ia32_sys_fsconfig
432 i386 fsmount sys_fsmount __ia32_sys_fsmount
433 i386 fspick sys_fspick __ia32_sys_fspick
+435 i386 close_range sys_close_range __ia32_sys_close_range
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index b4e6f9e6204a..0f7d47ae921c 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -355,6 +355,7 @@
431 common fsconfig __x64_sys_fsconfig
432 common fsmount __x64_sys_fsmount
433 common fspick __x64_sys_fspick
+435 common close_range __x64_sys_close_range

#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 5fa0ee1c8e00..b489532265d0 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -404,3 +404,4 @@
431 common fsconfig sys_fsconfig
432 common fsmount sys_fsmount
433 common fspick sys_fspick
+435 common close_range sys_close_range
diff --git a/fs/file.c b/fs/file.c
index 3da91a112bab..53430d68bbd7 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -10,6 +10,7 @@
#include <linux/syscalls.h>
#include <linux/export.h>
#include <linux/fs.h>
+#include <linux/kernel.h>
#include <linux/mm.h>
#include <linux/sched/signal.h>
#include <linux/slab.h>
@@ -615,12 +616,9 @@ void fd_install(unsigned int fd, struct file *file)

EXPORT_SYMBOL(fd_install);

-/*
- * The same warnings as for __alloc_fd()/__fd_install() apply here...
- */
-int __close_fd(struct files_struct *files, unsigned fd)
+static struct file *pick_file(struct files_struct *files, unsigned fd)
{
- struct file *file;
+ struct file *file = NULL;
struct fdtable *fdt;

spin_lock(&files->file_lock);
@@ -632,15 +630,63 @@ int __close_fd(struct files_struct *files, unsigned fd)
goto out_unlock;
rcu_assign_pointer(fdt->fd[fd], NULL);
__put_unused_fd(files, fd);
- spin_unlock(&files->file_lock);
- return filp_close(file, files);

out_unlock:
spin_unlock(&files->file_lock);
- return -EBADF;
+ return file;
+}
+
+/*
+ * The same warnings as for __alloc_fd()/__fd_install() apply here...
+ */
+int __close_fd(struct files_struct *files, unsigned fd)
+{
+ struct file *file;
+
+ file = pick_file(files, fd);
+ if (!file)
+ return -EBADF;
+
+ return filp_close(file, files);
}
EXPORT_SYMBOL(__close_fd); /* for ksys_close() */

+/**
+ * __close_range() - Close all file descriptors in a given range.
+ *
+ * @fd: starting file descriptor to close
+ * @max_fd: last file descriptor to close
+ *
+ * This closes a range of file descriptors. All file descriptors
+ * from @fd up to and including @max_fd are closed.
+ */
+int __close_range(struct files_struct *files, unsigned fd, unsigned max_fd)
+{
+ unsigned int cur_max;
+
+ if (fd > max_fd)
+ return -EINVAL;
+
+ rcu_read_lock();
+ cur_max = files_fdtable(files)->max_fds;
+ rcu_read_unlock();
+
+ /* cap to last valid index into fdtable */
+ max_fd = max(max_fd, (cur_max - 1));
+ while (fd <= max_fd) {
+ struct file *file;
+
+ file = pick_file(files, fd++);
+ if (!file)
+ continue;
+
+ filp_close(file, files);
+ cond_resched();
+ }
+
+ return 0;
+}
+
/*
* variant of __close_fd that gets a ref on the file for later fput
*/
diff --git a/fs/open.c b/fs/open.c
index 9c7d724a6f67..c7baaee7aa47 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1174,6 +1174,26 @@ SYSCALL_DEFINE1(close, unsigned int, fd)
return retval;
}

+/**
+ * close_range() - Close all file descriptors in a given range.
+ *
+ * @fd: starting file descriptor to close
+ * @max_fd: last file descriptor to close
+ * @flags: reserved for future extensions
+ *
+ * This closes a range of file descriptors. All file descriptors
+ * from @fd up to and including @max_fd are closed.
+ * Currently, errors to close a given file descriptor are ignored.
+ */
+SYSCALL_DEFINE3(close_range, unsigned int, fd, unsigned int, max_fd,
+ unsigned int, flags)
+{
+ if (flags)
+ return -EINVAL;
+
+ return __close_range(current->files, fd, max_fd);
+}
+
/*
* This routine simulates a hangup on the tty, to arrange that users
* are given clean terminals at login time.
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index f07c55ea0c22..fcd07181a365 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -121,6 +121,8 @@ extern void __fd_install(struct files_struct *files,
unsigned int fd, struct file *file);
extern int __close_fd(struct files_struct *files,
unsigned int fd);
+extern int __close_range(struct files_struct *files, unsigned int fd,
+ unsigned int max_fd);
extern int __close_fd_get_file(unsigned int fd, struct file **res);

extern struct kmem_cache *files_cachep;
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index e2870fe1be5b..c0189e223255 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -441,6 +441,8 @@ asmlinkage long sys_fchown(unsigned int fd, uid_t user, gid_t group);
asmlinkage long sys_openat(int dfd, const char __user *filename, int flags,
umode_t mode);
asmlinkage long sys_close(unsigned int fd);
+asmlinkage long sys_close_range(unsigned int fd, unsigned int max_fd,
+ unsigned int flags);
asmlinkage long sys_vhangup(void);

/* fs/pipe.c */
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index a87904daf103..3f36c8745d24 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -844,9 +844,11 @@ __SYSCALL(__NR_fsconfig, sys_fsconfig)
__SYSCALL(__NR_fsmount, sys_fsmount)
#define __NR_fspick 433
__SYSCALL(__NR_fspick, sys_fspick)
+#define __NR_close_range 435
+__SYSCALL(__NR_close_range, sys_close_range)

#undef __NR_syscalls
-#define __NR_syscalls 434
+#define __NR_syscalls 436

/*
* 32 bit systems traditionally used different
--
2.21.0

2019-05-23 15:52:02

by Christian Brauner

[permalink] [raw]
Subject: [PATCH v2 2/2] tests: add close_range() tests

This adds basic tests for the new close_range() syscall.
- test that no invalid flags can be passed
- test that a range of file descriptors is correctly closed
- test that a range of file descriptors is correctly closed if there there
are already closed file descriptors in the range
- test that max_fd is correctly capped to the current fdtable maximum

Signed-off-by: Christian Brauner <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: David Howells <[email protected]>
Cc: Dmitry V. Levin <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Florian Weimer <[email protected]>
Cc: [email protected]
---
v1: unchanged
v2:
- Christian Brauner <[email protected]>:
- verify that close_range() correctly closes a single file descriptor
---
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/core/.gitignore | 1 +
tools/testing/selftests/core/Makefile | 6 +
.../testing/selftests/core/close_range_test.c | 142 ++++++++++++++++++
4 files changed, 150 insertions(+)
create mode 100644 tools/testing/selftests/core/.gitignore
create mode 100644 tools/testing/selftests/core/Makefile
create mode 100644 tools/testing/selftests/core/close_range_test.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 9781ca79794a..06e57fabbff9 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -4,6 +4,7 @@ TARGETS += bpf
TARGETS += breakpoints
TARGETS += capabilities
TARGETS += cgroup
+TARGETS += core
TARGETS += cpufreq
TARGETS += cpu-hotplug
TARGETS += drivers/dma-buf
diff --git a/tools/testing/selftests/core/.gitignore b/tools/testing/selftests/core/.gitignore
new file mode 100644
index 000000000000..6e6712ce5817
--- /dev/null
+++ b/tools/testing/selftests/core/.gitignore
@@ -0,0 +1 @@
+close_range_test
diff --git a/tools/testing/selftests/core/Makefile b/tools/testing/selftests/core/Makefile
new file mode 100644
index 000000000000..de3ae68aa345
--- /dev/null
+++ b/tools/testing/selftests/core/Makefile
@@ -0,0 +1,6 @@
+CFLAGS += -g -I../../../../usr/include/ -I../../../../include
+
+TEST_GEN_PROGS := close_range_test
+
+include ../lib.mk
+
diff --git a/tools/testing/selftests/core/close_range_test.c b/tools/testing/selftests/core/close_range_test.c
new file mode 100644
index 000000000000..d6e6079d3d53
--- /dev/null
+++ b/tools/testing/selftests/core/close_range_test.c
@@ -0,0 +1,142 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <linux/kernel.h>
+#include <limits.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <syscall.h>
+#include <unistd.h>
+
+#include "../kselftest.h"
+
+static inline int sys_close_range(unsigned int fd, unsigned int max_fd,
+ unsigned int flags)
+{
+ return syscall(__NR_close_range, fd, max_fd, flags);
+}
+
+#ifndef ARRAY_SIZE
+#define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
+#endif
+
+int main(int argc, char **argv)
+{
+ const char *test_name = "close_range";
+ int i, ret;
+ int open_fds[101];
+ int fd_max, fd_mid, fd_min;
+
+ ksft_set_plan(9);
+
+ for (i = 0; i < ARRAY_SIZE(open_fds); i++) {
+ int fd;
+
+ fd = open("/dev/null", O_RDONLY | O_CLOEXEC);
+ if (fd < 0) {
+ if (errno == ENOENT)
+ ksft_exit_skip(
+ "%s test: skipping test since /dev/null does not exist\n",
+ test_name);
+
+ ksft_exit_fail_msg(
+ "%s test: %s - failed to open /dev/null\n",
+ strerror(errno), test_name);
+ }
+
+ open_fds[i] = fd;
+ }
+
+ fd_min = open_fds[0];
+ fd_max = open_fds[99];
+
+ ret = sys_close_range(fd_min, fd_max, 1);
+ if (!ret)
+ ksft_exit_fail_msg(
+ "%s test: managed to pass invalid flag value\n",
+ test_name);
+ ksft_test_result_pass("do not allow invalid flag values for close_range()\n");
+
+ fd_mid = open_fds[50];
+ ret = sys_close_range(fd_min, fd_mid, 0);
+ if (ret < 0)
+ ksft_exit_fail_msg(
+ "%s test: Failed to close range of file descriptors from %d to %d\n",
+ test_name, fd_min, fd_mid);
+ ksft_test_result_pass("close_range() from %d to %d\n", fd_min, fd_mid);
+
+ for (i = 0; i <= 50; i++) {
+ ret = fcntl(open_fds[i], F_GETFL);
+ if (ret >= 0)
+ ksft_exit_fail_msg(
+ "%s test: Failed to close range of file descriptors from %d to %d\n",
+ test_name, fd_min, fd_mid);
+ }
+ ksft_test_result_pass("fcntl() verify closed range from %d to %d\n", fd_min, fd_mid);
+
+ /* create a couple of gaps */
+ close(57);
+ close(78);
+ close(81);
+ close(82);
+ close(84);
+ close(90);
+
+ fd_mid = open_fds[51];
+ /* Choose slightly lower limit and leave some fds for a later test */
+ fd_max = open_fds[92];
+ ret = sys_close_range(fd_mid, fd_max, 0);
+ if (ret < 0)
+ ksft_exit_fail_msg(
+ "%s test: Failed to close range of file descriptors from 51 to 100\n",
+ test_name);
+ ksft_test_result_pass("close_range() from %d to %d\n", fd_mid, fd_max);
+
+ for (i = 51; i <= 92; i++) {
+ ret = fcntl(open_fds[i], F_GETFL);
+ if (ret >= 0)
+ ksft_exit_fail_msg(
+ "%s test: Failed to close range of file descriptors from 51 to 100\n",
+ test_name);
+ }
+ ksft_test_result_pass("fcntl() verify closed range from %d to %d\n", fd_mid, fd_max);
+
+ fd_mid = open_fds[93];
+ fd_max = open_fds[99];
+ /* test that the kernel caps and still closes all fds */
+ ret = sys_close_range(fd_mid, UINT_MAX, 0);
+ if (ret < 0)
+ ksft_exit_fail_msg(
+ "%s test: Failed to close range of file descriptors from 51 to 100\n",
+ test_name);
+ ksft_test_result_pass("close_range() from %d to %d\n", fd_mid, fd_max);
+
+ for (i = 93; i < 100; i++) {
+ ret = fcntl(open_fds[i], F_GETFL);
+ if (ret >= 0)
+ ksft_exit_fail_msg(
+ "%s test: Failed to close range of file descriptors from 51 to 100\n",
+ test_name);
+ }
+ ksft_test_result_pass("fcntl() verify closed range from %d to %d\n", fd_mid, fd_max);
+
+ ret = sys_close_range(open_fds[100], open_fds[100], 0);
+ if (ret < 0)
+ ksft_exit_fail_msg(
+ "%s test: Failed to close single file descriptor\n",
+ test_name);
+ ksft_test_result_pass("close_range() closed single file descriptor\n");
+
+ ret = fcntl(open_fds[100], F_GETFL);
+ if (ret >= 0)
+ ksft_exit_fail_msg(
+ "%s test: Failed to close single file descriptor\n",
+ test_name);
+ ksft_test_result_pass("fcntl() verify closed single file descriptor\n");
+
+ return ksft_exit_pass();
+}
--
2.21.0

2019-05-23 16:23:13

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] open: add close_range()

On 05/23, Christian Brauner wrote:
>
> +int __close_range(struct files_struct *files, unsigned fd, unsigned max_fd)
> +{
> + unsigned int cur_max;
> +
> + if (fd > max_fd)
> + return -EINVAL;
> +
> + rcu_read_lock();
> + cur_max = files_fdtable(files)->max_fds;
> + rcu_read_unlock();
> +
> + /* cap to last valid index into fdtable */
> + max_fd = max(max_fd, (cur_max - 1));
^^^

Hmm. min() ?

Oleg.

2019-05-23 16:38:12

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] open: add close_range()

On Thu, May 23, 2019 at 06:20:05PM +0200, Oleg Nesterov wrote:
> On 05/23, Christian Brauner wrote:
> >
> > +int __close_range(struct files_struct *files, unsigned fd, unsigned max_fd)
> > +{
> > + unsigned int cur_max;
> > +
> > + if (fd > max_fd)
> > + return -EINVAL;
> > +
> > + rcu_read_lock();
> > + cur_max = files_fdtable(files)->max_fds;
> > + rcu_read_unlock();
> > +
> > + /* cap to last valid index into fdtable */
> > + max_fd = max(max_fd, (cur_max - 1));
> ^^^
>
> Hmm. min() ?

Yes, thanks! Massive brainf*rt on my end, sorry.

Christian

2019-05-23 18:23:21

by Alexey Dobriyan

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] close_range()

> This is v2 of this patchset.

We've sent fdmap(2) back in the day:
https://marc.info/?l=linux-kernel&m=150628359803324&w=4

It can do everything close_range() does and potentially more.

If people ask for it I can rebase it and resend.

P.S.: you are 2 steps behind :-)
https://lwn.net/Articles/490224/

2019-05-23 21:37:26

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] close_range()

On Thu, May 23, 2019 at 11:22 AM Alexey Dobriyan <[email protected]> wrote:
>
> > This is v2 of this patchset.
>
> We've sent fdmap(2) back in the day:

Well, if the main point of the exercise is performance, then fdmap()
is clearly inferior.

Sadly, with all the HW security mitigation, system calls are no longer cheap.

Would there ever be any other reason to traverse unknown open files
than to close them?

Linus

2019-05-24 10:31:05

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] close_range()

On Thu, May 23, 2019 at 02:34:31PM -0700, Linus Torvalds wrote:
> On Thu, May 23, 2019 at 11:22 AM Alexey Dobriyan <[email protected]> wrote:
> >
> > > This is v2 of this patchset.
> >
> > We've sent fdmap(2) back in the day:
>
> Well, if the main point of the exercise is performance, then fdmap()
> is clearly inferior.
>
> Sadly, with all the HW security mitigation, system calls are no longer cheap.
>
> Would there ever be any other reason to traverse unknown open files
> than to close them?

I have had lively discussions and interestingly worded mails on account
of all of this. But noone has brought up this scenario. Florian also
said that it's not needed [1].

If we really want something like that we don't really need a new syscall
I think. We can just do a prctl() command or fcntl() command that will
give you back the next open fd.

There's imho crazy ideas out there what people expect a multi-close file
descriptor solution to look like. Service manager people apparently
think it would be a great idea to have a syscall that takes an array of
fds which the kernel is supposed to leave open and close all others,
basically "close all of the fds only leave out those I tell you".
I think for such a use-cases they can push for a prctl(PR_GET_NEXTFD, 2)
or a fcntl(2, F_GET_NEXTFD) and implement that in userspace.

I really only care about having a performant solution to closing a range
of fds that's a little more flexible than closefrom() without going all
crazy generic and copying (possibly) large bits of data between kernel-
and userspace.

close_range() is really something I've picked up on the side because the
current state has bothered me (and others) a long time whenever I have
to have my userspace hat on. With Al being in favor of it this seemed
like we should do it.
I actually wanted to have Jann's and my clone6() version on the table by
now since that would unblock larger things like the time namespace
patchset.

In any case I'll send v3 with my max()/min() braino fixed that Oleg
thankfully spotted and the split into two patches that Arnd suggested.

[1]: https://lkml.org/lkml/2019/5/21/516

2019-05-24 18:41:24

by Alexey Dobriyan

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] close_range()

On Thu, May 23, 2019 at 02:34:31PM -0700, Linus Torvalds wrote:
> On Thu, May 23, 2019 at 11:22 AM Alexey Dobriyan <[email protected]> wrote:
> >
> > > This is v2 of this patchset.
> >
> > We've sent fdmap(2) back in the day:
>
> Well, if the main point of the exercise is performance, then fdmap()
> is clearly inferior.

This is not true because there are other usecases.

Current equivalent is readdir() where getdents is essentially bulk fdmap()
with pretty-printing. glibc does getdents into 32KB buffer.

There was a bulk taskstats patch long before meltdown fiasco.

Unfortunately closerange() only closes ranges.
This is why I didn't even tried to send closefrom(2) from OpenBSD.

> Sadly, with all the HW security mitigation, system calls are no longer cheap.
>
> Would there ever be any other reason to traverse unknown open files
> than to close them?

This is what lsof(1) does:

3140 openat(AT_FDCWD, "/proc/29499/fd", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 4
3140 fstat(4, {st_mode=S_IFDIR|0500, st_size=0, ...}) = 0
3140 getdents(4, /* 6 entries */, 32768) = 144
3140 readlink("/proc/29499/fd/0", "/dev/pts/4", 4096) = 10
3140 lstat("/proc/29499/fd/0", {st_mode=S_IFLNK|0700, st_size=64, ...}) = 0
3140 stat("/proc/29499/fd/0", {st_mode=S_IFCHR|0600, st_rdev=makedev(136, 4), ...}) = 0
3140 openat(AT_FDCWD, "/proc/29499/fdinfo/0", O_RDONLY) = 7
3140 fstat(7, {st_mode=S_IFREG|0400, st_size=0, ...}) = 0
3140 read(7, "pos:\t0\nflags:\t02002\nmnt_id:\t24\n", 1024) = 31
3140 read(7, "", 1024) = 0
3140 close(7)
...

Once fdmap(2) or equivalent is in, more bulk system calls operating on
descriptors can pop up. But closefrom() will remain closefrom().

2019-05-24 18:58:58

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] close_range()

On Fri, May 24, 2019 at 11:39 AM Alexey Dobriyan <[email protected]> wrote:
>
> > Would there ever be any other reason to traverse unknown open files
> > than to close them?
>
> This is what lsof(1) does:

I repeat: Would there ever be any other reason to traverse unknown
open files than to close them?

lsof is not AT ALL a relevant argument.

lsof fundamentally wants /proc, because lsof looks at *other*
processes. That has absolutely zero to do with fdmap. lsof does *not*
want fdmap at all. It wants "list other processes files". Which is
very much what /proc is all about.

So your argument that "fdmap is more generic" is bogus.

fdmap is entirely pointless unless you can show a real and relevant
(to performance) use of it.

When you would *possibly* have a "let me get a list of all the file
descriptors I have open, because I didn't track them myself"
situation? That makes no sense. Particularly from a performance
standpoint.

In contrast, "close_range()" makes sense as an operation. I can
explain exactly when it would be used, and I can easily see a
situation where "I've opened a ton of files, now I want to release
them" is a valid model of operation. And it's a valid optimization to
do a bulk operation like that.

IOW, close_range() makes sense as an operation even if you could just
say "ok, I know exactly what files I have open". But it also makes
sense as an operation for the case of "I don't even care what files I
have open, I just want to close them".

In contrast, the "I have opened a ton of files, and I don't even know
what the hell I did, so can you list them for me" makes no sense.

Because outside of "close them", there's no bulk operation that makes
sense on random file handles that you don't know what they are. Unless
you iterate over them and do the stat thing or whatever to figure it
out - which is lsof, but as mentioned, it's about *other* peoples
files.

Linus

2019-05-24 21:29:10

by Alexey Dobriyan

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] close_range()

On Fri, May 24, 2019 at 11:55:44AM -0700, Linus Torvalds wrote:
> On Fri, May 24, 2019 at 11:39 AM Alexey Dobriyan <[email protected]> wrote:
> >
> > > Would there ever be any other reason to traverse unknown open files
> > > than to close them?
> >
> > This is what lsof(1) does:
>
> I repeat: Would there ever be any other reason to traverse unknown
> open files than to close them?
>
> lsof is not AT ALL a relevant argument.
>
> lsof fundamentally wants /proc, because lsof looks at *other*
> processes. That has absolutely zero to do with fdmap. lsof does *not*
> want fdmap at all. It wants "list other processes files". Which is
> very much what /proc is all about.
>
> So your argument that "fdmap is more generic" is bogus.
>
> fdmap is entirely pointless unless you can show a real and relevant
> (to performance) use of it.
>
> When you would *possibly* have a "let me get a list of all the file
> descriptors I have open, because I didn't track them myself"
> situation? That makes no sense. Particularly from a performance
> standpoint.
>
> In contrast, "close_range()" makes sense as an operation.

What about orthogonality of interfaces?

fdmap()
bulk_close()

Now fdmap() can be reused for lsof/criu and it is only 2 system calls
for close-everything usecase which is OK because readdir is 4(!) minimum:

open
getdents
getdents() = 0
close

Writing all of this I understood how fdmap can be made more faster which
neither getdents() nor even read() have the luxury of: it can return
a flag if more data is available so that application would do next fdmap()
only if truly necessary.

> I can
> explain exactly when it would be used, and I can easily see a
> situation where "I've opened a ton of files, now I want to release
> them" is a valid model of operation. And it's a valid optimization to
> do a bulk operation like that.
>
> IOW, close_range() makes sense as an operation even if you could just
> say "ok, I know exactly what files I have open". But it also makes
> sense as an operation for the case of "I don't even care what files I
> have open, I just want to close them".
>
> In contrast, the "I have opened a ton of files, and I don't even know
> what the hell I did, so can you list them for me" makes no sense.
>
> Because outside of "close them", there's no bulk operation that makes
> sense on random file handles that you don't know what they are. Unless
> you iterate over them and do the stat thing or whatever to figure it
> out - which is lsof, but as mentioned, it's about *other* peoples
> files.

What you're doing is making exactly one usecase take exactly one system
call and leaving everything else deal with /proc. Stracing lsof shows
very clearly how stupid and how wasteful it is. Especially now that
we're post-meltdown era caring about system call costs (yeah suure).

I'm suggesting make close-universe usecase take only 2 system calls.
which is still better than anything /proc can offer.

2019-05-24 23:47:13

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] close_range()

On Sat, May 25, 2019 at 12:27:40AM +0300, Alexey Dobriyan wrote:

> What about orthogonality of interfaces?
>
> fdmap()
> bulk_close()
>
> Now fdmap() can be reused for lsof/criu and it is only 2 system calls
> for close-everything usecase which is OK because readdir is 4(!) minimum:
>
> open
> getdents
> getdents() = 0
> close
>
> Writing all of this I understood how fdmap can be made more faster which
> neither getdents() nor even read() have the luxury of: it can return
> a flag if more data is available so that application would do next fdmap()
> only if truly necessary.

Tactless question: what has traumatised you so badly about string operations?
Because that seems to be the common denominator to a lot of things...

2019-05-26 20:30:10

by Szabolcs Nagy

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] open: add close_range()

* Christian Brauner <[email protected]> [2019-05-23 17:47:46 +0200]:
> This adds the close_range() syscall. It allows to efficiently close a range
> of file descriptors up to all file descriptors of a calling task.
>
> The syscall came up in a recent discussion around the new mount API and
> making new file descriptor types cloexec by default. During this
> discussion, Al suggested the close_range() syscall (cf. [1]). Note, a
> syscall in this manner has been requested by various people over time.
>
> First, it helps to close all file descriptors of an exec()ing task. This
> can be done safely via (quoting Al's example from [1] verbatim):
>
> /* that exec is sensitive */
> unshare(CLONE_FILES);
> /* we don't want anything past stderr here */
> close_range(3, ~0U);
> execve(....);

this does not work in a hosted c implementation unless the libc
guarantees not to use libc internal fds (e.g. in execve).
(the libc cannot easily abstract fds, so the syscall abi layer
fd semantics is necessarily visible to user code.)

i think this is a new constraint for userspace runtimes.
(not entirely unreasonable though)

> The code snippet above is one way of working around the problem that file
> descriptors are not cloexec by default. This is aggravated by the fact that
> we can't just switch them over without massively regressing userspace. For
> a whole class of programs having an in-kernel method of closing all file
> descriptors is very helpful (e.g. demons, service managers, programming
> language standard libraries, container managers etc.).

was cloexec_range(a,b) considered?

> (Please note, unshare(CLONE_FILES) should only be needed if the calling
> task is multi-threaded and shares the file descriptor table with another
> thread in which case two threads could race with one thread allocating
> file descriptors and the other one closing them via close_range(). For the
> general case close_range() before the execve() is sufficient.)

assuming there is no unblocked signal handler that may open fds.

a syscall that tramples on fds not owned by the caller is ugly
(not generally safe to use and may break things if it gets used),
i don't have a better solution for fd leaks or missing cloexec,
but i think it needs more analysis how it can be used.

2019-05-28 02:36:12

by Michael Ellerman

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] tests: add close_range() tests

Christian Brauner <[email protected]> writes:
> This adds basic tests for the new close_range() syscall.
> - test that no invalid flags can be passed
> - test that a range of file descriptors is correctly closed
> - test that a range of file descriptors is correctly closed if there there
> are already closed file descriptors in the range
> - test that max_fd is correctly capped to the current fdtable maximum
>
> Signed-off-by: Christian Brauner <[email protected]>
> Cc: Arnd Bergmann <[email protected]>
> Cc: Jann Horn <[email protected]>
> Cc: David Howells <[email protected]>
> Cc: Dmitry V. Levin <[email protected]>
> Cc: Oleg Nesterov <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Florian Weimer <[email protected]>
> Cc: [email protected]
> ---
> v1: unchanged
> v2:
> - Christian Brauner <[email protected]>:
> - verify that close_range() correctly closes a single file descriptor
> ---
> tools/testing/selftests/Makefile | 1 +
> tools/testing/selftests/core/.gitignore | 1 +
> tools/testing/selftests/core/Makefile | 6 +
> .../testing/selftests/core/close_range_test.c | 142 ++++++++++++++++++
> 4 files changed, 150 insertions(+)
> create mode 100644 tools/testing/selftests/core/.gitignore
> create mode 100644 tools/testing/selftests/core/Makefile
> create mode 100644 tools/testing/selftests/core/close_range_test.c
>
> diff --git a/tools/testing/selftests/core/.gitignore b/tools/testing/selftests/core/.gitignore
> new file mode 100644
> index 000000000000..6e6712ce5817
> --- /dev/null
> +++ b/tools/testing/selftests/core/.gitignore
> @@ -0,0 +1 @@
> +close_range_test
> diff --git a/tools/testing/selftests/core/Makefile b/tools/testing/selftests/core/Makefile
> new file mode 100644
> index 000000000000..de3ae68aa345
> --- /dev/null
> +++ b/tools/testing/selftests/core/Makefile
> @@ -0,0 +1,6 @@
> +CFLAGS += -g -I../../../../usr/include/ -I../../../../include

Your second -I pulls the unexported kernel headers in, userspace
programs shouldn't include unexported kernel headers.

It breaks the build on powerpc with eg:

powerpc64le-linux-gnu-gcc -g -I../../../../usr/include/ -I../../../../include close_range_test.c -o /output/kselftest/core/close_range_test
In file included from /usr/powerpc64le-linux-gnu/include/bits/fcntl-linux.h:346,
from /usr/powerpc64le-linux-gnu/include/bits/fcntl.h:62,
from /usr/powerpc64le-linux-gnu/include/fcntl.h:35,
from close_range_test.c:5:
../../../../include/linux/falloc.h:13:2: error: unknown type name '__s16'
__s16 l_type;
^~~~~


Did you do that on purpose or just copy it from one of the other
Makefiles? :)

If you're just wanting to get the syscall number when the headers
haven't been exported, I think the best solution is to do eg:

diff --git a/tools/testing/selftests/core/close_range_test.c b/tools/testing/selftests/core/close_range_test.c
index d6e6079d3d53..34c6f02f25de 100644
--- a/tools/testing/selftests/core/close_range_test.c
+++ b/tools/testing/selftests/core/close_range_test.c
@@ -14,6 +14,10 @@

#include "../kselftest.h"

+#ifndef __NR_close_range
+#define __NR_close_range 435
+#endif
+
static inline int sys_close_range(unsigned int fd, unsigned int max_fd,
unsigned int flags)
{


cheers

2019-05-28 11:29:23

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] tests: add close_range() tests

On Tue, May 28, 2019 at 12:33:41PM +1000, Michael Ellerman wrote:
> Christian Brauner <[email protected]> writes:
> > This adds basic tests for the new close_range() syscall.
> > - test that no invalid flags can be passed
> > - test that a range of file descriptors is correctly closed
> > - test that a range of file descriptors is correctly closed if there there
> > are already closed file descriptors in the range
> > - test that max_fd is correctly capped to the current fdtable maximum
> >
> > Signed-off-by: Christian Brauner <[email protected]>
> > Cc: Arnd Bergmann <[email protected]>
> > Cc: Jann Horn <[email protected]>
> > Cc: David Howells <[email protected]>
> > Cc: Dmitry V. Levin <[email protected]>
> > Cc: Oleg Nesterov <[email protected]>
> > Cc: Linus Torvalds <[email protected]>
> > Cc: Florian Weimer <[email protected]>
> > Cc: [email protected]
> > ---
> > v1: unchanged
> > v2:
> > - Christian Brauner <[email protected]>:
> > - verify that close_range() correctly closes a single file descriptor
> > ---
> > tools/testing/selftests/Makefile | 1 +
> > tools/testing/selftests/core/.gitignore | 1 +
> > tools/testing/selftests/core/Makefile | 6 +
> > .../testing/selftests/core/close_range_test.c | 142 ++++++++++++++++++
> > 4 files changed, 150 insertions(+)
> > create mode 100644 tools/testing/selftests/core/.gitignore
> > create mode 100644 tools/testing/selftests/core/Makefile
> > create mode 100644 tools/testing/selftests/core/close_range_test.c
> >
> > diff --git a/tools/testing/selftests/core/.gitignore b/tools/testing/selftests/core/.gitignore
> > new file mode 100644
> > index 000000000000..6e6712ce5817
> > --- /dev/null
> > +++ b/tools/testing/selftests/core/.gitignore
> > @@ -0,0 +1 @@
> > +close_range_test
> > diff --git a/tools/testing/selftests/core/Makefile b/tools/testing/selftests/core/Makefile
> > new file mode 100644
> > index 000000000000..de3ae68aa345
> > --- /dev/null
> > +++ b/tools/testing/selftests/core/Makefile
> > @@ -0,0 +1,6 @@
> > +CFLAGS += -g -I../../../../usr/include/ -I../../../../include
>
> Your second -I pulls the unexported kernel headers in, userspace
> programs shouldn't include unexported kernel headers.
>
> It breaks the build on powerpc with eg:
>
> powerpc64le-linux-gnu-gcc -g -I../../../../usr/include/ -I../../../../include close_range_test.c -o /output/kselftest/core/close_range_test
> In file included from /usr/powerpc64le-linux-gnu/include/bits/fcntl-linux.h:346,
> from /usr/powerpc64le-linux-gnu/include/bits/fcntl.h:62,
> from /usr/powerpc64le-linux-gnu/include/fcntl.h:35,
> from close_range_test.c:5:
> ../../../../include/linux/falloc.h:13:2: error: unknown type name '__s16'
> __s16 l_type;
> ^~~~~
>
>
> Did you do that on purpose or just copy it from one of the other
> Makefiles? :)

I originally did that on purpose because checkpatch was yammering on
about me not having used ARRAY_SIZE(). But that include can go, you are
right.

Christian