2020-04-22 05:20:45

by Josh Triplett

[permalink] [raw]
Subject: [PATCH v5 0/3] Support userspace-selected fds

5.8 material, not intended for 5.7. Now includes a patch for man-pages,
attached to this cover letter.

Inspired by the X protocol's handling of XIDs, allow userspace to select
the file descriptor opened by a call like openat2, so that it can use
the resulting file descriptor in subsequent system calls without waiting
for the response to the initial openat2 syscall.

The first patch is independent of the other two; it allows reserving
file descriptors below a certain minimum for userspace-selected fd
allocation only.

The second patch implements userspace-selected fd allocation for
openat2, introducing a new O_SPECIFIC_FD flag and an fd field in struct
open_how. In io_uring, this allows sequences like openat2/read/close
without waiting for the openat2 to complete. Multiple such sequences can
overlap, as long as each uses a distinct file descriptor.

The third patch adds userspace-selected fd allocation to pipe2 as well.
I did this partly as a demonstration of how simple it is to wire up
O_SPECIFIC_FD support for any fd-allocating system call, and partly in
the hopes that this may make it more useful to wire up io_uring support
for pipe2 in the future.

v5:

Rename padding field to __padding.
Add tests for non-zero __padding.
Include patch for man-pages.

v4:

Changed fd field to __u32.
Expanded and consolidated checks that return -EINVAL for invalid arguments.
Simplified and commented build_open_how.
Add documentation comment for fd field.
Add kselftests.

Thanks to Aleksa Sarai for feedback.

v3:

This new version has an API to atomically increase the minimum fd and
return the previous minimum, rather than just getting and setting the
minimum; this makes it easier to allocate a range. (A library that might
initialize after the program has already opened other file descriptors
may need to check for existing open fds in the range after reserving it,
and reserve more fds if needed; this can be done entirely in userspace,
and we can't really do anything simpler in the kernel due to limitations
on file-descriptor semantics, so this patch series avoids introducing
any extra complexity in the kernel.)

This new version also supports a __get_specific_unused_fd_flags call
which accepts the limit for RLIMIT_NOFILE as an argument, analogous to
__get_unused_fd_flags, since io_uring needs that to correctly handle
RLIMIT_NOFILE.

Thanks to Jens Axboe for review and feedback.

v2:

Version 2 was a version incorporated into a larger patch series from Jens Axboe
on io_uring.

Josh Triplett (3):
fs: Support setting a minimum fd for "lowest available fd" allocation
fs: openat2: Extend open_how to allow userspace-selected fds
fs: pipe2: Support O_SPECIFIC_FD

fs/fcntl.c | 2 +-
fs/file.c | 62 +++++++++++++++++--
fs/io_uring.c | 3 +-
fs/open.c | 8 ++-
fs/pipe.c | 16 +++--
include/linux/fcntl.h | 5 +-
include/linux/fdtable.h | 1 +
include/linux/file.h | 4 ++
include/uapi/asm-generic/fcntl.h | 4 ++
include/uapi/linux/openat2.h | 3 +
include/uapi/linux/prctl.h | 3 +
kernel/sys.c | 5 ++
tools/testing/selftests/openat2/helpers.c | 2 +-
tools/testing/selftests/openat2/helpers.h | 21 +++++--
.../testing/selftests/openat2/openat2_test.c | 35 ++++++++++-
15 files changed, 150 insertions(+), 24 deletions(-)

--
2.26.2


Attachments:
(No filename) (3.62 kB)
0001-openat2.2-pipe.2-prctl.2-Document-O_SPECIFIC_FD-and-.patch (5.79 kB)
Download all attachments

2020-04-22 05:21:47

by Josh Triplett

[permalink] [raw]
Subject: [PATCH v5 2/3] fs: openat2: Extend open_how to allow userspace-selected fds

Inspired by the X protocol's handling of XIDs, allow userspace to select
the file descriptor opened by openat2, so that it can use the resulting
file descriptor in subsequent system calls without waiting for the
response to openat2.

In io_uring, this allows sequences like openat2/read/close without
waiting for the openat2 to complete. Multiple such sequences can
overlap, as long as each uses a distinct file descriptor.

Add a new O_SPECIFIC_FD open flag to enable this behavior, only accepted
by openat2 for now (ignored by open/openat like all unknown flags). Add
an fd field to struct open_how (along with appropriate padding, and
verify that the padding is 0 to allow replacing the padding with a field
in the future).

The file table has a corresponding new function
get_specific_unused_fd_flags, which gets the specified file descriptor
if O_SPECIFIC_FD is set (and the fd isn't -1); otherwise it falls back
to get_unused_fd_flags, to simplify callers.

The specified file descriptor must not already be open; if it is,
get_specific_unused_fd_flags will fail with -EBUSY. This helps catch
userspace errors.

When O_SPECIFIC_FD is set, and fd is not -1, openat2 will use the
specified file descriptor rather than finding the lowest available one.

Test program:

#include <err.h>
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>

int main(void)
{
struct open_how how = {
.flags = O_RDONLY | O_SPECIFIC_FD,
.fd = 42
};
int fd = openat2(AT_FDCWD, "/dev/null", &how, sizeof(how));
if (fd < 0)
err(1, "openat2");
printf("fd=%d\n", fd); // prints fd=42
return 0;
}

Signed-off-by: Josh Triplett <[email protected]>
---
fs/fcntl.c | 2 +-
fs/file.c | 39 +++++++++++++++++++
fs/io_uring.c | 3 +-
fs/open.c | 8 +++-
include/linux/fcntl.h | 5 ++-
include/linux/file.h | 3 ++
include/uapi/asm-generic/fcntl.h | 4 ++
include/uapi/linux/openat2.h | 3 ++
tools/testing/selftests/openat2/helpers.c | 2 +-
tools/testing/selftests/openat2/helpers.h | 21 +++++++---
.../testing/selftests/openat2/openat2_test.c | 35 ++++++++++++++++-
11 files changed, 111 insertions(+), 14 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 2e4c0fa2074b..0357ad667563 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1033,7 +1033,7 @@ static int __init fcntl_init(void)
* Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
* is defined as O_NONBLOCK on some platforms and not on others.
*/
- BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
+ BUILD_BUG_ON(22 - 1 /* for O_RDONLY being 0 */ !=
HWEIGHT32(
(VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
__FMODE_EXEC | __FMODE_NONOTIFY));
diff --git a/fs/file.c b/fs/file.c
index ba06140d89af..0674c3a2d3a5 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -567,6 +567,45 @@ void put_unused_fd(unsigned int fd)

EXPORT_SYMBOL(put_unused_fd);

+int __get_specific_unused_fd_flags(unsigned int fd, unsigned int flags,
+ unsigned long nofile)
+{
+ int ret;
+ struct fdtable *fdt;
+ struct files_struct *files = current->files;
+
+ if (!(flags & O_SPECIFIC_FD) || fd == UINT_MAX)
+ return __get_unused_fd_flags(flags, nofile);
+
+ if (fd >= nofile)
+ return -EBADF;
+
+ spin_lock(&files->file_lock);
+ ret = expand_files(files, fd);
+ if (unlikely(ret < 0))
+ goto out_unlock;
+ fdt = files_fdtable(files);
+ if (fdt->fd[fd]) {
+ ret = -EBUSY;
+ goto out_unlock;
+ }
+ __set_open_fd(fd, fdt);
+ if (flags & O_CLOEXEC)
+ __set_close_on_exec(fd, fdt);
+ else
+ __clear_close_on_exec(fd, fdt);
+ ret = fd;
+
+out_unlock:
+ spin_unlock(&files->file_lock);
+ return ret;
+}
+
+int get_specific_unused_fd_flags(unsigned int fd, unsigned int flags)
+{
+ return __get_specific_unused_fd_flags(fd, flags, rlimit(RLIMIT_NOFILE));
+}
+
/*
* Install a file pointer in the fd array.
*
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 358f97be9c7b..4a69e1daf3fe 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2997,7 +2997,8 @@ static int io_openat2(struct io_kiocb *req, bool force_nonblock)
if (ret)
goto err;

- ret = __get_unused_fd_flags(req->open.how.flags, req->open.nofile);
+ ret = __get_specific_unused_fd_flags(req->open.how.fd,
+ req->open.how.flags, req->open.nofile);
if (ret < 0)
goto err;

diff --git a/fs/open.c b/fs/open.c
index 719b320ede52..54b2782dfa7a 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -962,6 +962,8 @@ inline struct open_how build_open_how(int flags, umode_t mode)
.mode = mode & S_IALLUGO,
};

+ /* O_SPECIFIC_FD doesn't work in open calls that use build_open_how. */
+ how.flags &= ~O_SPECIFIC_FD;
/* O_PATH beats everything else. */
if (how.flags & O_PATH)
how.flags &= O_PATH_FLAGS;
@@ -989,6 +991,10 @@ inline int build_open_flags(const struct open_how *how, struct open_flags *op)
return -EINVAL;
if (how->resolve & ~VALID_RESOLVE_FLAGS)
return -EINVAL;
+ if (how->__padding != 0)
+ return -EINVAL;
+ if (!(flags & O_SPECIFIC_FD) && how->fd != 0)
+ return -EINVAL;

/* Deal with the mode. */
if (WILL_CREATE(flags)) {
@@ -1143,7 +1149,7 @@ static long do_sys_openat2(int dfd, const char __user *filename,
if (IS_ERR(tmp))
return PTR_ERR(tmp);

- fd = get_unused_fd_flags(how->flags);
+ fd = get_specific_unused_fd_flags(how->fd, how->flags);
if (fd >= 0) {
struct file *f = do_filp_open(dfd, tmp, &op);
if (IS_ERR(f)) {
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index 7bcdcf4f6ab2..728849bcd8fa 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -10,7 +10,7 @@
(O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \
O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \
FASYNC | O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
- O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)
+ O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_SPECIFIC_FD)

/* List of all valid flags for the how->upgrade_mask argument: */
#define VALID_UPGRADE_FLAGS \
@@ -23,7 +23,8 @@

/* List of all open_how "versions". */
#define OPEN_HOW_SIZE_VER0 24 /* sizeof first published struct */
-#define OPEN_HOW_SIZE_LATEST OPEN_HOW_SIZE_VER0
+#define OPEN_HOW_SIZE_VER1 32 /* added fd and pad */
+#define OPEN_HOW_SIZE_LATEST OPEN_HOW_SIZE_VER1

#ifndef force_o_largefile
#define force_o_largefile() (!IS_ENABLED(CONFIG_ARCH_32BIT_OFF_T))
diff --git a/include/linux/file.h b/include/linux/file.h
index b67986f818d2..a63301864a36 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -87,6 +87,9 @@ extern void set_close_on_exec(unsigned int fd, int flag);
extern bool get_close_on_exec(unsigned int fd);
extern int __get_unused_fd_flags(unsigned flags, unsigned long nofile);
extern int get_unused_fd_flags(unsigned flags);
+extern int __get_specific_unused_fd_flags(unsigned int fd, unsigned int flags,
+ unsigned long nofile);
+extern int get_specific_unused_fd_flags(unsigned int fd, unsigned int flags);
extern void put_unused_fd(unsigned int fd);
extern unsigned int increase_min_fd(unsigned int num);

diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 9dc0bf0c5a6e..d3de5b8b3955 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -89,6 +89,10 @@
#define __O_TMPFILE 020000000
#endif

+#ifndef O_SPECIFIC_FD
+#define O_SPECIFIC_FD 01000000000 /* open as specified fd */
+#endif
+
/* a horrid kludge trying to make sure that this will fail on old kernels */
#define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
#define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)
diff --git a/include/uapi/linux/openat2.h b/include/uapi/linux/openat2.h
index 58b1eb711360..dcbf9dc333b5 100644
--- a/include/uapi/linux/openat2.h
+++ b/include/uapi/linux/openat2.h
@@ -15,11 +15,14 @@
* @flags: O_* flags.
* @mode: O_CREAT/O_TMPFILE file mode.
* @resolve: RESOLVE_* flags.
+ * @fd: Specific file descriptor to use, for O_SPECIFIC_FD.
*/
struct open_how {
__u64 flags;
__u64 mode;
__u64 resolve;
+ __u32 fd;
+ __u32 __padding; /* Must be zeroed */
};

/* how->resolve flags for openat2(2). */
diff --git a/tools/testing/selftests/openat2/helpers.c b/tools/testing/selftests/openat2/helpers.c
index 5074681ffdc9..b6533f0b1124 100644
--- a/tools/testing/selftests/openat2/helpers.c
+++ b/tools/testing/selftests/openat2/helpers.c
@@ -98,7 +98,7 @@ void __attribute__((constructor)) init(void)
struct open_how how = {};
int fd;

- BUILD_BUG_ON(sizeof(struct open_how) != OPEN_HOW_SIZE_VER0);
+ BUILD_BUG_ON(sizeof(struct open_how) != OPEN_HOW_SIZE_VER1);

/* Check openat2(2) support. */
fd = sys_openat2(AT_FDCWD, ".", &how);
diff --git a/tools/testing/selftests/openat2/helpers.h b/tools/testing/selftests/openat2/helpers.h
index a6ea27344db2..d38818033b45 100644
--- a/tools/testing/selftests/openat2/helpers.h
+++ b/tools/testing/selftests/openat2/helpers.h
@@ -24,28 +24,37 @@
#endif /* SYS_openat2 */

/*
- * Arguments for how openat2(2) should open the target path. If @resolve is
- * zero, then openat2(2) operates very similarly to openat(2).
+ * Arguments for how openat2(2) should open the target path. If only @flags and
+ * @mode are non-zero, then openat2(2) operates very similarly to openat(2).
*
- * However, unlike openat(2), unknown bits in @flags result in -EINVAL rather
- * than being silently ignored. @mode must be zero unless one of {O_CREAT,
- * O_TMPFILE} are set.
+ * However, unlike openat(2), unknown or invalid bits in @flags result in
+ * -EINVAL rather than being silently ignored. @mode must be zero unless one of
+ * {O_CREAT, O_TMPFILE} are set.
*
* @flags: O_* flags.
* @mode: O_CREAT/O_TMPFILE file mode.
* @resolve: RESOLVE_* flags.
+ * @fd: Specific file descriptor to use, for O_SPECIFIC_FD.
*/
struct open_how {
__u64 flags;
__u64 mode;
__u64 resolve;
+ __u32 fd;
+ __u32 __padding; /* Must be zeroed */
};

+/* List of all open_how "versions". */
#define OPEN_HOW_SIZE_VER0 24 /* sizeof first published struct */
-#define OPEN_HOW_SIZE_LATEST OPEN_HOW_SIZE_VER0
+#define OPEN_HOW_SIZE_VER1 32 /* added fd and pad */
+#define OPEN_HOW_SIZE_LATEST OPEN_HOW_SIZE_VER1

bool needs_openat2(const struct open_how *how);

+#ifndef O_SPECIFIC_FD
+#define O_SPECIFIC_FD 01000000000
+#endif
+
#ifndef RESOLVE_IN_ROOT
/* how->resolve flags for openat2(2). */
#define RESOLVE_NO_XDEV 0x01 /* Block mount-point crossings
diff --git a/tools/testing/selftests/openat2/openat2_test.c b/tools/testing/selftests/openat2/openat2_test.c
index b386367c606b..f999b4bb0dc2 100644
--- a/tools/testing/selftests/openat2/openat2_test.c
+++ b/tools/testing/selftests/openat2/openat2_test.c
@@ -40,7 +40,7 @@ struct struct_test {
int err;
};

-#define NUM_OPENAT2_STRUCT_TESTS 7
+#define NUM_OPENAT2_STRUCT_TESTS 8
#define NUM_OPENAT2_STRUCT_VARIATIONS 13

void test_openat2_struct(void)
@@ -52,6 +52,9 @@ void test_openat2_struct(void)
{ .name = "normal struct",
.arg.inner.flags = O_RDONLY,
.size = sizeof(struct open_how) },
+ { .name = "v0 struct",
+ .arg.inner.flags = O_RDONLY,
+ .size = OPEN_HOW_SIZE_VER0 },
/* Bigger struct, with zeroed out end. */
{ .name = "bigger struct (zeroed out)",
.arg.inner.flags = O_RDONLY,
@@ -155,7 +158,7 @@ struct flag_test {
int err;
};

-#define NUM_OPENAT2_FLAG_TESTS 23
+#define NUM_OPENAT2_FLAG_TESTS 31

void test_openat2_flags(void)
{
@@ -223,6 +226,30 @@ void test_openat2_flags(void)
{ .name = "invalid how.resolve and O_PATH",
.how.flags = O_PATH,
.how.resolve = 0x1337, .err = -EINVAL },
+
+ /* O_SPECIFIC_FD tests */
+ { .name = "O_SPECIFIC_FD",
+ .how.flags = O_RDONLY | O_SPECIFIC_FD, .how.fd = 42 },
+ { .name = "O_SPECIFIC_FD if fd exists",
+ .how.flags = O_RDONLY | O_SPECIFIC_FD, .how.fd = 2,
+ .err = -EBUSY },
+ { .name = "O_SPECIFIC_FD with fd -1",
+ .how.flags = O_RDONLY | O_SPECIFIC_FD, .how.fd = -1 },
+ { .name = "fd without O_SPECIFIC_FD",
+ .how.flags = O_RDONLY, .how.fd = 42,
+ .err = -EINVAL },
+ { .name = "fd -1 without O_SPECIFIC_FD",
+ .how.flags = O_RDONLY, .how.fd = -1,
+ .err = -EINVAL },
+ { .name = "existing fd without O_SPECIFIC_FD",
+ .how.flags = O_RDONLY, .how.fd = 2,
+ .err = -EINVAL },
+ { .name = "Non-zero padding with O_SPECIFIC_FD",
+ .how.flags = O_RDONLY | O_SPECIFIC_FD, .how.fd = 42,
+ .how.__padding = 42, .err = -EINVAL },
+ { .name = "Non-zero padding without O_SPECIFIC_FD",
+ .how.flags = O_RDONLY,
+ .how.__padding = 42, .err = -EINVAL },
};

BUILD_BUG_ON(ARRAY_LEN(tests) != NUM_OPENAT2_FLAG_TESTS);
@@ -268,6 +295,10 @@ void test_openat2_flags(void)
if (!(test->how.flags & O_LARGEFILE))
fdflags &= ~O_LARGEFILE;
failed |= (fdflags != test->how.flags);
+
+ if (test->how.flags & O_SPECIFIC_FD
+ && test->how.fd != -1)
+ failed |= (fd != test->how.fd);
}

if (failed) {
--
2.26.2

2020-04-22 05:23:23

by Josh Triplett

[permalink] [raw]
Subject: [PATCH v5 1/3] fs: Support setting a minimum fd for "lowest available fd" allocation

Some applications want to prevent the usual "lowest available fd"
allocation from allocating certain file descriptors. For instance, they
may want to prevent allocation of a closed fd 0, 1, or 2 other than via
dup2/dup3, or reserve some low file descriptors for other purposes.

Add a prctl to increase the minimum fd and return the previous minimum.

System calls that allocate a specific file descriptor, such as
dup2/dup3, ignore this minimum.

exec resets the minimum fd, to prevent one program from interfering with
another program's expectations about fd allocation.

Test program:

#include <err.h>
#include <fcntl.h>
#include <stdio.h>
#include <sys/prctl.h>

int main(int argc, char *argv[])
{
if (prctl(PR_INCREASE_MIN_FD, 100, 0, 0, 0) < 0)
err(1, "prctl");
int fd = open("/dev/null", O_RDONLY);
if (fd < 0)
err(1, "open");
printf("%d\n", fd); // prints 100
return 0;
}

Signed-off-by: Josh Triplett <[email protected]>
---
fs/file.c | 23 +++++++++++++++++------
include/linux/fdtable.h | 1 +
include/linux/file.h | 1 +
include/uapi/linux/prctl.h | 3 +++
kernel/sys.c | 5 +++++
5 files changed, 27 insertions(+), 6 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index c8a4e4c86e55..ba06140d89af 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -286,7 +286,6 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
spin_lock_init(&newf->file_lock);
newf->resize_in_progress = false;
init_waitqueue_head(&newf->resize_wait);
- newf->next_fd = 0;
new_fdt = &newf->fdtab;
new_fdt->max_fds = NR_OPEN_DEFAULT;
new_fdt->close_on_exec = newf->close_on_exec_init;
@@ -295,6 +294,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
new_fdt->fd = &newf->fd_array[0];

spin_lock(&oldf->file_lock);
+ newf->next_fd = newf->min_fd = oldf->min_fd;
old_fdt = files_fdtable(oldf);
open_files = count_open_files(old_fdt);

@@ -487,9 +487,7 @@ int __alloc_fd(struct files_struct *files,
spin_lock(&files->file_lock);
repeat:
fdt = files_fdtable(files);
- fd = start;
- if (fd < files->next_fd)
- fd = files->next_fd;
+ fd = max3(start, files->min_fd, files->next_fd);

if (fd < fdt->max_fds)
fd = find_next_fd(fdt, fd);
@@ -514,7 +512,7 @@ int __alloc_fd(struct files_struct *files,
goto repeat;

if (start <= files->next_fd)
- files->next_fd = fd + 1;
+ files->next_fd = max(fd + 1, files->min_fd);

__set_open_fd(fd, fdt);
if (flags & O_CLOEXEC)
@@ -555,7 +553,7 @@ static void __put_unused_fd(struct files_struct *files, unsigned int fd)
{
struct fdtable *fdt = files_fdtable(files);
__clear_open_fd(fd, fdt);
- if (fd < files->next_fd)
+ if (fd < files->next_fd && fd >= files->min_fd)
files->next_fd = fd;
}

@@ -684,6 +682,7 @@ void do_close_on_exec(struct files_struct *files)

/* exec unshares first */
spin_lock(&files->file_lock);
+ files->min_fd = 0;
for (i = 0; ; i++) {
unsigned long set;
unsigned fd = i * BITS_PER_LONG;
@@ -865,6 +864,18 @@ bool get_close_on_exec(unsigned int fd)
return res;
}

+unsigned int increase_min_fd(unsigned int num)
+{
+ struct files_struct *files = current->files;
+ unsigned int old_min_fd;
+
+ spin_lock(&files->file_lock);
+ old_min_fd = files->min_fd;
+ files->min_fd += num;
+ spin_unlock(&files->file_lock);
+ return old_min_fd;
+}
+
static int do_dup2(struct files_struct *files,
struct file *file, unsigned fd, unsigned flags)
__releases(&files->file_lock)
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index f07c55ea0c22..d1980443d8b3 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -60,6 +60,7 @@ struct files_struct {
*/
spinlock_t file_lock ____cacheline_aligned_in_smp;
unsigned int next_fd;
+ unsigned int min_fd; /* min for "lowest available fd" allocation */
unsigned long close_on_exec_init[1];
unsigned long open_fds_init[1];
unsigned long full_fds_bits_init[1];
diff --git a/include/linux/file.h b/include/linux/file.h
index 142d102f285e..b67986f818d2 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -88,6 +88,7 @@ extern bool get_close_on_exec(unsigned int fd);
extern int __get_unused_fd_flags(unsigned flags, unsigned long nofile);
extern int get_unused_fd_flags(unsigned flags);
extern void put_unused_fd(unsigned int fd);
+extern unsigned int increase_min_fd(unsigned int num);

extern void fd_install(unsigned int fd, struct file *file);

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 07b4f8131e36..916327272d21 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -238,4 +238,7 @@ struct prctl_mm_map {
#define PR_SET_IO_FLUSHER 57
#define PR_GET_IO_FLUSHER 58

+/* Increase minimum file descriptor for "lowest available fd" allocation */
+#define PR_INCREASE_MIN_FD 59
+
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index d325f3ab624a..daa0ce43cecc 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2514,6 +2514,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,

error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER;
break;
+ case PR_INCREASE_MIN_FD:
+ if (arg3 || arg4 || arg5)
+ return -EINVAL;
+ error = increase_min_fd((unsigned int)arg2);
+ break;
default:
error = -EINVAL;
break;
--
2.26.2

2020-04-22 05:24:04

by Josh Triplett

[permalink] [raw]
Subject: [PATCH v5 3/3] fs: pipe2: Support O_SPECIFIC_FD

This allows the caller of pipe2 to specify one or both file descriptors
rather than having them automatically use the lowest available file
descriptor. The caller can specify either file descriptor as -1 to
allow that file descriptor to use the lowest available.

Signed-off-by: Josh Triplett <[email protected]>
---
fs/pipe.c | 16 ++++++++++++----
1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index 16fb72e9abf7..4681a0d1d587 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -936,19 +936,19 @@ static int __do_pipe_flags(int *fd, struct file **files, int flags)
int error;
int fdw, fdr;

- if (flags & ~(O_CLOEXEC | O_NONBLOCK | O_DIRECT))
+ if (flags & ~(O_CLOEXEC | O_NONBLOCK | O_DIRECT | O_SPECIFIC_FD))
return -EINVAL;

error = create_pipe_files(files, flags);
if (error)
return error;

- error = get_unused_fd_flags(flags);
+ error = get_specific_unused_fd_flags(fd[0], flags);
if (error < 0)
goto err_read_pipe;
fdr = error;

- error = get_unused_fd_flags(flags);
+ error = get_specific_unused_fd_flags(fd[1], flags);
if (error < 0)
goto err_fdr;
fdw = error;
@@ -969,7 +969,11 @@ static int __do_pipe_flags(int *fd, struct file **files, int flags)
int do_pipe_flags(int *fd, int flags)
{
struct file *files[2];
- int error = __do_pipe_flags(fd, files, flags);
+ int error;
+
+ if (flags & O_SPECIFIC_FD)
+ return -EINVAL;
+ error = __do_pipe_flags(fd, files, flags);
if (!error) {
fd_install(fd[0], files[0]);
fd_install(fd[1], files[1]);
@@ -987,6 +991,10 @@ static int do_pipe2(int __user *fildes, int flags)
int fd[2];
int error;

+ if (flags & O_SPECIFIC_FD)
+ if (copy_from_user(fd, fildes, sizeof(fd)))
+ return -EFAULT;
+
error = __do_pipe_flags(fd, files, flags);
if (!error) {
if (unlikely(copy_to_user(fildes, fd, sizeof(fd)))) {
--
2.26.2

Subject: Re: [PATCH v5 0/3] Support userspace-selected fds

[CC += linux-api]

On Wed, 22 Apr 2020 at 07:19, Josh Triplett <[email protected]> wrote:
>
> 5.8 material, not intended for 5.7. Now includes a patch for man-pages,
> attached to this cover letter.
>
> Inspired by the X protocol's handling of XIDs, allow userspace to select
> the file descriptor opened by a call like openat2, so that it can use
> the resulting file descriptor in subsequent system calls without waiting
> for the response to the initial openat2 syscall.
>
> The first patch is independent of the other two; it allows reserving
> file descriptors below a certain minimum for userspace-selected fd
> allocation only.
>
> The second patch implements userspace-selected fd allocation for
> openat2, introducing a new O_SPECIFIC_FD flag and an fd field in struct
> open_how. In io_uring, this allows sequences like openat2/read/close
> without waiting for the openat2 to complete. Multiple such sequences can
> overlap, as long as each uses a distinct file descriptor.
>
> The third patch adds userspace-selected fd allocation to pipe2 as well.
> I did this partly as a demonstration of how simple it is to wire up
> O_SPECIFIC_FD support for any fd-allocating system call, and partly in
> the hopes that this may make it more useful to wire up io_uring support
> for pipe2 in the future.
>
> v5:
>
> Rename padding field to __padding.
> Add tests for non-zero __padding.
> Include patch for man-pages.
>
> v4:
>
> Changed fd field to __u32.
> Expanded and consolidated checks that return -EINVAL for invalid arguments.
> Simplified and commented build_open_how.
> Add documentation comment for fd field.
> Add kselftests.
>
> Thanks to Aleksa Sarai for feedback.
>
> v3:
>
> This new version has an API to atomically increase the minimum fd and
> return the previous minimum, rather than just getting and setting the
> minimum; this makes it easier to allocate a range. (A library that might
> initialize after the program has already opened other file descriptors
> may need to check for existing open fds in the range after reserving it,
> and reserve more fds if needed; this can be done entirely in userspace,
> and we can't really do anything simpler in the kernel due to limitations
> on file-descriptor semantics, so this patch series avoids introducing
> any extra complexity in the kernel.)
>
> This new version also supports a __get_specific_unused_fd_flags call
> which accepts the limit for RLIMIT_NOFILE as an argument, analogous to
> __get_unused_fd_flags, since io_uring needs that to correctly handle
> RLIMIT_NOFILE.
>
> Thanks to Jens Axboe for review and feedback.
>
> v2:
>
> Version 2 was a version incorporated into a larger patch series from Jens Axboe
> on io_uring.
>
> Josh Triplett (3):
> fs: Support setting a minimum fd for "lowest available fd" allocation
> fs: openat2: Extend open_how to allow userspace-selected fds
> fs: pipe2: Support O_SPECIFIC_FD
>
> fs/fcntl.c | 2 +-
> fs/file.c | 62 +++++++++++++++++--
> fs/io_uring.c | 3 +-
> fs/open.c | 8 ++-
> fs/pipe.c | 16 +++--
> include/linux/fcntl.h | 5 +-
> include/linux/fdtable.h | 1 +
> include/linux/file.h | 4 ++
> include/uapi/asm-generic/fcntl.h | 4 ++
> include/uapi/linux/openat2.h | 3 +
> include/uapi/linux/prctl.h | 3 +
> kernel/sys.c | 5 ++
> tools/testing/selftests/openat2/helpers.c | 2 +-
> tools/testing/selftests/openat2/helpers.h | 21 +++++--
> .../testing/selftests/openat2/openat2_test.c | 35 ++++++++++-
> 15 files changed, 150 insertions(+), 24 deletions(-)
>
> --
> 2.26.2
>


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Subject: Re: [PATCH v5 1/3] fs: Support setting a minimum fd for "lowest available fd" allocation

[CC += linux-api]

On Wed, 22 Apr 2020 at 07:19, Josh Triplett <[email protected]> wrote:
>
> Some applications want to prevent the usual "lowest available fd"
> allocation from allocating certain file descriptors. For instance, they
> may want to prevent allocation of a closed fd 0, 1, or 2 other than via
> dup2/dup3, or reserve some low file descriptors for other purposes.
>
> Add a prctl to increase the minimum fd and return the previous minimum.
>
> System calls that allocate a specific file descriptor, such as
> dup2/dup3, ignore this minimum.
>
> exec resets the minimum fd, to prevent one program from interfering with
> another program's expectations about fd allocation.
>
> Test program:
>
> #include <err.h>
> #include <fcntl.h>
> #include <stdio.h>
> #include <sys/prctl.h>
>
> int main(int argc, char *argv[])
> {
> if (prctl(PR_INCREASE_MIN_FD, 100, 0, 0, 0) < 0)
> err(1, "prctl");
> int fd = open("/dev/null", O_RDONLY);
> if (fd < 0)
> err(1, "open");
> printf("%d\n", fd); // prints 100
> return 0;
> }
>
> Signed-off-by: Josh Triplett <[email protected]>
> ---
> fs/file.c | 23 +++++++++++++++++------
> include/linux/fdtable.h | 1 +
> include/linux/file.h | 1 +
> include/uapi/linux/prctl.h | 3 +++
> kernel/sys.c | 5 +++++
> 5 files changed, 27 insertions(+), 6 deletions(-)
>
> diff --git a/fs/file.c b/fs/file.c
> index c8a4e4c86e55..ba06140d89af 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -286,7 +286,6 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
> spin_lock_init(&newf->file_lock);
> newf->resize_in_progress = false;
> init_waitqueue_head(&newf->resize_wait);
> - newf->next_fd = 0;
> new_fdt = &newf->fdtab;
> new_fdt->max_fds = NR_OPEN_DEFAULT;
> new_fdt->close_on_exec = newf->close_on_exec_init;
> @@ -295,6 +294,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
> new_fdt->fd = &newf->fd_array[0];
>
> spin_lock(&oldf->file_lock);
> + newf->next_fd = newf->min_fd = oldf->min_fd;
> old_fdt = files_fdtable(oldf);
> open_files = count_open_files(old_fdt);
>
> @@ -487,9 +487,7 @@ int __alloc_fd(struct files_struct *files,
> spin_lock(&files->file_lock);
> repeat:
> fdt = files_fdtable(files);
> - fd = start;
> - if (fd < files->next_fd)
> - fd = files->next_fd;
> + fd = max3(start, files->min_fd, files->next_fd);
>
> if (fd < fdt->max_fds)
> fd = find_next_fd(fdt, fd);
> @@ -514,7 +512,7 @@ int __alloc_fd(struct files_struct *files,
> goto repeat;
>
> if (start <= files->next_fd)
> - files->next_fd = fd + 1;
> + files->next_fd = max(fd + 1, files->min_fd);
>
> __set_open_fd(fd, fdt);
> if (flags & O_CLOEXEC)
> @@ -555,7 +553,7 @@ static void __put_unused_fd(struct files_struct *files, unsigned int fd)
> {
> struct fdtable *fdt = files_fdtable(files);
> __clear_open_fd(fd, fdt);
> - if (fd < files->next_fd)
> + if (fd < files->next_fd && fd >= files->min_fd)
> files->next_fd = fd;
> }
>
> @@ -684,6 +682,7 @@ void do_close_on_exec(struct files_struct *files)
>
> /* exec unshares first */
> spin_lock(&files->file_lock);
> + files->min_fd = 0;
> for (i = 0; ; i++) {
> unsigned long set;
> unsigned fd = i * BITS_PER_LONG;
> @@ -865,6 +864,18 @@ bool get_close_on_exec(unsigned int fd)
> return res;
> }
>
> +unsigned int increase_min_fd(unsigned int num)
> +{
> + struct files_struct *files = current->files;
> + unsigned int old_min_fd;
> +
> + spin_lock(&files->file_lock);
> + old_min_fd = files->min_fd;
> + files->min_fd += num;
> + spin_unlock(&files->file_lock);
> + return old_min_fd;
> +}
> +
> static int do_dup2(struct files_struct *files,
> struct file *file, unsigned fd, unsigned flags)
> __releases(&files->file_lock)
> diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
> index f07c55ea0c22..d1980443d8b3 100644
> --- a/include/linux/fdtable.h
> +++ b/include/linux/fdtable.h
> @@ -60,6 +60,7 @@ struct files_struct {
> */
> spinlock_t file_lock ____cacheline_aligned_in_smp;
> unsigned int next_fd;
> + unsigned int min_fd; /* min for "lowest available fd" allocation */
> unsigned long close_on_exec_init[1];
> unsigned long open_fds_init[1];
> unsigned long full_fds_bits_init[1];
> diff --git a/include/linux/file.h b/include/linux/file.h
> index 142d102f285e..b67986f818d2 100644
> --- a/include/linux/file.h
> +++ b/include/linux/file.h
> @@ -88,6 +88,7 @@ extern bool get_close_on_exec(unsigned int fd);
> extern int __get_unused_fd_flags(unsigned flags, unsigned long nofile);
> extern int get_unused_fd_flags(unsigned flags);
> extern void put_unused_fd(unsigned int fd);
> +extern unsigned int increase_min_fd(unsigned int num);
>
> extern void fd_install(unsigned int fd, struct file *file);
>
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index 07b4f8131e36..916327272d21 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -238,4 +238,7 @@ struct prctl_mm_map {
> #define PR_SET_IO_FLUSHER 57
> #define PR_GET_IO_FLUSHER 58
>
> +/* Increase minimum file descriptor for "lowest available fd" allocation */
> +#define PR_INCREASE_MIN_FD 59
> +
> #endif /* _LINUX_PRCTL_H */
> diff --git a/kernel/sys.c b/kernel/sys.c
> index d325f3ab624a..daa0ce43cecc 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2514,6 +2514,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>
> error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER;
> break;
> + case PR_INCREASE_MIN_FD:
> + if (arg3 || arg4 || arg5)
> + return -EINVAL;
> + error = increase_min_fd((unsigned int)arg2);
> + break;
> default:
> error = -EINVAL;
> break;
> --
> 2.26.2
>


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Subject: Re: [PATCH v5 2/3] fs: openat2: Extend open_how to allow userspace-selected fds

[CC += linux-api]

On Wed, 22 Apr 2020 at 07:20, Josh Triplett <[email protected]> wrote:
>
> Inspired by the X protocol's handling of XIDs, allow userspace to select
> the file descriptor opened by openat2, so that it can use the resulting
> file descriptor in subsequent system calls without waiting for the
> response to openat2.
>
> In io_uring, this allows sequences like openat2/read/close without
> waiting for the openat2 to complete. Multiple such sequences can
> overlap, as long as each uses a distinct file descriptor.
>
> Add a new O_SPECIFIC_FD open flag to enable this behavior, only accepted
> by openat2 for now (ignored by open/openat like all unknown flags). Add
> an fd field to struct open_how (along with appropriate padding, and
> verify that the padding is 0 to allow replacing the padding with a field
> in the future).
>
> The file table has a corresponding new function
> get_specific_unused_fd_flags, which gets the specified file descriptor
> if O_SPECIFIC_FD is set (and the fd isn't -1); otherwise it falls back
> to get_unused_fd_flags, to simplify callers.
>
> The specified file descriptor must not already be open; if it is,
> get_specific_unused_fd_flags will fail with -EBUSY. This helps catch
> userspace errors.
>
> When O_SPECIFIC_FD is set, and fd is not -1, openat2 will use the
> specified file descriptor rather than finding the lowest available one.
>
> Test program:
>
> #include <err.h>
> #include <fcntl.h>
> #include <stdio.h>
> #include <unistd.h>
>
> int main(void)
> {
> struct open_how how = {
> .flags = O_RDONLY | O_SPECIFIC_FD,
> .fd = 42
> };
> int fd = openat2(AT_FDCWD, "/dev/null", &how, sizeof(how));
> if (fd < 0)
> err(1, "openat2");
> printf("fd=%d\n", fd); // prints fd=42
> return 0;
> }
>
> Signed-off-by: Josh Triplett <[email protected]>
> ---
> fs/fcntl.c | 2 +-
> fs/file.c | 39 +++++++++++++++++++
> fs/io_uring.c | 3 +-
> fs/open.c | 8 +++-
> include/linux/fcntl.h | 5 ++-
> include/linux/file.h | 3 ++
> include/uapi/asm-generic/fcntl.h | 4 ++
> include/uapi/linux/openat2.h | 3 ++
> tools/testing/selftests/openat2/helpers.c | 2 +-
> tools/testing/selftests/openat2/helpers.h | 21 +++++++---
> .../testing/selftests/openat2/openat2_test.c | 35 ++++++++++++++++-
> 11 files changed, 111 insertions(+), 14 deletions(-)
>
> diff --git a/fs/fcntl.c b/fs/fcntl.c
> index 2e4c0fa2074b..0357ad667563 100644
> --- a/fs/fcntl.c
> +++ b/fs/fcntl.c
> @@ -1033,7 +1033,7 @@ static int __init fcntl_init(void)
> * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
> * is defined as O_NONBLOCK on some platforms and not on others.
> */
> - BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
> + BUILD_BUG_ON(22 - 1 /* for O_RDONLY being 0 */ !=
> HWEIGHT32(
> (VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
> __FMODE_EXEC | __FMODE_NONOTIFY));
> diff --git a/fs/file.c b/fs/file.c
> index ba06140d89af..0674c3a2d3a5 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -567,6 +567,45 @@ void put_unused_fd(unsigned int fd)
>
> EXPORT_SYMBOL(put_unused_fd);
>
> +int __get_specific_unused_fd_flags(unsigned int fd, unsigned int flags,
> + unsigned long nofile)
> +{
> + int ret;
> + struct fdtable *fdt;
> + struct files_struct *files = current->files;
> +
> + if (!(flags & O_SPECIFIC_FD) || fd == UINT_MAX)
> + return __get_unused_fd_flags(flags, nofile);
> +
> + if (fd >= nofile)
> + return -EBADF;
> +
> + spin_lock(&files->file_lock);
> + ret = expand_files(files, fd);
> + if (unlikely(ret < 0))
> + goto out_unlock;
> + fdt = files_fdtable(files);
> + if (fdt->fd[fd]) {
> + ret = -EBUSY;
> + goto out_unlock;
> + }
> + __set_open_fd(fd, fdt);
> + if (flags & O_CLOEXEC)
> + __set_close_on_exec(fd, fdt);
> + else
> + __clear_close_on_exec(fd, fdt);
> + ret = fd;
> +
> +out_unlock:
> + spin_unlock(&files->file_lock);
> + return ret;
> +}
> +
> +int get_specific_unused_fd_flags(unsigned int fd, unsigned int flags)
> +{
> + return __get_specific_unused_fd_flags(fd, flags, rlimit(RLIMIT_NOFILE));
> +}
> +
> /*
> * Install a file pointer in the fd array.
> *
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 358f97be9c7b..4a69e1daf3fe 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -2997,7 +2997,8 @@ static int io_openat2(struct io_kiocb *req, bool force_nonblock)
> if (ret)
> goto err;
>
> - ret = __get_unused_fd_flags(req->open.how.flags, req->open.nofile);
> + ret = __get_specific_unused_fd_flags(req->open.how.fd,
> + req->open.how.flags, req->open.nofile);
> if (ret < 0)
> goto err;
>
> diff --git a/fs/open.c b/fs/open.c
> index 719b320ede52..54b2782dfa7a 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -962,6 +962,8 @@ inline struct open_how build_open_how(int flags, umode_t mode)
> .mode = mode & S_IALLUGO,
> };
>
> + /* O_SPECIFIC_FD doesn't work in open calls that use build_open_how. */
> + how.flags &= ~O_SPECIFIC_FD;
> /* O_PATH beats everything else. */
> if (how.flags & O_PATH)
> how.flags &= O_PATH_FLAGS;
> @@ -989,6 +991,10 @@ inline int build_open_flags(const struct open_how *how, struct open_flags *op)
> return -EINVAL;
> if (how->resolve & ~VALID_RESOLVE_FLAGS)
> return -EINVAL;
> + if (how->__padding != 0)
> + return -EINVAL;
> + if (!(flags & O_SPECIFIC_FD) && how->fd != 0)
> + return -EINVAL;
>
> /* Deal with the mode. */
> if (WILL_CREATE(flags)) {
> @@ -1143,7 +1149,7 @@ static long do_sys_openat2(int dfd, const char __user *filename,
> if (IS_ERR(tmp))
> return PTR_ERR(tmp);
>
> - fd = get_unused_fd_flags(how->flags);
> + fd = get_specific_unused_fd_flags(how->fd, how->flags);
> if (fd >= 0) {
> struct file *f = do_filp_open(dfd, tmp, &op);
> if (IS_ERR(f)) {
> diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
> index 7bcdcf4f6ab2..728849bcd8fa 100644
> --- a/include/linux/fcntl.h
> +++ b/include/linux/fcntl.h
> @@ -10,7 +10,7 @@
> (O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \
> O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \
> FASYNC | O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
> - O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)
> + O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_SPECIFIC_FD)
>
> /* List of all valid flags for the how->upgrade_mask argument: */
> #define VALID_UPGRADE_FLAGS \
> @@ -23,7 +23,8 @@
>
> /* List of all open_how "versions". */
> #define OPEN_HOW_SIZE_VER0 24 /* sizeof first published struct */
> -#define OPEN_HOW_SIZE_LATEST OPEN_HOW_SIZE_VER0
> +#define OPEN_HOW_SIZE_VER1 32 /* added fd and pad */
> +#define OPEN_HOW_SIZE_LATEST OPEN_HOW_SIZE_VER1
>
> #ifndef force_o_largefile
> #define force_o_largefile() (!IS_ENABLED(CONFIG_ARCH_32BIT_OFF_T))
> diff --git a/include/linux/file.h b/include/linux/file.h
> index b67986f818d2..a63301864a36 100644
> --- a/include/linux/file.h
> +++ b/include/linux/file.h
> @@ -87,6 +87,9 @@ extern void set_close_on_exec(unsigned int fd, int flag);
> extern bool get_close_on_exec(unsigned int fd);
> extern int __get_unused_fd_flags(unsigned flags, unsigned long nofile);
> extern int get_unused_fd_flags(unsigned flags);
> +extern int __get_specific_unused_fd_flags(unsigned int fd, unsigned int flags,
> + unsigned long nofile);
> +extern int get_specific_unused_fd_flags(unsigned int fd, unsigned int flags);
> extern void put_unused_fd(unsigned int fd);
> extern unsigned int increase_min_fd(unsigned int num);
>
> diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
> index 9dc0bf0c5a6e..d3de5b8b3955 100644
> --- a/include/uapi/asm-generic/fcntl.h
> +++ b/include/uapi/asm-generic/fcntl.h
> @@ -89,6 +89,10 @@
> #define __O_TMPFILE 020000000
> #endif
>
> +#ifndef O_SPECIFIC_FD
> +#define O_SPECIFIC_FD 01000000000 /* open as specified fd */
> +#endif
> +
> /* a horrid kludge trying to make sure that this will fail on old kernels */
> #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
> #define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)
> diff --git a/include/uapi/linux/openat2.h b/include/uapi/linux/openat2.h
> index 58b1eb711360..dcbf9dc333b5 100644
> --- a/include/uapi/linux/openat2.h
> +++ b/include/uapi/linux/openat2.h
> @@ -15,11 +15,14 @@
> * @flags: O_* flags.
> * @mode: O_CREAT/O_TMPFILE file mode.
> * @resolve: RESOLVE_* flags.
> + * @fd: Specific file descriptor to use, for O_SPECIFIC_FD.
> */
> struct open_how {
> __u64 flags;
> __u64 mode;
> __u64 resolve;
> + __u32 fd;
> + __u32 __padding; /* Must be zeroed */
> };
>
> /* how->resolve flags for openat2(2). */
> diff --git a/tools/testing/selftests/openat2/helpers.c b/tools/testing/selftests/openat2/helpers.c
> index 5074681ffdc9..b6533f0b1124 100644
> --- a/tools/testing/selftests/openat2/helpers.c
> +++ b/tools/testing/selftests/openat2/helpers.c
> @@ -98,7 +98,7 @@ void __attribute__((constructor)) init(void)
> struct open_how how = {};
> int fd;
>
> - BUILD_BUG_ON(sizeof(struct open_how) != OPEN_HOW_SIZE_VER0);
> + BUILD_BUG_ON(sizeof(struct open_how) != OPEN_HOW_SIZE_VER1);
>
> /* Check openat2(2) support. */
> fd = sys_openat2(AT_FDCWD, ".", &how);
> diff --git a/tools/testing/selftests/openat2/helpers.h b/tools/testing/selftests/openat2/helpers.h
> index a6ea27344db2..d38818033b45 100644
> --- a/tools/testing/selftests/openat2/helpers.h
> +++ b/tools/testing/selftests/openat2/helpers.h
> @@ -24,28 +24,37 @@
> #endif /* SYS_openat2 */
>
> /*
> - * Arguments for how openat2(2) should open the target path. If @resolve is
> - * zero, then openat2(2) operates very similarly to openat(2).
> + * Arguments for how openat2(2) should open the target path. If only @flags and
> + * @mode are non-zero, then openat2(2) operates very similarly to openat(2).
> *
> - * However, unlike openat(2), unknown bits in @flags result in -EINVAL rather
> - * than being silently ignored. @mode must be zero unless one of {O_CREAT,
> - * O_TMPFILE} are set.
> + * However, unlike openat(2), unknown or invalid bits in @flags result in
> + * -EINVAL rather than being silently ignored. @mode must be zero unless one of
> + * {O_CREAT, O_TMPFILE} are set.
> *
> * @flags: O_* flags.
> * @mode: O_CREAT/O_TMPFILE file mode.
> * @resolve: RESOLVE_* flags.
> + * @fd: Specific file descriptor to use, for O_SPECIFIC_FD.
> */
> struct open_how {
> __u64 flags;
> __u64 mode;
> __u64 resolve;
> + __u32 fd;
> + __u32 __padding; /* Must be zeroed */
> };
>
> +/* List of all open_how "versions". */
> #define OPEN_HOW_SIZE_VER0 24 /* sizeof first published struct */
> -#define OPEN_HOW_SIZE_LATEST OPEN_HOW_SIZE_VER0
> +#define OPEN_HOW_SIZE_VER1 32 /* added fd and pad */
> +#define OPEN_HOW_SIZE_LATEST OPEN_HOW_SIZE_VER1
>
> bool needs_openat2(const struct open_how *how);
>
> +#ifndef O_SPECIFIC_FD
> +#define O_SPECIFIC_FD 01000000000
> +#endif
> +
> #ifndef RESOLVE_IN_ROOT
> /* how->resolve flags for openat2(2). */
> #define RESOLVE_NO_XDEV 0x01 /* Block mount-point crossings
> diff --git a/tools/testing/selftests/openat2/openat2_test.c b/tools/testing/selftests/openat2/openat2_test.c
> index b386367c606b..f999b4bb0dc2 100644
> --- a/tools/testing/selftests/openat2/openat2_test.c
> +++ b/tools/testing/selftests/openat2/openat2_test.c
> @@ -40,7 +40,7 @@ struct struct_test {
> int err;
> };
>
> -#define NUM_OPENAT2_STRUCT_TESTS 7
> +#define NUM_OPENAT2_STRUCT_TESTS 8
> #define NUM_OPENAT2_STRUCT_VARIATIONS 13
>
> void test_openat2_struct(void)
> @@ -52,6 +52,9 @@ void test_openat2_struct(void)
> { .name = "normal struct",
> .arg.inner.flags = O_RDONLY,
> .size = sizeof(struct open_how) },
> + { .name = "v0 struct",
> + .arg.inner.flags = O_RDONLY,
> + .size = OPEN_HOW_SIZE_VER0 },
> /* Bigger struct, with zeroed out end. */
> { .name = "bigger struct (zeroed out)",
> .arg.inner.flags = O_RDONLY,
> @@ -155,7 +158,7 @@ struct flag_test {
> int err;
> };
>
> -#define NUM_OPENAT2_FLAG_TESTS 23
> +#define NUM_OPENAT2_FLAG_TESTS 31
>
> void test_openat2_flags(void)
> {
> @@ -223,6 +226,30 @@ void test_openat2_flags(void)
> { .name = "invalid how.resolve and O_PATH",
> .how.flags = O_PATH,
> .how.resolve = 0x1337, .err = -EINVAL },
> +
> + /* O_SPECIFIC_FD tests */
> + { .name = "O_SPECIFIC_FD",
> + .how.flags = O_RDONLY | O_SPECIFIC_FD, .how.fd = 42 },
> + { .name = "O_SPECIFIC_FD if fd exists",
> + .how.flags = O_RDONLY | O_SPECIFIC_FD, .how.fd = 2,
> + .err = -EBUSY },
> + { .name = "O_SPECIFIC_FD with fd -1",
> + .how.flags = O_RDONLY | O_SPECIFIC_FD, .how.fd = -1 },
> + { .name = "fd without O_SPECIFIC_FD",
> + .how.flags = O_RDONLY, .how.fd = 42,
> + .err = -EINVAL },
> + { .name = "fd -1 without O_SPECIFIC_FD",
> + .how.flags = O_RDONLY, .how.fd = -1,
> + .err = -EINVAL },
> + { .name = "existing fd without O_SPECIFIC_FD",
> + .how.flags = O_RDONLY, .how.fd = 2,
> + .err = -EINVAL },
> + { .name = "Non-zero padding with O_SPECIFIC_FD",
> + .how.flags = O_RDONLY | O_SPECIFIC_FD, .how.fd = 42,
> + .how.__padding = 42, .err = -EINVAL },
> + { .name = "Non-zero padding without O_SPECIFIC_FD",
> + .how.flags = O_RDONLY,
> + .how.__padding = 42, .err = -EINVAL },
> };
>
> BUILD_BUG_ON(ARRAY_LEN(tests) != NUM_OPENAT2_FLAG_TESTS);
> @@ -268,6 +295,10 @@ void test_openat2_flags(void)
> if (!(test->how.flags & O_LARGEFILE))
> fdflags &= ~O_LARGEFILE;
> failed |= (fdflags != test->how.flags);
> +
> + if (test->how.flags & O_SPECIFIC_FD
> + && test->how.fd != -1)
> + failed |= (fd != test->how.fd);
> }
>
> if (failed) {
> --
> 2.26.2
>


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Subject: Re: [PATCH v5 3/3] fs: pipe2: Support O_SPECIFIC_FD

[CC += linux-api]

On Wed, 22 Apr 2020 at 07:20, Josh Triplett <[email protected]> wrote:
>
> This allows the caller of pipe2 to specify one or both file descriptors
> rather than having them automatically use the lowest available file
> descriptor. The caller can specify either file descriptor as -1 to
> allow that file descriptor to use the lowest available.
>
> Signed-off-by: Josh Triplett <[email protected]>
> ---
> fs/pipe.c | 16 ++++++++++++----
> 1 file changed, 12 insertions(+), 4 deletions(-)
>
> diff --git a/fs/pipe.c b/fs/pipe.c
> index 16fb72e9abf7..4681a0d1d587 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -936,19 +936,19 @@ static int __do_pipe_flags(int *fd, struct file **files, int flags)
> int error;
> int fdw, fdr;
>
> - if (flags & ~(O_CLOEXEC | O_NONBLOCK | O_DIRECT))
> + if (flags & ~(O_CLOEXEC | O_NONBLOCK | O_DIRECT | O_SPECIFIC_FD))
> return -EINVAL;
>
> error = create_pipe_files(files, flags);
> if (error)
> return error;
>
> - error = get_unused_fd_flags(flags);
> + error = get_specific_unused_fd_flags(fd[0], flags);
> if (error < 0)
> goto err_read_pipe;
> fdr = error;
>
> - error = get_unused_fd_flags(flags);
> + error = get_specific_unused_fd_flags(fd[1], flags);
> if (error < 0)
> goto err_fdr;
> fdw = error;
> @@ -969,7 +969,11 @@ static int __do_pipe_flags(int *fd, struct file **files, int flags)
> int do_pipe_flags(int *fd, int flags)
> {
> struct file *files[2];
> - int error = __do_pipe_flags(fd, files, flags);
> + int error;
> +
> + if (flags & O_SPECIFIC_FD)
> + return -EINVAL;
> + error = __do_pipe_flags(fd, files, flags);
> if (!error) {
> fd_install(fd[0], files[0]);
> fd_install(fd[1], files[1]);
> @@ -987,6 +991,10 @@ static int do_pipe2(int __user *fildes, int flags)
> int fd[2];
> int error;
>
> + if (flags & O_SPECIFIC_FD)
> + if (copy_from_user(fd, fildes, sizeof(fd)))
> + return -EFAULT;
> +
> error = __do_pipe_flags(fd, files, flags);
> if (!error) {
> if (unlikely(copy_to_user(fildes, fd, sizeof(fd)))) {
> --
> 2.26.2
>


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

2020-04-22 07:59:41

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH v5 2/3] fs: openat2: Extend open_how to allow userspace-selected fds

On Wed, Apr 22, 2020 at 8:06 AM Michael Kerrisk (man-pages)
<[email protected]> wrote:
>
> [CC += linux-api]
>
> On Wed, 22 Apr 2020 at 07:20, Josh Triplett <[email protected]> wrote:
> >
> > Inspired by the X protocol's handling of XIDs, allow userspace to select
> > the file descriptor opened by openat2, so that it can use the resulting
> > file descriptor in subsequent system calls without waiting for the
> > response to openat2.
> >
> > In io_uring, this allows sequences like openat2/read/close without
> > waiting for the openat2 to complete. Multiple such sequences can
> > overlap, as long as each uses a distinct file descriptor.

If this is primarily an io_uring feature, then why burden the normal
openat2 API with this?

Add this flag to the io_uring API, by all means.

This would also allow Implementing a private fd table for io_uring.
I.e. add a flag interpreted by file ops (IORING_PRIVATE_FD), including
openat2 and freely use the private fd space without having to worry
about interactions with other parts of the system.

Thanks,
Miklos

2020-04-22 15:46:58

by Florian Weimer

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] fs: pipe2: Support O_SPECIFIC_FD

* Josh Triplett:

> This allows the caller of pipe2 to specify one or both file descriptors
> rather than having them automatically use the lowest available file
> descriptor. The caller can specify either file descriptor as -1 to
> allow that file descriptor to use the lowest available.
>
> Signed-off-by: Josh Triplett <[email protected]>
> ---
> fs/pipe.c | 16 ++++++++++++----
> 1 file changed, 12 insertions(+), 4 deletions(-)
>
> diff --git a/fs/pipe.c b/fs/pipe.c
> index 16fb72e9abf7..4681a0d1d587 100644
> --- a/fs/pipe.c
> +++ b/fs/pipe.c
> @@ -936,19 +936,19 @@ static int __do_pipe_flags(int *fd, struct file **files, int flags)
> int error;
> int fdw, fdr;
>
> - if (flags & ~(O_CLOEXEC | O_NONBLOCK | O_DIRECT))
> + if (flags & ~(O_CLOEXEC | O_NONBLOCK | O_DIRECT | O_SPECIFIC_FD))
> return -EINVAL;
>
> error = create_pipe_files(files, flags);
> if (error)
> return error;
>
> - error = get_unused_fd_flags(flags);
> + error = get_specific_unused_fd_flags(fd[0], flags);
> if (error < 0)
> goto err_read_pipe;
> fdr = error;
>
> - error = get_unused_fd_flags(flags);
> + error = get_specific_unused_fd_flags(fd[1], flags);
> if (error < 0)
> goto err_fdr;
> fdw = error;
> @@ -969,7 +969,11 @@ static int __do_pipe_flags(int *fd, struct file **files, int flags)
> int do_pipe_flags(int *fd, int flags)
> {
> struct file *files[2];
> - int error = __do_pipe_flags(fd, files, flags);
> + int error;
> +
> + if (flags & O_SPECIFIC_FD)
> + return -EINVAL;
> + error = __do_pipe_flags(fd, files, flags);
> if (!error) {
> fd_install(fd[0], files[0]);
> fd_install(fd[1], files[1]);
> @@ -987,6 +991,10 @@ static int do_pipe2(int __user *fildes, int flags)
> int fd[2];
> int error;
>
> + if (flags & O_SPECIFIC_FD)
> + if (copy_from_user(fd, fildes, sizeof(fd)))
> + return -EFAULT;
> +
> error = __do_pipe_flags(fd, files, flags);
> if (!error) {
> if (unlikely(copy_to_user(fildes, fd, sizeof(fd)))) {

Mark, I think this will need (or at least benefit from) some valgrind
changes.

2020-04-23 00:47:52

by Josh Triplett

[permalink] [raw]
Subject: Re: [PATCH v5 3/3] fs: pipe2: Support O_SPECIFIC_FD

On Wed, Apr 22, 2020 at 05:44:38PM +0200, Florian Weimer wrote:
> * Josh Triplett:
> > This allows the caller of pipe2 to specify one or both file descriptors
> > rather than having them automatically use the lowest available file
> > descriptor. The caller can specify either file descriptor as -1 to
> > allow that file descriptor to use the lowest available.
> >
> > Signed-off-by: Josh Triplett <[email protected]>
> > ---
> > fs/pipe.c | 16 ++++++++++++----
> > 1 file changed, 12 insertions(+), 4 deletions(-)
> >
> > diff --git a/fs/pipe.c b/fs/pipe.c
> > index 16fb72e9abf7..4681a0d1d587 100644
> > --- a/fs/pipe.c
> > +++ b/fs/pipe.c
> > @@ -936,19 +936,19 @@ static int __do_pipe_flags(int *fd, struct file **files, int flags)
> > int error;
> > int fdw, fdr;
> >
> > - if (flags & ~(O_CLOEXEC | O_NONBLOCK | O_DIRECT))
> > + if (flags & ~(O_CLOEXEC | O_NONBLOCK | O_DIRECT | O_SPECIFIC_FD))
> > return -EINVAL;
> >
> > error = create_pipe_files(files, flags);
> > if (error)
> > return error;
> >
> > - error = get_unused_fd_flags(flags);
> > + error = get_specific_unused_fd_flags(fd[0], flags);
> > if (error < 0)
> > goto err_read_pipe;
> > fdr = error;
> >
> > - error = get_unused_fd_flags(flags);
> > + error = get_specific_unused_fd_flags(fd[1], flags);
> > if (error < 0)
> > goto err_fdr;
> > fdw = error;
> > @@ -969,7 +969,11 @@ static int __do_pipe_flags(int *fd, struct file **files, int flags)
> > int do_pipe_flags(int *fd, int flags)
> > {
> > struct file *files[2];
> > - int error = __do_pipe_flags(fd, files, flags);
> > + int error;
> > +
> > + if (flags & O_SPECIFIC_FD)
> > + return -EINVAL;
> > + error = __do_pipe_flags(fd, files, flags);
> > if (!error) {
> > fd_install(fd[0], files[0]);
> > fd_install(fd[1], files[1]);
> > @@ -987,6 +991,10 @@ static int do_pipe2(int __user *fildes, int flags)
> > int fd[2];
> > int error;
> >
> > + if (flags & O_SPECIFIC_FD)
> > + if (copy_from_user(fd, fildes, sizeof(fd)))
> > + return -EFAULT;
> > +
> > error = __do_pipe_flags(fd, files, flags);
> > if (!error) {
> > if (unlikely(copy_to_user(fildes, fd, sizeof(fd)))) {
>
> Mark, I think this will need (or at least benefit from) some valgrind
> changes.

Yes, this makes pipe2 read the memory of its first argument from
userspace, if and only if its second argument contains the O_SPECIFIC_FD
flag.

- Josh Triplett

2020-04-23 00:49:38

by Josh Triplett

[permalink] [raw]
Subject: Re: [PATCH v5 2/3] fs: openat2: Extend open_how to allow userspace-selected fds

On Wed, Apr 22, 2020 at 09:55:56AM +0200, Miklos Szeredi wrote:
> On Wed, Apr 22, 2020 at 8:06 AM Michael Kerrisk (man-pages)
> <[email protected]> wrote:
> >
> > [CC += linux-api]
> >
> > On Wed, 22 Apr 2020 at 07:20, Josh Triplett <[email protected]> wrote:
> > >
> > > Inspired by the X protocol's handling of XIDs, allow userspace to select
> > > the file descriptor opened by openat2, so that it can use the resulting
> > > file descriptor in subsequent system calls without waiting for the
> > > response to openat2.
> > >
> > > In io_uring, this allows sequences like openat2/read/close without
> > > waiting for the openat2 to complete. Multiple such sequences can
> > > overlap, as long as each uses a distinct file descriptor.
>
> If this is primarily an io_uring feature, then why burden the normal
> openat2 API with this?

This feature was inspired by io_uring; it isn't exclusively of value
with io_uring. (And io_uring doesn't normally change the semantics of
syscalls.)

> This would also allow Implementing a private fd table for io_uring.
> I.e. add a flag interpreted by file ops (IORING_PRIVATE_FD), including
> openat2 and freely use the private fd space without having to worry
> about interactions with other parts of the system.

I definitely don't want to add a special kind of file descriptor that
doesn't work in normal syscalls taking file descriptors. A file
descriptor allocated via O_SPECIFIC_FD is an entirely normal file
descriptor, and works anywhere a file descriptor normally works.

2020-04-23 01:15:01

by Dmitry V. Levin

[permalink] [raw]
Subject: Re: [PATCH v5 1/3] fs: Support setting a minimum fd for "lowest available fd" allocation

On Tue, Apr 21, 2020 at 10:19:49PM -0700, Josh Triplett wrote:
> Some applications want to prevent the usual "lowest available fd"
> allocation from allocating certain file descriptors. For instance, they
> may want to prevent allocation of a closed fd 0, 1, or 2 other than via
> dup2/dup3, or reserve some low file descriptors for other purposes.
>
> Add a prctl to increase the minimum fd and return the previous minimum.
>
> System calls that allocate a specific file descriptor, such as
> dup2/dup3, ignore this minimum.
>
> exec resets the minimum fd, to prevent one program from interfering with
> another program's expectations about fd allocation.

Please make this aspect properly documented in "Effect on process
attributes" section of execve(2) manual page.

[...]
> +unsigned int increase_min_fd(unsigned int num)
> +{
> + struct files_struct *files = current->files;
> + unsigned int old_min_fd;
> +
> + spin_lock(&files->file_lock);
> + old_min_fd = files->min_fd;
> + files->min_fd += num;
> + spin_unlock(&files->file_lock);
> + return old_min_fd;
> +}

If it's "increase", there should be an overflow check.
Otherwise it's "assign" rather than "increase".


--
ldv

2020-04-23 04:31:19

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH v5 2/3] fs: openat2: Extend open_how to allow userspace-selected fds

On Thu, Apr 23, 2020 at 2:48 AM Josh Triplett <[email protected]> wrote:
>
> On Wed, Apr 22, 2020 at 09:55:56AM +0200, Miklos Szeredi wrote:
> > On Wed, Apr 22, 2020 at 8:06 AM Michael Kerrisk (man-pages)
> > <[email protected]> wrote:
> > >
> > > [CC += linux-api]
> > >
> > > On Wed, 22 Apr 2020 at 07:20, Josh Triplett <[email protected]> wrote:
> > > >
> > > > Inspired by the X protocol's handling of XIDs, allow userspace to select
> > > > the file descriptor opened by openat2, so that it can use the resulting
> > > > file descriptor in subsequent system calls without waiting for the
> > > > response to openat2.
> > > >
> > > > In io_uring, this allows sequences like openat2/read/close without
> > > > waiting for the openat2 to complete. Multiple such sequences can
> > > > overlap, as long as each uses a distinct file descriptor.
> >
> > If this is primarily an io_uring feature, then why burden the normal
> > openat2 API with this?
>
> This feature was inspired by io_uring; it isn't exclusively of value
> with io_uring. (And io_uring doesn't normally change the semantics of
> syscalls.)

What's the use case of O_SPECIFIC_FD beyond io_uring?

>
> > This would also allow Implementing a private fd table for io_uring.
> > I.e. add a flag interpreted by file ops (IORING_PRIVATE_FD), including
> > openat2 and freely use the private fd space without having to worry
> > about interactions with other parts of the system.
>
> I definitely don't want to add a special kind of file descriptor that
> doesn't work in normal syscalls taking file descriptors. A file
> descriptor allocated via O_SPECIFIC_FD is an entirely normal file
> descriptor, and works anywhere a file descriptor normally works.

What's the use case of allocating a file descriptor within io_uring
and using it outside of io_uring?

Thanks,
Miklos

2020-04-23 04:46:40

by Josh Triplett

[permalink] [raw]
Subject: Re: [PATCH v5 2/3] fs: openat2: Extend open_how to allow userspace-selected fds

On Thu, Apr 23, 2020 at 06:24:14AM +0200, Miklos Szeredi wrote:
> On Thu, Apr 23, 2020 at 2:48 AM Josh Triplett <[email protected]> wrote:
> > On Wed, Apr 22, 2020 at 09:55:56AM +0200, Miklos Szeredi wrote:
> > > On Wed, Apr 22, 2020 at 8:06 AM Michael Kerrisk (man-pages)
> > > <[email protected]> wrote:
> > > >
> > > > [CC += linux-api]
> > > >
> > > > On Wed, 22 Apr 2020 at 07:20, Josh Triplett <[email protected]> wrote:
> > > > >
> > > > > Inspired by the X protocol's handling of XIDs, allow userspace to select
> > > > > the file descriptor opened by openat2, so that it can use the resulting
> > > > > file descriptor in subsequent system calls without waiting for the
> > > > > response to openat2.
> > > > >
> > > > > In io_uring, this allows sequences like openat2/read/close without
> > > > > waiting for the openat2 to complete. Multiple such sequences can
> > > > > overlap, as long as each uses a distinct file descriptor.
> > >
> > > If this is primarily an io_uring feature, then why burden the normal
> > > openat2 API with this?
> >
> > This feature was inspired by io_uring; it isn't exclusively of value
> > with io_uring. (And io_uring doesn't normally change the semantics of
> > syscalls.)
>
> What's the use case of O_SPECIFIC_FD beyond io_uring?

Avoiding a call to dup2 and close, if you need something as a specific
file descriptor, such as when setting up to exec something, or when
debugging a program.

I don't expect it to be as widely used as with io_uring, but I also
don't want io_uring versions of syscalls to diverge from the underlying
syscalls, and this would be a heavy divergence.

> > > This would also allow Implementing a private fd table for io_uring.
> > > I.e. add a flag interpreted by file ops (IORING_PRIVATE_FD), including
> > > openat2 and freely use the private fd space without having to worry
> > > about interactions with other parts of the system.
> >
> > I definitely don't want to add a special kind of file descriptor that
> > doesn't work in normal syscalls taking file descriptors. A file
> > descriptor allocated via O_SPECIFIC_FD is an entirely normal file
> > descriptor, and works anywhere a file descriptor normally works.
>
> What's the use case of allocating a file descriptor within io_uring
> and using it outside of io_uring?

Calling a syscall not provided via io_uring. Calling a library that
doesn't use io_uring. Passing the file descriptor via UNIX socket to
another program. Passing the file descriptor via exec to another
program. Userspace is modular, and file descriptors are widely used.

2020-04-23 04:53:26

by Josh Triplett

[permalink] [raw]
Subject: Re: [PATCH v5 1/3] fs: Support setting a minimum fd for "lowest available fd" allocation

On Thu, Apr 23, 2020 at 04:12:53AM +0300, Dmitry V. Levin wrote:
> On Tue, Apr 21, 2020 at 10:19:49PM -0700, Josh Triplett wrote:
> > Some applications want to prevent the usual "lowest available fd"
> > allocation from allocating certain file descriptors. For instance, they
> > may want to prevent allocation of a closed fd 0, 1, or 2 other than via
> > dup2/dup3, or reserve some low file descriptors for other purposes.
> >
> > Add a prctl to increase the minimum fd and return the previous minimum.
> >
> > System calls that allocate a specific file descriptor, such as
> > dup2/dup3, ignore this minimum.
> >
> > exec resets the minimum fd, to prevent one program from interfering with
> > another program's expectations about fd allocation.
>
> Please make this aspect properly documented in "Effect on process
> attributes" section of execve(2) manual page.

Done. I'll include updated manpage patches in v6.

> > +unsigned int increase_min_fd(unsigned int num)
> > +{
> > + struct files_struct *files = current->files;
> > + unsigned int old_min_fd;
> > +
> > + spin_lock(&files->file_lock);
> > + old_min_fd = files->min_fd;
> > + files->min_fd += num;
> > + spin_unlock(&files->file_lock);
> > + return old_min_fd;
> > +}
>
> If it's "increase", there should be an overflow check.
> Otherwise it's "assign" rather than "increase".

I'll add a check in v6, to make sure that the value cannot overflow into
the errno range. (Note that this is not security-sensitive, it's just
providing a footgun-resistant interface. It should absolutely check,
though.)

- Josh Triplett

2020-04-23 06:06:24

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH v5 2/3] fs: openat2: Extend open_how to allow userspace-selected fds

On Thu, Apr 23, 2020 at 6:42 AM Josh Triplett <[email protected]> wrote:
>
> On Thu, Apr 23, 2020 at 06:24:14AM +0200, Miklos Szeredi wrote:
> > On Thu, Apr 23, 2020 at 2:48 AM Josh Triplett <[email protected]> wrote:
> > > On Wed, Apr 22, 2020 at 09:55:56AM +0200, Miklos Szeredi wrote:
> > > > On Wed, Apr 22, 2020 at 8:06 AM Michael Kerrisk (man-pages)
> > > > <[email protected]> wrote:
> > > > >
> > > > > [CC += linux-api]
> > > > >
> > > > > On Wed, 22 Apr 2020 at 07:20, Josh Triplett <[email protected]> wrote:
> > > > > >
> > > > > > Inspired by the X protocol's handling of XIDs, allow userspace to select
> > > > > > the file descriptor opened by openat2, so that it can use the resulting
> > > > > > file descriptor in subsequent system calls without waiting for the
> > > > > > response to openat2.
> > > > > >
> > > > > > In io_uring, this allows sequences like openat2/read/close without
> > > > > > waiting for the openat2 to complete. Multiple such sequences can
> > > > > > overlap, as long as each uses a distinct file descriptor.
> > > >
> > > > If this is primarily an io_uring feature, then why burden the normal
> > > > openat2 API with this?
> > >
> > > This feature was inspired by io_uring; it isn't exclusively of value
> > > with io_uring. (And io_uring doesn't normally change the semantics of
> > > syscalls.)
> >
> > What's the use case of O_SPECIFIC_FD beyond io_uring?
>
> Avoiding a call to dup2 and close, if you need something as a specific
> file descriptor, such as when setting up to exec something, or when
> debugging a program.
>
> I don't expect it to be as widely used as with io_uring, but I also
> don't want io_uring versions of syscalls to diverge from the underlying
> syscalls, and this would be a heavy divergence.

What are the plans for those syscalls that don't easily lend
themselves to this modification (such as accept(2))? Do we want to
introduce another variant of these? Is that really worth it? If not,
we are faced with the same divergence.

Compared to that, having a common flag for file ops to enable the use
of fixed and private file descriptors is a clean and well contained
interface.

> > > > This would also allow Implementing a private fd table for io_uring.
> > > > I.e. add a flag interpreted by file ops (IORING_PRIVATE_FD), including
> > > > openat2 and freely use the private fd space without having to worry
> > > > about interactions with other parts of the system.
> > >
> > > I definitely don't want to add a special kind of file descriptor that
> > > doesn't work in normal syscalls taking file descriptors. A file
> > > descriptor allocated via O_SPECIFIC_FD is an entirely normal file
> > > descriptor, and works anywhere a file descriptor normally works.
> >
> > What's the use case of allocating a file descriptor within io_uring
> > and using it outside of io_uring?
>
> Calling a syscall not provided via io_uring. Calling a library that
> doesn't use io_uring. Passing the file descriptor via UNIX socket to
> another program. Passing the file descriptor via exec to another
> program. Userspace is modular, and file descriptors are widely used.

I mean, you could open the file descriptor outside of io_uring in such
cases, no? The point of O_SPECIFIC_FD is to be able to perform short
sequences of open/dosomething/close without having to block and having
to issue separate syscalls. If you're going to issue separate
syscalls anyway, then I see no point in doing the open within
io_uring. Or?

Thanks,
Miklos

2020-04-23 07:35:51

by Josh Triplett

[permalink] [raw]
Subject: Re: [PATCH v5 2/3] fs: openat2: Extend open_how to allow userspace-selected fds

On Thu, Apr 23, 2020 at 08:04:25AM +0200, Miklos Szeredi wrote:
> On Thu, Apr 23, 2020 at 6:42 AM Josh Triplett <[email protected]> wrote:
> >
> > On Thu, Apr 23, 2020 at 06:24:14AM +0200, Miklos Szeredi wrote:
> > > On Thu, Apr 23, 2020 at 2:48 AM Josh Triplett <[email protected]> wrote:
> > > > On Wed, Apr 22, 2020 at 09:55:56AM +0200, Miklos Szeredi wrote:
> > > > > On Wed, Apr 22, 2020 at 8:06 AM Michael Kerrisk (man-pages)
> > > > > <[email protected]> wrote:
> > > > > >
> > > > > > [CC += linux-api]
> > > > > >
> > > > > > On Wed, 22 Apr 2020 at 07:20, Josh Triplett <[email protected]> wrote:
> > > > > > >
> > > > > > > Inspired by the X protocol's handling of XIDs, allow userspace to select
> > > > > > > the file descriptor opened by openat2, so that it can use the resulting
> > > > > > > file descriptor in subsequent system calls without waiting for the
> > > > > > > response to openat2.
> > > > > > >
> > > > > > > In io_uring, this allows sequences like openat2/read/close without
> > > > > > > waiting for the openat2 to complete. Multiple such sequences can
> > > > > > > overlap, as long as each uses a distinct file descriptor.
> > > > >
> > > > > If this is primarily an io_uring feature, then why burden the normal
> > > > > openat2 API with this?
> > > >
> > > > This feature was inspired by io_uring; it isn't exclusively of value
> > > > with io_uring. (And io_uring doesn't normally change the semantics of
> > > > syscalls.)
> > >
> > > What's the use case of O_SPECIFIC_FD beyond io_uring?
> >
> > Avoiding a call to dup2 and close, if you need something as a specific
> > file descriptor, such as when setting up to exec something, or when
> > debugging a program.
> >
> > I don't expect it to be as widely used as with io_uring, but I also
> > don't want io_uring versions of syscalls to diverge from the underlying
> > syscalls, and this would be a heavy divergence.
>
> What are the plans for those syscalls that don't easily lend
> themselves to this modification (such as accept(2))?

accept4 has a flags argument with more flags available, so it'd be
entirely possible to cleanly extend it further without introducing a new
version. The same goes for other fd-producing syscalls that still have
flag space available.

This may or may not provide sufficient motivation on its own to
introduce a new syscall variant of a syscall that isn't currently
extensible.

> Compared to that, having a common flag for file ops to enable the use
> of fixed and private file descriptors is a clean and well contained
> interface.

"private" is not a desirable property here. See below. Even if the
userspace-specified fd mechanism were to become something only
accessible via io_uring (which I'd prefer to avoid), that's not a reason
to avoid generating real file descriptors that work anywhere a file
descriptor works.

> > > > > This would also allow Implementing a private fd table for io_uring.
> > > > > I.e. add a flag interpreted by file ops (IORING_PRIVATE_FD), including
> > > > > openat2 and freely use the private fd space without having to worry
> > > > > about interactions with other parts of the system.
> > > >
> > > > I definitely don't want to add a special kind of file descriptor that
> > > > doesn't work in normal syscalls taking file descriptors. A file
> > > > descriptor allocated via O_SPECIFIC_FD is an entirely normal file
> > > > descriptor, and works anywhere a file descriptor normally works.
> > >
> > > What's the use case of allocating a file descriptor within io_uring
> > > and using it outside of io_uring?
> >
> > Calling a syscall not provided via io_uring. Calling a library that
> > doesn't use io_uring. Passing the file descriptor via UNIX socket to
> > another program. Passing the file descriptor via exec to another
> > program. Userspace is modular, and file descriptors are widely used.
>
> I mean, you could open the file descriptor outside of io_uring in such
> cases, no?

I would prefer to not introduce that limitation in the first place, and
instead open normal file descriptors.

> The point of O_SPECIFIC_FD is to be able to perform short
> sequences of open/dosomething/close without having to block and having
> to issue separate syscalls.

"close" is not a required component. It's entirely possible to use
io_uring to open a file descriptor, do various things with it, and then
leave it open for subsequent usage via either other io_uring chains or
standalone syscalls.

> If you're going to issue separate
> syscalls anyway, then I see no point in doing the open within
> io_uring. Or?

io_uring is not an all-or-nothing proposition. There's value in using
io_uring for some operations without converting an entire program (and
every library it might potentially use on a file descriptor) entirely to
io_uring. Userspace is modular, and file descriptors are a common
element used by many different bits of userspace.

2020-04-23 07:48:46

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH v5 2/3] fs: openat2: Extend open_how to allow userspace-selected fds

On Thu, Apr 23, 2020 at 9:33 AM Josh Triplett <[email protected]> wrote:

> > What are the plans for those syscalls that don't easily lend
> > themselves to this modification (such as accept(2))?
>
> accept4 has a flags argument with more flags available, so it'd be
> entirely possible to cleanly extend it further without introducing a new
> version.

Variable argument syscalls, you are thinking?

> > I mean, you could open the file descriptor outside of io_uring in such
> > cases, no?
>
> I would prefer to not introduce that limitation in the first place, and
> instead open normal file descriptors.
>
> > The point of O_SPECIFIC_FD is to be able to perform short
> > sequences of open/dosomething/close without having to block and having
> > to issue separate syscalls.
>
> "close" is not a required component. It's entirely possible to use
> io_uring to open a file descriptor, do various things with it, and then
> leave it open for subsequent usage via either other io_uring chains or
> standalone syscalls.

If this use case arraises, we could add an op to dup/move a private
descriptor to a public one. io_uring can return values, right?

Still not convinced...

Thanks,
Miklos

2020-04-23 07:59:12

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH v5 2/3] fs: openat2: Extend open_how to allow userspace-selected fds

On Thu, Apr 23, 2020 at 9:45 AM Miklos Szeredi <[email protected]> wrote:

> > I would prefer to not introduce that limitation in the first place, and
> > instead open normal file descriptors.
> >
> > > The point of O_SPECIFIC_FD is to be able to perform short
> > > sequences of open/dosomething/close without having to block and having
> > > to issue separate syscalls.
> >
> > "close" is not a required component. It's entirely possible to use
> > io_uring to open a file descriptor, do various things with it, and then
> > leave it open for subsequent usage via either other io_uring chains or
> > standalone syscalls.
>
> If this use case arraises, we could add an op to dup/move a private
> descriptor to a public one. io_uring can return values, right?
>
> Still not convinced...

Oh, and we haven't even touched on the biggest advantage of a private
fd table: not having to dirty a cacheline on fdget/fdput due to the
possibility of concurrent close() in a MT application.

I believe this is a sticking point in some big enterprise apps and it
may even be a driving force for io_uring.

Thanks,
Miklos

2020-04-23 08:10:50

by Josh Triplett

[permalink] [raw]
Subject: Re: [PATCH v5 2/3] fs: openat2: Extend open_how to allow userspace-selected fds

On Thu, Apr 23, 2020 at 09:45:45AM +0200, Miklos Szeredi wrote:
> On Thu, Apr 23, 2020 at 9:33 AM Josh Triplett <[email protected]> wrote:
> > > What are the plans for those syscalls that don't easily lend
> > > themselves to this modification (such as accept(2))?
> >
> > accept4 has a flags argument with more flags available, so it'd be
> > entirely possible to cleanly extend it further without introducing a new
> > version.
>
> Variable argument syscalls, you are thinking?

That or repurposing an existing pointer-sized argument as an
open_how-style struct, yes. But in any case, I'm not proposing that; I'm
proposing changes to the existing highly extensible openat2 syscall.

> > > I mean, you could open the file descriptor outside of io_uring in such
> > > cases, no?
> >
> > I would prefer to not introduce that limitation in the first place, and
> > instead open normal file descriptors.
> >
> > > The point of O_SPECIFIC_FD is to be able to perform short
> > > sequences of open/dosomething/close without having to block and having
> > > to issue separate syscalls.
> >
> > "close" is not a required component. It's entirely possible to use
> > io_uring to open a file descriptor, do various things with it, and then
> > leave it open for subsequent usage via either other io_uring chains or
> > standalone syscalls.
>
> If this use case arraises,

This wasn't a hypothetical "someone might want this". I'm stating that
this is a requirement I'm seeking to meet with this patch series, and
one I intend to use. The primary use case is interoperability with
other code using file descriptors and not using io_uring.

2020-04-23 09:24:15

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH v5 2/3] fs: openat2: Extend open_how to allow userspace-selected fds

On Thu, Apr 23, 2020 at 9:57 AM Miklos Szeredi <[email protected]> wrote:
>
> On Thu, Apr 23, 2020 at 9:45 AM Miklos Szeredi <[email protected]> wrote:
>
> > > I would prefer to not introduce that limitation in the first place, and
> > > instead open normal file descriptors.
> > >
> > > > The point of O_SPECIFIC_FD is to be able to perform short
> > > > sequences of open/dosomething/close without having to block and having
> > > > to issue separate syscalls.
> > >
> > > "close" is not a required component. It's entirely possible to use
> > > io_uring to open a file descriptor, do various things with it, and then
> > > leave it open for subsequent usage via either other io_uring chains or
> > > standalone syscalls.
> >
> > If this use case arraises, we could add an op to dup/move a private
> > descriptor to a public one. io_uring can return values, right?
> >
> > Still not convinced...
>
> Oh, and we haven't even touched on the biggest advantage of a private
> fd table: not having to dirty a cacheline on fdget/fdput due to the
> possibility of concurrent close() in a MT application.
>
> I believe this is a sticking point in some big enterprise apps and it
> may even be a driving force for io_uring.

https://lwn.net/Articles/787473/

And an interesting (very old) article referenced from above, that
gives yet a new angle on fd allocation issues:

https://lwn.net/Articles/236843/

A private fd space would be perfect for libraries such as glibc.

Thanks,
Miklos

2020-04-23 09:27:13

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [PATCH v5 1/3] fs: Support setting a minimum fd for "lowest available fd" allocation

On Wed, Apr 22, 2020 at 7:19 AM Josh Triplett <[email protected]> wrote:
>
> Some applications want to prevent the usual "lowest available fd"
> allocation from allocating certain file descriptors. For instance, they
> may want to prevent allocation of a closed fd 0, 1, or 2 other than via
> dup2/dup3, or reserve some low file descriptors for other purposes.
>
> Add a prctl to increase the minimum fd and return the previous minimum.
>
> System calls that allocate a specific file descriptor, such as
> dup2/dup3, ignore this minimum.
>
> exec resets the minimum fd, to prevent one program from interfering with
> another program's expectations about fd allocation.

Have you considered making this a separate system call rather than
a part of prctl()?

At the moment, there are certain classes of things controlled by prctl,
e.g. capabilities, floating point handling and timer behavior, but nothing
that relates to file descriptors as such, so it's not an obvious decision.

Another option would be prlimit(), as it already controls the maximum
file descriptor number with RLIMIT_NOFILE, and adding a minimum
there would let you set min/max atomically.

Arnd

2020-04-23 09:49:31

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH v5 2/3] fs: openat2: Extend open_how to allow userspace-selected fds

On Thu, Apr 23, 2020 at 11:20 AM Miklos Szeredi <[email protected]> wrote:
>
> On Thu, Apr 23, 2020 at 9:57 AM Miklos Szeredi <[email protected]> wrote:
> >
> > On Thu, Apr 23, 2020 at 9:45 AM Miklos Szeredi <[email protected]> wrote:
> >
> > > > I would prefer to not introduce that limitation in the first place, and
> > > > instead open normal file descriptors.
> > > >
> > > > > The point of O_SPECIFIC_FD is to be able to perform short
> > > > > sequences of open/dosomething/close without having to block and having
> > > > > to issue separate syscalls.
> > > >
> > > > "close" is not a required component. It's entirely possible to use
> > > > io_uring to open a file descriptor, do various things with it, and then
> > > > leave it open for subsequent usage via either other io_uring chains or
> > > > standalone syscalls.
> > >
> > > If this use case arraises, we could add an op to dup/move a private
> > > descriptor to a public one. io_uring can return values, right?
> > >
> > > Still not convinced...
> >
> > Oh, and we haven't even touched on the biggest advantage of a private
> > fd table: not having to dirty a cacheline on fdget/fdput due to the
> > possibility of concurrent close() in a MT application.
> >
> > I believe this is a sticking point in some big enterprise apps and it
> > may even be a driving force for io_uring.
>
> https://lwn.net/Articles/787473/
>
> And an interesting (very old) article referenced from above, that
> gives yet a new angle on fd allocation issues:
>
> https://lwn.net/Articles/236843/
>
> A private fd space would be perfect for libraries such as glibc.

Ah, io_uring already implements a fixed private fd table via
io_uring_register(IORING_REGISTER_FILES,...), we just need a way to
wire up open, socket, accept, etc. to fill a slot in that table
instead of, or in addition to allocating a slot in the fd_table.

Thanks,
Miklos