2020-04-08 06:58:14

by Josh Triplett

[permalink] [raw]
Subject: [PATCH v3 0/3] Support userspace-selected fds

(Note: numbering this updated version v3, to avoid confusion with Jens'
v2 that built on my v1. Jens, if you like this approach, please feel
free to stack your additional patches from the io_uring-fd-select branch
atop this series. 5.8 material, not intended for the current merge window.)

Inspired by the X protocol's handling of XIDs, allow userspace to select
the file descriptor opened by a call like openat2, so that it can use
the resulting file descriptor in subsequent system calls without waiting
for the response to the initial openat2 syscall.

The first patch is independent of the other two; it allows reserving
file descriptors below a certain minimum for userspace-selected fd
allocation only.

The second patch implements userspace-selected fd allocation for
openat2, introducing a new O_SPECIFIC_FD flag and an fd field in struct
open_how. In io_uring, this allows sequences like openat2/read/close
without waiting for the openat2 to complete. Multiple such sequences can
overlap, as long as each uses a distinct file descriptor.

The third patch adds userspace-selected fd allocation to pipe2 as well.
I did this partly as a demonstration of how simple it is to wire up
O_SPECIFIC_FD support for any fd-allocating system call, and partly in
the hopes that this may make it more useful to wire up io_uring support
for pipe2 in the future.

If this gets accepted, I'm happy to also write corresponding manpage
patches.

v3:
This new version has an API to atomically increase the minimum fd and
return the previous minimum, rather than just getting and setting the
minimum; this makes it easier to allocate a range. (A library that might
initialize after the program has already opened other file descriptors
may need to check for existing open fds in the range after reserving it,
and reserve more fds if needed; this can be done entirely in userspace,
and we can't really do anything simpler in the kernel due to limitations
on file-descriptor semantics, so this patch series avoids introducing
any extra complexity in the kernel.)

This new version also supports a __get_specific_unused_fd_flags call
which accepts the limit for RLIMIT_NOFILE as an argument, analogous to
__get_unused_fd_flags, since io_uring needs that to correctly handle
RLIMIT_NOFILE.

Josh Triplett (3):
fs: Support setting a minimum fd for "lowest available fd" allocation
fs: openat2: Extend open_how to allow userspace-selected fds
fs: pipe2: Support O_SPECIFIC_FD

fs/fcntl.c | 2 +-
fs/file.c | 62 ++++++++++++++++++++++++++++----
fs/io_uring.c | 3 +-
fs/open.c | 6 ++--
fs/pipe.c | 16 ++++++---
include/linux/fcntl.h | 5 +--
include/linux/fdtable.h | 1 +
include/linux/file.h | 4 +++
include/uapi/asm-generic/fcntl.h | 4 +++
include/uapi/linux/openat2.h | 2 ++
include/uapi/linux/prctl.h | 3 ++
kernel/sys.c | 5 +++
12 files changed, 97 insertions(+), 16 deletions(-)

--
2.26.0


2020-04-08 06:58:27

by Josh Triplett

[permalink] [raw]
Subject: [PATCH v3 1/3] fs: Support setting a minimum fd for "lowest available fd" allocation

Some applications want to prevent the usual "lowest available fd"
allocation from allocating certain file descriptors. For instance, they
may want to prevent allocation of a closed fd 0, 1, or 2 other than via
dup2/dup3, or reserve some low file descriptors for other purposes.

Add a prctl to increase the minimum fd and return the previous minimum.

System calls that allocate a specific file descriptor, such as
dup2/dup3, ignore this minimum.

exec resets the minimum fd, to prevent one program from interfering with
another program's expectations about fd allocation.

Test program:

#include <err.h>
#include <fcntl.h>
#include <stdio.h>
#include <sys/prctl.h>

int main(int argc, char *argv[])
{
if (prctl(PR_INCREASE_MIN_FD, 100, 0, 0, 0) < 0)
err(1, "prctl");
int fd = open("/dev/null", O_RDONLY);
if (fd < 0)
err(1, "open");
printf("%d\n", fd); // prints 100
return 0;
}

Signed-off-by: Josh Triplett <[email protected]>
---
fs/file.c | 23 +++++++++++++++++------
include/linux/fdtable.h | 1 +
include/linux/file.h | 1 +
include/uapi/linux/prctl.h | 3 +++
kernel/sys.c | 5 +++++
5 files changed, 27 insertions(+), 6 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index c8a4e4c86e55..ba06140d89af 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -286,7 +286,6 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
spin_lock_init(&newf->file_lock);
newf->resize_in_progress = false;
init_waitqueue_head(&newf->resize_wait);
- newf->next_fd = 0;
new_fdt = &newf->fdtab;
new_fdt->max_fds = NR_OPEN_DEFAULT;
new_fdt->close_on_exec = newf->close_on_exec_init;
@@ -295,6 +294,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
new_fdt->fd = &newf->fd_array[0];

spin_lock(&oldf->file_lock);
+ newf->next_fd = newf->min_fd = oldf->min_fd;
old_fdt = files_fdtable(oldf);
open_files = count_open_files(old_fdt);

@@ -487,9 +487,7 @@ int __alloc_fd(struct files_struct *files,
spin_lock(&files->file_lock);
repeat:
fdt = files_fdtable(files);
- fd = start;
- if (fd < files->next_fd)
- fd = files->next_fd;
+ fd = max3(start, files->min_fd, files->next_fd);

if (fd < fdt->max_fds)
fd = find_next_fd(fdt, fd);
@@ -514,7 +512,7 @@ int __alloc_fd(struct files_struct *files,
goto repeat;

if (start <= files->next_fd)
- files->next_fd = fd + 1;
+ files->next_fd = max(fd + 1, files->min_fd);

__set_open_fd(fd, fdt);
if (flags & O_CLOEXEC)
@@ -555,7 +553,7 @@ static void __put_unused_fd(struct files_struct *files, unsigned int fd)
{
struct fdtable *fdt = files_fdtable(files);
__clear_open_fd(fd, fdt);
- if (fd < files->next_fd)
+ if (fd < files->next_fd && fd >= files->min_fd)
files->next_fd = fd;
}

@@ -684,6 +682,7 @@ void do_close_on_exec(struct files_struct *files)

/* exec unshares first */
spin_lock(&files->file_lock);
+ files->min_fd = 0;
for (i = 0; ; i++) {
unsigned long set;
unsigned fd = i * BITS_PER_LONG;
@@ -865,6 +864,18 @@ bool get_close_on_exec(unsigned int fd)
return res;
}

+unsigned int increase_min_fd(unsigned int num)
+{
+ struct files_struct *files = current->files;
+ unsigned int old_min_fd;
+
+ spin_lock(&files->file_lock);
+ old_min_fd = files->min_fd;
+ files->min_fd += num;
+ spin_unlock(&files->file_lock);
+ return old_min_fd;
+}
+
static int do_dup2(struct files_struct *files,
struct file *file, unsigned fd, unsigned flags)
__releases(&files->file_lock)
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index f07c55ea0c22..d1980443d8b3 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -60,6 +60,7 @@ struct files_struct {
*/
spinlock_t file_lock ____cacheline_aligned_in_smp;
unsigned int next_fd;
+ unsigned int min_fd; /* min for "lowest available fd" allocation */
unsigned long close_on_exec_init[1];
unsigned long open_fds_init[1];
unsigned long full_fds_bits_init[1];
diff --git a/include/linux/file.h b/include/linux/file.h
index 142d102f285e..b67986f818d2 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -88,6 +88,7 @@ extern bool get_close_on_exec(unsigned int fd);
extern int __get_unused_fd_flags(unsigned flags, unsigned long nofile);
extern int get_unused_fd_flags(unsigned flags);
extern void put_unused_fd(unsigned int fd);
+extern unsigned int increase_min_fd(unsigned int num);

extern void fd_install(unsigned int fd, struct file *file);

diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 07b4f8131e36..916327272d21 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -238,4 +238,7 @@ struct prctl_mm_map {
#define PR_SET_IO_FLUSHER 57
#define PR_GET_IO_FLUSHER 58

+/* Increase minimum file descriptor for "lowest available fd" allocation */
+#define PR_INCREASE_MIN_FD 59
+
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index d325f3ab624a..daa0ce43cecc 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2514,6 +2514,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,

error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER;
break;
+ case PR_INCREASE_MIN_FD:
+ if (arg3 || arg4 || arg5)
+ return -EINVAL;
+ error = increase_min_fd((unsigned int)arg2);
+ break;
default:
error = -EINVAL;
break;
--
2.26.0

2020-04-08 06:59:05

by Josh Triplett

[permalink] [raw]
Subject: [PATCH v3 3/3] fs: pipe2: Support O_SPECIFIC_FD

This allows the caller of pipe2 to specify one or both file descriptors
rather than having them automatically use the lowest available file
descriptor. The caller can specify either file descriptor as -1 to
allow that file descriptor to use the lowest available.

Signed-off-by: Josh Triplett <[email protected]>
---
fs/pipe.c | 16 ++++++++++++----
1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/fs/pipe.c b/fs/pipe.c
index 16fb72e9abf7..4681a0d1d587 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -936,19 +936,19 @@ static int __do_pipe_flags(int *fd, struct file **files, int flags)
int error;
int fdw, fdr;

- if (flags & ~(O_CLOEXEC | O_NONBLOCK | O_DIRECT))
+ if (flags & ~(O_CLOEXEC | O_NONBLOCK | O_DIRECT | O_SPECIFIC_FD))
return -EINVAL;

error = create_pipe_files(files, flags);
if (error)
return error;

- error = get_unused_fd_flags(flags);
+ error = get_specific_unused_fd_flags(fd[0], flags);
if (error < 0)
goto err_read_pipe;
fdr = error;

- error = get_unused_fd_flags(flags);
+ error = get_specific_unused_fd_flags(fd[1], flags);
if (error < 0)
goto err_fdr;
fdw = error;
@@ -969,7 +969,11 @@ static int __do_pipe_flags(int *fd, struct file **files, int flags)
int do_pipe_flags(int *fd, int flags)
{
struct file *files[2];
- int error = __do_pipe_flags(fd, files, flags);
+ int error;
+
+ if (flags & O_SPECIFIC_FD)
+ return -EINVAL;
+ error = __do_pipe_flags(fd, files, flags);
if (!error) {
fd_install(fd[0], files[0]);
fd_install(fd[1], files[1]);
@@ -987,6 +991,10 @@ static int do_pipe2(int __user *fildes, int flags)
int fd[2];
int error;

+ if (flags & O_SPECIFIC_FD)
+ if (copy_from_user(fd, fildes, sizeof(fd)))
+ return -EFAULT;
+
error = __do_pipe_flags(fd, files, flags);
if (!error) {
if (unlikely(copy_to_user(fildes, fd, sizeof(fd)))) {
--
2.26.0

2020-04-08 06:59:56

by Josh Triplett

[permalink] [raw]
Subject: [PATCH v3 2/3] fs: openat2: Extend open_how to allow userspace-selected fds

Inspired by the X protocol's handling of XIDs, allow userspace to select
the file descriptor opened by openat2, so that it can use the resulting
file descriptor in subsequent system calls without waiting for the
response to openat2.

In io_uring, this allows sequences like openat2/read/close without
waiting for the openat2 to complete. Multiple such sequences can
overlap, as long as each uses a distinct file descriptor.

Add a new O_SPECIFIC_FD open flag to enable this behavior, only accepted
by openat2 for now (ignored by open/openat like all unknown flags). Add
an fd field to struct open_how (along with appropriate padding, and
verify that the padding is 0 to allow replacing the padding with a field
in the future).

The file table has a corresponding new function
get_specific_unused_fd_flags, which gets the specified file descriptor
if O_SPECIFIC_FD is set (and the fd isn't -1); otherwise it falls back
to get_unused_fd_flags, to simplify callers.

The specified file descriptor must not already be open; if it is,
get_specific_unused_fd_flags will fail with -EBUSY. This helps catch
userspace errors.

When O_SPECIFIC_FD is set, and fd is not -1, openat2 will use the
specified file descriptor rather than finding the lowest available one.

Test program:

#include <err.h>
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>

int main(void)
{
struct open_how how = {
.flags = O_RDONLY | O_SPECIFIC_FD,
.fd = 42
};
int fd = openat2(AT_FDCWD, "/dev/null", &how, sizeof(how));
if (fd < 0)
err(1, "openat2");
printf("fd=%d\n", fd); // prints fd=42
return 0;
}

Signed-off-by: Josh Triplett <[email protected]>
Signed-off-by: Jens Axboe <[email protected]>
---
fs/fcntl.c | 2 +-
fs/file.c | 39 ++++++++++++++++++++++++++++++++
fs/io_uring.c | 3 ++-
fs/open.c | 6 +++--
include/linux/fcntl.h | 5 ++--
include/linux/file.h | 3 +++
include/uapi/asm-generic/fcntl.h | 4 ++++
include/uapi/linux/openat2.h | 2 ++
8 files changed, 58 insertions(+), 6 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 2e4c0fa2074b..0357ad667563 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1033,7 +1033,7 @@ static int __init fcntl_init(void)
* Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
* is defined as O_NONBLOCK on some platforms and not on others.
*/
- BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
+ BUILD_BUG_ON(22 - 1 /* for O_RDONLY being 0 */ !=
HWEIGHT32(
(VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
__FMODE_EXEC | __FMODE_NONOTIFY));
diff --git a/fs/file.c b/fs/file.c
index ba06140d89af..0c34206314dc 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -567,6 +567,45 @@ void put_unused_fd(unsigned int fd)

EXPORT_SYMBOL(put_unused_fd);

+int __get_specific_unused_fd_flags(unsigned int fd, unsigned int flags,
+ unsigned long nofile)
+{
+ int ret;
+ struct fdtable *fdt;
+ struct files_struct *files = current->files;
+
+ if (!(flags & O_SPECIFIC_FD) || fd == -1)
+ return __get_unused_fd_flags(flags, nofile);
+
+ if (fd >= nofile)
+ return -EBADF;
+
+ spin_lock(&files->file_lock);
+ ret = expand_files(files, fd);
+ if (unlikely(ret < 0))
+ goto out_unlock;
+ fdt = files_fdtable(files);
+ if (fdt->fd[fd]) {
+ ret = -EBUSY;
+ goto out_unlock;
+ }
+ __set_open_fd(fd, fdt);
+ if (flags & O_CLOEXEC)
+ __set_close_on_exec(fd, fdt);
+ else
+ __clear_close_on_exec(fd, fdt);
+ ret = fd;
+
+out_unlock:
+ spin_unlock(&files->file_lock);
+ return ret;
+}
+
+int get_specific_unused_fd_flags(unsigned int fd, unsigned int flags)
+{
+ return __get_specific_unused_fd_flags(fd, flags, rlimit(RLIMIT_NOFILE));
+}
+
/*
* Install a file pointer in the fd array.
*
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 358f97be9c7b..4a69e1daf3fe 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2997,7 +2997,8 @@ static int io_openat2(struct io_kiocb *req, bool force_nonblock)
if (ret)
goto err;

- ret = __get_unused_fd_flags(req->open.how.flags, req->open.nofile);
+ ret = __get_specific_unused_fd_flags(req->open.how.fd,
+ req->open.how.flags, req->open.nofile);
if (ret < 0)
goto err;

diff --git a/fs/open.c b/fs/open.c
index 719b320ede52..d4225e6f9d4c 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -958,7 +958,7 @@ EXPORT_SYMBOL(open_with_fake_path);
inline struct open_how build_open_how(int flags, umode_t mode)
{
struct open_how how = {
- .flags = flags & VALID_OPEN_FLAGS,
+ .flags = flags & VALID_OPEN_FLAGS & ~O_SPECIFIC_FD,
.mode = mode & S_IALLUGO,
};

@@ -1143,7 +1143,7 @@ static long do_sys_openat2(int dfd, const char __user *filename,
if (IS_ERR(tmp))
return PTR_ERR(tmp);

- fd = get_unused_fd_flags(how->flags);
+ fd = get_specific_unused_fd_flags(how->fd, how->flags);
if (fd >= 0) {
struct file *f = do_filp_open(dfd, tmp, &op);
if (IS_ERR(f)) {
@@ -1193,6 +1193,8 @@ SYSCALL_DEFINE4(openat2, int, dfd, const char __user *, filename,
err = copy_struct_from_user(&tmp, sizeof(tmp), how, usize);
if (err)
return err;
+ if (tmp.pad != 0)
+ return -EINVAL;

/* O_LARGEFILE is only allowed for non-O_PATH. */
if (!(tmp.flags & O_PATH) && force_o_largefile())
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index 7bcdcf4f6ab2..728849bcd8fa 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -10,7 +10,7 @@
(O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \
O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \
FASYNC | O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
- O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)
+ O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_SPECIFIC_FD)

/* List of all valid flags for the how->upgrade_mask argument: */
#define VALID_UPGRADE_FLAGS \
@@ -23,7 +23,8 @@

/* List of all open_how "versions". */
#define OPEN_HOW_SIZE_VER0 24 /* sizeof first published struct */
-#define OPEN_HOW_SIZE_LATEST OPEN_HOW_SIZE_VER0
+#define OPEN_HOW_SIZE_VER1 32 /* added fd and pad */
+#define OPEN_HOW_SIZE_LATEST OPEN_HOW_SIZE_VER1

#ifndef force_o_largefile
#define force_o_largefile() (!IS_ENABLED(CONFIG_ARCH_32BIT_OFF_T))
diff --git a/include/linux/file.h b/include/linux/file.h
index b67986f818d2..a63301864a36 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -87,6 +87,9 @@ extern void set_close_on_exec(unsigned int fd, int flag);
extern bool get_close_on_exec(unsigned int fd);
extern int __get_unused_fd_flags(unsigned flags, unsigned long nofile);
extern int get_unused_fd_flags(unsigned flags);
+extern int __get_specific_unused_fd_flags(unsigned int fd, unsigned int flags,
+ unsigned long nofile);
+extern int get_specific_unused_fd_flags(unsigned int fd, unsigned int flags);
extern void put_unused_fd(unsigned int fd);
extern unsigned int increase_min_fd(unsigned int num);

diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 9dc0bf0c5a6e..d3de5b8b3955 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -89,6 +89,10 @@
#define __O_TMPFILE 020000000
#endif

+#ifndef O_SPECIFIC_FD
+#define O_SPECIFIC_FD 01000000000 /* open as specified fd */
+#endif
+
/* a horrid kludge trying to make sure that this will fail on old kernels */
#define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
#define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)
diff --git a/include/uapi/linux/openat2.h b/include/uapi/linux/openat2.h
index 58b1eb711360..50d1206b64c2 100644
--- a/include/uapi/linux/openat2.h
+++ b/include/uapi/linux/openat2.h
@@ -20,6 +20,8 @@ struct open_how {
__u64 flags;
__u64 mode;
__u64 resolve;
+ __s32 fd;
+ __u32 pad; /* Must be 0 in the current version */
};

/* how->resolve flags for openat2(2). */
--
2.26.0

2020-04-08 12:04:41

by Aleksa Sarai

[permalink] [raw]
Subject: Re: [PATCH v3 1/3] fs: Support setting a minimum fd for "lowest available fd" allocation

On 2020-04-07, Josh Triplett <[email protected]> wrote:
> Some applications want to prevent the usual "lowest available fd"
> allocation from allocating certain file descriptors. For instance, they
> may want to prevent allocation of a closed fd 0, 1, or 2 other than via
> dup2/dup3, or reserve some low file descriptors for other purposes.
>
> Add a prctl to increase the minimum fd and return the previous minimum.
>
> System calls that allocate a specific file descriptor, such as
> dup2/dup3, ignore this minimum.
>
> exec resets the minimum fd, to prevent one program from interfering with
> another program's expectations about fd allocation.

Why is it implemented as an "increase the value" interface? It feels
like this is meant to avoid some kind of security trap (with a library
reducing the value) but it means that if you want to temporarily raise
the minimum fd number it's not possible (without re-exec()ing yourself,
which is hardly a fun thing to do).

Then again, this might've been discussed before and I missed it...

> Test program:
>
> #include <err.h>
> #include <fcntl.h>
> #include <stdio.h>
> #include <sys/prctl.h>
>
> int main(int argc, char *argv[])
> {
> if (prctl(PR_INCREASE_MIN_FD, 100, 0, 0, 0) < 0)
> err(1, "prctl");
> int fd = open("/dev/null", O_RDONLY);
> if (fd < 0)
> err(1, "open");
> printf("%d\n", fd); // prints 100
> return 0;
> }
>
> Signed-off-by: Josh Triplett <[email protected]>
> ---
> fs/file.c | 23 +++++++++++++++++------
> include/linux/fdtable.h | 1 +
> include/linux/file.h | 1 +
> include/uapi/linux/prctl.h | 3 +++
> kernel/sys.c | 5 +++++
> 5 files changed, 27 insertions(+), 6 deletions(-)
>
> diff --git a/fs/file.c b/fs/file.c
> index c8a4e4c86e55..ba06140d89af 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -286,7 +286,6 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
> spin_lock_init(&newf->file_lock);
> newf->resize_in_progress = false;
> init_waitqueue_head(&newf->resize_wait);
> - newf->next_fd = 0;
> new_fdt = &newf->fdtab;
> new_fdt->max_fds = NR_OPEN_DEFAULT;
> new_fdt->close_on_exec = newf->close_on_exec_init;
> @@ -295,6 +294,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
> new_fdt->fd = &newf->fd_array[0];
>
> spin_lock(&oldf->file_lock);
> + newf->next_fd = newf->min_fd = oldf->min_fd;
> old_fdt = files_fdtable(oldf);
> open_files = count_open_files(old_fdt);
>
> @@ -487,9 +487,7 @@ int __alloc_fd(struct files_struct *files,
> spin_lock(&files->file_lock);
> repeat:
> fdt = files_fdtable(files);
> - fd = start;
> - if (fd < files->next_fd)
> - fd = files->next_fd;
> + fd = max3(start, files->min_fd, files->next_fd);
>
> if (fd < fdt->max_fds)
> fd = find_next_fd(fdt, fd);
> @@ -514,7 +512,7 @@ int __alloc_fd(struct files_struct *files,
> goto repeat;
>
> if (start <= files->next_fd)
> - files->next_fd = fd + 1;
> + files->next_fd = max(fd + 1, files->min_fd);
>
> __set_open_fd(fd, fdt);
> if (flags & O_CLOEXEC)
> @@ -555,7 +553,7 @@ static void __put_unused_fd(struct files_struct *files, unsigned int fd)
> {
> struct fdtable *fdt = files_fdtable(files);
> __clear_open_fd(fd, fdt);
> - if (fd < files->next_fd)
> + if (fd < files->next_fd && fd >= files->min_fd)
> files->next_fd = fd;
> }
>
> @@ -684,6 +682,7 @@ void do_close_on_exec(struct files_struct *files)
>
> /* exec unshares first */
> spin_lock(&files->file_lock);
> + files->min_fd = 0;
> for (i = 0; ; i++) {
> unsigned long set;
> unsigned fd = i * BITS_PER_LONG;
> @@ -865,6 +864,18 @@ bool get_close_on_exec(unsigned int fd)
> return res;
> }
>
> +unsigned int increase_min_fd(unsigned int num)
> +{
> + struct files_struct *files = current->files;
> + unsigned int old_min_fd;
> +
> + spin_lock(&files->file_lock);
> + old_min_fd = files->min_fd;
> + files->min_fd += num;
> + spin_unlock(&files->file_lock);
> + return old_min_fd;
> +}
> +
> static int do_dup2(struct files_struct *files,
> struct file *file, unsigned fd, unsigned flags)
> __releases(&files->file_lock)
> diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
> index f07c55ea0c22..d1980443d8b3 100644
> --- a/include/linux/fdtable.h
> +++ b/include/linux/fdtable.h
> @@ -60,6 +60,7 @@ struct files_struct {
> */
> spinlock_t file_lock ____cacheline_aligned_in_smp;
> unsigned int next_fd;
> + unsigned int min_fd; /* min for "lowest available fd" allocation */
> unsigned long close_on_exec_init[1];
> unsigned long open_fds_init[1];
> unsigned long full_fds_bits_init[1];
> diff --git a/include/linux/file.h b/include/linux/file.h
> index 142d102f285e..b67986f818d2 100644
> --- a/include/linux/file.h
> +++ b/include/linux/file.h
> @@ -88,6 +88,7 @@ extern bool get_close_on_exec(unsigned int fd);
> extern int __get_unused_fd_flags(unsigned flags, unsigned long nofile);
> extern int get_unused_fd_flags(unsigned flags);
> extern void put_unused_fd(unsigned int fd);
> +extern unsigned int increase_min_fd(unsigned int num);
>
> extern void fd_install(unsigned int fd, struct file *file);
>
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index 07b4f8131e36..916327272d21 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -238,4 +238,7 @@ struct prctl_mm_map {
> #define PR_SET_IO_FLUSHER 57
> #define PR_GET_IO_FLUSHER 58
>
> +/* Increase minimum file descriptor for "lowest available fd" allocation */
> +#define PR_INCREASE_MIN_FD 59
> +
> #endif /* _LINUX_PRCTL_H */
> diff --git a/kernel/sys.c b/kernel/sys.c
> index d325f3ab624a..daa0ce43cecc 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2514,6 +2514,11 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>
> error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER;
> break;
> + case PR_INCREASE_MIN_FD:
> + if (arg3 || arg4 || arg5)
> + return -EINVAL;
> + error = increase_min_fd((unsigned int)arg2);
> + break;
> default:
> error = -EINVAL;
> break;
> --
> 2.26.0
>


--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


Attachments:
(No filename) (6.35 kB)
signature.asc (235.00 B)
Download all attachments

2020-04-08 13:48:25

by Aleksa Sarai

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] fs: openat2: Extend open_how to allow userspace-selected fds

On 2020-04-07, Josh Triplett <[email protected]> wrote:
> Inspired by the X protocol's handling of XIDs, allow userspace to select
> the file descriptor opened by openat2, so that it can use the resulting
> file descriptor in subsequent system calls without waiting for the
> response to openat2.
>
> In io_uring, this allows sequences like openat2/read/close without
> waiting for the openat2 to complete. Multiple such sequences can
> overlap, as long as each uses a distinct file descriptor.
>
> Add a new O_SPECIFIC_FD open flag to enable this behavior, only accepted
> by openat2 for now (ignored by open/openat like all unknown flags). Add
> an fd field to struct open_how (along with appropriate padding, and
> verify that the padding is 0 to allow replacing the padding with a field
> in the future).
>
> The file table has a corresponding new function
> get_specific_unused_fd_flags, which gets the specified file descriptor
> if O_SPECIFIC_FD is set (and the fd isn't -1); otherwise it falls back
> to get_unused_fd_flags, to simplify callers.

>
> The specified file descriptor must not already be open; if it is,
> get_specific_unused_fd_flags will fail with -EBUSY. This helps catch
> userspace errors.
>
> When O_SPECIFIC_FD is set, and fd is not -1, openat2 will use the
> specified file descriptor rather than finding the lowest available one.
>
> Test program:
>
> #include <err.h>
> #include <fcntl.h>
> #include <stdio.h>
> #include <unistd.h>
>
> int main(void)
> {
> struct open_how how = {
> .flags = O_RDONLY | O_SPECIFIC_FD,
> .fd = 42
> };
> int fd = openat2(AT_FDCWD, "/dev/null", &how, sizeof(how));
> if (fd < 0)
> err(1, "openat2");
> printf("fd=%d\n", fd); // prints fd=42
> return 0;
> }
>
> Signed-off-by: Josh Triplett <[email protected]>
> Signed-off-by: Jens Axboe <[email protected]>

Maybe I'm misunderstanding something, but this Signed-off-by looks
strange -- was it Co-developed-by Jens?

> ---
> fs/fcntl.c | 2 +-
> fs/file.c | 39 ++++++++++++++++++++++++++++++++
> fs/io_uring.c | 3 ++-
> fs/open.c | 6 +++--
> include/linux/fcntl.h | 5 ++--
> include/linux/file.h | 3 +++
> include/uapi/asm-generic/fcntl.h | 4 ++++
> include/uapi/linux/openat2.h | 2 ++
> 8 files changed, 58 insertions(+), 6 deletions(-)
>
> diff --git a/fs/fcntl.c b/fs/fcntl.c
> index 2e4c0fa2074b..0357ad667563 100644
> --- a/fs/fcntl.c
> +++ b/fs/fcntl.c
> @@ -1033,7 +1033,7 @@ static int __init fcntl_init(void)
> * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
> * is defined as O_NONBLOCK on some platforms and not on others.
> */
> - BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
> + BUILD_BUG_ON(22 - 1 /* for O_RDONLY being 0 */ !=
> HWEIGHT32(
> (VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
> __FMODE_EXEC | __FMODE_NONOTIFY));
> diff --git a/fs/file.c b/fs/file.c
> index ba06140d89af..0c34206314dc 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -567,6 +567,45 @@ void put_unused_fd(unsigned int fd)
>
> EXPORT_SYMBOL(put_unused_fd);
>
> +int __get_specific_unused_fd_flags(unsigned int fd, unsigned int flags,
> + unsigned long nofile)
> +{
> + int ret;
> + struct fdtable *fdt;
> + struct files_struct *files = current->files;
> +
> + if (!(flags & O_SPECIFIC_FD) || fd == -1)
> + return __get_unused_fd_flags(flags, nofile);

This check should just be (flags & O_SPECIFIC_FD) -- see my comment
below about ->fd being negative.

> +
> + if (fd >= nofile)
> + return -EBADF;
> +
> + spin_lock(&files->file_lock);
> + ret = expand_files(files, fd);
> + if (unlikely(ret < 0))
> + goto out_unlock;
> + fdt = files_fdtable(files);
> + if (fdt->fd[fd]) {
> + ret = -EBUSY;
> + goto out_unlock;

It would be remiss of me to not mention that this is inconsistent with
the other way of explicitly picking a file descriptor on Unix -- the dup
family closes newfd if it's already used.

But that being said, I do actually prefer this behaviour since it means
that two threads trying to open a file with the same specific file
descriptor won't see the file descriptor change underneath them (leading
to who knows how much head-scratching).

> + }
> + __set_open_fd(fd, fdt);
> + if (flags & O_CLOEXEC)
> + __set_close_on_exec(fd, fdt);
> + else
> + __clear_close_on_exec(fd, fdt);
> + ret = fd;
> +
> +out_unlock:
> + spin_unlock(&files->file_lock);
> + return ret;
> +}
> +
> +int get_specific_unused_fd_flags(unsigned int fd, unsigned int flags)
> +{
> + return __get_specific_unused_fd_flags(fd, flags, rlimit(RLIMIT_NOFILE));
> +}
> +
> /*
> * Install a file pointer in the fd array.
> *
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 358f97be9c7b..4a69e1daf3fe 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -2997,7 +2997,8 @@ static int io_openat2(struct io_kiocb *req, bool force_nonblock)
> if (ret)
> goto err;
>
> - ret = __get_unused_fd_flags(req->open.how.flags, req->open.nofile);
> + ret = __get_specific_unused_fd_flags(req->open.how.fd,
> + req->open.how.flags, req->open.nofile);
> if (ret < 0)
> goto err;
>
> diff --git a/fs/open.c b/fs/open.c
> index 719b320ede52..d4225e6f9d4c 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -958,7 +958,7 @@ EXPORT_SYMBOL(open_with_fake_path);
> inline struct open_how build_open_how(int flags, umode_t mode)
> {
> struct open_how how = {
> - .flags = flags & VALID_OPEN_FLAGS,
> + .flags = flags & VALID_OPEN_FLAGS & ~O_SPECIFIC_FD,

This is getting a little ugly, maybe filter out O_SPECIFIC_FD later on
in build_open_how() -- where we handle O_PATH.

> .mode = mode & S_IALLUGO,
> };
>
> @@ -1143,7 +1143,7 @@ static long do_sys_openat2(int dfd, const char __user *filename,
> if (IS_ERR(tmp))
> return PTR_ERR(tmp);
>
> - fd = get_unused_fd_flags(how->flags);
> + fd = get_specific_unused_fd_flags(how->fd, how->flags);
> if (fd >= 0) {
> struct file *f = do_filp_open(dfd, tmp, &op);
> if (IS_ERR(f)) {
> @@ -1193,6 +1193,8 @@ SYSCALL_DEFINE4(openat2, int, dfd, const char __user *, filename,
> err = copy_struct_from_user(&tmp, sizeof(tmp), how, usize);
> if (err)
> return err;
> + if (tmp.pad != 0)
> + return -EINVAL;

This check should be done in build_open_flags(), where the other sanity
checks are done. In addition, there must be an additional check like

if (!(flags & O_SPECIFIC_FD) && how->fd != 0)
return -EINVAL;

Since we must not allow garbage values to be passed and ignored by us in
openat2().

> /* O_LARGEFILE is only allowed for non-O_PATH. */
> if (!(tmp.flags & O_PATH) && force_o_largefile())
> diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
> index 7bcdcf4f6ab2..728849bcd8fa 100644
> --- a/include/linux/fcntl.h
> +++ b/include/linux/fcntl.h
> @@ -10,7 +10,7 @@
> (O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \
> O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \
> FASYNC | O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
> - O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)
> + O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_SPECIFIC_FD)
>
> /* List of all valid flags for the how->upgrade_mask argument: */
> #define VALID_UPGRADE_FLAGS \
> @@ -23,7 +23,8 @@
>
> /* List of all open_how "versions". */
> #define OPEN_HOW_SIZE_VER0 24 /* sizeof first published struct */
> -#define OPEN_HOW_SIZE_LATEST OPEN_HOW_SIZE_VER0
> +#define OPEN_HOW_SIZE_VER1 32 /* added fd and pad */
> +#define OPEN_HOW_SIZE_LATEST OPEN_HOW_SIZE_VER1
>
> #ifndef force_o_largefile
> #define force_o_largefile() (!IS_ENABLED(CONFIG_ARCH_32BIT_OFF_T))
> diff --git a/include/linux/file.h b/include/linux/file.h
> index b67986f818d2..a63301864a36 100644
> --- a/include/linux/file.h
> +++ b/include/linux/file.h
> @@ -87,6 +87,9 @@ extern void set_close_on_exec(unsigned int fd, int flag);
> extern bool get_close_on_exec(unsigned int fd);
> extern int __get_unused_fd_flags(unsigned flags, unsigned long nofile);
> extern int get_unused_fd_flags(unsigned flags);
> +extern int __get_specific_unused_fd_flags(unsigned int fd, unsigned int flags,
> + unsigned long nofile);
> +extern int get_specific_unused_fd_flags(unsigned int fd, unsigned int flags);
> extern void put_unused_fd(unsigned int fd);
> extern unsigned int increase_min_fd(unsigned int num);
>
> diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
> index 9dc0bf0c5a6e..d3de5b8b3955 100644
> --- a/include/uapi/asm-generic/fcntl.h
> +++ b/include/uapi/asm-generic/fcntl.h
> @@ -89,6 +89,10 @@
> #define __O_TMPFILE 020000000
> #endif
>
> +#ifndef O_SPECIFIC_FD
> +#define O_SPECIFIC_FD 01000000000 /* open as specified fd */
> +#endif

Maybe you've already done this (since you skipped several bits in the
O_* flag space), but I would double-check that there is no conflict on
other architectures. I faintly recall that FMODE_NOTIFY has a different
value on sparc, and there was some oddness on alpha too... But as long
as fcntl.c builds on all the arches then it's fine.

> +
> /* a horrid kludge trying to make sure that this will fail on old kernels */
> #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
> #define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)
> diff --git a/include/uapi/linux/openat2.h b/include/uapi/linux/openat2.h
> index 58b1eb711360..50d1206b64c2 100644
> --- a/include/uapi/linux/openat2.h
> +++ b/include/uapi/linux/openat2.h
> @@ -20,6 +20,8 @@ struct open_how {
> __u64 flags;
> __u64 mode;
> __u64 resolve;
> + __s32 fd;
> + __u32 pad; /* Must be 0 in the current version */

I'm not sure I see why the new field is an s32 -- there should be no
situation where a user specifies O_SPECIFIC_FD and a negative file
descriptor (if we keep it as an s32, you should get an -EINVAL in that
case). If you don't want O_SPECIFIC_FD, don't specify it.

But I think this should be a u32. I'm tempted to argue this should
actually be a u64, but nothing supports 64-bit file descriptor numbers
(including the return type of openat2).

--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


Attachments:
(No filename) (10.35 kB)
signature.asc (235.00 B)
Download all attachments

2020-04-08 14:31:33

by Aleksa Sarai

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] Support userspace-selected fds

On 2020-04-07, Josh Triplett <[email protected]> wrote:
> (Note: numbering this updated version v3, to avoid confusion with Jens'
> v2 that built on my v1. Jens, if you like this approach, please feel
> free to stack your additional patches from the io_uring-fd-select branch
> atop this series. 5.8 material, not intended for the current merge window.)
>
> Inspired by the X protocol's handling of XIDs, allow userspace to select
> the file descriptor opened by a call like openat2, so that it can use
> the resulting file descriptor in subsequent system calls without waiting
> for the response to the initial openat2 syscall.
>
> The first patch is independent of the other two; it allows reserving
> file descriptors below a certain minimum for userspace-selected fd
> allocation only.
>
> The second patch implements userspace-selected fd allocation for
> openat2, introducing a new O_SPECIFIC_FD flag and an fd field in struct
> open_how. In io_uring, this allows sequences like openat2/read/close
> without waiting for the openat2 to complete. Multiple such sequences can
> overlap, as long as each uses a distinct file descriptor.
>
> The third patch adds userspace-selected fd allocation to pipe2 as well.
> I did this partly as a demonstration of how simple it is to wire up
> O_SPECIFIC_FD support for any fd-allocating system call, and partly in
> the hopes that this may make it more useful to wire up io_uring support
> for pipe2 in the future.
>
> If this gets accepted, I'm happy to also write corresponding manpage
> patches.
>
> v3:
> This new version has an API to atomically increase the minimum fd and
> return the previous minimum, rather than just getting and setting the
> minimum; this makes it easier to allocate a range. (A library that might
> initialize after the program has already opened other file descriptors
> may need to check for existing open fds in the range after reserving it,
> and reserve more fds if needed; this can be done entirely in userspace,
> and we can't really do anything simpler in the kernel due to limitations
> on file-descriptor semantics, so this patch series avoids introducing
> any extra complexity in the kernel.)
>
> This new version also supports a __get_specific_unused_fd_flags call
> which accepts the limit for RLIMIT_NOFILE as an argument, analogous to
> __get_unused_fd_flags, since io_uring needs that to correctly handle
> RLIMIT_NOFILE.
>
> Josh Triplett (3):
> fs: Support setting a minimum fd for "lowest available fd" allocation
> fs: openat2: Extend open_how to allow userspace-selected fds
> fs: pipe2: Support O_SPECIFIC_FD

Aside from my specific comments and questions, the changes to openat2
deserve at least one or two selftests.

> fs/fcntl.c | 2 +-
> fs/file.c | 62 ++++++++++++++++++++++++++++----
> fs/io_uring.c | 3 +-
> fs/open.c | 6 ++--
> fs/pipe.c | 16 ++++++---
> include/linux/fcntl.h | 5 +--
> include/linux/fdtable.h | 1 +
> include/linux/file.h | 4 +++
> include/uapi/asm-generic/fcntl.h | 4 +++
> include/uapi/linux/openat2.h | 2 ++
> include/uapi/linux/prctl.h | 3 ++
> kernel/sys.c | 5 +++
> 12 files changed, 97 insertions(+), 16 deletions(-)

--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


Attachments:
(No filename) (3.45 kB)
signature.asc (235.00 B)
Download all attachments

2020-04-09 03:18:06

by Josh Triplett

[permalink] [raw]
Subject: Re: [PATCH v3 1/3] fs: Support setting a minimum fd for "lowest available fd" allocation

On Wed, Apr 08, 2020 at 10:00:40PM +1000, Aleksa Sarai wrote:
> On 2020-04-07, Josh Triplett <[email protected]> wrote:
> > Some applications want to prevent the usual "lowest available fd"
> > allocation from allocating certain file descriptors. For instance, they
> > may want to prevent allocation of a closed fd 0, 1, or 2 other than via
> > dup2/dup3, or reserve some low file descriptors for other purposes.
> >
> > Add a prctl to increase the minimum fd and return the previous minimum.
> >
> > System calls that allocate a specific file descriptor, such as
> > dup2/dup3, ignore this minimum.
> >
> > exec resets the minimum fd, to prevent one program from interfering with
> > another program's expectations about fd allocation.
>
> Why is it implemented as an "increase the value" interface? It feels
> like this is meant to avoid some kind of security trap (with a library
> reducing the value) but it means that if you want to temporarily raise
> the minimum fd number it's not possible (without re-exec()ing yourself,
> which is hardly a fun thing to do).
>
> Then again, this might've been discussed before and I missed it...

It was: the previous version was a "get" and "set" interface. That
interface didn't allow for the possibility that something else in the
process had already set a minimum. This new atomic increase interface
(which also serves as a "get" interface if you pass 0) makes it possible
for a userspace library to reserve a range. (You have no guarantee about
previously allocated descriptors in that range, but you know that no
*new* automatically allocated descriptors will appear in that range,
which suffices; userspace can do the rest.)

- Josh Triplett

2020-04-09 03:20:13

by Josh Triplett

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] Support userspace-selected fds

On Wed, Apr 08, 2020 at 10:26:01PM +1000, Aleksa Sarai wrote:
> On 2020-04-07, Josh Triplett <[email protected]> wrote:
> > (Note: numbering this updated version v3, to avoid confusion with Jens'
> > v2 that built on my v1. Jens, if you like this approach, please feel
> > free to stack your additional patches from the io_uring-fd-select branch
> > atop this series. 5.8 material, not intended for the current merge window.)
> >
> > Inspired by the X protocol's handling of XIDs, allow userspace to select
> > the file descriptor opened by a call like openat2, so that it can use
> > the resulting file descriptor in subsequent system calls without waiting
> > for the response to the initial openat2 syscall.
> >
> > The first patch is independent of the other two; it allows reserving
> > file descriptors below a certain minimum for userspace-selected fd
> > allocation only.
> >
> > The second patch implements userspace-selected fd allocation for
> > openat2, introducing a new O_SPECIFIC_FD flag and an fd field in struct
> > open_how. In io_uring, this allows sequences like openat2/read/close
> > without waiting for the openat2 to complete. Multiple such sequences can
> > overlap, as long as each uses a distinct file descriptor.
> >
> > The third patch adds userspace-selected fd allocation to pipe2 as well.
> > I did this partly as a demonstration of how simple it is to wire up
> > O_SPECIFIC_FD support for any fd-allocating system call, and partly in
> > the hopes that this may make it more useful to wire up io_uring support
> > for pipe2 in the future.
> >
> > If this gets accepted, I'm happy to also write corresponding manpage
> > patches.
> >
> > v3:
> > This new version has an API to atomically increase the minimum fd and
> > return the previous minimum, rather than just getting and setting the
> > minimum; this makes it easier to allocate a range. (A library that might
> > initialize after the program has already opened other file descriptors
> > may need to check for existing open fds in the range after reserving it,
> > and reserve more fds if needed; this can be done entirely in userspace,
> > and we can't really do anything simpler in the kernel due to limitations
> > on file-descriptor semantics, so this patch series avoids introducing
> > any extra complexity in the kernel.)
> >
> > This new version also supports a __get_specific_unused_fd_flags call
> > which accepts the limit for RLIMIT_NOFILE as an argument, analogous to
> > __get_unused_fd_flags, since io_uring needs that to correctly handle
> > RLIMIT_NOFILE.
> >
> > Josh Triplett (3):
> > fs: Support setting a minimum fd for "lowest available fd" allocation
> > fs: openat2: Extend open_how to allow userspace-selected fds
> > fs: pipe2: Support O_SPECIFIC_FD
>
> Aside from my specific comments and questions, the changes to openat2
> deserve at least one or two selftests.

Agreed. I don't expect this to get merged until it has tests and
manpage patches.

- Josh Triplett

2020-04-09 05:01:54

by Josh Triplett

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] fs: openat2: Extend open_how to allow userspace-selected fds

On Wed, Apr 08, 2020 at 10:23:30PM +1000, Aleksa Sarai wrote:
> On 2020-04-07, Josh Triplett <[email protected]> wrote:
> > Signed-off-by: Josh Triplett <[email protected]>
> > Signed-off-by: Jens Axboe <[email protected]>
>
> Maybe I'm misunderstanding something, but this Signed-off-by looks
> strange -- was it Co-developed-by Jens?

No. I wrote v1, and Jens incorporated some of those patches into a v2
series that added signoffs. So when I took patch 2 from that series and
rebased it into this new series, I kept the signoff.

> > ---
> > fs/fcntl.c | 2 +-
> > fs/file.c | 39 ++++++++++++++++++++++++++++++++
> > fs/io_uring.c | 3 ++-
> > fs/open.c | 6 +++--
> > include/linux/fcntl.h | 5 ++--
> > include/linux/file.h | 3 +++
> > include/uapi/asm-generic/fcntl.h | 4 ++++
> > include/uapi/linux/openat2.h | 2 ++
> > 8 files changed, 58 insertions(+), 6 deletions(-)

> > --- a/fs/file.c
> > +++ b/fs/file.c
> > @@ -567,6 +567,45 @@ void put_unused_fd(unsigned int fd)
> >
> > EXPORT_SYMBOL(put_unused_fd);
> >
> > +int __get_specific_unused_fd_flags(unsigned int fd, unsigned int flags,
> > + unsigned long nofile)
> > +{
> > + int ret;
> > + struct fdtable *fdt;
> > + struct files_struct *files = current->files;
> > +
> > + if (!(flags & O_SPECIFIC_FD) || fd == -1)
> > + return __get_unused_fd_flags(flags, nofile);
>
> This check should just be (flags & O_SPECIFIC_FD) -- see my comment
> below about ->fd being negative.

This was intentional, and I think it makes sense to keep. I'd like to
maintain the same consistent behavior in any call that uses
O_SPECIFIC_FD: an fd of -1 means "go ahead and assign the lowest
available fd". This allows for calls like pipe2, or in the future
socketpair, where you might want to explicitly assign one end, but allow
the kernel to allocate the other end. -1 can never be a valid file
descriptor, since it would be interpreted as an errno instead. mmap
similarly treats -1 as an invalid file descriptor.

As an example use case: in one io_uring batch, you want to allocate a
pipe, send one end over a UNIX socket to another process, and close that
end in your process. So you need *that* end to be a userspace-allocated
file descriptor, so you know what fd to use in the subsequent sendmsg
and close calls. But for the other end that you plan to hold onto, you
don't want to use up one of your small number of preallocated file
descriptors (of which you only need enough for the operations in any one
io_uring batch).

However, in response to your comment below, I've changed this from -1 to
UINT_MAX.

> > +
> > + if (fd >= nofile)
> > + return -EBADF;
> > +
> > + spin_lock(&files->file_lock);
> > + ret = expand_files(files, fd);
> > + if (unlikely(ret < 0))
> > + goto out_unlock;
> > + fdt = files_fdtable(files);
> > + if (fdt->fd[fd]) {
> > + ret = -EBUSY;
> > + goto out_unlock;
>
> It would be remiss of me to not mention that this is inconsistent with
> the other way of explicitly picking a file descriptor on Unix -- the dup
> family closes newfd if it's already used.
>
> But that being said, I do actually prefer this behaviour since it means
> that two threads trying to open a file with the same specific file
> descriptor won't see the file descriptor change underneath them (leading
> to who knows how much head-scratching).

Exactly. I documented that difference explicitly in the commit message:

> The specified file descriptor must not already be open; if it is,
> get_specific_unused_fd_flags will fail with -EBUSY. This helps catch
> userspace errors.

I also intend for that to appear in the manpage.

> > --- a/fs/open.c
> > +++ b/fs/open.c
> > @@ -958,7 +958,7 @@ EXPORT_SYMBOL(open_with_fake_path);
> > inline struct open_how build_open_how(int flags, umode_t mode)
> > {
> > struct open_how how = {
> > - .flags = flags & VALID_OPEN_FLAGS,
> > + .flags = flags & VALID_OPEN_FLAGS & ~O_SPECIFIC_FD,
>
> This is getting a little ugly, maybe filter out O_SPECIFIC_FD later on
> in build_open_how() -- where we handle O_PATH.

Sure; I'll move it down and add a comment.

> > .mode = mode & S_IALLUGO,
> > };
> >
> > @@ -1143,7 +1143,7 @@ static long do_sys_openat2(int dfd, const char __user *filename,
> > if (IS_ERR(tmp))
> > return PTR_ERR(tmp);
> >
> > - fd = get_unused_fd_flags(how->flags);
> > + fd = get_specific_unused_fd_flags(how->fd, how->flags);
> > if (fd >= 0) {
> > struct file *f = do_filp_open(dfd, tmp, &op);
> > if (IS_ERR(f)) {
> > @@ -1193,6 +1193,8 @@ SYSCALL_DEFINE4(openat2, int, dfd, const char __user *, filename,
> > err = copy_struct_from_user(&tmp, sizeof(tmp), how, usize);
> > if (err)
> > return err;
> > + if (tmp.pad != 0)
> > + return -EINVAL;
>
> This check should be done in build_open_flags(), where the other sanity
> checks are done.

Good idea; done.

> In addition, there must be an additional check like
>
> if (!(flags & O_SPECIFIC_FD) && how->fd != 0)
> return -EINVAL;
>
> Since we must not allow garbage values to be passed and ignored by us in
> openat2().

Good catch! Added.

> > --- a/include/uapi/asm-generic/fcntl.h
> > +++ b/include/uapi/asm-generic/fcntl.h
> > @@ -89,6 +89,10 @@
> > #define __O_TMPFILE 020000000
> > #endif
> >
> > +#ifndef O_SPECIFIC_FD
> > +#define O_SPECIFIC_FD 01000000000 /* open as specified fd */
> > +#endif
>
> Maybe you've already done this (since you skipped several bits in the
> O_* flag space), but I would double-check that there is no conflict on
> other architectures. I faintly recall that FMODE_NOTIFY has a different
> value on sparc, and there was some oddness on alpha too... But as long
> as fcntl.c builds on all the arches then it's fine.

I checked other architectures carefully, and yes, I picked this value
because some of the intermediate values conflicted on specific
architectures.

> > --- a/include/uapi/linux/openat2.h
> > +++ b/include/uapi/linux/openat2.h
> > @@ -20,6 +20,8 @@ struct open_how {
> > __u64 flags;
> > __u64 mode;
> > __u64 resolve;
> > + __s32 fd;
> > + __u32 pad; /* Must be 0 in the current version */
>
> I'm not sure I see why the new field is an s32 -- there should be no
> situation where a user specifies O_SPECIFIC_FD and a negative file
> descriptor (if we keep it as an s32, you should get an -EINVAL in that
> case). If you don't want O_SPECIFIC_FD, don't specify it.
>
> But I think this should be a u32.

See my explanation above for why I do want to allow `-1` here.

However, there's no reason I need to make the field signed just because
of that one reserved value. I'll change the field to a __u32.

- Josh Triplett

2020-04-09 08:12:16

by Aleksa Sarai

[permalink] [raw]
Subject: Re: [PATCH v3 2/3] fs: openat2: Extend open_how to allow userspace-selected fds

On 2020-04-08, Aleksa Sarai <[email protected]> wrote:
> On 2020-04-07, Josh Triplett <[email protected]> wrote:
> > Inspired by the X protocol's handling of XIDs, allow userspace to select
> > the file descriptor opened by openat2, so that it can use the resulting
> > file descriptor in subsequent system calls without waiting for the
> > response to openat2.
> >
> > In io_uring, this allows sequences like openat2/read/close without
> > waiting for the openat2 to complete. Multiple such sequences can
> > overlap, as long as each uses a distinct file descriptor.
> >
> > Add a new O_SPECIFIC_FD open flag to enable this behavior, only accepted
> > by openat2 for now (ignored by open/openat like all unknown flags). Add
> > an fd field to struct open_how (along with appropriate padding, and
> > verify that the padding is 0 to allow replacing the padding with a field
> > in the future).
> >
> > The file table has a corresponding new function
> > get_specific_unused_fd_flags, which gets the specified file descriptor
> > if O_SPECIFIC_FD is set (and the fd isn't -1); otherwise it falls back
> > to get_unused_fd_flags, to simplify callers.
>
> >
> > The specified file descriptor must not already be open; if it is,
> > get_specific_unused_fd_flags will fail with -EBUSY. This helps catch
> > userspace errors.
> >
> > When O_SPECIFIC_FD is set, and fd is not -1, openat2 will use the
> > specified file descriptor rather than finding the lowest available one.
> >
> > Test program:
> >
> > #include <err.h>
> > #include <fcntl.h>
> > #include <stdio.h>
> > #include <unistd.h>
> >
> > int main(void)
> > {
> > struct open_how how = {
> > .flags = O_RDONLY | O_SPECIFIC_FD,
> > .fd = 42
> > };
> > int fd = openat2(AT_FDCWD, "/dev/null", &how, sizeof(how));
> > if (fd < 0)
> > err(1, "openat2");
> > printf("fd=%d\n", fd); // prints fd=42
> > return 0;
> > }
> >
> > Signed-off-by: Josh Triplett <[email protected]>
> > Signed-off-by: Jens Axboe <[email protected]>
>
> Maybe I'm misunderstanding something, but this Signed-off-by looks
> strange -- was it Co-developed-by Jens?
>
> > ---
> > fs/fcntl.c | 2 +-
> > fs/file.c | 39 ++++++++++++++++++++++++++++++++
> > fs/io_uring.c | 3 ++-
> > fs/open.c | 6 +++--
> > include/linux/fcntl.h | 5 ++--
> > include/linux/file.h | 3 +++
> > include/uapi/asm-generic/fcntl.h | 4 ++++
> > include/uapi/linux/openat2.h | 2 ++
> > 8 files changed, 58 insertions(+), 6 deletions(-)
> >
> > diff --git a/fs/fcntl.c b/fs/fcntl.c
> > index 2e4c0fa2074b..0357ad667563 100644
> > --- a/fs/fcntl.c
> > +++ b/fs/fcntl.c
> > @@ -1033,7 +1033,7 @@ static int __init fcntl_init(void)
> > * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
> > * is defined as O_NONBLOCK on some platforms and not on others.
> > */
> > - BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
> > + BUILD_BUG_ON(22 - 1 /* for O_RDONLY being 0 */ !=
> > HWEIGHT32(
> > (VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
> > __FMODE_EXEC | __FMODE_NONOTIFY));
> > diff --git a/fs/file.c b/fs/file.c
> > index ba06140d89af..0c34206314dc 100644
> > --- a/fs/file.c
> > +++ b/fs/file.c
> > @@ -567,6 +567,45 @@ void put_unused_fd(unsigned int fd)
> >
> > EXPORT_SYMBOL(put_unused_fd);
> >
> > +int __get_specific_unused_fd_flags(unsigned int fd, unsigned int flags,
> > + unsigned long nofile)
> > +{
> > + int ret;
> > + struct fdtable *fdt;
> > + struct files_struct *files = current->files;
> > +
> > + if (!(flags & O_SPECIFIC_FD) || fd == -1)
> > + return __get_unused_fd_flags(flags, nofile);
>
> This check should just be (flags & O_SPECIFIC_FD) -- see my comment
> below about ->fd being negative.

Ah, I missed that the pipe2 patch allows you to choose which fd is being
specifically chosen by setting it to -1. So this check is fine but my
comment for openat2 still applies.

> > +
> > + if (fd >= nofile)
> > + return -EBADF;
> > +
> > + spin_lock(&files->file_lock);
> > + ret = expand_files(files, fd);
> > + if (unlikely(ret < 0))
> > + goto out_unlock;
> > + fdt = files_fdtable(files);
> > + if (fdt->fd[fd]) {
> > + ret = -EBUSY;
> > + goto out_unlock;
>
> It would be remiss of me to not mention that this is inconsistent with
> the other way of explicitly picking a file descriptor on Unix -- the dup
> family closes newfd if it's already used.
>
> But that being said, I do actually prefer this behaviour since it means
> that two threads trying to open a file with the same specific file
> descriptor won't see the file descriptor change underneath them (leading
> to who knows how much head-scratching).
>
> > + }
> > + __set_open_fd(fd, fdt);
> > + if (flags & O_CLOEXEC)
> > + __set_close_on_exec(fd, fdt);
> > + else
> > + __clear_close_on_exec(fd, fdt);
> > + ret = fd;
> > +
> > +out_unlock:
> > + spin_unlock(&files->file_lock);
> > + return ret;
> > +}
> > +
> > +int get_specific_unused_fd_flags(unsigned int fd, unsigned int flags)
> > +{
> > + return __get_specific_unused_fd_flags(fd, flags, rlimit(RLIMIT_NOFILE));
> > +}
> > +
> > /*
> > * Install a file pointer in the fd array.
> > *
> > diff --git a/fs/io_uring.c b/fs/io_uring.c
> > index 358f97be9c7b..4a69e1daf3fe 100644
> > --- a/fs/io_uring.c
> > +++ b/fs/io_uring.c
> > @@ -2997,7 +2997,8 @@ static int io_openat2(struct io_kiocb *req, bool force_nonblock)
> > if (ret)
> > goto err;
> >
> > - ret = __get_unused_fd_flags(req->open.how.flags, req->open.nofile);
> > + ret = __get_specific_unused_fd_flags(req->open.how.fd,
> > + req->open.how.flags, req->open.nofile);
> > if (ret < 0)
> > goto err;
> >
> > diff --git a/fs/open.c b/fs/open.c
> > index 719b320ede52..d4225e6f9d4c 100644
> > --- a/fs/open.c
> > +++ b/fs/open.c
> > @@ -958,7 +958,7 @@ EXPORT_SYMBOL(open_with_fake_path);
> > inline struct open_how build_open_how(int flags, umode_t mode)
> > {
> > struct open_how how = {
> > - .flags = flags & VALID_OPEN_FLAGS,
> > + .flags = flags & VALID_OPEN_FLAGS & ~O_SPECIFIC_FD,
>
> This is getting a little ugly, maybe filter out O_SPECIFIC_FD later on
> in build_open_how() -- where we handle O_PATH.
>
> > .mode = mode & S_IALLUGO,
> > };
> >
> > @@ -1143,7 +1143,7 @@ static long do_sys_openat2(int dfd, const char __user *filename,
> > if (IS_ERR(tmp))
> > return PTR_ERR(tmp);
> >
> > - fd = get_unused_fd_flags(how->flags);
> > + fd = get_specific_unused_fd_flags(how->fd, how->flags);
> > if (fd >= 0) {
> > struct file *f = do_filp_open(dfd, tmp, &op);
> > if (IS_ERR(f)) {
> > @@ -1193,6 +1193,8 @@ SYSCALL_DEFINE4(openat2, int, dfd, const char __user *, filename,
> > err = copy_struct_from_user(&tmp, sizeof(tmp), how, usize);
> > if (err)
> > return err;
> > + if (tmp.pad != 0)
> > + return -EINVAL;
>
> This check should be done in build_open_flags(), where the other sanity
> checks are done. In addition, there must be an additional check like
>
> if (!(flags & O_SPECIFIC_FD) && how->fd != 0)
> return -EINVAL;
>
> Since we must not allow garbage values to be passed and ignored by us in
> openat2().
>
> > /* O_LARGEFILE is only allowed for non-O_PATH. */
> > if (!(tmp.flags & O_PATH) && force_o_largefile())
> > diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
> > index 7bcdcf4f6ab2..728849bcd8fa 100644
> > --- a/include/linux/fcntl.h
> > +++ b/include/linux/fcntl.h
> > @@ -10,7 +10,7 @@
> > (O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \
> > O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \
> > FASYNC | O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
> > - O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)
> > + O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_SPECIFIC_FD)
> >
> > /* List of all valid flags for the how->upgrade_mask argument: */
> > #define VALID_UPGRADE_FLAGS \
> > @@ -23,7 +23,8 @@
> >
> > /* List of all open_how "versions". */
> > #define OPEN_HOW_SIZE_VER0 24 /* sizeof first published struct */
> > -#define OPEN_HOW_SIZE_LATEST OPEN_HOW_SIZE_VER0
> > +#define OPEN_HOW_SIZE_VER1 32 /* added fd and pad */
> > +#define OPEN_HOW_SIZE_LATEST OPEN_HOW_SIZE_VER1
> >
> > #ifndef force_o_largefile
> > #define force_o_largefile() (!IS_ENABLED(CONFIG_ARCH_32BIT_OFF_T))
> > diff --git a/include/linux/file.h b/include/linux/file.h
> > index b67986f818d2..a63301864a36 100644
> > --- a/include/linux/file.h
> > +++ b/include/linux/file.h
> > @@ -87,6 +87,9 @@ extern void set_close_on_exec(unsigned int fd, int flag);
> > extern bool get_close_on_exec(unsigned int fd);
> > extern int __get_unused_fd_flags(unsigned flags, unsigned long nofile);
> > extern int get_unused_fd_flags(unsigned flags);
> > +extern int __get_specific_unused_fd_flags(unsigned int fd, unsigned int flags,
> > + unsigned long nofile);
> > +extern int get_specific_unused_fd_flags(unsigned int fd, unsigned int flags);
> > extern void put_unused_fd(unsigned int fd);
> > extern unsigned int increase_min_fd(unsigned int num);
> >
> > diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
> > index 9dc0bf0c5a6e..d3de5b8b3955 100644
> > --- a/include/uapi/asm-generic/fcntl.h
> > +++ b/include/uapi/asm-generic/fcntl.h
> > @@ -89,6 +89,10 @@
> > #define __O_TMPFILE 020000000
> > #endif
> >
> > +#ifndef O_SPECIFIC_FD
> > +#define O_SPECIFIC_FD 01000000000 /* open as specified fd */
> > +#endif
>
> Maybe you've already done this (since you skipped several bits in the
> O_* flag space), but I would double-check that there is no conflict on
> other architectures. I faintly recall that FMODE_NOTIFY has a different
> value on sparc, and there was some oddness on alpha too... But as long
> as fcntl.c builds on all the arches then it's fine.
>
> > +
> > /* a horrid kludge trying to make sure that this will fail on old kernels */
> > #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
> > #define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)
> > diff --git a/include/uapi/linux/openat2.h b/include/uapi/linux/openat2.h
> > index 58b1eb711360..50d1206b64c2 100644
> > --- a/include/uapi/linux/openat2.h
> > +++ b/include/uapi/linux/openat2.h
> > @@ -20,6 +20,8 @@ struct open_how {
> > __u64 flags;
> > __u64 mode;
> > __u64 resolve;
> > + __s32 fd;
> > + __u32 pad; /* Must be 0 in the current version */
>
> I'm not sure I see why the new field is an s32 -- there should be no
> situation where a user specifies O_SPECIFIC_FD and a negative file
> descriptor (if we keep it as an s32, you should get an -EINVAL in that
> case). If you don't want O_SPECIFIC_FD, don't specify it.
>
> But I think this should be a u32. I'm tempted to argue this should
> actually be a u64, but nothing supports 64-bit file descriptor numbers
> (including the return type of openat2).
>
> --
> Aleksa Sarai
> Senior Software Engineer (Containers)
> SUSE Linux GmbH
> <https://www.cyphar.com/>




--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>


Attachments:
(No filename) (11.25 kB)
signature.asc (235.00 B)
Download all attachments