Series of 4 patches to implement close-on-fork. Tests have been
published to https://github.com/nkarstens/ltp/tree/close-on-fork
and cover close-on-fork functionality in the following syscalls:
* accept(4)
* dup3(2)
* fcntl(2)
* open(2)
* socket(2)
* socketpair(2)
* unshare(2)
Addresses underlying issue in that there is no way to prevent
a fork() from duplicating a file descriptor. The existing
close-on-exec flag partially-addresses this by allowing the
parent process to mark a file descriptor as exclusive to itself,
but there is still a period of time the failure can occur
because the auto-close only occurs during the exec().
One manifestation of this is a race conditions in system(), which
(depending on the implementation) is non-atomic in that it first
calls a fork() and then an exec().
This functionality was approved by the Austin Common Standards
Revision Group for inclusion in the next revision of the POSIX
standard (see issue 1318 in the Austin Group Defect Tracker).
---
This is v2 of the change. See https://lkml.org/lkml/2020/4/20/113
for the original work.
Thanks to everyone who provided comments on the first series of
patches. Here are replies to specific comments:
> I suggest we group the two bits of a file (close_on_exec, close_on_fork)
> together, so that we do not have to dirty two separate cache lines.
I could be mistaken, but I don't think this would improve efficiency.
The close-on-fork and close-on-exec flags are read at different
times. If you assume separate syscalls for fork and exec then
there are several switches between when the two flags are read.
In addition, the close-on-fork flags in the new process must be
cleared, which will be much harder if the flags are interleaved.
> Also the F_GETFD/F_SETFD implementation must use a single function call,
> to not acquire the spinlock twice.
Good point, done.
> How about only allocating the 'close on fork' bitmap the first time
> a process sets a bit in it?
I looked into it and there are side effects I dont't think we want.
For example, if fcntl is used to set the close-on-fork flag, then
there is a chance that it cannot allocate memory, and so we'd have
to return ENOMEM. Seems cleaner to allocate memory up front so that
we know the file has all of the memory it needs.
> You should be able to use the same 'close the fds in this bitmap'
> function for both cases.
I looked into this and I think it is more efficient to prevent the
new process from having a reference to the open file than it is to
temporarily give the new process a reference and then close it later.
> I'm not sure dup_fd() is the best place to check the close-on-fork flag.
> For example, the ksys_unshare() > unshare_fd() > dup_fd() execution path
> seems suspect.
I have a better understanding of clone(2)/unshare(2) now and believe
that dup_fd() is the appropriate place to handle this. clone(2) with
CLONE_FILES set intentionally shares the file descriptor table, so
close-on-fork should not impact that. However, if unshare(2) is later
used to unshare the file descriptor table then the process calling
unshare(2) should automatically close its copy of any file descriptor
with close-on-fork set.
> If the close-on-fork flag is set, then __clear_open_fd() should be
> called instead of just __clear_bit(). This will ensure that
> fdt->full_fds_bits() is updated.
Done. It falls through to the case where the file had not finished
opening yet and leverages its call to __clear_open_fd().
> Need to investigate if the close-on-fork (or close-on-exec) flags
> need to be cleared when the file is closed as part of the
> close-on-fork execution path.
Done. The new file descriptor table starts with all close-on-fork
flags being cleared and dup_fd() gets the close-on-fork flag from
the old file descriptor table.
Add the O_CLOFORK flag to open(2) and dup3(2) to automatically
set the close-on-fork flag in the new file descriptor, saving
a separate call to fcntl(2).
Co-developed-by: Changli Gao <[email protected]>
Signed-off-by: Changli Gao <[email protected]>
Signed-off-by: Nate Karstens <[email protected]>
---
arch/alpha/include/uapi/asm/fcntl.h | 2 ++
arch/parisc/include/uapi/asm/fcntl.h | 39 +++++++++++++-------------
arch/sparc/include/uapi/asm/fcntl.h | 1 +
fs/fcntl.c | 2 +-
fs/file.c | 10 ++++++-
include/linux/fcntl.h | 2 +-
include/uapi/asm-generic/fcntl.h | 4 +++
tools/include/uapi/asm-generic/fcntl.h | 4 +++
8 files changed, 42 insertions(+), 22 deletions(-)
diff --git a/arch/alpha/include/uapi/asm/fcntl.h b/arch/alpha/include/uapi/asm/fcntl.h
index 50bdc8e8a271..fbab69b15f7f 100644
--- a/arch/alpha/include/uapi/asm/fcntl.h
+++ b/arch/alpha/include/uapi/asm/fcntl.h
@@ -35,6 +35,8 @@
#define O_PATH 040000000
#define __O_TMPFILE 0100000000
+#define O_CLOFORK 0200000000 /* set close_on_fork */
+
#define F_GETLK 7
#define F_SETLK 8
#define F_SETLKW 9
diff --git a/arch/parisc/include/uapi/asm/fcntl.h b/arch/parisc/include/uapi/asm/fcntl.h
index 03ce20e5ad7d..8f5989e75b05 100644
--- a/arch/parisc/include/uapi/asm/fcntl.h
+++ b/arch/parisc/include/uapi/asm/fcntl.h
@@ -2,26 +2,27 @@
#ifndef _PARISC_FCNTL_H
#define _PARISC_FCNTL_H
-#define O_APPEND 000000010
-#define O_BLKSEEK 000000100 /* HPUX only */
-#define O_CREAT 000000400 /* not fcntl */
-#define O_EXCL 000002000 /* not fcntl */
-#define O_LARGEFILE 000004000
-#define __O_SYNC 000100000
+#define O_APPEND 0000000010
+#define O_BLKSEEK 0000000100 /* HPUX only */
+#define O_CREAT 0000000400 /* not fcntl */
+#define O_EXCL 0000002000 /* not fcntl */
+#define O_LARGEFILE 0000004000
+#define __O_SYNC 0000100000
#define O_SYNC (__O_SYNC|O_DSYNC)
-#define O_NONBLOCK 000200004 /* HPUX has separate NDELAY & NONBLOCK */
-#define O_NOCTTY 000400000 /* not fcntl */
-#define O_DSYNC 001000000 /* HPUX only */
-#define O_RSYNC 002000000 /* HPUX only */
-#define O_NOATIME 004000000
-#define O_CLOEXEC 010000000 /* set close_on_exec */
-
-#define O_DIRECTORY 000010000 /* must be a directory */
-#define O_NOFOLLOW 000000200 /* don't follow links */
-#define O_INVISIBLE 004000000 /* invisible I/O, for DMAPI/XDSM */
-
-#define O_PATH 020000000
-#define __O_TMPFILE 040000000
+#define O_NONBLOCK 0000200004 /* HPUX has separate NDELAY & NONBLOCK */
+#define O_NOCTTY 0000400000 /* not fcntl */
+#define O_DSYNC 0001000000 /* HPUX only */
+#define O_RSYNC 0002000000 /* HPUX only */
+#define O_NOATIME 0004000000
+#define O_CLOEXEC 0010000000 /* set close_on_exec */
+
+#define O_DIRECTORY 0000010000 /* must be a directory */
+#define O_NOFOLLOW 0000000200 /* don't follow links */
+#define O_INVISIBLE 0004000000 /* invisible I/O, for DMAPI/XDSM */
+
+#define O_PATH 0020000000
+#define __O_TMPFILE 0040000000
+#define O_CLOFORK 0100000000
#define F_GETLK64 8
#define F_SETLK64 9
diff --git a/arch/sparc/include/uapi/asm/fcntl.h b/arch/sparc/include/uapi/asm/fcntl.h
index 67dae75e5274..d631ea13bac3 100644
--- a/arch/sparc/include/uapi/asm/fcntl.h
+++ b/arch/sparc/include/uapi/asm/fcntl.h
@@ -37,6 +37,7 @@
#define O_PATH 0x1000000
#define __O_TMPFILE 0x2000000
+#define O_CLOFORK 0x4000000
#define F_GETOWN 5 /* for sockets. */
#define F_SETOWN 6 /* for sockets. */
diff --git a/fs/fcntl.c b/fs/fcntl.c
index 913b0cb70804..40af2a48702b 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1033,7 +1033,7 @@ static int __init fcntl_init(void)
* Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
* is defined as O_NONBLOCK on some platforms and not on others.
*/
- BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
+ BUILD_BUG_ON(22 - 1 /* for O_RDONLY being 0 */ !=
HWEIGHT32(
(VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
__FMODE_EXEC | __FMODE_NONOTIFY));
diff --git a/fs/file.c b/fs/file.c
index 81194349e980..4a1b84eecec6 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -547,6 +547,10 @@ int __alloc_fd(struct files_struct *files,
__set_close_on_exec(fd, fdt);
else
__clear_close_on_exec(fd, fdt);
+ if (flags & O_CLOFORK)
+ __set_close_on_fork(fd, fdt);
+ else
+ __clear_close_on_fork(fd, fdt);
error = fd;
#if 1
/* Sanity check */
@@ -953,6 +957,10 @@ __releases(&files->file_lock)
__set_close_on_exec(fd, fdt);
else
__clear_close_on_exec(fd, fdt);
+ if (flags & O_CLOFORK)
+ __set_close_on_fork(fd, fdt);
+ else
+ __clear_close_on_fork(fd, fdt);
spin_unlock(&files->file_lock);
if (tofree)
@@ -993,7 +1001,7 @@ static int ksys_dup3(unsigned int oldfd, unsigned int newfd, int flags)
struct file *file;
struct files_struct *files = current->files;
- if ((flags & ~O_CLOEXEC) != 0)
+ if ((flags & ~(O_CLOEXEC | O_CLOFORK)) != 0)
return -EINVAL;
if (unlikely(oldfd == newfd))
diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
index 7bcdcf4f6ab2..cd4c625647db 100644
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -10,7 +10,7 @@
(O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \
O_APPEND | O_NDELAY | O_NONBLOCK | O_NDELAY | __O_SYNC | O_DSYNC | \
FASYNC | O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
- O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)
+ O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_CLOFORK)
/* List of all valid flags for the how->upgrade_mask argument: */
#define VALID_UPGRADE_FLAGS \
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 0cb7199a7743..165a0736a3aa 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -89,6 +89,10 @@
#define __O_TMPFILE 020000000
#endif
+#ifndef O_CLOFORK
+#define O_CLOFORK 040000000 /* set close_on_fork */
+#endif
+
/* a horrid kludge trying to make sure that this will fail on old kernels */
#define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
#define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)
diff --git a/tools/include/uapi/asm-generic/fcntl.h b/tools/include/uapi/asm-generic/fcntl.h
index e04a00fecb4a..69d8a000ec65 100644
--- a/tools/include/uapi/asm-generic/fcntl.h
+++ b/tools/include/uapi/asm-generic/fcntl.h
@@ -88,6 +88,10 @@
#define __O_TMPFILE 020000000
#endif
+#ifndef O_CLOFORK
+#define O_CLOFORK 040000000 /* set close_on_fork */
+#endif
+
/* a horrid kludge trying to make sure that this will fail on old kernels */
#define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
#define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)
--
2.26.1
Implements a new socket flag that automatically sets the
close-on-fork flag for sockets created using socket(2),
socketpair(2), and accept4(2).
Signed-off-by: Nate Karstens <[email protected]>
---
include/linux/net.h | 3 ++-
net/socket.c | 14 ++++++++------
2 files changed, 10 insertions(+), 7 deletions(-)
diff --git a/include/linux/net.h b/include/linux/net.h
index 6451425e828f..57663c9dc8c4 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -17,7 +17,7 @@
#include <linux/stringify.h>
#include <linux/random.h>
#include <linux/wait.h>
-#include <linux/fcntl.h> /* For O_CLOEXEC and O_NONBLOCK */
+#include <linux/fcntl.h> /* For O_CLOEXEC, O_CLOFORK, and O_NONBLOCK */
#include <linux/rcupdate.h>
#include <linux/once.h>
#include <linux/fs.h>
@@ -73,6 +73,7 @@ enum sock_type {
/* Flags for socket, socketpair, accept4 */
#define SOCK_CLOEXEC O_CLOEXEC
+#define SOCK_CLOFORK O_CLOFORK
#ifndef SOCK_NONBLOCK
#define SOCK_NONBLOCK O_NONBLOCK
#endif
diff --git a/net/socket.c b/net/socket.c
index 2eecf1517f76..ba6e971c7e78 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -1511,12 +1511,14 @@ int __sys_socket(int family, int type, int protocol)
/* Check the SOCK_* constants for consistency. */
BUILD_BUG_ON(SOCK_CLOEXEC != O_CLOEXEC);
+ BUILD_BUG_ON(SOCK_CLOFORK != O_CLOFORK);
BUILD_BUG_ON((SOCK_MAX | SOCK_TYPE_MASK) != SOCK_TYPE_MASK);
BUILD_BUG_ON(SOCK_CLOEXEC & SOCK_TYPE_MASK);
+ BUILD_BUG_ON(SOCK_CLOFORK & SOCK_TYPE_MASK);
BUILD_BUG_ON(SOCK_NONBLOCK & SOCK_TYPE_MASK);
flags = type & ~SOCK_TYPE_MASK;
- if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK))
+ if (flags & ~(SOCK_CLOEXEC | SOCK_CLOFORK | SOCK_NONBLOCK))
return -EINVAL;
type &= SOCK_TYPE_MASK;
@@ -1527,7 +1529,7 @@ int __sys_socket(int family, int type, int protocol)
if (retval < 0)
return retval;
- return sock_map_fd(sock, flags & (O_CLOEXEC | O_NONBLOCK));
+ return sock_map_fd(sock, flags & (O_CLOEXEC | O_CLOFORK | O_NONBLOCK));
}
SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol)
@@ -1547,7 +1549,7 @@ int __sys_socketpair(int family, int type, int protocol, int __user *usockvec)
int flags;
flags = type & ~SOCK_TYPE_MASK;
- if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK))
+ if (flags & ~(SOCK_CLOEXEC | SOCK_CLOFORK | SOCK_NONBLOCK))
return -EINVAL;
type &= SOCK_TYPE_MASK;
@@ -1715,7 +1717,7 @@ int __sys_accept4_file(struct file *file, unsigned file_flags,
int err, len, newfd;
struct sockaddr_storage address;
- if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK))
+ if (flags & ~(SOCK_CLOEXEC | SOCK_CLOFORK | SOCK_NONBLOCK))
return -EINVAL;
if (SOCK_NONBLOCK != O_NONBLOCK && (flags & SOCK_NONBLOCK))
@@ -3628,8 +3630,8 @@ EXPORT_SYMBOL(kernel_listen);
* @newsock: new connected socket
* @flags: flags
*
- * @flags must be SOCK_CLOEXEC, SOCK_NONBLOCK or 0.
- * If it fails, @newsock is guaranteed to be %NULL.
+ * @flags must be SOCK_CLOEXEC, SOCK_CLOFORK, SOCK_NONBLOCK,
+ * or 0. If it fails, @newsock is guaranteed to be %NULL.
* Returns 0 or an error.
*/
--
2.26.1
Implement functionality for duplicating a file descriptor
and having the close-on-fork flag automatically set in the
new file descriptor.
Signed-off-by: Nate Karstens <[email protected]>
---
fs/fcntl.c | 4 ++++
include/uapi/linux/fcntl.h | 3 +++
tools/include/uapi/linux/fcntl.h | 3 +++
3 files changed, 10 insertions(+)
diff --git a/fs/fcntl.c b/fs/fcntl.c
index 40af2a48702b..e15cdd77b1df 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -333,6 +333,9 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
case F_DUPFD_CLOEXEC:
err = f_dupfd(arg, filp, O_CLOEXEC);
break;
+ case F_DUPFD_CLOFORK:
+ err = f_dupfd(arg, filp, O_CLOFORK);
+ break;
case F_GETFD:
err = f_getfd(fd);
break;
@@ -437,6 +440,7 @@ static int check_fcntl_cmd(unsigned cmd)
switch (cmd) {
case F_DUPFD:
case F_DUPFD_CLOEXEC:
+ case F_DUPFD_CLOFORK:
case F_GETFD:
case F_SETFD:
case F_GETFL:
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index ca88b7bce553..9e1069ff3a22 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -55,6 +55,9 @@
#define F_GET_FILE_RW_HINT (F_LINUX_SPECIFIC_BASE + 13)
#define F_SET_FILE_RW_HINT (F_LINUX_SPECIFIC_BASE + 14)
+/* Create a file descriptor with FD_CLOFORK set. */
+#define F_DUPFD_CLOFORK (F_LINUX_SPECIFIC_BASE + 15)
+
/*
* Valid hint values for F_{GET,SET}_RW_HINT. 0 is "not set", or can be
* used to clear any hints previously set.
diff --git a/tools/include/uapi/linux/fcntl.h b/tools/include/uapi/linux/fcntl.h
index ca88b7bce553..9e1069ff3a22 100644
--- a/tools/include/uapi/linux/fcntl.h
+++ b/tools/include/uapi/linux/fcntl.h
@@ -55,6 +55,9 @@
#define F_GET_FILE_RW_HINT (F_LINUX_SPECIFIC_BASE + 13)
#define F_SET_FILE_RW_HINT (F_LINUX_SPECIFIC_BASE + 14)
+/* Create a file descriptor with FD_CLOFORK set. */
+#define F_DUPFD_CLOFORK (F_LINUX_SPECIFIC_BASE + 15)
+
/*
* Valid hint values for F_{GET,SET}_RW_HINT. 0 is "not set", or can be
* used to clear any hints previously set.
--
2.26.1
The close-on-fork flag causes the file descriptor to be closed
atomically in the child process before the child process returns
from fork(). Implement this feature and provide a method to
get/set the close-on-fork flag using fcntl(2).
This functionality was approved by the Austin Common Standards
Revision Group for inclusion in the next revision of the POSIX
standard (see issue 1318 in the Austin Group Defect Tracker).
If clone(2) is used to create a child process and the CLONE_FILES
flag is set, then both processes will share the table of file
descriptors and the state of the close-on-fork flag for any
individual file descriptor. If unshare(2) is later used to stop
sharing the file descriptor table, then any file descriptor with
the close-on-fork flag set will be closed in the process that
calls unshare(2).
execve(2) also causes the file descriptor table to be unshared,
so any file descriptor with the close-on-fork flag set will be
closed in the process that calls execve(2).
Co-developed-by: Changli Gao <[email protected]>
Signed-off-by: Changli Gao <[email protected]>
Signed-off-by: Nate Karstens <[email protected]>
---
fs/fcntl.c | 4 +-
fs/file.c | 64 ++++++++++++++++++++++++--
include/linux/fdtable.h | 7 +++
include/linux/file.h | 2 +
include/uapi/asm-generic/fcntl.h | 5 +-
tools/include/uapi/asm-generic/fcntl.h | 5 +-
6 files changed, 77 insertions(+), 10 deletions(-)
diff --git a/fs/fcntl.c b/fs/fcntl.c
index 2e4c0fa2074b..913b0cb70804 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -334,11 +334,11 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
err = f_dupfd(arg, filp, O_CLOEXEC);
break;
case F_GETFD:
- err = get_close_on_exec(fd) ? FD_CLOEXEC : 0;
+ err = f_getfd(fd);
break;
case F_SETFD:
err = 0;
- set_close_on_exec(fd, arg & FD_CLOEXEC);
+ f_setfd(fd, arg);
break;
case F_GETFL:
err = filp->f_flags;
diff --git a/fs/file.c b/fs/file.c
index c8a4e4c86e55..81194349e980 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -47,7 +47,7 @@ static void free_fdtable_rcu(struct rcu_head *rcu)
* spinlock held for write.
*/
static void copy_fd_bitmaps(struct fdtable *nfdt, struct fdtable *ofdt,
- unsigned int count)
+ unsigned int count, bool copy_cof)
{
unsigned int cpy, set;
@@ -58,6 +58,13 @@ static void copy_fd_bitmaps(struct fdtable *nfdt, struct fdtable *ofdt,
memcpy(nfdt->close_on_exec, ofdt->close_on_exec, cpy);
memset((char *)nfdt->close_on_exec + cpy, 0, set);
+ if (copy_cof) {
+ memcpy(nfdt->close_on_fork, ofdt->close_on_fork, cpy);
+ memset((char *)nfdt->close_on_fork + cpy, 0, set);
+ } else {
+ memset((char *)nfdt->close_on_fork, 0, cpy + set);
+ }
+
cpy = BITBIT_SIZE(count);
set = BITBIT_SIZE(nfdt->max_fds) - cpy;
memcpy(nfdt->full_fds_bits, ofdt->full_fds_bits, cpy);
@@ -79,7 +86,7 @@ static void copy_fdtable(struct fdtable *nfdt, struct fdtable *ofdt)
memcpy(nfdt->fd, ofdt->fd, cpy);
memset((char *)nfdt->fd + cpy, 0, set);
- copy_fd_bitmaps(nfdt, ofdt, ofdt->max_fds);
+ copy_fd_bitmaps(nfdt, ofdt, ofdt->max_fds, true);
}
static struct fdtable * alloc_fdtable(unsigned int nr)
@@ -118,7 +125,7 @@ static struct fdtable * alloc_fdtable(unsigned int nr)
fdt->fd = data;
data = kvmalloc(max_t(size_t,
- 2 * nr / BITS_PER_BYTE + BITBIT_SIZE(nr), L1_CACHE_BYTES),
+ 3 * nr / BITS_PER_BYTE + BITBIT_SIZE(nr), L1_CACHE_BYTES),
GFP_KERNEL_ACCOUNT);
if (!data)
goto out_arr;
@@ -126,6 +133,8 @@ static struct fdtable * alloc_fdtable(unsigned int nr)
data += nr / BITS_PER_BYTE;
fdt->close_on_exec = data;
data += nr / BITS_PER_BYTE;
+ fdt->close_on_fork = data;
+ data += nr / BITS_PER_BYTE;
fdt->full_fds_bits = data;
return fdt;
@@ -236,6 +245,17 @@ static inline void __clear_close_on_exec(unsigned int fd, struct fdtable *fdt)
__clear_bit(fd, fdt->close_on_exec);
}
+static inline void __set_close_on_fork(unsigned int fd, struct fdtable *fdt)
+{
+ __set_bit(fd, fdt->close_on_fork);
+}
+
+static inline void __clear_close_on_fork(unsigned int fd, struct fdtable *fdt)
+{
+ if (test_bit(fd, fdt->close_on_fork))
+ __clear_bit(fd, fdt->close_on_fork);
+}
+
static inline void __set_open_fd(unsigned int fd, struct fdtable *fdt)
{
__set_bit(fd, fdt->open_fds);
@@ -290,6 +310,7 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
new_fdt = &newf->fdtab;
new_fdt->max_fds = NR_OPEN_DEFAULT;
new_fdt->close_on_exec = newf->close_on_exec_init;
+ new_fdt->close_on_fork = newf->close_on_fork_init;
new_fdt->open_fds = newf->open_fds_init;
new_fdt->full_fds_bits = newf->full_fds_bits_init;
new_fdt->fd = &newf->fd_array[0];
@@ -330,13 +351,17 @@ struct files_struct *dup_fd(struct files_struct *oldf, int *errorp)
open_files = count_open_files(old_fdt);
}
- copy_fd_bitmaps(new_fdt, old_fdt, open_files);
+ copy_fd_bitmaps(new_fdt, old_fdt, open_files, false);
old_fds = old_fdt->fd;
new_fds = new_fdt->fd;
for (i = open_files; i != 0; i--) {
struct file *f = *old_fds++;
+
+ if (close_on_fork(open_files - i, old_fdt))
+ f = NULL;
+
if (f) {
get_file(f);
} else {
@@ -453,6 +478,7 @@ struct files_struct init_files = {
.max_fds = NR_OPEN_DEFAULT,
.fd = &init_files.fd_array[0],
.close_on_exec = init_files.close_on_exec_init,
+ .close_on_fork = init_files.close_on_fork_init,
.open_fds = init_files.open_fds_init,
.full_fds_bits = init_files.full_fds_bits_init,
},
@@ -840,6 +866,36 @@ void __f_unlock_pos(struct file *f)
* file count (done either by fdget() or by fork()).
*/
+void f_setfd(unsigned int fd, int flags)
+{
+ struct files_struct *files = current->files;
+ struct fdtable *fdt;
+ spin_lock(&files->file_lock);
+ fdt = files_fdtable(files);
+ if (flags & FD_CLOEXEC)
+ __set_close_on_exec(fd, fdt);
+ else
+ __clear_close_on_exec(fd, fdt);
+ if (flags & FD_CLOFORK)
+ __set_close_on_fork(fd, fdt);
+ else
+ __clear_close_on_fork(fd, fdt);
+ spin_unlock(&files->file_lock);
+}
+
+int f_getfd(unsigned int fd)
+{
+ struct files_struct *files = current->files;
+ struct fdtable *fdt;
+ int flags;
+ rcu_read_lock();
+ fdt = files_fdtable(files);
+ flags = (close_on_exec(fd, fdt) ? FD_CLOEXEC : 0) |
+ (close_on_fork(fd, fdt) ? FD_CLOFORK : 0);
+ rcu_read_unlock();
+ return flags;
+}
+
void set_close_on_exec(unsigned int fd, int flag)
{
struct files_struct *files = current->files;
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index f07c55ea0c22..61c551947fa3 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -27,6 +27,7 @@ struct fdtable {
unsigned int max_fds;
struct file __rcu **fd; /* current fd array */
unsigned long *close_on_exec;
+ unsigned long *close_on_fork;
unsigned long *open_fds;
unsigned long *full_fds_bits;
struct rcu_head rcu;
@@ -37,6 +38,11 @@ static inline bool close_on_exec(unsigned int fd, const struct fdtable *fdt)
return test_bit(fd, fdt->close_on_exec);
}
+static inline bool close_on_fork(unsigned int fd, const struct fdtable *fdt)
+{
+ return test_bit(fd, fdt->close_on_fork);
+}
+
static inline bool fd_is_open(unsigned int fd, const struct fdtable *fdt)
{
return test_bit(fd, fdt->open_fds);
@@ -61,6 +67,7 @@ struct files_struct {
spinlock_t file_lock ____cacheline_aligned_in_smp;
unsigned int next_fd;
unsigned long close_on_exec_init[1];
+ unsigned long close_on_fork_init[1];
unsigned long open_fds_init[1];
unsigned long full_fds_bits_init[1];
struct file __rcu * fd_array[NR_OPEN_DEFAULT];
diff --git a/include/linux/file.h b/include/linux/file.h
index 142d102f285e..0ee15ee24010 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -83,6 +83,8 @@ static inline void fdput_pos(struct fd f)
extern int f_dupfd(unsigned int from, struct file *file, unsigned flags);
extern int replace_fd(unsigned fd, struct file *file, unsigned flags);
+extern int f_getfd(unsigned int fd);
+extern void f_setfd(unsigned int fd, int flags);
extern void set_close_on_exec(unsigned int fd, int flag);
extern bool get_close_on_exec(unsigned int fd);
extern int __get_unused_fd_flags(unsigned flags, unsigned long nofile);
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 9dc0bf0c5a6e..0cb7199a7743 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -98,8 +98,8 @@
#endif
#define F_DUPFD 0 /* dup */
-#define F_GETFD 1 /* get close_on_exec */
-#define F_SETFD 2 /* set/clear close_on_exec */
+#define F_GETFD 1 /* get close_on_exec & close_on_fork */
+#define F_SETFD 2 /* set/clear close_on_exec & close_on_fork */
#define F_GETFL 3 /* get file->f_flags */
#define F_SETFL 4 /* set file->f_flags */
#ifndef F_GETLK
@@ -160,6 +160,7 @@ struct f_owner_ex {
/* for F_[GET|SET]FL */
#define FD_CLOEXEC 1 /* actually anything with low bit set goes */
+#define FD_CLOFORK 2
/* for posix fcntl() and lockf() */
#ifndef F_RDLCK
diff --git a/tools/include/uapi/asm-generic/fcntl.h b/tools/include/uapi/asm-generic/fcntl.h
index ac190958c981..e04a00fecb4a 100644
--- a/tools/include/uapi/asm-generic/fcntl.h
+++ b/tools/include/uapi/asm-generic/fcntl.h
@@ -97,8 +97,8 @@
#endif
#define F_DUPFD 0 /* dup */
-#define F_GETFD 1 /* get close_on_exec */
-#define F_SETFD 2 /* set/clear close_on_exec */
+#define F_GETFD 1 /* get close_on_exec & close_on_fork */
+#define F_SETFD 2 /* set/clear close_on_exec & close_on_fork */
#define F_GETFL 3 /* get file->f_flags */
#define F_SETFL 4 /* set file->f_flags */
#ifndef F_GETLK
@@ -159,6 +159,7 @@ struct f_owner_ex {
/* for F_[GET|SET]FL */
#define FD_CLOEXEC 1 /* actually anything with low bit set goes */
+#define FD_CLOFORK 2
/* for posix fcntl() and lockf() */
#ifndef F_RDLCK
--
2.26.1
On Fri, May 15, 2020 at 8:23 AM Nate Karstens <[email protected]> wrote:
>
>
> Series of 4 patches to implement close-on-fork. Tests have been
> published to https://github.com/nkarstens/ltp/tree/close-on-fork
> and cover close-on-fork functionality in the following syscalls:
>
> * accept(4)
> * dup3(2)
> * fcntl(2)
> * open(2)
> * socket(2)
> * socketpair(2)
> * unshare(2)
>
> Addresses underlying issue in that there is no way to prevent
> a fork() from duplicating a file descriptor. The existing
> close-on-exec flag partially-addresses this by allowing the
> parent process to mark a file descriptor as exclusive to itself,
> but there is still a period of time the failure can occur
> because the auto-close only occurs during the exec().
>
> One manifestation of this is a race conditions in system(), which
> (depending on the implementation) is non-atomic in that it first
> calls a fork() and then an exec().
>
> This functionality was approved by the Austin Common Standards
> Revision Group for inclusion in the next revision of the POSIX
> standard (see issue 1318 in the Austin Group Defect Tracker).
>
> ---
>
> This is v2 of the change. See https://lkml.org/lkml/2020/4/20/113
> for the original work.
>
> Thanks to everyone who provided comments on the first series of
> patches. Here are replies to specific comments:
>
> > I suggest we group the two bits of a file (close_on_exec, close_on_fork)
> > together, so that we do not have to dirty two separate cache lines.
>
> I could be mistaken, but I don't think this would improve efficiency.
> The close-on-fork and close-on-exec flags are read at different
> times. If you assume separate syscalls for fork and exec then
> there are several switches between when the two flags are read.
> In addition, the close-on-fork flags in the new process must be
> cleared, which will be much harder if the flags are interleaved.
:/
Fast path in big and performance sensitive applications is not fork()
and/or exec().
This is open()/close() and others (socket(), accept(), ...)
We do not want them to access extra cache lines for this new feature.
Sorry, I will say no to these patches in their current form.
On Fri, May 15, 2020 at 10:23:17AM -0500, Nate Karstens wrote:
> Series of 4 patches to implement close-on-fork. Tests have been
> published to https://github.com/nkarstens/ltp/tree/close-on-fork
> and cover close-on-fork functionality in the following syscalls:
[...]
> This functionality was approved by the Austin Common Standards
> Revision Group for inclusion in the next revision of the POSIX
> standard (see issue 1318 in the Austin Group Defect Tracker).
NAK to this patch series, and the entire concept.
Is there a way to persuade POSIX that they made a bad decision by
standardising this mess?
From: Eric Dumazet
> Sent: 15 May 2020 16:31
...
> Fast path in big and performance sensitive applications is not fork()
> and/or exec().
>
> This is open()/close() and others (socket(), accept(), ...)
>
> We do not want them to access extra cache lines for this new feature.
>
> Sorry, I will say no to these patches in their current form.
Is it worth completely removing the bitmaps and just remembering
the lowest fd that has had each bit set (don't worry about clears).
Then leverage the close_all() code that closes all fd above
a specified number to close only those with the 'close on exec'
or 'close on fork' flag set.
After all an application is currently very likely to have set
'close on exec' on all open fd above 2.
So the number of fd that don't need closing is small.
This puts all the expensive code in the already slow fork/exec
paths.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Fri, May 15, 2020 at 10:23:17AM -0500, Nate Karstens wrote:
> This functionality was approved by the Austin Common Standards
> Revision Group for inclusion in the next revision of the POSIX
> standard (see issue 1318 in the Austin Group Defect Tracker).
It penalizes every call of fork() in the system (as well as adds
an extra dirtied cacheline on each socket()/open()/etc.), adds
memory footprint and complicates the API. All of that - to deal
with rather uncommon problem that already has a portable solution.
As for the Austin Group, the only authority it has ever had derives
from consensus between existing Unices. "Solaris does it, Linux and
*BSD do not" translates into "Austin Group is welcome to take a hike".
BTW, contrary to the lovely bit of misrepresentation in that
thread of theirs ("<LWN URL> suggests that" != "someone's comment
under LWN article says it _appears_ that"), none of *BSD do it.
IMO it's a bad idea.
NAKed-by: Al Viro <[email protected]>
Matthew,
What alternative would you suggest?
From an earlier email:
> ...nothing else addresses the underlying issue: there is no way to
> prevent a fork() from duplicating the resource. The close-on-exec
> flag partially-addresses this by allowing the parent process to
> mark a file descriptor as exclusive to itself, but there is still
> a period of time the failure can occur because the auto-close only
> occurs during the exec(). Perhaps this would not be an issue with
> a different process/threading model, but that is another discussion
> entirely.
Do you disagree there is an issue?
-----Original Message-----
From: Matthew Wilcox <[email protected]>
Sent: Friday, May 15, 2020 10:58
To: Karstens, Nate <[email protected]>
Cc: Alexander Viro <[email protected]>; Jeff Layton <[email protected]>; J. Bruce Fields <[email protected]>; Arnd Bergmann <[email protected]>; Richard Henderson <[email protected]>; Ivan Kokshaysky <[email protected]>; Matt Turner <[email protected]>; James E.J. Bottomley <[email protected]>; Helge Deller <[email protected]>; David S. Miller <[email protected]>; Jakub Kicinski <[email protected]>; Eric Dumazet <[email protected]>; David Laight <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; Changli Gao <[email protected]>; [email protected]
Subject: Re: [PATCH v2] Implement close-on-fork
CAUTION - EXTERNAL EMAIL: Do not click any links or open any attachments unless you trust the sender and know the content is safe.
On Fri, May 15, 2020 at 10:23:17AM -0500, Nate Karstens wrote:
> Series of 4 patches to implement close-on-fork. Tests have been
> published to https://github.com/nkarstens/ltp/tree/close-on-fork
> and cover close-on-fork functionality in the following syscalls:
[...]
> This functionality was approved by the Austin Common Standards
> Revision Group for inclusion in the next revision of the POSIX
> standard (see issue 1318 in the Austin Group Defect Tracker).
NAK to this patch series, and the entire concept.
Is there a way to persuade POSIX that they made a bad decision by standardising this mess?
________________________________
CONFIDENTIALITY NOTICE: This email and any attachments are for the sole use of the intended recipient(s) and contain information that may be Garmin confidential and/or Garmin legally privileged. If you have received this email in error, please notify the sender by reply email and delete the message. Any disclosure, copying, distribution or use of this communication (including attachments) by someone other than the intended recipient is prohibited. Thank you.
On Fri, 2020-05-15 at 16:07 +0000, Karstens, Nate wrote:
> Matthew,
>
> What alternative would you suggest?
>
> From an earlier email:
>
> > ...nothing else addresses the underlying issue: there is no way to
> > prevent a fork() from duplicating the resource. The close-on-exec
> > flag partially-addresses this by allowing the parent process to
> > mark a file descriptor as exclusive to itself, but there is still
> > a period of time the failure can occur because the auto-close only
> > occurs during the exec(). Perhaps this would not be an issue with
> > a different process/threading model, but that is another discussion
> > entirely.
>
> Do you disagree there is an issue?
Oh good grief that's a leading question: When I write bad code and it
crashes, most people would agree there is an issue; very few would
agree the kernel should be changed to fix it. Several of us have
already said the problem seems to be with the way your application is
written. You didn't even answer emails like this speculating about the
cause being the way your application counts resources:
https://lore.kernel.org/linux-fsdevel/[email protected]/
The bottom line is that we think you could rewrite this one application
not to have the problem you're complaining about rather than introduce
a new kernel API to "fix" it.
James
On Fri, May 15, 2020 at 04:07:33PM +0000, Karstens, Nate wrote:
> Matthew,
>
> What alternative would you suggest?
>
> >From an earlier email:
>
> > ...nothing else addresses the underlying issue: there is no way to
> > prevent a fork() from duplicating the resource. The close-on-exec
> > flag partially-addresses this by allowing the parent process to
> > mark a file descriptor as exclusive to itself, but there is still
> > a period of time the failure can occur because the auto-close only
> > occurs during the exec(). Perhaps this would not be an issue with
> > a different process/threading model, but that is another discussion
> > entirely.
>
> Do you disagree there is an issue?
Yes. system() is defined as being unsafe for a threaded application
to call. I pointed this out in the last thread.
Al,
The reference to the POSIX change was included more for a reference to that discussion, not to say "POSIX did this and so Linux must do it to". In any case, the documentation presented to the Austin Group was focused more around the issue we ran into and some alternative solutions. In reviewing the notes from the meeting I didn't get the impression that they added this to POSIX simply because Solaris and *BSD already had it (reference https://austingroupbugs.net/view.php?id=1317, the first solution we suggested).
> It penalizes every call of fork() in the system
Is the performance hit really that drastic? fork() does a lot of stuff and this really seems like a drop in the bucket...
> adds an extra dirtied cacheline on each socket()/open()/etc.
It sounds like we can work to improve that, though.
> already has a portable solution
What is the solution?
Thanks,
Nate
-----Original Message-----
From: Al Viro <[email protected]> On Behalf Of Al Viro
Sent: Friday, May 15, 2020 11:04
To: Karstens, Nate <[email protected]>
Cc: Jeff Layton <[email protected]>; J. Bruce Fields <[email protected]>; Arnd Bergmann <[email protected]>; Richard Henderson <[email protected]>; Ivan Kokshaysky <[email protected]>; Matt Turner <[email protected]>; James E.J. Bottomley <[email protected]>; Helge Deller <[email protected]>; David S. Miller <[email protected]>; Jakub Kicinski <[email protected]>; Eric Dumazet <[email protected]>; David Laight <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; Changli Gao <[email protected]>
Subject: Re: [PATCH v2] Implement close-on-fork
CAUTION - EXTERNAL EMAIL: Do not click any links or open any attachments unless you trust the sender and know the content is safe.
On Fri, May 15, 2020 at 10:23:17AM -0500, Nate Karstens wrote:
> This functionality was approved by the Austin Common Standards
> Revision Group for inclusion in the next revision of the POSIX
> standard (see issue 1318 in the Austin Group Defect Tracker).
It penalizes every call of fork() in the system (as well as adds an extra dirtied cacheline on each socket()/open()/etc.), adds memory footprint and complicates the API. All of that - to deal with rather uncommon problem that already has a portable solution.
As for the Austin Group, the only authority it has ever had derives from consensus between existing Unices. "Solaris does it, Linux and *BSD do not" translates into "Austin Group is welcome to take a hike".
BTW, contrary to the lovely bit of misrepresentation in that thread of theirs ("<LWN URL> suggests that" != "someone's comment under LWN article says it _appears_ that"), none of *BSD do it.
IMO it's a bad idea.
NAKed-by: Al Viro <[email protected]>
________________________________
CONFIDENTIALITY NOTICE: This email and any attachments are for the sole use of the intended recipient(s) and contain information that may be Garmin confidential and/or Garmin legally privileged. If you have received this email in error, please notify the sender by reply email and delete the message. Any disclosure, copying, distribution or use of this communication (including attachments) by someone other than the intended recipient is prohibited. Thank you.
Karstens, Nate <[email protected]> wrote:
> > already has a portable solution
>
> What is the solution?
sys_spawn(const char *path, const char **argv, const char **envv,
unsigned long clone_flags, unsigned int nfds, int *fds);
maybe?
David
James,
Sorry, perhaps I was indirect, but I thought I had responded to that in https://lore.kernel.org/linux-fsdevel/[email protected]/.
I really hope I do not come off as complaining about this issue. We identified what seemed to be something that was overlooked with the various APIs around creating child processes. Rather than fixing it ourselves and moving on we chose to invest more time and effort into it by engaging the community (first POSIX, and now this one) in a discussion. I humbly and sincerely ask if you would help me understand, if we could turn back the clock, how our application could have been written to avoid this issue:
*A parent process forks a child. Another thread in the parent process closes and attempts to reopen a socket, file, or other resource it needs exclusive access to. This fails because the -operating system- still has a reference to that resource that it is keeping on behalf of the child. The child eventually calls exec and the resource is closed because the close-on-exec flag is set.*
Our first attempt, which was to use the pthread_atfork() handlers, failed because system() is not required to call the handlers.
Most of the feedback we're getting on this seems to say "don't use system(), it is unsafe for threaded applications". Is that documented anywhere? The man page says it is "MT-Safe".
Aside from that, even if we remove all uses of system() from our application (which we already have), then our application, like many other applications, needs to use third-party shared libraries. There is nothing that prevents those libraries from using system(). We can audit those libraries and go back with the vendor with a request to replace system() with a standard fork/exec, but they will also want documentation supporting that.
We can also take steps to change or remove system() from our standard library. It fixes our issue, but still leaves the community with an API that is broken/flawed/poorly-documented (depending on how one looks at it).
If the feedback from the community is truly and finally that system() should not be used in these applications, then is there support for updating the man page to better communicate that?
Thanks for your help with this.
Nate
-----Original Message-----
From: James Bottomley <[email protected]>
Sent: Friday, May 15, 2020 11:26
To: Karstens, Nate <[email protected]>; Matthew Wilcox <[email protected]>
Cc: Alexander Viro <[email protected]>; Jeff Layton <[email protected]>; J. Bruce Fields <[email protected]>; Arnd Bergmann <[email protected]>; Richard Henderson <[email protected]>; Ivan Kokshaysky <[email protected]>; Matt Turner <[email protected]>; Helge Deller <[email protected]>; David S. Miller <[email protected]>; Jakub Kicinski <[email protected]>; Eric Dumazet <[email protected]>; David Laight <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; Changli Gao <[email protected]>; [email protected]
Subject: Re: [PATCH v2] Implement close-on-fork
CAUTION - EXTERNAL EMAIL: Do not click any links or open any attachments unless you trust the sender and know the content is safe.
On Fri, 2020-05-15 at 16:07 +0000, Karstens, Nate wrote:
> Matthew,
>
> What alternative would you suggest?
>
> From an earlier email:
>
> > ...nothing else addresses the underlying issue: there is no way to
> > prevent a fork() from duplicating the resource. The close-on-exec
> > flag partially-addresses this by allowing the parent process to mark
> > a file descriptor as exclusive to itself, but there is still a
> > period of time the failure can occur because the auto-close only
> > occurs during the exec(). Perhaps this would not be an issue with a
> > different process/threading model, but that is another discussion
> > entirely.
>
> Do you disagree there is an issue?
Oh good grief that's a leading question: When I write bad code and it crashes, most people would agree there is an issue; very few would agree the kernel should be changed to fix it. Several of us have already said the problem seems to be with the way your application is written. You didn't even answer emails like this speculating about the cause being the way your application counts resources:
https://lore.kernel.org/linux-fsdevel/[email protected]/
The bottom line is that we think you could rewrite this one application not to have the problem you're complaining about rather than introduce a new kernel API to "fix" it.
James
________________________________
CONFIDENTIALITY NOTICE: This email and any attachments are for the sole use of the intended recipient(s) and contain information that may be Garmin confidential and/or Garmin legally privileged. If you have received this email in error, please notify the sender by reply email and delete the message. Any disclosure, copying, distribution or use of this communication (including attachments) by someone other than the intended recipient is prohibited. Thank you.
On Fri, May 15, 2020 at 06:28:20PM +0000, Karstens, Nate wrote:
> Our first attempt, which was to use the pthread_atfork() handlers, failed because system() is not required to call the handlers.
>
> Most of the feedback we're getting on this seems to say "don't use system(), it is unsafe for threaded applications". Is that documented anywhere? The man page says it is "MT-Safe".
https://pubs.opengroup.org/onlinepubs/9699919799/functions/system.html
> Aside from that, even if we remove all uses of system() from our application (which we already have), then our application, like many other applications, needs to use third-party shared libraries. There is nothing that prevents those libraries from using system(). We can audit those libraries and go back with the vendor with a request to replace system() with a standard fork/exec, but they will also want documentation supporting that.
They might also be using any number of other interfaces which aren't
thread-safe.
> If the feedback from the community is truly and finally that system() should not be used in these applications, then is there support for updating the man page to better communicate that?
Yes, absolutely.
I think you're mischaracterising our feedback. It's not "We don't want
to fix this interface". It's "We don't want to slow down everything else
to fix this interface (and by the way there are lots of other problems
with this interface that you aren't even trying to address)".
On Fri, May 15, 2020 at 08:57:30AM -0700, Matthew Wilcox wrote:
> On Fri, May 15, 2020 at 10:23:17AM -0500, Nate Karstens wrote:
> > Series of 4 patches to implement close-on-fork. Tests have been
> > published to https://github.com/nkarstens/ltp/tree/close-on-fork
> > and cover close-on-fork functionality in the following syscalls:
>
> [...]
>
> > This functionality was approved by the Austin Common Standards
> > Revision Group for inclusion in the next revision of the POSIX
> > standard (see issue 1318 in the Austin Group Defect Tracker).
>
> NAK to this patch series, and the entire concept.
Yeah.
But also, stuff like this should really be on linux-api.
Christian
Hi!
>
> If the feedback from the community is truly and finally that system() should not be used in these applications, then is there support for updating the man page to better communicate that?
>
Clarifying documenation might be the best way forward. Note you'd have
to do that anyway, since people would not know about O_CLOFORK without
pointers in documentation.
Best regards,
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Hi Nate,
> One manifestation of this is a race conditions in system(), which
> (depending on the implementation) is non-atomic in that it first calls
> a fork() and then an exec().
The need for O_CLOFORK might be made more clear by looking at a
long-standing Go issue, i.e. unrelated to system(3), which was started
in 2017 by Russ Cox when he summed up the current race-condition
behaviour of trying to execve(2) a newly created file:
https://github.com/golang/go/issues/22315. I raised it on linux-kernel
in 2017, https://marc.info/?l=linux-kernel&m=150834137201488, and linked
to a proposed patch from 2011, ‘[PATCH] fs: add FD_CLOFORK and
O_CLOFORK’ by Changli Gao. As I said, long-standing.
The Go issue is worth a read. Russ wondered ‘What would Java do’ only
to find that Java already had an issue open for the same problem since
2014.
I think the kernel is the place to fix the problem, just as with
FD_CLOEXEC/O_CLOEXEC. Ian Lance Taylor says on the Go issue that it
looks like ‘Solaris and macOS and OpenBSD have O_CLOFORK already.
Hopefully it will catch on further’.
--
Cheers, Ralph.
On Sat, Jun 18, 2022 at 12:41:11PM +0100, Ralph Corderoy wrote:
> Hi Nate,
>
> > One manifestation of this is a race conditions in system(), which
> > (depending on the implementation) is non-atomic in that it first calls
> > a fork() and then an exec().
>
> The need for O_CLOFORK might be made more clear by looking at a
> long-standing Go issue, i.e. unrelated to system(3), which was started
> in 2017 by Russ Cox when he summed up the current race-condition
> behaviour of trying to execve(2) a newly created file:
> https://github.com/golang/go/issues/22315. I raised it on linux-kernel
> in 2017, https://marc.info/?l=linux-kernel&m=150834137201488, and linked
> to a proposed patch from 2011, ‘[PATCH] fs: add FD_CLOFORK and
> O_CLOFORK’ by Changli Gao. As I said, long-standing.
The problem is that people advocating for O_CLOFORK understand its
value, but not its cost. Other google employees have a system which has
literally millions of file descriptors in a single process. Having to
maintain this extra state per-fd is a cost they don't want to pay
(and have been quite vocal about earlier in this thread).
Fundamentally, fork()+exec() is a terrible model. Mind you, so is
spawn(). I haven't seen a good model yet.
Hi Matthew, thanks for replying.
> > The need for O_CLOFORK might be made more clear by looking at a
> > long-standing Go issue, i.e. unrelated to system(3), which was started
> > in 2017 by Russ Cox when he summed up the current race-condition
> > behaviour of trying to execve(2) a newly created file:
> > https://github.com/golang/go/issues/22315.
>
> The problem is that people advocating for O_CLOFORK understand its
> value, but not its cost. Other google employees have a system which
> has literally millions of file descriptors in a single process.
> Having to maintain this extra state per-fd is a cost they don't want
> to pay (and have been quite vocal about earlier in this thread).
So do you agree the userspace issue is best solved by *_CLOFORK and the
problem is how to implement *_CLOFORK at an acceptable cost?
OTOH David Laight was making suggestions on moving the load to the
fork/exec path earlier in the thread, but OTOH Al Viro mentioned a
‘portable solution’, though that could have been to a specific issue
rather than the more general case.
How would you recommend approaching an acceptable cost is progressed?
Iterate on patch versions? Open a bugzilla.kernel.org for central
tracking and linking from the other projects? ..?
--
Cheers, Ralph.
On Sun, Jun 19, 2022 at 11:42:28AM +0100, Ralph Corderoy wrote:
> Hi Matthew, thanks for replying.
>
> > > The need for O_CLOFORK might be made more clear by looking at a
> > > long-standing Go issue, i.e. unrelated to system(3), which was started
> > > in 2017 by Russ Cox when he summed up the current race-condition
> > > behaviour of trying to execve(2) a newly created file:
> > > https://github.com/golang/go/issues/22315.
> >
> > The problem is that people advocating for O_CLOFORK understand its
> > value, but not its cost. Other google employees have a system which
> > has literally millions of file descriptors in a single process.
> > Having to maintain this extra state per-fd is a cost they don't want
> > to pay (and have been quite vocal about earlier in this thread).
>
> So do you agree the userspace issue is best solved by *_CLOFORK and the
> problem is how to implement *_CLOFORK at an acceptable cost?
>
> OTOH David Laight was making suggestions on moving the load to the
> fork/exec path earlier in the thread, but OTOH Al Viro mentioned a
> ‘portable solution’, though that could have been to a specific issue
> rather than the more general case.
>
> How would you recommend approaching an acceptable cost is progressed?
> Iterate on patch versions? Open a bugzilla.kernel.org for central
> tracking and linking from the other projects? ..?
Quoting from that go thread
"If the OS had a "close all fds above x", we could use that. (I don't know of any that do, but it sure would help.)"
So why can't this be solved with:
close_range(fd_first, fd_last, CLOSE_RANGE_CLOEXEC | CLOSE_RANGE_UNSHARE)?
e.g.
close_range(100, ~0U, CLOSE_RANGE_CLOEXEC | CLOSE_RANGE_UNSHARE)?
https://man7.org/linux/man-pages/man2/close_range.2.html
From: Christian Brauner
> Sent: 28 June 2022 14:13
>
> On Sun, Jun 19, 2022 at 11:42:28AM +0100, Ralph Corderoy wrote:
> > Hi Matthew, thanks for replying.
> >
> > > > The need for O_CLOFORK might be made more clear by looking at a
> > > > long-standing Go issue, i.e. unrelated to system(3), which was started
> > > > in 2017 by Russ Cox when he summed up the current race-condition
> > > > behaviour of trying to execve(2) a newly created file:
> > > > https://github.com/golang/go/issues/22315.
> > >
> > > The problem is that people advocating for O_CLOFORK understand its
> > > value, but not its cost. Other google employees have a system which
> > > has literally millions of file descriptors in a single process.
> > > Having to maintain this extra state per-fd is a cost they don't want
> > > to pay (and have been quite vocal about earlier in this thread).
> >
> > So do you agree the userspace issue is best solved by *_CLOFORK and the
> > problem is how to implement *_CLOFORK at an acceptable cost?
> >
> > OTOH David Laight was making suggestions on moving the load to the
> > fork/exec path earlier in the thread, but OTOH Al Viro mentioned a
> > ‘portable solution’, though that could have been to a specific issue
> > rather than the more general case.
> >
> > How would you recommend approaching an acceptable cost is progressed?
> > Iterate on patch versions? Open a bugzilla.kernel.org for central
> > tracking and linking from the other projects? ..?
>
> Quoting from that go thread
>
> "If the OS had a "close all fds above x", we could use that. (I don't know of any that do, but it sure
> would help.)"
>
> So why can't this be solved with:
> close_range(fd_first, fd_last, CLOSE_RANGE_CLOEXEC | CLOSE_RANGE_UNSHARE)?
> e.g.
> close_range(100, ~0U, CLOSE_RANGE_CLOEXEC | CLOSE_RANGE_UNSHARE)?
That is a relatively recent linux system call.
Although it can be (mostly) emulated by reading /proc/fd
- but that may not be mounted.
In any case another thread can open an fd between the close_range()
and fork() calls.
(I can't remember what I said before :-)
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Tue, Jun 28, 2022 at 01:38:07PM +0000, David Laight wrote:
> From: Christian Brauner
> > Sent: 28 June 2022 14:13
> >
> > On Sun, Jun 19, 2022 at 11:42:28AM +0100, Ralph Corderoy wrote:
> > > Hi Matthew, thanks for replying.
> > >
> > > > > The need for O_CLOFORK might be made more clear by looking at a
> > > > > long-standing Go issue, i.e. unrelated to system(3), which was started
> > > > > in 2017 by Russ Cox when he summed up the current race-condition
> > > > > behaviour of trying to execve(2) a newly created file:
> > > > > https://github.com/golang/go/issues/22315.
> > > >
> > > > The problem is that people advocating for O_CLOFORK understand its
> > > > value, but not its cost. Other google employees have a system which
> > > > has literally millions of file descriptors in a single process.
> > > > Having to maintain this extra state per-fd is a cost they don't want
> > > > to pay (and have been quite vocal about earlier in this thread).
> > >
> > > So do you agree the userspace issue is best solved by *_CLOFORK and the
> > > problem is how to implement *_CLOFORK at an acceptable cost?
> > >
> > > OTOH David Laight was making suggestions on moving the load to the
> > > fork/exec path earlier in the thread, but OTOH Al Viro mentioned a
> > > ‘portable solution’, though that could have been to a specific issue
> > > rather than the more general case.
> > >
> > > How would you recommend approaching an acceptable cost is progressed?
> > > Iterate on patch versions? Open a bugzilla.kernel.org for central
> > > tracking and linking from the other projects? ..?
> >
> > Quoting from that go thread
> >
> > "If the OS had a "close all fds above x", we could use that. (I don't know of any that do, but it sure
> > would help.)"
> >
> > So why can't this be solved with:
> > close_range(fd_first, fd_last, CLOSE_RANGE_CLOEXEC | CLOSE_RANGE_UNSHARE)?
> > e.g.
> > close_range(100, ~0U, CLOSE_RANGE_CLOEXEC | CLOSE_RANGE_UNSHARE)?
>
> That is a relatively recent linux system call.
> Although it can be (mostly) emulated by reading /proc/fd
> - but that may not be mounted.
>
> In any case another thread can open an fd between the close_range()
> and fork() calls.
The CLOSE_RANGE_UNSHARE gives the calling thread a private file
descriptor table before marking fs close-on-exec.
close_range(100, ~0U, CLOSE_RANGE_CLOEXEC | CLOSE_RANGE_UNSHARE)?