Resending, adding cc:linux-api.
Also, it may help to add a little more background -- this patch is
needed as a (small) part of implementing Capsicum in the Linux kernel.
Capsicum is a security framework that has been present in FreeBSD since
version 9.0 (Jan 2012), and is based on concepts from object-capability
security [1].
One of the features of Capsicum is capability mode, which locks down
access to global namespaces such as the filesystem hierarchy. In
capability mode, /proc is thus inaccessible and so fexecve(3) doesn't
work -- hence the need for a kernel-space alternative.
[1] http://www.cl.cam.ac.uk/research/security/capsicum/papers/2010usenix-security-capsicum-website.pdf
------
This patch set adds execveat(2) for x86, and is derived from Meredydd
Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).
The primary aim of adding an execveat syscall is to allow an
implementation of fexecve(3) that does not rely on the /proc
filesystem. The current glibc version of fexecve(3) is implemented
via /proc, which causes problems in sandboxed or otherwise restricted
environments.
Given the desire for a /proc-free fexecve() implementation, HPA
suggested (https://lkml.org/lkml/2006/7/11/556) that an execveat(2)
syscall would be an appropriate generalization.
Also, having a new syscall means that it can take a flags argument
without back-compatibility concerns. The current implementation just
defines the AT_SYMLINK_NOFOLLOW flag, but other flags could be added
in future -- for example, flags for new namespaces (as suggested at
https://lkml.org/lkml/2006/7/11/474).
Related history:
- https://lkml.org/lkml/2006/12/27/123 is an example of someone
realizing that fexecve() is likely to fail in a chroot environment.
- http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
documenting the /proc requirement of fexecve(3) in its manpage, to
"prevent other people from wasting their time".
- https://bugzilla.kernel.org/show_bug.cgi?id=74481 documented that
it's not possible to fexecve() a file descriptor for a script with
close-on-exec set (which is possible with the implementation here).
- https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
problem where a process that did setuid() could not fexecve()
because it no longer had access to /proc/self/fd; this has since
been fixed.
Changes since Meredydd's v3 patch:
- Added a selftest.
- Added a man page.
- Left open_exec() signature untouched to reduce patch impact
elsewhere (as suggested by Al Viro).
- Filled in bprm->filename with d_path() into a buffer, to avoid use
of potentially-ephemeral dentry->d_name.
- Patch against v3.14 (455c6fdbd21916).
David Drysdale (2):
syscalls,x86: implement execveat() system call
syscalls,x86: add selftest for execveat(2)
arch/x86/ia32/audit.c | 1 +
arch/x86/ia32/ia32entry.S | 1 +
arch/x86/kernel/audit_64.c | 1 +
arch/x86/kernel/entry_64.S | 28 ++++
arch/x86/syscalls/syscall_32.tbl | 1 +
arch/x86/syscalls/syscall_64.tbl | 2 +
arch/x86/um/sys_call_table_64.c | 1 +
fs/exec.c | 153 ++++++++++++++++---
include/linux/compat.h | 3 +
include/linux/sched.h | 4 +
include/linux/syscalls.h | 4 +
include/uapi/asm-generic/unistd.h | 4 +-
kernel/sys_ni.c | 3 +
lib/audit.c | 3 +
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/exec/.gitignore | 6 +
tools/testing/selftests/exec/Makefile | 32 ++++
tools/testing/selftests/exec/execveat.c | 251 ++++++++++++++++++++++++++++++++
18 files changed, 476 insertions(+), 23 deletions(-)
create mode 100644 tools/testing/selftests/exec/.gitignore
create mode 100644 tools/testing/selftests/exec/Makefile
create mode 100644 tools/testing/selftests/exec/execveat.c
--
1.9.1.423.g4596e3a
Signed-off-by: David Drysdale <[email protected]>
---
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/exec/.gitignore | 6 +
tools/testing/selftests/exec/Makefile | 32 ++++
tools/testing/selftests/exec/execveat.c | 251 ++++++++++++++++++++++++++++++++
4 files changed, 290 insertions(+)
create mode 100644 tools/testing/selftests/exec/.gitignore
create mode 100644 tools/testing/selftests/exec/Makefile
create mode 100644 tools/testing/selftests/exec/execveat.c
diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 32487ed18354..26b7e6410fcd 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -10,6 +10,7 @@ TARGETS += timers
TARGETS += vm
TARGETS += powerpc
TARGETS += user
+TARGETS += exec
all:
for TARGET in $(TARGETS); do \
diff --git a/tools/testing/selftests/exec/.gitignore b/tools/testing/selftests/exec/.gitignore
new file mode 100644
index 000000000000..349a786899a7
--- /dev/null
+++ b/tools/testing/selftests/exec/.gitignore
@@ -0,0 +1,6 @@
+subdir*
+script*
+execveat
+execveat.symlink
+execveat.moved
+execveat.ephemeral
diff --git a/tools/testing/selftests/exec/Makefile b/tools/testing/selftests/exec/Makefile
new file mode 100644
index 000000000000..55771ed05dda
--- /dev/null
+++ b/tools/testing/selftests/exec/Makefile
@@ -0,0 +1,32 @@
+CC = $(CROSS_COMPILE)gcc
+CFLAGS = -Wall
+BINARIES = execveat
+DEPS = execveat.symlink script execveat.ephemeral script.ephemeral subdir subdir.ephemeral subdir.ephemeral/script
+all: $(BINARIES) $(DEPS)
+
+subdir:
+ mkdir -p $@
+subdir.ephemeral:
+ mkdir -p $@
+subdir.ephemeral/script: | subdir.ephemeral
+ echo '#!/bin/sh' > $@
+ echo 'exit $$*' >> $@
+ chmod +x $@
+script:
+ echo '#!/bin/sh' > $@
+ echo 'exit $$*' >> $@
+ chmod +x $@
+script.ephemeral: script
+ cp $< $@
+execveat.symlink: execveat
+ ln -s -f $< $@
+execveat.ephemeral: execveat
+ cp $< $@
+%: %.c
+ $(CC) $(CFLAGS) -o $@ $^
+
+run_tests: all
+ ./execveat
+
+clean:
+ rm -rf $(BINARIES) $(DEPS) subdir.moved execveat.moved
diff --git a/tools/testing/selftests/exec/execveat.c b/tools/testing/selftests/exec/execveat.c
new file mode 100644
index 000000000000..f67862fa3f38
--- /dev/null
+++ b/tools/testing/selftests/exec/execveat.c
@@ -0,0 +1,251 @@
+/*
+ * Copyright (c) 2014 Google, Inc.
+ *
+ * Licensed under the terms of the GNU GPL License version 2
+ *
+ * Selftests for execveat(2).
+ */
+
+#define _GNU_SOURCE /* to get O_PATH */
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <sys/syscall.h>
+#include <sys/stat.h>
+#include <stdio.h>
+#include <errno.h>
+#include <string.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <limits.h>
+#include <stdlib.h>
+
+static char *envp[] = { "IN_TEST=yes", NULL };
+static char *argv[] = { "execveat", "99", NULL };
+
+static int execveat_(int fd, const char *path, char **argv, char **envp,
+ int flags)
+{
+#ifdef __NR_execveat
+ return syscall(__NR_execveat, fd, path, argv, envp, flags);
+#else
+ errno = -ENOSYS;
+ return -1;
+#endif
+}
+
+#define check_execveat_fail(fd, path, flags, errno) \
+ _check_execveat_fail(fd, path, flags, errno, #errno)
+static int _check_execveat_fail(int fd, const char *path, int flags,
+ int expected_errno, const char *errno_str)
+{
+ errno = 0;
+ printf("Check failure of execveat(%d, '%s', %d) with %s... ",
+ fd, path?:"(null)", flags, errno_str);
+ int rc = execveat_(fd, path, argv, envp, flags);
+ if (rc > 0) {
+ printf("[FAIL] (unexpected success from execveat(2))\n");
+ return 1;
+ }
+ if (errno != expected_errno) {
+ printf("[FAIL] (expected errno %d (%s) not %d (%s)\n",
+ expected_errno, strerror(expected_errno),
+ errno, strerror(errno));
+ return 1;
+ }
+ printf("[OK]\n");
+ return 0;
+}
+
+static int check_execveat_invoked_rc(int fd, const char *path, int flags,
+ int expected_rc)
+{
+ int status;
+ int rc;
+ pid_t child;
+ printf("Check success of execveat(%d, '%s', %d)... ",
+ fd, path?:"(null)", flags);
+ child = fork();
+ if (child < 0) {
+ printf("[FAIL] (fork() failed)\n");
+ return 1;
+ }
+ if (child == 0) {
+ /* Child: do execveat(). */
+ rc = execveat_(fd, path, argv, envp, flags);
+ printf("[FAIL]: execveat() failed, rc=%d errno=%d (%s)\n",
+ rc, errno, strerror(errno));
+ exit(1); /* should not reach here */
+ }
+ /* Parent: wait for & check child's exit status. */
+ rc = waitpid(child, &status, 0);
+ if (rc != child) {
+ printf("[FAIL] (waitpid(%d,...) returned %d)\n", child, rc);
+ return 1;
+ }
+ if (!WIFEXITED(status)) {
+ printf("[FAIL] (child %d did not exit cleanly, status=%08x)\n",
+ child, status);
+ return 1;
+ }
+ if (WEXITSTATUS(status) != expected_rc) {
+ printf("[FAIL] (child %d exited with %d not %d)\n",
+ child, WEXITSTATUS(status), expected_rc);
+ return 1;
+ }
+ printf("[OK]\n");
+ return 0;
+}
+
+static int check_execveat(int fd, const char *path, int flags)
+{
+ return check_execveat_invoked_rc(fd, path, flags, 99);
+}
+
+static char *concat(const char *left, const char *right)
+{
+ char *result = malloc(strlen(left) + strlen(right) + 1);
+ strcpy(result, left);
+ strcat(result, right);
+ return result;
+}
+
+static int open_or_die(const char *filename, int flags)
+{
+ int fd = open(filename, flags);
+ if (fd < 0) {
+ printf("Failed to open '%s'; "
+ "check prerequisites are available\n", filename);
+ exit(1);
+ }
+ return fd;
+}
+
+static int run_tests(void)
+{
+ int fail = 0;
+ char *fullname = realpath("execveat", NULL);
+ char *fullname_script = realpath("script", NULL);
+ char *fullname_symlink = concat(fullname, ".symlink");
+ int subdir_dfd = open_or_die("subdir", O_DIRECTORY|O_RDONLY);
+ int subdir_dfd_ephemeral = open_or_die("subdir.ephemeral",
+ O_DIRECTORY|O_RDONLY);
+ int dot_dfd = open_or_die(".", O_DIRECTORY|O_RDONLY);
+ int dot_dfd_path = open_or_die(".", O_DIRECTORY|O_RDONLY|O_PATH);
+ int fd = open_or_die("execveat", O_RDONLY);
+ int fd_path = open_or_die("execveat", O_RDONLY|O_PATH);
+ int fd_symlink = open_or_die("execveat.symlink", O_RDONLY);
+ int fd_script = open_or_die("script", O_RDONLY);
+ int fd_ephemeral = open_or_die("execveat.ephemeral", O_RDONLY);
+ int fd_script_ephemeral = open_or_die("script.ephemeral", O_RDONLY);
+
+ /* Normal executable file: */
+ /* dfd + path */
+ fail |= check_execveat(subdir_dfd, "../execveat", 0);
+ fail |= check_execveat(dot_dfd, "execveat", 0);
+ fail |= check_execveat(dot_dfd_path, "execveat", 0);
+ /* absolute path */
+ fail |= check_execveat(AT_FDCWD, fullname, 0);
+ /* absolute path with nonsense dfd */
+ fail |= check_execveat(99, fullname, 0);
+ /* fd + no path */
+ fail |= check_execveat(fd, NULL, 0);
+
+ /* Mess with executable file that's already open: */
+ /* fd + no path to a file that's been renamed */
+ rename("execveat.ephemeral", "execveat.moved");
+ fail |= check_execveat(fd_ephemeral, NULL, 0);
+ /* fd + no path to a file that's been deleted */
+ unlink("execveat.moved"); /* remove the file now fd open */
+ fail |= check_execveat(fd_ephemeral, NULL, 0);
+
+ /* Symlink to executable file: */
+ /* dfd + path */
+ fail |= check_execveat(dot_dfd, "execveat.symlink", 0);
+ fail |= check_execveat(dot_dfd_path, "execveat.symlink", 0);
+ /* absolute path */
+ fail |= check_execveat(AT_FDCWD, fullname_symlink, 0);
+ /* fd + no path, even with AT_SYMLINK_NOFOLLOW (already followed) */
+ fail |= check_execveat(fd_symlink, NULL, 0);
+ fail |= check_execveat(fd_symlink, NULL, AT_SYMLINK_NOFOLLOW);
+
+ /* Symlink fails when AT_SYMLINK_NOFOLLOW set: */
+ /* dfd + path */
+ fail |= check_execveat_fail(dot_dfd, "execveat.symlink",
+ AT_SYMLINK_NOFOLLOW, ELOOP);
+ fail |= check_execveat_fail(dot_dfd_path, "execveat.symlink",
+ AT_SYMLINK_NOFOLLOW, ELOOP);
+ /* absolute path */
+ fail |= check_execveat_fail(AT_FDCWD, fullname_symlink,
+ AT_SYMLINK_NOFOLLOW, ELOOP);
+
+ /* Shell script wrapping executable file: */
+ /* dfd + path */
+ fail |= check_execveat(subdir_dfd, "../script", 0);
+ fail |= check_execveat(dot_dfd, "script", 0);
+ fail |= check_execveat(dot_dfd_path, "script", 0);
+ /* absolute path */
+ fail |= check_execveat(AT_FDCWD, fullname_script, 0);
+ /* fd + no path */
+ fail |= check_execveat(fd_script, NULL, 0);
+ fail |= check_execveat(fd_script, NULL, AT_SYMLINK_NOFOLLOW);
+
+ /* Mess with script file that's already open: */
+ /* fd + no path to a file that's been renamed */
+ rename("script.ephemeral", "script.moved");
+ fail |= check_execveat(fd_script_ephemeral, NULL, 0);
+ /* fd + no path to a file that's been deleted */
+ unlink("script.moved"); /* remove the file while fd open */
+ /* Shell attempts to load the deleted file but fails => rc=127 */
+ fail |= check_execveat_invoked_rc(fd_script_ephemeral, NULL, 0, 127);
+
+ /* Rename a subdirectory in the path: */
+ rename("subdir.ephemeral", "subdir.moved");
+ fail |= check_execveat(subdir_dfd_ephemeral, "../script", 0);
+ fail |= check_execveat(subdir_dfd_ephemeral, "script", 0);
+ /* Remove the subdir and its contents */
+ unlink("subdir.moved/script");
+ unlink("subdir.moved");
+ /* Shell loads via deleted subdir OK because name starts with .. */
+ fail |= check_execveat(subdir_dfd_ephemeral, "../script", 0);
+ fail |= check_execveat_fail(subdir_dfd_ephemeral, "script", 0, ENOENT);
+
+ /* Flag values other than AT_SYMLINK_NOFOLLOW => EINVAL */
+ fail |= check_execveat_fail(dot_dfd, "execveat", 0xFFFF, EINVAL);
+ /* Invalid path => ENOENT */
+ fail |= check_execveat_fail(dot_dfd, "no-such-file", 0, ENOENT);
+ fail |= check_execveat_fail(dot_dfd_path, "no-such-file", 0, ENOENT);
+ fail |= check_execveat_fail(AT_FDCWD, "no-such-file", 0, ENOENT);
+ /* Attempt to execute directory => EACCES */
+ fail |= check_execveat_fail(dot_dfd, NULL, 0, EACCES);
+ /* Attempt to execute non-executable => EACCES */
+ fail |= check_execveat_fail(dot_dfd, "Makefile", 0, EACCES);
+ /* Attempt to execute file opened with O_PATH => EBADF */
+ fail |= check_execveat_fail(dot_dfd_path, NULL, 0, EBADF);
+ fail |= check_execveat_fail(fd_path, NULL, 0, EBADF);
+ /* Attempt to execute nonsense FD => EBADF */
+ fail |= check_execveat_fail(99, NULL, 0, EBADF);
+ fail |= check_execveat_fail(99, "execveat", 0, EBADF);
+ /* Attempt to execute relative to non-directory => ENOTDIR */
+ fail |= check_execveat_fail(fd, "execveat", 0, ENOTDIR);
+
+ return fail ? -1 : 0;
+}
+
+int main(int argc, char **argv)
+{
+ if (argc >= 2) {
+ /* If we are invoked with an argument, exit immediately. */
+ /* Check expected environment transferred. */
+ if (strcmp(getenv("IN_TEST"), "yes") != 0) {
+ printf("[FAIL] (no IN_TEST=yes in env)\n");
+ return 1;
+ }
+
+ /* Use the argument as an exit code. */
+ int rc = atoi(argv[1]);
+ fflush(stdout);
+ return rc;
+ } else {
+ return run_tests();
+ }
+}
--
1.9.1.423.g4596e3a
Signed-off-by: David Drysdale <[email protected]>
---
man2/execveat.2 | 132 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 132 insertions(+)
create mode 100644 man2/execveat.2
diff --git a/man2/execveat.2 b/man2/execveat.2
new file mode 100644
index 000000000000..fbbb9f824290
--- /dev/null
+++ b/man2/execveat.2
@@ -0,0 +1,132 @@
+.\" Copyright (c) 2014 Google, Inc.
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of this
+.\" manual under the conditions for verbatim copying, provided that the
+.\" entire resulting derived work is distributed under the terms of a
+.\" permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date. The author(s) assume no
+.\" responsibility for errors or omissions, or for damages resulting from
+.\" the use of the information contained herein. The author(s) may not
+.\" have taken the same level of care in the production of this manual,
+.\" which is licensed free of charge, as they might when working
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and authors of this work.
+.\" %%%LICENSE_END
+.\"
+.TH EXECVEAT 2 2014-04-02 "Linux" "Linux Programmer's Manual"
+.SH NAME
+execveat \- execute program relative to a directory file descriptor
+.SH SYNOPSIS
+.B #include <unistd.h>
+.sp
+.BI "int execve(int " fd ", const char *" pathname ","
+.br
+.BI " char *const " argv "[], char *const " envp "[],"
+.br
+.BI " int " flags);
+.SH DESCRIPTION
+The
+.BR execveat ()
+system call executes the program pointed to by the combination of \fIfd\fP and \fIpathname\fP.
+The
+.BR execveat ()
+system call operates in exactly the same way as
+.BR execve (2),
+except for the differences described in this manual page.
+
+If the pathname given in
+.I pathname
+is relative, then it is interpreted relative to the directory
+referred to by the file descriptor
+.I fd
+(rather than relative to the current working directory of
+the calling process, as is done by
+.BR execve (2)
+for a relative pathname).
+
+If
+.I pathname
+is relative and
+.I fd
+is the special value
+.BR AT_FDCWD ,
+then
+.I pathname
+is interpreted relative to the current working
+directory of the calling process (like
+.BR execve (2)).
+
+If
+.I pathname
+is absolute, then
+.I fd
+is ignored.
+
+If
+.I pathname
+is NULL, then the file descriptor
+.I fd
+specifies the file to be executed.
+
+.I flags
+can either be 0, or include the following flag:
+.TP
+.B AT_SYMLINK_NOFOLLOW
+If the file identified by
+.I fd
+and a non-NULL
+.I pathname
+is a symbolic link, then the call fails with the error
+.BR EINVAL .
+.SH "RETURN VALUE"
+On success,
+.BR execveat ()
+does not return. On error \-1 is returned, and
+.I errno
+is set appropriately.
+.SH ERRORS
+The same errors that occur for
+.BR execve (2)
+can also occur for
+.BR execveat ().
+The following additional errors can occur for
+.BR execveat ():
+.TP
+.B EBADF
+.I fd
+is not a valid file descriptor.
+.TP
+.B EINVAL
+Invalid flag specified in
+.IR flags .
+.TP
+.B ENOTDIR
+.I pathname
+is relative and
+.I fd
+is a file descriptor referring to a file other than a directory.
+.SH VERSIONS
+.BR execveat ()
+was added to Linux in kernel 3.???.
+.SH NOTES
+In addition to the reasons explained in
+.BR openat (2),
+the
+.BR execveat ()
+system call is also needed to allow
+.BR fexecve (3)
+to be implemented on systems that do not have the
+.I /proc
+filesystem mounted.
+.SH SEE ALSO
+.BR execve (2),
+.BR fexecve (3)
--
1.9.1.423.g4596e3a
This patch set adds execveat(2) for x86, and is derived from Meredydd
Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).
The primary aim of adding an execveat syscall is to allow an
implementation of fexecve(3) that does not rely on the /proc
filesystem. The current glibc version of fexecve(3) is implemented
via /proc, which causes problems in sandboxed or otherwise restricted
environments.
Given the desire for a /proc-free fexecve() implementation, HPA
suggested (https://lkml.org/lkml/2006/7/11/556) that an execveat(2)
syscall would be an appropriate generalization.
Also, having a new syscall means that it can take a flags argument
without back-compatibility concerns. The current implementation just
defines the AT_SYMLINK_NOFOLLOW flag, but other flags could be added
in future -- for example, flags for new namespaces (as suggested at
https://lkml.org/lkml/2006/7/11/474).
Related history:
- https://lkml.org/lkml/2006/12/27/123 is an example of someone
realizing that fexecve() is likely to fail in a chroot environment.
- http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
documenting the /proc requirement of fexecve(3) in its manpage, to
"prevent other people from wasting their time".
- https://bugzilla.kernel.org/show_bug.cgi?id=74481 documented that
it's not possible to fexecve() a file descriptor for a script with
close-on-exec set (which is possible with the implementation here).
- https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
problem where a process that did setuid() could not fexecve()
because it no longer had access to /proc/self/fd; this has since
been fixed.
This patch in particular (of 3 = 2 kernel + 1 man-pages):
Adds the new system execveat(2) syscall. execveat() is to execve() as
openat() is to open(): it takes a file descriptor that refers to a
directory, and resolves the filename relative to that.
In addition, if the filename is NULL, execveat() executes the file
to which the file descriptor refers. This replicates the functionality
of fexecve(), which is a system call in other UNIXen, but in Linux
glibc it depends on /proc being mounted.
Only x86-64, i386 and x32 ABIs are supported in this patch.
Based on patches by Meredydd Luff <[email protected]>
Signed-off-by: David Drysdale <[email protected]>
---
arch/x86/ia32/audit.c | 1 +
arch/x86/ia32/ia32entry.S | 1 +
arch/x86/kernel/audit_64.c | 1 +
arch/x86/kernel/entry_64.S | 28 +++++++
arch/x86/syscalls/syscall_32.tbl | 1 +
arch/x86/syscalls/syscall_64.tbl | 2 +
arch/x86/um/sys_call_table_64.c | 1 +
fs/exec.c | 153 ++++++++++++++++++++++++++++++++------
include/linux/compat.h | 3 +
include/linux/sched.h | 4 +
include/linux/syscalls.h | 4 +
include/uapi/asm-generic/unistd.h | 4 +-
kernel/sys_ni.c | 3 +
lib/audit.c | 3 +
14 files changed, 186 insertions(+), 23 deletions(-)
diff --git a/arch/x86/ia32/audit.c b/arch/x86/ia32/audit.c
index 5d7b381da692..2eccc8932ae6 100644
--- a/arch/x86/ia32/audit.c
+++ b/arch/x86/ia32/audit.c
@@ -35,6 +35,7 @@ int ia32_classify_syscall(unsigned syscall)
case __NR_socketcall:
return 4;
case __NR_execve:
+ case __NR_execveat:
return 5;
default:
return 1;
diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 4299eb05023c..2516c09743e0 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -464,6 +464,7 @@ GLOBAL(\label)
PTREGSCALL stub32_rt_sigreturn, sys32_rt_sigreturn
PTREGSCALL stub32_sigreturn, sys32_sigreturn
PTREGSCALL stub32_execve, compat_sys_execve
+ PTREGSCALL stub32_execveat, compat_sys_execveat
PTREGSCALL stub32_fork, sys_fork
PTREGSCALL stub32_vfork, sys_vfork
diff --git a/arch/x86/kernel/audit_64.c b/arch/x86/kernel/audit_64.c
index 06d3e5a14d9d..f3672508b249 100644
--- a/arch/x86/kernel/audit_64.c
+++ b/arch/x86/kernel/audit_64.c
@@ -50,6 +50,7 @@ int audit_classify_syscall(int abi, unsigned syscall)
case __NR_openat:
return 3;
case __NR_execve:
+ case __NR_execveat:
return 5;
default:
return 0;
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 1e96c3628bf2..f9a6c2fdda15 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -872,6 +872,20 @@ ENTRY(stub_execve)
CFI_ENDPROC
END(stub_execve)
+ENTRY(stub_execveat)
+ CFI_STARTPROC
+ addq $8, %rsp
+ PARTIAL_FRAME 0
+ SAVE_REST
+ FIXUP_TOP_OF_STACK %r11
+ call sys_execveat
+ RESTORE_TOP_OF_STACK %r11
+ movq %rax,RAX(%rsp)
+ RESTORE_REST
+ jmp int_ret_from_sys_call
+ CFI_ENDPROC
+END(stub_execveat)
+
/*
* sigreturn is special because it needs to restore all registers on return.
* This cannot be done with SYSRET, so use the IRET return path instead.
@@ -917,6 +931,20 @@ ENTRY(stub_x32_execve)
CFI_ENDPROC
END(stub_x32_execve)
+ENTRY(stub_x32_execveat)
+ CFI_STARTPROC
+ addq $8, %rsp
+ PARTIAL_FRAME 0
+ SAVE_REST
+ FIXUP_TOP_OF_STACK %r11
+ call compat_sys_execveat
+ RESTORE_TOP_OF_STACK %r11
+ movq %rax,RAX(%rsp)
+ RESTORE_REST
+ jmp int_ret_from_sys_call
+ CFI_ENDPROC
+END(stub_x32_execveat)
+
#endif
/*
diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index 96bc506ac6de..2ab0712a0e7c 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -359,3 +359,4 @@
350 i386 finit_module sys_finit_module
351 i386 sched_setattr sys_sched_setattr
352 i386 sched_getattr sys_sched_getattr
+353 i386 execveat sys_execveat stub32_execveat
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index a12bddc7ccea..2e4058c14b4f 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -322,6 +322,7 @@
313 common finit_module sys_finit_module
314 common sched_setattr sys_sched_setattr
315 common sched_getattr sys_sched_getattr
+316 64 execveat stub_execveat
#
# x32-specific system call numbers start at 512 to avoid cache impact
@@ -358,3 +359,4 @@
540 x32 process_vm_writev compat_sys_process_vm_writev
541 x32 setsockopt compat_sys_setsockopt
542 x32 getsockopt compat_sys_getsockopt
+543 x32 execveat stub_x32_execveat
diff --git a/arch/x86/um/sys_call_table_64.c b/arch/x86/um/sys_call_table_64.c
index f2f0723070ca..20c3649d0691 100644
--- a/arch/x86/um/sys_call_table_64.c
+++ b/arch/x86/um/sys_call_table_64.c
@@ -31,6 +31,7 @@
#define stub_fork sys_fork
#define stub_vfork sys_vfork
#define stub_execve sys_execve
+#define stub_execveat sys_execveat
#define stub_rt_sigreturn sys_rt_sigreturn
#define __SYSCALL_COMMON(nr, sym, compat) __SYSCALL_64(nr, sym, compat)
diff --git a/fs/exec.c b/fs/exec.c
index 3d78fccdd723..a8676ce571ce 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -748,7 +748,22 @@ EXPORT_SYMBOL(setup_arg_pages);
#endif /* CONFIG_MMU */
-static struct file *do_open_exec(struct filename *name)
+/*
+ * Perform the extra checks that open_exec() needs over and above a normal
+ * open.
+ */
+static int check_exec_and_deny_write(struct file *file)
+{
+ if (!S_ISREG(file_inode(file)->i_mode))
+ return -EACCES;
+
+ if (file->f_path.mnt->mnt_flags & MNT_NOEXEC)
+ return -EACCES;
+
+ return deny_write_access(file);
+}
+
+static struct file *do_open_execat(int fd, struct filename *name, int flags)
{
struct file *file;
int err;
@@ -758,24 +773,42 @@ static struct file *do_open_exec(struct filename *name)
.intent = LOOKUP_OPEN,
.lookup_flags = LOOKUP_FOLLOW,
};
+ static const struct open_flags open_exec_nofollow_flags = {
+ .open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
+ .acc_mode = MAY_EXEC | MAY_OPEN,
+ .intent = LOOKUP_OPEN,
+ .lookup_flags = 0,
+ };
- file = do_filp_open(AT_FDCWD, name, &open_exec_flags);
- if (IS_ERR(file))
- goto out;
-
- err = -EACCES;
- if (!S_ISREG(file_inode(file)->i_mode))
- goto exit;
+ if (flags & ~AT_SYMLINK_NOFOLLOW)
+ return ERR_PTR(-EINVAL);
- if (file->f_path.mnt->mnt_flags & MNT_NOEXEC)
- goto exit;
+ if (name) {
+ const struct open_flags *oflags = ((flags & AT_SYMLINK_NOFOLLOW)
+ ? &open_exec_nofollow_flags
+ : &open_exec_flags);
- fsnotify_open(file);
+ file = do_filp_open(fd, name, oflags);
+ if (IS_ERR(file))
+ goto out;
+ } else {
+ file = fget(fd);
+ if (!file)
+ return ERR_PTR(-EBADF);
+
+ err = inode_permission(file->f_path.dentry->d_inode,
+ open_exec_flags.acc_mode);
+ if (err)
+ goto exit;
+ }
- err = deny_write_access(file);
+ err = check_exec_and_deny_write(file);
if (err)
goto exit;
+ if (name)
+ fsnotify_open(file);
+
out:
return file;
@@ -787,7 +820,7 @@ exit:
struct file *open_exec(const char *name)
{
struct filename tmp = { .name = name };
- return do_open_exec(&tmp);
+ return do_open_execat(AT_FDCWD, &tmp, 0);
}
EXPORT_SYMBOL(open_exec);
@@ -1437,10 +1470,12 @@ static int exec_binprm(struct linux_binprm *bprm)
/*
* sys_execve() executes a new program.
*/
-static int do_execve_common(struct filename *filename,
- struct user_arg_ptr argv,
- struct user_arg_ptr envp)
+static int do_execveat_common(int fd, struct filename *filename,
+ struct user_arg_ptr argv,
+ struct user_arg_ptr envp,
+ int flags)
{
+ char *pathbuf = NULL;
struct linux_binprm *bprm;
struct file *file;
struct files_struct *displaced;
@@ -1481,7 +1516,7 @@ static int do_execve_common(struct filename *filename,
check_unsafe_exec(bprm);
current->in_execve = 1;
- file = do_open_exec(filename);
+ file = do_open_execat(fd, filename, flags);
retval = PTR_ERR(file);
if (IS_ERR(file))
goto out_unmark;
@@ -1489,7 +1524,21 @@ static int do_execve_common(struct filename *filename,
sched_exec();
bprm->file = file;
- bprm->filename = bprm->interp = filename->name;
+ if (filename && fd == AT_FDCWD) {
+ bprm->filename = filename->name;
+ } else {
+ pathbuf = kmalloc(PATH_MAX, GFP_TEMPORARY);
+ if (!pathbuf) {
+ retval = -ENOMEM;
+ goto out_unmark;
+ }
+ bprm->filename = d_path(&file->f_path, pathbuf, PATH_MAX);
+ if (IS_ERR(bprm->filename)) {
+ retval = PTR_ERR(bprm->filename);
+ goto out_unmark;
+ }
+ }
+ bprm->interp = bprm->filename;
retval = bprm_mm_init(bprm);
if (retval)
@@ -1530,7 +1579,8 @@ static int do_execve_common(struct filename *filename,
acct_update_integrals(current);
task_numa_free(current);
free_bprm(bprm);
- putname(filename);
+ if (filename)
+ putname(filename);
if (displaced)
put_files_struct(displaced);
return retval;
@@ -1547,12 +1597,14 @@ out_unmark:
out_free:
free_bprm(bprm);
+ kfree(pathbuf);
out_files:
if (displaced)
reset_files_struct(displaced);
out_ret:
- putname(filename);
+ if (filename)
+ putname(filename);
return retval;
}
@@ -1562,7 +1614,17 @@ int do_execve(struct filename *filename,
{
struct user_arg_ptr argv = { .ptr.native = __argv };
struct user_arg_ptr envp = { .ptr.native = __envp };
- return do_execve_common(filename, argv, envp);
+ return do_execveat_common(AT_FDCWD, filename, argv, envp, 0);
+}
+
+int do_execveat(int fd, struct filename *filename,
+ const char __user *const __user *__argv,
+ const char __user *const __user *__envp,
+ int flags)
+{
+ struct user_arg_ptr argv = { .ptr.native = __argv };
+ struct user_arg_ptr envp = { .ptr.native = __envp };
+ return do_execveat_common(fd, filename, argv, envp, flags);
}
#ifdef CONFIG_COMPAT
@@ -1578,7 +1640,23 @@ static int compat_do_execve(struct filename *filename,
.is_compat = true,
.ptr.compat = __envp,
};
- return do_execve_common(filename, argv, envp);
+ return do_execveat_common(AT_FDCWD, filename, argv, envp, 0);
+}
+
+static int compat_do_execveat(int fd, struct filename *filename,
+ const compat_uptr_t __user *__argv,
+ const compat_uptr_t __user *__envp,
+ int flags)
+{
+ struct user_arg_ptr argv = {
+ .is_compat = true,
+ .ptr.compat = __argv,
+ };
+ struct user_arg_ptr envp = {
+ .is_compat = true,
+ .ptr.compat = __envp,
+ };
+ return do_execveat_common(fd, filename, argv, envp, flags);
}
#endif
@@ -1618,6 +1696,22 @@ SYSCALL_DEFINE3(execve,
{
return do_execve(getname(filename), argv, envp);
}
+
+SYSCALL_DEFINE5(execveat,
+ int, fd, const char __user *, filename,
+ const char __user *const __user *, argv,
+ const char __user *const __user *, envp,
+ int, flags)
+{
+ struct filename *path = NULL;
+ if (filename) {
+ path = getname(filename);
+ if (IS_ERR(path))
+ return PTR_ERR(path);
+ }
+ return do_execveat(fd, path, argv, envp, flags);
+}
+
#ifdef CONFIG_COMPAT
asmlinkage long compat_sys_execve(const char __user * filename,
const compat_uptr_t __user * argv,
@@ -1625,4 +1719,19 @@ asmlinkage long compat_sys_execve(const char __user * filename,
{
return compat_do_execve(getname(filename), argv, envp);
}
+
+asmlinkage long compat_sys_execveat(int fd,
+ const char __user *filename,
+ const compat_uptr_t __user *argv,
+ const compat_uptr_t __user *envp,
+ int flags)
+{
+ struct filename *path = NULL;
+ if (filename) {
+ path = getname(filename);
+ if (IS_ERR(path))
+ return PTR_ERR(path);
+ }
+ return compat_do_execveat(fd, path, argv, envp, flags);
+}
#endif
diff --git a/include/linux/compat.h b/include/linux/compat.h
index 3f448c65511b..e875a5d97e08 100644
--- a/include/linux/compat.h
+++ b/include/linux/compat.h
@@ -341,6 +341,9 @@ asmlinkage long compat_sys_lseek(unsigned int, compat_off_t, unsigned int);
asmlinkage long compat_sys_execve(const char __user *filename, const compat_uptr_t __user *argv,
const compat_uptr_t __user *envp);
+asmlinkage long compat_sys_execveat(int dfd, const char __user *filename,
+ const compat_uptr_t __user *argv,
+ const compat_uptr_t __user *envp, int flags);
asmlinkage long compat_sys_select(int n, compat_ulong_t __user *inp,
compat_ulong_t __user *outp, compat_ulong_t __user *exp,
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a781dec1cd0b..92601818b4fb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2315,6 +2315,10 @@ extern int disallow_signal(int);
extern int do_execve(struct filename *,
const char __user * const __user *,
const char __user * const __user *);
+extern int do_execveat(int, struct filename *,
+ const char __user * const __user *,
+ const char __user * const __user *,
+ int);
extern long do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *);
struct task_struct *fork_idle(int);
extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index a747a77ea584..309a810cf39d 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -855,4 +855,8 @@ asmlinkage long sys_process_vm_writev(pid_t pid,
asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
unsigned long idx1, unsigned long idx2);
asmlinkage long sys_finit_module(int fd, const char __user *uargs, int flags);
+asmlinkage long sys_execveat(int dfd, const char __user *filename,
+ const char __user *const __user *argv,
+ const char __user *const __user *envp, int flags);
+
#endif
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index dde8041f40d2..4231bca3f95e 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -696,9 +696,11 @@ __SYSCALL(__NR_finit_module, sys_finit_module)
__SYSCALL(__NR_sched_setattr, sys_sched_setattr)
#define __NR_sched_getattr 275
__SYSCALL(__NR_sched_getattr, sys_sched_getattr)
+#define __NR_execveat 276
+__SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)
#undef __NR_syscalls
-#define __NR_syscalls 276
+#define __NR_syscalls 277
/*
* All syscalls below here should go away really,
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 7078052284fd..5ea7b8ab9e63 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -209,3 +209,6 @@ cond_syscall(compat_sys_open_by_handle_at);
/* compare kernel pointers */
cond_syscall(sys_kcmp);
+
+/* execveat */
+cond_syscall(sys_execveat);
diff --git a/lib/audit.c b/lib/audit.c
index 76bbed4a20e5..712456ed5960 100644
--- a/lib/audit.c
+++ b/lib/audit.c
@@ -48,6 +48,9 @@ int audit_classify_syscall(int abi, unsigned syscall)
case __NR_socketcall:
return 4;
#endif
+#ifdef __NR_execveat
+ case __NR_execveat:
+#endif
case __NR_execve:
return 5;
default:
--
1.9.1.423.g4596e3a
On Thu, Jun 5, 2014 at 6:40 AM, David Drysdale <[email protected]> wrote:
> Resending, adding cc:linux-api.
>
> Also, it may help to add a little more background -- this patch is
> needed as a (small) part of implementing Capsicum in the Linux kernel.
>
> Capsicum is a security framework that has been present in FreeBSD since
> version 9.0 (Jan 2012), and is based on concepts from object-capability
> security [1].
>
> One of the features of Capsicum is capability mode, which locks down
> access to global namespaces such as the filesystem hierarchy. In
> capability mode, /proc is thus inaccessible and so fexecve(3) doesn't
> work -- hence the need for a kernel-space alternative.
>
> [1] http://www.cl.cam.ac.uk/research/security/capsicum/papers/2010usenix-security-capsicum-website.pdf
Thanks for reposting!
I think it'd be quite helpful to have this available for very tightly
confined sandboxes. And in a larger sense, Capsicum itself is an
interesting way to do programmatic isolation.
-Kees
--
Kees Cook
Chrome OS Security
On Thu, Jun 5, 2014 at 6:40 AM, David Drysdale <[email protected]> wrote:
> This patch set adds execveat(2) for x86, and is derived from Meredydd
> Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).
>
> The primary aim of adding an execveat syscall is to allow an
> implementation of fexecve(3) that does not rely on the /proc
> filesystem. The current glibc version of fexecve(3) is implemented
> via /proc, which causes problems in sandboxed or otherwise restricted
> environments.
>
> Given the desire for a /proc-free fexecve() implementation, HPA
> suggested (https://lkml.org/lkml/2006/7/11/556) that an execveat(2)
> syscall would be an appropriate generalization.
>
> Also, having a new syscall means that it can take a flags argument
> without back-compatibility concerns. The current implementation just
> defines the AT_SYMLINK_NOFOLLOW flag, but other flags could be added
> in future -- for example, flags for new namespaces (as suggested at
> https://lkml.org/lkml/2006/7/11/474).
>
> Related history:
> - https://lkml.org/lkml/2006/12/27/123 is an example of someone
> realizing that fexecve() is likely to fail in a chroot environment.
> - http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
> documenting the /proc requirement of fexecve(3) in its manpage, to
> "prevent other people from wasting their time".
> - https://bugzilla.kernel.org/show_bug.cgi?id=74481 documented that
> it's not possible to fexecve() a file descriptor for a script with
> close-on-exec set (which is possible with the implementation here).
> - https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
> problem where a process that did setuid() could not fexecve()
> because it no longer had access to /proc/self/fd; this has since
> been fixed.
>
> This patch in particular (of 3 = 2 kernel + 1 man-pages):
>
> Adds the new system execveat(2) syscall. execveat() is to execve() as
> openat() is to open(): it takes a file descriptor that refers to a
> directory, and resolves the filename relative to that.
>
> In addition, if the filename is NULL, execveat() executes the file
> to which the file descriptor refers. This replicates the functionality
> of fexecve(), which is a system call in other UNIXen, but in Linux
> glibc it depends on /proc being mounted.
>
> Only x86-64, i386 and x32 ABIs are supported in this patch.
>
> Based on patches by Meredydd Luff <[email protected]>
>
> Signed-off-by: David Drysdale <[email protected]>
Hi Al,
Any thoughts on this? I think it would be quite handy.
-Kees
> ---
> arch/x86/ia32/audit.c | 1 +
> arch/x86/ia32/ia32entry.S | 1 +
> arch/x86/kernel/audit_64.c | 1 +
> arch/x86/kernel/entry_64.S | 28 +++++++
> arch/x86/syscalls/syscall_32.tbl | 1 +
> arch/x86/syscalls/syscall_64.tbl | 2 +
> arch/x86/um/sys_call_table_64.c | 1 +
> fs/exec.c | 153 ++++++++++++++++++++++++++++++++------
> include/linux/compat.h | 3 +
> include/linux/sched.h | 4 +
> include/linux/syscalls.h | 4 +
> include/uapi/asm-generic/unistd.h | 4 +-
> kernel/sys_ni.c | 3 +
> lib/audit.c | 3 +
> 14 files changed, 186 insertions(+), 23 deletions(-)
>
> diff --git a/arch/x86/ia32/audit.c b/arch/x86/ia32/audit.c
> index 5d7b381da692..2eccc8932ae6 100644
> --- a/arch/x86/ia32/audit.c
> +++ b/arch/x86/ia32/audit.c
> @@ -35,6 +35,7 @@ int ia32_classify_syscall(unsigned syscall)
> case __NR_socketcall:
> return 4;
> case __NR_execve:
> + case __NR_execveat:
> return 5;
> default:
> return 1;
> diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
> index 4299eb05023c..2516c09743e0 100644
> --- a/arch/x86/ia32/ia32entry.S
> +++ b/arch/x86/ia32/ia32entry.S
> @@ -464,6 +464,7 @@ GLOBAL(\label)
> PTREGSCALL stub32_rt_sigreturn, sys32_rt_sigreturn
> PTREGSCALL stub32_sigreturn, sys32_sigreturn
> PTREGSCALL stub32_execve, compat_sys_execve
> + PTREGSCALL stub32_execveat, compat_sys_execveat
> PTREGSCALL stub32_fork, sys_fork
> PTREGSCALL stub32_vfork, sys_vfork
>
> diff --git a/arch/x86/kernel/audit_64.c b/arch/x86/kernel/audit_64.c
> index 06d3e5a14d9d..f3672508b249 100644
> --- a/arch/x86/kernel/audit_64.c
> +++ b/arch/x86/kernel/audit_64.c
> @@ -50,6 +50,7 @@ int audit_classify_syscall(int abi, unsigned syscall)
> case __NR_openat:
> return 3;
> case __NR_execve:
> + case __NR_execveat:
> return 5;
> default:
> return 0;
> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index 1e96c3628bf2..f9a6c2fdda15 100644
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -872,6 +872,20 @@ ENTRY(stub_execve)
> CFI_ENDPROC
> END(stub_execve)
>
> +ENTRY(stub_execveat)
> + CFI_STARTPROC
> + addq $8, %rsp
> + PARTIAL_FRAME 0
> + SAVE_REST
> + FIXUP_TOP_OF_STACK %r11
> + call sys_execveat
> + RESTORE_TOP_OF_STACK %r11
> + movq %rax,RAX(%rsp)
> + RESTORE_REST
> + jmp int_ret_from_sys_call
> + CFI_ENDPROC
> +END(stub_execveat)
> +
> /*
> * sigreturn is special because it needs to restore all registers on return.
> * This cannot be done with SYSRET, so use the IRET return path instead.
> @@ -917,6 +931,20 @@ ENTRY(stub_x32_execve)
> CFI_ENDPROC
> END(stub_x32_execve)
>
> +ENTRY(stub_x32_execveat)
> + CFI_STARTPROC
> + addq $8, %rsp
> + PARTIAL_FRAME 0
> + SAVE_REST
> + FIXUP_TOP_OF_STACK %r11
> + call compat_sys_execveat
> + RESTORE_TOP_OF_STACK %r11
> + movq %rax,RAX(%rsp)
> + RESTORE_REST
> + jmp int_ret_from_sys_call
> + CFI_ENDPROC
> +END(stub_x32_execveat)
> +
> #endif
>
> /*
> diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
> index 96bc506ac6de..2ab0712a0e7c 100644
> --- a/arch/x86/syscalls/syscall_32.tbl
> +++ b/arch/x86/syscalls/syscall_32.tbl
> @@ -359,3 +359,4 @@
> 350 i386 finit_module sys_finit_module
> 351 i386 sched_setattr sys_sched_setattr
> 352 i386 sched_getattr sys_sched_getattr
> +353 i386 execveat sys_execveat stub32_execveat
> diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
> index a12bddc7ccea..2e4058c14b4f 100644
> --- a/arch/x86/syscalls/syscall_64.tbl
> +++ b/arch/x86/syscalls/syscall_64.tbl
> @@ -322,6 +322,7 @@
> 313 common finit_module sys_finit_module
> 314 common sched_setattr sys_sched_setattr
> 315 common sched_getattr sys_sched_getattr
> +316 64 execveat stub_execveat
>
> #
> # x32-specific system call numbers start at 512 to avoid cache impact
> @@ -358,3 +359,4 @@
> 540 x32 process_vm_writev compat_sys_process_vm_writev
> 541 x32 setsockopt compat_sys_setsockopt
> 542 x32 getsockopt compat_sys_getsockopt
> +543 x32 execveat stub_x32_execveat
> diff --git a/arch/x86/um/sys_call_table_64.c b/arch/x86/um/sys_call_table_64.c
> index f2f0723070ca..20c3649d0691 100644
> --- a/arch/x86/um/sys_call_table_64.c
> +++ b/arch/x86/um/sys_call_table_64.c
> @@ -31,6 +31,7 @@
> #define stub_fork sys_fork
> #define stub_vfork sys_vfork
> #define stub_execve sys_execve
> +#define stub_execveat sys_execveat
> #define stub_rt_sigreturn sys_rt_sigreturn
>
> #define __SYSCALL_COMMON(nr, sym, compat) __SYSCALL_64(nr, sym, compat)
> diff --git a/fs/exec.c b/fs/exec.c
> index 3d78fccdd723..a8676ce571ce 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -748,7 +748,22 @@ EXPORT_SYMBOL(setup_arg_pages);
>
> #endif /* CONFIG_MMU */
>
> -static struct file *do_open_exec(struct filename *name)
> +/*
> + * Perform the extra checks that open_exec() needs over and above a normal
> + * open.
> + */
> +static int check_exec_and_deny_write(struct file *file)
> +{
> + if (!S_ISREG(file_inode(file)->i_mode))
> + return -EACCES;
> +
> + if (file->f_path.mnt->mnt_flags & MNT_NOEXEC)
> + return -EACCES;
> +
> + return deny_write_access(file);
> +}
> +
> +static struct file *do_open_execat(int fd, struct filename *name, int flags)
> {
> struct file *file;
> int err;
> @@ -758,24 +773,42 @@ static struct file *do_open_exec(struct filename *name)
> .intent = LOOKUP_OPEN,
> .lookup_flags = LOOKUP_FOLLOW,
> };
> + static const struct open_flags open_exec_nofollow_flags = {
> + .open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
> + .acc_mode = MAY_EXEC | MAY_OPEN,
> + .intent = LOOKUP_OPEN,
> + .lookup_flags = 0,
> + };
>
> - file = do_filp_open(AT_FDCWD, name, &open_exec_flags);
> - if (IS_ERR(file))
> - goto out;
> -
> - err = -EACCES;
> - if (!S_ISREG(file_inode(file)->i_mode))
> - goto exit;
> + if (flags & ~AT_SYMLINK_NOFOLLOW)
> + return ERR_PTR(-EINVAL);
>
> - if (file->f_path.mnt->mnt_flags & MNT_NOEXEC)
> - goto exit;
> + if (name) {
> + const struct open_flags *oflags = ((flags & AT_SYMLINK_NOFOLLOW)
> + ? &open_exec_nofollow_flags
> + : &open_exec_flags);
>
> - fsnotify_open(file);
> + file = do_filp_open(fd, name, oflags);
> + if (IS_ERR(file))
> + goto out;
> + } else {
> + file = fget(fd);
> + if (!file)
> + return ERR_PTR(-EBADF);
> +
> + err = inode_permission(file->f_path.dentry->d_inode,
> + open_exec_flags.acc_mode);
> + if (err)
> + goto exit;
> + }
>
> - err = deny_write_access(file);
> + err = check_exec_and_deny_write(file);
> if (err)
> goto exit;
>
> + if (name)
> + fsnotify_open(file);
> +
> out:
> return file;
>
> @@ -787,7 +820,7 @@ exit:
> struct file *open_exec(const char *name)
> {
> struct filename tmp = { .name = name };
> - return do_open_exec(&tmp);
> + return do_open_execat(AT_FDCWD, &tmp, 0);
> }
> EXPORT_SYMBOL(open_exec);
>
> @@ -1437,10 +1470,12 @@ static int exec_binprm(struct linux_binprm *bprm)
> /*
> * sys_execve() executes a new program.
> */
> -static int do_execve_common(struct filename *filename,
> - struct user_arg_ptr argv,
> - struct user_arg_ptr envp)
> +static int do_execveat_common(int fd, struct filename *filename,
> + struct user_arg_ptr argv,
> + struct user_arg_ptr envp,
> + int flags)
> {
> + char *pathbuf = NULL;
> struct linux_binprm *bprm;
> struct file *file;
> struct files_struct *displaced;
> @@ -1481,7 +1516,7 @@ static int do_execve_common(struct filename *filename,
> check_unsafe_exec(bprm);
> current->in_execve = 1;
>
> - file = do_open_exec(filename);
> + file = do_open_execat(fd, filename, flags);
> retval = PTR_ERR(file);
> if (IS_ERR(file))
> goto out_unmark;
> @@ -1489,7 +1524,21 @@ static int do_execve_common(struct filename *filename,
> sched_exec();
>
> bprm->file = file;
> - bprm->filename = bprm->interp = filename->name;
> + if (filename && fd == AT_FDCWD) {
> + bprm->filename = filename->name;
> + } else {
> + pathbuf = kmalloc(PATH_MAX, GFP_TEMPORARY);
> + if (!pathbuf) {
> + retval = -ENOMEM;
> + goto out_unmark;
> + }
> + bprm->filename = d_path(&file->f_path, pathbuf, PATH_MAX);
> + if (IS_ERR(bprm->filename)) {
> + retval = PTR_ERR(bprm->filename);
> + goto out_unmark;
> + }
> + }
> + bprm->interp = bprm->filename;
>
> retval = bprm_mm_init(bprm);
> if (retval)
> @@ -1530,7 +1579,8 @@ static int do_execve_common(struct filename *filename,
> acct_update_integrals(current);
> task_numa_free(current);
> free_bprm(bprm);
> - putname(filename);
> + if (filename)
> + putname(filename);
> if (displaced)
> put_files_struct(displaced);
> return retval;
> @@ -1547,12 +1597,14 @@ out_unmark:
>
> out_free:
> free_bprm(bprm);
> + kfree(pathbuf);
>
> out_files:
> if (displaced)
> reset_files_struct(displaced);
> out_ret:
> - putname(filename);
> + if (filename)
> + putname(filename);
> return retval;
> }
>
> @@ -1562,7 +1614,17 @@ int do_execve(struct filename *filename,
> {
> struct user_arg_ptr argv = { .ptr.native = __argv };
> struct user_arg_ptr envp = { .ptr.native = __envp };
> - return do_execve_common(filename, argv, envp);
> + return do_execveat_common(AT_FDCWD, filename, argv, envp, 0);
> +}
> +
> +int do_execveat(int fd, struct filename *filename,
> + const char __user *const __user *__argv,
> + const char __user *const __user *__envp,
> + int flags)
> +{
> + struct user_arg_ptr argv = { .ptr.native = __argv };
> + struct user_arg_ptr envp = { .ptr.native = __envp };
> + return do_execveat_common(fd, filename, argv, envp, flags);
> }
>
> #ifdef CONFIG_COMPAT
> @@ -1578,7 +1640,23 @@ static int compat_do_execve(struct filename *filename,
> .is_compat = true,
> .ptr.compat = __envp,
> };
> - return do_execve_common(filename, argv, envp);
> + return do_execveat_common(AT_FDCWD, filename, argv, envp, 0);
> +}
> +
> +static int compat_do_execveat(int fd, struct filename *filename,
> + const compat_uptr_t __user *__argv,
> + const compat_uptr_t __user *__envp,
> + int flags)
> +{
> + struct user_arg_ptr argv = {
> + .is_compat = true,
> + .ptr.compat = __argv,
> + };
> + struct user_arg_ptr envp = {
> + .is_compat = true,
> + .ptr.compat = __envp,
> + };
> + return do_execveat_common(fd, filename, argv, envp, flags);
> }
> #endif
>
> @@ -1618,6 +1696,22 @@ SYSCALL_DEFINE3(execve,
> {
> return do_execve(getname(filename), argv, envp);
> }
> +
> +SYSCALL_DEFINE5(execveat,
> + int, fd, const char __user *, filename,
> + const char __user *const __user *, argv,
> + const char __user *const __user *, envp,
> + int, flags)
> +{
> + struct filename *path = NULL;
> + if (filename) {
> + path = getname(filename);
> + if (IS_ERR(path))
> + return PTR_ERR(path);
> + }
> + return do_execveat(fd, path, argv, envp, flags);
> +}
> +
> #ifdef CONFIG_COMPAT
> asmlinkage long compat_sys_execve(const char __user * filename,
> const compat_uptr_t __user * argv,
> @@ -1625,4 +1719,19 @@ asmlinkage long compat_sys_execve(const char __user * filename,
> {
> return compat_do_execve(getname(filename), argv, envp);
> }
> +
> +asmlinkage long compat_sys_execveat(int fd,
> + const char __user *filename,
> + const compat_uptr_t __user *argv,
> + const compat_uptr_t __user *envp,
> + int flags)
> +{
> + struct filename *path = NULL;
> + if (filename) {
> + path = getname(filename);
> + if (IS_ERR(path))
> + return PTR_ERR(path);
> + }
> + return compat_do_execveat(fd, path, argv, envp, flags);
> +}
> #endif
> diff --git a/include/linux/compat.h b/include/linux/compat.h
> index 3f448c65511b..e875a5d97e08 100644
> --- a/include/linux/compat.h
> +++ b/include/linux/compat.h
> @@ -341,6 +341,9 @@ asmlinkage long compat_sys_lseek(unsigned int, compat_off_t, unsigned int);
>
> asmlinkage long compat_sys_execve(const char __user *filename, const compat_uptr_t __user *argv,
> const compat_uptr_t __user *envp);
> +asmlinkage long compat_sys_execveat(int dfd, const char __user *filename,
> + const compat_uptr_t __user *argv,
> + const compat_uptr_t __user *envp, int flags);
>
> asmlinkage long compat_sys_select(int n, compat_ulong_t __user *inp,
> compat_ulong_t __user *outp, compat_ulong_t __user *exp,
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index a781dec1cd0b..92601818b4fb 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2315,6 +2315,10 @@ extern int disallow_signal(int);
> extern int do_execve(struct filename *,
> const char __user * const __user *,
> const char __user * const __user *);
> +extern int do_execveat(int, struct filename *,
> + const char __user * const __user *,
> + const char __user * const __user *,
> + int);
> extern long do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *);
> struct task_struct *fork_idle(int);
> extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index a747a77ea584..309a810cf39d 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -855,4 +855,8 @@ asmlinkage long sys_process_vm_writev(pid_t pid,
> asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
> unsigned long idx1, unsigned long idx2);
> asmlinkage long sys_finit_module(int fd, const char __user *uargs, int flags);
> +asmlinkage long sys_execveat(int dfd, const char __user *filename,
> + const char __user *const __user *argv,
> + const char __user *const __user *envp, int flags);
> +
> #endif
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index dde8041f40d2..4231bca3f95e 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -696,9 +696,11 @@ __SYSCALL(__NR_finit_module, sys_finit_module)
> __SYSCALL(__NR_sched_setattr, sys_sched_setattr)
> #define __NR_sched_getattr 275
> __SYSCALL(__NR_sched_getattr, sys_sched_getattr)
> +#define __NR_execveat 276
> +__SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)
>
> #undef __NR_syscalls
> -#define __NR_syscalls 276
> +#define __NR_syscalls 277
>
> /*
> * All syscalls below here should go away really,
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 7078052284fd..5ea7b8ab9e63 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -209,3 +209,6 @@ cond_syscall(compat_sys_open_by_handle_at);
>
> /* compare kernel pointers */
> cond_syscall(sys_kcmp);
> +
> +/* execveat */
> +cond_syscall(sys_execveat);
> diff --git a/lib/audit.c b/lib/audit.c
> index 76bbed4a20e5..712456ed5960 100644
> --- a/lib/audit.c
> +++ b/lib/audit.c
> @@ -48,6 +48,9 @@ int audit_classify_syscall(int abi, unsigned syscall)
> case __NR_socketcall:
> return 4;
> #endif
> +#ifdef __NR_execveat
> + case __NR_execveat:
> +#endif
> case __NR_execve:
> return 5;
> default:
> --
> 1.9.1.423.g4596e3a
>
--
Kees Cook
Chrome OS Security
[Added Eric Biederman, since I think your tree might be a reasonable
route forward for these patches.]
On Thu, Jun 5, 2014 at 6:40 AM, David Drysdale <[email protected]> wrote:
> Resending, adding cc:linux-api.
>
> Also, it may help to add a little more background -- this patch is
> needed as a (small) part of implementing Capsicum in the Linux kernel.
>
> Capsicum is a security framework that has been present in FreeBSD since
> version 9.0 (Jan 2012), and is based on concepts from object-capability
> security [1].
>
> One of the features of Capsicum is capability mode, which locks down
> access to global namespaces such as the filesystem hierarchy. In
> capability mode, /proc is thus inaccessible and so fexecve(3) doesn't
> work -- hence the need for a kernel-space
I just found myself wanting this syscall for another reason: injecting
programs into sandboxes or otherwise heavily locked-down namespaces.
For example, I want to be able to reliably do something like nsenter
--namespace-flags-here toybox sh. Toybox's shell is unusual in that
it is more or less fully functional, so this should Just Work (tm),
except that the toybox binary might not exist in the namespace being
entered. If execveat were available, I could rig nsenter or a similar
tool to open it with O_CLOEXEC, enter the namespace, and then call
execveat.
Is there any reason that these patches can't be merged more or less as
is for 3.19?
--Andy
>
> [1] http://www.cl.cam.ac.uk/research/security/capsicum/papers/2010usenix-security-capsicum-website.pdf
>
> ------
>
> This patch set adds execveat(2) for x86, and is derived from Meredydd
> Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).
>
> The primary aim of adding an execveat syscall is to allow an
> implementation of fexecve(3) that does not rely on the /proc
> filesystem. The current glibc version of fexecve(3) is implemented
> via /proc, which causes problems in sandboxed or otherwise restricted
> environments.
>
> Given the desire for a /proc-free fexecve() implementation, HPA
> suggested (https://lkml.org/lkml/2006/7/11/556) that an execveat(2)
> syscall would be an appropriate generalization.
>
> Also, having a new syscall means that it can take a flags argument
> without back-compatibility concerns. The current implementation just
> defines the AT_SYMLINK_NOFOLLOW flag, but other flags could be added
> in future -- for example, flags for new namespaces (as suggested at
> https://lkml.org/lkml/2006/7/11/474).
>
> Related history:
> - https://lkml.org/lkml/2006/12/27/123 is an example of someone
> realizing that fexecve() is likely to fail in a chroot environment.
> - http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
> documenting the /proc requirement of fexecve(3) in its manpage, to
> "prevent other people from wasting their time".
> - https://bugzilla.kernel.org/show_bug.cgi?id=74481 documented that
> it's not possible to fexecve() a file descriptor for a script with
> close-on-exec set (which is possible with the implementation here).
> - https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
> problem where a process that did setuid() could not fexecve()
> because it no longer had access to /proc/self/fd; this has since
> been fixed.
>
>
> Changes since Meredydd's v3 patch:
> - Added a selftest.
> - Added a man page.
> - Left open_exec() signature untouched to reduce patch impact
> elsewhere (as suggested by Al Viro).
> - Filled in bprm->filename with d_path() into a buffer, to avoid use
> of potentially-ephemeral dentry->d_name.
> - Patch against v3.14 (455c6fdbd21916).
>
>
> David Drysdale (2):
> syscalls,x86: implement execveat() system call
> syscalls,x86: add selftest for execveat(2)
>
> arch/x86/ia32/audit.c | 1 +
> arch/x86/ia32/ia32entry.S | 1 +
> arch/x86/kernel/audit_64.c | 1 +
> arch/x86/kernel/entry_64.S | 28 ++++
> arch/x86/syscalls/syscall_32.tbl | 1 +
> arch/x86/syscalls/syscall_64.tbl | 2 +
> arch/x86/um/sys_call_table_64.c | 1 +
> fs/exec.c | 153 ++++++++++++++++---
> include/linux/compat.h | 3 +
> include/linux/sched.h | 4 +
> include/linux/syscalls.h | 4 +
> include/uapi/asm-generic/unistd.h | 4 +-
> kernel/sys_ni.c | 3 +
> lib/audit.c | 3 +
> tools/testing/selftests/Makefile | 1 +
> tools/testing/selftests/exec/.gitignore | 6 +
> tools/testing/selftests/exec/Makefile | 32 ++++
> tools/testing/selftests/exec/execveat.c | 251 ++++++++++++++++++++++++++++++++
> 18 files changed, 476 insertions(+), 23 deletions(-)
> create mode 100644 tools/testing/selftests/exec/.gitignore
> create mode 100644 tools/testing/selftests/exec/Makefile
> create mode 100644 tools/testing/selftests/exec/execveat.c
>
> --
> 1.9.1.423.g4596e3a
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Andy Lutomirski
AMA Capital Management, LLC
Andy Lutomirski <[email protected]> writes:
> [Added Eric Biederman, since I think your tree might be a reasonable
> route forward for these patches.]
>
> On Thu, Jun 5, 2014 at 6:40 AM, David Drysdale <[email protected]> wrote:
>> Resending, adding cc:linux-api.
>>
>> Also, it may help to add a little more background -- this patch is
>> needed as a (small) part of implementing Capsicum in the Linux kernel.
>>
>> Capsicum is a security framework that has been present in FreeBSD since
>> version 9.0 (Jan 2012), and is based on concepts from object-capability
>> security [1].
>>
>> One of the features of Capsicum is capability mode, which locks down
>> access to global namespaces such as the filesystem hierarchy. In
>> capability mode, /proc is thus inaccessible and so fexecve(3) doesn't
>> work -- hence the need for a kernel-space
>
> I just found myself wanting this syscall for another reason: injecting
> programs into sandboxes or otherwise heavily locked-down namespaces.
>
> For example, I want to be able to reliably do something like nsenter
> --namespace-flags-here toybox sh. Toybox's shell is unusual in that
> it is more or less fully functional, so this should Just Work (tm),
> except that the toybox binary might not exist in the namespace being
> entered. If execveat were available, I could rig nsenter or a similar
> tool to open it with O_CLOEXEC, enter the namespace, and then call
> execveat.
>
> Is there any reason that these patches can't be merged more or less as
> is for 3.19?
Yes. There is a silliness in how it implements fexecve. The fexecve
case should be use the empty string "" not a NULL pointer to indication
that. That change will then harmonize execveat with the other ...at
system calls and simplify the code and remove a special case. I believe
using the empty string "" requires implementing the AT_EMPTY_PATH flag.
For sandboxes execveat seems to make a great deal of sense. I can
get the same functionality by passing in a directory file descriptor
calling fchdir and execve so this should not introduce any new security
holes. And using the final file descriptor removes a race.
AT_SYMLINK_NOFOLLOW seems to have some limited utility as well, although
for exec I don't know what problems it can solve.
Until I am done moving I won't have time to pick this up, and the code
clearly needs another revision but I will be happy to work to see that
we get a sane execveat implemented.
Eric
p.s. I don't believe there are any namespaces issues where doing
something with execveat flags make sense.
On Oct 18, 2014 5:21 PM, "Eric W. Biederman" <[email protected]> wrote:
>
> Andy Lutomirski <[email protected]> writes:
>
> > [Added Eric Biederman, since I think your tree might be a reasonable
> > route forward for these patches.]
> >
> > On Thu, Jun 5, 2014 at 6:40 AM, David Drysdale <[email protected]> wrote:
> >> Resending, adding cc:linux-api.
> >>
> >> Also, it may help to add a little more background -- this patch is
> >> needed as a (small) part of implementing Capsicum in the Linux kernel.
> >>
> >> Capsicum is a security framework that has been present in FreeBSD since
> >> version 9.0 (Jan 2012), and is based on concepts from object-capability
> >> security [1].
> >>
> >> One of the features of Capsicum is capability mode, which locks down
> >> access to global namespaces such as the filesystem hierarchy. In
> >> capability mode, /proc is thus inaccessible and so fexecve(3) doesn't
> >> work -- hence the need for a kernel-space
> >
> > I just found myself wanting this syscall for another reason: injecting
> > programs into sandboxes or otherwise heavily locked-down namespaces.
> >
> > For example, I want to be able to reliably do something like nsenter
> > --namespace-flags-here toybox sh. Toybox's shell is unusual in that
> > it is more or less fully functional, so this should Just Work (tm),
> > except that the toybox binary might not exist in the namespace being
> > entered. If execveat were available, I could rig nsenter or a similar
> > tool to open it with O_CLOEXEC, enter the namespace, and then call
> > execveat.
> >
> > Is there any reason that these patches can't be merged more or less as
> > is for 3.19?
>
> Yes. There is a silliness in how it implements fexecve. The fexecve
> case should be use the empty string "" not a NULL pointer to indication
> that. That change will then harmonize execveat with the other ...at
> system calls and simplify the code and remove a special case. I believe
> using the empty string "" requires implementing the AT_EMPTY_PATH flag.
>
Sounds reasonable.
> For sandboxes execveat seems to make a great deal of sense. I can
> get the same functionality by passing in a directory file descriptor
> calling fchdir and execve so this should not introduce any new security
> holes. And using the final file descriptor removes a race.
The problem with that approach is that the execed program now has its
current directory outside the sandbox, which could be problematic if
you don't trust that program.
>
> AT_SYMLINK_NOFOLLOW seems to have some limited utility as well, although
> for exec I don't know what problems it can solve.
It can always be added later
>
> Until I am done moving I won't have time to pick this up, and the code
> clearly needs another revision but I will be happy to work to see that
> we get a sane execveat implemented.
Do you have an ETA? If it's likely to miss 3.19, but if you'll have
time to review before then, I can try to do it.
>
> Eric
>
> p.s. I don't believe there are any namespaces issues where doing
> something with execveat flags make sense.
>
OK, I'll bite. How feasible would it be to have a flag that activated
pid_ns_for_children? That would reduce a lot of the ugliness in tools
like nsenter that need to fork to enter a pid ns.
I always assume that the reason for the active vs. for_children
distinction was because a lot of userspace libraries, including glibc,
would malfunction if getpid(2) started returning a different value.
But, for exec, this doesn't matter.
--Andy
On Fri, Oct 17, 2014 at 02:45:03PM -0700, Andy Lutomirski wrote:
> For example, I want to be able to reliably do something like nsenter
> --namespace-flags-here toybox sh. Toybox's shell is unusual in that
> it is more or less fully functional, so this should Just Work (tm),
> except that the toybox binary might not exist in the namespace being
> entered. If execveat were available, I could rig nsenter or a similar
> tool to open it with O_CLOEXEC, enter the namespace, and then call
> execveat.
The question I hadn't seen really answered through all of that was how to
deal with #!... "Just use d_path()" isn't particulary appealing - if that
file has a pathname reachable for you, you could bloody well use _that_
from the very beginning.
Frankly, I wonder if it would make sense to provide something like
dupfs. We can't mount it by default on /dev/fd (more's the pity), but
it might be a good thing to have.
What it is, for those who are not familiar with Plan 9: a filesystem with
one directory and a bunch of files in it. Directory contents depends on
who's looking; for each opened descriptor in your descriptor table, you'll
see two files there. One series is 0, 1, ... - opening one of those gives
dup(). IOW, it's *not* giving you a new struct file; it gives you a new
reference to existing one, complete with sharing IO position, etc. Another
is 0ctl, 1ctl, ... - those are read-only and reading from them gives pretty
much a combination of our /proc/self/fdinfo/n with readlink of /proc/self/fd/n.
It's actually a better match for what one would expect at /dev/fd than what
we do. Example:
; echo 'read i; cat /dev/fd/0; echo "The first line was $i"' >a.sh
; (echo 'line 1';echo 'line 2') >a
; cat a|sh a.sh
line 2
The first line was line 1
; sh a.sh <a
line 1
line 2
The first line was line 1
;
See what's going on? Opening /dev/fd/0 (aka /dev/stdin) does a fresh open
of whatever your stdin is; if it's a pipe - fine, you've just added yourself
as additional reader. But if it's a regular file, you've got yourself
a brand-new opened file, with IO position of its own. Sitting at the
beginning of the file.
Moreover, try that with stdin being a socket and you'll see cat(1) failing
to open that sucker.
We _can't_ blindly replace /dev/fd with it - it has to be a sysadmin choice;
semantics is different. However, there's no reason why it can't be mounted
in environments where you want to avoid procfs - it's certainly exposing less
than procfs would.
And these days we can implement relatively cheaply. It's a window that will
close after a while, but right now we can change ->atomic_open() calling
conventions. Instead of having it return 0 or error, let's switch to returning
NULL, ERR_PTR(error) *or* an extra reference to preexisting struct file.
Same as we did for ->lookup(), and for similar reason.
Right now we have 8 instances of ->atomic_open() and one place calling that
method. Changing the API like that would be trivial (and it's a trivial
conversion - replace return ret; with return ERR_PTR(ret); through all
instances, so any out-of-tree filesystems could follow easily). We certainly
can't do anything of that sort with ->open() - there would be thousands
instances to convert. ->atomic_open(), OTOH, is still new enough for
that to be feasible.
What we get from that conversion is an ability to do dup-style semantics
easily.
* give root directory an ->atomic_open() instance that would be
handling opens.
* make lookups in there fail with ENOENT if you don't have such a
descriptor at the moment. Otherwise bind all of them to the same inode.
The only method it needs is ->getattr(), and that would look into your
descriptor table for descriptor with number derived from dentry (stashed
in ->d_fsdata at lookup time) and do what fstat() would.
* have those dentries always fail ->d_revalidate(), to force
everything towards ->atomic_open().
* for ...ctl names, ->atomic_open() would act in normal fashion;
again, only one inode is needed. ->read() would pick descriptor number
from ->d_fsdata and report on whatever you have with that number at the
time.
I'll try to put a prototype of that together; I think it's at least
interesting to try. And that ought to be safe to mount even in very
restricted environments, making arguments along the lines of "but we can't
get the path by opened file without the big bad wol^Wprocfs and we can't
have that in our environment" much weaker...
Comments?
On Sun, Oct 19, 2014 at 1:20 PM, Al Viro <[email protected]> wrote:
> On Fri, Oct 17, 2014 at 02:45:03PM -0700, Andy Lutomirski wrote:
>
>> For example, I want to be able to reliably do something like nsenter
>> --namespace-flags-here toybox sh. Toybox's shell is unusual in that
>> it is more or less fully functional, so this should Just Work (tm),
>> except that the toybox binary might not exist in the namespace being
>> entered. If execveat were available, I could rig nsenter or a similar
>> tool to open it with O_CLOEXEC, enter the namespace, and then call
>> execveat.
>
> The question I hadn't seen really answered through all of that was how to
> deal with #!... "Just use d_path()" isn't particulary appealing - if that
> file has a pathname reachable for you, you could bloody well use _that_
> from the very beginning.
Does this matter for absolute paths after #! (or for absolute paths to
ELF interpreters)? Does anyone use relative paths there?
Does execve("/proc/self/fd/N", ...) not work correctly now?
Presumably relative paths should be relative to the execed program, or
maybe there should be a flag to execveat that disallows that behavior
entirely, or maybe it should never work, even through /proc. I don't
really like the idea that an fd pointing at a *file* should allow
access to its directory.
>
> Frankly, I wonder if it would make sense to provide something like
> dupfs. We can't mount it by default on /dev/fd (more's the pity), but
> it might be a good thing to have.
>
> What it is, for those who are not familiar with Plan 9: a filesystem with
> one directory and a bunch of files in it. Directory contents depends on
> who's looking; for each opened descriptor in your descriptor table, you'll
> see two files there. One series is 0, 1, ... - opening one of those gives
> dup(). IOW, it's *not* giving you a new struct file; it gives you a new
> reference to existing one, complete with sharing IO position, etc. Another
> is 0ctl, 1ctl, ... - those are read-only and reading from them gives pretty
> much a combination of our /proc/self/fdinfo/n with readlink of /proc/self/fd/n.
>
> It's actually a better match for what one would expect at /dev/fd than what
> we do. Example:
>
> ; echo 'read i; cat /dev/fd/0; echo "The first line was $i"' >a.sh
> ; (echo 'line 1';echo 'line 2') >a
> ; cat a|sh a.sh
> line 2
> The first line was line 1
> ; sh a.sh <a
> line 1
> line 2
> The first line was line 1
> ;
>
> See what's going on? Opening /dev/fd/0 (aka /dev/stdin) does a fresh open
> of whatever your stdin is; if it's a pipe - fine, you've just added yourself
> as additional reader. But if it's a regular file, you've got yourself
> a brand-new opened file, with IO position of its own. Sitting at the
> beginning of the file.
>
> Moreover, try that with stdin being a socket and you'll see cat(1) failing
> to open that sucker.
>
> We _can't_ blindly replace /dev/fd with it - it has to be a sysadmin choice;
> semantics is different. However, there's no reason why it can't be mounted
> in environments where you want to avoid procfs - it's certainly exposing less
> than procfs would.
>
> And these days we can implement relatively cheaply. It's a window that will
> close after a while, but right now we can change ->atomic_open() calling
> conventions. Instead of having it return 0 or error, let's switch to returning
> NULL, ERR_PTR(error) *or* an extra reference to preexisting struct file.
> Same as we did for ->lookup(), and for similar reason.
>
> Right now we have 8 instances of ->atomic_open() and one place calling that
> method. Changing the API like that would be trivial (and it's a trivial
> conversion - replace return ret; with return ERR_PTR(ret); through all
> instances, so any out-of-tree filesystems could follow easily). We certainly
> can't do anything of that sort with ->open() - there would be thousands
> instances to convert. ->atomic_open(), OTOH, is still new enough for
> that to be feasible.
>
> What we get from that conversion is an ability to do dup-style semantics
> easily.
> * give root directory an ->atomic_open() instance that would be
> handling opens.
> * make lookups in there fail with ENOENT if you don't have such a
> descriptor at the moment. Otherwise bind all of them to the same inode.
> The only method it needs is ->getattr(), and that would look into your
> descriptor table for descriptor with number derived from dentry (stashed
> in ->d_fsdata at lookup time) and do what fstat() would.
> * have those dentries always fail ->d_revalidate(), to force
> everything towards ->atomic_open().
> * for ...ctl names, ->atomic_open() would act in normal fashion;
> again, only one inode is needed. ->read() would pick descriptor number
> from ->d_fsdata and report on whatever you have with that number at the
> time.
>
> I'll try to put a prototype of that together; I think it's at least
> interesting to try. And that ought to be safe to mount even in very
> restricted environments, making arguments along the lines of "but we can't
> get the path by opened file without the big bad wol^Wprocfs and we can't
> have that in our environment" much weaker...
>
> Comments?
I'm not convinced that these semantics are better than /proc/self/fd's
in many contexts. I don't really like the idea that catting some file
can *change* the position of one of my open file descriptors. Also,
for execveat in particular, I want to be able to setns into a
completely unknown namespace and exec something, so a new fs won't
help if it's not mounted.
An alternative solution to proc-lite would be to have a heavily
stripped-down variant of /proc. It could self as a real directory
(not a symlink), with the normal semantics for /proc/self/fd, and it
could have very little else (possibly nothing at all; possibly exe,
root, and cwd).
--Andy
\
On 10/19/2014 01:20 PM, Al Viro wrote:
> On Fri, Oct 17, 2014 at 02:45:03PM -0700, Andy Lutomirski wrote:
>
>> For example, I want to be able to reliably do something like nsenter
>> --namespace-flags-here toybox sh. Toybox's shell is unusual in that
>> it is more or less fully functional, so this should Just Work (tm),
>> except that the toybox binary might not exist in the namespace being
>> entered. If execveat were available, I could rig nsenter or a similar
>> tool to open it with O_CLOEXEC, enter the namespace, and then call
>> execveat.
>
> The question I hadn't seen really answered through all of that was how to
> deal with #!... "Just use d_path()" isn't particulary appealing - if that
> file has a pathname reachable for you, you could bloody well use _that_
> from the very beginning.
>
> Frankly, I wonder if it would make sense to provide something like
> dupfs. We can't mount it by default on /dev/fd (more's the pity), but
> it might be a good thing to have.
>
> What it is, for those who are not familiar with Plan 9: a filesystem with
> one directory and a bunch of files in it. Directory contents depends on
> who's looking; for each opened descriptor in your descriptor table, you'll
> see two files there. One series is 0, 1, ... - opening one of those gives
> dup(). IOW, it's *not* giving you a new struct file; it gives you a new
> reference to existing one, complete with sharing IO position, etc. Another
> is 0ctl, 1ctl, ... - those are read-only and reading from them gives pretty
> much a combination of our /proc/self/fdinfo/n with readlink of /proc/self/fd/n.
>
> It's actually a better match for what one would expect at /dev/fd than what
> we do. Example:
>
Yes, it is really unfortunate that /proc/self/fd/* had the wrong
semantics for the start, due to then-existing implementation issues.
-hpa
On Sun, Oct 19, 2014 at 01:37:54PM -0700, Andy Lutomirski wrote:
> > The question I hadn't seen really answered through all of that was how to
> > deal with #!... "Just use d_path()" isn't particulary appealing - if that
> > file has a pathname reachable for you, you could bloody well use _that_
> > from the very beginning.
>
> Does this matter for absolute paths after #! (or for absolute paths to
> ELF interpreters)? Does anyone use relative paths there?
It's not about what's after #!; it's what we *append* to what's after #!
that is interesting. Recall how #! works - we turn execve() of something
that starts with e.g. "#!/usr/bin/make -f\n" into execve of /usr/bin/make,
with (/usr/bin/make, -f, name of that file, argv[1]..argv[argc]) as
arguments list. With make(1) doing opening and reading the file, as it
would for any makefile. Or /bin/sh opening and reading the script, etc.
Pathname of interpreter is a non-issue (and ELF ones don't go anywhere
near that path anyway).
> Does execve("/proc/self/fd/N", ...) not work correctly now?
Yes, it does. And if procfs is there, this syscall is completely pointless.
The main argument in favour of adding it to the kernel (rather than doing
in userland) has been "but what of the people who don't want procfs mounted
in there?".
> Presumably relative paths should be relative to the execed program, or
> maybe there should be a flag to execveat that disallows that behavior
> entirely, or maybe it should never work, even through /proc. I don't
> really like the idea that an fd pointing at a *file* should allow
> access to its directory.
Huh? What are you talking about? And who the hell cares whether it's
absolute or relative?
> I'm not convinced that these semantics are better than /proc/self/fd's
> in many contexts. I don't really like the idea that catting some file
> can *change* the position of one of my open file descriptors.
Er... You do realize that if descriptor refers to the same file,
cat <&<that descriptor> will change position of your other descriptor,
right? BTW, on *BSD /dev/stdin *does* have a dup()-style semantics.
They do it at the price of very convoluted code in their VFS, but that's
how it works there. And in Plan 9, and (AFAIK) in Solaris...
On Sun, Oct 19, 2014 at 2:29 PM, Al Viro <[email protected]> wrote:
> On Sun, Oct 19, 2014 at 01:37:54PM -0700, Andy Lutomirski wrote:
>> > The question I hadn't seen really answered through all of that was how to
>> > deal with #!... "Just use d_path()" isn't particulary appealing - if that
>> > file has a pathname reachable for you, you could bloody well use _that_
>> > from the very beginning.
>>
>> Does this matter for absolute paths after #! (or for absolute paths to
>> ELF interpreters)? Does anyone use relative paths there?
>
> It's not about what's after #!; it's what we *append* to what's after #!
> that is interesting. Recall how #! works - we turn execve() of something
> that starts with e.g. "#!/usr/bin/make -f\n" into execve of /usr/bin/make,
> with (/usr/bin/make, -f, name of that file, argv[1]..argv[argc]) as
> arguments list. With make(1) doing opening and reading the file, as it
> would for any makefile. Or /bin/sh opening and reading the script, etc.
>
> Pathname of interpreter is a non-issue (and ELF ones don't go anywhere
> near that path anyway).
>
Oh, you mean that #!/usr/bin/make -f would turn into /usr/bin/make
/dev/fd/3? That could be interesting, although I can imagine it
breaking things, especially if /dev/fd/3 isn't set up like that, e.g.
early in boot.
Why does this use case need the dup-style semantics? Merely opening
the same file should work, right?
[deleted my questions based on my misunderstanding of what you were
talking about]
>> I'm not convinced that these semantics are better than /proc/self/fd's
>> in many contexts. I don't really like the idea that catting some file
>> can *change* the position of one of my open file descriptors.
>
> Er... You do realize that if descriptor refers to the same file,
> cat <&<that descriptor> will change position of your other descriptor,
> right? BTW, on *BSD /dev/stdin *does* have a dup()-style semantics.
> They do it at the price of very convoluted code in their VFS, but that's
> how it works there. And in Plan 9, and (AFAIK) in Solaris...
But cat <&3 is shell magic; it isn't actually an invocation of
/bin/cat with some special argument that causes it to poke at some fd.
And bash already supports this kind of use case.
I still don't userstand why the BSD behavior is better here. Also,
this isn't just file *position* -- anything that opens dupfs/N and
uses fcntl with F_SETFL seems likely to cause failures, crashes, or
outright corruption. And there are file locks, which suddenly have
rather odd semantics. (Even with the new OFD locks or with BSD locks,
I wonder how many little programs open argv[1], lock the file, do
something, and then exit. Anything that does that is now leaking a
lock.)
Aside from the general scariness of allowing one process to actually
dup another process's fds, I feel like this is asking for trouble wrt
the various types of file locks.
--Andy
On Sun, Oct 19, 2014 at 03:16:03PM -0700, Andy Lutomirski wrote:
> Oh, you mean that #!/usr/bin/make -f would turn into /usr/bin/make
> /dev/fd/3? That could be interesting, although I can imagine it
> breaking things, especially if /dev/fd/3 isn't set up like that, e.g.
> early in boot.
Sigh... What I mean is that fexecve(fd, ...) would have to put _something_
into argv when it execs the interpreter of #! file. Simply because the
interpreter (which can be anything whatsoever) has no fscking idea what
to do with some descriptor it has before execve(). Hell, it doesn't have
any idea *which* descriptor had it been.
You need to put some pathname that would yield your script upon open(2).
If you bothered to read those patches, you'd see that they do supply
one, generating it with d_path(). Which isn't particulary reliable.
I'm not sure there's any point putting any of that in the kernel - if
you *do* have that pathname, you can just pass it.
> Aside from the general scariness of allowing one process to actually
> dup another process's fds, I feel like this is asking for trouble wrt
> the various types of file locks.
Who said anything about another process's fds? That, indeed, would be
a recipe for serious trouble. It's a filesystem with one directory,
not with one directory for each process...
FWIW, they (Plan 9) do have procfs and there they have /proc/<pid>/fd.
Which is a regular file, with contents consisting of \n-terminated
lines (one per descriptor in <pid>'s descriptor table>) in the same
format as in *ctl (they put descriptor number as the first field in
those).
Unlike our solution, they do not allow to get to any process' files via
procfs. They do allow /dev/stdin-style access to your own files via
dupfs. And yes, for /dev/stdin and friends dup-style semantics is better -
you get consistent behaviours for pipes and redirects from file that way.
See the example I've posted upthread. Besides, for things like sockets
our semantics simply fails - they really depend on having only one
struct file for given socket; it's dup or nothing there. The same goes
for a lot of things like eventfd, etc.
On Sun, Oct 19, 2014 at 3:42 PM, Al Viro <[email protected]> wrote:
> On Sun, Oct 19, 2014 at 03:16:03PM -0700, Andy Lutomirski wrote:
>
>> Oh, you mean that #!/usr/bin/make -f would turn into /usr/bin/make
>> /dev/fd/3? That could be interesting, although I can imagine it
>> breaking things, especially if /dev/fd/3 isn't set up like that, e.g.
>> early in boot.
>
> Sigh... What I mean is that fexecve(fd, ...) would have to put _something_
> into argv when it execs the interpreter of #! file. Simply because the
> interpreter (which can be anything whatsoever) has no fscking idea what
> to do with some descriptor it has before execve(). Hell, it doesn't have
> any idea *which* descriptor had it been.
>
> You need to put some pathname that would yield your script upon open(2).
> If you bothered to read those patches, you'd see that they do supply
> one, generating it with d_path(). Which isn't particulary reliable.
>
> I'm not sure there's any point putting any of that in the kernel - if
> you *do* have that pathname, you can just pass it.
Hmm.
This issue certainly makes fexecve or execveat less attractive, at
least in cases where d_path won't work.
On the other hand, if you want to run a static binary on a cloexec fd
(or, for that matter, a dynamic binary if you trust the interpreter to
close the extra copy of the fd it gets) in a namespace or chroot where
the binary is invisible, then you need kernel help.
It's too bad that script interpreters don't have a mechanism to open
their scripts by fd.
>
>> Aside from the general scariness of allowing one process to actually
>> dup another process's fds, I feel like this is asking for trouble wrt
>> the various types of file locks.
>
> Who said anything about another process's fds? That, indeed, would be
> a recipe for serious trouble. It's a filesystem with one directory,
> not with one directory for each process...
>
This still has issues with locks if you pass an fd to a child process,
but I guess that you get what you ask for if you do that.
> FWIW, they (Plan 9) do have procfs and there they have /proc/<pid>/fd.
> Which is a regular file, with contents consisting of \n-terminated
> lines (one per descriptor in <pid>'s descriptor table>) in the same
> format as in *ctl (they put descriptor number as the first field in
> those).
>
> Unlike our solution, they do not allow to get to any process' files via
> procfs. They do allow /dev/stdin-style access to your own files via
> dupfs. And yes, for /dev/stdin and friends dup-style semantics is better -
> you get consistent behaviours for pipes and redirects from file that way.
> See the example I've posted upthread. Besides, for things like sockets
> our semantics simply fails - they really depend on having only one
> struct file for given socket; it's dup or nothing there. The same goes
> for a lot of things like eventfd, etc.
Fair enough.
--Andy
On Sun, Oct 19, 2014 at 1:20 AM, Eric W. Biederman
<[email protected]> wrote:
> Andy Lutomirski <[email protected]> writes:
>
>> [Added Eric Biederman, since I think your tree might be a reasonable
>> route forward for these patches.]
>>
>> On Thu, Jun 5, 2014 at 6:40 AM, David Drysdale <[email protected]> wrote:
>>> Resending, adding cc:linux-api.
>>>
>>> Also, it may help to add a little more background -- this patch is
>>> needed as a (small) part of implementing Capsicum in the Linux kernel.
>>>
>>> Capsicum is a security framework that has been present in FreeBSD since
>>> version 9.0 (Jan 2012), and is based on concepts from object-capability
>>> security [1].
>>>
>>> One of the features of Capsicum is capability mode, which locks down
>>> access to global namespaces such as the filesystem hierarchy. In
>>> capability mode, /proc is thus inaccessible and so fexecve(3) doesn't
>>> work -- hence the need for a kernel-space
>>
>> I just found myself wanting this syscall for another reason: injecting
>> programs into sandboxes or otherwise heavily locked-down namespaces.
>>
>> For example, I want to be able to reliably do something like nsenter
>> --namespace-flags-here toybox sh. Toybox's shell is unusual in that
>> it is more or less fully functional, so this should Just Work (tm),
>> except that the toybox binary might not exist in the namespace being
>> entered. If execveat were available, I could rig nsenter or a similar
>> tool to open it with O_CLOEXEC, enter the namespace, and then call
>> execveat.
>>
>> Is there any reason that these patches can't be merged more or less as
>> is for 3.19?
>
> Yes. There is a silliness in how it implements fexecve. The fexecve
> case should be use the empty string "" not a NULL pointer to indication
> that. That change will then harmonize execveat with the other ...at
> system calls and simplify the code and remove a special case. I believe
> using the empty string "" requires implementing the AT_EMPTY_PATH flag.
Good point -- I'll shift to "" + AT_EMPTY_PATH.
> For sandboxes execveat seems to make a great deal of sense. I can
> get the same functionality by passing in a directory file descriptor
> calling fchdir and execve so this should not introduce any new security
> holes. And using the final file descriptor removes a race.
>
> AT_SYMLINK_NOFOLLOW seems to have some limited utility as well, although
> for exec I don't know what problems it can solve.
>
> Until I am done moving I won't have time to pick this up, and the code
> clearly needs another revision but I will be happy to work to see that
> we get a sane execveat implemented.
If it helps, I can push out another revision in the next couple of days.
> Eric
>
> p.s. I don't believe there are any namespaces issues where doing
> something with execveat flags make sense.
>
>
On Mon, Oct 20, 2014 at 6:48 AM, David Drysdale <[email protected]> wrote:
> On Sun, Oct 19, 2014 at 1:20 AM, Eric W. Biederman
> <[email protected]> wrote:
>> Andy Lutomirski <[email protected]> writes:
>>
>>> [Added Eric Biederman, since I think your tree might be a reasonable
>>> route forward for these patches.]
>>>
>>> On Thu, Jun 5, 2014 at 6:40 AM, David Drysdale <[email protected]> wrote:
>>>> Resending, adding cc:linux-api.
>>>>
>>>> Also, it may help to add a little more background -- this patch is
>>>> needed as a (small) part of implementing Capsicum in the Linux kernel.
>>>>
>>>> Capsicum is a security framework that has been present in FreeBSD since
>>>> version 9.0 (Jan 2012), and is based on concepts from object-capability
>>>> security [1].
>>>>
>>>> One of the features of Capsicum is capability mode, which locks down
>>>> access to global namespaces such as the filesystem hierarchy. In
>>>> capability mode, /proc is thus inaccessible and so fexecve(3) doesn't
>>>> work -- hence the need for a kernel-space
>>>
>>> I just found myself wanting this syscall for another reason: injecting
>>> programs into sandboxes or otherwise heavily locked-down namespaces.
>>>
>>> For example, I want to be able to reliably do something like nsenter
>>> --namespace-flags-here toybox sh. Toybox's shell is unusual in that
>>> it is more or less fully functional, so this should Just Work (tm),
>>> except that the toybox binary might not exist in the namespace being
>>> entered. If execveat were available, I could rig nsenter or a similar
>>> tool to open it with O_CLOEXEC, enter the namespace, and then call
>>> execveat.
>>>
>>> Is there any reason that these patches can't be merged more or less as
>>> is for 3.19?
>>
>> Yes. There is a silliness in how it implements fexecve. The fexecve
>> case should be use the empty string "" not a NULL pointer to indication
>> that. That change will then harmonize execveat with the other ...at
>> system calls and simplify the code and remove a special case. I believe
>> using the empty string "" requires implementing the AT_EMPTY_PATH flag.
>
> Good point -- I'll shift to "" + AT_EMPTY_PATH.
Pending a better idea, I would also see if the patches can be changed
to return an error if d_path ends up with an "(unreachable)" thing
rather than failing inexplicably later on.
--Andy
Andy Lutomirski <[email protected]> writes:
> On Mon, Oct 20, 2014 at 6:48 AM, David Drysdale <[email protected]> wrote:
>> On Sun, Oct 19, 2014 at 1:20 AM, Eric W. Biederman
>> <[email protected]> wrote:
>>> Andy Lutomirski <[email protected]> writes:
>>>
>>>> [Added Eric Biederman, since I think your tree might be a reasonable
>>>> route forward for these patches.]
>>>>
>>>> On Thu, Jun 5, 2014 at 6:40 AM, David Drysdale <[email protected]> wrote:
>>>>> Resending, adding cc:linux-api.
>>>>>
>>>>> Also, it may help to add a little more background -- this patch is
>>>>> needed as a (small) part of implementing Capsicum in the Linux kernel.
>>>>>
>>>>> Capsicum is a security framework that has been present in FreeBSD since
>>>>> version 9.0 (Jan 2012), and is based on concepts from object-capability
>>>>> security [1].
>>>>>
>>>>> One of the features of Capsicum is capability mode, which locks down
>>>>> access to global namespaces such as the filesystem hierarchy. In
>>>>> capability mode, /proc is thus inaccessible and so fexecve(3) doesn't
>>>>> work -- hence the need for a kernel-space
>>>>
>>>> I just found myself wanting this syscall for another reason: injecting
>>>> programs into sandboxes or otherwise heavily locked-down namespaces.
>>>>
>>>> For example, I want to be able to reliably do something like nsenter
>>>> --namespace-flags-here toybox sh. Toybox's shell is unusual in that
>>>> it is more or less fully functional, so this should Just Work (tm),
>>>> except that the toybox binary might not exist in the namespace being
>>>> entered. If execveat were available, I could rig nsenter or a similar
>>>> tool to open it with O_CLOEXEC, enter the namespace, and then call
>>>> execveat.
>>>>
>>>> Is there any reason that these patches can't be merged more or less as
>>>> is for 3.19?
>>>
>>> Yes. There is a silliness in how it implements fexecve. The fexecve
>>> case should be use the empty string "" not a NULL pointer to indication
>>> that. That change will then harmonize execveat with the other ...at
>>> system calls and simplify the code and remove a special case. I believe
>>> using the empty string "" requires implementing the AT_EMPTY_PATH flag.
>>
>> Good point -- I'll shift to "" + AT_EMPTY_PATH.
>
> Pending a better idea, I would also see if the patches can be changed
> to return an error if d_path ends up with an "(unreachable)" thing
> rather than failing inexplicably later on.
For my reference we are talking about
> @@ -1489,7 +1524,21 @@ static int do_execve_common(struct filename *filename,
> sched_exec();
>
> bprm->file = file;
> - bprm->filename = bprm->interp = filename->name;
> + if (filename && fd == AT_FDCWD) {
> + bprm->filename = filename->name;
> + } else {
> + pathbuf = kmalloc(PATH_MAX, GFP_TEMPORARY);
> + if (!pathbuf) {
> + retval = -ENOMEM;
> + goto out_unmark;
> + }
> + bprm->filename = d_path(&file->f_path, pathbuf, PATH_MAX);
> + if (IS_ERR(bprm->filename)) {
> + retval = PTR_ERR(bprm->filename);
> + goto out_unmark;
> + }
> + }
> + bprm->interp = bprm->filename;
>
> retval = bprm_mm_init(bprm);
> if (retval)
The interesting case for fexecve is when we either don't know what files
are present or we don't want to depend on which files are present.
As Al pointed out d_path really isn't the right solution. It fails when
printing /proc/self/fd/${fd}/${filename->name} would work, and the
"(deleted)" or "(unreachable)" strings are wrong.
The test for today's cases should be:
if ((filename->name[0] == '/') || fd == AT_FDCWD) {
bprm->filename = filename->name;
}
To handle the case where the file descriptor is relevant.
For the case where the file descriptor is relevant let me suggest
setting bprm->filename and bprm->interp to:
/dev/fd/${fd}/${filename->name}
It is more a description of what we have done but as a magic string it
is descriptive. Documetation/devices.txt documents that /dev/fd/ should
exist, making it an unambiguous path. Further these days the kernel
sets the device naming policy in dev, so I think we are strongly safe in
using that path in any event.
I think execveat is interesting in the kernel because the motivating
cases are the cases where anything except a static executable is
uninteresting.
Now it has been suggested creating a dupfs or a mini-proc. I think that
sounds like a nice companion, to the concept of a locked down root.
But I don't think it removes the need for execveat (because we still
have the case where we don't want to care what is mounted, and are happy
to use static executables).
Eric
On Tue, Oct 21, 2014 at 5:29 AM, Eric W. Biederman
<[email protected]> wrote:
> Andy Lutomirski <[email protected]> writes:
>
>> On Mon, Oct 20, 2014 at 6:48 AM, David Drysdale <[email protected]> wrote:
>>> On Sun, Oct 19, 2014 at 1:20 AM, Eric W. Biederman
>>> <[email protected]> wrote:
>>>> Andy Lutomirski <[email protected]> writes:
>>>>
>>>>> [Added Eric Biederman, since I think your tree might be a reasonable
>>>>> route forward for these patches.]
>>>>>
>>>>> On Thu, Jun 5, 2014 at 6:40 AM, David Drysdale <[email protected]> wrote:
>>>>>> Resending, adding cc:linux-api.
>>>>>>
>>>>>> Also, it may help to add a little more background -- this patch is
>>>>>> needed as a (small) part of implementing Capsicum in the Linux kernel.
>>>>>>
>>>>>> Capsicum is a security framework that has been present in FreeBSD since
>>>>>> version 9.0 (Jan 2012), and is based on concepts from object-capability
>>>>>> security [1].
>>>>>>
>>>>>> One of the features of Capsicum is capability mode, which locks down
>>>>>> access to global namespaces such as the filesystem hierarchy. In
>>>>>> capability mode, /proc is thus inaccessible and so fexecve(3) doesn't
>>>>>> work -- hence the need for a kernel-space
>>>>>
>>>>> I just found myself wanting this syscall for another reason: injecting
>>>>> programs into sandboxes or otherwise heavily locked-down namespaces.
>>>>>
>>>>> For example, I want to be able to reliably do something like nsenter
>>>>> --namespace-flags-here toybox sh. Toybox's shell is unusual in that
>>>>> it is more or less fully functional, so this should Just Work (tm),
>>>>> except that the toybox binary might not exist in the namespace being
>>>>> entered. If execveat were available, I could rig nsenter or a similar
>>>>> tool to open it with O_CLOEXEC, enter the namespace, and then call
>>>>> execveat.
>>>>>
>>>>> Is there any reason that these patches can't be merged more or less as
>>>>> is for 3.19?
>>>>
>>>> Yes. There is a silliness in how it implements fexecve. The fexecve
>>>> case should be use the empty string "" not a NULL pointer to indication
>>>> that. That change will then harmonize execveat with the other ...at
>>>> system calls and simplify the code and remove a special case. I believe
>>>> using the empty string "" requires implementing the AT_EMPTY_PATH flag.
>>>
>>> Good point -- I'll shift to "" + AT_EMPTY_PATH.
>>
>> Pending a better idea, I would also see if the patches can be changed
>> to return an error if d_path ends up with an "(unreachable)" thing
>> rather than failing inexplicably later on.
>
> For my reference we are talking about
>
>> @@ -1489,7 +1524,21 @@ static int do_execve_common(struct filename *filename,
>> sched_exec();
>>
>> bprm->file = file;
>> - bprm->filename = bprm->interp = filename->name;
>> + if (filename && fd == AT_FDCWD) {
>> + bprm->filename = filename->name;
>> + } else {
>> + pathbuf = kmalloc(PATH_MAX, GFP_TEMPORARY);
>> + if (!pathbuf) {
>> + retval = -ENOMEM;
>> + goto out_unmark;
>> + }
>> + bprm->filename = d_path(&file->f_path, pathbuf, PATH_MAX);
>> + if (IS_ERR(bprm->filename)) {
>> + retval = PTR_ERR(bprm->filename);
>> + goto out_unmark;
>> + }
>> + }
>> + bprm->interp = bprm->filename;
>>
>> retval = bprm_mm_init(bprm);
>> if (retval)
>
> The interesting case for fexecve is when we either don't know what files
> are present or we don't want to depend on which files are present.
>
> As Al pointed out d_path really isn't the right solution. It fails when
> printing /proc/self/fd/${fd}/${filename->name} would work, and the
> "(deleted)" or "(unreachable)" strings are wrong.
>
> The test for today's cases should be:
> if ((filename->name[0] == '/') || fd == AT_FDCWD) {
> bprm->filename = filename->name;
> }
>
> To handle the case where the file descriptor is relevant.
(s/relevant/irrelevant)
Yep, good spot.
> For the case where the file descriptor is relevant let me suggest
> setting bprm->filename and bprm->interp to:
>
> /dev/fd/${fd}/${filename->name}
I'll send out an updated patchset with this approach, but I have a slight
reservation. Given that /dev/fd is a symlink to /proc/self/fd, this approach
means that script invocations will always fail on a /proc-less system,
where the previous iteration might have worked.
(As it happens, this isn't a restriction that affects the things I'm
working on, as Capsicum wouldn't allow script invocation anyway.
However, scenarios without /proc were nominally one of the motivating
factors for execveat in the first place...)
> It is more a description of what we have done but as a magic string it
> is descriptive. Documetation/devices.txt documents that /dev/fd/ should
> exist, making it an unambiguous path. Further these days the kernel
> sets the device naming policy in dev, so I think we are strongly safe in
> using that path in any event.
>
> I think execveat is interesting in the kernel because the motivating
> cases are the cases where anything except a static executable is
> uninteresting.
FYI, there is potential in the future for something other than static
executables -- the FreeBSD Capsicum implementation includes changes
to the dynamic linker to get its search path as a list of pre-opened dfds
(in LD_LIBRARY_PATH_FDS) rather than paths.
> Now it has been suggested creating a dupfs or a mini-proc. I think that
> sounds like a nice companion, to the concept of a locked down root.
> But I don't think it removes the need for execveat (because we still
> have the case where we don't want to care what is mounted, and are happy
> to use static executables).
>
> Eric
>
[adding Rich Felker to the Cc list, who has been very interested in a
O_SEARCH implementation for which this would be an important building
block]
On Fri, Oct 17, 2014 at 02:45:03PM -0700, Andy Lutomirski wrote:
> [Added Eric Biederman, since I think your tree might be a reasonable
> route forward for these patches.]
>
> On Thu, Jun 5, 2014 at 6:40 AM, David Drysdale <[email protected]> wrote:
> > Resending, adding cc:linux-api.
> >
> > Also, it may help to add a little more background -- this patch is
> > needed as a (small) part of implementing Capsicum in the Linux kernel.
> >
> > Capsicum is a security framework that has been present in FreeBSD since
> > version 9.0 (Jan 2012), and is based on concepts from object-capability
> > security [1].
> >
> > One of the features of Capsicum is capability mode, which locks down
> > access to global namespaces such as the filesystem hierarchy. In
> > capability mode, /proc is thus inaccessible and so fexecve(3) doesn't
> > work -- hence the need for a kernel-space
>
> I just found myself wanting this syscall for another reason: injecting
> programs into sandboxes or otherwise heavily locked-down namespaces.
>
> For example, I want to be able to reliably do something like nsenter
> --namespace-flags-here toybox sh. Toybox's shell is unusual in that
> it is more or less fully functional, so this should Just Work (tm),
> except that the toybox binary might not exist in the namespace being
> entered. If execveat were available, I could rig nsenter or a similar
> tool to open it with O_CLOEXEC, enter the namespace, and then call
> execveat.
>
> Is there any reason that these patches can't be merged more or less as
> is for 3.19?
>
> --Andy
>
> >
> > [1] http://www.cl.cam.ac.uk/research/security/capsicum/papers/2010usenix-security-capsicum-website.pdf
> >
> > ------
> >
> > This patch set adds execveat(2) for x86, and is derived from Meredydd
> > Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).
> >
> > The primary aim of adding an execveat syscall is to allow an
> > implementation of fexecve(3) that does not rely on the /proc
> > filesystem. The current glibc version of fexecve(3) is implemented
> > via /proc, which causes problems in sandboxed or otherwise restricted
> > environments.
> >
> > Given the desire for a /proc-free fexecve() implementation, HPA
> > suggested (https://lkml.org/lkml/2006/7/11/556) that an execveat(2)
> > syscall would be an appropriate generalization.
> >
> > Also, having a new syscall means that it can take a flags argument
> > without back-compatibility concerns. The current implementation just
> > defines the AT_SYMLINK_NOFOLLOW flag, but other flags could be added
> > in future -- for example, flags for new namespaces (as suggested at
> > https://lkml.org/lkml/2006/7/11/474).
> >
> > Related history:
> > - https://lkml.org/lkml/2006/12/27/123 is an example of someone
> > realizing that fexecve() is likely to fail in a chroot environment.
> > - http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
> > documenting the /proc requirement of fexecve(3) in its manpage, to
> > "prevent other people from wasting their time".
> > - https://bugzilla.kernel.org/show_bug.cgi?id=74481 documented that
> > it's not possible to fexecve() a file descriptor for a script with
> > close-on-exec set (which is possible with the implementation here).
> > - https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
> > problem where a process that did setuid() could not fexecve()
> > because it no longer had access to /proc/self/fd; this has since
> > been fixed.
> >
> >
> > Changes since Meredydd's v3 patch:
> > - Added a selftest.
> > - Added a man page.
> > - Left open_exec() signature untouched to reduce patch impact
> > elsewhere (as suggested by Al Viro).
> > - Filled in bprm->filename with d_path() into a buffer, to avoid use
> > of potentially-ephemeral dentry->d_name.
> > - Patch against v3.14 (455c6fdbd21916).
> >
> >
> > David Drysdale (2):
> > syscalls,x86: implement execveat() system call
> > syscalls,x86: add selftest for execveat(2)
> >
> > arch/x86/ia32/audit.c | 1 +
> > arch/x86/ia32/ia32entry.S | 1 +
> > arch/x86/kernel/audit_64.c | 1 +
> > arch/x86/kernel/entry_64.S | 28 ++++
> > arch/x86/syscalls/syscall_32.tbl | 1 +
> > arch/x86/syscalls/syscall_64.tbl | 2 +
> > arch/x86/um/sys_call_table_64.c | 1 +
> > fs/exec.c | 153 ++++++++++++++++---
> > include/linux/compat.h | 3 +
> > include/linux/sched.h | 4 +
> > include/linux/syscalls.h | 4 +
> > include/uapi/asm-generic/unistd.h | 4 +-
> > kernel/sys_ni.c | 3 +
> > lib/audit.c | 3 +
> > tools/testing/selftests/Makefile | 1 +
> > tools/testing/selftests/exec/.gitignore | 6 +
> > tools/testing/selftests/exec/Makefile | 32 ++++
> > tools/testing/selftests/exec/execveat.c | 251 ++++++++++++++++++++++++++++++++
> > 18 files changed, 476 insertions(+), 23 deletions(-)
> > create mode 100644 tools/testing/selftests/exec/.gitignore
> > create mode 100644 tools/testing/selftests/exec/Makefile
> > create mode 100644 tools/testing/selftests/exec/execveat.c
> >
> > --
> > 1.9.1.423.g4596e3a
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-api" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> Andy Lutomirski
> AMA Capital Management, LLC
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
---end quoted text---
On Wed, Oct 22, 2014 at 04:54:05AM -0700, Christoph Hellwig wrote:
> [adding Rich Felker to the Cc list, who has been very interested in a
> O_SEARCH implementation for which this would be an important building
> block]
s/O_SEARCH/O_EXEC/, sorry.
David Drysdale <[email protected]> writes:
> On Tue, Oct 21, 2014 at 5:29 AM, Eric W. Biederman
> <[email protected]> wrote:
>> Andy Lutomirski <[email protected]> writes:
>>
>>> On Mon, Oct 20, 2014 at 6:48 AM, David Drysdale <[email protected]> wrote:
>>>> On Sun, Oct 19, 2014 at 1:20 AM, Eric W. Biederman
>>>> <[email protected]> wrote:
>>>>> Andy Lutomirski <[email protected]> writes:
>>>>>
>>>>>> [Added Eric Biederman, since I think your tree might be a reasonable
>>>>>> route forward for these patches.]
>>>>>>
>>>>>> On Thu, Jun 5, 2014 at 6:40 AM, David Drysdale <[email protected]> wrote:
>>>>>>> Resending, adding cc:linux-api.
>>>>>>>
>>>>>>> Also, it may help to add a little more background -- this patch is
>>>>>>> needed as a (small) part of implementing Capsicum in the Linux kernel.
>>>>>>>
>>>>>>> Capsicum is a security framework that has been present in FreeBSD since
>>>>>>> version 9.0 (Jan 2012), and is based on concepts from object-capability
>>>>>>> security [1].
>>>>>>>
>>>>>>> One of the features of Capsicum is capability mode, which locks down
>>>>>>> access to global namespaces such as the filesystem hierarchy. In
>>>>>>> capability mode, /proc is thus inaccessible and so fexecve(3) doesn't
>>>>>>> work -- hence the need for a kernel-space
>>>>>>
>>>>>> I just found myself wanting this syscall for another reason: injecting
>>>>>> programs into sandboxes or otherwise heavily locked-down namespaces.
>>>>>>
>>>>>> For example, I want to be able to reliably do something like nsenter
>>>>>> --namespace-flags-here toybox sh. Toybox's shell is unusual in that
>>>>>> it is more or less fully functional, so this should Just Work (tm),
>>>>>> except that the toybox binary might not exist in the namespace being
>>>>>> entered. If execveat were available, I could rig nsenter or a similar
>>>>>> tool to open it with O_CLOEXEC, enter the namespace, and then call
>>>>>> execveat.
>>>>>>
>>>>>> Is there any reason that these patches can't be merged more or less as
>>>>>> is for 3.19?
>>>>>
>>>>> Yes. There is a silliness in how it implements fexecve. The fexecve
>>>>> case should be use the empty string "" not a NULL pointer to indication
>>>>> that. That change will then harmonize execveat with the other ...at
>>>>> system calls and simplify the code and remove a special case. I believe
>>>>> using the empty string "" requires implementing the AT_EMPTY_PATH flag.
>>>>
>>>> Good point -- I'll shift to "" + AT_EMPTY_PATH.
>>>
>>> Pending a better idea, I would also see if the patches can be changed
>>> to return an error if d_path ends up with an "(unreachable)" thing
>>> rather than failing inexplicably later on.
>>
>> For my reference we are talking about
>>
>>> @@ -1489,7 +1524,21 @@ static int do_execve_common(struct filename *filename,
>>> sched_exec();
>>>
>>> bprm->file = file;
>>> - bprm->filename = bprm->interp = filename->name;
>>> + if (filename && fd == AT_FDCWD) {
>>> + bprm->filename = filename->name;
>>> + } else {
>>> + pathbuf = kmalloc(PATH_MAX, GFP_TEMPORARY);
>>> + if (!pathbuf) {
>>> + retval = -ENOMEM;
>>> + goto out_unmark;
>>> + }
>>> + bprm->filename = d_path(&file->f_path, pathbuf, PATH_MAX);
>>> + if (IS_ERR(bprm->filename)) {
>>> + retval = PTR_ERR(bprm->filename);
>>> + goto out_unmark;
>>> + }
>>> + }
>>> + bprm->interp = bprm->filename;
>>>
>>> retval = bprm_mm_init(bprm);
>>> if (retval)
>>
>> The interesting case for fexecve is when we either don't know what files
>> are present or we don't want to depend on which files are present.
>>
>> As Al pointed out d_path really isn't the right solution. It fails when
>> printing /proc/self/fd/${fd}/${filename->name} would work, and the
>> "(deleted)" or "(unreachable)" strings are wrong.
>>
>> The test for today's cases should be:
>> if ((filename->name[0] == '/') || fd == AT_FDCWD) {
>> bprm->filename = filename->name;
>> }
>>
>> To handle the case where the file descriptor is relevant.
> (s/relevant/irrelevant)
>
> Yep, good spot.
>
>> For the case where the file descriptor is relevant let me suggest
>> setting bprm->filename and bprm->interp to:
>>
>> /dev/fd/${fd}/${filename->name}
>
> I'll send out an updated patchset with this approach, but I have a slight
> reservation. Given that /dev/fd is a symlink to /proc/self/fd, this approach
> means that script invocations will always fail on a /proc-less system,
> where the previous iteration might have worked.
>
> (As it happens, this isn't a restriction that affects the things I'm
> working on, as Capsicum wouldn't allow script invocation anyway.
> However, scenarios without /proc were nominally one of the motivating
> factors for execveat in the first place...)
Which is where's Al Viro's and Peter Anvin's conversation about a
minimal filesystem that can serve the needs of /proc/self/fd comes in.
There are uses for execveat with static executables, so I think execveat
is justified. But having a dupfs that we could potentially mount on
/dev/fd would be interesting. As it is much less of a security
concern than /proc with all of the interfaces it provides.
>> It is more a description of what we have done but as a magic string it
>> is descriptive. Documetation/devices.txt documents that /dev/fd/ should
>> exist, making it an unambiguous path. Further these days the kernel
>> sets the device naming policy in dev, so I think we are strongly safe in
>> using that path in any event.
>>
>> I think execveat is interesting in the kernel because the motivating
>> cases are the cases where anything except a static executable is
>> uninteresting.
>
> FYI, there is potential in the future for something other than static
> executables -- the FreeBSD Capsicum implementation includes changes
> to the dynamic linker to get its search path as a list of pre-opened dfds
> (in LD_LIBRARY_PATH_FDS) rather than paths.
Which still leaves open the question how do you find the dynamic
linker. Is that also a pre-opened dfd?
Using /dev/fd/$N is also the kind of thing that a shell or a script
interpret could special case instead relying on a filesystem node
to exist.
Eric
Hi!
> >> Oh, you mean that #!/usr/bin/make -f would turn into /usr/bin/make
> >> /dev/fd/3? That could be interesting, although I can imagine it
> >> breaking things, especially if /dev/fd/3 isn't set up like that, e.g.
> >> early in boot.
> >
> > Sigh... What I mean is that fexecve(fd, ...) would have to put _something_
> > into argv when it execs the interpreter of #! file. Simply because the
> > interpreter (which can be anything whatsoever) has no fscking idea what
> > to do with some descriptor it has before execve(). Hell, it doesn't have
> > any idea *which* descriptor had it been.
> >
> > You need to put some pathname that would yield your script upon open(2).
> > If you bothered to read those patches, you'd see that they do supply
> > one, generating it with d_path(). Which isn't particulary reliable.
> >
> > I'm not sure there's any point putting any of that in the kernel - if
> > you *do* have that pathname, you can just pass it.
>
> Hmm.
>
> This issue certainly makes fexecve or execveat less attractive, at
> least in cases where d_path won't work.
>
> On the other hand, if you want to run a static binary on a cloexec fd
> (or, for that matter, a dynamic binary if you trust the interpreter to
> close the extra copy of the fd it gets) in a namespace or chroot where
> the binary is invisible, then you need kernel help.
>
> It's too bad that script interpreters don't have a mechanism to open
> their scripts by fd.
Every interpretter depends on /dev/zero... so what about having
/dev/script_im_running? Standard character device, contains whatever
script it should contain...
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
On Wed, Oct 22, 2014 at 6:40 PM, Eric W. Biederman
<[email protected]> wrote:
> David Drysdale <[email protected]> writes:
>> On Tue, Oct 21, 2014 at 5:29 AM, Eric W. Biederman
>> <[email protected]> wrote:
>>> It is more a description of what we have done but as a magic string it
>>> is descriptive. Documetation/devices.txt documents that /dev/fd/ should
>>> exist, making it an unambiguous path. Further these days the kernel
>>> sets the device naming policy in dev, so I think we are strongly safe in
>>> using that path in any event.
>>>
>>> I think execveat is interesting in the kernel because the motivating
>>> cases are the cases where anything except a static executable is
>>> uninteresting.
>>
>> FYI, there is potential in the future for something other than static
>> executables -- the FreeBSD Capsicum implementation includes changes
>> to the dynamic linker to get its search path as a list of pre-opened dfds
>> (in LD_LIBRARY_PATH_FDS) rather than paths.
>
> Which still leaves open the question how do you find the dynamic
> linker. Is that also a pre-opened dfd?
I *think* it's still effectively a path lookup of the INTERP header in
the kernel (but the pathname is restricted to be one of a set of
standard pre-registered interpreters).
> Using /dev/fd/$N is also the kind of thing that a shell or a script
> interpret could special case instead relying on a filesystem node
> to exist.
True.