2014-10-22 12:12:06

by David Drysdale

[permalink] [raw]
Subject: [PATCHv5 0/3] syscalls,x86: Add execveat() system call

This patch set adds execveat(2) for x86, and is derived from Meredydd
Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).

The primary aim of adding an execveat syscall is to allow an
implementation of fexecve(3) that does not rely on the /proc
filesystem. The current glibc version of fexecve(3) is implemented
via /proc, which causes problems in sandboxed or otherwise restricted
environments.

Given the desire for a /proc-free fexecve() implementation, HPA
suggested (https://lkml.org/lkml/2006/7/11/556) that an execveat(2)
syscall would be an appropriate generalization.

Also, having a new syscall means that it can take a flags argument
without back-compatibility concerns. The current implementation just
defines the AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other
flags could be added in future -- for example, flags for new namespaces
(as suggested at https://lkml.org/lkml/2006/7/11/474).

Related history:
- https://lkml.org/lkml/2006/12/27/123 is an example of someone
realizing that fexecve() is likely to fail in a chroot environment.
- http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
documenting the /proc requirement of fexecve(3) in its manpage, to
"prevent other people from wasting their time".
- https://bugzilla.kernel.org/show_bug.cgi?id=74481 documented that
it's not possible to fexecve() a file descriptor for a script with
close-on-exec set (which is possible with the implementation here).
- https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
problem where a process that did setuid() could not fexecve()
because it no longer had access to /proc/self/fd; this has since
been fixed.


Changes since v4, suggested by Eric W. Biederman:
- Use empty filename with AT_EMPTY_PATH flag rather than NULL
pathname to request fexecve-like behaviour.
- Build pathname as "/dev/fd/<fd>/<filename>" (or "/dev/fd/<fd>")
rather than using d_path().
- Patch against v3.17 (bfe01a5ba249)

Changes since Meredydd's v3 patch:
- Added a selftest.
- Added a man page.
- Left open_exec() signature untouched to reduce patch impact
elsewhere (as suggested by Al Viro).
- Filled in bprm->filename with d_path() into a buffer, to avoid use
of potentially-ephemeral dentry->d_name.
- Patch against v3.14 (455c6fdbd21916).


David Drysdale (2):
syscalls,x86: implement execveat() system call
syscalls,x86: add selftest for execveat(2)

arch/x86/ia32/audit.c | 1 +
arch/x86/ia32/ia32entry.S | 1 +
arch/x86/kernel/audit_64.c | 1 +
arch/x86/kernel/entry_64.S | 28 +++
arch/x86/syscalls/syscall_32.tbl | 1 +
arch/x86/syscalls/syscall_64.tbl | 2 +
arch/x86/um/sys_call_table_64.c | 1 +
fs/exec.c | 130 ++++++++++++--
fs/namei.c | 2 +-
include/linux/compat.h | 3 +
include/linux/fs.h | 1 +
include/linux/sched.h | 4 +
include/linux/syscalls.h | 4 +
include/uapi/asm-generic/unistd.h | 4 +-
kernel/sys_ni.c | 3 +
lib/audit.c | 3 +
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/exec/.gitignore | 6 +
tools/testing/selftests/exec/Makefile | 22 +++
tools/testing/selftests/exec/execveat.c | 295 ++++++++++++++++++++++++++++++++
20 files changed, 497 insertions(+), 16 deletions(-)
create mode 100644 tools/testing/selftests/exec/.gitignore
create mode 100644 tools/testing/selftests/exec/Makefile
create mode 100644 tools/testing/selftests/exec/execveat.c

--
2.1.0.rc2.206.gedb03e5


2014-10-22 11:51:33

by David Drysdale

[permalink] [raw]
Subject: [PATCHv5 2/3] syscalls,x86: add selftest for execveat(2)

Signed-off-by: David Drysdale <[email protected]>
---
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/exec/.gitignore | 6 +
tools/testing/selftests/exec/Makefile | 22 +++
tools/testing/selftests/exec/execveat.c | 295 ++++++++++++++++++++++++++++++++
4 files changed, 324 insertions(+)
create mode 100644 tools/testing/selftests/exec/.gitignore
create mode 100644 tools/testing/selftests/exec/Makefile
create mode 100644 tools/testing/selftests/exec/execveat.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 36ff2e4c7b6f..210cf68b3511 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -14,6 +14,7 @@ TARGETS += powerpc
TARGETS += user
TARGETS += sysctl
TARGETS += firmware
+TARGETS += exec

TARGETS_HOTPLUG = cpu-hotplug
TARGETS_HOTPLUG += memory-hotplug
diff --git a/tools/testing/selftests/exec/.gitignore b/tools/testing/selftests/exec/.gitignore
new file mode 100644
index 000000000000..349a786899a7
--- /dev/null
+++ b/tools/testing/selftests/exec/.gitignore
@@ -0,0 +1,6 @@
+subdir*
+script*
+execveat
+execveat.symlink
+execveat.moved
+execveat.ephemeral
diff --git a/tools/testing/selftests/exec/Makefile b/tools/testing/selftests/exec/Makefile
new file mode 100644
index 000000000000..428cb1d2a06a
--- /dev/null
+++ b/tools/testing/selftests/exec/Makefile
@@ -0,0 +1,22 @@
+CC = $(CROSS_COMPILE)gcc
+CFLAGS = -Wall
+BINARIES = execveat
+DEPS = execveat.symlink script subdir
+all: $(BINARIES) $(DEPS)
+
+subdir:
+ mkdir -p $@
+script:
+ echo '#!/bin/sh' > $@
+ echo 'exit $$*' >> $@
+ chmod +x $@
+execveat.symlink: execveat
+ ln -s -f $< $@
+%: %.c
+ $(CC) $(CFLAGS) -o $@ $^
+
+run_tests: all
+ ./execveat
+
+clean:
+ rm -rf $(BINARIES) $(DEPS) subdir.moved execveat.moved
diff --git a/tools/testing/selftests/exec/execveat.c b/tools/testing/selftests/exec/execveat.c
new file mode 100644
index 000000000000..1f2722ed3847
--- /dev/null
+++ b/tools/testing/selftests/exec/execveat.c
@@ -0,0 +1,295 @@
+/*
+ * Copyright (c) 2014 Google, Inc.
+ *
+ * Licensed under the terms of the GNU GPL License version 2
+ *
+ * Selftests for execveat(2).
+ */
+
+#define _GNU_SOURCE /* to get O_PATH, AT_EMPTY_PATH */
+#include <sys/sendfile.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <limits.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+
+static char *envp[] = { "IN_TEST=yes", NULL };
+static char *argv[] = { "execveat", "99", NULL };
+
+static int execveat_(int fd, const char *path, char **argv, char **envp,
+ int flags)
+{
+#ifdef __NR_execveat
+ return syscall(__NR_execveat, fd, path, argv, envp, flags);
+#else
+ errno = -ENOSYS;
+ return -1;
+#endif
+}
+
+#define check_execveat_fail(fd, path, flags, errno) \
+ _check_execveat_fail(fd, path, flags, errno, #errno)
+static int _check_execveat_fail(int fd, const char *path, int flags,
+ int expected_errno, const char *errno_str)
+{
+ int rc;
+
+ errno = 0;
+ printf("Check failure of execveat(%d, '%s', %d) with %s... ",
+ fd, path?:"(null)", flags, errno_str);
+ rc = execveat_(fd, path, argv, envp, flags);
+
+ if (rc > 0) {
+ printf("[FAIL] (unexpected success from execveat(2))\n");
+ return 1;
+ }
+ if (errno != expected_errno) {
+ printf("[FAIL] (expected errno %d (%s) not %d (%s)\n",
+ expected_errno, strerror(expected_errno),
+ errno, strerror(errno));
+ return 1;
+ }
+ printf("[OK]\n");
+ return 0;
+}
+
+static int check_execveat_invoked_rc(int fd, const char *path, int flags,
+ int expected_rc)
+{
+ int status;
+ int rc;
+ pid_t child;
+
+ printf("Check success of execveat(%d, '%s', %d)... ",
+ fd, path?:"(null)", flags);
+ child = fork();
+ if (child < 0) {
+ printf("[FAIL] (fork() failed)\n");
+ return 1;
+ }
+ if (child == 0) {
+ /* Child: do execveat(). */
+ rc = execveat_(fd, path, argv, envp, flags);
+ printf("[FAIL]: execveat() failed, rc=%d errno=%d (%s)\n",
+ rc, errno, strerror(errno));
+ exit(1); /* should not reach here */
+ }
+ /* Parent: wait for & check child's exit status. */
+ rc = waitpid(child, &status, 0);
+ if (rc != child) {
+ printf("[FAIL] (waitpid(%d,...) returned %d)\n", child, rc);
+ return 1;
+ }
+ if (!WIFEXITED(status)) {
+ printf("[FAIL] (child %d did not exit cleanly, status=%08x)\n",
+ child, status);
+ return 1;
+ }
+ if (WEXITSTATUS(status) != expected_rc) {
+ printf("[FAIL] (child %d exited with %d not %d)\n",
+ child, WEXITSTATUS(status), expected_rc);
+ return 1;
+ }
+ printf("[OK]\n");
+ return 0;
+}
+
+static int check_execveat(int fd, const char *path, int flags)
+{
+ return check_execveat_invoked_rc(fd, path, flags, 99);
+}
+
+static char *concat(const char *left, const char *right)
+{
+ char *result = malloc(strlen(left) + strlen(right) + 1);
+
+ strcpy(result, left);
+ strcat(result, right);
+ return result;
+}
+
+static int open_or_die(const char *filename, int flags)
+{
+ int fd = open(filename, flags);
+
+ if (fd < 0) {
+ printf("Failed to open '%s'; "
+ "check prerequisites are available\n", filename);
+ exit(1);
+ }
+ return fd;
+}
+
+static int run_tests(void)
+{
+ int fail = 0;
+ char *fullname = realpath("execveat", NULL);
+ char *fullname_script = realpath("script", NULL);
+ char *fullname_symlink = concat(fullname, ".symlink");
+ int subdir_dfd = open_or_die("subdir", O_DIRECTORY|O_RDONLY);
+ int subdir_dfd_ephemeral = open_or_die("subdir.ephemeral",
+ O_DIRECTORY|O_RDONLY);
+ int dot_dfd = open_or_die(".", O_DIRECTORY|O_RDONLY);
+ int dot_dfd_path = open_or_die(".", O_DIRECTORY|O_RDONLY|O_PATH);
+ int fd = open_or_die("execveat", O_RDONLY);
+ int fd_path = open_or_die("execveat", O_RDONLY|O_PATH);
+ int fd_symlink = open_or_die("execveat.symlink", O_RDONLY);
+ int fd_script = open_or_die("script", O_RDONLY);
+ int fd_ephemeral = open_or_die("execveat.ephemeral", O_RDONLY);
+ int fd_script_ephemeral = open_or_die("script.ephemeral", O_RDONLY);
+
+ /* Normal executable file: */
+ /* dfd + path */
+ fail |= check_execveat(subdir_dfd, "../execveat", 0);
+ fail |= check_execveat(dot_dfd, "execveat", 0);
+ fail |= check_execveat(dot_dfd_path, "execveat", 0);
+ /* absolute path */
+ fail |= check_execveat(AT_FDCWD, fullname, 0);
+ /* absolute path with nonsense dfd */
+ fail |= check_execveat(99, fullname, 0);
+ /* fd + no path */
+ fail |= check_execveat(fd, "", AT_EMPTY_PATH);
+
+ /* Mess with executable file that's already open: */
+ /* fd + no path to a file that's been renamed */
+ rename("execveat.ephemeral", "execveat.moved");
+ fail |= check_execveat(fd_ephemeral, "", AT_EMPTY_PATH);
+ /* fd + no path to a file that's been deleted */
+ unlink("execveat.moved"); /* remove the file now fd open */
+ fail |= check_execveat(fd_ephemeral, "", AT_EMPTY_PATH);
+
+ /* Invalid argument failures */
+ fail |= check_execveat_fail(fd, "", 0, ENOENT);
+ fail |= check_execveat_fail(fd, NULL, AT_EMPTY_PATH, EFAULT);
+
+ /* Symlink to executable file: */
+ /* dfd + path */
+ fail |= check_execveat(dot_dfd, "execveat.symlink", 0);
+ fail |= check_execveat(dot_dfd_path, "execveat.symlink", 0);
+ /* absolute path */
+ fail |= check_execveat(AT_FDCWD, fullname_symlink, 0);
+ /* fd + no path, even with AT_SYMLINK_NOFOLLOW (already followed) */
+ fail |= check_execveat(fd_symlink, "", AT_EMPTY_PATH);
+ fail |= check_execveat(fd_symlink, "",
+ AT_EMPTY_PATH|AT_SYMLINK_NOFOLLOW);
+
+ /* Symlink fails when AT_SYMLINK_NOFOLLOW set: */
+ /* dfd + path */
+ fail |= check_execveat_fail(dot_dfd, "execveat.symlink",
+ AT_SYMLINK_NOFOLLOW, ELOOP);
+ fail |= check_execveat_fail(dot_dfd_path, "execveat.symlink",
+ AT_SYMLINK_NOFOLLOW, ELOOP);
+ /* absolute path */
+ fail |= check_execveat_fail(AT_FDCWD, fullname_symlink,
+ AT_SYMLINK_NOFOLLOW, ELOOP);
+
+ /* Shell script wrapping executable file: */
+ /* dfd + path */
+ fail |= check_execveat(subdir_dfd, "../script", 0);
+ fail |= check_execveat(dot_dfd, "script", 0);
+ fail |= check_execveat(dot_dfd_path, "script", 0);
+ /* absolute path */
+ fail |= check_execveat(AT_FDCWD, fullname_script, 0);
+ /* fd + no path */
+ fail |= check_execveat(fd_script, "", AT_EMPTY_PATH);
+ fail |= check_execveat(fd_script, "",
+ AT_EMPTY_PATH|AT_SYMLINK_NOFOLLOW);
+
+ /* Mess with script file that's already open: */
+ /* fd + no path to a file that's been renamed */
+ rename("script.ephemeral", "script.moved");
+ fail |= check_execveat(fd_script_ephemeral, "", AT_EMPTY_PATH);
+ /* fd + no path to a file that's been deleted */
+ unlink("script.moved"); /* remove the file while fd open */
+ fail |= check_execveat(fd_script_ephemeral, "", AT_EMPTY_PATH);
+
+ /* Rename a subdirectory in the path: */
+ rename("subdir.ephemeral", "subdir.moved");
+ fail |= check_execveat(subdir_dfd_ephemeral, "../script", 0);
+ fail |= check_execveat(subdir_dfd_ephemeral, "script", 0);
+ /* Remove the subdir and its contents */
+ unlink("subdir.moved/script");
+ unlink("subdir.moved");
+ /* Shell loads via deleted subdir OK because name starts with .. */
+ fail |= check_execveat(subdir_dfd_ephemeral, "../script", 0);
+ fail |= check_execveat_fail(subdir_dfd_ephemeral, "script", 0, ENOENT);
+
+ /* Flag values other than AT_SYMLINK_NOFOLLOW => EINVAL */
+ fail |= check_execveat_fail(dot_dfd, "execveat", 0xFFFF, EINVAL);
+ /* Invalid path => ENOENT */
+ fail |= check_execveat_fail(dot_dfd, "no-such-file", 0, ENOENT);
+ fail |= check_execveat_fail(dot_dfd_path, "no-such-file", 0, ENOENT);
+ fail |= check_execveat_fail(AT_FDCWD, "no-such-file", 0, ENOENT);
+ /* Attempt to execute directory => EACCES */
+ fail |= check_execveat_fail(dot_dfd, "", AT_EMPTY_PATH, EACCES);
+ /* Attempt to execute non-executable => EACCES */
+ fail |= check_execveat_fail(dot_dfd, "Makefile", 0, EACCES);
+ /* Attempt to execute file opened with O_PATH => EBADF */
+ fail |= check_execveat_fail(dot_dfd_path, "", AT_EMPTY_PATH, EBADF);
+ fail |= check_execveat_fail(fd_path, "", AT_EMPTY_PATH, EBADF);
+ /* Attempt to execute nonsense FD => EBADF */
+ fail |= check_execveat_fail(99, "", AT_EMPTY_PATH, EBADF);
+ fail |= check_execveat_fail(99, "execveat", 0, EBADF);
+ /* Attempt to execute relative to non-directory => ENOTDIR */
+ fail |= check_execveat_fail(fd, "execveat", 0, ENOTDIR);
+
+ return fail ? -1 : 0;
+}
+
+void exe_cp(const char *src, const char *dest)
+{
+ int in_fd = open_or_die(src, O_RDONLY);
+ int out_fd = open(dest, O_RDWR|O_CREAT|O_TRUNC, 0755);
+ struct stat info;
+
+ fstat(in_fd, &info);
+ sendfile(out_fd, in_fd, NULL, info.st_size);
+ close(in_fd);
+ close(out_fd);
+}
+
+void prerequisites(void)
+{
+ int fd;
+ const char *script = "#!/bin/sh\nexit $*\n";
+
+ /* Create ephemeral copies of files */
+ exe_cp("execveat", "execveat.ephemeral");
+ exe_cp("script", "script.ephemeral");
+ mkdir("subdir.ephemeral", 0755);
+
+ fd = open("subdir.ephemeral/script", O_RDWR|O_CREAT|O_TRUNC, 0755);
+ write(fd, script, strlen(script));
+ close(fd);
+}
+
+int main(int argc, char **argv)
+{
+ int rc;
+
+ if (argc >= 2) {
+ /* If we are invoked with an argument, exit immediately. */
+ /* Check expected environment transferred. */
+ const char *in_test = getenv("IN_TEST");
+
+ if (!in_test || strcmp(in_test, "yes") != 0) {
+ printf("[FAIL] (no IN_TEST=yes in env)\n");
+ return 1;
+ }
+
+ /* Use the final argument as an exit code. */
+ rc = atoi(argv[argc - 1]);
+ fflush(stdout);
+ } else {
+ prerequisites();
+ rc = run_tests();
+ }
+ return rc;
+}
--
2.1.0.rc2.206.gedb03e5

2014-10-22 11:51:58

by David Drysdale

[permalink] [raw]
Subject: [PATCHv5 1/3] syscalls,x86: implement execveat() system call

Add a new system execveat(2) syscall. execveat() is to execve() as
openat() is to open(): it takes a file descriptor that refers to a
directory, and resolves the filename relative to that.

In addition, if the filename is empty and AT_EMPTY_PATH is specified,
execveat() executes the file to which the file descriptor refers. This
replicates the functionality of fexecve(), which is a system call in
other UNIXen, but in Linux glibc it depends on opening
"/proc/self/fd/<fd>" (and so relies on /proc being mounted).

The filename fed to the executed program as argv[0] (or the name of the
script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
(for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
reflecting how the executable was found. This does however mean that
execution of a script in a /proc-less environment won't work.

Only x86-64, i386 and x32 ABIs are supported in this patch.

Based on patches by Meredydd Luff <[email protected]>

Signed-off-by: David Drysdale <[email protected]>
---
arch/x86/ia32/audit.c | 1 +
arch/x86/ia32/ia32entry.S | 1 +
arch/x86/kernel/audit_64.c | 1 +
arch/x86/kernel/entry_64.S | 28 ++++++++
arch/x86/syscalls/syscall_32.tbl | 1 +
arch/x86/syscalls/syscall_64.tbl | 2 +
arch/x86/um/sys_call_table_64.c | 1 +
fs/exec.c | 130 ++++++++++++++++++++++++++++++++++----
fs/namei.c | 2 +-
include/linux/compat.h | 3 +
include/linux/fs.h | 1 +
include/linux/sched.h | 4 ++
include/linux/syscalls.h | 4 ++
include/uapi/asm-generic/unistd.h | 4 +-
kernel/sys_ni.c | 3 +
lib/audit.c | 3 +
16 files changed, 173 insertions(+), 16 deletions(-)

diff --git a/arch/x86/ia32/audit.c b/arch/x86/ia32/audit.c
index 5d7b381da692..2eccc8932ae6 100644
--- a/arch/x86/ia32/audit.c
+++ b/arch/x86/ia32/audit.c
@@ -35,6 +35,7 @@ int ia32_classify_syscall(unsigned syscall)
case __NR_socketcall:
return 4;
case __NR_execve:
+ case __NR_execveat:
return 5;
default:
return 1;
diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 4299eb05023c..2516c09743e0 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -464,6 +464,7 @@ GLOBAL(\label)
PTREGSCALL stub32_rt_sigreturn, sys32_rt_sigreturn
PTREGSCALL stub32_sigreturn, sys32_sigreturn
PTREGSCALL stub32_execve, compat_sys_execve
+ PTREGSCALL stub32_execveat, compat_sys_execveat
PTREGSCALL stub32_fork, sys_fork
PTREGSCALL stub32_vfork, sys_vfork

diff --git a/arch/x86/kernel/audit_64.c b/arch/x86/kernel/audit_64.c
index 06d3e5a14d9d..f3672508b249 100644
--- a/arch/x86/kernel/audit_64.c
+++ b/arch/x86/kernel/audit_64.c
@@ -50,6 +50,7 @@ int audit_classify_syscall(int abi, unsigned syscall)
case __NR_openat:
return 3;
case __NR_execve:
+ case __NR_execveat:
return 5;
default:
return 0;
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 2fac1343a90b..00c4526e6ffe 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -665,6 +665,20 @@ ENTRY(stub_execve)
CFI_ENDPROC
END(stub_execve)

+ENTRY(stub_execveat)
+ CFI_STARTPROC
+ addq $8, %rsp
+ PARTIAL_FRAME 0
+ SAVE_REST
+ FIXUP_TOP_OF_STACK %r11
+ call sys_execveat
+ RESTORE_TOP_OF_STACK %r11
+ movq %rax,RAX(%rsp)
+ RESTORE_REST
+ jmp int_ret_from_sys_call
+ CFI_ENDPROC
+END(stub_execveat)
+
/*
* sigreturn is special because it needs to restore all registers on return.
* This cannot be done with SYSRET, so use the IRET return path instead.
@@ -710,6 +724,20 @@ ENTRY(stub_x32_execve)
CFI_ENDPROC
END(stub_x32_execve)

+ENTRY(stub_x32_execveat)
+ CFI_STARTPROC
+ addq $8, %rsp
+ PARTIAL_FRAME 0
+ SAVE_REST
+ FIXUP_TOP_OF_STACK %r11
+ call compat_sys_execveat
+ RESTORE_TOP_OF_STACK %r11
+ movq %rax,RAX(%rsp)
+ RESTORE_REST
+ jmp int_ret_from_sys_call
+ CFI_ENDPROC
+END(stub_x32_execveat)
+
#endif

/*
diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index 028b78168d85..2633e3195455 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -363,3 +363,4 @@
354 i386 seccomp sys_seccomp
355 i386 getrandom sys_getrandom
356 i386 memfd_create sys_memfd_create
+357 i386 execveat sys_execveat stub32_execveat
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 35dd922727b9..1af5badd159c 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -327,6 +327,7 @@
318 common getrandom sys_getrandom
319 common memfd_create sys_memfd_create
320 common kexec_file_load sys_kexec_file_load
+321 64 execveat stub_execveat

#
# x32-specific system call numbers start at 512 to avoid cache impact
@@ -365,3 +366,4 @@
542 x32 getsockopt compat_sys_getsockopt
543 x32 io_setup compat_sys_io_setup
544 x32 io_submit compat_sys_io_submit
+545 x32 execveat stub_x32_execveat
diff --git a/arch/x86/um/sys_call_table_64.c b/arch/x86/um/sys_call_table_64.c
index f2f0723070ca..20c3649d0691 100644
--- a/arch/x86/um/sys_call_table_64.c
+++ b/arch/x86/um/sys_call_table_64.c
@@ -31,6 +31,7 @@
#define stub_fork sys_fork
#define stub_vfork sys_vfork
#define stub_execve sys_execve
+#define stub_execveat sys_execveat
#define stub_rt_sigreturn sys_rt_sigreturn

#define __SYSCALL_COMMON(nr, sym, compat) __SYSCALL_64(nr, sym, compat)
diff --git a/fs/exec.c b/fs/exec.c
index a2b42a98c743..92a6e14f096a 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -747,7 +747,7 @@ EXPORT_SYMBOL(setup_arg_pages);

#endif /* CONFIG_MMU */

-static struct file *do_open_exec(struct filename *name)
+static struct file *do_open_execat(int fd, struct filename *name, int flags)
{
struct file *file;
int err;
@@ -757,10 +757,34 @@ static struct file *do_open_exec(struct filename *name)
.intent = LOOKUP_OPEN,
.lookup_flags = LOOKUP_FOLLOW,
};
+ static const struct open_flags open_exec_nofollow_flags = {
+ .open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
+ .acc_mode = MAY_EXEC | MAY_OPEN,
+ .intent = LOOKUP_OPEN,
+ .lookup_flags = 0,
+ };

- file = do_filp_open(AT_FDCWD, name, &open_exec_flags);
- if (IS_ERR(file))
- goto out;
+ if ((flags & ~(AT_SYMLINK_NOFOLLOW | AT_EMPTY_PATH)) != 0)
+ return ERR_PTR(-EINVAL);
+
+ if (name->name[0] != '\0') {
+ const struct open_flags *oflags = ((flags & AT_SYMLINK_NOFOLLOW)
+ ? &open_exec_nofollow_flags
+ : &open_exec_flags);
+
+ file = do_filp_open(fd, name, oflags);
+ if (IS_ERR(file))
+ goto out;
+ } else {
+ file = fget(fd);
+ if (!file)
+ return ERR_PTR(-EBADF);
+
+ err = inode_permission(file->f_path.dentry->d_inode,
+ open_exec_flags.acc_mode);
+ if (err)
+ goto exit;
+ }

err = -EACCES;
if (!S_ISREG(file_inode(file)->i_mode))
@@ -769,12 +793,13 @@ static struct file *do_open_exec(struct filename *name)
if (file->f_path.mnt->mnt_flags & MNT_NOEXEC)
goto exit;

- fsnotify_open(file);
-
err = deny_write_access(file);
if (err)
goto exit;

+ if (name->name[0] != '\0')
+ fsnotify_open(file);
+
out:
return file;

@@ -786,7 +811,7 @@ exit:
struct file *open_exec(const char *name)
{
struct filename tmp = { .name = name };
- return do_open_exec(&tmp);
+ return do_open_execat(AT_FDCWD, &tmp, 0);
}
EXPORT_SYMBOL(open_exec);

@@ -1422,10 +1447,12 @@ static int exec_binprm(struct linux_binprm *bprm)
/*
* sys_execve() executes a new program.
*/
-static int do_execve_common(struct filename *filename,
- struct user_arg_ptr argv,
- struct user_arg_ptr envp)
+static int do_execveat_common(int fd, struct filename *filename,
+ struct user_arg_ptr argv,
+ struct user_arg_ptr envp,
+ int flags)
{
+ char *pathbuf = NULL;
struct linux_binprm *bprm;
struct file *file;
struct files_struct *displaced;
@@ -1466,7 +1493,7 @@ static int do_execve_common(struct filename *filename,
check_unsafe_exec(bprm);
current->in_execve = 1;

- file = do_open_exec(filename);
+ file = do_open_execat(fd, filename, flags);
retval = PTR_ERR(file);
if (IS_ERR(file))
goto out_unmark;
@@ -1474,7 +1501,27 @@ static int do_execve_common(struct filename *filename,
sched_exec();

bprm->file = file;
- bprm->filename = bprm->interp = filename->name;
+ if (fd == AT_FDCWD || filename->name[0] == '/') {
+ bprm->filename = filename->name;
+ } else {
+ /*
+ * Build a pathname that reflects how we got to the file,
+ * either "/dev/fd/<fd>" (for an empty filename) or
+ * "/dev/fd/<fd>/<filename>".
+ */
+ pathbuf = kmalloc(PATH_MAX, GFP_TEMPORARY);
+ if (!pathbuf) {
+ retval = -ENOMEM;
+ goto out_unmark;
+ }
+ bprm->filename = pathbuf;
+ if (filename->name[0] == '\0')
+ sprintf(pathbuf, "/dev/fd/%d", fd);
+ else
+ snprintf(pathbuf, PATH_MAX,
+ "/dev/fd/%d/%s", fd, filename->name);
+ }
+ bprm->interp = bprm->filename;

retval = bprm_mm_init(bprm);
if (retval)
@@ -1532,6 +1579,7 @@ out_unmark:

out_free:
free_bprm(bprm);
+ kfree(pathbuf);

out_files:
if (displaced)
@@ -1547,7 +1595,18 @@ int do_execve(struct filename *filename,
{
struct user_arg_ptr argv = { .ptr.native = __argv };
struct user_arg_ptr envp = { .ptr.native = __envp };
- return do_execve_common(filename, argv, envp);
+ return do_execveat_common(AT_FDCWD, filename, argv, envp, 0);
+}
+
+int do_execveat(int fd, struct filename *filename,
+ const char __user *const __user *__argv,
+ const char __user *const __user *__envp,
+ int flags)
+{
+ struct user_arg_ptr argv = { .ptr.native = __argv };
+ struct user_arg_ptr envp = { .ptr.native = __envp };
+
+ return do_execveat_common(fd, filename, argv, envp, flags);
}

#ifdef CONFIG_COMPAT
@@ -1563,7 +1622,23 @@ static int compat_do_execve(struct filename *filename,
.is_compat = true,
.ptr.compat = __envp,
};
- return do_execve_common(filename, argv, envp);
+ return do_execveat_common(AT_FDCWD, filename, argv, envp, 0);
+}
+
+static int compat_do_execveat(int fd, struct filename *filename,
+ const compat_uptr_t __user *__argv,
+ const compat_uptr_t __user *__envp,
+ int flags)
+{
+ struct user_arg_ptr argv = {
+ .is_compat = true,
+ .ptr.compat = __argv,
+ };
+ struct user_arg_ptr envp = {
+ .is_compat = true,
+ .ptr.compat = __envp,
+ };
+ return do_execveat_common(fd, filename, argv, envp, flags);
}
#endif

@@ -1603,6 +1678,20 @@ SYSCALL_DEFINE3(execve,
{
return do_execve(getname(filename), argv, envp);
}
+
+SYSCALL_DEFINE5(execveat,
+ int, fd, const char __user *, filename,
+ const char __user *const __user *, argv,
+ const char __user *const __user *, envp,
+ int, flags)
+{
+ int lookup_flags = (flags & AT_EMPTY_PATH) ? LOOKUP_EMPTY : 0;
+
+ return do_execveat(fd,
+ getname_flags(filename, lookup_flags, NULL),
+ argv, envp, flags);
+}
+
#ifdef CONFIG_COMPAT
COMPAT_SYSCALL_DEFINE3(execve, const char __user *, filename,
const compat_uptr_t __user *, argv,
@@ -1610,4 +1699,17 @@ COMPAT_SYSCALL_DEFINE3(execve, const char __user *, filename,
{
return compat_do_execve(getname(filename), argv, envp);
}
+
+COMPAT_SYSCALL_DEFINE5(execveat, int, fd,
+ const char __user *, filename,
+ const compat_uptr_t __user *, argv,
+ const compat_uptr_t __user *, envp,
+ int, flags)
+{
+ int lookup_flags = (flags & AT_EMPTY_PATH) ? LOOKUP_EMPTY : 0;
+
+ return compat_do_execveat(fd,
+ getname_flags(filename, lookup_flags, NULL),
+ argv, envp, flags);
+}
#endif
diff --git a/fs/namei.c b/fs/namei.c
index a7b05bf82d31..553c84d3e0cc 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -130,7 +130,7 @@ void final_putname(struct filename *name)

#define EMBEDDED_NAME_MAX (PATH_MAX - sizeof(struct filename))

-static struct filename *
+struct filename *
getname_flags(const char __user *filename, int flags, int *empty)
{
struct filename *result, *err;
diff --git a/include/linux/compat.h b/include/linux/compat.h
index e6494261eaff..7450ca2ac1fc 100644
--- a/include/linux/compat.h
+++ b/include/linux/compat.h
@@ -357,6 +357,9 @@ asmlinkage long compat_sys_lseek(unsigned int, compat_off_t, unsigned int);

asmlinkage long compat_sys_execve(const char __user *filename, const compat_uptr_t __user *argv,
const compat_uptr_t __user *envp);
+asmlinkage long compat_sys_execveat(int dfd, const char __user *filename,
+ const compat_uptr_t __user *argv,
+ const compat_uptr_t __user *envp, int flags);

asmlinkage long compat_sys_select(int n, compat_ulong_t __user *inp,
compat_ulong_t __user *outp, compat_ulong_t __user *exp,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 94187721ad41..e9818574d738 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2060,6 +2060,7 @@ extern struct file *file_open_root(struct dentry *, struct vfsmount *,
extern struct file * dentry_open(const struct path *, int, const struct cred *);
extern int filp_close(struct file *, fl_owner_t id);

+extern struct filename *getname_flags(const char __user *, int, int *);
extern struct filename *getname(const char __user *);
extern struct filename *getname_kernel(const char *);

diff --git a/include/linux/sched.h b/include/linux/sched.h
index b867a4dab38a..33e056da7d33 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2430,6 +2430,10 @@ extern void do_group_exit(int);
extern int do_execve(struct filename *,
const char __user * const __user *,
const char __user * const __user *);
+extern int do_execveat(int, struct filename *,
+ const char __user * const __user *,
+ const char __user * const __user *,
+ int);
extern long do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *);
struct task_struct *fork_idle(int);
extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 0f86d85a9ce4..df5422294deb 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -876,4 +876,8 @@ asmlinkage long sys_seccomp(unsigned int op, unsigned int flags,
asmlinkage long sys_getrandom(char __user *buf, size_t count,
unsigned int flags);

+asmlinkage long sys_execveat(int dfd, const char __user *filename,
+ const char __user *const __user *argv,
+ const char __user *const __user *envp, int flags);
+
#endif
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 11d11bc5c78f..feef07d29663 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -705,9 +705,11 @@ __SYSCALL(__NR_seccomp, sys_seccomp)
__SYSCALL(__NR_getrandom, sys_getrandom)
#define __NR_memfd_create 279
__SYSCALL(__NR_memfd_create, sys_memfd_create)
+#define __NR_execveat 280
+__SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)

#undef __NR_syscalls
-#define __NR_syscalls 280
+#define __NR_syscalls 281

/*
* All syscalls below here should go away really,
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 391d4ddb6f4b..efb06058ad3e 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -218,3 +218,6 @@ cond_syscall(sys_kcmp);

/* operate on Secure Computing state */
cond_syscall(sys_seccomp);
+
+/* execveat */
+cond_syscall(sys_execveat);
diff --git a/lib/audit.c b/lib/audit.c
index 1d726a22565b..b8fb5ee81e26 100644
--- a/lib/audit.c
+++ b/lib/audit.c
@@ -54,6 +54,9 @@ int audit_classify_syscall(int abi, unsigned syscall)
case __NR_socketcall:
return 4;
#endif
+#ifdef __NR_execveat
+ case __NR_execveat:
+#endif
case __NR_execve:
return 5;
default:
--
2.1.0.rc2.206.gedb03e5

2014-10-22 12:09:06

by David Drysdale

[permalink] [raw]
Subject: [PATCHv5 man-pages 3/3] execveat.2: initial man page for execveat(2)

Signed-off-by: David Drysdale <[email protected]>
---
man2/execveat.2 | 144 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 144 insertions(+)
create mode 100644 man2/execveat.2

diff --git a/man2/execveat.2 b/man2/execveat.2
new file mode 100644
index 000000000000..d19571a3eb9d
--- /dev/null
+++ b/man2/execveat.2
@@ -0,0 +1,144 @@
+.\" Copyright (c) 2014 Google, Inc.
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of this
+.\" manual under the conditions for verbatim copying, provided that the
+.\" entire resulting derived work is distributed under the terms of a
+.\" permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date. The author(s) assume no
+.\" responsibility for errors or omissions, or for damages resulting from
+.\" the use of the information contained herein. The author(s) may not
+.\" have taken the same level of care in the production of this manual,
+.\" which is licensed free of charge, as they might when working
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and authors of this work.
+.\" %%%LICENSE_END
+.\"
+.TH EXECVEAT 2 2014-04-02 "Linux" "Linux Programmer's Manual"
+.SH NAME
+execveat \- execute program relative to a directory file descriptor
+.SH SYNOPSIS
+.B #include <unistd.h>
+.sp
+.BI "int execve(int " fd ", const char *" pathname ","
+.br
+.BI " char *const " argv "[], char *const " envp "[],"
+.br
+.BI " int " flags);
+.SH DESCRIPTION
+The
+.BR execveat ()
+system call executes the program pointed to by the combination of \fIfd\fP and \fIpathname\fP.
+The
+.BR execveat ()
+system call operates in exactly the same way as
+.BR execve (2),
+except for the differences described in this manual page.
+
+If the pathname given in
+.I pathname
+is relative, then it is interpreted relative to the directory
+referred to by the file descriptor
+.I fd
+(rather than relative to the current working directory of
+the calling process, as is done by
+.BR execve (2)
+for a relative pathname).
+
+If
+.I pathname
+is relative and
+.I fd
+is the special value
+.BR AT_FDCWD ,
+then
+.I pathname
+is interpreted relative to the current working
+directory of the calling process (like
+.BR execve (2)).
+
+If
+.I pathname
+is absolute, then
+.I fd
+is ignored.
+
+If
+.I pathname
+is an empty string and the
+.BR AT_EMPTY_PATH
+flag is specified, then the file descriptor
+.I fd
+specifies the file to be executed.
+
+.I flags
+can either be 0, or include the following flags:
+.TP
+.BR AT_EMPTY_PATH
+If
+.I pathname
+is an empty string, operate on the file referred to by
+.IR fd
+(which may have been obtained using the
+.BR open (2)
+.B O_PATH
+flag).
+.TP
+.B AT_SYMLINK_NOFOLLOW
+If the file identified by
+.I fd
+and a non-NULL
+.I pathname
+is a symbolic link, then the call fails with the error
+.BR EINVAL .
+.SH "RETURN VALUE"
+On success,
+.BR execveat ()
+does not return. On error \-1 is returned, and
+.I errno
+is set appropriately.
+.SH ERRORS
+The same errors that occur for
+.BR execve (2)
+can also occur for
+.BR execveat ().
+The following additional errors can occur for
+.BR execveat ():
+.TP
+.B EBADF
+.I fd
+is not a valid file descriptor.
+.TP
+.B EINVAL
+Invalid flag specified in
+.IR flags .
+.TP
+.B ENOTDIR
+.I pathname
+is relative and
+.I fd
+is a file descriptor referring to a file other than a directory.
+.SH VERSIONS
+.BR execveat ()
+was added to Linux in kernel 3.???.
+.SH NOTES
+In addition to the reasons explained in
+.BR openat (2),
+the
+.BR execveat ()
+system call is also needed to allow
+.BR fexecve (3)
+to be implemented on systems that do not have the
+.I /proc
+filesystem mounted.
+.SH SEE ALSO
+.BR execve (2),
+.BR fexecve (3)
--
2.1.0.rc2.206.gedb03e5

2014-10-22 17:56:26

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCHv5 man-pages 3/3] execveat.2: initial man page for execveat(2)

David Drysdale <[email protected]> writes:

> Signed-off-by: David Drysdale <[email protected]>
> ---
> man2/execveat.2 | 144 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 144 insertions(+)
> create mode 100644 man2/execveat.2
>
> diff --git a/man2/execveat.2 b/man2/execveat.2
> new file mode 100644
> index 000000000000..d19571a3eb9d
> --- /dev/null
> +++ b/man2/execveat.2
> @@ -0,0 +1,144 @@
> +.\" Copyright (c) 2014 Google, Inc.
> +.\"
> +.\" %%%LICENSE_START(VERBATIM)
> +.\" Permission is granted to make and distribute verbatim copies of this
> +.\" manual provided the copyright notice and this permission notice are
> +.\" preserved on all copies.
> +.\"
> +.\" Permission is granted to copy and distribute modified versions of this
> +.\" manual under the conditions for verbatim copying, provided that the
> +.\" entire resulting derived work is distributed under the terms of a
> +.\" permission notice identical to this one.
> +.\"
> +.\" Since the Linux kernel and libraries are constantly changing, this
> +.\" manual page may be incorrect or out-of-date. The author(s) assume no
> +.\" responsibility for errors or omissions, or for damages resulting from
> +.\" the use of the information contained herein. The author(s) may not
> +.\" have taken the same level of care in the production of this manual,
> +.\" which is licensed free of charge, as they might when working
> +.\" professionally.
> +.\"
> +.\" Formatted or processed versions of this manual, if unaccompanied by
> +.\" the source, must acknowledge the copyright and authors of this work.
> +.\" %%%LICENSE_END
> +.\"
> +.TH EXECVEAT 2 2014-04-02 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +execveat \- execute program relative to a directory file descriptor
> +.SH SYNOPSIS
> +.B #include <unistd.h>
> +.sp
> +.BI "int execve(int " fd ", const char *" pathname ","
^^^^^^ That needs to be execveat
> +.br
> +.BI " char *const " argv "[], char *const " envp "[],"
> +.br
> +.BI " int " flags);
> +.SH DESCRIPTION
> +The
> +.BR execveat ()
> +system call executes the program pointed to by the combination of \fIfd\fP and \fIpathname\fP.
> +The
> +.BR execveat ()
> +system call operates in exactly the same way as
> +.BR execve (2),
> +except for the differences described in this manual page.
> +
> +If the pathname given in
> +.I pathname
> +is relative, then it is interpreted relative to the directory
> +referred to by the file descriptor
> +.I fd
> +(rather than relative to the current working directory of
> +the calling process, as is done by
> +.BR execve (2)
> +for a relative pathname).
> +
> +If
> +.I pathname
> +is relative and
> +.I fd
> +is the special value
> +.BR AT_FDCWD ,
> +then
> +.I pathname
> +is interpreted relative to the current working
> +directory of the calling process (like
> +.BR execve (2)).
> +
> +If
> +.I pathname
> +is absolute, then
> +.I fd
> +is ignored.
> +
> +If
> +.I pathname
> +is an empty string and the
> +.BR AT_EMPTY_PATH
> +flag is specified, then the file descriptor
> +.I fd
> +specifies the file to be executed.
> +
> +.I flags
> +can either be 0, or include the following flags:
> +.TP
> +.BR AT_EMPTY_PATH
> +If
> +.I pathname
> +is an empty string, operate on the file referred to by
> +.IR fd
> +(which may have been obtained using the
> +.BR open (2)
> +.B O_PATH
> +flag).
> +.TP
> +.B AT_SYMLINK_NOFOLLOW
> +If the file identified by
> +.I fd
> +and a non-NULL
> +.I pathname
> +is a symbolic link, then the call fails with the error
> +.BR EINVAL .
> +.SH "RETURN VALUE"
> +On success,
> +.BR execveat ()
> +does not return. On error \-1 is returned, and
> +.I errno
> +is set appropriately.
> +.SH ERRORS
> +The same errors that occur for
> +.BR execve (2)
> +can also occur for
> +.BR execveat ().
> +The following additional errors can occur for
> +.BR execveat ():
> +.TP
> +.B EBADF
> +.I fd
> +is not a valid file descriptor.
> +.TP
> +.B EINVAL
> +Invalid flag specified in
> +.IR flags .
> +.TP
> +.B ENOTDIR
> +.I pathname
> +is relative and
> +.I fd
> +is a file descriptor referring to a file other than a directory.
> +.SH VERSIONS
> +.BR execveat ()
> +was added to Linux in kernel 3.???.
> +.SH NOTES
> +In addition to the reasons explained in
> +.BR openat (2),
> +the
> +.BR execveat ()
> +system call is also needed to allow
> +.BR fexecve (3)
> +to be implemented on systems that do not have the
> +.I /proc
> +filesystem mounted.
> +.SH SEE ALSO
> +.BR execve (2),
> +.BR fexecve (3)

2014-10-22 18:08:36

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCHv5 1/3] syscalls,x86: implement execveat() system call

David Drysdale <[email protected]> writes:

> Add a new system execveat(2) syscall. execveat() is to execve() as
> openat() is to open(): it takes a file descriptor that refers to a
> directory, and resolves the filename relative to that.
>
> In addition, if the filename is empty and AT_EMPTY_PATH is specified,
> execveat() executes the file to which the file descriptor refers. This
> replicates the functionality of fexecve(), which is a system call in
> other UNIXen, but in Linux glibc it depends on opening
> "/proc/self/fd/<fd>" (and so relies on /proc being mounted).
>
> The filename fed to the executed program as argv[0] (or the name of the
> script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
> (for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
> reflecting how the executable was found. This does however mean that
> execution of a script in a /proc-less environment won't work.
>
> Only x86-64, i386 and x32 ABIs are supported in this patch.
>
> Based on patches by Meredydd Luff <[email protected]>
>
> Signed-off-by: David Drysdale <[email protected]>
> ---
> arch/x86/ia32/audit.c | 1 +
> arch/x86/ia32/ia32entry.S | 1 +
> arch/x86/kernel/audit_64.c | 1 +
> arch/x86/kernel/entry_64.S | 28 ++++++++
> arch/x86/syscalls/syscall_32.tbl | 1 +
> arch/x86/syscalls/syscall_64.tbl | 2 +
> arch/x86/um/sys_call_table_64.c | 1 +
> fs/exec.c | 130 ++++++++++++++++++++++++++++++++++----
> fs/namei.c | 2 +-
> include/linux/compat.h | 3 +
> include/linux/fs.h | 1 +
> include/linux/sched.h | 4 ++
> include/linux/syscalls.h | 4 ++
> include/uapi/asm-generic/unistd.h | 4 +-
> kernel/sys_ni.c | 3 +
> lib/audit.c | 3 +
> 16 files changed, 173 insertions(+), 16 deletions(-)
>
> diff --git a/arch/x86/ia32/audit.c b/arch/x86/ia32/audit.c
> index 5d7b381da692..2eccc8932ae6 100644
> --- a/arch/x86/ia32/audit.c
> +++ b/arch/x86/ia32/audit.c
> @@ -35,6 +35,7 @@ int ia32_classify_syscall(unsigned syscall)
> case __NR_socketcall:
> return 4;
> case __NR_execve:
> + case __NR_execveat:
> return 5;
> default:
> return 1;
> diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
> index 4299eb05023c..2516c09743e0 100644
> --- a/arch/x86/ia32/ia32entry.S
> +++ b/arch/x86/ia32/ia32entry.S
> @@ -464,6 +464,7 @@ GLOBAL(\label)
> PTREGSCALL stub32_rt_sigreturn, sys32_rt_sigreturn
> PTREGSCALL stub32_sigreturn, sys32_sigreturn
> PTREGSCALL stub32_execve, compat_sys_execve
> + PTREGSCALL stub32_execveat, compat_sys_execveat
> PTREGSCALL stub32_fork, sys_fork
> PTREGSCALL stub32_vfork, sys_vfork
>
> diff --git a/arch/x86/kernel/audit_64.c b/arch/x86/kernel/audit_64.c
> index 06d3e5a14d9d..f3672508b249 100644
> --- a/arch/x86/kernel/audit_64.c
> +++ b/arch/x86/kernel/audit_64.c
> @@ -50,6 +50,7 @@ int audit_classify_syscall(int abi, unsigned syscall)
> case __NR_openat:
> return 3;
> case __NR_execve:
> + case __NR_execveat:
> return 5;
> default:
> return 0;
> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index 2fac1343a90b..00c4526e6ffe 100644
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -665,6 +665,20 @@ ENTRY(stub_execve)
> CFI_ENDPROC
> END(stub_execve)
>
> +ENTRY(stub_execveat)
> + CFI_STARTPROC
> + addq $8, %rsp
> + PARTIAL_FRAME 0
> + SAVE_REST
> + FIXUP_TOP_OF_STACK %r11
> + call sys_execveat
> + RESTORE_TOP_OF_STACK %r11
> + movq %rax,RAX(%rsp)
> + RESTORE_REST
> + jmp int_ret_from_sys_call
> + CFI_ENDPROC
> +END(stub_execveat)
> +
> /*
> * sigreturn is special because it needs to restore all registers on return.
> * This cannot be done with SYSRET, so use the IRET return path instead.
> @@ -710,6 +724,20 @@ ENTRY(stub_x32_execve)
> CFI_ENDPROC
> END(stub_x32_execve)
>
> +ENTRY(stub_x32_execveat)
> + CFI_STARTPROC
> + addq $8, %rsp
> + PARTIAL_FRAME 0
> + SAVE_REST
> + FIXUP_TOP_OF_STACK %r11
> + call compat_sys_execveat
> + RESTORE_TOP_OF_STACK %r11
> + movq %rax,RAX(%rsp)
> + RESTORE_REST
> + jmp int_ret_from_sys_call
> + CFI_ENDPROC
> +END(stub_x32_execveat)
> +
> #endif
>
> /*
> diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
> index 028b78168d85..2633e3195455 100644
> --- a/arch/x86/syscalls/syscall_32.tbl
> +++ b/arch/x86/syscalls/syscall_32.tbl
> @@ -363,3 +363,4 @@
> 354 i386 seccomp sys_seccomp
> 355 i386 getrandom sys_getrandom
> 356 i386 memfd_create sys_memfd_create
> +357 i386 execveat sys_execveat stub32_execveat
> diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
> index 35dd922727b9..1af5badd159c 100644
> --- a/arch/x86/syscalls/syscall_64.tbl
> +++ b/arch/x86/syscalls/syscall_64.tbl
> @@ -327,6 +327,7 @@
> 318 common getrandom sys_getrandom
> 319 common memfd_create sys_memfd_create
> 320 common kexec_file_load sys_kexec_file_load
> +321 64 execveat stub_execveat
>
> #
> # x32-specific system call numbers start at 512 to avoid cache impact
> @@ -365,3 +366,4 @@
> 542 x32 getsockopt compat_sys_getsockopt
> 543 x32 io_setup compat_sys_io_setup
> 544 x32 io_submit compat_sys_io_submit
> +545 x32 execveat stub_x32_execveat
> diff --git a/arch/x86/um/sys_call_table_64.c b/arch/x86/um/sys_call_table_64.c
> index f2f0723070ca..20c3649d0691 100644
> --- a/arch/x86/um/sys_call_table_64.c
> +++ b/arch/x86/um/sys_call_table_64.c
> @@ -31,6 +31,7 @@
> #define stub_fork sys_fork
> #define stub_vfork sys_vfork
> #define stub_execve sys_execve
> +#define stub_execveat sys_execveat
> #define stub_rt_sigreturn sys_rt_sigreturn
>
> #define __SYSCALL_COMMON(nr, sym, compat) __SYSCALL_64(nr, sym, compat)
> diff --git a/fs/exec.c b/fs/exec.c
> index a2b42a98c743..92a6e14f096a 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -747,7 +747,7 @@ EXPORT_SYMBOL(setup_arg_pages);
>
> #endif /* CONFIG_MMU */
>
> -static struct file *do_open_exec(struct filename *name)
> +static struct file *do_open_execat(int fd, struct filename *name, int flags)
> {
> struct file *file;
> int err;
> @@ -757,10 +757,34 @@ static struct file *do_open_exec(struct filename *name)
> .intent = LOOKUP_OPEN,
> .lookup_flags = LOOKUP_FOLLOW,
> };
> + static const struct open_flags open_exec_nofollow_flags = {
> + .open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
> + .acc_mode = MAY_EXEC | MAY_OPEN,
> + .intent = LOOKUP_OPEN,
> + .lookup_flags = 0,
> + };
>
> - file = do_filp_open(AT_FDCWD, name, &open_exec_flags);
> - if (IS_ERR(file))
> - goto out;
> + if ((flags & ~(AT_SYMLINK_NOFOLLOW | AT_EMPTY_PATH)) != 0)
> + return ERR_PTR(-EINVAL);
> +
> + if (name->name[0] != '\0') {

Is it really necessary to special case AT_EMPTY_PATH here. I would
have thought the existing logic in namei.c would have been fine
assuning we passed LOOKUP_EMPTY.

> + const struct open_flags *oflags = ((flags & AT_SYMLINK_NOFOLLOW)
> + ? &open_exec_nofollow_flags
> + : &open_exec_flags);
> +
> + file = do_filp_open(fd, name, oflags);
> + if (IS_ERR(file))
> + goto out;
> + } else {
> + file = fget(fd);
> + if (!file)
> + return ERR_PTR(-EBADF);
> +
> + err = inode_permission(file->f_path.dentry->d_inode,
> + open_exec_flags.acc_mode);
> + if (err)
> + goto exit;
> + }
>
> err = -EACCES;
> if (!S_ISREG(file_inode(file)->i_mode))
> @@ -769,12 +793,13 @@ static struct file *do_open_exec(struct filename *name)
> if (file->f_path.mnt->mnt_flags & MNT_NOEXEC)
> goto exit;
>
> - fsnotify_open(file);
> -
> err = deny_write_access(file);
> if (err)
> goto exit;
>
> + if (name->name[0] != '\0')
> + fsnotify_open(file);
> +
> out:
> return file;
>
> @@ -786,7 +811,7 @@ exit:
> struct file *open_exec(const char *name)
> {
> struct filename tmp = { .name = name };
> - return do_open_exec(&tmp);
> + return do_open_execat(AT_FDCWD, &tmp, 0);
> }
> EXPORT_SYMBOL(open_exec);
>
> @@ -1422,10 +1447,12 @@ static int exec_binprm(struct linux_binprm *bprm)
> /*
> * sys_execve() executes a new program.
> */
> -static int do_execve_common(struct filename *filename,
> - struct user_arg_ptr argv,
> - struct user_arg_ptr envp)
> +static int do_execveat_common(int fd, struct filename *filename,
> + struct user_arg_ptr argv,
> + struct user_arg_ptr envp,
> + int flags)
> {
> + char *pathbuf = NULL;
> struct linux_binprm *bprm;
> struct file *file;
> struct files_struct *displaced;
> @@ -1466,7 +1493,7 @@ static int do_execve_common(struct filename *filename,
> check_unsafe_exec(bprm);
> current->in_execve = 1;
>
> - file = do_open_exec(filename);
> + file = do_open_execat(fd, filename, flags);
> retval = PTR_ERR(file);
> if (IS_ERR(file))
> goto out_unmark;
> @@ -1474,7 +1501,27 @@ static int do_execve_common(struct filename *filename,
> sched_exec();
>
> bprm->file = file;
> - bprm->filename = bprm->interp = filename->name;
> + if (fd == AT_FDCWD || filename->name[0] == '/') {
> + bprm->filename = filename->name;
> + } else {
> + /*
> + * Build a pathname that reflects how we got to the file,
> + * either "/dev/fd/<fd>" (for an empty filename) or
> + * "/dev/fd/<fd>/<filename>".
> + */
> + pathbuf = kmalloc(PATH_MAX, GFP_TEMPORARY);
> + if (!pathbuf) {
> + retval = -ENOMEM;
> + goto out_unmark;
> + }
> + bprm->filename = pathbuf;
> + if (filename->name[0] == '\0')
> + sprintf(pathbuf, "/dev/fd/%d", fd);
> + else
> + snprintf(pathbuf, PATH_MAX,
> + "/dev/fd/%d/%s", fd, filename->name);
> + }
> + bprm->interp = bprm->filename;
>
> retval = bprm_mm_init(bprm);
> if (retval)
> @@ -1532,6 +1579,7 @@ out_unmark:
>
> out_free:
> free_bprm(bprm);
> + kfree(pathbuf);
>
> out_files:
> if (displaced)
> @@ -1547,7 +1595,18 @@ int do_execve(struct filename *filename,
> {
> struct user_arg_ptr argv = { .ptr.native = __argv };
> struct user_arg_ptr envp = { .ptr.native = __envp };
> - return do_execve_common(filename, argv, envp);
> + return do_execveat_common(AT_FDCWD, filename, argv, envp, 0);
> +}
> +
> +int do_execveat(int fd, struct filename *filename,
> + const char __user *const __user *__argv,
> + const char __user *const __user *__envp,
> + int flags)
> +{
> + struct user_arg_ptr argv = { .ptr.native = __argv };
> + struct user_arg_ptr envp = { .ptr.native = __envp };
> +
> + return do_execveat_common(fd, filename, argv, envp, flags);
> }
>
> #ifdef CONFIG_COMPAT
> @@ -1563,7 +1622,23 @@ static int compat_do_execve(struct filename *filename,
> .is_compat = true,
> .ptr.compat = __envp,
> };
> - return do_execve_common(filename, argv, envp);
> + return do_execveat_common(AT_FDCWD, filename, argv, envp, 0);
> +}
> +
> +static int compat_do_execveat(int fd, struct filename *filename,
> + const compat_uptr_t __user *__argv,
> + const compat_uptr_t __user *__envp,
> + int flags)
> +{
> + struct user_arg_ptr argv = {
> + .is_compat = true,
> + .ptr.compat = __argv,
> + };
> + struct user_arg_ptr envp = {
> + .is_compat = true,
> + .ptr.compat = __envp,
> + };
> + return do_execveat_common(fd, filename, argv, envp, flags);
> }
> #endif
>
> @@ -1603,6 +1678,20 @@ SYSCALL_DEFINE3(execve,
> {
> return do_execve(getname(filename), argv, envp);
> }
> +
> +SYSCALL_DEFINE5(execveat,
> + int, fd, const char __user *, filename,
> + const char __user *const __user *, argv,
> + const char __user *const __user *, envp,
> + int, flags)
> +{
> + int lookup_flags = (flags & AT_EMPTY_PATH) ? LOOKUP_EMPTY : 0;
> +
> + return do_execveat(fd,
> + getname_flags(filename, lookup_flags, NULL),
> + argv, envp, flags);
> +}
> +
> #ifdef CONFIG_COMPAT
> COMPAT_SYSCALL_DEFINE3(execve, const char __user *, filename,
> const compat_uptr_t __user *, argv,
> @@ -1610,4 +1699,17 @@ COMPAT_SYSCALL_DEFINE3(execve, const char __user *, filename,
> {
> return compat_do_execve(getname(filename), argv, envp);
> }
> +
> +COMPAT_SYSCALL_DEFINE5(execveat, int, fd,
> + const char __user *, filename,
> + const compat_uptr_t __user *, argv,
> + const compat_uptr_t __user *, envp,
> + int, flags)
> +{
> + int lookup_flags = (flags & AT_EMPTY_PATH) ? LOOKUP_EMPTY : 0;
> +
> + return compat_do_execveat(fd,
> + getname_flags(filename, lookup_flags, NULL),
> + argv, envp, flags);
> +}
> #endif
> diff --git a/fs/namei.c b/fs/namei.c
> index a7b05bf82d31..553c84d3e0cc 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -130,7 +130,7 @@ void final_putname(struct filename *name)
>
> #define EMBEDDED_NAME_MAX (PATH_MAX - sizeof(struct filename))
>
> -static struct filename *
> +struct filename *
> getname_flags(const char __user *filename, int flags, int *empty)
> {
> struct filename *result, *err;
> diff --git a/include/linux/compat.h b/include/linux/compat.h
> index e6494261eaff..7450ca2ac1fc 100644
> --- a/include/linux/compat.h
> +++ b/include/linux/compat.h
> @@ -357,6 +357,9 @@ asmlinkage long compat_sys_lseek(unsigned int, compat_off_t, unsigned int);
>
> asmlinkage long compat_sys_execve(const char __user *filename, const compat_uptr_t __user *argv,
> const compat_uptr_t __user *envp);
> +asmlinkage long compat_sys_execveat(int dfd, const char __user *filename,
> + const compat_uptr_t __user *argv,
> + const compat_uptr_t __user *envp, int flags);
>
> asmlinkage long compat_sys_select(int n, compat_ulong_t __user *inp,
> compat_ulong_t __user *outp, compat_ulong_t __user *exp,
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 94187721ad41..e9818574d738 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2060,6 +2060,7 @@ extern struct file *file_open_root(struct dentry *, struct vfsmount *,
> extern struct file * dentry_open(const struct path *, int, const struct cred *);
> extern int filp_close(struct file *, fl_owner_t id);
>
> +extern struct filename *getname_flags(const char __user *, int, int *);
> extern struct filename *getname(const char __user *);
> extern struct filename *getname_kernel(const char *);
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index b867a4dab38a..33e056da7d33 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2430,6 +2430,10 @@ extern void do_group_exit(int);
> extern int do_execve(struct filename *,
> const char __user * const __user *,
> const char __user * const __user *);
> +extern int do_execveat(int, struct filename *,
> + const char __user * const __user *,
> + const char __user * const __user *,
> + int);
> extern long do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *);
> struct task_struct *fork_idle(int);
> extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 0f86d85a9ce4..df5422294deb 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -876,4 +876,8 @@ asmlinkage long sys_seccomp(unsigned int op, unsigned int flags,
> asmlinkage long sys_getrandom(char __user *buf, size_t count,
> unsigned int flags);
>
> +asmlinkage long sys_execveat(int dfd, const char __user *filename,
> + const char __user *const __user *argv,
> + const char __user *const __user *envp, int flags);
> +
> #endif
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index 11d11bc5c78f..feef07d29663 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -705,9 +705,11 @@ __SYSCALL(__NR_seccomp, sys_seccomp)
> __SYSCALL(__NR_getrandom, sys_getrandom)
> #define __NR_memfd_create 279
> __SYSCALL(__NR_memfd_create, sys_memfd_create)
> +#define __NR_execveat 280
> +__SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)
>
> #undef __NR_syscalls
> -#define __NR_syscalls 280
> +#define __NR_syscalls 281
>
> /*
> * All syscalls below here should go away really,
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 391d4ddb6f4b..efb06058ad3e 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -218,3 +218,6 @@ cond_syscall(sys_kcmp);
>
> /* operate on Secure Computing state */
> cond_syscall(sys_seccomp);
> +
> +/* execveat */
> +cond_syscall(sys_execveat);
> diff --git a/lib/audit.c b/lib/audit.c
> index 1d726a22565b..b8fb5ee81e26 100644
> --- a/lib/audit.c
> +++ b/lib/audit.c
> @@ -54,6 +54,9 @@ int audit_classify_syscall(int abi, unsigned syscall)
> case __NR_socketcall:
> return 4;
> #endif
> +#ifdef __NR_execveat
> + case __NR_execveat:
> +#endif
> case __NR_execve:
> return 5;
> default:

2014-10-22 18:44:53

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCHv5 1/3] syscalls,x86: implement execveat() system call

On Wed, Oct 22, 2014 at 4:44 AM, David Drysdale <[email protected]> wrote:
> Add a new system execveat(2) syscall. execveat() is to execve() as
> openat() is to open(): it takes a file descriptor that refers to a
> directory, and resolves the filename relative to that.
>

> bprm->file = file;
> - bprm->filename = bprm->interp = filename->name;
> + if (fd == AT_FDCWD || filename->name[0] == '/') {
> + bprm->filename = filename->name;
> + } else {
> + /*
> + * Build a pathname that reflects how we got to the file,
> + * either "/dev/fd/<fd>" (for an empty filename) or
> + * "/dev/fd/<fd>/<filename>".
> + */
> + pathbuf = kmalloc(PATH_MAX, GFP_TEMPORARY);
> + if (!pathbuf) {
> + retval = -ENOMEM;
> + goto out_unmark;
> + }
> + bprm->filename = pathbuf;
> + if (filename->name[0] == '\0')
> + sprintf(pathbuf, "/dev/fd/%d", fd);

If the fd is O_CLOEXEC, then this will result in a confused child
process. Should we fail exec attempts like that for non-static
programs? (E.g. set filename to "" or something and fix up the binfmt
drivers to handle that?)

> + else
> + snprintf(pathbuf, PATH_MAX,
> + "/dev/fd/%d/%s", fd, filename->name);

Does this need to handle the case where the result exceeds PATH_MAX?

--Andy

2014-10-22 20:14:03

by David Drysdale

[permalink] [raw]
Subject: Re: [PATCHv5 man-pages 3/3] execveat.2: initial man page for execveat(2)

On Wed, Oct 22, 2014 at 6:55 PM, Eric W. Biederman
<[email protected]> wrote:
> David Drysdale <[email protected]> writes:
>> +.TH EXECVEAT 2 2014-04-02 "Linux" "Linux Programmer's Manual"
>> +.SH NAME
>> +execveat \- execute program relative to a directory file descriptor
>> +.SH SYNOPSIS
>> +.B #include <unistd.h>
>> +.sp
>> +.BI "int execve(int " fd ", const char *" pathname ","
> ^^^^^^ That needs to be execveat

Ah yes, so it does -- thanks for spotting.

2014-10-23 06:41:10

by David Drysdale

[permalink] [raw]
Subject: Re: [PATCHv5 1/3] syscalls,x86: implement execveat() system call

On Wed, Oct 22, 2014 at 7:07 PM, Eric W. Biederman
<[email protected]> wrote:
> David Drysdale <[email protected]> writes:
>
>> Add a new system execveat(2) syscall. execveat() is to execve() as
>> openat() is to open(): it takes a file descriptor that refers to a
>> directory, and resolves the filename relative to that.
>>
>> In addition, if the filename is empty and AT_EMPTY_PATH is specified,
>> execveat() executes the file to which the file descriptor refers. This
>> replicates the functionality of fexecve(), which is a system call in
>> other UNIXen, but in Linux glibc it depends on opening
>> "/proc/self/fd/<fd>" (and so relies on /proc being mounted).
>>
>> The filename fed to the executed program as argv[0] (or the name of the
>> script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
>> (for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
>> reflecting how the executable was found. This does however mean that
>> execution of a script in a /proc-less environment won't work.
>>
>> Only x86-64, i386 and x32 ABIs are supported in this patch.
>>
>> Based on patches by Meredydd Luff <[email protected]>
>>
>> Signed-off-by: David Drysdale <[email protected]>
>> ---
>> arch/x86/ia32/audit.c | 1 +
>> arch/x86/ia32/ia32entry.S | 1 +
>> arch/x86/kernel/audit_64.c | 1 +
>> arch/x86/kernel/entry_64.S | 28 ++++++++
>> arch/x86/syscalls/syscall_32.tbl | 1 +
>> arch/x86/syscalls/syscall_64.tbl | 2 +
>> arch/x86/um/sys_call_table_64.c | 1 +
>> fs/exec.c | 130 ++++++++++++++++++++++++++++++++++----
>> fs/namei.c | 2 +-
>> include/linux/compat.h | 3 +
>> include/linux/fs.h | 1 +
>> include/linux/sched.h | 4 ++
>> include/linux/syscalls.h | 4 ++
>> include/uapi/asm-generic/unistd.h | 4 +-
>> kernel/sys_ni.c | 3 +
>> lib/audit.c | 3 +
>> 16 files changed, 173 insertions(+), 16 deletions(-)
>>
>> diff --git a/arch/x86/ia32/audit.c b/arch/x86/ia32/audit.c
>> index 5d7b381da692..2eccc8932ae6 100644
>> --- a/arch/x86/ia32/audit.c
>> +++ b/arch/x86/ia32/audit.c
>> @@ -35,6 +35,7 @@ int ia32_classify_syscall(unsigned syscall)
>> case __NR_socketcall:
>> return 4;
>> case __NR_execve:
>> + case __NR_execveat:
>> return 5;
>> default:
>> return 1;
>> diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
>> index 4299eb05023c..2516c09743e0 100644
>> --- a/arch/x86/ia32/ia32entry.S
>> +++ b/arch/x86/ia32/ia32entry.S
>> @@ -464,6 +464,7 @@ GLOBAL(\label)
>> PTREGSCALL stub32_rt_sigreturn, sys32_rt_sigreturn
>> PTREGSCALL stub32_sigreturn, sys32_sigreturn
>> PTREGSCALL stub32_execve, compat_sys_execve
>> + PTREGSCALL stub32_execveat, compat_sys_execveat
>> PTREGSCALL stub32_fork, sys_fork
>> PTREGSCALL stub32_vfork, sys_vfork
>>
>> diff --git a/arch/x86/kernel/audit_64.c b/arch/x86/kernel/audit_64.c
>> index 06d3e5a14d9d..f3672508b249 100644
>> --- a/arch/x86/kernel/audit_64.c
>> +++ b/arch/x86/kernel/audit_64.c
>> @@ -50,6 +50,7 @@ int audit_classify_syscall(int abi, unsigned syscall)
>> case __NR_openat:
>> return 3;
>> case __NR_execve:
>> + case __NR_execveat:
>> return 5;
>> default:
>> return 0;
>> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
>> index 2fac1343a90b..00c4526e6ffe 100644
>> --- a/arch/x86/kernel/entry_64.S
>> +++ b/arch/x86/kernel/entry_64.S
>> @@ -665,6 +665,20 @@ ENTRY(stub_execve)
>> CFI_ENDPROC
>> END(stub_execve)
>>
>> +ENTRY(stub_execveat)
>> + CFI_STARTPROC
>> + addq $8, %rsp
>> + PARTIAL_FRAME 0
>> + SAVE_REST
>> + FIXUP_TOP_OF_STACK %r11
>> + call sys_execveat
>> + RESTORE_TOP_OF_STACK %r11
>> + movq %rax,RAX(%rsp)
>> + RESTORE_REST
>> + jmp int_ret_from_sys_call
>> + CFI_ENDPROC
>> +END(stub_execveat)
>> +
>> /*
>> * sigreturn is special because it needs to restore all registers on return.
>> * This cannot be done with SYSRET, so use the IRET return path instead.
>> @@ -710,6 +724,20 @@ ENTRY(stub_x32_execve)
>> CFI_ENDPROC
>> END(stub_x32_execve)
>>
>> +ENTRY(stub_x32_execveat)
>> + CFI_STARTPROC
>> + addq $8, %rsp
>> + PARTIAL_FRAME 0
>> + SAVE_REST
>> + FIXUP_TOP_OF_STACK %r11
>> + call compat_sys_execveat
>> + RESTORE_TOP_OF_STACK %r11
>> + movq %rax,RAX(%rsp)
>> + RESTORE_REST
>> + jmp int_ret_from_sys_call
>> + CFI_ENDPROC
>> +END(stub_x32_execveat)
>> +
>> #endif
>>
>> /*
>> diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
>> index 028b78168d85..2633e3195455 100644
>> --- a/arch/x86/syscalls/syscall_32.tbl
>> +++ b/arch/x86/syscalls/syscall_32.tbl
>> @@ -363,3 +363,4 @@
>> 354 i386 seccomp sys_seccomp
>> 355 i386 getrandom sys_getrandom
>> 356 i386 memfd_create sys_memfd_create
>> +357 i386 execveat sys_execveat stub32_execveat
>> diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
>> index 35dd922727b9..1af5badd159c 100644
>> --- a/arch/x86/syscalls/syscall_64.tbl
>> +++ b/arch/x86/syscalls/syscall_64.tbl
>> @@ -327,6 +327,7 @@
>> 318 common getrandom sys_getrandom
>> 319 common memfd_create sys_memfd_create
>> 320 common kexec_file_load sys_kexec_file_load
>> +321 64 execveat stub_execveat
>>
>> #
>> # x32-specific system call numbers start at 512 to avoid cache impact
>> @@ -365,3 +366,4 @@
>> 542 x32 getsockopt compat_sys_getsockopt
>> 543 x32 io_setup compat_sys_io_setup
>> 544 x32 io_submit compat_sys_io_submit
>> +545 x32 execveat stub_x32_execveat
>> diff --git a/arch/x86/um/sys_call_table_64.c b/arch/x86/um/sys_call_table_64.c
>> index f2f0723070ca..20c3649d0691 100644
>> --- a/arch/x86/um/sys_call_table_64.c
>> +++ b/arch/x86/um/sys_call_table_64.c
>> @@ -31,6 +31,7 @@
>> #define stub_fork sys_fork
>> #define stub_vfork sys_vfork
>> #define stub_execve sys_execve
>> +#define stub_execveat sys_execveat
>> #define stub_rt_sigreturn sys_rt_sigreturn
>>
>> #define __SYSCALL_COMMON(nr, sym, compat) __SYSCALL_64(nr, sym, compat)
>> diff --git a/fs/exec.c b/fs/exec.c
>> index a2b42a98c743..92a6e14f096a 100644
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -747,7 +747,7 @@ EXPORT_SYMBOL(setup_arg_pages);
>>
>> #endif /* CONFIG_MMU */
>>
>> -static struct file *do_open_exec(struct filename *name)
>> +static struct file *do_open_execat(int fd, struct filename *name, int flags)
>> {
>> struct file *file;
>> int err;
>> @@ -757,10 +757,34 @@ static struct file *do_open_exec(struct filename *name)
>> .intent = LOOKUP_OPEN,
>> .lookup_flags = LOOKUP_FOLLOW,
>> };
>> + static const struct open_flags open_exec_nofollow_flags = {
>> + .open_flag = O_LARGEFILE | O_RDONLY | __FMODE_EXEC,
>> + .acc_mode = MAY_EXEC | MAY_OPEN,
>> + .intent = LOOKUP_OPEN,
>> + .lookup_flags = 0,
>> + };
>>
>> - file = do_filp_open(AT_FDCWD, name, &open_exec_flags);
>> - if (IS_ERR(file))
>> - goto out;
>> + if ((flags & ~(AT_SYMLINK_NOFOLLOW | AT_EMPTY_PATH)) != 0)
>> + return ERR_PTR(-EINVAL);
>> +
>> + if (name->name[0] != '\0') {
>
> Is it really necessary to special case AT_EMPTY_PATH here. I would
> have thought the existing logic in namei.c would have been fine
> assuning we passed LOOKUP_EMPTY.

Just using do_filp_open() throughout looks mostly plausible on a quick
experiment, but my initial version appears to make O_PATH fds unexpectedly
fexecve()-able (I'm glad I had a test case for that).

I'll look for a way around that, hopefully without an explicit special case.

>> + const struct open_flags *oflags = ((flags & AT_SYMLINK_NOFOLLOW)
>> + ? &open_exec_nofollow_flags
>> + : &open_exec_flags);
>> +
>> + file = do_filp_open(fd, name, oflags);
>> + if (IS_ERR(file))
>> + goto out;
>> + } else {
>> + file = fget(fd);
>> + if (!file)
>> + return ERR_PTR(-EBADF);
>> +
>> + err = inode_permission(file->f_path.dentry->d_inode,
>> + open_exec_flags.acc_mode);
>> + if (err)
>> + goto exit;
>> + }
>>
>> err = -EACCES;
>> if (!S_ISREG(file_inode(file)->i_mode))
>> @@ -769,12 +793,13 @@ static struct file *do_open_exec(struct filename *name)
>> if (file->f_path.mnt->mnt_flags & MNT_NOEXEC)
>> goto exit;
>>
>> - fsnotify_open(file);
>> -
>> err = deny_write_access(file);
>> if (err)
>> goto exit;
>>
>> + if (name->name[0] != '\0')
>> + fsnotify_open(file);
>> +
>> out:
>> return file;
>>
>> @@ -786,7 +811,7 @@ exit:
>> struct file *open_exec(const char *name)
>> {
>> struct filename tmp = { .name = name };
>> - return do_open_exec(&tmp);
>> + return do_open_execat(AT_FDCWD, &tmp, 0);
>> }
>> EXPORT_SYMBOL(open_exec);
>>
>> @@ -1422,10 +1447,12 @@ static int exec_binprm(struct linux_binprm *bprm)
>> /*
>> * sys_execve() executes a new program.
>> */
>> -static int do_execve_common(struct filename *filename,
>> - struct user_arg_ptr argv,
>> - struct user_arg_ptr envp)
>> +static int do_execveat_common(int fd, struct filename *filename,
>> + struct user_arg_ptr argv,
>> + struct user_arg_ptr envp,
>> + int flags)
>> {
>> + char *pathbuf = NULL;
>> struct linux_binprm *bprm;
>> struct file *file;
>> struct files_struct *displaced;
>> @@ -1466,7 +1493,7 @@ static int do_execve_common(struct filename *filename,
>> check_unsafe_exec(bprm);
>> current->in_execve = 1;
>>
>> - file = do_open_exec(filename);
>> + file = do_open_execat(fd, filename, flags);
>> retval = PTR_ERR(file);
>> if (IS_ERR(file))
>> goto out_unmark;
>> @@ -1474,7 +1501,27 @@ static int do_execve_common(struct filename *filename,
>> sched_exec();
>>
>> bprm->file = file;
>> - bprm->filename = bprm->interp = filename->name;
>> + if (fd == AT_FDCWD || filename->name[0] == '/') {
>> + bprm->filename = filename->name;
>> + } else {
>> + /*
>> + * Build a pathname that reflects how we got to the file,
>> + * either "/dev/fd/<fd>" (for an empty filename) or
>> + * "/dev/fd/<fd>/<filename>".
>> + */
>> + pathbuf = kmalloc(PATH_MAX, GFP_TEMPORARY);
>> + if (!pathbuf) {
>> + retval = -ENOMEM;
>> + goto out_unmark;
>> + }
>> + bprm->filename = pathbuf;
>> + if (filename->name[0] == '\0')
>> + sprintf(pathbuf, "/dev/fd/%d", fd);
>> + else
>> + snprintf(pathbuf, PATH_MAX,
>> + "/dev/fd/%d/%s", fd, filename->name);
>> + }
>> + bprm->interp = bprm->filename;
>>
>> retval = bprm_mm_init(bprm);
>> if (retval)
>> @@ -1532,6 +1579,7 @@ out_unmark:
>>
>> out_free:
>> free_bprm(bprm);
>> + kfree(pathbuf);
>>
>> out_files:
>> if (displaced)
>> @@ -1547,7 +1595,18 @@ int do_execve(struct filename *filename,
>> {
>> struct user_arg_ptr argv = { .ptr.native = __argv };
>> struct user_arg_ptr envp = { .ptr.native = __envp };
>> - return do_execve_common(filename, argv, envp);
>> + return do_execveat_common(AT_FDCWD, filename, argv, envp, 0);
>> +}
>> +
>> +int do_execveat(int fd, struct filename *filename,
>> + const char __user *const __user *__argv,
>> + const char __user *const __user *__envp,
>> + int flags)
>> +{
>> + struct user_arg_ptr argv = { .ptr.native = __argv };
>> + struct user_arg_ptr envp = { .ptr.native = __envp };
>> +
>> + return do_execveat_common(fd, filename, argv, envp, flags);
>> }
>>
>> #ifdef CONFIG_COMPAT
>> @@ -1563,7 +1622,23 @@ static int compat_do_execve(struct filename *filename,
>> .is_compat = true,
>> .ptr.compat = __envp,
>> };
>> - return do_execve_common(filename, argv, envp);
>> + return do_execveat_common(AT_FDCWD, filename, argv, envp, 0);
>> +}
>> +
>> +static int compat_do_execveat(int fd, struct filename *filename,
>> + const compat_uptr_t __user *__argv,
>> + const compat_uptr_t __user *__envp,
>> + int flags)
>> +{
>> + struct user_arg_ptr argv = {
>> + .is_compat = true,
>> + .ptr.compat = __argv,
>> + };
>> + struct user_arg_ptr envp = {
>> + .is_compat = true,
>> + .ptr.compat = __envp,
>> + };
>> + return do_execveat_common(fd, filename, argv, envp, flags);
>> }
>> #endif
>>
>> @@ -1603,6 +1678,20 @@ SYSCALL_DEFINE3(execve,
>> {
>> return do_execve(getname(filename), argv, envp);
>> }
>> +
>> +SYSCALL_DEFINE5(execveat,
>> + int, fd, const char __user *, filename,
>> + const char __user *const __user *, argv,
>> + const char __user *const __user *, envp,
>> + int, flags)
>> +{
>> + int lookup_flags = (flags & AT_EMPTY_PATH) ? LOOKUP_EMPTY : 0;
>> +
>> + return do_execveat(fd,
>> + getname_flags(filename, lookup_flags, NULL),
>> + argv, envp, flags);
>> +}
>> +
>> #ifdef CONFIG_COMPAT
>> COMPAT_SYSCALL_DEFINE3(execve, const char __user *, filename,
>> const compat_uptr_t __user *, argv,
>> @@ -1610,4 +1699,17 @@ COMPAT_SYSCALL_DEFINE3(execve, const char __user *, filename,
>> {
>> return compat_do_execve(getname(filename), argv, envp);
>> }
>> +
>> +COMPAT_SYSCALL_DEFINE5(execveat, int, fd,
>> + const char __user *, filename,
>> + const compat_uptr_t __user *, argv,
>> + const compat_uptr_t __user *, envp,
>> + int, flags)
>> +{
>> + int lookup_flags = (flags & AT_EMPTY_PATH) ? LOOKUP_EMPTY : 0;
>> +
>> + return compat_do_execveat(fd,
>> + getname_flags(filename, lookup_flags, NULL),
>> + argv, envp, flags);
>> +}
>> #endif
>> diff --git a/fs/namei.c b/fs/namei.c
>> index a7b05bf82d31..553c84d3e0cc 100644
>> --- a/fs/namei.c
>> +++ b/fs/namei.c
>> @@ -130,7 +130,7 @@ void final_putname(struct filename *name)
>>
>> #define EMBEDDED_NAME_MAX (PATH_MAX - sizeof(struct filename))
>>
>> -static struct filename *
>> +struct filename *
>> getname_flags(const char __user *filename, int flags, int *empty)
>> {
>> struct filename *result, *err;
>> diff --git a/include/linux/compat.h b/include/linux/compat.h
>> index e6494261eaff..7450ca2ac1fc 100644
>> --- a/include/linux/compat.h
>> +++ b/include/linux/compat.h
>> @@ -357,6 +357,9 @@ asmlinkage long compat_sys_lseek(unsigned int, compat_off_t, unsigned int);
>>
>> asmlinkage long compat_sys_execve(const char __user *filename, const compat_uptr_t __user *argv,
>> const compat_uptr_t __user *envp);
>> +asmlinkage long compat_sys_execveat(int dfd, const char __user *filename,
>> + const compat_uptr_t __user *argv,
>> + const compat_uptr_t __user *envp, int flags);
>>
>> asmlinkage long compat_sys_select(int n, compat_ulong_t __user *inp,
>> compat_ulong_t __user *outp, compat_ulong_t __user *exp,
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index 94187721ad41..e9818574d738 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -2060,6 +2060,7 @@ extern struct file *file_open_root(struct dentry *, struct vfsmount *,
>> extern struct file * dentry_open(const struct path *, int, const struct cred *);
>> extern int filp_close(struct file *, fl_owner_t id);
>>
>> +extern struct filename *getname_flags(const char __user *, int, int *);
>> extern struct filename *getname(const char __user *);
>> extern struct filename *getname_kernel(const char *);
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index b867a4dab38a..33e056da7d33 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -2430,6 +2430,10 @@ extern void do_group_exit(int);
>> extern int do_execve(struct filename *,
>> const char __user * const __user *,
>> const char __user * const __user *);
>> +extern int do_execveat(int, struct filename *,
>> + const char __user * const __user *,
>> + const char __user * const __user *,
>> + int);
>> extern long do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *);
>> struct task_struct *fork_idle(int);
>> extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags);
>> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
>> index 0f86d85a9ce4..df5422294deb 100644
>> --- a/include/linux/syscalls.h
>> +++ b/include/linux/syscalls.h
>> @@ -876,4 +876,8 @@ asmlinkage long sys_seccomp(unsigned int op, unsigned int flags,
>> asmlinkage long sys_getrandom(char __user *buf, size_t count,
>> unsigned int flags);
>>
>> +asmlinkage long sys_execveat(int dfd, const char __user *filename,
>> + const char __user *const __user *argv,
>> + const char __user *const __user *envp, int flags);
>> +
>> #endif
>> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
>> index 11d11bc5c78f..feef07d29663 100644
>> --- a/include/uapi/asm-generic/unistd.h
>> +++ b/include/uapi/asm-generic/unistd.h
>> @@ -705,9 +705,11 @@ __SYSCALL(__NR_seccomp, sys_seccomp)
>> __SYSCALL(__NR_getrandom, sys_getrandom)
>> #define __NR_memfd_create 279
>> __SYSCALL(__NR_memfd_create, sys_memfd_create)
>> +#define __NR_execveat 280
>> +__SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)
>>
>> #undef __NR_syscalls
>> -#define __NR_syscalls 280
>> +#define __NR_syscalls 281
>>
>> /*
>> * All syscalls below here should go away really,
>> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
>> index 391d4ddb6f4b..efb06058ad3e 100644
>> --- a/kernel/sys_ni.c
>> +++ b/kernel/sys_ni.c
>> @@ -218,3 +218,6 @@ cond_syscall(sys_kcmp);
>>
>> /* operate on Secure Computing state */
>> cond_syscall(sys_seccomp);
>> +
>> +/* execveat */
>> +cond_syscall(sys_execveat);
>> diff --git a/lib/audit.c b/lib/audit.c
>> index 1d726a22565b..b8fb5ee81e26 100644
>> --- a/lib/audit.c
>> +++ b/lib/audit.c
>> @@ -54,6 +54,9 @@ int audit_classify_syscall(int abi, unsigned syscall)
>> case __NR_socketcall:
>> return 4;
>> #endif
>> +#ifdef __NR_execveat
>> + case __NR_execveat:
>> +#endif
>> case __NR_execve:
>> return 5;
>> default:

2014-10-27 18:04:11

by David Drysdale

[permalink] [raw]
Subject: Re: [PATCHv5 1/3] syscalls,x86: implement execveat() system call

On Wed, Oct 22, 2014 at 7:44 PM, Andy Lutomirski <[email protected]> wrote:
> On Wed, Oct 22, 2014 at 4:44 AM, David Drysdale <[email protected]> wrote:
>> Add a new system execveat(2) syscall. execveat() is to execve() as
>> openat() is to open(): it takes a file descriptor that refers to a
>> directory, and resolves the filename relative to that.
>>
>
>> bprm->file = file;
>> - bprm->filename = bprm->interp = filename->name;
>> + if (fd == AT_FDCWD || filename->name[0] == '/') {
>> + bprm->filename = filename->name;
>> + } else {
>> + /*
>> + * Build a pathname that reflects how we got to the file,
>> + * either "/dev/fd/<fd>" (for an empty filename) or
>> + * "/dev/fd/<fd>/<filename>".
>> + */
>> + pathbuf = kmalloc(PATH_MAX, GFP_TEMPORARY);
>> + if (!pathbuf) {
>> + retval = -ENOMEM;
>> + goto out_unmark;
>> + }
>> + bprm->filename = pathbuf;
>> + if (filename->name[0] == '\0')
>> + sprintf(pathbuf, "/dev/fd/%d", fd);
>
> If the fd is O_CLOEXEC, then this will result in a confused child
> process. Should we fail exec attempts like that for non-static
> programs? (E.g. set filename to "" or something and fix up the binfmt
> drivers to handle that?)

Isn't it just scripts that get confused here (as normal executables don't
get to see brpm->filename)?

Given that we don't know which we have at this point, I'd suggest
carrying on regardless. Or we could fall back to use the previous
best-effort d_path() code for O_CLOEXEC fds. Thoughts?

>> + else
>> + snprintf(pathbuf, PATH_MAX,
>> + "/dev/fd/%d/%s", fd, filename->name);
>
> Does this need to handle the case where the result exceeds PATH_MAX?

I guess we could kmalloc(strlen(filename->name) + 19) to avoid the
possibility of failure, but that just defers the inevitable -- the interpreter
won't be able to open the script file anyway. But it would at least then
generate the appropriate error (ENAMETOOLONG rather than ENOENT).

2014-10-27 18:47:52

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCHv5 1/3] syscalls,x86: implement execveat() system call

On Mon, Oct 27, 2014 at 11:03 AM, David Drysdale <[email protected]> wrote:
> On Wed, Oct 22, 2014 at 7:44 PM, Andy Lutomirski <[email protected]> wrote:
>> On Wed, Oct 22, 2014 at 4:44 AM, David Drysdale <[email protected]> wrote:
>>> Add a new system execveat(2) syscall. execveat() is to execve() as
>>> openat() is to open(): it takes a file descriptor that refers to a
>>> directory, and resolves the filename relative to that.
>>>
>>
>>> bprm->file = file;
>>> - bprm->filename = bprm->interp = filename->name;
>>> + if (fd == AT_FDCWD || filename->name[0] == '/') {
>>> + bprm->filename = filename->name;
>>> + } else {
>>> + /*
>>> + * Build a pathname that reflects how we got to the file,
>>> + * either "/dev/fd/<fd>" (for an empty filename) or
>>> + * "/dev/fd/<fd>/<filename>".
>>> + */
>>> + pathbuf = kmalloc(PATH_MAX, GFP_TEMPORARY);
>>> + if (!pathbuf) {
>>> + retval = -ENOMEM;
>>> + goto out_unmark;
>>> + }
>>> + bprm->filename = pathbuf;
>>> + if (filename->name[0] == '\0')
>>> + sprintf(pathbuf, "/dev/fd/%d", fd);
>>
>> If the fd is O_CLOEXEC, then this will result in a confused child
>> process. Should we fail exec attempts like that for non-static
>> programs? (E.g. set filename to "" or something and fix up the binfmt
>> drivers to handle that?)
>
> Isn't it just scripts that get confused here (as normal executables don't
> get to see brpm->filename)?
>
> Given that we don't know which we have at this point, I'd suggest
> carrying on regardless. Or we could fall back to use the previous
> best-effort d_path() code for O_CLOEXEC fds. Thoughts?

How hard would it be to mark the bprm as not having a path for the
binary? Then we could fail later on if and when we actually need the
path.

I don't really have a strong opinion here, though. I do prefer
actually failing the execveat call over succeeding but invoking a
script interpreter than can't possibly work.

>
>>> + else
>>> + snprintf(pathbuf, PATH_MAX,
>>> + "/dev/fd/%d/%s", fd, filename->name);
>>
>> Does this need to handle the case where the result exceeds PATH_MAX?
>
> I guess we could kmalloc(strlen(filename->name) + 19) to avoid the
> possibility of failure, but that just defers the inevitable -- the interpreter
> won't be able to open the script file anyway. But it would at least then
> generate the appropriate error (ENAMETOOLONG rather than ENOENT).

Depends whether anyone cares about bprm->filename. But I think the
code should either return an error or allocate enough space.


--
Andy Lutomirski
AMA Capital Management, LLC

2014-10-28 17:30:33

by David Drysdale

[permalink] [raw]
Subject: Re: [PATCHv5 1/3] syscalls,x86: implement execveat() system call

[Oops, re-send remembering to turn on plaintext mode -- sorry]

On Mon, Oct 27, 2014 at 6:47 PM, Andy Lutomirski <[email protected]> wrote:
> On Mon, Oct 27, 2014 at 11:03 AM, David Drysdale <[email protected]> wrote:
>> On Wed, Oct 22, 2014 at 7:44 PM, Andy Lutomirski <[email protected]> wrote:
>>> On Wed, Oct 22, 2014 at 4:44 AM, David Drysdale <[email protected]> wrote:
>>>> Add a new system execveat(2) syscall. execveat() is to execve() as
>>>> openat() is to open(): it takes a file descriptor that refers to a
>>>> directory, and resolves the filename relative to that.
>>>>
>>>
>>>> bprm->file = file;
>>>> - bprm->filename = bprm->interp = filename->name;
>>>> + if (fd == AT_FDCWD || filename->name[0] == '/') {
>>>> + bprm->filename = filename->name;
>>>> + } else {
>>>> + /*
>>>> + * Build a pathname that reflects how we got to the file,
>>>> + * either "/dev/fd/<fd>" (for an empty filename) or
>>>> + * "/dev/fd/<fd>/<filename>".
>>>> + */
>>>> + pathbuf = kmalloc(PATH_MAX, GFP_TEMPORARY);
>>>> + if (!pathbuf) {
>>>> + retval = -ENOMEM;
>>>> + goto out_unmark;
>>>> + }
>>>> + bprm->filename = pathbuf;
>>>> + if (filename->name[0] == '\0')
>>>> + sprintf(pathbuf, "/dev/fd/%d", fd);
>>>
>>> If the fd is O_CLOEXEC, then this will result in a confused child
>>> process. Should we fail exec attempts like that for non-static
>>> programs? (E.g. set filename to "" or something and fix up the binfmt
>>> drivers to handle that?)
>>
>> Isn't it just scripts that get confused here (as normal executables don't
>> get to see brpm->filename)?
>>
>> Given that we don't know which we have at this point, I'd suggest
>> carrying on regardless. Or we could fall back to use the previous
>> best-effort d_path() code for O_CLOEXEC fds. Thoughts?
>
> How hard would it be to mark the bprm as not having a path for the
> binary? Then we could fail later on if and when we actually need the
> path.

Adding a flag to bprm->interp_flags to indicate that the bprm->filename
will be inaccessible after exec is straightforward. But I'm not sure who
should/could make use of the flag...

> I don't really have a strong opinion here, though. I do prefer
> actually failing the execveat call over succeeding but invoking a
> script interpreter than can't possibly work.

Yeah, but that involves the kernel code (e.g. fs/binfmt_script.c) making
an assumption about what the interpreter is going to do -- specifically
that it's going to try to open its argv[1]. Admittedly, that's a very likely
assumption, but I'm not sure it's one the kernel should make -- a script
like "#!/bin/echo" wouldn't be very useful, but fexecve()ing it would still
work OK on a name like "/dev/fd/7" after fd 7 is closed.

(Also, we need some kind of non-empty name in bprm->filename,
even if it's going to be inaccessible later, so that any LSM processing
off of the bprm_set_creds()/bprm_check_security() hooks has something
to work with; those hooks are pre-exec so the "/dev/fd/<fd>" part should
still be OK at that point.)

So I guess I lean towards keeping "/dev/fd/<fd>/<path>" regardless.

>>
>>>> + else
>>>> + snprintf(pathbuf, PATH_MAX,
>>>> + "/dev/fd/%d/%s", fd, filename->name);
>>>
>>> Does this need to handle the case where the result exceeds PATH_MAX?
>>
>> I guess we could kmalloc(strlen(filename->name) + 19) to avoid the
>> possibility of failure, but that just defers the inevitable -- the interpreter
>> won't be able to open the script file anyway. But it would at least then
>> generate the appropriate error (ENAMETOOLONG rather than ENOENT).
>
> Depends whether anyone cares about bprm->filename. But I think the
> code should either return an error or allocate enough space.

I'll allocate enough space.

>
> --
> Andy Lutomirski
> AMA Capital Management, LLC

2014-10-28 17:48:27

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCHv5 1/3] syscalls,x86: implement execveat() system call

On Tue, Oct 28, 2014 at 10:30 AM, David Drysdale <[email protected]> wrote:
> [Oops, re-send remembering to turn on plaintext mode -- sorry]
>
> On Mon, Oct 27, 2014 at 6:47 PM, Andy Lutomirski <[email protected]> wrote:
>> On Mon, Oct 27, 2014 at 11:03 AM, David Drysdale <[email protected]> wrote:
>>> On Wed, Oct 22, 2014 at 7:44 PM, Andy Lutomirski <[email protected]> wrote:
>>>> On Wed, Oct 22, 2014 at 4:44 AM, David Drysdale <[email protected]> wrote:
>>>>> Add a new system execveat(2) syscall. execveat() is to execve() as
>>>>> openat() is to open(): it takes a file descriptor that refers to a
>>>>> directory, and resolves the filename relative to that.
>>>>>
>>>>
>>>>> bprm->file = file;
>>>>> - bprm->filename = bprm->interp = filename->name;
>>>>> + if (fd == AT_FDCWD || filename->name[0] == '/') {
>>>>> + bprm->filename = filename->name;
>>>>> + } else {
>>>>> + /*
>>>>> + * Build a pathname that reflects how we got to the file,
>>>>> + * either "/dev/fd/<fd>" (for an empty filename) or
>>>>> + * "/dev/fd/<fd>/<filename>".
>>>>> + */
>>>>> + pathbuf = kmalloc(PATH_MAX, GFP_TEMPORARY);
>>>>> + if (!pathbuf) {
>>>>> + retval = -ENOMEM;
>>>>> + goto out_unmark;
>>>>> + }
>>>>> + bprm->filename = pathbuf;
>>>>> + if (filename->name[0] == '\0')
>>>>> + sprintf(pathbuf, "/dev/fd/%d", fd);
>>>>
>>>> If the fd is O_CLOEXEC, then this will result in a confused child
>>>> process. Should we fail exec attempts like that for non-static
>>>> programs? (E.g. set filename to "" or something and fix up the binfmt
>>>> drivers to handle that?)
>>>
>>> Isn't it just scripts that get confused here (as normal executables don't
>>> get to see brpm->filename)?
>>>
>>> Given that we don't know which we have at this point, I'd suggest
>>> carrying on regardless. Or we could fall back to use the previous
>>> best-effort d_path() code for O_CLOEXEC fds. Thoughts?
>>
>> How hard would it be to mark the bprm as not having a path for the
>> binary? Then we could fail later on if and when we actually need the
>> path.
>
> Adding a flag to bprm->interp_flags to indicate that the bprm->filename
> will be inaccessible after exec is straightforward. But I'm not sure who
> should/could make use of the flag...
>
>> I don't really have a strong opinion here, though. I do prefer
>> actually failing the execveat call over succeeding but invoking a
>> script interpreter than can't possibly work.
>
> Yeah, but that involves the kernel code (e.g. fs/binfmt_script.c) making
> an assumption about what the interpreter is going to do -- specifically
> that it's going to try to open its argv[1]. Admittedly, that's a very likely
> assumption, but I'm not sure it's one the kernel should make -- a script
> like "#!/bin/echo" wouldn't be very useful, but fexecve()ing it would still
> work OK on a name like "/dev/fd/7" after fd 7 is closed.

Hmm. I'm unconvinced. If an important part of executing the script
is passing it an argv[0] that can be opened, then I think we shouldn't
allow a known-bad argv[0].

>
> (Also, we need some kind of non-empty name in bprm->filename,
> even if it's going to be inaccessible later, so that any LSM processing
> off of the bprm_set_creds()/bprm_check_security() hooks has something
> to work with; those hooks are pre-exec so the "/dev/fd/<fd>" part should
> still be OK at that point.)
>
> So I guess I lean towards keeping "/dev/fd/<fd>/<path>" regardless.
>
>>>
>>>>> + else
>>>>> + snprintf(pathbuf, PATH_MAX,
>>>>> + "/dev/fd/%d/%s", fd, filename->name);
>>>>
>>>> Does this need to handle the case where the result exceeds PATH_MAX?
>>>
>>> I guess we could kmalloc(strlen(filename->name) + 19) to avoid the
>>> possibility of failure, but that just defers the inevitable -- the interpreter
>>> won't be able to open the script file anyway. But it would at least then
>>> generate the appropriate error (ENAMETOOLONG rather than ENOENT).
>>
>> Depends whether anyone cares about bprm->filename. But I think the
>> code should either return an error or allocate enough space.
>
> I'll allocate enough space.
>
>>
>> --
>> Andy Lutomirski
>> AMA Capital Management, LLC



--
Andy Lutomirski
AMA Capital Management, LLC