This patchset introduces a mechanism (pidfd_getfd syscall) to get file
descriptors from other processes via pidfd. Although this can be achieved
using SCM_RIGHTS, and parasitic code injection, this offers a more
straightforward mechanism, with less overhead and complexity. The process
under manipulation's fd still remains valid, and unmodified by the
copy operation.
It introduces a flags field. The flags field is reserved a the moment,
but the intent is to extend it with the following capabilities:
* Close the remote FD when copying it
* Drop the cgroup data if it's a fd pointing a socket when copying it
The syscall numbers were chosen to be one greater than openat2.
Summary of history:
This initially started as a ptrace command. It did not require the process
to be stopped, and felt like kind of an awkward fit for ptrace. After that,
it moved to an ioctl on the pidfd. Given the core functionality, it made
sense to make it a syscall which did not require the process to be stopped.
Previous versions:
V8: https://lore.kernel.org/lkml/[email protected]/
V7: https://lore.kernel.org/lkml/[email protected]/
V6: https://lore.kernel.org/lkml/[email protected]/
V5: https://lore.kernel.org/lkml/[email protected]/
V4: https://lore.kernel.org/lkml/[email protected]/
V3: https://lore.kernel.org/lkml/[email protected]/
V2: https://lore.kernel.org/lkml/[email protected]/
RFC V1: https://lore.kernel.org/lkml/[email protected]/
Changes since v8:
* Cleanup / comments on tests
* Split out implementation of syscall vs. arch wiring
Changes since v7:
* No longer put security_file_recv at the end, and align with other
usages of putting it at the end of the file_recv.
* Rewrite self-tests in kselftest harness.
* Minor refactoring
Changes since v6:
* Proper attribution of get_task_file helper
* Move all types for syscall to int to represent fd
Changes since v5:
* Drop pidfd_getfd_options struct and replace with a flags field
Changes since v4:
* Turn into a syscall
* Move to PTRACE_MODE_ATTACH_REALCREDS from PTRACE_MODE_READ_REALCREDS
* Remove the sample code. This will come in another patchset, as the
new self-tests cover all the functionality.
Changes since v3:
* Add self-test
* Move to ioctl passing fd directly, versus args struct
* Shuffle around include files
Changes since v2:
* Move to ioctl on pidfd instead of ptrace function
* Add security check before moving file descriptor
Changes since the RFC v1:
* Introduce a new helper to fs/file.c to fetch a file descriptor from
any process. It largely uses the code suggested by Oleg, with a few
changes to fix locking
* It uses an extensible options struct to supply the FD, and option.
* I added a sample, using the code from the user-ptrace sample
Sargun Dhillon (4):
vfs, fdtable: Add fget_task helper
pid: Implement pidfd_getfd syscall
arch: wire up pidfd_getfd syscall
test: Add test for pidfd getfd
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
fs/file.c | 22 +-
include/linux/file.h | 2 +
include/linux/syscalls.h | 1 +
include/uapi/asm-generic/unistd.h | 4 +-
kernel/pid.c | 90 +++++++
tools/testing/selftests/pidfd/.gitignore | 1 +
tools/testing/selftests/pidfd/Makefile | 2 +-
tools/testing/selftests/pidfd/pidfd.h | 9 +
.../selftests/pidfd/pidfd_getfd_test.c | 249 ++++++++++++++++++
27 files changed, 395 insertions(+), 5 deletions(-)
create mode 100644 tools/testing/selftests/pidfd/pidfd_getfd_test.c
--
2.20.1
This introduces a function which can be used to fetch a file, given an
arbitrary task. As long as the user holds a reference (refcnt) to the
task_struct it is safe to call, and will either return NULL on failure,
or a pointer to the file, with a refcnt.
This patch is based on Oleg Nesterov's (cf. [1]) patch from September
2018.
[1]: Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Sargun Dhillon <[email protected]>
Suggested-by: Oleg Nesterov <[email protected]>
Acked-by: Christian Brauner <[email protected]>
---
fs/file.c | 22 ++++++++++++++++++++--
include/linux/file.h | 2 ++
2 files changed, 22 insertions(+), 2 deletions(-)
diff --git a/fs/file.c b/fs/file.c
index 2f4fcf985079..2fc5eeef54a4 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -706,9 +706,9 @@ void do_close_on_exec(struct files_struct *files)
spin_unlock(&files->file_lock);
}
-static struct file *__fget(unsigned int fd, fmode_t mask, unsigned int refs)
+static struct file *__fget_files(struct files_struct *files, unsigned int fd,
+ fmode_t mask, unsigned int refs)
{
- struct files_struct *files = current->files;
struct file *file;
rcu_read_lock();
@@ -729,6 +729,12 @@ static struct file *__fget(unsigned int fd, fmode_t mask, unsigned int refs)
return file;
}
+static inline struct file *__fget(unsigned int fd, fmode_t mask,
+ unsigned int refs)
+{
+ return __fget_files(current->files, fd, mask, refs);
+}
+
struct file *fget_many(unsigned int fd, unsigned int refs)
{
return __fget(fd, FMODE_PATH, refs);
@@ -746,6 +752,18 @@ struct file *fget_raw(unsigned int fd)
}
EXPORT_SYMBOL(fget_raw);
+struct file *fget_task(struct task_struct *task, unsigned int fd)
+{
+ struct file *file = NULL;
+
+ task_lock(task);
+ if (task->files)
+ file = __fget_files(task->files, fd, 0, 1);
+ task_unlock(task);
+
+ return file;
+}
+
/*
* Lightweight file lookup - no refcnt increment if fd table isn't shared.
*
diff --git a/include/linux/file.h b/include/linux/file.h
index 3fcddff56bc4..c6c7b24ea9f7 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -16,6 +16,7 @@ extern void fput(struct file *);
extern void fput_many(struct file *, unsigned int);
struct file_operations;
+struct task_struct;
struct vfsmount;
struct dentry;
struct inode;
@@ -47,6 +48,7 @@ static inline void fdput(struct fd fd)
extern struct file *fget(unsigned int fd);
extern struct file *fget_many(unsigned int fd, unsigned int refs);
extern struct file *fget_raw(unsigned int fd);
+extern struct file *fget_task(struct task_struct *task, unsigned int fd);
extern unsigned long __fdget(unsigned int fd);
extern unsigned long __fdget_raw(unsigned int fd);
extern unsigned long __fdget_pos(unsigned int fd);
--
2.20.1
This wires up the pidfd_getfd syscall for all architectures.
Signed-off-by: Sargun Dhillon <[email protected]>
Acked-by: Christian Brauner <[email protected]>
---
arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd.h | 2 +-
arch/arm64/include/asm/unistd32.h | 2 ++
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
include/linux/syscalls.h | 1 +
include/uapi/asm-generic/unistd.h | 4 +++-
20 files changed, 23 insertions(+), 2 deletions(-)
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 8e13b0b2928d..82301080f5e7 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -475,3 +475,4 @@
543 common fspick sys_fspick
544 common pidfd_open sys_pidfd_open
# 545 reserved for clone3
+548 common pidfd_getfd sys_pidfd_getfd
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 6da7dc4d79cc..ba045e2f3a60 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -449,3 +449,4 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
435 common clone3 sys_clone3
+438 common pidfd_getfd sys_pidfd_getfd
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index 2629a68b8724..b722e47377a5 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -38,7 +38,7 @@
#define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5)
#define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800)
-#define __NR_compat_syscalls 436
+#define __NR_compat_syscalls 439
#endif
#define __ARCH_WANT_SYS_CLONE
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index 94ab29cf4f00..a8da97a2de41 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -879,6 +879,8 @@ __SYSCALL(__NR_fspick, sys_fspick)
__SYSCALL(__NR_pidfd_open, sys_pidfd_open)
#define __NR_clone3 435
__SYSCALL(__NR_clone3, sys_clone3)
+#define __NR_pidfd_getfd 438
+__SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
/*
* Please add new compat syscalls above this comment and update
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index 36d5faf4c86c..2b11adfc860c 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -356,3 +356,4 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
# 435 reserved for clone3
+438 common pidfd_getfd sys_pidfd_getfd
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index a88a285a0e5f..44e879e98459 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -435,3 +435,4 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
# 435 reserved for clone3
+438 common pidfd_getfd sys_pidfd_getfd
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 09b0cd7dab0a..7afa00125cc4 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -441,3 +441,4 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
435 common clone3 sys_clone3
+438 common pidfd_getfd sys_pidfd_getfd
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index e7c5ab38e403..856d5ba34461 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -374,3 +374,4 @@
433 n32 fspick sys_fspick
434 n32 pidfd_open sys_pidfd_open
435 n32 clone3 __sys_clone3
+438 n32 pidfd_getfd sys_pidfd_getfd
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index 13cd66581f3b..2db6075352f3 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -350,3 +350,4 @@
433 n64 fspick sys_fspick
434 n64 pidfd_open sys_pidfd_open
435 n64 clone3 __sys_clone3
+438 n64 pidfd_getfd sys_pidfd_getfd
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index 353539ea4140..e9f9d4a9b105 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -423,3 +423,4 @@
433 o32 fspick sys_fspick
434 o32 pidfd_open sys_pidfd_open
435 o32 clone3 __sys_clone3
+438 o32 pidfd_getfd sys_pidfd_getfd
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index 285ff516150c..c58c7eb144ca 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -433,3 +433,4 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
435 common clone3 sys_clone3_wrapper
+438 common pidfd_getfd sys_pidfd_getfd
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 43f736ed47f2..707609bfe3ea 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -517,3 +517,4 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
435 nospu clone3 ppc_clone3
+438 common pidfd_getfd sys_pidfd_getfd
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index 3054e9c035a3..185cd624face 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -438,3 +438,4 @@
433 common fspick sys_fspick sys_fspick
434 common pidfd_open sys_pidfd_open sys_pidfd_open
435 common clone3 sys_clone3 sys_clone3
+438 common pidfd_getfd sys_pidfd_getfd sys_pidfd_getfd
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index b5ed26c4c005..88f90895aad8 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -438,3 +438,4 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
# 435 reserved for clone3
+438 common pidfd_getfd sys_pidfd_getfd
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index 8c8cc7537fb2..218df6a2326e 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -481,3 +481,4 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
# 435 reserved for clone3
+438 common pidfd_getfd sys_pidfd_getfd
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 15908eb9b17e..9c3101b65e0f 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -440,3 +440,4 @@
433 i386 fspick sys_fspick __ia32_sys_fspick
434 i386 pidfd_open sys_pidfd_open __ia32_sys_pidfd_open
435 i386 clone3 sys_clone3 __ia32_sys_clone3
+438 i386 pidfd_getfd sys_pidfd_getfd __ia32_sys_pidfd_getfd
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index c29976eca4a8..cef85db75a62 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -357,6 +357,7 @@
433 common fspick __x64_sys_fspick
434 common pidfd_open __x64_sys_pidfd_open
435 common clone3 __x64_sys_clone3/ptregs
+438 common pidfd_getfd __x64_sys_pidfd_getfd
#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 25f4de729a6d..ae15183def12 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -406,3 +406,4 @@
433 common fspick sys_fspick
434 common pidfd_open sys_pidfd_open
435 common clone3 sys_clone3
+438 common pidfd_getfd sys_pidfd_getfd
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 2960dedcfde8..5edbc31af51f 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1000,6 +1000,7 @@ asmlinkage long sys_fspick(int dfd, const char __user *path, unsigned int flags)
asmlinkage long sys_pidfd_send_signal(int pidfd, int sig,
siginfo_t __user *info,
unsigned int flags);
+asmlinkage long sys_pidfd_getfd(int pidfd, int fd, unsigned int flags);
/*
* Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 1fc8faa6e973..d36ec3d645bd 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -850,9 +850,11 @@ __SYSCALL(__NR_pidfd_open, sys_pidfd_open)
#define __NR_clone3 435
__SYSCALL(__NR_clone3, sys_clone3)
#endif
+#define __NR_pidfd_getfd 438
+__SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
#undef __NR_syscalls
-#define __NR_syscalls 436
+#define __NR_syscalls 439
/*
* 32 bit systems traditionally used different
--
2.20.1
This syscall allows for the retrieval of file descriptors from other
processes, based on their pidfd. This is possible using ptrace, and
injection of parasitic code to inject code which leverages SCM_RIGHTS
to move file descriptors between a tracee and a tracer. Unfortunately,
ptrace comes with a high cost of requiring the process to be stopped,
and breaks debuggers. This does not require stopping the process under
manipulation.
One reason to use this is to allow sandboxers to take actions on file
descriptors on the behalf of another process. For example, this can be
combined with seccomp-bpf's user notification to do on-demand fd
extraction and take privileged actions. One such privileged action
is binding a socket to a privileged port.
/* prototype */
/* flags is currently reserved and should be set to 0 */
int sys_pidfd_getfd(int pidfd, int fd, unsigned int flags);
/* testing */
Ran self-test suite on x86_64
Signed-off-by: Sargun Dhillon <[email protected]>
Cc: Christian Brauner <[email protected]>
Acked-by: Christian Brauner <[email protected]>
---
kernel/pid.c | 90 ++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 90 insertions(+)
diff --git a/kernel/pid.c b/kernel/pid.c
index 2278e249141d..0f4ecb57214c 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -578,3 +578,93 @@ void __init pid_idr_init(void)
init_pid_ns.pid_cachep = KMEM_CACHE(pid,
SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT);
}
+
+static struct file *__pidfd_fget(struct task_struct *task, int fd)
+{
+ struct file *file;
+ int ret;
+
+ ret = mutex_lock_killable(&task->signal->cred_guard_mutex);
+ if (ret)
+ return ERR_PTR(ret);
+
+ if (ptrace_may_access(task, PTRACE_MODE_ATTACH_REALCREDS))
+ file = fget_task(task, fd);
+ else
+ file = ERR_PTR(-EPERM);
+
+ mutex_unlock(&task->signal->cred_guard_mutex);
+
+ return file ?: ERR_PTR(-EBADF);
+}
+
+static int pidfd_getfd(struct pid *pid, int fd)
+{
+ struct task_struct *task;
+ struct file *file;
+ int ret;
+
+ task = get_pid_task(pid, PIDTYPE_PID);
+ if (!task)
+ return -ESRCH;
+
+ file = __pidfd_fget(task, fd);
+ put_task_struct(task);
+ if (IS_ERR(file))
+ return PTR_ERR(file);
+
+ ret = security_file_receive(file);
+ if (ret) {
+ fput(file);
+ return ret;
+ }
+
+ ret = get_unused_fd_flags(O_CLOEXEC);
+ if (ret < 0)
+ fput(file);
+ else
+ fd_install(ret, file);
+
+ return ret;
+}
+
+/**
+ * sys_pidfd_getfd() - Get a file descriptor from another process
+ *
+ * @pidfd: the pidfd file descriptor of the process
+ * @fd: the file descriptor number to get
+ * @flags: flags on how to get the fd (reserved)
+ *
+ * This syscall gets a copy of a file descriptor from another process
+ * based on the pidfd, and file descriptor number. It requires that
+ * the calling process has the ability to ptrace the process represented
+ * by the pidfd. The process which is having its file descriptor copied
+ * is otherwise unaffected.
+ *
+ * Return: On success, a cloexec file descriptor is returned.
+ * On error, a negative errno number will be returned.
+ */
+SYSCALL_DEFINE3(pidfd_getfd, int, pidfd, int, fd,
+ unsigned int, flags)
+{
+ struct pid *pid;
+ struct fd f;
+ int ret;
+
+ /* flags is currently unused - make sure it's unset */
+ if (flags)
+ return -EINVAL;
+
+ f = fdget(pidfd);
+ if (!f.file)
+ return -EBADF;
+
+ pid = pidfd_pid(f.file);
+ if (IS_ERR(pid))
+ ret = PTR_ERR(pid);
+ else
+ ret = pidfd_getfd(pid, fd);
+
+ fdput(f);
+ return ret;
+}
--
2.20.1
The following tests:
* Fetch FD, and then compare via kcmp
* Make sure getfd can be blocked by blocking ptrace_may_access
* Making sure fetching bad FDs fails
* Make sure trying to set flags to non-zero results in an EINVAL
Signed-off-by: Sargun Dhillon <[email protected]>
---
tools/testing/selftests/pidfd/.gitignore | 1 +
tools/testing/selftests/pidfd/Makefile | 2 +-
tools/testing/selftests/pidfd/pidfd.h | 9 +
.../selftests/pidfd/pidfd_getfd_test.c | 249 ++++++++++++++++++
4 files changed, 260 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/pidfd/pidfd_getfd_test.c
diff --git a/tools/testing/selftests/pidfd/.gitignore b/tools/testing/selftests/pidfd/.gitignore
index 8d069490e17b..3a779c084d96 100644
--- a/tools/testing/selftests/pidfd/.gitignore
+++ b/tools/testing/selftests/pidfd/.gitignore
@@ -2,3 +2,4 @@ pidfd_open_test
pidfd_poll_test
pidfd_test
pidfd_wait
+pidfd_getfd_test
diff --git a/tools/testing/selftests/pidfd/Makefile b/tools/testing/selftests/pidfd/Makefile
index 43db1b98e845..75a545861375 100644
--- a/tools/testing/selftests/pidfd/Makefile
+++ b/tools/testing/selftests/pidfd/Makefile
@@ -1,7 +1,7 @@
# SPDX-License-Identifier: GPL-2.0-only
CFLAGS += -g -I../../../../usr/include/ -pthread
-TEST_GEN_PROGS := pidfd_test pidfd_fdinfo_test pidfd_open_test pidfd_poll_test pidfd_wait
+TEST_GEN_PROGS := pidfd_test pidfd_fdinfo_test pidfd_open_test pidfd_poll_test pidfd_wait pidfd_getfd_test
include ../lib.mk
diff --git a/tools/testing/selftests/pidfd/pidfd.h b/tools/testing/selftests/pidfd/pidfd.h
index c6bc68329f4b..d482515604db 100644
--- a/tools/testing/selftests/pidfd/pidfd.h
+++ b/tools/testing/selftests/pidfd/pidfd.h
@@ -36,6 +36,10 @@
#define __NR_clone3 -1
#endif
+#ifndef __NR_pidfd_getfd
+#define __NR_pidfd_getfd -1
+#endif
+
/*
* The kernel reserves 300 pids via RESERVED_PIDS in kernel/pid.c
* That means, when it wraps around any pid < 300 will be skipped.
@@ -84,4 +88,9 @@ static inline int sys_pidfd_send_signal(int pidfd, int sig, siginfo_t *info,
return syscall(__NR_pidfd_send_signal, pidfd, sig, info, flags);
}
+static inline int sys_pidfd_getfd(int pidfd, int fd, int flags)
+{
+ return syscall(__NR_pidfd_getfd, pidfd, fd, flags);
+}
+
#endif /* __PIDFD_H */
diff --git a/tools/testing/selftests/pidfd/pidfd_getfd_test.c b/tools/testing/selftests/pidfd/pidfd_getfd_test.c
new file mode 100644
index 000000000000..401a7c1d0312
--- /dev/null
+++ b/tools/testing/selftests/pidfd/pidfd_getfd_test.c
@@ -0,0 +1,249 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <limits.h>
+#include <linux/types.h>
+#include <sched.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <syscall.h>
+#include <sys/prctl.h>
+#include <sys/wait.h>
+#include <unistd.h>
+#include <sys/socket.h>
+#include <linux/kcmp.h>
+
+#include "pidfd.h"
+#include "../kselftest.h"
+#include "../kselftest_harness.h"
+
+/*
+ * UNKNOWN_FD is an fd number that should never exist in the child, as it is
+ * used to check the negative case.
+ */
+#define UNKNOWN_FD 111
+#define UID_NOBODY 65535
+
+static int sys_kcmp(pid_t pid1, pid_t pid2, int type, unsigned long idx1,
+ unsigned long idx2)
+{
+ return syscall(__NR_kcmp, pid1, pid2, type, idx1, idx2);
+}
+
+static int sys_memfd_create(const char *name, unsigned int flags)
+{
+ return syscall(__NR_memfd_create, name, flags);
+}
+
+static int __child(int sk, int memfd)
+{
+ int ret;
+ char buf;
+
+ /*
+ * Ensure we don't leave around a bunch of orphaned children if our
+ * tests fail.
+ */
+ ret = prctl(PR_SET_PDEATHSIG, SIGKILL);
+ if (ret) {
+ fprintf(stderr, "%s: Child could not set DEATHSIG\n",
+ strerror(errno));
+ return -1;
+ }
+
+ ret = send(sk, &memfd, sizeof(memfd), 0);
+ if (ret != sizeof(memfd)) {
+ fprintf(stderr, "%s: Child failed to send fd number\n",
+ strerror(errno));
+ return -1;
+ }
+
+ /*
+ * The fixture setup is completed at this point. The tests will run.
+ *
+ * This blocking recv enables the parent to message the child.
+ * Either we will read 'P' off of the sk, indicating that we need
+ * to disable ptrace, or we will read a 0, indicating that the other
+ * side has closed the sk. This occurs during fixture teardown time,
+ * indicating that the child should exit.
+ */
+ while ((ret = recv(sk, &buf, sizeof(buf), 0)) > 0) {
+ if (buf == 'P') {
+ ret = prctl(PR_SET_DUMPABLE, 0);
+ if (ret < 0) {
+ fprintf(stderr,
+ "%s: Child failed to disable ptrace\n",
+ strerror(errno));
+ return -1;
+ }
+ } else {
+ fprintf(stderr, "Child received unknown command %c\n",
+ buf);
+ return -1;
+ }
+ ret = send(sk, &buf, sizeof(buf), 0);
+ if (ret != 1) {
+ fprintf(stderr, "%s: Child failed to ack\n",
+ strerror(errno));
+ return -1;
+ }
+ }
+ if (ret < 0) {
+ fprintf(stderr, "%s: Child failed to read from socket\n",
+ strerror(errno));
+ return -1;
+ }
+
+ return 0;
+}
+
+static int child(int sk)
+{
+ int memfd, ret;
+
+ memfd = sys_memfd_create("test", 0);
+ if (memfd < 0) {
+ fprintf(stderr, "%s: Child could not create memfd\n",
+ strerror(errno));
+ ret = -1;
+ } else {
+ ret = __child(sk, memfd);
+ close(memfd);
+ }
+
+ close(sk);
+ return ret;
+}
+
+FIXTURE(child)
+{
+ /*
+ * remote_fd is the number of the FD which we are trying to retrieve
+ * from the child.
+ */
+ int remote_fd;
+ /* pid points to the child which we are fetching FDs from */
+ pid_t pid;
+ /* pidfd is the pidfd of the child */
+ int pidfd;
+ /*
+ * sk is our side of the socketpair used to communicate with the child.
+ * When it is closed, the child will exit.
+ */
+ int sk;
+};
+
+FIXTURE_SETUP(child)
+{
+ int ret, sk_pair[2];
+
+ ASSERT_EQ(0, socketpair(PF_LOCAL, SOCK_SEQPACKET, 0, sk_pair)) {
+ TH_LOG("%s: failed to create socketpair", strerror(errno));
+ }
+ self->sk = sk_pair[0];
+
+ self->pid = fork();
+ ASSERT_GE(self->pid, 0);
+
+ if (self->pid == 0) {
+ close(sk_pair[0]);
+ if (child(sk_pair[1]))
+ _exit(EXIT_FAILURE);
+ _exit(EXIT_SUCCESS);
+ }
+
+ close(sk_pair[1]);
+
+ self->pidfd = sys_pidfd_open(self->pid, 0);
+ ASSERT_GE(self->pidfd, 0);
+
+ /*
+ * Wait for the child to complete setup. It'll send the remote memfd's
+ * number when ready.
+ */
+ ret = recv(sk_pair[0], &self->remote_fd, sizeof(self->remote_fd), 0);
+ ASSERT_EQ(sizeof(self->remote_fd), ret);
+}
+
+FIXTURE_TEARDOWN(child)
+{
+ EXPECT_EQ(0, close(self->pidfd));
+ EXPECT_EQ(0, close(self->sk));
+
+ EXPECT_EQ(0, wait_for_pid(self->pid));
+}
+
+TEST_F(child, disable_ptrace)
+{
+ int uid, fd;
+ char c;
+
+ /*
+ * Turn into nobody if we're root, to avoid CAP_SYS_PTRACE
+ *
+ * The tests should run in their own process, so even this test fails,
+ * it shouldn't result in subsequent tests failing.
+ */
+ uid = getuid();
+ if (uid == 0)
+ ASSERT_EQ(0, seteuid(UID_NOBODY));
+
+ ASSERT_EQ(1, send(self->sk, "P", 1, 0));
+ ASSERT_EQ(1, recv(self->sk, &c, 1, 0));
+
+ fd = sys_pidfd_getfd(self->pidfd, self->remote_fd, 0);
+ EXPECT_EQ(-1, fd);
+ EXPECT_EQ(EPERM, errno);
+
+ if (uid == 0)
+ ASSERT_EQ(0, seteuid(0));
+}
+
+TEST_F(child, fetch_fd)
+{
+ int fd, ret;
+
+ fd = sys_pidfd_getfd(self->pidfd, self->remote_fd, 0);
+ ASSERT_GE(fd, 0);
+
+ EXPECT_EQ(0, sys_kcmp(getpid(), self->pid, KCMP_FILE, fd, self->remote_fd));
+
+ ret = fcntl(fd, F_GETFD);
+ ASSERT_GE(ret, 0);
+ EXPECT_GE(ret & FD_CLOEXEC, 0);
+
+ close(fd);
+}
+
+TEST_F(child, test_unknown_fd)
+{
+ int fd;
+
+ fd = sys_pidfd_getfd(self->pidfd, UNKNOWN_FD, 0);
+ EXPECT_EQ(-1, fd) {
+ TH_LOG("getfd succeeded while fetching unknown fd");
+ };
+ EXPECT_EQ(EBADF, errno) {
+ TH_LOG("%s: getfd did not get EBADF", strerror(errno));
+ }
+}
+
+TEST(flags_set)
+{
+ ASSERT_EQ(-1, sys_pidfd_getfd(0, 0, 1));
+ EXPECT_EQ(errno, EINVAL);
+}
+
+#if __NR_pidfd_getfd == -1
+int main(void)
+{
+ fprintf(stderr, "__NR_pidfd_getfd undefined. The pidfd_getfd syscall is unavailable. Test aborting\n");
+ return KSFT_SKIP;
+}
+#else
+TEST_HARNESS_MAIN
+#endif
--
2.20.1
On Tue, Jan 07, 2020 at 09:59:23AM -0800, Sargun Dhillon wrote:
> This patchset introduces a mechanism (pidfd_getfd syscall) to get file
> descriptors from other processes via pidfd. Although this can be achieved
> using SCM_RIGHTS, and parasitic code injection, this offers a more
> straightforward mechanism, with less overhead and complexity. The process
> under manipulation's fd still remains valid, and unmodified by the
> copy operation.
>
> It introduces a flags field. The flags field is reserved a the moment,
> but the intent is to extend it with the following capabilities:
> * Close the remote FD when copying it
> * Drop the cgroup data if it's a fd pointing a socket when copying it
>
> The syscall numbers were chosen to be one greater than openat2.
>
> Summary of history:
> This initially started as a ptrace command. It did not require the process
> to be stopped, and felt like kind of an awkward fit for ptrace. After that,
> it moved to an ioctl on the pidfd. Given the core functionality, it made
> sense to make it a syscall which did not require the process to be stopped.
>
> Previous versions:
> V8: https://lore.kernel.org/lkml/[email protected]/
> V7: https://lore.kernel.org/lkml/[email protected]/
> V6: https://lore.kernel.org/lkml/[email protected]/
> V5: https://lore.kernel.org/lkml/[email protected]/
> V4: https://lore.kernel.org/lkml/[email protected]/
> V3: https://lore.kernel.org/lkml/[email protected]/
> V2: https://lore.kernel.org/lkml/[email protected]/
> RFC V1: https://lore.kernel.org/lkml/[email protected]/
I don't see anything wrong with this series anymore:
Acked-by: Christian Brauner <[email protected]>
Other Acked-bys/Reviewed-bys and reviews of course strongly encouraged!
Christian
On Tue, Jan 07, 2020 at 09:54:49PM +0100, Christian Brauner wrote:
> On Tue, Jan 07, 2020 at 09:59:23AM -0800, Sargun Dhillon wrote:
> > This patchset introduces a mechanism (pidfd_getfd syscall) to get file
> > descriptors from other processes via pidfd. Although this can be achieved
> > using SCM_RIGHTS, and parasitic code injection, this offers a more
> > straightforward mechanism, with less overhead and complexity. The process
> > under manipulation's fd still remains valid, and unmodified by the
> > copy operation.
> >
> > It introduces a flags field. The flags field is reserved a the moment,
> > but the intent is to extend it with the following capabilities:
> > * Close the remote FD when copying it
> > * Drop the cgroup data if it's a fd pointing a socket when copying it
> >
> > The syscall numbers were chosen to be one greater than openat2.
> >
> > Summary of history:
> > This initially started as a ptrace command. It did not require the process
> > to be stopped, and felt like kind of an awkward fit for ptrace. After that,
> > it moved to an ioctl on the pidfd. Given the core functionality, it made
> > sense to make it a syscall which did not require the process to be stopped.
> >
> > Previous versions:
> > V8: https://lore.kernel.org/lkml/[email protected]/
> > V7: https://lore.kernel.org/lkml/[email protected]/
> > V6: https://lore.kernel.org/lkml/[email protected]/
> > V5: https://lore.kernel.org/lkml/[email protected]/
> > V4: https://lore.kernel.org/lkml/[email protected]/
> > V3: https://lore.kernel.org/lkml/[email protected]/
> > V2: https://lore.kernel.org/lkml/[email protected]/
> > RFC V1: https://lore.kernel.org/lkml/[email protected]/
>
> I don't see anything wrong with this series anymore:
>
> Acked-by: Christian Brauner <[email protected]>
>
> Other Acked-bys/Reviewed-bys and reviews of course strongly encouraged!
> Christian
Fyi, I'm waiting a few days on a reply from Al.
Depending on his input the intent rn is to move this into my for-next
early next week.
Christian
On Tue, Jan 7, 2020 at 6:59 PM Sargun Dhillon <[email protected]> wrote:
> +/**
> + * sys_pidfd_getfd() - Get a file descriptor from another process
> + *
> + * @pidfd: the pidfd file descriptor of the process
> + * @fd: the file descriptor number to get
> + * @flags: flags on how to get the fd (reserved)
> + *
> + * This syscall gets a copy of a file descriptor from another process
> + * based on the pidfd, and file descriptor number. It requires that
> + * the calling process has the ability to ptrace the process represented
> + * by the pidfd. The process which is having its file descriptor copied
> + * is otherwise unaffected.
> + *
> + * Return: On success, a cloexec file descriptor is returned.
> + * On error, a negative errno number will be returned.
> + */
> +SYSCALL_DEFINE3(pidfd_getfd, int, pidfd, int, fd,
> + unsigned int, flags)
This is the most sensible definition I can see. I can not tell
whether we should or want to have it, but if everyone thinks
this is a good idea, then this ABI makes sense.
Reviewed-by: Arnd Bergmann <[email protected]>
On Tue, Jan 7, 2020 at 6:59 PM Sargun Dhillon <[email protected]> wrote:
>
> This wires up the pidfd_getfd syscall for all architectures.
>
> Signed-off-by: Sargun Dhillon <[email protected]>
> Acked-by: Christian Brauner <[email protected]>
This all looks correct,
Reviewed-by: Arnd Bergmann <[email protected]>
On Tue, Jan 7, 2020 at 6:59 PM Sargun Dhillon <[email protected]> wrote:
>
> This introduces a function which can be used to fetch a file, given an
> arbitrary task. As long as the user holds a reference (refcnt) to the
> task_struct it is safe to call, and will either return NULL on failure,
> or a pointer to the file, with a refcnt.
>
> This patch is based on Oleg Nesterov's (cf. [1]) patch from September
> 2018.
>
> [1]: Link: https://lore.kernel.org/r/[email protected]
>
> Signed-off-by: Sargun Dhillon <[email protected]>
> Suggested-by: Oleg Nesterov <[email protected]>
> Acked-by: Christian Brauner <[email protected]>
Reviewed-by: Arnd Bergmann <[email protected]>
Hello Sargun et al.
I've taken a shot at writing a manual page for pidfd_getfd().
I would be happy to receive comments, suggestions for
improvements, etc. The text is as follows (the groff source
is at the foot of this mail):
NAME
pidfd_getfd - obtain a duplicate of another process's file
descriptor
SYNOPSIS
int pidfd_getfd(int pidfd, int targetfd, unsigned int flags);
DESCRIPTION
The pidfd_getfd() system call allocates a new file descriptor in
the calling process. This new file descriptor is a duplicate of
an existing file descriptor, targetfd, in the process referred to
by the PID file descriptor pidfd.
The duplicate file descriptor refers to the same open file
description (see open(2)) as the original file descriptor in the
process referred to by pidfd. The two file descriptors thus share
file status flags and file offset. Furthermore, operations on the
underlying file object (for example, assigning an address to a
socket object using bind(2)) can be equally be performed via the
duplicate file descriptor.
The close-on-exec flag (FD_CLOEXEC; see fcntl(2)) is set on the
file descriptor returned by pidfd_getfd().
The flags argument is reserved for future use. Currently, it must
be specified as 0.
Permission to duplicate another process's file descriptor is gov‐
erned by a ptrace access mode PTRACE_MODE_ATTACH_REALCREDS check
(see ptrace(2)).
RETURN VALUE
On success, pidfd_getfd() returns a nonnegative file descriptor.
On error, -1 is returned and errno is set to indicate the cause of
the error.
ERRORS
EBADF pidfd is not a valid PID file descriptor.
EBADF targetfd is not an open file descriptor in the process
referred to by pidfd.
EINVAL flags is not 0.
EMFILE The per-process limit on the number of open file descrip‐
tors has been reached (see the description of RLIMIT_NOFILE
in getrlimit(2)).
ENFILE The system-wide limit on the total number of open files has
been reached.
ESRCH The process referred to by pidfd does not exist (i.e., it
has terminated and been waited on).
VERSIONS
pidfd_getfd() first appeared in Linux 5.6.
CONFORMING TO
pidfd_getfd() is Linux specific.
NOTES
Currently, there is no glibc wrapper for this system call; call it
using syscall(2).
For a description of PID file descriptors, see pidfd_open(2).
SEE ALSO
clone3(2), kcmp(2), pidfd_open(2)
Cheers,
Michael
.\" Copyright (c) 2020 by Michael Kerrisk <[email protected]>
.\"
.\" %%%LICENSE_START(VERBATIM)
.\" Permission is granted to make and distribute verbatim copies of this
.\" manual provided the copyright notice and this permission notice are
.\" preserved on all copies.
.\"
.\" Permission is granted to copy and distribute modified versions of this
.\" manual under the conditions for verbatim copying, provided that the
.\" entire resulting derived work is distributed under the terms of a
.\" permission notice identical to this one.
.\"
.\" Since the Linux kernel and libraries are constantly changing, this
.\" manual page may be incorrect or out-of-date. The author(s) assume no
.\" responsibility for errors or omissions, or for damages resulting from
.\" the use of the information contained herein. The author(s) may not
.\" have taken the same level of care in the production of this manual,
.\" which is licensed free of charge, as they might when working
.\" professionally.
.\"
.\" Formatted or processed versions of this manual, if unaccompanied by
.\" the source, must acknowledge the copyright and authors of this work.
.\" %%%LICENSE_END
.\"
.TH PIDFD_GETFD 2 2020-03-31 "Linux" "Linux Programmer's Manual"
.SH NAME
pidfd_getfd \- obtain a duplicate of another process's file descriptor
.SH SYNOPSIS
.nf
.BI "int pidfd_getfd(int " pidfd ", int " targetfd ", unsigned int " flags );
.fi
.SH DESCRIPTION
The
.BR pidfd_getfd ()
system call allocates a new file descriptor in the calling process.
This new file descriptor is a duplicate of an existing file descriptor,
.IR targetfd ,
in the process referred to by the PID file descriptor
.IR pidfd .
.PP
The duplicate file descriptor refers to the same open file description (see
.BR open (2))
as the original file descriptor in the process referred to by
.IR pidfd .
The two file descriptors thus share file status flags and file offset.
Furthermore, operations on the underlying file object
(for example, assigning an address to a socket object using
.BR bind (2))
can be equally be performed via the duplicate file descriptor.
.PP
The close-on-exec flag
.RB ( FD_CLOEXEC ;
see
.BR fcntl (2))
is set on the file descriptor returned by
.BR pidfd_getfd ().
.PP
The
.I flags
argument is reserved for future use.
Currently, it must be specified as 0.
.PP
Permission to duplicate another process's file descriptor
is governed by a ptrace access mode
.B PTRACE_MODE_ATTACH_REALCREDS
check (see
.BR ptrace (2)).
.SH RETURN VALUE
On success,
.BR pidfd_getfd ()
returns a nonnegative file descriptor.
On error, \-1 is returned and
.I errno
is set to indicate the cause of the error.
.SH ERRORS
.TP
.B EBADF
.I pidfd
is not a valid PID file descriptor.
.TP
.B EBADF
.I targetfd
is not an open file descriptor in the process referred to by
.IR pidfd .
.BR
.TP
.B EINVAL
.I flags
is not 0.
.TP
.B EMFILE
The per-process limit on the number of open file descriptors has been reached
(see the description of
.BR RLIMIT_NOFILE
in
.BR getrlimit (2)).
.TP
.B ENFILE
The system-wide limit on the total number of open files has been reached.
.TP
.B ESRCH
The process referred to by
.I pidfd
does not exist
(i.e., it has terminated and been waited on).
.SH VERSIONS
.BR pidfd_getfd ()
first appeared in Linux 5.6.
.\" commit 8649c322f75c96e7ced2fec201e123b2b073bf09
.SH CONFORMING TO
.BR pidfd_getfd ()
is Linux specific.
.SH NOTES
Currently, there is no glibc wrapper for this system call; call it using
.BR syscall (2).
.PP
For a description of PID file descriptors, see
.BR pidfd_open (2).
.SH SEE ALSO
.BR clone3 (2),
.BR kcmp (2),
.BR pidfd_open (2)
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/