2014-12-02 04:35:31

by Alex Dubov

[permalink] [raw]
Subject: Minimal effort/low overhead file descriptor duplication over Posix.1b s

A common requirement in parallel processing applications (relied upon by
popular network servers, databases and various other applications) is to
pass open file descriptors between processes. Historically, several mechanisms
existed to support this requirement, such as those provided by "cmsg" facility
of unix domain sockets or special operations on named pipes (on Android this
can also be achieved using "binder" facility).

Unfortunately, using facilities like Unix domain sockets to merely pass file
descriptors between "worker" processes is unnecessarily difficult, due to
the following common consideration:

1. Domain sockets and named pipes are persistent objects. Applications must
manage their lifetime and devise unambiguous access schemes in case multiple
application instances are to be run within the same OS instance. Usually, they
would also require a writable file system to be mounted.

2. Interaction with domain sockets and named pipes requires a sizable,
non-trivial and error-prone code on the application side, especially in
cases where multiple worker types started by multiple application instances
must coexist within the same OS instance.

3. Domain sockets and pipes require creation of complex kernel-side set-ups,
whereupon, in many cases, the only information ever passed by the application
over those channels are file descriptors (it is usual for the major part of the
application's shared state to be established through other mechanisms,
like shared memory). In some cases, applications are forced to send meaningless
rubbish over the domain socket merely to "push" the associated "cmsg" carrying
the file descriptor through.

Present patch introduces exceptionally easy to use, low latency and low
overhead mechanism for transferring file descriptors between cooperating
processes:

int sendfd(pid_t pid, int sig, int fd)

Given a target process pid, the sendfd() syscall will create a duplicate
file descriptor in a target task's (referred by pid) file table pointing to
the file references by descriptor fd. Then, it will attempt to notify the
target task by issuing a Posix.1b real-time signal (sig), carrying the new
file descriptor as integer payload. If real-time signal can not be enqueued
at the destination signal queue, the newly created file descriptor will be
promptly closed.

It is believed, that proposed sendfd() syscall, together with recently
accepted "memfd" facility may greatly simplify development of parallel
processing applications, by eliminating the need to rely on tricky and
possibly insecure approaches involving domain sockets and such.


2014-12-02 04:35:36

by Alex Dubov

[permalink] [raw]
Subject: [PATCH 1/2] fs: introduce sendfd() syscall

Present patch introduces exceptionally easy to use, low latency and low
overhead mechanism for transferring file descriptors between cooperating
processes:

int sendfd(pid_t pid, int sig, int fd)

Given a target process pid, the sendfd() syscall will create a duplicate
file descriptor in a target task's (referred by pid) file table pointing to
the file references by descriptor fd. Then, it will attempt to notify the
target task by issuing a Posix.1b real-time signal (sig), carrying the new
file descriptor as integer payload. If real-time signal can not be enqueued
at the destination signal queue, the newly created file descriptor will be
promptly closed.

Signed-off-by: Alex Dubov <[email protected]>
---
fs/Makefile | 1 +
fs/sendfd.c | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
init/Kconfig | 11 ++++++++
3 files changed, 94 insertions(+)
create mode 100644 fs/sendfd.c

diff --git a/fs/Makefile b/fs/Makefile
index da0bbb4..bed05a8 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -27,6 +27,7 @@ obj-$(CONFIG_ANON_INODES) += anon_inodes.o
obj-$(CONFIG_SIGNALFD) += signalfd.o
obj-$(CONFIG_TIMERFD) += timerfd.o
obj-$(CONFIG_EVENTFD) += eventfd.o
+obj-$(CONFIG_SENDFD) += sendfd.o
obj-$(CONFIG_AIO) += aio.o
obj-$(CONFIG_FILE_LOCKING) += locks.o
obj-$(CONFIG_COMPAT) += compat.o compat_ioctl.o
diff --git a/fs/sendfd.c b/fs/sendfd.c
new file mode 100644
index 0000000..1e85484
--- /dev/null
+++ b/fs/sendfd.c
@@ -0,0 +1,82 @@
+/*
+ * fs/sendfd.c
+ *
+ * Copyright (C) 2014 Alex Dubov <[email protected]>
+ *
+ */
+
+#include <linux/file.h>
+#include <linux/fdtable.h>
+#include <linux/syscalls.h>
+
+SYSCALL_DEFINE3(sendfd, pid_t, pid, int, sig, int, fd)
+{
+ struct siginfo s_info = {
+ .si_signo = sig,
+ .si_errno = 0,
+ .si_code = __SI_RT
+ };
+ struct file *src_file = NULL;
+ struct task_struct *dst_task = NULL;
+ struct files_struct *dst_files = NULL;
+ unsigned long rlim = 0;
+ unsigned long flags = 0;
+ int rc = 0;
+
+ if ((sig < SIGRTMIN) || (sig > SIGRTMAX))
+ return -EINVAL;
+
+ s_info.si_pid = task_pid_vnr(current);
+ s_info.si_uid = from_kuid_munged(current_user_ns(), current_uid());
+ s_info.si_int = -1;
+
+ src_file = fget(fd);
+ if (!src_file)
+ return -EBADF;
+
+ rcu_read_lock();
+ dst_task = find_task_by_vpid(pid);
+
+ if (!dst_task) {
+ rc = -ESRCH;
+ goto out_put_src_file;
+ }
+ get_task_struct(dst_task);
+ rcu_read_unlock();
+
+ dst_files = get_files_struct(dst_task);
+ if (!dst_files) {
+ rc = -EMFILE;
+ goto out_put_dst_task;
+ }
+
+ if (!lock_task_sighand(dst_task, &flags)) {
+ rc = -EMFILE;
+ goto out_put_dst_files;
+ }
+
+ rlim = task_rlimit(dst_task, RLIMIT_NOFILE);
+
+ unlock_task_sighand(dst_task, &flags);
+
+ rc = __alloc_fd(dst_task->files, 0, rlim, O_CLOEXEC);
+ if (rc < 0)
+ goto out_put_dst_files;
+
+ s_info.si_int = rc;
+
+ get_file(src_file);
+ __fd_install(dst_files, rc, src_file);
+ rc = kill_pid_info(sig, &s_info, task_pid(dst_task));
+
+ if (rc < 0)
+ __close_fd(dst_files, s_info.si_int);
+
+out_put_dst_files:
+ put_files_struct(dst_files);
+out_put_dst_task:
+ put_task_struct(dst_task);
+out_put_src_file:
+ fput(src_file);
+ return rc;
+}
diff --git a/init/Kconfig b/init/Kconfig
index 2081a4d..dfe8b6f 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1525,6 +1525,17 @@ config EVENTFD

If unsure, say Y.

+config SENDFD
+ bool "Enable sendfd() system call" if EXPERT
+ default y
+ help
+ Enable the sendfd() system call that allows rapid duplication
+ of file descriptor across process boundaries. The target process
+ will receive a duplicate file descriptor delivered with one of
+ Posix.1b real-time signals.
+
+ If unsure, say Y.
+
# syscall, maps, verifier
config BPF_SYSCALL
bool "Enable bpf() system call" if EXPERT
--
1.8.3.2

2014-12-02 04:35:39

by Alex Dubov

[permalink] [raw]
Subject: [PATCH 2/2] fs: Wire up sendfd() syscall (all architectures)

Signed-off-by: Alex Dubov <[email protected]>
---
arch/arm/include/uapi/asm/unistd.h | 1 +
arch/arm/kernel/calls.S | 1 +
arch/arm64/include/asm/unistd32.h | 2 ++
arch/ia64/include/uapi/asm/unistd.h | 1 +
arch/ia64/kernel/entry.S | 1 +
arch/m68k/include/uapi/asm/unistd.h | 1 +
arch/m68k/kernel/syscalltable.S | 1 +
arch/microblaze/include/uapi/asm/unistd.h | 1 +
arch/microblaze/kernel/syscall_table.S | 1 +
arch/mips/include/uapi/asm/unistd.h | 15 +++++++++------
arch/mips/kernel/scall32-o32.S | 1 +
arch/mips/kernel/scall64-64.S | 1 +
arch/mips/kernel/scall64-n32.S | 1 +
arch/mips/kernel/scall64-o32.S | 1 +
arch/parisc/include/uapi/asm/unistd.h | 3 ++-
arch/powerpc/include/asm/systbl.h | 1 +
arch/powerpc/include/uapi/asm/unistd.h | 1 +
arch/s390/include/uapi/asm/unistd.h | 3 ++-
arch/s390/kernel/compat_wrapper.c | 1 +
arch/s390/kernel/syscalls.S | 1 +
arch/sparc/include/uapi/asm/unistd.h | 3 ++-
arch/sparc/kernel/systbls_32.S | 2 +-
arch/sparc/kernel/systbls_64.S | 4 ++--
arch/x86/syscalls/syscall_32.tbl | 1 +
arch/x86/syscalls/syscall_64.tbl | 1 +
arch/xtensa/include/uapi/asm/unistd.h | 5 +++--
include/linux/syscalls.h | 1 +
include/uapi/asm-generic/unistd.h | 4 +++-
kernel/sys_ni.c | 3 +++
29 files changed, 48 insertions(+), 15 deletions(-)

diff --git a/arch/arm/include/uapi/asm/unistd.h b/arch/arm/include/uapi/asm/unistd.h
index 705bb76..6428823 100644
--- a/arch/arm/include/uapi/asm/unistd.h
+++ b/arch/arm/include/uapi/asm/unistd.h
@@ -413,6 +413,7 @@
#define __NR_getrandom (__NR_SYSCALL_BASE+384)
#define __NR_memfd_create (__NR_SYSCALL_BASE+385)
#define __NR_bpf (__NR_SYSCALL_BASE+386)
+#define __NR_sendfd (__NR_SYSCALL_BASE+387)

/*
* The following SWIs are ARM private.
diff --git a/arch/arm/kernel/calls.S b/arch/arm/kernel/calls.S
index e51833f..30bdeb5 100644
--- a/arch/arm/kernel/calls.S
+++ b/arch/arm/kernel/calls.S
@@ -396,6 +396,7 @@
CALL(sys_getrandom)
/* 385 */ CALL(sys_memfd_create)
CALL(sys_bpf)
+ CALL(sys_sendfd)
#ifndef syscalls_counted
.equ syscalls_padding, ((NR_syscalls + 3) & ~3) - NR_syscalls
#define syscalls_counted
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index 9dfdac4..7f19595 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -794,3 +794,5 @@ __SYSCALL(__NR_getrandom, sys_getrandom)
__SYSCALL(__NR_memfd_create, sys_memfd_create)
#define __NR_bpf 386
__SYSCALL(__NR_bpf, sys_bpf)
+#define __NR_sendfd 387
+__SYSCALL(__NR_sendfd, sys_sendfd)
diff --git a/arch/ia64/include/uapi/asm/unistd.h b/arch/ia64/include/uapi/asm/unistd.h
index 4c2240c..55be68c 100644
--- a/arch/ia64/include/uapi/asm/unistd.h
+++ b/arch/ia64/include/uapi/asm/unistd.h
@@ -331,5 +331,6 @@
#define __NR_getrandom 1339
#define __NR_memfd_create 1340
#define __NR_bpf 1341
+#define __NR_sendfd 1342

#endif /* _UAPI_ASM_IA64_UNISTD_H */
diff --git a/arch/ia64/kernel/entry.S b/arch/ia64/kernel/entry.S
index f5e96df..97596a3 100644
--- a/arch/ia64/kernel/entry.S
+++ b/arch/ia64/kernel/entry.S
@@ -1779,6 +1779,7 @@ sys_call_table:
data8 sys_getrandom
data8 sys_memfd_create // 1340
data8 sys_bpf
+ data8 sys_sendfd

.org sys_call_table + 8*NR_syscalls // guard against failures to increase NR_syscalls
#endif /* __IA64_ASM_PARAVIRTUALIZED_NATIVE */
diff --git a/arch/m68k/include/uapi/asm/unistd.h b/arch/m68k/include/uapi/asm/unistd.h
index 2c1bec9..77e7098 100644
--- a/arch/m68k/include/uapi/asm/unistd.h
+++ b/arch/m68k/include/uapi/asm/unistd.h
@@ -360,5 +360,6 @@
#define __NR_getrandom 352
#define __NR_memfd_create 353
#define __NR_bpf 354
+#define __NR_sendfd 355

#endif /* _UAPI_ASM_M68K_UNISTD_H_ */
diff --git a/arch/m68k/kernel/syscalltable.S b/arch/m68k/kernel/syscalltable.S
index 2ca219e..3ea20d4 100644
--- a/arch/m68k/kernel/syscalltable.S
+++ b/arch/m68k/kernel/syscalltable.S
@@ -375,4 +375,5 @@ ENTRY(sys_call_table)
.long sys_getrandom
.long sys_memfd_create
.long sys_bpf
+ .long sys_sendfd

diff --git a/arch/microblaze/include/uapi/asm/unistd.h b/arch/microblaze/include/uapi/asm/unistd.h
index c712677..f69e30a 100644
--- a/arch/microblaze/include/uapi/asm/unistd.h
+++ b/arch/microblaze/include/uapi/asm/unistd.h
@@ -403,5 +403,6 @@
#define __NR_getrandom 385
#define __NR_memfd_create 386
#define __NR_bpf 387
+#define __NR_sendfd 388

#endif /* _UAPI_ASM_MICROBLAZE_UNISTD_H */
diff --git a/arch/microblaze/kernel/syscall_table.S b/arch/microblaze/kernel/syscall_table.S
index 0166e89..1550f45 100644
--- a/arch/microblaze/kernel/syscall_table.S
+++ b/arch/microblaze/kernel/syscall_table.S
@@ -388,3 +388,4 @@ ENTRY(sys_call_table)
.long sys_getrandom /* 385 */
.long sys_memfd_create
.long sys_bpf
+ .long sys_sendfd
diff --git a/arch/mips/include/uapi/asm/unistd.h b/arch/mips/include/uapi/asm/unistd.h
index d001bb1..24109dc 100644
--- a/arch/mips/include/uapi/asm/unistd.h
+++ b/arch/mips/include/uapi/asm/unistd.h
@@ -376,16 +376,17 @@
#define __NR_getrandom (__NR_Linux + 353)
#define __NR_memfd_create (__NR_Linux + 354)
#define __NR_bpf (__NR_Linux + 355)
+#define __NR_sendfd (__NR_Linux + 356)

/*
* Offset of the last Linux o32 flavoured syscall
*/
-#define __NR_Linux_syscalls 355
+#define __NR_Linux_syscalls 356

#endif /* _MIPS_SIM == _MIPS_SIM_ABI32 */

#define __NR_O32_Linux 4000
-#define __NR_O32_Linux_syscalls 355
+#define __NR_O32_Linux_syscalls 356

#if _MIPS_SIM == _MIPS_SIM_ABI64

@@ -709,16 +710,17 @@
#define __NR_getrandom (__NR_Linux + 313)
#define __NR_memfd_create (__NR_Linux + 314)
#define __NR_bpf (__NR_Linux + 315)
+#define __NR_sendfd (__NR_Linux + 316)

/*
* Offset of the last Linux 64-bit flavoured syscall
*/
-#define __NR_Linux_syscalls 315
+#define __NR_Linux_syscalls 316

#endif /* _MIPS_SIM == _MIPS_SIM_ABI64 */

#define __NR_64_Linux 5000
-#define __NR_64_Linux_syscalls 315
+#define __NR_64_Linux_syscalls 316

#if _MIPS_SIM == _MIPS_SIM_NABI32

@@ -1046,15 +1048,16 @@
#define __NR_getrandom (__NR_Linux + 317)
#define __NR_memfd_create (__NR_Linux + 318)
#define __NR_bpf (__NR_Linux + 319)
+#define __NR_sendfd (__NR_Linux + 320)

/*
* Offset of the last N32 flavoured syscall
*/
-#define __NR_Linux_syscalls 319
+#define __NR_Linux_syscalls 320

#endif /* _MIPS_SIM == _MIPS_SIM_NABI32 */

#define __NR_N32_Linux 6000
-#define __NR_N32_Linux_syscalls 319
+#define __NR_N32_Linux_syscalls 320

#endif /* _UAPI_ASM_UNISTD_H */
diff --git a/arch/mips/kernel/scall32-o32.S b/arch/mips/kernel/scall32-o32.S
index 00cad10..94a7014 100644
--- a/arch/mips/kernel/scall32-o32.S
+++ b/arch/mips/kernel/scall32-o32.S
@@ -580,3 +580,4 @@ EXPORT(sys_call_table)
PTR sys_getrandom
PTR sys_memfd_create
PTR sys_bpf /* 4355 */
+ PTR sys_sendfd
diff --git a/arch/mips/kernel/scall64-64.S b/arch/mips/kernel/scall64-64.S
index 5251565..cc2440d 100644
--- a/arch/mips/kernel/scall64-64.S
+++ b/arch/mips/kernel/scall64-64.S
@@ -435,4 +435,5 @@ EXPORT(sys_call_table)
PTR sys_getrandom
PTR sys_memfd_create
PTR sys_bpf /* 5315 */
+ PTR sys_sendfd
.size sys_call_table,.-sys_call_table
diff --git a/arch/mips/kernel/scall64-n32.S b/arch/mips/kernel/scall64-n32.S
index 77e7439..ff1de3a 100644
--- a/arch/mips/kernel/scall64-n32.S
+++ b/arch/mips/kernel/scall64-n32.S
@@ -428,4 +428,5 @@ EXPORT(sysn32_call_table)
PTR sys_getrandom
PTR sys_memfd_create
PTR sys_bpf
+ PTR sys_sendfd
.size sysn32_call_table,.-sysn32_call_table
diff --git a/arch/mips/kernel/scall64-o32.S b/arch/mips/kernel/scall64-o32.S
index 6f8db9f..87d3a33 100644
--- a/arch/mips/kernel/scall64-o32.S
+++ b/arch/mips/kernel/scall64-o32.S
@@ -565,4 +565,5 @@ EXPORT(sys32_call_table)
PTR sys_getrandom
PTR sys_memfd_create
PTR sys_bpf /* 4355 */
+ PTR sys_sendfd
.size sys32_call_table,.-sys32_call_table
diff --git a/arch/parisc/include/uapi/asm/unistd.h b/arch/parisc/include/uapi/asm/unistd.h
index 5f5c037..f182787 100644
--- a/arch/parisc/include/uapi/asm/unistd.h
+++ b/arch/parisc/include/uapi/asm/unistd.h
@@ -834,8 +834,9 @@
#define __NR_getrandom (__NR_Linux + 339)
#define __NR_memfd_create (__NR_Linux + 340)
#define __NR_bpf (__NR_Linux + 341)
+#define __NR_sendfd (__NR_Linux + 342)

-#define __NR_Linux_syscalls (__NR_bpf + 1)
+#define __NR_Linux_syscalls (__NR_sendfd + 1)


#define __IGNORE_select /* newselect */
diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h
index ce9577d..4aa6c22 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -366,3 +366,4 @@ SYSCALL_SPU(seccomp)
SYSCALL_SPU(getrandom)
SYSCALL_SPU(memfd_create)
SYSCALL_SPU(bpf)
+SYSCALL_SPU(sendfd)
diff --git a/arch/powerpc/include/uapi/asm/unistd.h b/arch/powerpc/include/uapi/asm/unistd.h
index f55351f..2d55338 100644
--- a/arch/powerpc/include/uapi/asm/unistd.h
+++ b/arch/powerpc/include/uapi/asm/unistd.h
@@ -384,5 +384,6 @@
#define __NR_getrandom 359
#define __NR_memfd_create 360
#define __NR_bpf 361
+#define __NR_sendfd 362

#endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */
diff --git a/arch/s390/include/uapi/asm/unistd.h b/arch/s390/include/uapi/asm/unistd.h
index 4197c89..7248c4a 100644
--- a/arch/s390/include/uapi/asm/unistd.h
+++ b/arch/s390/include/uapi/asm/unistd.h
@@ -287,7 +287,8 @@
#define __NR_getrandom 349
#define __NR_memfd_create 350
#define __NR_bpf 351
-#define NR_syscalls 352
+#define __NR_sendfd 352
+#define NR_syscalls 353

/*
* There are some system calls that are not present on 64 bit, some
diff --git a/arch/s390/kernel/compat_wrapper.c b/arch/s390/kernel/compat_wrapper.c
index c4f7a3d..d931326 100644
--- a/arch/s390/kernel/compat_wrapper.c
+++ b/arch/s390/kernel/compat_wrapper.c
@@ -218,3 +218,4 @@ COMPAT_SYSCALL_WRAP3(seccomp, unsigned int, op, unsigned int, flags, const char
COMPAT_SYSCALL_WRAP3(getrandom, char __user *, buf, size_t, count, unsigned int, flags)
COMPAT_SYSCALL_WRAP2(memfd_create, const char __user *, uname, unsigned int, flags)
COMPAT_SYSCALL_WRAP3(bpf, int, cmd, union bpf_attr *, attr, unsigned int, size);
+COMPAT_SYSCALL_WRAP3(sendfd, pid_t, pid, int, sig, int, fd);
diff --git a/arch/s390/kernel/syscalls.S b/arch/s390/kernel/syscalls.S
index 9f7087f..b1beaf1 100644
--- a/arch/s390/kernel/syscalls.S
+++ b/arch/s390/kernel/syscalls.S
@@ -360,3 +360,4 @@ SYSCALL(sys_seccomp,sys_seccomp,compat_sys_seccomp)
SYSCALL(sys_getrandom,sys_getrandom,compat_sys_getrandom)
SYSCALL(sys_memfd_create,sys_memfd_create,compat_sys_memfd_create) /* 350 */
SYSCALL(sys_bpf,sys_bpf,compat_sys_bpf)
+SYSCALL(sys_sendfd,sys_sendfd,compat_sys_sendfd)
diff --git a/arch/sparc/include/uapi/asm/unistd.h b/arch/sparc/include/uapi/asm/unistd.h
index 46d8384..a43637a 100644
--- a/arch/sparc/include/uapi/asm/unistd.h
+++ b/arch/sparc/include/uapi/asm/unistd.h
@@ -415,8 +415,9 @@
#define __NR_getrandom 347
#define __NR_memfd_create 348
#define __NR_bpf 349
+#define __NR_sendfd 350

-#define NR_syscalls 350
+#define NR_syscalls 351

/* Bitmask values returned from kern_features system call. */
#define KERN_FEATURE_MIXED_MODE_STACK 0x00000001
diff --git a/arch/sparc/kernel/systbls_32.S b/arch/sparc/kernel/systbls_32.S
index ad0cdf4..1b3ff92 100644
--- a/arch/sparc/kernel/systbls_32.S
+++ b/arch/sparc/kernel/systbls_32.S
@@ -86,4 +86,4 @@ sys_call_table:
/*330*/ .long sys_fanotify_mark, sys_prlimit64, sys_name_to_handle_at, sys_open_by_handle_at, sys_clock_adjtime
/*335*/ .long sys_syncfs, sys_sendmmsg, sys_setns, sys_process_vm_readv, sys_process_vm_writev
/*340*/ .long sys_ni_syscall, sys_kcmp, sys_finit_module, sys_sched_setattr, sys_sched_getattr
-/*345*/ .long sys_renameat2, sys_seccomp, sys_getrandom, sys_memfd_create, sys_bpf
+/*345*/ .long sys_renameat2, sys_seccomp, sys_getrandom, sys_memfd_create, sys_bpf, sys_sendfd
diff --git a/arch/sparc/kernel/systbls_64.S b/arch/sparc/kernel/systbls_64.S
index 580cde9..ebbafb1 100644
--- a/arch/sparc/kernel/systbls_64.S
+++ b/arch/sparc/kernel/systbls_64.S
@@ -87,7 +87,7 @@ sys_call_table32:
/*330*/ .word compat_sys_fanotify_mark, sys_prlimit64, sys_name_to_handle_at, compat_sys_open_by_handle_at, compat_sys_clock_adjtime
.word sys_syncfs, compat_sys_sendmmsg, sys_setns, compat_sys_process_vm_readv, compat_sys_process_vm_writev
/*340*/ .word sys_kern_features, sys_kcmp, sys_finit_module, sys_sched_setattr, sys_sched_getattr
- .word sys32_renameat2, sys_seccomp, sys_getrandom, sys_memfd_create, sys_bpf
+ .word sys32_renameat2, sys_seccomp, sys_getrandom, sys_memfd_create, sys_bpf, sys_sendfd

#endif /* CONFIG_COMPAT */

@@ -166,4 +166,4 @@ sys_call_table:
/*330*/ .word sys_fanotify_mark, sys_prlimit64, sys_name_to_handle_at, sys_open_by_handle_at, sys_clock_adjtime
.word sys_syncfs, sys_sendmmsg, sys_setns, sys_process_vm_readv, sys_process_vm_writev
/*340*/ .word sys_kern_features, sys_kcmp, sys_finit_module, sys_sched_setattr, sys_sched_getattr
- .word sys_renameat2, sys_seccomp, sys_getrandom, sys_memfd_create, sys_bpf
+ .word sys_renameat2, sys_seccomp, sys_getrandom, sys_memfd_create, sys_bpf, sys_sendfd
diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index 9fe1b5d..dfe91f7 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -364,3 +364,4 @@
355 i386 getrandom sys_getrandom
356 i386 memfd_create sys_memfd_create
357 i386 bpf sys_bpf
+358 i386 sendfd sys_sendfd
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 281150b..4d6b55d 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -328,6 +328,7 @@
319 common memfd_create sys_memfd_create
320 common kexec_file_load sys_kexec_file_load
321 common bpf sys_bpf
+322 common sendfd sys_sendfd

#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/arch/xtensa/include/uapi/asm/unistd.h b/arch/xtensa/include/uapi/asm/unistd.h
index db5bb72..3705d28 100644
--- a/arch/xtensa/include/uapi/asm/unistd.h
+++ b/arch/xtensa/include/uapi/asm/unistd.h
@@ -749,8 +749,9 @@ __SYSCALL(337, sys_seccomp, 3)
__SYSCALL(338, sys_getrandom, 3)
#define __NR_memfd_create 339
__SYSCALL(339, sys_memfd_create, 2)
-
-#define __NR_syscall_count 340
+#define __NR_sendfd 340
+__SYSCALL(340, sys_sendfd, 3)
+#define __NR_syscall_count 341

/*
* sysxtensa syscall handler
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index bda9b81..1871b72f 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -877,4 +877,5 @@ asmlinkage long sys_seccomp(unsigned int op, unsigned int flags,
asmlinkage long sys_getrandom(char __user *buf, size_t count,
unsigned int flags);
asmlinkage long sys_bpf(int cmd, union bpf_attr *attr, unsigned int size);
+asmlinkage long sys_sendfd(pid_t pid, int sig, int fd);
#endif
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 22749c1..270aa02 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -707,9 +707,11 @@ __SYSCALL(__NR_getrandom, sys_getrandom)
__SYSCALL(__NR_memfd_create, sys_memfd_create)
#define __NR_bpf 280
__SYSCALL(__NR_bpf, sys_bpf)
+#define __NR_sendfd 281
+__SYSCALL(__NR_sendfd, sys_sendfd)

#undef __NR_syscalls
-#define __NR_syscalls 281
+#define __NR_syscalls 282

/*
* All syscalls below here should go away really,
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 02aa418..353cddb 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -224,3 +224,6 @@ cond_syscall(sys_seccomp);

/* access BPF programs and maps */
cond_syscall(sys_bpf);
+
+/* send file descriptor to another process */
+cond_syscall(sys_sendfd);
--
1.8.3.2

2014-12-02 08:01:07

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [PATCH 2/2] fs: Wire up sendfd() syscall (all architectures)

This really needs a CC to linux-arch (added).

On Tue, Dec 2, 2014 at 5:35 AM, Alex Dubov <[email protected]> wrote:
> Signed-off-by: Alex Dubov <[email protected]>
> ---
> arch/arm/include/uapi/asm/unistd.h | 1 +
> arch/arm/kernel/calls.S | 1 +
> arch/arm64/include/asm/unistd32.h | 2 ++
> arch/ia64/include/uapi/asm/unistd.h | 1 +
> arch/ia64/kernel/entry.S | 1 +
> arch/m68k/include/uapi/asm/unistd.h | 1 +
> arch/m68k/kernel/syscalltable.S | 1 +

You forgot to update NR_syscalls in arch/m68k/include/asm/unistd.h.

> arch/microblaze/include/uapi/asm/unistd.h | 1 +
> arch/microblaze/kernel/syscall_table.S | 1 +
> arch/mips/include/uapi/asm/unistd.h | 15 +++++++++------
> arch/mips/kernel/scall32-o32.S | 1 +
> arch/mips/kernel/scall64-64.S | 1 +
> arch/mips/kernel/scall64-n32.S | 1 +
> arch/mips/kernel/scall64-o32.S | 1 +
> arch/parisc/include/uapi/asm/unistd.h | 3 ++-
> arch/powerpc/include/asm/systbl.h | 1 +
> arch/powerpc/include/uapi/asm/unistd.h | 1 +
> arch/s390/include/uapi/asm/unistd.h | 3 ++-
> arch/s390/kernel/compat_wrapper.c | 1 +
> arch/s390/kernel/syscalls.S | 1 +
> arch/sparc/include/uapi/asm/unistd.h | 3 ++-
> arch/sparc/kernel/systbls_32.S | 2 +-
> arch/sparc/kernel/systbls_64.S | 4 ++--
> arch/x86/syscalls/syscall_32.tbl | 1 +
> arch/x86/syscalls/syscall_64.tbl | 1 +
> arch/xtensa/include/uapi/asm/unistd.h | 5 +++--
> include/linux/syscalls.h | 1 +
> include/uapi/asm-generic/unistd.h | 4 +++-
> kernel/sys_ni.c | 3 +++
> 29 files changed, 48 insertions(+), 15 deletions(-)
>
> diff --git a/arch/arm/include/uapi/asm/unistd.h b/arch/arm/include/uapi/asm/unistd.h
> index 705bb76..6428823 100644
> --- a/arch/arm/include/uapi/asm/unistd.h
> +++ b/arch/arm/include/uapi/asm/unistd.h
> @@ -413,6 +413,7 @@
> #define __NR_getrandom (__NR_SYSCALL_BASE+384)
> #define __NR_memfd_create (__NR_SYSCALL_BASE+385)
> #define __NR_bpf (__NR_SYSCALL_BASE+386)
> +#define __NR_sendfd (__NR_SYSCALL_BASE+387)
>
> /*
> * The following SWIs are ARM private.
> diff --git a/arch/arm/kernel/calls.S b/arch/arm/kernel/calls.S
> index e51833f..30bdeb5 100644
> --- a/arch/arm/kernel/calls.S
> +++ b/arch/arm/kernel/calls.S
> @@ -396,6 +396,7 @@
> CALL(sys_getrandom)
> /* 385 */ CALL(sys_memfd_create)
> CALL(sys_bpf)
> + CALL(sys_sendfd)
> #ifndef syscalls_counted
> .equ syscalls_padding, ((NR_syscalls + 3) & ~3) - NR_syscalls
> #define syscalls_counted
> diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
> index 9dfdac4..7f19595 100644
> --- a/arch/arm64/include/asm/unistd32.h
> +++ b/arch/arm64/include/asm/unistd32.h
> @@ -794,3 +794,5 @@ __SYSCALL(__NR_getrandom, sys_getrandom)
> __SYSCALL(__NR_memfd_create, sys_memfd_create)
> #define __NR_bpf 386
> __SYSCALL(__NR_bpf, sys_bpf)
> +#define __NR_sendfd 387
> +__SYSCALL(__NR_sendfd, sys_sendfd)
> diff --git a/arch/ia64/include/uapi/asm/unistd.h b/arch/ia64/include/uapi/asm/unistd.h
> index 4c2240c..55be68c 100644
> --- a/arch/ia64/include/uapi/asm/unistd.h
> +++ b/arch/ia64/include/uapi/asm/unistd.h
> @@ -331,5 +331,6 @@
> #define __NR_getrandom 1339
> #define __NR_memfd_create 1340
> #define __NR_bpf 1341
> +#define __NR_sendfd 1342
>
> #endif /* _UAPI_ASM_IA64_UNISTD_H */
> diff --git a/arch/ia64/kernel/entry.S b/arch/ia64/kernel/entry.S
> index f5e96df..97596a3 100644
> --- a/arch/ia64/kernel/entry.S
> +++ b/arch/ia64/kernel/entry.S
> @@ -1779,6 +1779,7 @@ sys_call_table:
> data8 sys_getrandom
> data8 sys_memfd_create // 1340
> data8 sys_bpf
> + data8 sys_sendfd
>
> .org sys_call_table + 8*NR_syscalls // guard against failures to increase NR_syscalls
> #endif /* __IA64_ASM_PARAVIRTUALIZED_NATIVE */
> diff --git a/arch/m68k/include/uapi/asm/unistd.h b/arch/m68k/include/uapi/asm/unistd.h
> index 2c1bec9..77e7098 100644
> --- a/arch/m68k/include/uapi/asm/unistd.h
> +++ b/arch/m68k/include/uapi/asm/unistd.h
> @@ -360,5 +360,6 @@
> #define __NR_getrandom 352
> #define __NR_memfd_create 353
> #define __NR_bpf 354
> +#define __NR_sendfd 355
>
> #endif /* _UAPI_ASM_M68K_UNISTD_H_ */
> diff --git a/arch/m68k/kernel/syscalltable.S b/arch/m68k/kernel/syscalltable.S
> index 2ca219e..3ea20d4 100644
> --- a/arch/m68k/kernel/syscalltable.S
> +++ b/arch/m68k/kernel/syscalltable.S
> @@ -375,4 +375,5 @@ ENTRY(sys_call_table)
> .long sys_getrandom
> .long sys_memfd_create
> .long sys_bpf
> + .long sys_sendfd
>
> diff --git a/arch/microblaze/include/uapi/asm/unistd.h b/arch/microblaze/include/uapi/asm/unistd.h
> index c712677..f69e30a 100644
> --- a/arch/microblaze/include/uapi/asm/unistd.h
> +++ b/arch/microblaze/include/uapi/asm/unistd.h
> @@ -403,5 +403,6 @@
> #define __NR_getrandom 385
> #define __NR_memfd_create 386
> #define __NR_bpf 387
> +#define __NR_sendfd 388
>
> #endif /* _UAPI_ASM_MICROBLAZE_UNISTD_H */
> diff --git a/arch/microblaze/kernel/syscall_table.S b/arch/microblaze/kernel/syscall_table.S
> index 0166e89..1550f45 100644
> --- a/arch/microblaze/kernel/syscall_table.S
> +++ b/arch/microblaze/kernel/syscall_table.S
> @@ -388,3 +388,4 @@ ENTRY(sys_call_table)
> .long sys_getrandom /* 385 */
> .long sys_memfd_create
> .long sys_bpf
> + .long sys_sendfd
> diff --git a/arch/mips/include/uapi/asm/unistd.h b/arch/mips/include/uapi/asm/unistd.h
> index d001bb1..24109dc 100644
> --- a/arch/mips/include/uapi/asm/unistd.h
> +++ b/arch/mips/include/uapi/asm/unistd.h
> @@ -376,16 +376,17 @@
> #define __NR_getrandom (__NR_Linux + 353)
> #define __NR_memfd_create (__NR_Linux + 354)
> #define __NR_bpf (__NR_Linux + 355)
> +#define __NR_sendfd (__NR_Linux + 356)
>
> /*
> * Offset of the last Linux o32 flavoured syscall
> */
> -#define __NR_Linux_syscalls 355
> +#define __NR_Linux_syscalls 356
>
> #endif /* _MIPS_SIM == _MIPS_SIM_ABI32 */
>
> #define __NR_O32_Linux 4000
> -#define __NR_O32_Linux_syscalls 355
> +#define __NR_O32_Linux_syscalls 356
>
> #if _MIPS_SIM == _MIPS_SIM_ABI64
>
> @@ -709,16 +710,17 @@
> #define __NR_getrandom (__NR_Linux + 313)
> #define __NR_memfd_create (__NR_Linux + 314)
> #define __NR_bpf (__NR_Linux + 315)
> +#define __NR_sendfd (__NR_Linux + 316)
>
> /*
> * Offset of the last Linux 64-bit flavoured syscall
> */
> -#define __NR_Linux_syscalls 315
> +#define __NR_Linux_syscalls 316
>
> #endif /* _MIPS_SIM == _MIPS_SIM_ABI64 */
>
> #define __NR_64_Linux 5000
> -#define __NR_64_Linux_syscalls 315
> +#define __NR_64_Linux_syscalls 316
>
> #if _MIPS_SIM == _MIPS_SIM_NABI32
>
> @@ -1046,15 +1048,16 @@
> #define __NR_getrandom (__NR_Linux + 317)
> #define __NR_memfd_create (__NR_Linux + 318)
> #define __NR_bpf (__NR_Linux + 319)
> +#define __NR_sendfd (__NR_Linux + 320)
>
> /*
> * Offset of the last N32 flavoured syscall
> */
> -#define __NR_Linux_syscalls 319
> +#define __NR_Linux_syscalls 320
>
> #endif /* _MIPS_SIM == _MIPS_SIM_NABI32 */
>
> #define __NR_N32_Linux 6000
> -#define __NR_N32_Linux_syscalls 319
> +#define __NR_N32_Linux_syscalls 320
>
> #endif /* _UAPI_ASM_UNISTD_H */
> diff --git a/arch/mips/kernel/scall32-o32.S b/arch/mips/kernel/scall32-o32.S
> index 00cad10..94a7014 100644
> --- a/arch/mips/kernel/scall32-o32.S
> +++ b/arch/mips/kernel/scall32-o32.S
> @@ -580,3 +580,4 @@ EXPORT(sys_call_table)
> PTR sys_getrandom
> PTR sys_memfd_create
> PTR sys_bpf /* 4355 */
> + PTR sys_sendfd
> diff --git a/arch/mips/kernel/scall64-64.S b/arch/mips/kernel/scall64-64.S
> index 5251565..cc2440d 100644
> --- a/arch/mips/kernel/scall64-64.S
> +++ b/arch/mips/kernel/scall64-64.S
> @@ -435,4 +435,5 @@ EXPORT(sys_call_table)
> PTR sys_getrandom
> PTR sys_memfd_create
> PTR sys_bpf /* 5315 */
> + PTR sys_sendfd
> .size sys_call_table,.-sys_call_table
> diff --git a/arch/mips/kernel/scall64-n32.S b/arch/mips/kernel/scall64-n32.S
> index 77e7439..ff1de3a 100644
> --- a/arch/mips/kernel/scall64-n32.S
> +++ b/arch/mips/kernel/scall64-n32.S
> @@ -428,4 +428,5 @@ EXPORT(sysn32_call_table)
> PTR sys_getrandom
> PTR sys_memfd_create
> PTR sys_bpf
> + PTR sys_sendfd
> .size sysn32_call_table,.-sysn32_call_table
> diff --git a/arch/mips/kernel/scall64-o32.S b/arch/mips/kernel/scall64-o32.S
> index 6f8db9f..87d3a33 100644
> --- a/arch/mips/kernel/scall64-o32.S
> +++ b/arch/mips/kernel/scall64-o32.S
> @@ -565,4 +565,5 @@ EXPORT(sys32_call_table)
> PTR sys_getrandom
> PTR sys_memfd_create
> PTR sys_bpf /* 4355 */
> + PTR sys_sendfd
> .size sys32_call_table,.-sys32_call_table
> diff --git a/arch/parisc/include/uapi/asm/unistd.h b/arch/parisc/include/uapi/asm/unistd.h
> index 5f5c037..f182787 100644
> --- a/arch/parisc/include/uapi/asm/unistd.h
> +++ b/arch/parisc/include/uapi/asm/unistd.h
> @@ -834,8 +834,9 @@
> #define __NR_getrandom (__NR_Linux + 339)
> #define __NR_memfd_create (__NR_Linux + 340)
> #define __NR_bpf (__NR_Linux + 341)
> +#define __NR_sendfd (__NR_Linux + 342)
>
> -#define __NR_Linux_syscalls (__NR_bpf + 1)
> +#define __NR_Linux_syscalls (__NR_sendfd + 1)
>
>
> #define __IGNORE_select /* newselect */
> diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h
> index ce9577d..4aa6c22 100644
> --- a/arch/powerpc/include/asm/systbl.h
> +++ b/arch/powerpc/include/asm/systbl.h
> @@ -366,3 +366,4 @@ SYSCALL_SPU(seccomp)
> SYSCALL_SPU(getrandom)
> SYSCALL_SPU(memfd_create)
> SYSCALL_SPU(bpf)
> +SYSCALL_SPU(sendfd)
> diff --git a/arch/powerpc/include/uapi/asm/unistd.h b/arch/powerpc/include/uapi/asm/unistd.h
> index f55351f..2d55338 100644
> --- a/arch/powerpc/include/uapi/asm/unistd.h
> +++ b/arch/powerpc/include/uapi/asm/unistd.h
> @@ -384,5 +384,6 @@
> #define __NR_getrandom 359
> #define __NR_memfd_create 360
> #define __NR_bpf 361
> +#define __NR_sendfd 362
>
> #endif /* _UAPI_ASM_POWERPC_UNISTD_H_ */
> diff --git a/arch/s390/include/uapi/asm/unistd.h b/arch/s390/include/uapi/asm/unistd.h
> index 4197c89..7248c4a 100644
> --- a/arch/s390/include/uapi/asm/unistd.h
> +++ b/arch/s390/include/uapi/asm/unistd.h
> @@ -287,7 +287,8 @@
> #define __NR_getrandom 349
> #define __NR_memfd_create 350
> #define __NR_bpf 351
> -#define NR_syscalls 352
> +#define __NR_sendfd 352
> +#define NR_syscalls 353
>
> /*
> * There are some system calls that are not present on 64 bit, some
> diff --git a/arch/s390/kernel/compat_wrapper.c b/arch/s390/kernel/compat_wrapper.c
> index c4f7a3d..d931326 100644
> --- a/arch/s390/kernel/compat_wrapper.c
> +++ b/arch/s390/kernel/compat_wrapper.c
> @@ -218,3 +218,4 @@ COMPAT_SYSCALL_WRAP3(seccomp, unsigned int, op, unsigned int, flags, const char
> COMPAT_SYSCALL_WRAP3(getrandom, char __user *, buf, size_t, count, unsigned int, flags)
> COMPAT_SYSCALL_WRAP2(memfd_create, const char __user *, uname, unsigned int, flags)
> COMPAT_SYSCALL_WRAP3(bpf, int, cmd, union bpf_attr *, attr, unsigned int, size);
> +COMPAT_SYSCALL_WRAP3(sendfd, pid_t, pid, int, sig, int, fd);
> diff --git a/arch/s390/kernel/syscalls.S b/arch/s390/kernel/syscalls.S
> index 9f7087f..b1beaf1 100644
> --- a/arch/s390/kernel/syscalls.S
> +++ b/arch/s390/kernel/syscalls.S
> @@ -360,3 +360,4 @@ SYSCALL(sys_seccomp,sys_seccomp,compat_sys_seccomp)
> SYSCALL(sys_getrandom,sys_getrandom,compat_sys_getrandom)
> SYSCALL(sys_memfd_create,sys_memfd_create,compat_sys_memfd_create) /* 350 */
> SYSCALL(sys_bpf,sys_bpf,compat_sys_bpf)
> +SYSCALL(sys_sendfd,sys_sendfd,compat_sys_sendfd)
> diff --git a/arch/sparc/include/uapi/asm/unistd.h b/arch/sparc/include/uapi/asm/unistd.h
> index 46d8384..a43637a 100644
> --- a/arch/sparc/include/uapi/asm/unistd.h
> +++ b/arch/sparc/include/uapi/asm/unistd.h
> @@ -415,8 +415,9 @@
> #define __NR_getrandom 347
> #define __NR_memfd_create 348
> #define __NR_bpf 349
> +#define __NR_sendfd 350
>
> -#define NR_syscalls 350
> +#define NR_syscalls 351
>
> /* Bitmask values returned from kern_features system call. */
> #define KERN_FEATURE_MIXED_MODE_STACK 0x00000001
> diff --git a/arch/sparc/kernel/systbls_32.S b/arch/sparc/kernel/systbls_32.S
> index ad0cdf4..1b3ff92 100644
> --- a/arch/sparc/kernel/systbls_32.S
> +++ b/arch/sparc/kernel/systbls_32.S
> @@ -86,4 +86,4 @@ sys_call_table:
> /*330*/ .long sys_fanotify_mark, sys_prlimit64, sys_name_to_handle_at, sys_open_by_handle_at, sys_clock_adjtime
> /*335*/ .long sys_syncfs, sys_sendmmsg, sys_setns, sys_process_vm_readv, sys_process_vm_writev
> /*340*/ .long sys_ni_syscall, sys_kcmp, sys_finit_module, sys_sched_setattr, sys_sched_getattr
> -/*345*/ .long sys_renameat2, sys_seccomp, sys_getrandom, sys_memfd_create, sys_bpf
> +/*345*/ .long sys_renameat2, sys_seccomp, sys_getrandom, sys_memfd_create, sys_bpf, sys_sendfd
> diff --git a/arch/sparc/kernel/systbls_64.S b/arch/sparc/kernel/systbls_64.S
> index 580cde9..ebbafb1 100644
> --- a/arch/sparc/kernel/systbls_64.S
> +++ b/arch/sparc/kernel/systbls_64.S
> @@ -87,7 +87,7 @@ sys_call_table32:
> /*330*/ .word compat_sys_fanotify_mark, sys_prlimit64, sys_name_to_handle_at, compat_sys_open_by_handle_at, compat_sys_clock_adjtime
> .word sys_syncfs, compat_sys_sendmmsg, sys_setns, compat_sys_process_vm_readv, compat_sys_process_vm_writev
> /*340*/ .word sys_kern_features, sys_kcmp, sys_finit_module, sys_sched_setattr, sys_sched_getattr
> - .word sys32_renameat2, sys_seccomp, sys_getrandom, sys_memfd_create, sys_bpf
> + .word sys32_renameat2, sys_seccomp, sys_getrandom, sys_memfd_create, sys_bpf, sys_sendfd
>
> #endif /* CONFIG_COMPAT */
>
> @@ -166,4 +166,4 @@ sys_call_table:
> /*330*/ .word sys_fanotify_mark, sys_prlimit64, sys_name_to_handle_at, sys_open_by_handle_at, sys_clock_adjtime
> .word sys_syncfs, sys_sendmmsg, sys_setns, sys_process_vm_readv, sys_process_vm_writev
> /*340*/ .word sys_kern_features, sys_kcmp, sys_finit_module, sys_sched_setattr, sys_sched_getattr
> - .word sys_renameat2, sys_seccomp, sys_getrandom, sys_memfd_create, sys_bpf
> + .word sys_renameat2, sys_seccomp, sys_getrandom, sys_memfd_create, sys_bpf, sys_sendfd
> diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
> index 9fe1b5d..dfe91f7 100644
> --- a/arch/x86/syscalls/syscall_32.tbl
> +++ b/arch/x86/syscalls/syscall_32.tbl
> @@ -364,3 +364,4 @@
> 355 i386 getrandom sys_getrandom
> 356 i386 memfd_create sys_memfd_create
> 357 i386 bpf sys_bpf
> +358 i386 sendfd sys_sendfd
> diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
> index 281150b..4d6b55d 100644
> --- a/arch/x86/syscalls/syscall_64.tbl
> +++ b/arch/x86/syscalls/syscall_64.tbl
> @@ -328,6 +328,7 @@
> 319 common memfd_create sys_memfd_create
> 320 common kexec_file_load sys_kexec_file_load
> 321 common bpf sys_bpf
> +322 common sendfd sys_sendfd
>
> #
> # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/arch/xtensa/include/uapi/asm/unistd.h b/arch/xtensa/include/uapi/asm/unistd.h
> index db5bb72..3705d28 100644
> --- a/arch/xtensa/include/uapi/asm/unistd.h
> +++ b/arch/xtensa/include/uapi/asm/unistd.h
> @@ -749,8 +749,9 @@ __SYSCALL(337, sys_seccomp, 3)
> __SYSCALL(338, sys_getrandom, 3)
> #define __NR_memfd_create 339
> __SYSCALL(339, sys_memfd_create, 2)
> -
> -#define __NR_syscall_count 340
> +#define __NR_sendfd 340
> +__SYSCALL(340, sys_sendfd, 3)
> +#define __NR_syscall_count 341
>
> /*
> * sysxtensa syscall handler
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index bda9b81..1871b72f 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -877,4 +877,5 @@ asmlinkage long sys_seccomp(unsigned int op, unsigned int flags,
> asmlinkage long sys_getrandom(char __user *buf, size_t count,
> unsigned int flags);
> asmlinkage long sys_bpf(int cmd, union bpf_attr *attr, unsigned int size);
> +asmlinkage long sys_sendfd(pid_t pid, int sig, int fd);
> #endif
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index 22749c1..270aa02 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -707,9 +707,11 @@ __SYSCALL(__NR_getrandom, sys_getrandom)
> __SYSCALL(__NR_memfd_create, sys_memfd_create)
> #define __NR_bpf 280
> __SYSCALL(__NR_bpf, sys_bpf)
> +#define __NR_sendfd 281
> +__SYSCALL(__NR_sendfd, sys_sendfd)
>
> #undef __NR_syscalls
> -#define __NR_syscalls 281
> +#define __NR_syscalls 282
>
> /*
> * All syscalls below here should go away really,
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 02aa418..353cddb 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -224,3 +224,6 @@ cond_syscall(sys_seccomp);
>
> /* access BPF programs and maps */
> cond_syscall(sys_bpf);
> +
> +/* send file descriptor to another process */
> +cond_syscall(sys_sendfd);
> --
> 1.8.3.2
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/



--
Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2014-12-02 08:31:28

by Alex Dubov

[permalink] [raw]
Subject: Re: [PATCH 2/2] fs: Wire up sendfd() syscall (all architectures)

On Tue, Dec 2, 2014 at 7:01 PM, Geert Uytterhoeven <[email protected]> wrote:
> This really needs a CC to linux-arch (added).
>
>
> You forgot to update NR_syscalls in arch/m68k/include/asm/unistd.h.
>

Noted. I would assume that other architectures may have similar
problems (I only tested
my submission on x86_64).

Will try to fix those when/if there's progress toward accepting the
proposed feature.

2014-12-02 11:42:19

by Michal Simek

[permalink] [raw]
Subject: Re: [PATCH 2/2] fs: Wire up sendfd() syscall (all architectures)

On 12/02/2014 09:01 AM, Geert Uytterhoeven wrote:
> This really needs a CC to linux-arch (added).
>
> On Tue, Dec 2, 2014 at 5:35 AM, Alex Dubov <[email protected]> wrote:
>> Signed-off-by: Alex Dubov <[email protected]>
>> ---
>> arch/arm/include/uapi/asm/unistd.h | 1 +
>> arch/arm/kernel/calls.S | 1 +
>> arch/arm64/include/asm/unistd32.h | 2 ++
>> arch/ia64/include/uapi/asm/unistd.h | 1 +
>> arch/ia64/kernel/entry.S | 1 +
>> arch/m68k/include/uapi/asm/unistd.h | 1 +
>> arch/m68k/kernel/syscalltable.S | 1 +
>
> You forgot to update NR_syscalls in arch/m68k/include/asm/unistd.h.
>
>> arch/microblaze/include/uapi/asm/unistd.h | 1 +
>> arch/microblaze/kernel/syscall_table.S | 1 +


The same for microblaze here
arch/microblaze/include/asm/unistd.h

Thanks,
Michal

--
Michal Simek, Ing. (M.Eng), OpenPGP -> KeyID: FE3D1F91
w: http://www.monstr.eu p: +42-0-721842854
Maintainer of Linux kernel - Microblaze cpu - http://www.monstr.eu/fdt/
Maintainer of Linux kernel - Xilinx Zynq ARM architecture
Microblaze U-BOOT custodian and responsible for u-boot arm zynq platform



Attachments:
signature.asc (198.00 B)
OpenPGP digital signature

2014-12-02 12:50:11

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH 1/2] fs: introduce sendfd() syscall

On Tue, 2014-12-02 at 15:35 +1100, Alex Dubov wrote:
> Present patch introduces exceptionally easy to use, low latency and low
> overhead mechanism for transferring file descriptors between cooperating
> processes:
>
> int sendfd(pid_t pid, int sig, int fd)
>
> Given a target process pid, the sendfd() syscall will create a duplicate
> file descriptor in a target task's (referred by pid) file table pointing to
> the file references by descriptor fd. Then, it will attempt to notify the
> target task by issuing a Posix.1b real-time signal (sig), carrying the new
> file descriptor as integer payload. If real-time signal can not be enqueued
> at the destination signal queue, the newly created file descriptor will be
> promptly closed.
>
> Signed-off-by: Alex Dubov <[email protected]>
> ---

User A can send fd(s) to processes belonging to user B, even if user B
does (probably) not want this to happen ?

Also, relying on signals seems quite old fashion these days. How about
multi-threaded programs wanting separate channels to receive fds ?

Ability to flood fds and fill target file descriptors table looks very
dangerous to me. Some programs could break as they expect they control
fd allocations.

I like the idea of not having to use AF_UNIX and stick to a well defined
interface, but I do not like this asynchronous model.

Thanks.

2014-12-02 14:31:58

by Alex Dubov

[permalink] [raw]
Subject: Re: [PATCH 2/2] fs: Wire up sendfd() syscall (all architectures)

On Tue, Dec 2, 2014 at 10:42 PM, Michal Simek <[email protected]> wrote:
> On 12/02/2014 09:01 AM, Geert Uytterhoeven wrote:
>> This really needs a CC to linux-arch (added).
>>
>> On Tue, Dec 2, 2014 at 5:35 AM, Alex Dubov <[email protected]> wrote:
>>> Signed-off-by: Alex Dubov <[email protected]>
>>> ---

>
> The same for microblaze here
> arch/microblaze/include/asm/unistd.h
>

This invites the question as to why the __NR_syscalls macro is not
defined in uapi/asm/unistd.h on those architectures, where it will be
easier to spot? After all, asm/unistd.h includes uapi/asm/unistd.h
unconditionally.

2014-12-02 14:38:27

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [PATCH 2/2] fs: Wire up sendfd() syscall (all architectures)

Hi Alex,

On Tue, Dec 2, 2014 at 3:31 PM, Alex Dubov <[email protected]> wrote:
> On Tue, Dec 2, 2014 at 10:42 PM, Michal Simek <[email protected]> wrote:
>> On 12/02/2014 09:01 AM, Geert Uytterhoeven wrote:
>>> This really needs a CC to linux-arch (added).
>>>
>>> On Tue, Dec 2, 2014 at 5:35 AM, Alex Dubov <[email protected]> wrote:
>>>> Signed-off-by: Alex Dubov <[email protected]>
>>>> ---
>
>> The same for microblaze here
>> arch/microblaze/include/asm/unistd.h
>
> This invites the question as to why the __NR_syscalls macro is not
> defined in uapi/asm/unistd.h on those architectures, where it will be
> easier to spot? After all, asm/unistd.h includes uapi/asm/unistd.h
> unconditionally.

Because it's not part of the ABI?

There may be multiple ABIs, with multiple syscall ranges.
Userspace only needs to know if a syscall is available, not what the
valid syscall number range is.
The kernel does need to know the size of the full syscall table.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2014-12-02 14:47:16

by Alex Dubov

[permalink] [raw]
Subject: Re: [PATCH 1/2] fs: introduce sendfd() syscall

> User A can send fd(s) to processes belonging to user B, even if user B
> does (probably) not want this to happen ?

1. Process A must have sufficient permissions to signal process B.
This will only happen if process A belongs to the same user as process
B or has elevated capabilities, which can not appear by themselves
(and if root on some machine can not be trusted, then all is lost
anyway).

2. If process B has not specified explicitly how it wants the
particular signal to be handled, it will be killed by the default
handler. End of story, nothing else is going to happen.

I suppose, I can add an extra permissions check prior to creating the
new file descriptor in the first place.

> Also, relying on signals seems quite old fashion these days. How about
> multi-threaded programs wanting separate channels to receive fds ?

Most multi-threaded programs share the same file table between all
threads (unless some fancy clone() magic is involved), so the issue is
rather mundane. At any rate, each thread has its own pid and the usual
signal routing applies.

At a more generic level Posix real-time signals are anything, but
old-fashioned. sigqueue()/signalfd() pair provides a very convenient,
low overhead micro-messaging facility with ordered, reliably delivery.
I fail to see what's wrong with making a worthy use of it.

2014-12-02 15:26:47

by Jonathan Corbet

[permalink] [raw]
Subject: Re: Minimal effort/low overhead file descriptor duplication over Posix.1b s

On Tue, 2 Dec 2014 15:35:17 +1100
Alex Dubov <[email protected]> wrote:

> int sendfd(pid_t pid, int sig, int fd)
>
> Given a target process pid, the sendfd() syscall will create a duplicate
> file descriptor in a target task's (referred by pid) file table pointing to
> the file references by descriptor fd. Then, it will attempt to notify the
> target task by issuing a Posix.1b real-time signal (sig), carrying the new
> file descriptor as integer payload. If real-time signal can not be enqueued
> at the destination signal queue, the newly created file descriptor will be
> promptly closed.

[ CC += linux-api ]

So I'm not a syscall API design expert, but this one raises a few
questions with me.

- Messing with another process's file descriptor table without its
knowledge looks like a possible source of all kinds problems. Might
there be race conditions with close()/dup() code, for example? And
remember that users can be root in a user namespace; maybe there's no
potential for mischief there, but it needs to be considered.

- Forcing the use of realtime signals seems strange; this isn't a
realtime operation by any stretch.

- How might the sending process communicate to the recipient what the fd
is for? Even if a process only expects one type of file descriptor,
the ability to communicate information other than its number seems
like it would often be useful.

Some of these concerns might be addressable by requiring the recipient to
call acceptfd() (or some such) with the ability to use poll(). As an
alternative, I believe kdbus has fd-passing abilities; if kdbus goes in,
would you still need this feature?

Thanks,

jon

2014-12-02 15:33:05

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH 1/2] fs: introduce sendfd() syscall

On Wed, 2014-12-03 at 01:47 +1100, Alex Dubov wrote:
> > User A can send fd(s) to processes belonging to user B, even if user B
> > does (probably) not want this to happen ?
>
> 1. Process A must have sufficient permissions to signal process B.
> This will only happen if process A belongs to the same user as process
> B or has elevated capabilities, which can not appear by themselves
> (and if root on some machine can not be trusted, then all is lost
> anyway).
>

I do not see this enforced in your patch.

Allowing a process to hold many times the lock protecting my file
descriptor table is very scary.

Reserving a slot, then undo this if the signal failed is a nice way to
slow down critical programs and eventually block them from doing
progress when using file descriptors (most system calls afaik)


> 2. If process B has not specified explicitly how it wants the
> particular signal to be handled, it will be killed by the default
> handler. End of story, nothing else is going to happen.

So it seems possible for an arbitrary program to send fds to innocent
programs, that will likely fill their fd table and wont be able to open
a new file.

This opens interesting security issues and attack vectors.

2014-12-02 16:15:22

by Alex Dubov

[permalink] [raw]
Subject: Re: Minimal effort/low overhead file descriptor duplication over Posix.1b s

On Wed, Dec 3, 2014 at 2:26 AM, Jonathan Corbet <[email protected]> wrote:
> On Tue, 2 Dec 2014 15:35:17 +1100
> Alex Dubov <[email protected]> wrote:
>
>
> - Messing with another process's file descriptor table without its
> knowledge looks like a possible source of all kinds problems. Might
> there be race conditions with close()/dup() code, for example? And
> remember that users can be root in a user namespace; maybe there's no
> potential for mischief there, but it needs to be considered.

If process A has sufficient permissions to signal process B, it can
already do arbitrary mischief, no news there (SIGKILL and SIGSTOP will
definitely cause more havoc :-).

I don't believe there can be any race conditions as this is not
different to what happens when dup() is invoked from one of the
threads in multi-threaded application, whereupon other threads go on
with their usual file operations. Descriptor duplication happens prior
to any signal handling activities.

> - Forcing the use of realtime signals seems strange; this isn't a
> realtime operation by any stretch.

"Real time signals" are merely a misleading name for Posix.1b
micro-messaging facility. To the best of my knowledge they do not
affect scheduling any more then SIGIO or SIGALRM would.

As Posix.1b signals are best handled by signalfd() facility anyway, no
impact on scheduling compared to any other approach (including the
existing domain socket approach) is expected at all.

>
> - How might the sending process communicate to the recipient what the fd
> is for? Even if a process only expects one type of file descriptor,
> the ability to communicate information other than its number seems
> like it would often be useful.

There are 32 "real time" signals defined by default in kernel; this
range can be increased at will with kernel recompilation and glibc
will pick up the correct range automatically (this is Posix mandated
behavior and it actually works like that).

I have not seen an app yet that relied on more than half a dozen of
distinct signal numbers. Thus any application can conveniently define
more than 2 dozens of different fd varieties out of the box, delivered
to it with dedicated signal ids, whereupon in most practical
applications only 1 or 2 varieties of file descriptors are ever passed
around.

>
> Some of these concerns might be addressable by requiring the recipient to
> call acceptfd() (or some such) with the ability to use poll(). As an
> alternative, I believe kdbus has fd-passing abilities; if kdbus goes in,
> would you still need this feature?

Any process willing to handle Posix.1b signals must explicitly
manipulate the signal masks - otherwise it will be killed the moment
signal is received. Thus, no special "acceptfd()" call is necessary on
the receiver side - applications usually don't modify their signal
masks unless they expect some particular signal to arrive.

kdbus has something like it and binder on android has it as well. The
problem with both of them are the same as with unix domain sockets
(which implement a whole, rather convoluted, cmsg facility to be ever
used for that single purpose): they try to solve big problems with
fancy functionality, whereupon fd passing is a nice side feature
(which then gets used the most).

To my understanding, commonly used functionality deserves to have its
own quick, low overhead path:

1. We've got eventfd() which is neat and all, but to use it we need an
easy way to pass its fd around.

2. We've got memfd() which is also neat, but to use it..

3. We've got fairly complex (and consequently buggy) functionality
like SO_REUSEPORT, but I can't avoid a feeling that if there was a low
overhead transport available to path fds around (like the one
proposed), the old school approach of having one process running
tightly around accept() and sending sockets to workers may still rival
it (pity I don't have google's setup around to test it).

4. Most importantly, when network appliances are concerned (and those
represent a huge percentage of linux install base), it is desirable to
have the leanest possible code paths both in kernel and in the user
space (no functionality - no vulnerabilities to fish for) and still be
able to rely on multi-process applications (as multi-process
applications are considerably more reliable then multi-threaded ones,
for all the obvious reasons). A compact, easily traceable facility
comprising few hundred LOCs in the kernel, end to end, and very simple
application code (sigqueue() -> signalfd()) pose a distinct advantage
in this regard over largish subsystems which may provide similar
feature (invariable at the expense of unnecessary costs, like
persistent file system objects, specialized user-space libraries, etc)
.

2014-12-02 16:23:09

by Alex Dubov

[permalink] [raw]
Subject: Re: [PATCH 1/2] fs: introduce sendfd() syscall

On Wed, Dec 3, 2014 at 2:33 AM, Eric Dumazet <[email protected]> wrote:
> On Wed, 2014-12-03 at 01:47 +1100, Alex Dubov wrote:
>> > User A can send fd(s) to processes belonging to user B, even if user B
>> > does (probably) not want this to happen ?
>>
>> 1. Process A must have sufficient permissions to signal process B.
>> This will only happen if process A belongs to the same user as process
>> B or has elevated capabilities, which can not appear by themselves
>> (and if root on some machine can not be trusted, then all is lost
>> anyway).
>>
>
> I do not see this enforced in your patch.
>
> Allowing a process to hold many times the lock protecting my file
> descriptor table is very scary.
>
> Reserving a slot, then undo this if the signal failed is a nice way to
> slow down critical programs and eventually block them from doing
> progress when using file descriptors (most system calls afaik)

Yes, this is an omission. I already promised to tighten the security
in my last post. :)

>> 2. If process B has not specified explicitly how it wants the
>> particular signal to be handled, it will be killed by the default
>> handler. End of story, nothing else is going to happen.
>
> So it seems possible for an arbitrary program to send fds to innocent
> programs, that will likely fill their fd table and wont be able to open
> a new file.
>
> This opens interesting security issues and attack vectors.

Same as SIGKILL. And yet, our machines are still working fine.

If process A has sufficient capability to send signals to process B,
then process B is already at its mercy, fds or not fds.

2014-12-02 16:42:58

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH 1/2] fs: introduce sendfd() syscall

On Wed, 2014-12-03 at 03:23 +1100, Alex Dubov wrote:

> Same as SIGKILL. And yet, our machines are still working fine.
>
> If process A has sufficient capability to send signals to process B,
> then process B is already at its mercy, fds or not fds.

Tell me how a 128 threads program can use this new mechanism in a
scalable way.

One signal per thread ?

I guess we'll keep AF_UNIX then, thank you.

2014-12-02 17:00:41

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 1/2] fs: introduce sendfd() syscall

On Tue, Dec 02, 2014 at 03:35:18PM +1100, Alex Dubov wrote:
> + dst_files = get_files_struct(dst_task);
> + if (!dst_files) {
> + rc = -EMFILE;
> + goto out_put_dst_task;
> + }
> +
> + if (!lock_task_sighand(dst_task, &flags)) {
> + rc = -EMFILE;
> + goto out_put_dst_files;
> + }
> +
> + rlim = task_rlimit(dst_task, RLIMIT_NOFILE);
> +
> + unlock_task_sighand(dst_task, &flags);
> +
> + rc = __alloc_fd(dst_task->files, 0, rlim, O_CLOEXEC);
> + if (rc < 0)
> + goto out_put_dst_files;
> +
> + s_info.si_int = rc;
> +
> + get_file(src_file);
> + __fd_install(dst_files, rc, src_file);
> + rc = kill_pid_info(sig, &s_info, task_pid(dst_task));
> +
> + if (rc < 0)
> + __close_fd(dst_files, s_info.si_int);

Oh, lovely... And we are guaranteed that it still the same file, because...?

Not to mention anything else, this stuff violates the assumption used in a lot
of places - that the *only* way for a process to modify a descriptor table is
to have a reference to it obtained by something that had it as its current
descriptor table and not dropped since then. The way you do it might actually
turn out to be OK, but there's no way I'll take that without detailed analysis;
start with refcounting of struct file, for one thing - it does rely on the
assumption above in non-trivial ways.

Binder, shite as it is, satisfies that assumption. Your "simpler" variant
does not. Which means that you get to prove that you won't open any races
around fs/file.c.

And that's aside of the points other folks had brought up.

2014-12-03 02:11:30

by Alex Dubov

[permalink] [raw]
Subject: Re: [PATCH 1/2] fs: introduce sendfd() syscall

On Wed, Dec 3, 2014 at 3:42 AM, Eric Dumazet <[email protected]> wrote:
> On Wed, 2014-12-03 at 03:23 +1100, Alex Dubov wrote:
>
> Tell me how a 128 threads program can use this new mechanism in a
> scalable way.
>
> One signal per thread ?

What for?

Kernel will deliver the signal only to the thread/threads which has
the relevant signal unblocked (they are blocked by default).

>
> I guess we'll keep AF_UNIX then, thank you.

Kindly enlighten me, how are you going to use any file descriptor in a
128 threads program in a scalable way (socket and all)? How this
approach will be different when using signalfd()?

And no, I'm not proposing to take your favorite toys away.

2014-12-03 02:22:36

by Alex Dubov

[permalink] [raw]
Subject: Re: [PATCH 1/2] fs: introduce sendfd() syscall

On Wed, Dec 3, 2014 at 4:00 AM, Al Viro <[email protected]> wrote:
> On Tue, Dec 02, 2014 at 03:35:18PM +1100, Alex Dubov wrote:
>> +
>> + if (rc < 0)
>> + __close_fd(dst_files, s_info.si_int);
>
> Oh, lovely... And we are guaranteed that it still the same file, because...?
>
> Not to mention anything else, this stuff violates the assumption used in a lot
> of places - that the *only* way for a process to modify a descriptor table is
> to have a reference to it obtained by something that had it as its current
> descriptor table and not dropped since then. The way you do it might actually
> turn out to be OK, but there's no way I'll take that without detailed analysis;
> start with refcounting of struct file, for one thing - it does rely on the
> assumption above in non-trivial ways.

Ok, I see the problem here. This indeed requires further thought.

> And that's aside of the points other folks had brought up.

Yours is the first insightful message in this thread. Some of the
other commenters exhibited an unfortunate lack of understanding,
regarding what signals are and what they can be useful for.

Unless, of course, I have missed something important.

On a less related note, I hope you will agree that the simpler
mechanism for this very in-demand feature is long overdue on Linux
(every man and his dog are passing fds around these days).

2014-12-03 03:40:23

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 1/2] fs: introduce sendfd() syscall

On Wed, Dec 03, 2014 at 01:22:33PM +1100, Alex Dubov wrote:

> On a less related note, I hope you will agree that the simpler
> mechanism for this very in-demand feature is long overdue on Linux
> (every man and his dog are passing fds around these days).

... and I'm less than sure that it's a good thing. If nothing else,
once the pieces of your program are passing descriptors around freely,
you have created a barfball that will be impossible to split between
several boxen if you run into scalability issues. Descriptor-passing
is limited to a single system; you *can't* do that between e.g. components
of a cluster. So it's not an unmixed blessing, just as overuse of
shared memory segments, etc. They do have their uses, but that needs
to be carefully considered every time, or you'll create a major headache
a few years down the road.

2014-12-03 04:14:55

by Alex Dubov

[permalink] [raw]
Subject: Re: [PATCH 1/2] fs: introduce sendfd() syscall

On Wed, Dec 3, 2014 at 2:40 PM, Al Viro <[email protected]> wrote:
> On Wed, Dec 03, 2014 at 01:22:33PM +1100, Alex Dubov wrote:
>
>> On a less related note, I hope you will agree that the simpler
>> mechanism for this very in-demand feature is long overdue on Linux
>> (every man and his dog are passing fds around these days).
>
> ... and I'm less than sure that it's a good thing. If nothing else,
> once the pieces of your program are passing descriptors around freely,
> you have created a barfball that will be impossible to split between
> several boxen if you run into scalability issues. Descriptor-passing
> is limited to a single system; you *can't* do that between e.g. components
> of a cluster. So it's not an unmixed blessing, just as overuse of
> shared memory segments, etc. They do have their uses, but that needs
> to be carefully considered every time, or you'll create a major headache
> a few years down the road.

Well, if you try hard enough, you can pass fds around the components
of the cluster - Mosix was doing just that some 10 years ago.
Conceptually, it's even easier than doing distributed shared memory,
as long as mmap is not concerned. :-)

I was, however, looking at it from a different standpoint. Abundance
of cores in the contemporary CPUs calls for locally parallel
applications (and those are still the majority - clearly 90% of the
applications and their workloads fit just fine on a single node).

Thus, any modern application developer faces the usual dilemma:

1. Go multi-threaded - easy inter-thread IPC, lousy reliability with
minor errors in secondary tasks crashing the whole application.

2. Go multi-process - circus hoop jumping when IPC is concerned, great
reliability through OS provided fault isolation (so even really broken
stuff, like PHP plugin for apache manages to perform most of the time
:-)

Memfd (on its own) and eventfd are great steps in the right direction,
as managing persistent shmem and sem objects was always pain in the
arse. If there was an alternative to AF_UNIX fd passing, with its
arcane API, fs persistence and mind boggling fd recursion bugs, then
option 2 would became much more attractive for developers leading to
over-all increase in application reliability and security.

2014-12-03 06:48:30

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH 1/2] fs: introduce sendfd() syscall

On Wed, 2014-12-03 at 13:11 +1100, Alex Dubov wrote:

> Kindly enlighten me, how are you going to use any file descriptor in a
> 128 threads program in a scalable way (socket and all)? How this
> approach will be different when using signalfd()?

Thats the point : use one different channel (AF_UNIX socket, or AF_INET
listener...) per thread.

Each thread uses epoll() on a private epoll fd, and a dedicated channel
to get fds from other processes.

Sharing a signalfd() would be terrible, like using accept() on a single
listener socket :(

Your proposed interface, being tied to legacy signal(s), do not allow
for many multiple channels.

Sorry, but using signals is simply a no go for me.

2014-12-03 06:50:50

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH 1/2] fs: introduce sendfd() syscall

On Wed, 2014-12-03 at 13:22 +1100, Alex Dubov wrote:

> Yours is the first insightful message in this thread. Some of the
> other commenters exhibited an unfortunate lack of understanding,
> regarding what signals are and what they can be useful for.

Oh nice.

I think I will ignore your future mails.

2014-12-03 08:08:08

by Richard Cochran

[permalink] [raw]
Subject: Re: [PATCH 1/2] fs: introduce sendfd() syscall

On Tue, Dec 02, 2014 at 10:50:46PM -0800, Eric Dumazet wrote:
> I think I will ignore your future mails.

And I won't have time to read them either, because I will be too busy
passing fds to my two collies.

Cheers,
Richard

2014-12-03 08:17:39

by Richard Weinberger

[permalink] [raw]
Subject: Re: [PATCH 1/2] fs: introduce sendfd() syscall

On Wed, Dec 3, 2014 at 9:08 AM, Richard Cochran
<[email protected]> wrote:
> On Tue, Dec 02, 2014 at 10:50:46PM -0800, Eric Dumazet wrote:
>> I think I will ignore your future mails.
>
> And I won't have time to read them either, because I will be too busy
> passing fds to my two collies.

Come on guys, get a cup of coffee and relax a bit...

--
Thanks,
//richard

2014-12-03 10:41:51

by Richard Cochran

[permalink] [raw]
Subject: Re: [PATCH 1/2] fs: introduce sendfd() syscall

On Wed, Dec 03, 2014 at 09:17:37AM +0100, Richard Weinberger wrote:
> Come on guys, get a cup of coffee and relax a bit...

I am relaxed, especially after I had a good laugh reading this:

On a less related note, I hope you will agree that the simpler
mechanism for this very in-demand feature is long overdue on Linux
(every man and his dog are passing fds around these days).

Really, in years and years of unix programming, I have not yet felt
the need to pass a file descriptor. Thats goes double for my dogs.

In any case, I find it hard to believe that the traditional method is
really so bad. The explanation of why this new way is needed boils
down to: "unix programming is so hard to get right."

Thanks,
Richard

2014-12-03 14:08:47

by Alex Dubov

[permalink] [raw]
Subject: Re: [PATCH 1/2] fs: introduce sendfd() syscall

On Wed, Dec 3, 2014 at 9:41 PM, Richard Cochran
<[email protected]> wrote:
> In any case, I find it hard to believe that the traditional method is
> really so bad. The explanation of why this new way is needed boils
> down to: "unix programming is so hard to get right."


Surely, this can be said about any new feature proposed. Why do we
need this new thing called wheel? We lived 50k years without it just
fine! It all boils down to: "walking with legs is so hard to get
right". :-)

2014-12-05 13:37:58

by Alan Cox

[permalink] [raw]
Subject: Re: [PATCH 1/2] fs: introduce sendfd() syscall

On Wed, 3 Dec 2014 11:41:44 +0100
Richard Cochran <[email protected]> wrote:

> On Wed, Dec 03, 2014 at 09:17:37AM +0100, Richard Weinberger wrote:
> > Come on guys, get a cup of coffee and relax a bit...
>
> I am relaxed, especially after I had a good laugh reading this:
>
> On a less related note, I hope you will agree that the simpler
> mechanism for this very in-demand feature is long overdue on Linux
> (every man and his dog are passing fds around these days).
>
> Really, in years and years of unix programming, I have not yet felt
> the need to pass a file descriptor. Thats goes double for my dogs.

Its underused in part because you need a pointy hat to do it in Unix, but
it's a very common model elsewhere.

Whether you need the syscall or just to write sendfd() acceptfd() in
terms of AF_UNIX sockets in a library and bury the icky bits is another
question. I think the reality is you'd probably end up doing the library
*anyway* to deal with the fact it'll be 5 or more years before sendfd
percolated everywhere even if it was merged today.

Alan

2014-12-17 12:19:14

by Kevin Easton

[permalink] [raw]
Subject: Re: Minimal effort/low overhead file descriptor duplication over Posix.1b s

On Tue, Dec 02, 2014 at 03:35:17PM +1100, Alex Dubov wrote:
> Unfortunately, using facilities like Unix domain sockets to merely pass file
> descriptors between "worker" processes is unnecessarily difficult, due to
> the following common consideration:
>
> 1. Domain sockets and named pipes are persistent objects. Applications must
> manage their lifetime and devise unambiguous access schemes in case multiple
> application instances are to be run within the same OS instance. Usually, they
> would also require a writable file system to be mounted.

I believe this particular issue has long been addressed in Linux, with
the "abstract namespace" domain sockets.

These aren't persistent - they go away when the bound socket is closed -
and they don't need a writable filesystem.

If you derived the name in the abstract namespace from your PID (or better,
application identifier and PID) then you would have exactly the same
"ambiguous access" scheme as your proposal.

> int sendfd(pid_t pid, int sig, int fd)

PIDs tend to be regarded as a bit of an iffy way to refer to another
process, because they tend to be racy. If the process you think you're
talking to dies, and has its PID reused by another unrelated sendfd()-aware
process, you've just sent your open file to somewhere unexpected.

You can avoid that if the process is a child of yours, but in that case
you could have set up a no-fuss domain socket connection with socketpair()
too.

- Kevin