2017-11-27 07:21:21

by Mike Rapoport

[permalink] [raw]
Subject: [PATCH v4 0/4] vm: add a syscall to map a process memory into a pipe

Hi,

This patches introduces new process_vmsplice system call that combines
functionality of process_vm_read and vmsplice.

It allows to map the memory of another process into a pipe, similarly to
what vmsplice does for its own address space.

The patch 2/4 ("vm: add a syscall to map a process memory into a pipe")
actually adds the new system call and provides its elaborate description.

The patchset is against -mm tree.

v4: skip test when process_vmsplice syscall is not available
v3: minor refactoring to reduce code duplication
v2: move this syscall under CONFIG_CROSS_MEMORY_ATTACH
give correct flags to get_user_pages_remote()

Andrei Vagin (3):
vm: add a syscall to map a process memory into a pipe
x86: wire up the process_vmsplice syscall
test: add a test for the process_vmsplice syscall

Mike Rapoport (1):
fs/splice: introduce pages_to_pipe helper

arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 2 +
fs/splice.c | 262 +++++++++++++++++++--
include/linux/compat.h | 3 +
include/linux/syscalls.h | 4 +
include/uapi/asm-generic/unistd.h | 5 +-
kernel/sys_ni.c | 2 +
tools/testing/selftests/process_vmsplice/Makefile | 5 +
.../process_vmsplice/process_vmsplice_test.c | 196 +++++++++++++++
9 files changed, 458 insertions(+), 22 deletions(-)
create mode 100644 tools/testing/selftests/process_vmsplice/Makefile
create mode 100644 tools/testing/selftests/process_vmsplice/process_vmsplice_test.c

--
2.7.4


From 1585398963365723070@xxx Wed Nov 29 11:17:52 +0000 2017
X-GM-THRID: 1585398963365723070
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread


2017-11-27 07:37:01

by Mike Rapoport

[permalink] [raw]
Subject: [PATCH v4 3/4] x86: wire up the process_vmsplice syscall

From: Andrei Vagin <[email protected]>

Signed-off-by: Andrei Vagin <[email protected]>
---
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 2 ++
2 files changed, 3 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 448ac21..dc64bf5 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -391,3 +391,4 @@
382 i386 pkey_free sys_pkey_free
383 i386 statx sys_statx
384 i386 arch_prctl sys_arch_prctl compat_sys_arch_prctl
+385 i386 process_vmsplice sys_process_vmsplice compat_sys_process_vmsplice
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 5aef183..d2f916c 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -339,6 +339,7 @@
330 common pkey_alloc sys_pkey_alloc
331 common pkey_free sys_pkey_free
332 common statx sys_statx
+333 64 process_vmsplice sys_process_vmsplice

#
# x32-specific system call numbers start at 512 to avoid cache impact
@@ -380,3 +381,4 @@
545 x32 execveat compat_sys_execveat/ptregs
546 x32 preadv2 compat_sys_preadv64v2
547 x32 pwritev2 compat_sys_pwritev64v2
+548 x32 process_vmsplice compat_sys_process_vmsplice
--
2.7.4


From 1585322917298298474@xxx Tue Nov 28 15:09:09 +0000 2017
X-GM-THRID: 1585322917298298474
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread

2017-11-27 07:23:16

by Mike Rapoport

[permalink] [raw]
Subject: [PATCH] process_vmsplice.2: New page describing process_vmsplice(2) system call.

Signed-off-by: Mike Rapoport <[email protected]>
---
man2/process_vmsplice.2 | 188 ++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 188 insertions(+)
create mode 100644 man2/process_vmsplice.2

diff --git a/man2/process_vmsplice.2 b/man2/process_vmsplice.2
new file mode 100644
index 0000000..b99c06b
--- /dev/null
+++ b/man2/process_vmsplice.2
@@ -0,0 +1,188 @@
+.\" Copyright (c) 2017, IBM Corporation.
+.\" Written by Mike Rapoport <[email protected]>
+.\" Based on vmsplice(2) by Jens Axboe and
+.\" process_vm_read(2) by Christopher Yeoh, Mike Frysinger and Michael Kerrisk
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of this
+.\" manual under the conditions for verbatim copying, provided that the
+.\" entire resulting derived work is distributed under the terms of a
+.\" permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date. The author(s) assume no
+.\" responsibility for errors or omissions, or for damages resulting from
+.\" the use of the information contained herein. The author(s) may not
+.\" have taken the same level of care in the production of this manual,
+.\" which is licensed free of charge, as they might when working
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and authors of this work.
+.\" %%%LICENSE_END
+.\"
+.TH PROCESS_VMSPLICE 2 2017-11-23 "Linux" "Linux Programmer's Manual"
+.SH NAME
+process_vmsplice \- splice user pages from a specific process
+address space into a pipe
+.SH SYNOPSIS
+.nf
+.BR "#define _GNU_SOURCE" " /* See feature_test_macros(7) */"
+.B #include <unistd.h>
+.B #include <sys/uio.h>
+.PP
+.BI "ssize_t process_vmsplice(pid_t " pid ", int " fd ,
+.BI " const struct iovec *" iov ,
+.BI " unsigned long " nr_segs ,
+.BI " unsigned int " flags );
+.fi
+.PP
+.IR Note :
+There is no glibc wrapper for this system call; see NOTES.
+.SH DESCRIPTION
+The
+.BR process_vmsplice ()
+system call maps
+.I nr_segs
+ranges of user memory described by
+.I iov
+from address space of the process identified by
+.I pid
+into a pipe.
+The file descriptor
+.I fd
+must refer to a pipe.
+.PP
+The pointer
+.I iov
+points to an array of
+.I iovec
+structures as defined in
+.IR <sys/uio.h> :
+.PP
+.in +4n
+.EX
+struct iovec {
+ void *iov_base; /* Starting address */
+ size_t iov_len; /* Number of bytes */
+};
+.EE
+.in
+.PP
+The
+.I flags
+argument is a bit mask that is composed by ORing together
+zero or more of the following values:
+.RS
+.TP 1.9i
+.B SPLICE_F_MOVE
+Unused for
+.BR process_vmsplice ();
+see
+.BR splice (2).
+.TP
+.B SPLICE_F_NONBLOCK
+Do not block on I/O; see
+.BR splice (2)
+for further details.
+.TP
+.B SPLICE_F_MORE
+Currently has no effect for
+.BR process_vmsplice ()
+.TP
+.B SPLICE_F_GIFT
+The user pages are a gift to the kernel.
+see
+.BR vmsplice (2)
+for further details.
+.RE
+.PP
+Buffers pointed by the
+.I iov
+parameter are processed in array order.
+This means that
+.BR process_vmsplice ()
+completely fills
+.I iov[0]
+before proceeding to
+.IR iov[1] ,
+and so on.
+.PP
+The
+.BR process_vmsplice ()
+does not check the memory regions in the process
+until just before remapping those regions into the pipe.
+Consequently, a partial read may result if one of the
+.I iov
+elements points to an invalid memory region in the process.
+No further reads will be attempted beyond that point.
+.PP
+Permission to read from or write to another process
+is governed by a ptrace access mode
+.B PTRACE_MODE_ATTACH_REALCREDS
+check; see
+.BR ptrace (2).
+.SH RETURN VALUE
+Upon successful completion,
+.BR process_vmsplice ()
+returns the number of bytes transferred to the pipe.
+On error,
+.BR process_vmsplice ()
+returns \-1 and
+.I errno
+is set to indicate the error.
+.SH ERRORS
+.TP
+.B EAGAIN
+.B SPLICE_F_NONBLOCK
+was specified in
+.IR flags ,
+and the operation would block.
+.TP
+.B EBADF
+.I fd
+either not valid, or doesn't refer to a pipe.
+.TP
+.B EINVAL
+.I nr_segs
+is greater than
+.BR IOV_MAX ;
+or memory not aligned if
+.B SPLICE_F_GIFT
+set.
+.TP
+.B ENOMEM
+Out of memory.
+.TP
+.B ESRCH
+No process with ID
+.I pid
+exists.
+.SH VERSIONS
+The
+.BR process_vmsplice ()
+system call first appeared in Linux 4.15.
+.SH CONFORMING TO
+This system call is Linux-specific.
+.SH NOTES
+Glibc does not provide a wrapper for this system call; call it using
+.BR syscall (2).
+.BR process_vmsplice ()
+follows the other vectorized read/write type functions when it comes to
+limitations on the number of segments being passed in.
+This limit is
+.B IOV_MAX
+as defined in
+.IR <limits.h> .
+Currently,
+.\" UIO_MAXIOV in kernel source
+this limit is 1024.
+.SH SEE ALSO
+.BR process_vm_read (2)
+.BR ptrace (2),
+.BR splice (2),
+.BR pipe (7)
--
2.7.4


From 1585197470969170248@xxx Mon Nov 27 05:55:14 +0000 2017
X-GM-THRID: 1583517611141261654
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread

2017-11-27 07:21:37

by Mike Rapoport

[permalink] [raw]
Subject: [PATCH v4 4/4] test: add a test for the process_vmsplice syscall

From: Andrei Vagin <[email protected]>

This test checks that process_vmsplice() can splice pages from a remote
process and returns EFAULT, if process_vmsplice() tries to splice pages
by an unaccessiable address.

Signed-off-by: Andrei Vagin <[email protected]>
Signed-off-by: Mike Rapoport <[email protected]>
---
tools/testing/selftests/process_vmsplice/Makefile | 5 +
.../process_vmsplice/process_vmsplice_test.c | 196 +++++++++++++++++++++
2 files changed, 201 insertions(+)
create mode 100644 tools/testing/selftests/process_vmsplice/Makefile
create mode 100644 tools/testing/selftests/process_vmsplice/process_vmsplice_test.c

diff --git a/tools/testing/selftests/process_vmsplice/Makefile b/tools/testing/selftests/process_vmsplice/Makefile
new file mode 100644
index 0000000..246d5a7
--- /dev/null
+++ b/tools/testing/selftests/process_vmsplice/Makefile
@@ -0,0 +1,5 @@
+CFLAGS += -I../../../../usr/include/
+
+TEST_GEN_PROGS := process_vmsplice_test
+
+include ../lib.mk
diff --git a/tools/testing/selftests/process_vmsplice/process_vmsplice_test.c b/tools/testing/selftests/process_vmsplice/process_vmsplice_test.c
new file mode 100644
index 0000000..1682bdb
--- /dev/null
+++ b/tools/testing/selftests/process_vmsplice/process_vmsplice_test.c
@@ -0,0 +1,196 @@
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <unistd.h>
+#include <sys/mman.h>
+#include <sys/syscall.h>
+#include <fcntl.h>
+#include <sys/uio.h>
+#include <errno.h>
+#include <signal.h>
+#include <sys/prctl.h>
+#include <sys/wait.h>
+
+#include "../kselftest.h"
+
+#ifndef __NR_process_vmsplice
+#define __NR_process_vmsplice 333
+#endif
+
+#define pr_err(fmt, ...) \
+ ({ \
+ fprintf(stderr, "%s:%d:" fmt, \
+ __func__, __LINE__, ##__VA_ARGS__); \
+ KSFT_FAIL; \
+ })
+#define pr_perror(fmt, ...) pr_err(fmt ": %m\n", ##__VA_ARGS__)
+#define fail(fmt, ...) pr_err("FAIL:" fmt, ##__VA_ARGS__)
+
+static ssize_t process_vmsplice(pid_t pid, int fd, const struct iovec *iov,
+ unsigned long nr_segs, unsigned int flags)
+{
+ return syscall(__NR_process_vmsplice, pid, fd, iov, nr_segs, flags);
+
+}
+
+#define MEM_SIZE (4096 * 100)
+#define MEM_WRONLY_SIZE (4096 * 10)
+
+int main(int argc, char **argv)
+{
+ char *addr, *addr_wronly;
+ int p[2];
+ struct iovec iov[2];
+ char buf[4096];
+ int status, ret;
+ pid_t pid;
+
+ ksft_print_header();
+
+ if (process_vmsplice(0, 0, 0, 0, 0)) {
+ if (errno == ENOSYS) {
+ ksft_exit_skip("process_vmsplice is not supported\n");
+ return 0;
+ }
+ return pr_perror("Zero-length process_vmsplice failed");
+ }
+
+ addr = mmap(0, MEM_SIZE, PROT_READ | PROT_WRITE,
+ MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+ if (addr == MAP_FAILED)
+ return pr_perror("Unable to create a mapping");
+
+ addr_wronly = mmap(0, MEM_WRONLY_SIZE, PROT_WRITE,
+ MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+ if (addr_wronly == MAP_FAILED)
+ return pr_perror("Unable to create a write-only mapping");
+
+ if (pipe(p))
+ return pr_perror("Unable to create a pipe");
+
+ pid = fork();
+ if (pid < 0)
+ return pr_perror("Unable to fork");
+
+ if (pid == 0) {
+ addr[0] = 'C';
+ addr[4096 + 128] = 'A';
+ addr[4096 + 128 + 4096 - 1] = 'B';
+
+ if (prctl(PR_SET_PDEATHSIG, SIGKILL))
+ return pr_perror("Unable to set PR_SET_PDEATHSIG");
+ if (write(p[1], "c", 1) != 1)
+ return pr_perror("Unable to write data into pipe");
+
+ while (1)
+ sleep(1);
+ return 1;
+ }
+ if (read(p[0], buf, 1) != 1) {
+ pr_perror("Unable to read data from pipe");
+ kill(pid, SIGKILL);
+ wait(&status);
+ return 1;
+ }
+
+ munmap(addr, MEM_SIZE);
+ munmap(addr_wronly, MEM_WRONLY_SIZE);
+
+ iov[0].iov_base = addr;
+ iov[0].iov_len = 1;
+
+ iov[1].iov_base = addr + 4096 + 128;
+ iov[1].iov_len = 4096;
+
+ /* check one iovec */
+ if (process_vmsplice(pid, p[1], iov, 1, SPLICE_F_GIFT) != 1)
+ return pr_perror("Unable to splice pages");
+
+ if (read(p[0], buf, 1) != 1)
+ return pr_perror("Unable to read from pipe");
+
+ if (buf[0] != 'C')
+ ksft_test_result_fail("Get wrong data\n");
+ else
+ ksft_test_result_pass("Check process_vmsplice with one vec\n");
+
+ /* check two iovec-s */
+ if (process_vmsplice(pid, p[1], iov, 2, SPLICE_F_GIFT) != 4097)
+ return pr_perror("Unable to spice pages\n");
+
+ if (read(p[0], buf, 1) != 1)
+ return pr_perror("Unable to read from pipe\n");
+
+ if (buf[0] != 'C')
+ ksft_test_result_fail("Get wrong data\n");
+
+ if (read(p[0], buf, 4096) != 4096)
+ return pr_perror("Unable to read from pipe\n");
+
+ if (buf[0] != 'A' || buf[4095] != 'B')
+ ksft_test_result_fail("Get wrong data\n");
+ else
+ ksft_test_result_pass("check process_vmsplice with two vecs\n");
+
+ /* check how an unreadable region in a second vec is handled */
+ iov[0].iov_base = addr;
+ iov[0].iov_len = 1;
+
+ iov[1].iov_base = addr_wronly + 5;
+ iov[1].iov_len = 1;
+
+ if (process_vmsplice(pid, p[1], iov, 2, SPLICE_F_GIFT) != 1)
+ return pr_perror("Unable to splice data");
+
+ if (read(p[0], buf, 1) != 1)
+ return pr_perror("Unable to read form pipe");
+
+ if (buf[0] != 'C')
+ ksft_test_result_fail("Get wrong data\n");
+ else
+ ksft_test_result_pass("unreadable region in a second vec\n");
+
+ /* check how an unreadable region in a first vec is handled */
+ errno = 0;
+ if (process_vmsplice(pid, p[1], iov + 1, 1, SPLICE_F_GIFT) != -1 ||
+ errno != EFAULT)
+ ksft_test_result_fail("Got anexpected errno %d\n", errno);
+ else
+ ksft_test_result_pass("splice as much as possible\n");
+
+ iov[0].iov_base = addr;
+ iov[0].iov_len = 1;
+
+ iov[1].iov_base = addr;
+ iov[1].iov_len = MEM_SIZE;
+
+ /* splice as much as possible */
+ ret = process_vmsplice(pid, p[1], iov, 2,
+ SPLICE_F_GIFT | SPLICE_F_NONBLOCK);
+ if (ret != 4096 * 15 + 1) /* by default a pipe can fit 16 pages */
+ return pr_perror("Unable to splice pages");
+
+ while (ret > 0) {
+ int len;
+
+ len = read(p[0], buf, 4096);
+ if (len < 0)
+ return pr_perror("Unable to read data");
+ if (len > ret)
+ return pr_err("Read more than expected\n");
+ ret -= len;
+ }
+ ksft_test_result_pass("splice as much as possible\n");
+
+ if (kill(pid, SIGTERM))
+ return pr_perror("Unable to kill a child process");
+ status = -1;
+ if (wait(&status) < 0)
+ return pr_perror("Unable to wait a child process");
+ if (!WIFSIGNALED(status) || WTERMSIG(status) != SIGTERM)
+ return pr_err("The child exited with an unexpected code %d\n",
+ status);
+
+ if (ksft_get_fail_cnt())
+ return ksft_exit_fail();
+ return ksft_exit_pass();
+}
--
2.7.4


From 1598906137503339877@xxx Fri Apr 27 13:28:37 +0000 2018
X-GM-THRID: 1584862043636462130
X-Gmail-Labels: Inbox,Category Forums,Downloaded_2018-04

2017-11-27 07:21:50

by Mike Rapoport

[permalink] [raw]
Subject: [PATCH v4 1/4] fs/splice: introduce pages_to_pipe helper

Signed-off-by: Mike Rapoport <[email protected]>
---
fs/splice.c | 57 ++++++++++++++++++++++++++++++++++++---------------------
1 file changed, 36 insertions(+), 21 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 39e2dc0..7f1ffc5 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1185,6 +1185,36 @@ static long do_splice(struct file *in, loff_t __user *off_in,
return -EINVAL;
}

+static int pages_to_pipe(struct page **pages, struct pipe_inode_info *pipe,
+ struct pipe_buffer *buf, size_t *total,
+ ssize_t copied, size_t start)
+{
+ bool failed = false;
+ size_t len = 0;
+ int ret = 0;
+ int n;
+
+ for (n = 0; copied; n++, start = 0) {
+ int size = min_t(int, copied, PAGE_SIZE - start);
+ if (!failed) {
+ buf->page = pages[n];
+ buf->offset = start;
+ buf->len = size;
+ ret = add_to_pipe(pipe, buf);
+ if (unlikely(ret < 0))
+ failed = true;
+ else
+ len += ret;
+ } else {
+ put_page(pages[n]);
+ }
+ copied -= size;
+ }
+
+ *total += len;
+ return failed ? ret : len;
+}
+
static int iter_to_pipe(struct iov_iter *from,
struct pipe_inode_info *pipe,
unsigned flags)
@@ -1195,13 +1225,11 @@ static int iter_to_pipe(struct iov_iter *from,
};
size_t total = 0;
int ret = 0;
- bool failed = false;

- while (iov_iter_count(from) && !failed) {
+ while (iov_iter_count(from)) {
struct page *pages[16];
ssize_t copied;
size_t start;
- int n;

copied = iov_iter_get_pages(from, pages, ~0UL, 16, &start);
if (copied <= 0) {
@@ -1209,24 +1237,11 @@ static int iter_to_pipe(struct iov_iter *from,
break;
}

- for (n = 0; copied; n++, start = 0) {
- int size = min_t(int, copied, PAGE_SIZE - start);
- if (!failed) {
- buf.page = pages[n];
- buf.offset = start;
- buf.len = size;
- ret = add_to_pipe(pipe, &buf);
- if (unlikely(ret < 0)) {
- failed = true;
- } else {
- iov_iter_advance(from, ret);
- total += ret;
- }
- } else {
- put_page(pages[n]);
- }
- copied -= size;
- }
+ ret = pages_to_pipe(pages, pipe, &buf, &total, copied, start);
+ if (unlikely(ret < 0))
+ break;
+
+ iov_iter_advance(from, ret);
}
return total ? total : ret;
}
--
2.7.4


From 1585274259226834705@xxx Tue Nov 28 02:15:45 +0000 2017
X-GM-THRID: 1585274259226834705
X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread