2018-01-09 06:31:08

by Mike Rapoport

[permalink] [raw]
Subject: [PATCH v5 0/4] vm: add a syscall to map a process memory into a pipe

Hi,

This patches introduces new process_vmsplice system call that combines
functionality of process_vm_read and vmsplice.

It allows to map the memory of another process into a pipe, similarly to
what vmsplice does for its own address space.

The patch 2/4 ("vm: add a syscall to map a process memory into a pipe")
actually adds the new system call and provides its elaborate description.

The patchset is against -mm tree.

v5: update changelog with more elaborate usecase description
v4: skip test when process_vmsplice syscall is not available
v3: minor refactoring to reduce code duplication
v2: move this syscall under CONFIG_CROSS_MEMORY_ATTACH
give correct flags to get_user_pages_remote()


Andrei Vagin (3):
vm: add a syscall to map a process memory into a pipe
x86: wire up the process_vmsplice syscall
test: add a test for the process_vmsplice syscall

Mike Rapoport (1):
fs/splice: introduce pages_to_pipe helper

arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 2 +
fs/splice.c | 262 +++++++++++++++++++--
include/linux/compat.h | 3 +
include/linux/syscalls.h | 4 +
include/uapi/asm-generic/unistd.h | 5 +-
kernel/sys_ni.c | 2 +
tools/testing/selftests/process_vmsplice/Makefile | 5 +
.../process_vmsplice/process_vmsplice_test.c | 196 +++++++++++++++
9 files changed, 458 insertions(+), 22 deletions(-)
create mode 100644 tools/testing/selftests/process_vmsplice/Makefile
create mode 100644 tools/testing/selftests/process_vmsplice/process_vmsplice_test.c

--
2.7.4


2018-01-09 06:31:14

by Mike Rapoport

[permalink] [raw]
Subject: [PATCH v5 1/4] fs/splice: introduce pages_to_pipe helper

Signed-off-by: Mike Rapoport <[email protected]>
---
fs/splice.c | 57 ++++++++++++++++++++++++++++++++++++---------------------
1 file changed, 36 insertions(+), 21 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 39e2dc01ac12..7f1ffc50ff1d 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1185,6 +1185,36 @@ static long do_splice(struct file *in, loff_t __user *off_in,
return -EINVAL;
}

+static int pages_to_pipe(struct page **pages, struct pipe_inode_info *pipe,
+ struct pipe_buffer *buf, size_t *total,
+ ssize_t copied, size_t start)
+{
+ bool failed = false;
+ size_t len = 0;
+ int ret = 0;
+ int n;
+
+ for (n = 0; copied; n++, start = 0) {
+ int size = min_t(int, copied, PAGE_SIZE - start);
+ if (!failed) {
+ buf->page = pages[n];
+ buf->offset = start;
+ buf->len = size;
+ ret = add_to_pipe(pipe, buf);
+ if (unlikely(ret < 0))
+ failed = true;
+ else
+ len += ret;
+ } else {
+ put_page(pages[n]);
+ }
+ copied -= size;
+ }
+
+ *total += len;
+ return failed ? ret : len;
+}
+
static int iter_to_pipe(struct iov_iter *from,
struct pipe_inode_info *pipe,
unsigned flags)
@@ -1195,13 +1225,11 @@ static int iter_to_pipe(struct iov_iter *from,
};
size_t total = 0;
int ret = 0;
- bool failed = false;

- while (iov_iter_count(from) && !failed) {
+ while (iov_iter_count(from)) {
struct page *pages[16];
ssize_t copied;
size_t start;
- int n;

copied = iov_iter_get_pages(from, pages, ~0UL, 16, &start);
if (copied <= 0) {
@@ -1209,24 +1237,11 @@ static int iter_to_pipe(struct iov_iter *from,
break;
}

- for (n = 0; copied; n++, start = 0) {
- int size = min_t(int, copied, PAGE_SIZE - start);
- if (!failed) {
- buf.page = pages[n];
- buf.offset = start;
- buf.len = size;
- ret = add_to_pipe(pipe, &buf);
- if (unlikely(ret < 0)) {
- failed = true;
- } else {
- iov_iter_advance(from, ret);
- total += ret;
- }
- } else {
- put_page(pages[n]);
- }
- copied -= size;
- }
+ ret = pages_to_pipe(pages, pipe, &buf, &total, copied, start);
+ if (unlikely(ret < 0))
+ break;
+
+ iov_iter_advance(from, ret);
}
return total ? total : ret;
}
--
2.7.4

2018-01-09 06:31:19

by Mike Rapoport

[permalink] [raw]
Subject: [PATCH v5 3/4] x86: wire up the process_vmsplice syscall

From: Andrei Vagin <[email protected]>

Signed-off-by: Andrei Vagin <[email protected]>
---
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 2 ++
2 files changed, 3 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 448ac2161112..dc64bf577b17 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -391,3 +391,4 @@
382 i386 pkey_free sys_pkey_free
383 i386 statx sys_statx
384 i386 arch_prctl sys_arch_prctl compat_sys_arch_prctl
+385 i386 process_vmsplice sys_process_vmsplice compat_sys_process_vmsplice
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 5aef183e2f85..d2f916c0309a 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -339,6 +339,7 @@
330 common pkey_alloc sys_pkey_alloc
331 common pkey_free sys_pkey_free
332 common statx sys_statx
+333 64 process_vmsplice sys_process_vmsplice

#
# x32-specific system call numbers start at 512 to avoid cache impact
@@ -380,3 +381,4 @@
545 x32 execveat compat_sys_execveat/ptregs
546 x32 preadv2 compat_sys_preadv64v2
547 x32 pwritev2 compat_sys_pwritev64v2
+548 x32 process_vmsplice compat_sys_process_vmsplice
--
2.7.4

2018-01-09 06:31:36

by Mike Rapoport

[permalink] [raw]
Subject: [PATCH v5 4/4] test: add a test for the process_vmsplice syscall

From: Andrei Vagin <[email protected]>

This test checks that process_vmsplice() can splice pages from a remote
process and returns EFAULT, if process_vmsplice() tries to splice pages
by an unaccessiable address.

Signed-off-by: Andrei Vagin <[email protected]>
Signed-off-by: Mike Rapoport <[email protected]>
---
tools/testing/selftests/process_vmsplice/Makefile | 5 +
.../process_vmsplice/process_vmsplice_test.c | 196 +++++++++++++++++++++
2 files changed, 201 insertions(+)
create mode 100644 tools/testing/selftests/process_vmsplice/Makefile
create mode 100644 tools/testing/selftests/process_vmsplice/process_vmsplice_test.c

diff --git a/tools/testing/selftests/process_vmsplice/Makefile b/tools/testing/selftests/process_vmsplice/Makefile
new file mode 100644
index 000000000000..246d5a7dfed6
--- /dev/null
+++ b/tools/testing/selftests/process_vmsplice/Makefile
@@ -0,0 +1,5 @@
+CFLAGS += -I../../../../usr/include/
+
+TEST_GEN_PROGS := process_vmsplice_test
+
+include ../lib.mk
diff --git a/tools/testing/selftests/process_vmsplice/process_vmsplice_test.c b/tools/testing/selftests/process_vmsplice/process_vmsplice_test.c
new file mode 100644
index 000000000000..1682bdb32de3
--- /dev/null
+++ b/tools/testing/selftests/process_vmsplice/process_vmsplice_test.c
@@ -0,0 +1,196 @@
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <unistd.h>
+#include <sys/mman.h>
+#include <sys/syscall.h>
+#include <fcntl.h>
+#include <sys/uio.h>
+#include <errno.h>
+#include <signal.h>
+#include <sys/prctl.h>
+#include <sys/wait.h>
+
+#include "../kselftest.h"
+
+#ifndef __NR_process_vmsplice
+#define __NR_process_vmsplice 333
+#endif
+
+#define pr_err(fmt, ...) \
+ ({ \
+ fprintf(stderr, "%s:%d:" fmt, \
+ __func__, __LINE__, ##__VA_ARGS__); \
+ KSFT_FAIL; \
+ })
+#define pr_perror(fmt, ...) pr_err(fmt ": %m\n", ##__VA_ARGS__)
+#define fail(fmt, ...) pr_err("FAIL:" fmt, ##__VA_ARGS__)
+
+static ssize_t process_vmsplice(pid_t pid, int fd, const struct iovec *iov,
+ unsigned long nr_segs, unsigned int flags)
+{
+ return syscall(__NR_process_vmsplice, pid, fd, iov, nr_segs, flags);
+
+}
+
+#define MEM_SIZE (4096 * 100)
+#define MEM_WRONLY_SIZE (4096 * 10)
+
+int main(int argc, char **argv)
+{
+ char *addr, *addr_wronly;
+ int p[2];
+ struct iovec iov[2];
+ char buf[4096];
+ int status, ret;
+ pid_t pid;
+
+ ksft_print_header();
+
+ if (process_vmsplice(0, 0, 0, 0, 0)) {
+ if (errno == ENOSYS) {
+ ksft_exit_skip("process_vmsplice is not supported\n");
+ return 0;
+ }
+ return pr_perror("Zero-length process_vmsplice failed");
+ }
+
+ addr = mmap(0, MEM_SIZE, PROT_READ | PROT_WRITE,
+ MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+ if (addr == MAP_FAILED)
+ return pr_perror("Unable to create a mapping");
+
+ addr_wronly = mmap(0, MEM_WRONLY_SIZE, PROT_WRITE,
+ MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+ if (addr_wronly == MAP_FAILED)
+ return pr_perror("Unable to create a write-only mapping");
+
+ if (pipe(p))
+ return pr_perror("Unable to create a pipe");
+
+ pid = fork();
+ if (pid < 0)
+ return pr_perror("Unable to fork");
+
+ if (pid == 0) {
+ addr[0] = 'C';
+ addr[4096 + 128] = 'A';
+ addr[4096 + 128 + 4096 - 1] = 'B';
+
+ if (prctl(PR_SET_PDEATHSIG, SIGKILL))
+ return pr_perror("Unable to set PR_SET_PDEATHSIG");
+ if (write(p[1], "c", 1) != 1)
+ return pr_perror("Unable to write data into pipe");
+
+ while (1)
+ sleep(1);
+ return 1;
+ }
+ if (read(p[0], buf, 1) != 1) {
+ pr_perror("Unable to read data from pipe");
+ kill(pid, SIGKILL);
+ wait(&status);
+ return 1;
+ }
+
+ munmap(addr, MEM_SIZE);
+ munmap(addr_wronly, MEM_WRONLY_SIZE);
+
+ iov[0].iov_base = addr;
+ iov[0].iov_len = 1;
+
+ iov[1].iov_base = addr + 4096 + 128;
+ iov[1].iov_len = 4096;
+
+ /* check one iovec */
+ if (process_vmsplice(pid, p[1], iov, 1, SPLICE_F_GIFT) != 1)
+ return pr_perror("Unable to splice pages");
+
+ if (read(p[0], buf, 1) != 1)
+ return pr_perror("Unable to read from pipe");
+
+ if (buf[0] != 'C')
+ ksft_test_result_fail("Get wrong data\n");
+ else
+ ksft_test_result_pass("Check process_vmsplice with one vec\n");
+
+ /* check two iovec-s */
+ if (process_vmsplice(pid, p[1], iov, 2, SPLICE_F_GIFT) != 4097)
+ return pr_perror("Unable to spice pages\n");
+
+ if (read(p[0], buf, 1) != 1)
+ return pr_perror("Unable to read from pipe\n");
+
+ if (buf[0] != 'C')
+ ksft_test_result_fail("Get wrong data\n");
+
+ if (read(p[0], buf, 4096) != 4096)
+ return pr_perror("Unable to read from pipe\n");
+
+ if (buf[0] != 'A' || buf[4095] != 'B')
+ ksft_test_result_fail("Get wrong data\n");
+ else
+ ksft_test_result_pass("check process_vmsplice with two vecs\n");
+
+ /* check how an unreadable region in a second vec is handled */
+ iov[0].iov_base = addr;
+ iov[0].iov_len = 1;
+
+ iov[1].iov_base = addr_wronly + 5;
+ iov[1].iov_len = 1;
+
+ if (process_vmsplice(pid, p[1], iov, 2, SPLICE_F_GIFT) != 1)
+ return pr_perror("Unable to splice data");
+
+ if (read(p[0], buf, 1) != 1)
+ return pr_perror("Unable to read form pipe");
+
+ if (buf[0] != 'C')
+ ksft_test_result_fail("Get wrong data\n");
+ else
+ ksft_test_result_pass("unreadable region in a second vec\n");
+
+ /* check how an unreadable region in a first vec is handled */
+ errno = 0;
+ if (process_vmsplice(pid, p[1], iov + 1, 1, SPLICE_F_GIFT) != -1 ||
+ errno != EFAULT)
+ ksft_test_result_fail("Got anexpected errno %d\n", errno);
+ else
+ ksft_test_result_pass("splice as much as possible\n");
+
+ iov[0].iov_base = addr;
+ iov[0].iov_len = 1;
+
+ iov[1].iov_base = addr;
+ iov[1].iov_len = MEM_SIZE;
+
+ /* splice as much as possible */
+ ret = process_vmsplice(pid, p[1], iov, 2,
+ SPLICE_F_GIFT | SPLICE_F_NONBLOCK);
+ if (ret != 4096 * 15 + 1) /* by default a pipe can fit 16 pages */
+ return pr_perror("Unable to splice pages");
+
+ while (ret > 0) {
+ int len;
+
+ len = read(p[0], buf, 4096);
+ if (len < 0)
+ return pr_perror("Unable to read data");
+ if (len > ret)
+ return pr_err("Read more than expected\n");
+ ret -= len;
+ }
+ ksft_test_result_pass("splice as much as possible\n");
+
+ if (kill(pid, SIGTERM))
+ return pr_perror("Unable to kill a child process");
+ status = -1;
+ if (wait(&status) < 0)
+ return pr_perror("Unable to wait a child process");
+ if (!WIFSIGNALED(status) || WTERMSIG(status) != SIGTERM)
+ return pr_err("The child exited with an unexpected code %d\n",
+ status);
+
+ if (ksft_get_fail_cnt())
+ return ksft_exit_fail();
+ return ksft_exit_pass();
+}
--
2.7.4

2018-01-09 06:32:00

by Mike Rapoport

[permalink] [raw]
Subject: [PATCH v5 2/4] vm: add a syscall to map a process memory into a pipe

From: Andrei Vagin <[email protected]>

It is a hybrid of process_vm_readv() and vmsplice().

vmsplice can map memory from a current address space into a pipe.
process_vm_readv can read memory of another process.

A new system call can map memory of another process into a pipe.

ssize_t process_vmsplice(pid_t pid, int fd, const struct iovec *iov,
unsigned long nr_segs, unsigned int flags)

All arguments are identical with vmsplice except pid which specifies a
target process.

Currently if we want to dump a process memory to a file or to a socket,
we can use process_vm_readv() + write(), but it works slow, because data
are copied into a temporary user-space buffer.

A second way is to use vmsplice() + splice(). It is more effective,
because data are not copied into a temporary buffer, but here is another
problem. vmsplice works with the currect address space, so it can be
used only if we inject our code into a target process.

The second way suffers from a few other issues:
* a process has to be stopped to run a parasite code
* a number of pipes is limited, so it may be impossible to dump all
memory in one iteration, and we have to stop process and inject our
code a few times.
* pages in pipes are unreclaimable, so it isn't good to hold a lot of
memory in pipes.

The introduced syscall allows to use a second way without injecting any
code into a target process.

My experiments shows that process_vmsplice() + splice() works two time
faster than process_vm_readv() + write().

It is particularly useful for iterative migration. In those cases we enable
a memory tracker, and then we are dumping a process memory while a process
continues to work. On the first iteration we are dumping all memory, and
then we are dumpung only memory that was modified from the previous
iteration. After a few pre-dump operations, a process is stopped and
dumped finally. The pre-dump operations allow to significantly decrease a
process downtime, when a process is migrated to another host.

The primary disadvantage of the process_vm_readv() + write() approach for
the iterative migration usecase is the need to stop the processes with, e.g.
ptrace, and inject parasite code. And, since the the number of pipes and
their size is limited, we either have to limit the dump size to about 1.5GB
or to keep the target processes stopped. The process_vmsplice system call
allows splicing memory without the parasite code and even without stopping
target processes.

This system call maybe usefult for debuggers. For example, a debugger can
use it to generate a core file, it can splice memory of a process into a
pipe and then splice it from the pipe to a file. This method works much
faster than using PTRACE_PEEK* commands.

Debuggers may also use process_vmsplice() to observe memory changes in a
process page. process_vmsplice() attaches a real process page to a pipe, so
we can splice it once and observe how it is being changed over time.

The process_vmsplice syscall might be of interest for users of
process_vm_readv(), in case if they read memory to send it to somewhere
else.

Signed-off-by: Andrei Vagin <[email protected]>
Signed-off-by: Mike Rapoport <[email protected]>
---
fs/splice.c | 205 ++++++++++++++++++++++++++++++++++++++
include/linux/compat.h | 3 +
include/linux/syscalls.h | 4 +
include/uapi/asm-generic/unistd.h | 5 +-
kernel/sys_ni.c | 2 +
5 files changed, 218 insertions(+), 1 deletion(-)

diff --git a/fs/splice.c b/fs/splice.c
index 7f1ffc50ff1d..72397d2a59b9 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -34,6 +34,7 @@
#include <linux/socket.h>
#include <linux/compat.h>
#include <linux/sched/signal.h>
+#include <linux/sched/mm.h>

#include "internal.h"

@@ -1373,6 +1374,210 @@ SYSCALL_DEFINE4(vmsplice, int, fd, const struct iovec __user *, iov,
return error;
}

+#ifdef CONFIG_CROSS_MEMORY_ATTACH
+/*
+ * Map pages from a specified task into a pipe
+ */
+static int remote_single_vec_to_pipe(struct task_struct *task,
+ struct mm_struct *mm,
+ const struct iovec *rvec,
+ struct pipe_inode_info *pipe,
+ unsigned int flags,
+ size_t *total)
+{
+ struct pipe_buffer buf = {
+ .ops = &user_page_pipe_buf_ops,
+ .flags = flags
+ };
+ unsigned long addr = (unsigned long) rvec->iov_base;
+ unsigned long pa = addr & PAGE_MASK;
+ unsigned long start_offset = addr - pa;
+ unsigned long nr_pages;
+ ssize_t len = rvec->iov_len;
+ struct page *process_pages[16];
+ bool failed = false;
+ int ret = 0;
+
+ nr_pages = (addr + len - 1) / PAGE_SIZE - addr / PAGE_SIZE + 1;
+ while (nr_pages) {
+ long pages = min(nr_pages, 16UL);
+ int locked = 1;
+ ssize_t copied;
+
+ /*
+ * Get the pages we're interested in. We must
+ * access remotely because task/mm might not
+ * current/current->mm
+ */
+ down_read(&mm->mmap_sem);
+ pages = get_user_pages_remote(task, mm, pa, pages, 0,
+ process_pages, NULL, &locked);
+ if (locked)
+ up_read(&mm->mmap_sem);
+ if (pages <= 0) {
+ failed = true;
+ ret = -EFAULT;
+ break;
+ }
+
+ copied = pages * PAGE_SIZE - start_offset;
+ if (copied > len)
+ copied = len;
+ len -= copied;
+
+ ret = pages_to_pipe(process_pages, pipe, &buf, total, copied,
+ start_offset);
+ if (unlikely(ret < 0))
+ break;
+
+ start_offset = 0;
+ nr_pages -= pages;
+ pa += pages * PAGE_SIZE;
+ }
+ return ret < 0 ? ret : 0;
+}
+
+static ssize_t remote_iovec_to_pipe(struct task_struct *task,
+ struct mm_struct *mm,
+ const struct iovec *rvec,
+ unsigned long riovcnt,
+ struct pipe_inode_info *pipe,
+ unsigned int flags)
+{
+ size_t total = 0;
+ int ret = 0, i;
+
+ for (i = 0; i < riovcnt; i++) {
+ /* Work out address and page range required */
+ if (rvec[i].iov_len == 0)
+ continue;
+
+ ret = remote_single_vec_to_pipe(
+ task, mm, &rvec[i], pipe, flags, &total);
+ if (ret < 0)
+ break;
+ }
+ return total ? total : ret;
+}
+
+static long process_vmsplice_to_pipe(struct task_struct *task,
+ struct mm_struct *mm, struct file *file,
+ const struct iovec __user *uiov,
+ unsigned long nr_segs, unsigned int flags)
+{
+ struct pipe_inode_info *pipe;
+ struct iovec iovstack[UIO_FASTIOV];
+ struct iovec *iov = iovstack;
+ unsigned int buf_flag = 0;
+ long ret;
+
+ if (flags & SPLICE_F_GIFT)
+ buf_flag = PIPE_BUF_FLAG_GIFT;
+
+ pipe = get_pipe_info(file);
+ if (!pipe)
+ return -EBADF;
+
+ ret = rw_copy_check_uvector(CHECK_IOVEC_ONLY, uiov, nr_segs,
+ UIO_FASTIOV, iovstack, &iov);
+ if (ret < 0)
+ return ret;
+
+ pipe_lock(pipe);
+ ret = wait_for_space(pipe, flags);
+ if (!ret)
+ ret = remote_iovec_to_pipe(task, mm, iov,
+ nr_segs, pipe, buf_flag);
+ pipe_unlock(pipe);
+ if (ret > 0)
+ wakeup_pipe_readers(pipe);
+
+ if (iov != iovstack)
+ kfree(iov);
+ return ret;
+}
+
+/* process_vmsplice splices a process address range into a pipe. */
+SYSCALL_DEFINE5(process_vmsplice, int, pid, int, fd,
+ const struct iovec __user *, iov,
+ unsigned long, nr_segs, unsigned int, flags)
+{
+ struct task_struct *task;
+ struct mm_struct *mm;
+ struct fd f;
+ long ret;
+
+ if (unlikely(flags & ~SPLICE_F_ALL))
+ return -EINVAL;
+ if (unlikely(nr_segs > UIO_MAXIOV))
+ return -EINVAL;
+ else if (unlikely(!nr_segs))
+ return 0;
+
+ f = fdget(fd);
+ if (!f.file)
+ return -EBADF;
+
+ /* Get process information */
+ task = find_get_task_by_vpid(pid);
+ if (!task) {
+ ret = -ESRCH;
+ goto out_fput;
+ }
+
+ mm = mm_access(task, PTRACE_MODE_ATTACH_REALCREDS);
+ if (!mm || IS_ERR(mm)) {
+ ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
+ /*
+ * Explicitly map EACCES to EPERM as EPERM is a more a
+ * appropriate error code for process_vw_readv/writev
+ */
+ if (ret == -EACCES)
+ ret = -EPERM;
+ goto put_task_struct;
+ }
+
+ ret = -EBADF;
+ if (f.file->f_mode & FMODE_WRITE)
+ ret = process_vmsplice_to_pipe(task, mm, f.file,
+ iov, nr_segs, flags);
+ mmput(mm);
+
+put_task_struct:
+ put_task_struct(task);
+
+out_fput:
+ fdput(f);
+
+ return ret;
+}
+
+#ifdef CONFIG_COMPAT
+COMPAT_SYSCALL_DEFINE5(process_vmsplice, pid_t, pid, int, fd,
+ const struct compat_iovec __user *, iov32,
+ unsigned int, nr_segs, unsigned int, flags)
+{
+ struct iovec __user *iov;
+ unsigned int i;
+
+ if (nr_segs > UIO_MAXIOV)
+ return -EINVAL;
+
+ iov = compat_alloc_user_space(nr_segs * sizeof(struct iovec));
+ for (i = 0; i < nr_segs; i++) {
+ struct compat_iovec v;
+
+ if (get_user(v.iov_base, &iov32[i].iov_base) ||
+ get_user(v.iov_len, &iov32[i].iov_len) ||
+ put_user(compat_ptr(v.iov_base), &iov[i].iov_base) ||
+ put_user(v.iov_len, &iov[i].iov_len))
+ return -EFAULT;
+ }
+ return sys_process_vmsplice(pid, fd, iov, nr_segs, flags);
+}
+#endif
+#endif /* CONFIG_CROSS_MEMORY_ATTACH */
+
#ifdef CONFIG_COMPAT
COMPAT_SYSCALL_DEFINE4(vmsplice, int, fd, const struct compat_iovec __user *, iov32,
unsigned int, nr_segs, unsigned int, flags)
diff --git a/include/linux/compat.h b/include/linux/compat.h
index 0fc36406f32c..11b375319289 100644
--- a/include/linux/compat.h
+++ b/include/linux/compat.h
@@ -550,6 +550,9 @@ asmlinkage long compat_sys_getdents(unsigned int fd,
unsigned int count);
asmlinkage long compat_sys_vmsplice(int fd, const struct compat_iovec __user *,
unsigned int nr_segs, unsigned int flags);
+asmlinkage long compat_sys_process_vmsplice(pid_t pid, int fd,
+ const struct compat_iovec __user *,
+ unsigned int nr_segs, unsigned int flags);
asmlinkage long compat_sys_open(const char __user *filename, int flags,
umode_t mode);
asmlinkage long compat_sys_openat(int dfd, const char __user *filename,
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index a78186d826d7..4ba93336bf05 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -941,4 +941,8 @@ asmlinkage long sys_pkey_free(int pkey);
asmlinkage long sys_statx(int dfd, const char __user *path, unsigned flags,
unsigned mask, struct statx __user *buffer);

+asmlinkage long sys_process_vmsplice(pid_t pid,
+ int fd, const struct iovec __user *iov,
+ unsigned long nr_segs, unsigned int flags);
+
#endif
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 8b87de067bc7..37f18320ae94 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -732,9 +732,12 @@ __SYSCALL(__NR_pkey_alloc, sys_pkey_alloc)
__SYSCALL(__NR_pkey_free, sys_pkey_free)
#define __NR_statx 291
__SYSCALL(__NR_statx, sys_statx)
+#define __NR_process_vmsplice 292
+__SC_COMP(__NR_process_vmsplice, sys_process_vmsplice,
+ compat_sys_process_vmsplice)

#undef __NR_syscalls
-#define __NR_syscalls 292
+#define __NR_syscalls 293

/*
* All syscalls below here should go away really,
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index b5189762d275..a939fbb92d9e 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -158,8 +158,10 @@ cond_syscall(sys_sysfs);
cond_syscall(sys_syslog);
cond_syscall(sys_process_vm_readv);
cond_syscall(sys_process_vm_writev);
+cond_syscall(sys_process_vmsplice);
cond_syscall(compat_sys_process_vm_readv);
cond_syscall(compat_sys_process_vm_writev);
+cond_syscall(compat_sys_process_vmsplice);
cond_syscall(sys_uselib);
cond_syscall(sys_fadvise64);
cond_syscall(sys_fadvise64_64);
--
2.7.4

2018-01-11 17:11:38

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v5 3/4] x86: wire up the process_vmsplice syscall

Hi Andrei,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on linus/master]
[also build test WARNING on v4.15-rc7 next-20180111]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url: https://github.com/0day-ci/linux/commits/Mike-Rapoport/vm-add-a-syscall-to-map-a-process-memory-into-a-pipe/20180111-220440
config: xtensa-allyesconfig (attached as .config)
compiler: xtensa-linux-gcc (GCC) 7.2.0
reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=xtensa

All warnings (new ones prefixed by >>):

>> <stdin>:1332:2: warning: #warning syscall process_vmsplice not implemented [-Wcpp]

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation


Attachments:
(No filename) (0.99 kB)
.config.gz (51.43 kB)
Download all attachments

2018-02-21 00:45:08

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v5 0/4] vm: add a syscall to map a process memory into a pipe

On Tue, 9 Jan 2018 08:30:49 +0200 Mike Rapoport <[email protected]> wrote:

> This patches introduces new process_vmsplice system call that combines
> functionality of process_vm_read and vmsplice.

All seems fairly strightforward. The big question is: do we know that
people will actually use this, and get sufficient value from it to
justify its addition?



2018-02-26 09:03:59

by Pavel Emelianov

[permalink] [raw]
Subject: Re: [PATCH v5 0/4] vm: add a syscall to map a process memory into a pipe

On 02/21/2018 03:44 AM, Andrew Morton wrote:
> On Tue, 9 Jan 2018 08:30:49 +0200 Mike Rapoport <[email protected]> wrote:
>
>> This patches introduces new process_vmsplice system call that combines
>> functionality of process_vm_read and vmsplice.
>
> All seems fairly strightforward. The big question is: do we know that
> people will actually use this, and get sufficient value from it to
> justify its addition?

Yes, that's what bothers us a lot too :) I've tried to start with finding out if anyone
used the sys_read/write_process_vm() calls, but failed :( Does anybody know how popular
these syscalls are? If its users operate on big amount of memory, they could benefit from
the proposed splice extension.

-- Pavel

2018-02-26 17:41:18

by Nathan Hjelm

[permalink] [raw]
Subject: Re: [OMPI devel] [PATCH v5 0/4] vm: add a syscall to map a process memory into a pipe

All MPI implementations have support for using CMA to transfer data between local processes. The performance is fairly good (not as good as XPMEM) but the interface limits what we can do with to remote process memory (no atomics). I have not heard about this new proposal. What is the benefit of the proposed calls over the existing calls?

-Nathan

> On Feb 26, 2018, at 2:02 AM, Pavel Emelyanov <[email protected]> wrote:
>
> On 02/21/2018 03:44 AM, Andrew Morton wrote:
>> On Tue, 9 Jan 2018 08:30:49 +0200 Mike Rapoport <[email protected]> wrote:
>>
>>> This patches introduces new process_vmsplice system call that combines
>>> functionality of process_vm_read and vmsplice.
>>
>> All seems fairly strightforward. The big question is: do we know that
>> people will actually use this, and get sufficient value from it to
>> justify its addition?
>
> Yes, that's what bothers us a lot too :) I've tried to start with finding out if anyone
> used the sys_read/write_process_vm() calls, but failed :( Does anybody know how popular
> these syscalls are? If its users operate on big amount of memory, they could benefit from
> the proposed splice extension.
>
> -- Pavel
> _______________________________________________
> devel mailing list
> [email protected]
> https://lists.open-mpi.org/mailman/listinfo/devel


Attachments:
signature.asc (849.00 B)
Message signed with OpenPGP

2018-02-27 02:19:18

by Dmitry V. Levin

[permalink] [raw]
Subject: Re: [PATCH v5 0/4] vm: add a syscall to map a process memory into a pipe

On Mon, Feb 26, 2018 at 12:02:25PM +0300, Pavel Emelyanov wrote:
> On 02/21/2018 03:44 AM, Andrew Morton wrote:
> > On Tue, 9 Jan 2018 08:30:49 +0200 Mike Rapoport <[email protected]> wrote:
> >
> >> This patches introduces new process_vmsplice system call that combines
> >> functionality of process_vm_read and vmsplice.
> >
> > All seems fairly strightforward. The big question is: do we know that
> > people will actually use this, and get sufficient value from it to
> > justify its addition?
>
> Yes, that's what bothers us a lot too :) I've tried to start with finding out if anyone
> used the sys_read/write_process_vm() calls, but failed :( Does anybody know how popular
> these syscalls are?

Well, process_vm_readv itself is quite popular, it's used by debuggers nowadays,
see e.g.
$ strace -qq -esignal=none -eprocess_vm_readv strace -qq -o/dev/null cat /dev/null


--
ldv


Attachments:
(No filename) (920.00 B)
signature.asc (817.00 B)
Download all attachments

2018-02-27 07:45:27

by Mike Rapoport

[permalink] [raw]
Subject: Re: [OMPI devel] [PATCH v5 0/4] vm: add a syscall to map a process memory into a pipe

On Mon, Feb 26, 2018 at 09:38:19AM -0700, Nathan Hjelm wrote:
> All MPI implementations have support for using CMA to transfer data
> between local processes. The performance is fairly good (not as good as
> XPMEM) but the interface limits what we can do with to remote process
> memory (no atomics). I have not heard about this new proposal. What is
> the benefit of the proposed calls over the existing calls?

The proposed system call call that combines functionality of
process_vm_read and vmsplice [1] and it's particularly useful when one
needs to read the remote process memory and then write it to a file
descriptor. In this case a sequence of process_vm_read() + write() calls
that involves two copies of data can be replaced with process_vm_splice() +
splice() which does not involve copy at all.

[1] https://lkml.org/lkml/2018/1/9/32

> -Nathan
>
> > On Feb 26, 2018, at 2:02 AM, Pavel Emelyanov <[email protected]> wrote:
> >
> > On 02/21/2018 03:44 AM, Andrew Morton wrote:
> >> On Tue, 9 Jan 2018 08:30:49 +0200 Mike Rapoport <[email protected]> wrote:
> >>
> >>> This patches introduces new process_vmsplice system call that combines
> >>> functionality of process_vm_read and vmsplice.
> >>
> >> All seems fairly strightforward. The big question is: do we know that
> >> people will actually use this, and get sufficient value from it to
> >> justify its addition?
> >
> > Yes, that's what bothers us a lot too :) I've tried to start with finding out if anyone
> > used the sys_read/write_process_vm() calls, but failed :( Does anybody know how popular
> > these syscalls are? If its users operate on big amount of memory, they could benefit from
> > the proposed splice extension.
> >
> > -- Pavel

--
Sincerely yours,
Mike.


2018-02-28 06:12:24

by Andrei Vagin

[permalink] [raw]
Subject: Re: [PATCH v5 0/4] vm: add a syscall to map a process memory into a pipe

On Tue, Feb 27, 2018 at 05:18:18AM +0300, Dmitry V. Levin wrote:
> On Mon, Feb 26, 2018 at 12:02:25PM +0300, Pavel Emelyanov wrote:
> > On 02/21/2018 03:44 AM, Andrew Morton wrote:
> > > On Tue, 9 Jan 2018 08:30:49 +0200 Mike Rapoport <[email protected]> wrote:
> > >
> > >> This patches introduces new process_vmsplice system call that combines
> > >> functionality of process_vm_read and vmsplice.
> > >
> > > All seems fairly strightforward. The big question is: do we know that
> > > people will actually use this, and get sufficient value from it to
> > > justify its addition?
> >
> > Yes, that's what bothers us a lot too :) I've tried to start with finding out if anyone
> > used the sys_read/write_process_vm() calls, but failed :( Does anybody know how popular
> > these syscalls are?
>
> Well, process_vm_readv itself is quite popular, it's used by debuggers nowadays,
> see e.g.
> $ strace -qq -esignal=none -eprocess_vm_readv strace -qq -o/dev/null cat /dev/null

For this case, there is no advantage from process_vmsplice().

But it can significantly optimize a process of generating a core file.
In this case, we need to read a process memory and save content into a
file. process_vmsplice() allows to do this more optimal than
process_vm_readv(), because it doesn't copy data into a userspace.

Here is a part of strace how gdb saves memory content into a core file:

10593 open("/proc/10193/mem", O_RDONLY|O_CLOEXEC) = 17
10593 pread64(17, "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz"..., 1048576, 140009356111872) = 1048576
10593 close(17) = 0
10593 write(16, "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz"..., 4096) = 4096
10593 write(16, "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz"..., 1044480) = 1044480
10593 open("/proc/10193/mem", O_RDONLY|O_CLOEXEC) = 17
10593 pread64(17, "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz"..., 1048576, 140009357160448) = 1048576
10593 close(17) = 0
10593 write(16, "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz"..., 4096) = 4096
10593 write(16, "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz"..., 1044480) = 1044480
10593 open("/proc/10193/mem", O_RDONLY|O_CLOEXEC) = 17
10593 pread64(17, "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz"..., 1048576, 140009358209024) = 1048576
10593 close(17) = 0
10593 write(16, "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz"..., 4096) = 4096
10593 write(16, "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz"..., 1044480) = 1044480
10593 open("/proc/10193/mem", O_RDONLY|O_CLOEXEC) = 17
10593 pread64(17, "zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz"..., 1048576, 140009359257600) = 1048576
10593 close(17)

It is strange that process_vm_readv() isn't used and that
/proc/10193/mem is opened many times.

BTW: "strace -fo strace-gdb.log gdb -p PID" doesn't work properly.

Thanks,
Andrei

>
>
> --
> ldv



2018-02-28 07:14:51

by Pavel Emelianov

[permalink] [raw]
Subject: Re: [PATCH v5 0/4] vm: add a syscall to map a process memory into a pipe

On 02/27/2018 05:18 AM, Dmitry V. Levin wrote:
> On Mon, Feb 26, 2018 at 12:02:25PM +0300, Pavel Emelyanov wrote:
>> On 02/21/2018 03:44 AM, Andrew Morton wrote:
>>> On Tue, 9 Jan 2018 08:30:49 +0200 Mike Rapoport <[email protected]> wrote:
>>>
>>>> This patches introduces new process_vmsplice system call that combines
>>>> functionality of process_vm_read and vmsplice.
>>>
>>> All seems fairly strightforward. The big question is: do we know that
>>> people will actually use this, and get sufficient value from it to
>>> justify its addition?
>>
>> Yes, that's what bothers us a lot too :) I've tried to start with finding out if anyone
>> used the sys_read/write_process_vm() calls, but failed :( Does anybody know how popular
>> these syscalls are?
>
> Well, process_vm_readv itself is quite popular, it's used by debuggers nowadays,
> see e.g.
> $ strace -qq -esignal=none -eprocess_vm_readv strace -qq -o/dev/null cat /dev/null

I see. Well, yes, this use-case will not benefit much from remote splice. How about more
interactive debug by, say, gdb? It may attach, then splice all the memory, then analyze
the victim code/data w/o copying it to its address space?

-- Pavel

2018-02-28 17:51:52

by Andrei Vagin

[permalink] [raw]
Subject: Re: [PATCH v5 0/4] vm: add a syscall to map a process memory into a pipe

On Wed, Feb 28, 2018 at 10:12:55AM +0300, Pavel Emelyanov wrote:
> On 02/27/2018 05:18 AM, Dmitry V. Levin wrote:
> > On Mon, Feb 26, 2018 at 12:02:25PM +0300, Pavel Emelyanov wrote:
> >> On 02/21/2018 03:44 AM, Andrew Morton wrote:
> >>> On Tue, 9 Jan 2018 08:30:49 +0200 Mike Rapoport <[email protected]> wrote:
> >>>
> >>>> This patches introduces new process_vmsplice system call that combines
> >>>> functionality of process_vm_read and vmsplice.
> >>>
> >>> All seems fairly strightforward. The big question is: do we know that
> >>> people will actually use this, and get sufficient value from it to
> >>> justify its addition?
> >>
> >> Yes, that's what bothers us a lot too :) I've tried to start with finding out if anyone
> >> used the sys_read/write_process_vm() calls, but failed :( Does anybody know how popular
> >> these syscalls are?
> >
> > Well, process_vm_readv itself is quite popular, it's used by debuggers nowadays,
> > see e.g.
> > $ strace -qq -esignal=none -eprocess_vm_readv strace -qq -o/dev/null cat /dev/null
>
> I see. Well, yes, this use-case will not benefit much from remote splice. How about more
> interactive debug by, say, gdb? It may attach, then splice all the memory, then analyze
> the victim code/data w/o copying it to its address space?

Hmm, in this case, you probably will want to be able to map pipe pages
into memory.

>
> -- Pavel

2018-02-28 23:22:45

by Atchley, Scott

[permalink] [raw]
Subject: Re: [OMPI devel] [PATCH v5 0/4] vm: add a syscall to map a process memory into a pipe

> On Feb 28, 2018, at 2:12 AM, Pavel Emelyanov <[email protected]> wrote:
>
> On 02/27/2018 05:18 AM, Dmitry V. Levin wrote:
>> On Mon, Feb 26, 2018 at 12:02:25PM +0300, Pavel Emelyanov wrote:
>>> On 02/21/2018 03:44 AM, Andrew Morton wrote:
>>>> On Tue, 9 Jan 2018 08:30:49 +0200 Mike Rapoport <[email protected]> wrote:
>>>>
>>>>> This patches introduces new process_vmsplice system call that combines
>>>>> functionality of process_vm_read and vmsplice.
>>>>
>>>> All seems fairly strightforward. The big question is: do we know that
>>>> people will actually use this, and get sufficient value from it to
>>>> justify its addition?
>>>
>>> Yes, that's what bothers us a lot too :) I've tried to start with finding out if anyone
>>> used the sys_read/write_process_vm() calls, but failed :( Does anybody know how popular
>>> these syscalls are?
>>
>> Well, process_vm_readv itself is quite popular, it's used by debuggers nowadays,
>> see e.g.
>> $ strace -qq -esignal=none -eprocess_vm_readv strace -qq -o/dev/null cat /dev/null
>
> I see. Well, yes, this use-case will not benefit much from remote splice. How about more
> interactive debug by, say, gdb? It may attach, then splice all the memory, then analyze
> the victim code/data w/o copying it to its address space?
>
> -- Pavel

I may be completely off base, but could a FUSE daemon use this to read memory from the client and dump it to a file descriptor without copying the data into the kernel?