v10: move refactor code into patch ahead of change
add compat support
documentation change
v9: tglx feedback
whitespace
documentation of ptrace struct
shorten struct name
helper function for set_syscall_user_dispatch
use task variant of set/clear_syscall_work
use task variant of test_syscall_work in getter
selftest
v8: fix include issue Reported-by: kernel test robot <[email protected]>
summary:
+++ b/kernel/entry/syscall_user_dispatch.c
+ include <linux/ptrace.h>
v7: drop ptrace suspend flag, not required
hanging unreferenced variable
whitespace
v6: drop fs/proc/array update, it's not needed
drop on_dispatch field exposure in config structure, it's not
checkpoint relevant.
(Thank you for the reviews Oleg and Andrei)
v5: automated test for !defined(GENERIC_ENTRY) failed, fix fs/proc
use ifdef for GENERIC_ENTRY || TIF_SYSCALL_USER_DISPATCH
note: syscall user dispatch is not presently supported for
non-generic entry, but could be implemented. question is
whether the TIF_ define should be carved out now or then
v4: Whitespace
s/CHECKPOINT_RESTART/CHECKPOINT_RESUME
check test_syscall_work(SYSCALL_USER_DISPATCH) to determine if it's
turned on or not in fs/proc/array and getter interface
v3: Kernel test robot static function fix
Whitespace nitpicks
v2: Implements the getter/setter interface in ptrace rather than prctl
Syscall user dispatch makes it possible to cleanly intercept system
calls from user-land. However, most transparent checkpoint software
presently leverages some combination of ptrace and system call
injection to place software in a ready-to-checkpoint state.
If Syscall User Dispatch is enabled at the time of being quiesced,
injected system calls will subsequently be interposed upon and
dispatched to the task's signal handler.
Patch summary:
- Refactor configuration setting interface to operate on a task
rather than current, so the set and error paths can be consolidated
- Implement a getter interface for Syscall User Dispatch config info.
To resume successfully, the checkpoint/resume software has to
save and restore this information. Presently this configuration
is write-only, with no way for C/R software to save it.
This was done in ptrace because syscall user dispatch is not part of
uapi. The syscall_user_dispatch_config structure was added to the
ptrace exports.
Gregory Price (2):
syscall_user_dispatch: helper function to operate on given task
ptrace,syscall_user_dispatch: checkpoint/restore support for SUD
.../admin-guide/syscall-user-dispatch.rst | 4 +
include/linux/compat.h | 7 ++
include/linux/syscall_user_dispatch.h | 18 +++
include/uapi/linux/ptrace.h | 29 +++++
kernel/entry/syscall_user_dispatch.c | 106 ++++++++++++++++--
kernel/ptrace.c | 9 ++
tools/testing/selftests/ptrace/.gitignore | 1 +
tools/testing/selftests/ptrace/Makefile | 2 +-
tools/testing/selftests/ptrace/get_set_sud.c | 77 +++++++++++++
9 files changed, 244 insertions(+), 9 deletions(-)
create mode 100644 tools/testing/selftests/ptrace/get_set_sud.c
--
2.39.1
Preparatory patch ahead of set/get interfaces which will allow a
ptrace to get/set the syscall user dispatch configuration of a task.
This will simplify the set interface and consolidates error paths.
Signed-off-by: Gregory Price <[email protected]>
---
kernel/entry/syscall_user_dispatch.c | 23 +++++++++++++++--------
1 file changed, 15 insertions(+), 8 deletions(-)
diff --git a/kernel/entry/syscall_user_dispatch.c b/kernel/entry/syscall_user_dispatch.c
index 0b6379adff6b..22396b234854 100644
--- a/kernel/entry/syscall_user_dispatch.c
+++ b/kernel/entry/syscall_user_dispatch.c
@@ -68,8 +68,9 @@ bool syscall_user_dispatch(struct pt_regs *regs)
return true;
}
-int set_syscall_user_dispatch(unsigned long mode, unsigned long offset,
- unsigned long len, char __user *selector)
+static int task_set_syscall_user_dispatch(struct task_struct *task, unsigned long mode,
+ unsigned long offset, unsigned long len,
+ char __user *selector)
{
switch (mode) {
case PR_SYS_DISPATCH_OFF:
@@ -94,15 +95,21 @@ int set_syscall_user_dispatch(unsigned long mode, unsigned long offset,
return -EINVAL;
}
- current->syscall_dispatch.selector = selector;
- current->syscall_dispatch.offset = offset;
- current->syscall_dispatch.len = len;
- current->syscall_dispatch.on_dispatch = false;
+ task->syscall_dispatch.selector = selector;
+ task->syscall_dispatch.offset = offset;
+ task->syscall_dispatch.len = len;
+ task->syscall_dispatch.on_dispatch = false;
if (mode == PR_SYS_DISPATCH_ON)
- set_syscall_work(SYSCALL_USER_DISPATCH);
+ set_task_syscall_work(task, SYSCALL_USER_DISPATCH);
else
- clear_syscall_work(SYSCALL_USER_DISPATCH);
+ clear_task_syscall_work(task, SYSCALL_USER_DISPATCH);
return 0;
}
+
+int set_syscall_user_dispatch(unsigned long mode, unsigned long offset,
+ unsigned long len, char __user *selector)
+{
+ return task_set_syscall_user_dispatch(current, mode, offset, len, selector);
+}
--
2.39.1
Implement ptrace getter/setter interface for syscall user dispatch.
These prctl settings are presently write-only, making it impossible to
implement transparent checkpoint/restore via software like CRIU.
'on_dispatch' field is not exposed because it is a kernel-internal
only field that cannot be 'true' when returning to userland.
Signed-off-by: Gregory Price <[email protected]>
---
.../admin-guide/syscall-user-dispatch.rst | 4 +
include/linux/compat.h | 7 ++
include/linux/syscall_user_dispatch.h | 18 ++++
include/uapi/linux/ptrace.h | 29 +++++++
kernel/entry/syscall_user_dispatch.c | 83 +++++++++++++++++++
kernel/ptrace.c | 9 ++
tools/testing/selftests/ptrace/.gitignore | 1 +
tools/testing/selftests/ptrace/Makefile | 2 +-
tools/testing/selftests/ptrace/get_set_sud.c | 77 +++++++++++++++++
9 files changed, 229 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/ptrace/get_set_sud.c
diff --git a/Documentation/admin-guide/syscall-user-dispatch.rst b/Documentation/admin-guide/syscall-user-dispatch.rst
index 60314953c728..f7648c08297e 100644
--- a/Documentation/admin-guide/syscall-user-dispatch.rst
+++ b/Documentation/admin-guide/syscall-user-dispatch.rst
@@ -73,6 +73,10 @@ thread-wide, without the need to invoke the kernel directly. selector
can be set to SYSCALL_DISPATCH_FILTER_ALLOW or SYSCALL_DISPATCH_FILTER_BLOCK.
Any other value should terminate the program with a SIGSYS.
+Additionally, a task's syscall user dispatch configuration can be peeked
+and poked via the PTRACE_(GET|SET)_SYSCALL_USER_DISPATCH_CONFIG ptrace
+requests. This is useful for checkpoint/restart software.
+
Security Notes
--------------
diff --git a/include/linux/compat.h b/include/linux/compat.h
index 44b1736c95b5..56f06a0184e0 100644
--- a/include/linux/compat.h
+++ b/include/linux/compat.h
@@ -252,6 +252,13 @@ typedef struct compat_siginfo {
} _sifields;
} compat_siginfo_t;
+struct compat_ptrace_sud_config {
+ compat_ulong_t mode;
+ compat_uptr_t selector;
+ compat_ulong_t offset;
+ compat_ulong_t len;
+};
+
struct compat_rlimit {
compat_ulong_t rlim_cur;
compat_ulong_t rlim_max;
diff --git a/include/linux/syscall_user_dispatch.h b/include/linux/syscall_user_dispatch.h
index a0ae443fb7df..641ca8880995 100644
--- a/include/linux/syscall_user_dispatch.h
+++ b/include/linux/syscall_user_dispatch.h
@@ -22,6 +22,12 @@ int set_syscall_user_dispatch(unsigned long mode, unsigned long offset,
#define clear_syscall_work_syscall_user_dispatch(tsk) \
clear_task_syscall_work(tsk, SYSCALL_USER_DISPATCH)
+int syscall_user_dispatch_get_config(struct task_struct *task, unsigned long size,
+ void __user *data);
+
+int syscall_user_dispatch_set_config(struct task_struct *task, unsigned long size,
+ void __user *data);
+
#else
struct syscall_user_dispatch {};
@@ -35,6 +41,18 @@ static inline void clear_syscall_work_syscall_user_dispatch(struct task_struct *
{
}
+static inline int syscall_user_dispatch_get_config(struct task_struct *task,
+ unsigned long size, void __user *data)
+{
+ return -EINVAL;
+}
+
+static inline int syscall_user_dispatch_set_config(struct task_struct *task,
+ unsigned long size, void __user *data)
+{
+ return -EINVAL;
+}
+
#endif /* CONFIG_GENERIC_ENTRY */
#endif /* _SYSCALL_USER_DISPATCH_H */
diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h
index 195ae64a8c87..49cbb165a219 100644
--- a/include/uapi/linux/ptrace.h
+++ b/include/uapi/linux/ptrace.h
@@ -112,6 +112,35 @@ struct ptrace_rseq_configuration {
__u32 pad;
};
+#define PTRACE_SET_SYSCALL_USER_DISPATCH_CONFIG 0x4210
+#define PTRACE_GET_SYSCALL_USER_DISPATCH_CONFIG 0x4211
+
+/*
+ * struct ptrace_sud_config - Per-task configuration for SUD
+ * @mode: One of PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF
+ * @selector: Tracee's user virtual address of SUD selector
+ * @offset: SUD exclusion area (virtual address)
+ * @len: Length of SUD exclusion area
+ *
+ * Used to get/set the syscall user dispatch configuration for tracee.
+ * process. Selector is optional (may be NULL), and if invalid will produce
+ * a SIGSEGV in the tracee upon first access.
+ *
+ * If mode is PR_SYS_DISPATCH_ON, syscall dispatch will be enabled. If
+ * PR_SYS_DISPATCH_OFF, syscall dispatch will be disabled and all other
+ * parameters must be 0. The value in *selector (if not null), also determines
+ * whether syscall dispatch will occur.
+ *
+ * The SUD Exclusion area described by offset/len is the virtual address space
+ * from which syscalls will not produce a user dispatch.
+ */
+struct ptrace_sud_config {
+ unsigned long mode;
+ __s8 *selector;
+ unsigned long offset;
+ unsigned long len;
+};
+
/*
* These values are stored in task->ptrace_message
* by ptrace_stop to describe the current syscall-stop.
diff --git a/kernel/entry/syscall_user_dispatch.c b/kernel/entry/syscall_user_dispatch.c
index 22396b234854..a47795b8d6f0 100644
--- a/kernel/entry/syscall_user_dispatch.c
+++ b/kernel/entry/syscall_user_dispatch.c
@@ -4,10 +4,12 @@
*/
#include <linux/sched.h>
#include <linux/prctl.h>
+#include <linux/ptrace.h>
#include <linux/syscall_user_dispatch.h>
#include <linux/uaccess.h>
#include <linux/signal.h>
#include <linux/elf.h>
+#include <linux/compat.h>
#include <linux/sched/signal.h>
#include <linux/sched/task_stack.h>
@@ -113,3 +115,84 @@ int set_syscall_user_dispatch(unsigned long mode, unsigned long offset,
{
return task_set_syscall_user_dispatch(current, mode, offset, len, selector);
}
+
+int syscall_user_dispatch_get_config(struct task_struct *task, unsigned long size,
+ void __user *data)
+{
+ struct syscall_user_dispatch *sd = &task->syscall_dispatch;
+#ifdef CONFIG_COMPAT
+ if (unlikely(in_compat_syscall())) {
+ struct compat_ptrace_sud_config cfg32;
+
+ if (size != sizeof(struct compat_ptrace_sud_config))
+ return -EINVAL;
+
+ if (test_task_syscall_work(task, SYSCALL_USER_DISPATCH))
+ cfg32.mode = PR_SYS_DISPATCH_ON;
+ else
+ cfg32.mode = PR_SYS_DISPATCH_OFF;
+
+ cfg32.selector = ptr_to_compat(sd->selector);
+ cfg32.offset = (__u32)sd->offset;
+ cfg32.len = (__u32)sd->len;
+
+ if (copy_to_user(data, &cfg32, sizeof(cfg32))) {
+ return -EFAULT;
+ }
+ } else
+#endif
+ {
+ struct ptrace_sud_config config;
+ if (size != sizeof(struct ptrace_sud_config))
+ return -EINVAL;
+
+ if (test_task_syscall_work(task, SYSCALL_USER_DISPATCH))
+ config.mode = PR_SYS_DISPATCH_ON;
+ else
+ config.mode = PR_SYS_DISPATCH_OFF;
+
+ config.offset = sd->offset;
+ config.len = sd->len;
+ config.selector = sd->selector;
+
+ if (copy_to_user(data, &config, sizeof(config))) {
+ return -EFAULT;
+ }
+ }
+ return 0;
+}
+
+int syscall_user_dispatch_set_config(struct task_struct *task, unsigned long size,
+ void __user *data)
+{
+ int rc;
+#ifdef CONFIG_COMPAT
+ if (unlikely(in_compat_syscall())) {
+ struct compat_ptrace_sud_config cfg32;
+ char __user *selector;
+
+ if (size != sizeof(struct compat_ptrace_sud_config))
+ return -EINVAL;
+
+ if (copy_from_user(&cfg32, data, sizeof(cfg32))) {
+ return -EFAULT;
+ }
+
+ selector = compat_ptr(cfg32.selector);
+ rc = task_set_syscall_user_dispatch(task, cfg32.mode, cfg32.offset,
+ cfg32.len, selector);
+ } else
+#endif
+ {
+ struct ptrace_sud_config cfg;
+ if (size != sizeof(struct ptrace_sud_config))
+ return -EINVAL;
+
+ if (copy_from_user(&cfg, data, sizeof(cfg))) {
+ return -EFAULT;
+ }
+ rc = task_set_syscall_user_dispatch(task, cfg.mode, cfg.offset,
+ cfg.len, cfg.selector);
+ }
+ return rc;
+}
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 54482193e1ed..d99376532b56 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -32,6 +32,7 @@
#include <linux/compat.h>
#include <linux/sched/signal.h>
#include <linux/minmax.h>
+#include <linux/syscall_user_dispatch.h>
#include <asm/syscall.h> /* for syscall_get_* */
@@ -1259,6 +1260,14 @@ int ptrace_request(struct task_struct *child, long request,
break;
#endif
+ case PTRACE_SET_SYSCALL_USER_DISPATCH_CONFIG:
+ ret = syscall_user_dispatch_set_config(child, addr, datavp);
+ break;
+
+ case PTRACE_GET_SYSCALL_USER_DISPATCH_CONFIG:
+ ret = syscall_user_dispatch_get_config(child, addr, datavp);
+ break;
+
default:
break;
}
diff --git a/tools/testing/selftests/ptrace/.gitignore b/tools/testing/selftests/ptrace/.gitignore
index 792318aaa30c..b7dde152e75a 100644
--- a/tools/testing/selftests/ptrace/.gitignore
+++ b/tools/testing/selftests/ptrace/.gitignore
@@ -1,4 +1,5 @@
# SPDX-License-Identifier: GPL-2.0-only
get_syscall_info
+get_set_sud
peeksiginfo
vmaccess
diff --git a/tools/testing/selftests/ptrace/Makefile b/tools/testing/selftests/ptrace/Makefile
index 2f1f532c39db..33a36b73bcb9 100644
--- a/tools/testing/selftests/ptrace/Makefile
+++ b/tools/testing/selftests/ptrace/Makefile
@@ -1,6 +1,6 @@
# SPDX-License-Identifier: GPL-2.0-only
CFLAGS += -std=c99 -pthread -iquote../../../../include/uapi -Wall
-TEST_GEN_PROGS := get_syscall_info peeksiginfo vmaccess
+TEST_GEN_PROGS := get_syscall_info peeksiginfo vmaccess get_set_sud
include ../lib.mk
diff --git a/tools/testing/selftests/ptrace/get_set_sud.c b/tools/testing/selftests/ptrace/get_set_sud.c
new file mode 100644
index 000000000000..0dca681e6f1e
--- /dev/null
+++ b/tools/testing/selftests/ptrace/get_set_sud.c
@@ -0,0 +1,77 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include "../kselftest_harness.h"
+#include <stdio.h>
+#include <string.h>
+#include <errno.h>
+#include <sys/wait.h>
+#include <sys/syscall.h>
+#include <sys/prctl.h>
+
+#include "linux/ptrace.h"
+
+static int sys_ptrace(int request, pid_t pid, void *addr, void *data)
+{
+ return syscall(SYS_ptrace, request, pid, addr, data);
+}
+
+TEST(get_set_sud)
+{
+ struct ptrace_sud_config config;
+ pid_t child;
+ int ret = 0;
+ int status;
+
+ child = fork();
+ ASSERT_GE(child, 0);
+ if (child == 0) {
+ ASSERT_EQ(0, sys_ptrace(PTRACE_TRACEME, 0, 0, 0)) {
+ TH_LOG("PTRACE_TRACEME: %m");
+ }
+ kill(getpid(), SIGSTOP);
+ _exit(1);
+ }
+
+ waitpid(child, &status, 0);
+
+ memset(&config, 0xff, sizeof(config));
+ config.mode = PR_SYS_DISPATCH_ON;
+
+ ret = sys_ptrace(PTRACE_GET_SYSCALL_USER_DISPATCH_CONFIG, child,
+ (void*)sizeof(config), &config);
+ if (ret < 0) {
+ ASSERT_EQ(errno, EIO);
+ goto leave;
+ }
+
+ ASSERT_EQ(ret, 0);
+ ASSERT_EQ(config.mode, PR_SYS_DISPATCH_OFF);
+ ASSERT_EQ(config.selector, (void*)NULL);
+ ASSERT_EQ(config.offset, 0);
+ ASSERT_EQ(config.len, 0);
+
+ config.mode = PR_SYS_DISPATCH_ON;
+ config.selector = (void*)NULL;
+ config.offset = 0x400000;
+ config.len = 0x1000;
+
+ ret = sys_ptrace(PTRACE_SET_SYSCALL_USER_DISPATCH_CONFIG, child,
+ (void*)sizeof(config), &config);
+
+ ASSERT_EQ(ret, 0);
+
+ memset(&config, 1, sizeof(config));
+ ret = sys_ptrace(PTRACE_GET_SYSCALL_USER_DISPATCH_CONFIG, child,
+ (void*)sizeof(config), &config);
+
+ ASSERT_EQ(ret, 0);
+ ASSERT_EQ(config.mode, PR_SYS_DISPATCH_ON);
+ ASSERT_EQ(config.selector, (void*)NULL);
+ ASSERT_EQ(config.offset, 0x400000);
+ ASSERT_EQ(config.len, 0x1000);
+
+leave:
+ kill(child, SIGKILL);
+}
+
+TEST_HARNESS_MAIN
--
2.39.1
On 02/14, Gregory Price wrote:
>
> +struct compat_ptrace_sud_config {
> + compat_ulong_t mode;
> + compat_uptr_t selector;
> + compat_ulong_t offset;
> + compat_ulong_t len;
> +};
...
> +int syscall_user_dispatch_get_config(struct task_struct *task, unsigned long size,
> + void __user *data)
> +{
> + struct syscall_user_dispatch *sd = &task->syscall_dispatch;
> +#ifdef CONFIG_COMPAT
> + if (unlikely(in_compat_syscall())) {
> + struct compat_ptrace_sud_config cfg32;
> +
> + if (size != sizeof(struct compat_ptrace_sud_config))
> + return -EINVAL;
> +
Horror ;) why?
See my reply to v9, just make
struct ptrace_sud_config {
__u8 mode;
__u64 selector;
__u64 offset;
__u64 len;
};
Oleg.
On Thu, Feb 16, 2023 at 02:57:38PM +0100, Oleg Nesterov wrote:
> On 02/14, Gregory Price wrote:
> >
> > +struct compat_ptrace_sud_config {
> > + compat_ulong_t mode;
> > + compat_uptr_t selector;
> > + compat_ulong_t offset;
> > + compat_ulong_t len;
> > +};
>
> ...
>
> > +int syscall_user_dispatch_get_config(struct task_struct *task, unsigned long size,
> > + void __user *data)
> > +{
> > + struct syscall_user_dispatch *sd = &task->syscall_dispatch;
> > +#ifdef CONFIG_COMPAT
> > + if (unlikely(in_compat_syscall())) {
> > + struct compat_ptrace_sud_config cfg32;
> > +
> > + if (size != sizeof(struct compat_ptrace_sud_config))
> > + return -EINVAL;
> > +
>
> Horror ;) why?
>
> See my reply to v9, just make
>
> struct ptrace_sud_config {
> __u8 mode;
> __u64 selector;
> __u64 offset;
> __u64 len;
> };
>
> Oleg.
>
It was unclear to me what the prior note was asking an I followed the
pattern of other compat code in ptrace. For some reason i got it in my
head that u64 would compile down to u32 in compatibility mode and i went
full-stupid.
will back out this compat code here and fixup the struct.