v6: drop fs/proc/array update, it's not needed
drop on_dispatch field exposure in config structure, it's not
checkpoint relevant.
(Thank you for the reviews Oleg and Andrei)
v5: automated test for !defined(GENERIC_ENTRY) failed, fix fs/proc
use ifdef for GENERIC_ENTRY || TIF_SYSCALL_USER_DISPATCH
note: syscall user dispatch is not presently supported for
non-generic entry, but could be implemented. question is
whether the TIF_ define should be carved out now or then
v4: Whitespace
s/CHECKPOINT_RESTART/CHECKPOINT_RESUME
check test_syscall_work(SYSCALL_USER_DISPATCH) to determine if it's
turned on or not in fs/proc/array and getter interface
v3: Kernel test robot static function fix
Whitespace nitpicks
v2: Implements the getter/setter interface in ptrace rather than prctl
Syscall user dispatch makes it possible to cleanly intercept system
calls from user-land. However, most transparent checkpoint software
presently leverages some combination of ptrace and system call
injection to place software in a ready-to-checkpoint state.
If Syscall User Dispatch is enabled at the time of being quiesced,
injected system calls will subsequently be interposed upon and
dispatched to the task's signal handler.
This patch set implements 2 features to enable software such as CRIU
to cleanly interpose upon software leveraging syscall user dispatch.
- Implement PTRACE_O_SUSPEND_SYSCALL_USER_DISPATCH, akin to a similar
feature for SECCOMP. This allows a ptracer to temporarily disable
syscall user dispatch, making syscall injection possible.
- Implement a getter interface for Syscall User Dispatch config info.
To resume successfully, the checkpoint/resume software has to
save and restore this information. Presently this configuration
is write-only, with no way for C/R software to save it.
This was done in ptrace because syscall user dispatch is not part of
uapi. The syscall_user_dispatch_config structure was added to the
ptrace exports.
Gregory Price (2):
ptrace,syscall_user_dispatch: Implement Syscall User Dispatch
Suspension
ptrace,syscall_user_dispatch: add a getter/setter for sud
configuration
.../admin-guide/syscall-user-dispatch.rst | 5 ++-
include/linux/ptrace.h | 2 +
include/linux/syscall_user_dispatch.h | 19 ++++++++
include/uapi/linux/ptrace.h | 15 ++++++-
kernel/entry/syscall_user_dispatch.c | 45 +++++++++++++++++++
kernel/ptrace.c | 13 ++++++
6 files changed, 97 insertions(+), 2 deletions(-)
--
2.39.0
Adds PTRACE_O_SUSPEND_SYSCALL_USER_DISPATCH to ptrace options, and
modify Syscall User Dispatch to suspend interception when enabled.
This is modeled after the SUSPEND_SECCOMP feature, which suspends
SECCOMP interposition. Without doing this, software like CRIU will
inject system calls into a process and be intercepted by Syscall
User Dispatch, either causing a crash (due to blocked signals) or
the delivery of those signals to a ptracer (not the intended behavior).
Since Syscall User Dispatch is not a privileged feature, a check
for permissions is not required, however attempting to set this
option when CONFIG_CHECKPOINT_RESTORE it not supported should be
disallowed, as its intended use is checkpoint/resume.
Signed-off-by: Gregory Price <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Oleg Nesterov <[email protected]>
---
include/linux/ptrace.h | 2 ++
include/uapi/linux/ptrace.h | 6 +++++-
kernel/entry/syscall_user_dispatch.c | 5 +++++
kernel/ptrace.c | 4 ++++
4 files changed, 16 insertions(+), 1 deletion(-)
diff --git a/include/linux/ptrace.h b/include/linux/ptrace.h
index eaaef3ffec22..461ae5c99d57 100644
--- a/include/linux/ptrace.h
+++ b/include/linux/ptrace.h
@@ -45,6 +45,8 @@ extern int ptrace_access_vm(struct task_struct *tsk, unsigned long addr,
#define PT_EXITKILL (PTRACE_O_EXITKILL << PT_OPT_FLAG_SHIFT)
#define PT_SUSPEND_SECCOMP (PTRACE_O_SUSPEND_SECCOMP << PT_OPT_FLAG_SHIFT)
+#define PT_SUSPEND_SYSCALL_USER_DISPATCH \
+ (PTRACE_O_SUSPEND_SYSCALL_USER_DISPATCH << PT_OPT_FLAG_SHIFT)
extern long arch_ptrace(struct task_struct *child, long request,
unsigned long addr, unsigned long data);
diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h
index 195ae64a8c87..ba9e3f19a22c 100644
--- a/include/uapi/linux/ptrace.h
+++ b/include/uapi/linux/ptrace.h
@@ -146,9 +146,13 @@ struct ptrace_rseq_configuration {
/* eventless options */
#define PTRACE_O_EXITKILL (1 << 20)
#define PTRACE_O_SUSPEND_SECCOMP (1 << 21)
+#define PTRACE_O_SUSPEND_SYSCALL_USER_DISPATCH (1 << 22)
#define PTRACE_O_MASK (\
- 0x000000ff | PTRACE_O_EXITKILL | PTRACE_O_SUSPEND_SECCOMP)
+ 0x000000ff | \
+ PTRACE_O_EXITKILL | \
+ PTRACE_O_SUSPEND_SECCOMP | \
+ PTRACE_O_SUSPEND_SYSCALL_USER_DISPATCH)
#include <asm/ptrace.h>
diff --git a/kernel/entry/syscall_user_dispatch.c b/kernel/entry/syscall_user_dispatch.c
index 0b6379adff6b..b5ec75164805 100644
--- a/kernel/entry/syscall_user_dispatch.c
+++ b/kernel/entry/syscall_user_dispatch.c
@@ -8,6 +8,7 @@
#include <linux/uaccess.h>
#include <linux/signal.h>
#include <linux/elf.h>
+#include <linux/ptrace.h>
#include <linux/sched/signal.h>
#include <linux/sched/task_stack.h>
@@ -36,6 +37,10 @@ bool syscall_user_dispatch(struct pt_regs *regs)
struct syscall_user_dispatch *sd = ¤t->syscall_dispatch;
char state;
+ if (IS_ENABLED(CONFIG_CHECKPOINT_RESTORE) &&
+ unlikely(current->ptrace & PT_SUSPEND_SYSCALL_USER_DISPATCH))
+ return false;
+
if (likely(instruction_pointer(regs) - sd->offset < sd->len))
return false;
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 54482193e1ed..a348b68d07a2 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -370,6 +370,10 @@ static int check_ptrace_options(unsigned long data)
if (data & ~(unsigned long)PTRACE_O_MASK)
return -EINVAL;
+ if (unlikely(data & PTRACE_O_SUSPEND_SYSCALL_USER_DISPATCH) &&
+ (!IS_ENABLED(CONFIG_CHECKPOINT_RESTORE)))
+ return -EINVAL;
+
if (unlikely(data & PTRACE_O_SUSPEND_SECCOMP)) {
if (!IS_ENABLED(CONFIG_CHECKPOINT_RESTORE) ||
!IS_ENABLED(CONFIG_SECCOMP))
--
2.39.0
Implement ptrace getter/setter interface for syscall user dispatch.
These prctl settings are presently write-only, making it impossible to
implement transparent checkpoint via software like CRIU.
This is modeled after a similar interface for SECCOMP, which can have
its configuration dumped by ptrace.
Signed-off-by: Gregory Price <[email protected]>
---
.../admin-guide/syscall-user-dispatch.rst | 5 ++-
include/linux/syscall_user_dispatch.h | 19 +++++++++
include/uapi/linux/ptrace.h | 9 +++++
kernel/entry/syscall_user_dispatch.c | 40 +++++++++++++++++++
kernel/ptrace.c | 9 +++++
5 files changed, 81 insertions(+), 1 deletion(-)
diff --git a/Documentation/admin-guide/syscall-user-dispatch.rst b/Documentation/admin-guide/syscall-user-dispatch.rst
index 60314953c728..a23ae21a1d5b 100644
--- a/Documentation/admin-guide/syscall-user-dispatch.rst
+++ b/Documentation/admin-guide/syscall-user-dispatch.rst
@@ -43,7 +43,10 @@ doesn't rely on any of the syscall ABI to make the filtering. It uses
only the syscall dispatcher address and the userspace key.
As the ABI of these intercepted syscalls is unknown to Linux, these
-syscalls are not instrumentable via ptrace or the syscall tracepoints.
+syscalls are not instrumentable via ptrace or the syscall tracepoints,
+however an interfaces to suspend, checkpoint, and restore syscall user
+dispatch configuration has been added to ptrace to assist userland
+checkpoint/restart software.
Interface
---------
diff --git a/include/linux/syscall_user_dispatch.h b/include/linux/syscall_user_dispatch.h
index a0ae443fb7df..9e1bd0d87c1e 100644
--- a/include/linux/syscall_user_dispatch.h
+++ b/include/linux/syscall_user_dispatch.h
@@ -22,6 +22,13 @@ int set_syscall_user_dispatch(unsigned long mode, unsigned long offset,
#define clear_syscall_work_syscall_user_dispatch(tsk) \
clear_task_syscall_work(tsk, SYSCALL_USER_DISPATCH)
+int syscall_user_dispatch_get_config(struct task_struct *task, unsigned long size,
+ void __user *data);
+
+int syscall_user_dispatch_set_config(struct task_struct *task, unsigned long size,
+ void __user *data);
+
+
#else
struct syscall_user_dispatch {};
@@ -35,6 +42,18 @@ static inline void clear_syscall_work_syscall_user_dispatch(struct task_struct *
{
}
+static inline int syscall_user_dispatch_get_config(struct task_struct *task, unsigned long size,
+ void __user *data)
+{
+ return -EINVAL;
+}
+
+static inline int syscall_user_dispatch_set_config(struct task_struct *task, unsigned long size,
+ void __user *data)
+{
+ return -EINVAL;
+}
+
#endif /* CONFIG_GENERIC_ENTRY */
#endif /* _SYSCALL_USER_DISPATCH_H */
diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h
index ba9e3f19a22c..53ef59134dcb 100644
--- a/include/uapi/linux/ptrace.h
+++ b/include/uapi/linux/ptrace.h
@@ -112,6 +112,15 @@ struct ptrace_rseq_configuration {
__u32 pad;
};
+#define PTRACE_SET_SYSCALL_USER_DISPATCH_CONFIG 0x4210
+#define PTRACE_GET_SYSCALL_USER_DISPATCH_CONFIG 0x4211
+struct syscall_user_dispatch_config {
+ __u64 mode;
+ __s8 *selector;
+ __u64 offset;
+ __u64 len;
+};
+
/*
* These values are stored in task->ptrace_message
* by ptrace_stop to describe the current syscall-stop.
diff --git a/kernel/entry/syscall_user_dispatch.c b/kernel/entry/syscall_user_dispatch.c
index b5ec75164805..ee02ce21f75e 100644
--- a/kernel/entry/syscall_user_dispatch.c
+++ b/kernel/entry/syscall_user_dispatch.c
@@ -111,3 +111,43 @@ int set_syscall_user_dispatch(unsigned long mode, unsigned long offset,
return 0;
}
+
+int syscall_user_dispatch_get_config(struct task_struct *task, unsigned long size,
+ void __user *data)
+{
+ struct syscall_user_dispatch *sd = &task->syscall_dispatch;
+ struct syscall_user_dispatch_config config;
+
+ if (size != sizeof(struct syscall_user_dispatch_config))
+ return -EINVAL;
+
+ if (test_syscall_work(SYSCALL_USER_DISPATCH))
+ config.mode = PR_SYS_DISPATCH_ON;
+ else
+ config.mode = PR_SYS_DISPATCH_OFF;
+
+ config.offset = sd->offset;
+ config.len = sd->len;
+ config.selector = sd->selector;
+
+ if (copy_to_user(data, &config, sizeof(config)))
+ return -EFAULT;
+
+ return 0;
+}
+
+int syscall_user_dispatch_set_config(struct task_struct *task, unsigned long size,
+ void __user *data)
+{
+ struct syscall_user_dispatch_config config;
+ int ret;
+
+ if (size != sizeof(struct syscall_user_dispatch_config))
+ return -EINVAL;
+
+ if (copy_from_user(&config, data, sizeof(config)))
+ return -EFAULT;
+
+ return set_syscall_user_dispatch(config.mode, config.offset, config.len,
+ config.selector);
+}
diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index a348b68d07a2..76de46e080e2 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -32,6 +32,7 @@
#include <linux/compat.h>
#include <linux/sched/signal.h>
#include <linux/minmax.h>
+#include <linux/syscall_user_dispatch.h>
#include <asm/syscall.h> /* for syscall_get_* */
@@ -1263,6 +1264,14 @@ int ptrace_request(struct task_struct *child, long request,
break;
#endif
+ case PTRACE_SET_SYSCALL_USER_DISPATCH_CONFIG:
+ ret = syscall_user_dispatch_set_config(child, addr, datavp);
+ break;
+
+ case PTRACE_GET_SYSCALL_USER_DISPATCH_CONFIG:
+ ret = syscall_user_dispatch_get_config(child, addr, datavp);
+ break;
+
default:
break;
}
--
2.39.0
Hi Gregory,
Thank you for the patch! Perhaps something to improve:
[auto build test WARNING on linus/master]
[cannot apply to tip/core/entry]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Gregory-Price/ptrace-syscall_user_dispatch-add-a-getter-setter-for-sud-configuration/20230125-115155
patch link: https://lore.kernel.org/r/20230125025126.787431-3-gregory.price%40memverge.com
patch subject: [PATCH v6 2/2] ptrace,syscall_user_dispatch: add a getter/setter for sud configuration
config: x86_64-allyesconfig (https://download.01.org/0day-ci/archive/20230125/[email protected]/config)
compiler: gcc-11 (Debian 11.3.0-8) 11.3.0
reproduce (this is a W=1 build):
# https://github.com/intel-lab-lkp/linux/commit/581ad4d3309b94aafb967b0ca56607436c18127f
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Gregory-Price/ptrace-syscall_user_dispatch-add-a-getter-setter-for-sud-configuration/20230125-115155
git checkout 581ad4d3309b94aafb967b0ca56607436c18127f
# save the config file
mkdir build_dir && cp config build_dir/.config
make W=1 O=build_dir ARCH=x86_64 olddefconfig
make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash kernel/
If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>
All warnings (new ones prefixed by >>):
kernel/entry/syscall_user_dispatch.c: In function 'syscall_user_dispatch_set_config':
>> kernel/entry/syscall_user_dispatch.c:143:13: warning: unused variable 'ret' [-Wunused-variable]
143 | int ret;
| ^~~
vim +/ret +143 kernel/entry/syscall_user_dispatch.c
138
139 int syscall_user_dispatch_set_config(struct task_struct *task, unsigned long size,
140 void __user *data)
141 {
142 struct syscall_user_dispatch_config config;
> 143 int ret;
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests
On 01/24, Gregory Price wrote:
>
> Adds PTRACE_O_SUSPEND_SYSCALL_USER_DISPATCH to ptrace options, and
> modify Syscall User Dispatch to suspend interception when enabled.
>
> This is modeled after the SUSPEND_SECCOMP feature, which suspends
> SECCOMP interposition. Without doing this, software like CRIU will
> inject system calls into a process and be intercepted by Syscall
> User Dispatch, either causing a crash (due to blocked signals) or
> the delivery of those signals to a ptracer (not the intended behavior).
Cough... Gregory, I am sorry ;)
but can't we drop this patch to ?
CRIU needs to do PTRACE_SET_SYSCALL_USER_DISPATCH_CONFIG and check
config->mode anyway as we discussed.
Then it can simply set *config->selector = SYSCALL_DISPATCH_FILTER_ALLOW
with the same effect, no?
Oleg.
On Thu, Jan 26, 2023 at 01:30:08AM +0100, Oleg Nesterov wrote:
> On 01/24, Gregory Price wrote:
> >
> > Adds PTRACE_O_SUSPEND_SYSCALL_USER_DISPATCH to ptrace options, and
> > modify Syscall User Dispatch to suspend interception when enabled.
> >
> > This is modeled after the SUSPEND_SECCOMP feature, which suspends
> > SECCOMP interposition. Without doing this, software like CRIU will
> > inject system calls into a process and be intercepted by Syscall
> > User Dispatch, either causing a crash (due to blocked signals) or
> > the delivery of those signals to a ptracer (not the intended behavior).
>
> Cough... Gregory, I am sorry ;)
>
> but can't we drop this patch to ?
>
> CRIU needs to do PTRACE_SET_SYSCALL_USER_DISPATCH_CONFIG and check
> config->mode anyway as we discussed.
>
> Then it can simply set *config->selector = SYSCALL_DISPATCH_FILTER_ALLOW
> with the same effect, no?
>
> Oleg.
>
The selector is optional, but the core idea seems reasonable.
Though I think this complicates the quiesce vs checkpoint phases a bit.
My best understanding of CRIU is there are (at least) two checkpoint
phases: quiesce and checkpoint. The intent of patch 1/2 is to aid the
quiesce phase, not the checkpoint phase.
In both phases the `compel` code is used to inject system calls, so
turning SUD off is required. That can obviously be achieved via saving
with get_config, and just clearing it entirely with set_config.
I'm NOT sure whether the `compel` code can save settings that the
`cr-check` code then saves to disc, or if `compel` is standalone. I will
go check this and report back.
The only other concern is one of how it's restored, and in what order
compared to SECCOMP - for the absolute insane case of someone running a
SUD task inside a locked down cgroup? Technically possible (TM)!
We may find that the suspend flag is "just easier" but not required.
I do think more-simple-is-more-better, though, so I will investigate.
~Gregory
On Thu, Jan 26, 2023 at 01:30:08AM +0100, Oleg Nesterov wrote:
> On 01/24, Gregory Price wrote:
> >
> > Adds PTRACE_O_SUSPEND_SYSCALL_USER_DISPATCH to ptrace options, and
> > modify Syscall User Dispatch to suspend interception when enabled.
> >
> > This is modeled after the SUSPEND_SECCOMP feature, which suspends
> > SECCOMP interposition. Without doing this, software like CRIU will
> > inject system calls into a process and be intercepted by Syscall
> > User Dispatch, either causing a crash (due to blocked signals) or
> > the delivery of those signals to a ptracer (not the intended behavior).
>
> Cough... Gregory, I am sorry ;)
>
> but can't we drop this patch to ?
>
> CRIU needs to do PTRACE_SET_SYSCALL_USER_DISPATCH_CONFIG and check
> config->mode anyway as we discussed.
>
> Then it can simply set *config->selector = SYSCALL_DISPATCH_FILTER_ALLOW
> with the same effect, no?
>
> Oleg.
>
After further investigation, I believe we can drop 1/2, but for a
different reason: It's actually insane behavior during the quiesce
phase. Quiesce allows the program to run until a particular state,
which means we can't turn it off lest we interfere with intended
behavior - (cough cough prior review said this cough cough i'm dumb).
I'll drop patch 1/2 and resubmit (there's an unused variable warning i
need to clean up).
Thanks again for the reviews all
~Gregory
On Wed, Jan 25, 2023 at 08:54:48PM -0800, Andrei Vagin wrote:
> On Wed, Jan 25, 2023 at 4:30 PM Oleg Nesterov <[email protected]> wrote:
> >
> > On 01/24, Gregory Price wrote:
> > >
> > > Adds PTRACE_O_SUSPEND_SYSCALL_USER_DISPATCH to ptrace options, and
> > > modify Syscall User Dispatch to suspend interception when enabled.
> > >
> > > This is modeled after the SUSPEND_SECCOMP feature, which suspends
> > > SECCOMP interposition. Without doing this, software like CRIU will
> > > inject system calls into a process and be intercepted by Syscall
> > > User Dispatch, either causing a crash (due to blocked signals) or
> > > the delivery of those signals to a ptracer (not the intended behavior).
> >
> > Cough... Gregory, I am sorry ;)
> >
> > but can't we drop this patch to ?
> >
> > CRIU needs to do PTRACE_SET_SYSCALL_USER_DISPATCH_CONFIG and check
> > config->mode anyway as we discussed.
> >
> > Then it can simply set *config->selector = SYSCALL_DISPATCH_FILTER_ALLOW
> > with the same effect, no?
>
> Oleg,
>
> PTRACE_O_SUSPEND_SYSCALL_USER_DISPATCH is automatically cleared when
> a tracer detaches. It is critical when tracers detach due to unexpected
> reasons
> (crashes, killed by oom, etc). In such cases, we want to be sure that a
> tracee will continue
> running from the point where it has been trapped.
>
> Thanks,
> Andrei
There might be a better place for the full C/R discussion, but it's worth
the extra context to hash out the SUSPEND flag.
The relevant kernel code i'm concerned about:
static long syscall_trace_enter(struct pt_regs *regs, long syscall,
unsigned long work)
{
long ret = 0;
/* ... snip ... do syscall user dispatch */
if (work & SYSCALL_WORK_SYSCALL_USER_DISPATCH) {
if (syscall_user_dispatch(regs))
return -1L;
}
/* Handle ptrace */
if (work & (SYSCALL_WORK_SYSCALL_TRACE | SYSCALL_WORK_SYSCALL_EMU)) {
ret = ptrace_report_syscall_entry(regs);
if (ret || (work & SYSCALL_WORK_SYSCALL_EMU))
return -1L;
}
/* Do seccomp after ptrace, to catch any tracer changes. */
if (work & SYSCALL_WORK_SECCOMP) {
ret = __secure_computing(NULL);
}
/* ... snip ... */
}
The problem i'm seeing with PTRACE_O_SUSPEND_SUD is that SUD comes before
ptrace, while Seccomp comes after.
CRIU seems to use a few different methods to quiesce:
* ptrace syscall entry traps
* breakpoints (on sigreturn it seems)
* masking everything but SIGSTOP and waiting for a STOP
SUD represent an issue in all three cases
* syscall dispatch preempts ptrace traps (though syscalls may come
from the exclusion area, so it should hit eventually)
* sigreturn can be changed (glibc prevents this, but the raw sigaction
syscall will take whatever address you give it)
* masking SIGSYS crashes a program on next dispatch if SUD is enabled
Most of this can be worked around.
My concern is whether there is any injection occuring during the quiesce
phase. If there is no injection - this SUSPEND isn't needed during
quiesce (and in fact, it's dangerous). If there is injection, then
the current CRIU quiesce method is incompatible with SUD anyway and this
would take more investigation to determine a solution.
Once quiesced, however, SUD has to be disabled to allow injections. And
I think auto-clearing of the PTRACE Options on detach is a good reason
to keep it.
So basic questions are:
1) Andrei do you think any injection occurs during quiesce that can't be
worked around?
2) Oleg is the auto-clearing nature of the flag sufficient justification
for keeping SUSPEND?
~Gregory
On 01/25, Andrei Vagin wrote:
>
> On Wed, Jan 25, 2023 at 4:30 PM Oleg Nesterov <[email protected]> wrote:
> >
> > On 01/24, Gregory Price wrote:
> > >
> > > Adds PTRACE_O_SUSPEND_SYSCALL_USER_DISPATCH to ptrace options, and
> > > modify Syscall User Dispatch to suspend interception when enabled.
> > >
> > > This is modeled after the SUSPEND_SECCOMP feature, which suspends
> > > SECCOMP interposition. Without doing this, software like CRIU will
> > > inject system calls into a process and be intercepted by Syscall
> > > User Dispatch, either causing a crash (due to blocked signals) or
> > > the delivery of those signals to a ptracer (not the intended behavior).
> >
> > Cough... Gregory, I am sorry ;)
> >
> > but can't we drop this patch to ?
> >
> > CRIU needs to do PTRACE_SET_SYSCALL_USER_DISPATCH_CONFIG and check
> > config->mode anyway as we discussed.
> >
> > Then it can simply set *config->selector = SYSCALL_DISPATCH_FILTER_ALLOW
> > with the same effect, no?
>
> Oleg,
>
> PTRACE_O_SUSPEND_SYSCALL_USER_DISPATCH is automatically cleared when
> a tracer detaches. It is critical when tracers detach due to unexpected
> reasons
IIUC, PTRACE_O_SUSPEND_SYSCALL_USER_DISPATCH is needed to run the injected
code, and this also needs to change the state of the traced process. If
the tracer (CRIU) dies while the tracee runs this code, I guess the tracee
will have other problems?
Oleg.
On 01/26, Gregory Price wrote:
>
> So basic questions are:
> 1) Andrei do you think any injection occurs during quiesce that can't be
> worked around?
>
> 2) Oleg is the auto-clearing nature of the flag sufficient justification
> for keeping SUSPEND?
I understand that SUSPEND is more convenient for CRIU and more "safe".
But. This kernel is already overbloated ;) IMO, if SUSPEND is not strictly
necessary, it should be dropped.
Oleg.
On Thu, Jan 26, 2023 at 7:07 AM Oleg Nesterov <[email protected]> wrote:
>
> On 01/25, Andrei Vagin wrote:
> >
> > On Wed, Jan 25, 2023 at 4:30 PM Oleg Nesterov <[email protected]> wrote:
> > >
> > > On 01/24, Gregory Price wrote:
> > > >
> > > > Adds PTRACE_O_SUSPEND_SYSCALL_USER_DISPATCH to ptrace options, and
> > > > modify Syscall User Dispatch to suspend interception when enabled.
> > > >
> > > > This is modeled after the SUSPEND_SECCOMP feature, which suspends
> > > > SECCOMP interposition. Without doing this, software like CRIU will
> > > > inject system calls into a process and be intercepted by Syscall
> > > > User Dispatch, either causing a crash (due to blocked signals) or
> > > > the delivery of those signals to a ptracer (not the intended behavior).
> > >
> > > Cough... Gregory, I am sorry ;)
> > >
> > > but can't we drop this patch to ?
> > >
> > > CRIU needs to do PTRACE_SET_SYSCALL_USER_DISPATCH_CONFIG and check
> > > config->mode anyway as we discussed.
> > >
> > > Then it can simply set *config->selector = SYSCALL_DISPATCH_FILTER_ALLOW
> > > with the same effect, no?
> >
> > Oleg,
> >
> > PTRACE_O_SUSPEND_SYSCALL_USER_DISPATCH is automatically cleared when
> > a tracer detaches. It is critical when tracers detach due to unexpected
> > reasons
>
> IIUC, PTRACE_O_SUSPEND_SYSCALL_USER_DISPATCH is needed to run the injected
> code, and this also needs to change the state of the traced process. If
> the tracer (CRIU) dies while the tracee runs this code, I guess the tracee
> will have other problems?
Our injected code can reheal itself if something goes wrong. The hack
here is that we inject
the code with a signal frame and it calls rt_segreturn to resume the process.
We want to have this functionality for most cases. I don't expect that
the syscall user dispatch
is used by many applications, so I don't strongly insist on
PTRACE_O_SUSPEND_SYSCALL_USER_DISPATCH. In addition, if we know a user dispatch
memory region, it can be enough to inject our code out of this region
without disabling SUD.
Thanks,
Andrei
On Thu, Jan 26, 2023 at 09:45:39AM -0800, Andrei Vagin wrote:
> On Thu, Jan 26, 2023 at 7:07 AM Oleg Nesterov <[email protected]> wrote:
> >
> > On 01/25, Andrei Vagin wrote:
> > >
> > > On Wed, Jan 25, 2023 at 4:30 PM Oleg Nesterov <[email protected]> wrote:
> > > >
> > > > On 01/24, Gregory Price wrote:
> > > > >
> > > > > Adds PTRACE_O_SUSPEND_SYSCALL_USER_DISPATCH to ptrace options, and
> > > > > modify Syscall User Dispatch to suspend interception when enabled.
> > > > >
> > > > > This is modeled after the SUSPEND_SECCOMP feature, which suspends
> > > > > SECCOMP interposition. Without doing this, software like CRIU will
> > > > > inject system calls into a process and be intercepted by Syscall
> > > > > User Dispatch, either causing a crash (due to blocked signals) or
> > > > > the delivery of those signals to a ptracer (not the intended behavior).
> > > >
> > > > Cough... Gregory, I am sorry ;)
> > > >
> > > > but can't we drop this patch to ?
> > > >
> > > > CRIU needs to do PTRACE_SET_SYSCALL_USER_DISPATCH_CONFIG and check
> > > > config->mode anyway as we discussed.
> > > >
> > > > Then it can simply set *config->selector = SYSCALL_DISPATCH_FILTER_ALLOW
> > > > with the same effect, no?
> > >
> > > Oleg,
> > >
> > > PTRACE_O_SUSPEND_SYSCALL_USER_DISPATCH is automatically cleared when
> > > a tracer detaches. It is critical when tracers detach due to unexpected
> > > reasons
> >
> > IIUC, PTRACE_O_SUSPEND_SYSCALL_USER_DISPATCH is needed to run the injected
> > code, and this also needs to change the state of the traced process. If
> > the tracer (CRIU) dies while the tracee runs this code, I guess the tracee
> > will have other problems?
>
> Our injected code can reheal itself if something goes wrong. The hack
> here is that we inject
> the code with a signal frame and it calls rt_segreturn to resume the process.
>
> We want to have this functionality for most cases. I don't expect that
> the syscall user dispatch
> is used by many applications, so I don't strongly insist on
> PTRACE_O_SUSPEND_SYSCALL_USER_DISPATCH. In addition, if we know a user dispatch
> memory region, it can be enough to inject our code out of this region
> without disabling SUD.
>
> Thanks,
> Andrei
The region is exclusive, so syscalls *outside* [offset, offset+len]
produce a dispatch. That means you would have to inject into that region.
That's what's problematic for injection. Even rt_sigreturn itself
may/will be intercepted.
On Wed, Jan 25, 2023 at 9:26 PM Gregory Price
<[email protected]> wrote:
>
> On Wed, Jan 25, 2023 at 08:54:48PM -0800, Andrei Vagin wrote:
> > On Wed, Jan 25, 2023 at 4:30 PM Oleg Nesterov <[email protected]> wrote:
> > >
> > > On 01/24, Gregory Price wrote:
> > > >
> > > > Adds PTRACE_O_SUSPEND_SYSCALL_USER_DISPATCH to ptrace options, and
> > > > modify Syscall User Dispatch to suspend interception when enabled.
> > > >
> > > > This is modeled after the SUSPEND_SECCOMP feature, which suspends
> > > > SECCOMP interposition. Without doing this, software like CRIU will
> > > > inject system calls into a process and be intercepted by Syscall
> > > > User Dispatch, either causing a crash (due to blocked signals) or
> > > > the delivery of those signals to a ptracer (not the intended behavior).
> > >
> > > Cough... Gregory, I am sorry ;)
> > >
> > > but can't we drop this patch to ?
> > >
> > > CRIU needs to do PTRACE_SET_SYSCALL_USER_DISPATCH_CONFIG and check
> > > config->mode anyway as we discussed.
> > >
> > > Then it can simply set *config->selector = SYSCALL_DISPATCH_FILTER_ALLOW
> > > with the same effect, no?
> >
> > Oleg,
> >
> > PTRACE_O_SUSPEND_SYSCALL_USER_DISPATCH is automatically cleared when
> > a tracer detaches. It is critical when tracers detach due to unexpected
> > reasons
> > (crashes, killed by oom, etc). In such cases, we want to be sure that a
> > tracee will continue
> > running from the point where it has been trapped.
> >
> > Thanks,
> > Andrei
>
> There might be a better place for the full C/R discussion, but it's worth
> the extra context to hash out the SUSPEND flag.
>
> The relevant kernel code i'm concerned about:
>
> static long syscall_trace_enter(struct pt_regs *regs, long syscall,
> unsigned long work)
> {
> long ret = 0;
> /* ... snip ... do syscall user dispatch */
> if (work & SYSCALL_WORK_SYSCALL_USER_DISPATCH) {
> if (syscall_user_dispatch(regs))
> return -1L;
> }
>
> /* Handle ptrace */
> if (work & (SYSCALL_WORK_SYSCALL_TRACE | SYSCALL_WORK_SYSCALL_EMU)) {
> ret = ptrace_report_syscall_entry(regs);
> if (ret || (work & SYSCALL_WORK_SYSCALL_EMU))
> return -1L;
> }
>
> /* Do seccomp after ptrace, to catch any tracer changes. */
> if (work & SYSCALL_WORK_SECCOMP) {
> ret = __secure_computing(NULL);
> }
> /* ... snip ... */
> }
>
> The problem i'm seeing with PTRACE_O_SUSPEND_SUD is that SUD comes before
> ptrace, while Seccomp comes after.
>
> CRIU seems to use a few different methods to quiesce:
> * ptrace syscall entry traps
> * breakpoints (on sigreturn it seems)
I don't understand why it matters here. If we have
PTRACE_O_SUSPEND_SUD, all these actions
happen when the flag is set and so syscall_user_dispatch always
returns 0 and we reach ptrace
hooks.
If we don't have PTRACE_O_SUSPEND_SUD, we need to disable SUD after attaching to
a process or we need to inject our code into the region that is not under SUD.
> * masking everything but SIGSTOP and waiting for a STOP
I am not sure that I understand what part of criu you refer to here.
>
> SUD represent an issue in all three cases
> * syscall dispatch preempts ptrace traps (though syscalls may come
> from the exclusion area, so it should hit eventually)
> * sigreturn can be changed (glibc prevents this, but the raw sigaction
> syscall will take whatever address you give it
CRIU sets breakpoints into the parasite code that is part of the criu
code base and it is under
our control. Second, the parasite code doesn't use libc calls.
> * masking SIGSYS crashes a program on next dispatch if SUD is enabled
>
This is right and criu should not trigger dispatched syscalls ;)
Thanks,
Andrei
On Thu, Jan 26, 2023 at 9:52 AM Gregory Price
<[email protected]> wrote:
>
> On Thu, Jan 26, 2023 at 09:45:39AM -0800, Andrei Vagin wrote:
> > On Thu, Jan 26, 2023 at 7:07 AM Oleg Nesterov <[email protected]> wrote:
> > >
> > > On 01/25, Andrei Vagin wrote:
> > > >
> > > > On Wed, Jan 25, 2023 at 4:30 PM Oleg Nesterov <[email protected]> wrote:
> > > > >
> > > > > On 01/24, Gregory Price wrote:
> > > > > >
> > > > > > Adds PTRACE_O_SUSPEND_SYSCALL_USER_DISPATCH to ptrace options, and
> > > > > > modify Syscall User Dispatch to suspend interception when enabled.
> > > > > >
> > > > > > This is modeled after the SUSPEND_SECCOMP feature, which suspends
> > > > > > SECCOMP interposition. Without doing this, software like CRIU will
> > > > > > inject system calls into a process and be intercepted by Syscall
> > > > > > User Dispatch, either causing a crash (due to blocked signals) or
> > > > > > the delivery of those signals to a ptracer (not the intended behavior).
> > > > >
> > > > > Cough... Gregory, I am sorry ;)
> > > > >
> > > > > but can't we drop this patch to ?
> > > > >
> > > > > CRIU needs to do PTRACE_SET_SYSCALL_USER_DISPATCH_CONFIG and check
> > > > > config->mode anyway as we discussed.
> > > > >
> > > > > Then it can simply set *config->selector = SYSCALL_DISPATCH_FILTER_ALLOW
> > > > > with the same effect, no?
> > > >
> > > > Oleg,
> > > >
> > > > PTRACE_O_SUSPEND_SYSCALL_USER_DISPATCH is automatically cleared when
> > > > a tracer detaches. It is critical when tracers detach due to unexpected
> > > > reasons
> > >
> > > IIUC, PTRACE_O_SUSPEND_SYSCALL_USER_DISPATCH is needed to run the injected
> > > code, and this also needs to change the state of the traced process. If
> > > the tracer (CRIU) dies while the tracee runs this code, I guess the tracee
> > > will have other problems?
> >
> > Our injected code can reheal itself if something goes wrong. The hack
> > here is that we inject
> > the code with a signal frame and it calls rt_segreturn to resume the process.
> >
> > We want to have this functionality for most cases. I don't expect that
> > the syscall user dispatch
> > is used by many applications, so I don't strongly insist on
> > PTRACE_O_SUSPEND_SYSCALL_USER_DISPATCH. In addition, if we know a user dispatch
> > memory region, it can be enough to inject our code out of this region
> > without disabling SUD.
> >
> > Thanks,
> > Andrei
>
> The region is exclusive, so syscalls *outside* [offset, offset+len]
> produce a dispatch. That means you would have to inject into that region.
You are right. I missed that. So it depends how large a region is and
whether it has enough
free space to inject our code.
Out of curiosity, do you know any real app that use SUD? I think it
was implemented for wine,
but they haven't started using it yet.
Thanks,
Andrei
On 01/26, Andrei Vagin wrote:
>
> On Thu, Jan 26, 2023 at 7:07 AM Oleg Nesterov <[email protected]> wrote:
> >
> > IIUC, PTRACE_O_SUSPEND_SYSCALL_USER_DISPATCH is needed to run the injected
> > code, and this also needs to change the state of the traced process. If
> > the tracer (CRIU) dies while the tracee runs this code, I guess the tracee
> > will have other problems?
>
> Our injected code can reheal itself if something goes wrong. The hack
> here is that we inject
> the code with a signal frame and it calls rt_segreturn to resume the process.
What will happen if CRIU dies and clears ->ptrace right before
syscall_user_dispatch() checks PT_SUSPEND_SYSCALL_USER_DISPATCH ?
How the tracee will react to SIGSYS with unexpected .si_syscall ?
> I don't expect that
> the syscall user dispatch
> is used by many applications,
Agreed, so the case when CRIU will need to do the additional
PTRACE_SET_SYSCALL_USER_DISPATCH_CONFIG twice to disable and then re-enable
syscall_user_dispatch is unlikely.
> so I don't strongly insist on
> PTRACE_O_SUSPEND_SYSCALL_USER_DISPATCH.
I too won't argue too much. but so far I do not feel there is enough
justification for this feature ...
Oleg.
On Thu, Jan 26, 2023 at 07:30:19PM +0100, Oleg Nesterov wrote:
> On 01/26, Andrei Vagin wrote:
> >
> > On Thu, Jan 26, 2023 at 7:07 AM Oleg Nesterov <[email protected]> wrote:
> > >
> > > IIUC, PTRACE_O_SUSPEND_SYSCALL_USER_DISPATCH is needed to run the injected
> > > code, and this also needs to change the state of the traced process. If
> > > the tracer (CRIU) dies while the tracee runs this code, I guess the tracee
> > > will have other problems?
> >
> > Our injected code can reheal itself if something goes wrong. The hack
> > here is that we inject
> > the code with a signal frame and it calls rt_segreturn to resume the process.
>
> What will happen if CRIU dies and clears ->ptrace right before
> syscall_user_dispatch() checks PT_SUSPEND_SYSCALL_USER_DISPATCH ?
>
> How the tracee will react to SIGSYS with unexpected .si_syscall ?
>
> > I don't expect that
> > the syscall user dispatch
> > is used by many applications,
>
> Agreed, so the case when CRIU will need to do the additional
> PTRACE_SET_SYSCALL_USER_DISPATCH_CONFIG twice to disable and then re-enable
> syscall_user_dispatch is unlikely.
>
> > so I don't strongly insist on
> > PTRACE_O_SUSPEND_SYSCALL_USER_DISPATCH.
>
> I too won't argue too much. but so far I do not feel there is enough
> justification for this feature ...
>
> Oleg.
>
I'm not married to the idea, just want to make sure I have the tools
needed to make checkpoint work. The option seems like the easiest way
given the exclusion area issue.
One idea is to overwrite a portion of the exclusion area, but this
obviously can increase the complexity of the checkpoint process.
Another idea is to disable/enable SUD via get/set, but this produces
potential detach issues.
@Andrei i'm happy to take this to IRC or somewhere else out of band to
discuss the checkpoint specifics, if that would be preferable and you
are interested.
On Thu, Jan 26, 2023 at 10:30 AM Oleg Nesterov <[email protected]> wrote:
>
> On 01/26, Andrei Vagin wrote:
> >
> > On Thu, Jan 26, 2023 at 7:07 AM Oleg Nesterov <[email protected]> wrote:
> > >
> > > IIUC, PTRACE_O_SUSPEND_SYSCALL_USER_DISPATCH is needed to run the injected
> > > code, and this also needs to change the state of the traced process. If
> > > the tracer (CRIU) dies while the tracee runs this code, I guess the tracee
> > > will have other problems?
> >
> > Our injected code can reheal itself if something goes wrong. The hack
> > here is that we inject
> > the code with a signal frame and it calls rt_segreturn to resume the process.
>
> What will happen if CRIU dies and clears ->ptrace right before
> syscall_user_dispatch() checks PT_SUSPEND_SYSCALL_USER_DISPATCH ?
>
> How the tracee will react to SIGSYS with unexpected .si_syscall ?
I got it. PTRACE_O_SUSPEND_SUD doesn't help us here, because we rely
on sigreturn
that is called after ptrace_detach. Thanks.
>
> > I don't expect that
> > the syscall user dispatch
> > is used by many applications,
>
> Agreed, so the case when CRIU will need to do the additional
> PTRACE_SET_SYSCALL_USER_DISPATCH_CONFIG twice to disable and then re-enable
> syscall_user_dispatch is unlikely.
>
> > so I don't strongly insist on
> > PTRACE_O_SUSPEND_SYSCALL_USER_DISPATCH.
>
> I too won't argue too much. but so far I do not feel there is enough
> justification for this feature ...
Agree
>
> Oleg.
>
On Thu, Jan 26, 2023 at 10:53:49AM -0800, Andrei Vagin wrote:
> On Thu, Jan 26, 2023 at 10:30 AM Oleg Nesterov <[email protected]> wrote:
> >
> > On 01/26, Andrei Vagin wrote:
> > >
> > > On Thu, Jan 26, 2023 at 7:07 AM Oleg Nesterov <[email protected]> wrote:
> > > >
> > > > IIUC, PTRACE_O_SUSPEND_SYSCALL_USER_DISPATCH is needed to run the injected
> > > > code, and this also needs to change the state of the traced process. If
> > > > the tracer (CRIU) dies while the tracee runs this code, I guess the tracee
> > > > will have other problems?
> > >
> > > Our injected code can reheal itself if something goes wrong. The hack
> > > here is that we inject
> > > the code with a signal frame and it calls rt_segreturn to resume the process.
> >
> > What will happen if CRIU dies and clears ->ptrace right before
> > syscall_user_dispatch() checks PT_SUSPEND_SYSCALL_USER_DISPATCH ?
> >
> > How the tracee will react to SIGSYS with unexpected .si_syscall ?
>
> I got it. PTRACE_O_SUSPEND_SUD doesn't help us here, because we rely
> on sigreturn
> that is called after ptrace_detach. Thanks.
>
> >
> > > I don't expect that
> > > the syscall user dispatch
> > > is used by many applications,
> >
> > Agreed, so the case when CRIU will need to do the additional
> > PTRACE_SET_SYSCALL_USER_DISPATCH_CONFIG twice to disable and then re-enable
> > syscall_user_dispatch is unlikely.
> >
> > > so I don't strongly insist on
> > > PTRACE_O_SUSPEND_SYSCALL_USER_DISPATCH.
> >
> > I too won't argue too much. but so far I do not feel there is enough
> > justification for this feature ...
>
> Agree
>
> >
> > Oleg.
> >
Seems that's consensus, I will drop it.