2018-10-13 00:43:46

by Enke Chen (enkechen)

[permalink] [raw]
Subject: [PATCH] kernel/signal: Signal-based pre-coredump notification

For simplicity and consistency, this patch provides an implementation
for signal-based fault notification prior to the coredump of a child
process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
be used by an application to express its interest and to specify the
signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.

Background:

As the coredump of a process may take time, in certain time-sensitive
applications it is necessary for a parent process (e.g., a process
manager) to be notified of a child's imminent death before the coredump
so that the parent process can act sooner, such as re-spawning an
application process, or initiating a control-plane fail-over.

Currently there are two ways for a parent process to be notified of a
child process's state change. One is to use the POSIX signal, and
another is to use the kernel connector module. The specific events and
actions are summarized as follows:

Process Event POSIX Signal Connector-based
----------------------------------------------------------------------
ptrace_attach() do_notify_parent_cldstop() proc_ptrace_connector()
SIGCHLD / CLD_STOPPED

ptrace_detach() do_notify_parent_cldstop() proc_ptrace_connector()
SIGCHLD / CLD_CONTINUED

pre_coredump/ N/A proc_coredump_connector()
get_signal()

post_coredump/ do_notify_parent() proc_exit_connector()
do_exit() SIGCHLD / exit_signal
----------------------------------------------------------------------

As shown in the table, the signal-based pre-coredump notification is not
currently available. In some cases using a connector-based notification
can be quite complicated (e.g., when a process manager is written in shell
scripts and thus is subject to certain inherent limitations), and a
signal-based notification would be simpler and better suited.

Signed-off-by: Enke Chen <[email protected]>
---
arch/x86/kernel/signal_compat.c | 2 +-
include/linux/sched.h | 4 ++
include/linux/signal.h | 5 +++
include/uapi/asm-generic/siginfo.h | 3 +-
include/uapi/linux/prctl.h | 4 ++
kernel/fork.c | 1 +
kernel/signal.c | 51 +++++++++++++++++++++++++
kernel/sys.c | 77 ++++++++++++++++++++++++++++++++++++++
8 files changed, 145 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/signal_compat.c b/arch/x86/kernel/signal_compat.c
index 9ccbf05..a3deba8 100644
--- a/arch/x86/kernel/signal_compat.c
+++ b/arch/x86/kernel/signal_compat.c
@@ -30,7 +30,7 @@ static inline void signal_compat_build_tests(void)
BUILD_BUG_ON(NSIGSEGV != 7);
BUILD_BUG_ON(NSIGBUS != 5);
BUILD_BUG_ON(NSIGTRAP != 5);
- BUILD_BUG_ON(NSIGCHLD != 6);
+ BUILD_BUG_ON(NSIGCHLD != 7);
BUILD_BUG_ON(NSIGSYS != 1);

/* This is part of the ABI and can never change in size: */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 09026ea..cfb9645 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -696,6 +696,10 @@ struct task_struct {
int exit_signal;
/* The signal sent when the parent dies: */
int pdeath_signal;
+
+ /* The signal sent prior to a child's coredump: */
+ int predump_signal;
+
/* JOBCTL_*, siglock protected: */
unsigned long jobctl;

diff --git a/include/linux/signal.h b/include/linux/signal.h
index 706a499..7cb976d 100644
--- a/include/linux/signal.h
+++ b/include/linux/signal.h
@@ -256,6 +256,11 @@ static inline int valid_signal(unsigned long sig)
return sig <= _NSIG ? 1 : 0;
}

+static inline int valid_predump_signal(int sig)
+{
+ return (sig == SIGCHLD) || (sig == SIGUSR1) || (sig == SIGUSR2);
+}
+
struct timespec;
struct pt_regs;
enum pid_type;
diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-generic/siginfo.h
index cb3d6c2..1a47cef 100644
--- a/include/uapi/asm-generic/siginfo.h
+++ b/include/uapi/asm-generic/siginfo.h
@@ -267,7 +267,8 @@ struct { \
#define CLD_TRAPPED 4 /* traced child has trapped */
#define CLD_STOPPED 5 /* child has stopped */
#define CLD_CONTINUED 6 /* stopped child has continued */
-#define NSIGCHLD 6
+#define CLD_PREDUMP 7 /* child is about to dump core */
+#define NSIGCHLD 7

/*
* SIGPOLL (or any other signal without signal specific si_codes) si_codes
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index c0d7ea0..79f0a8a 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -219,4 +219,8 @@ struct prctl_mm_map {
# define PR_SPEC_DISABLE (1UL << 2)
# define PR_SPEC_FORCE_DISABLE (1UL << 3)

+/* Whether to receive signal prior to child's coredump */
+#define PR_SET_PREDUMP_SIG 54
+#define PR_GET_PREDUMP_SIG 55
+
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 07cddff..c296c11 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1985,6 +1985,7 @@ static __latent_entropy struct task_struct *copy_process(
p->dirty_paused_when = 0;

p->pdeath_signal = 0;
+ p->predump_signal = 0;
INIT_LIST_HEAD(&p->thread_group);
p->task_works = NULL;

diff --git a/kernel/signal.c b/kernel/signal.c
index 312b43e..eb4a483 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2337,6 +2337,44 @@ static int ptrace_signal(int signr, kernel_siginfo_t *info)
return signr;
}

+/*
+ * Let the parent, if so desired, know about the imminent death of a child
+ * prior to its coredump.
+ *
+ * Locking logic is similar to do_notify_parent_cldstop().
+ */
+static void do_notify_parent_predump(struct task_struct *tsk)
+{
+ struct sighand_struct *sighand;
+ struct task_struct *parent;
+ struct kernel_siginfo info;
+ unsigned long flags;
+ int sig;
+
+ parent = tsk->real_parent;
+ sig = parent->predump_signal;
+
+ /* Check again with "tasklist_lock" locked by the caller */
+ if (!valid_predump_signal(sig))
+ return;
+
+ clear_siginfo(&info);
+ info.si_signo = sig;
+ if (sig == SIGCHLD)
+ info.si_code = CLD_PREDUMP;
+
+ rcu_read_lock();
+ info.si_pid = task_pid_nr_ns(tsk, task_active_pid_ns(parent));
+ info.si_uid = from_kuid_munged(task_cred_xxx(parent, user_ns),
+ task_uid(tsk));
+ rcu_read_unlock();
+
+ sighand = parent->sighand;
+ spin_lock_irqsave(&sighand->siglock, flags);
+ __group_send_sig_info(sig, &info, parent);
+ spin_unlock_irqrestore(&sighand->siglock, flags);
+}
+
bool get_signal(struct ksignal *ksig)
{
struct sighand_struct *sighand = current->sighand;
@@ -2497,6 +2535,19 @@ bool get_signal(struct ksignal *ksig)
current->flags |= PF_SIGNALED;

if (sig_kernel_coredump(signr)) {
+ /*
+ * Notify the parent prior to the coredump if the
+ * parent is interested in such a notificaiton.
+ */
+ int p_sig = current->real_parent->predump_signal;
+
+ if (valid_predump_signal(p_sig)) {
+ read_lock(&tasklist_lock);
+ do_notify_parent_predump(current);
+ read_unlock(&tasklist_lock);
+ cond_resched();
+ }
+
if (print_fatal_signals)
print_fatal_signal(ksig->info.si_signo);
proc_coredump_connector(current);
diff --git a/kernel/sys.c b/kernel/sys.c
index 123bd73..43eb250d 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2258,6 +2258,76 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which,
return -EINVAL;
}

+static int prctl_get_predump_signal(struct task_struct *tsk, pid_t pid,
+ int __user *addr)
+{
+ struct task_struct *p;
+ int error;
+
+ /* For the current task, the common case */
+ if (pid == 0) {
+ put_user(tsk->predump_signal, addr);
+ return 0;
+ }
+
+ error = -ESRCH;
+ rcu_read_lock();
+ p = find_task_by_vpid(pid);
+ if (p) {
+ error = 0;
+ put_user(p->predump_signal, addr);
+ }
+ rcu_read_unlock();
+ return error;
+}
+
+/*
+ * Returns true if current's euid is same as p's uid or euid,
+ * or has CAP_SYS_ADMIN.
+ *
+ * Called with rcu_read_lock, creds are safe.
+ *
+ * Adapted from set_one_prio_perm().
+ */
+static bool set_predump_signal_perm(struct task_struct *p)
+{
+ const struct cred *cred = current_cred(), *pcred = __task_cred(p);
+
+ return uid_eq(pcred->uid, cred->euid) ||
+ uid_eq(pcred->euid, cred->euid) ||
+ capable(CAP_SYS_ADMIN);
+}
+
+static int prctl_set_predump_signal(struct task_struct *tsk, pid_t pid, int sig)
+{
+ struct task_struct *p;
+ int error;
+
+ /* 0 is valid for disabling the feature */
+ if (sig && !valid_predump_signal(sig))
+ return -EINVAL;
+
+ /* For the current task, the common case */
+ if (pid == 0) {
+ tsk->predump_signal = sig;
+ return 0;
+ }
+
+ error = -ESRCH;
+ rcu_read_lock();
+ p = find_task_by_vpid(pid);
+ if (p) {
+ if (!set_predump_signal_perm(p))
+ error = -EPERM;
+ else {
+ error = 0;
+ p->predump_signal = sig;
+ }
+ }
+ rcu_read_unlock();
+ return error;
+}
+
SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
unsigned long, arg4, unsigned long, arg5)
{
@@ -2476,6 +2546,13 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which,
return -EINVAL;
error = arch_prctl_spec_ctrl_set(me, arg2, arg3);
break;
+ case PR_SET_PREDUMP_SIG:
+ error = prctl_set_predump_signal(me, (pid_t)arg2, (int)arg3);
+ break;
+ case PR_GET_PREDUMP_SIG:
+ error = prctl_get_predump_signal(me, (pid_t)arg2,
+ (int __user *)arg3);
+ break;
default:
error = -EINVAL;
break;
--
1.8.3.1


2018-10-13 06:42:22

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

On Fri, Oct 12, 2018 at 05:33:35PM -0700, Enke Chen wrote:
> For simplicity and consistency, this patch provides an implementation
> for signal-based fault notification prior to the coredump of a child
> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
> be used by an application to express its interest and to specify the
> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>
> Background:
>
> As the coredump of a process may take time, in certain time-sensitive
> applications it is necessary for a parent process (e.g., a process
> manager) to be notified of a child's imminent death before the coredump
> so that the parent process can act sooner, such as re-spawning an
> application process, or initiating a control-plane fail-over.
>
> Currently there are two ways for a parent process to be notified of a
> child process's state change. One is to use the POSIX signal, and
> another is to use the kernel connector module. The specific events and
> actions are summarized as follows:
>
> Process Event POSIX Signal Connector-based
> ----------------------------------------------------------------------
> ptrace_attach() do_notify_parent_cldstop() proc_ptrace_connector()
> SIGCHLD / CLD_STOPPED
>
> ptrace_detach() do_notify_parent_cldstop() proc_ptrace_connector()
> SIGCHLD / CLD_CONTINUED
>
> pre_coredump/ N/A proc_coredump_connector()
> get_signal()
>
> post_coredump/ do_notify_parent() proc_exit_connector()
> do_exit() SIGCHLD / exit_signal
> ----------------------------------------------------------------------
>
> As shown in the table, the signal-based pre-coredump notification is not
> currently available. In some cases using a connector-based notification
> can be quite complicated (e.g., when a process manager is written in shell
> scripts and thus is subject to certain inherent limitations), and a
> signal-based notification would be simpler and better suited.
>
> Signed-off-by: Enke Chen <[email protected]>
> ---
> arch/x86/kernel/signal_compat.c | 2 +-
> include/linux/sched.h | 4 ++
> include/linux/signal.h | 5 +++
> include/uapi/asm-generic/siginfo.h | 3 +-
> include/uapi/linux/prctl.h | 4 ++
> kernel/fork.c | 1 +
> kernel/signal.c | 51 +++++++++++++++++++++++++
> kernel/sys.c | 77 ++++++++++++++++++++++++++++++++++++++
> 8 files changed, 145 insertions(+), 2 deletions(-)

Shouldn't there also be a manpage update, and a kselftest added for this
new user/kernel api that is being created?

thanks,

greg k-h

2018-10-13 10:50:59

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

On Fri, Oct 12, 2018 at 05:33:35PM -0700, Enke Chen wrote:
> For simplicity and consistency, this patch provides an implementation
> for signal-based fault notification prior to the coredump of a child
> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
> be used by an application to express its interest and to specify the
> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>
> Background:
>
> As the coredump of a process may take time, in certain time-sensitive
> applications it is necessary for a parent process (e.g., a process
> manager) to be notified of a child's imminent death before the coredump
> so that the parent process can act sooner, such as re-spawning an
> application process, or initiating a control-plane fail-over.
>
> Currently there are two ways for a parent process to be notified of a
> child process's state change. One is to use the POSIX signal, and
> another is to use the kernel connector module. The specific events and
> actions are summarized as follows:
>
> Process Event POSIX Signal Connector-based
> ----------------------------------------------------------------------
> ptrace_attach() do_notify_parent_cldstop() proc_ptrace_connector()
> SIGCHLD / CLD_STOPPED
>
> ptrace_detach() do_notify_parent_cldstop() proc_ptrace_connector()
> SIGCHLD / CLD_CONTINUED
>
> pre_coredump/ N/A proc_coredump_connector()
> get_signal()
>
> post_coredump/ do_notify_parent() proc_exit_connector()
> do_exit() SIGCHLD / exit_signal
> ----------------------------------------------------------------------
>
> As shown in the table, the signal-based pre-coredump notification is not
> currently available. In some cases using a connector-based notification
> can be quite complicated (e.g., when a process manager is written in shell
> scripts and thus is subject to certain inherent limitations), and a
> signal-based notification would be simpler and better suited.
>
> Signed-off-by: Enke Chen <[email protected]>
> ---
> arch/x86/kernel/signal_compat.c | 2 +-
> include/linux/sched.h | 4 ++
> include/linux/signal.h | 5 +++
> include/uapi/asm-generic/siginfo.h | 3 +-
> include/uapi/linux/prctl.h | 4 ++
> kernel/fork.c | 1 +
> kernel/signal.c | 51 +++++++++++++++++++++++++
> kernel/sys.c | 77 ++++++++++++++++++++++++++++++++++++++
> 8 files changed, 145 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kernel/signal_compat.c b/arch/x86/kernel/signal_compat.c
> index 9ccbf05..a3deba8 100644
> --- a/arch/x86/kernel/signal_compat.c
> +++ b/arch/x86/kernel/signal_compat.c
> @@ -30,7 +30,7 @@ static inline void signal_compat_build_tests(void)
> BUILD_BUG_ON(NSIGSEGV != 7);
> BUILD_BUG_ON(NSIGBUS != 5);
> BUILD_BUG_ON(NSIGTRAP != 5);
> - BUILD_BUG_ON(NSIGCHLD != 6);
> + BUILD_BUG_ON(NSIGCHLD != 7);
> BUILD_BUG_ON(NSIGSYS != 1);
>
> /* This is part of the ABI and can never change in size: */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 09026ea..cfb9645 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -696,6 +696,10 @@ struct task_struct {
> int exit_signal;
> /* The signal sent when the parent dies: */
> int pdeath_signal;
> +
> + /* The signal sent prior to a child's coredump: */
> + int predump_signal;
> +
> /* JOBCTL_*, siglock protected: */
> unsigned long jobctl;
>
> diff --git a/include/linux/signal.h b/include/linux/signal.h
> index 706a499..7cb976d 100644
> --- a/include/linux/signal.h
> +++ b/include/linux/signal.h
> @@ -256,6 +256,11 @@ static inline int valid_signal(unsigned long sig)
> return sig <= _NSIG ? 1 : 0;
> }
>
> +static inline int valid_predump_signal(int sig)
> +{
> + return (sig == SIGCHLD) || (sig == SIGUSR1) || (sig == SIGUSR2);
> +}
> +
> struct timespec;
> struct pt_regs;
> enum pid_type;
> diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-generic/siginfo.h
> index cb3d6c2..1a47cef 100644
> --- a/include/uapi/asm-generic/siginfo.h
> +++ b/include/uapi/asm-generic/siginfo.h
> @@ -267,7 +267,8 @@ struct { \
> #define CLD_TRAPPED 4 /* traced child has trapped */
> #define CLD_STOPPED 5 /* child has stopped */
> #define CLD_CONTINUED 6 /* stopped child has continued */
> -#define NSIGCHLD 6
> +#define CLD_PREDUMP 7 /* child is about to dump core */
> +#define NSIGCHLD 7
>
> /*
> * SIGPOLL (or any other signal without signal specific si_codes) si_codes
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index c0d7ea0..79f0a8a 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -219,4 +219,8 @@ struct prctl_mm_map {
> # define PR_SPEC_DISABLE (1UL << 2)
> # define PR_SPEC_FORCE_DISABLE (1UL << 3)
>
> +/* Whether to receive signal prior to child's coredump */
> +#define PR_SET_PREDUMP_SIG 54
> +#define PR_GET_PREDUMP_SIG 55
> +
> #endif /* _LINUX_PRCTL_H */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 07cddff..c296c11 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1985,6 +1985,7 @@ static __latent_entropy struct task_struct *copy_process(
> p->dirty_paused_when = 0;
>
> p->pdeath_signal = 0;
> + p->predump_signal = 0;
> INIT_LIST_HEAD(&p->thread_group);
> p->task_works = NULL;
>
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 312b43e..eb4a483 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2337,6 +2337,44 @@ static int ptrace_signal(int signr, kernel_siginfo_t *info)
> return signr;
> }
>
> +/*
> + * Let the parent, if so desired, know about the imminent death of a child
> + * prior to its coredump.
> + *
> + * Locking logic is similar to do_notify_parent_cldstop().
> + */
> +static void do_notify_parent_predump(struct task_struct *tsk)
> +{
> + struct sighand_struct *sighand;
> + struct task_struct *parent;
> + struct kernel_siginfo info;
> + unsigned long flags;
> + int sig;
> +
> + parent = tsk->real_parent;
> + sig = parent->predump_signal;
> +
> + /* Check again with "tasklist_lock" locked by the caller */
> + if (!valid_predump_signal(sig))
> + return;
> +
> + clear_siginfo(&info);
> + info.si_signo = sig;
> + if (sig == SIGCHLD)
> + info.si_code = CLD_PREDUMP;
> +
> + rcu_read_lock();
> + info.si_pid = task_pid_nr_ns(tsk, task_active_pid_ns(parent));
> + info.si_uid = from_kuid_munged(task_cred_xxx(parent, user_ns),
> + task_uid(tsk));
> + rcu_read_unlock();
> +
> + sighand = parent->sighand;
> + spin_lock_irqsave(&sighand->siglock, flags);
> + __group_send_sig_info(sig, &info, parent);
> + spin_unlock_irqrestore(&sighand->siglock, flags);
> +}
> +
> bool get_signal(struct ksignal *ksig)
> {
> struct sighand_struct *sighand = current->sighand;
> @@ -2497,6 +2535,19 @@ bool get_signal(struct ksignal *ksig)
> current->flags |= PF_SIGNALED;
>
> if (sig_kernel_coredump(signr)) {
> + /*
> + * Notify the parent prior to the coredump if the
> + * parent is interested in such a notificaiton.
> + */
> + int p_sig = current->real_parent->predump_signal;
> +
> + if (valid_predump_signal(p_sig)) {
> + read_lock(&tasklist_lock);
> + do_notify_parent_predump(current);
> + read_unlock(&tasklist_lock);
> + cond_resched();
> + }
> +
> if (print_fatal_signals)
> print_fatal_signal(ksig->info.si_signo);
> proc_coredump_connector(current);
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 123bd73..43eb250d 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2258,6 +2258,76 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which,
> return -EINVAL;
> }
>
> +static int prctl_get_predump_signal(struct task_struct *tsk, pid_t pid,
> + int __user *addr)
> +{
> + struct task_struct *p;
> + int error;
> +
> + /* For the current task, the common case */
> + if (pid == 0) {
> + put_user(tsk->predump_signal, addr);
> + return 0;
> + }
> +
> + error = -ESRCH;
> + rcu_read_lock();
> + p = find_task_by_vpid(pid);
> + if (p) {
> + error = 0;
> + put_user(p->predump_signal, addr);
> + }
> + rcu_read_unlock();
> + return error;
> +}
> +
> +/*
> + * Returns true if current's euid is same as p's uid or euid,
> + * or has CAP_SYS_ADMIN.
> + *
> + * Called with rcu_read_lock, creds are safe.
> + *
> + * Adapted from set_one_prio_perm().
> + */
> +static bool set_predump_signal_perm(struct task_struct *p)
> +{
> + const struct cred *cred = current_cred(), *pcred = __task_cred(p);
> +
> + return uid_eq(pcred->uid, cred->euid) ||
> + uid_eq(pcred->euid, cred->euid) ||
> + capable(CAP_SYS_ADMIN);

So before proceeding I'd like to discuss at least two points:
- how does this interact with the dumpability of a process?
- do we need the capable(CAP_SYS_ADMIN) restriction to init_user_ns?
Seems we could make this work per-user-ns just like
PRCTL_SET_PDEATHSIG does?

> +}
> +
> +static int prctl_set_predump_signal(struct task_struct *tsk, pid_t pid, int sig)
> +{
> + struct task_struct *p;
> + int error;
> +
> + /* 0 is valid for disabling the feature */
> + if (sig && !valid_predump_signal(sig))
> + return -EINVAL;
> +
> + /* For the current task, the common case */
> + if (pid == 0) {
> + tsk->predump_signal = sig;
> + return 0;
> + }
> +
> + error = -ESRCH;
> + rcu_read_lock();
> + p = find_task_by_vpid(pid);
> + if (p) {
> + if (!set_predump_signal_perm(p))
> + error = -EPERM;
> + else {
> + error = 0;
> + p->predump_signal = sig;
> + }
> + }
> + rcu_read_unlock();
> + return error;
> +}
> +
> SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> unsigned long, arg4, unsigned long, arg5)
> {
> @@ -2476,6 +2546,13 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which,
> return -EINVAL;
> error = arch_prctl_spec_ctrl_set(me, arg2, arg3);
> break;
> + case PR_SET_PREDUMP_SIG:
> + error = prctl_set_predump_signal(me, (pid_t)arg2, (int)arg3);
> + break;
> + case PR_GET_PREDUMP_SIG:
> + error = prctl_get_predump_signal(me, (pid_t)arg2,
> + (int __user *)arg3);
> + break;
> default:
> error = -EINVAL;
> break;
> --
> 1.8.3.1

2018-10-13 18:28:51

by Jann Horn

[permalink] [raw]
Subject: Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

On Sat, Oct 13, 2018 at 2:33 AM Enke Chen <[email protected]> wrote:
> For simplicity and consistency, this patch provides an implementation
> for signal-based fault notification prior to the coredump of a child
> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
> be used by an application to express its interest and to specify the
> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.

Your suggested API looks vaguely similar to PR_SET_PDEATHSIG, but with
some important differences:

- You don't reset the signal on setuid execution.
- You permit setting this not just on the current process, but also on others.

For both of these: Are these differences actually necessary, and if
so, can you provide a specific rationale? From a security perspective,
I would very much prefer it if this API had semantics closer to
PR_SET_PDEATHSIG.

[...]
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 312b43e..eb4a483 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2337,6 +2337,44 @@ static int ptrace_signal(int signr, kernel_siginfo_t *info)
> return signr;
> }
>
> +/*
> + * Let the parent, if so desired, know about the imminent death of a child
> + * prior to its coredump.
> + *
> + * Locking logic is similar to do_notify_parent_cldstop().
> + */
> +static void do_notify_parent_predump(struct task_struct *tsk)
> +{
> + struct sighand_struct *sighand;
> + struct task_struct *parent;
> + struct kernel_siginfo info;
> + unsigned long flags;
> + int sig;
> +
> + parent = tsk->real_parent;
> + sig = parent->predump_signal;
> +
> + /* Check again with "tasklist_lock" locked by the caller */
> + if (!valid_predump_signal(sig))
> + return;
> +
> + clear_siginfo(&info);
> + info.si_signo = sig;
> + if (sig == SIGCHLD)
> + info.si_code = CLD_PREDUMP;
> +
> + rcu_read_lock();
> + info.si_pid = task_pid_nr_ns(tsk, task_active_pid_ns(parent));
> + info.si_uid = from_kuid_munged(task_cred_xxx(parent, user_ns),
> + task_uid(tsk));

You're sending a signal from the current namespaces, but with IDs that
have been mapped into the parent's namespaces? That looks wrong to me.

> + rcu_read_unlock();
> +
> + sighand = parent->sighand;
> + spin_lock_irqsave(&sighand->siglock, flags);
> + __group_send_sig_info(sig, &info, parent);
> + spin_unlock_irqrestore(&sighand->siglock, flags);
> +}
> +
> bool get_signal(struct ksignal *ksig)
> {
> struct sighand_struct *sighand = current->sighand;
> @@ -2497,6 +2535,19 @@ bool get_signal(struct ksignal *ksig)
> current->flags |= PF_SIGNALED;
>
> if (sig_kernel_coredump(signr)) {
> + /*
> + * Notify the parent prior to the coredump if the
> + * parent is interested in such a notificaiton.
> + */
> + int p_sig = current->real_parent->predump_signal;

current->real_parent is an __rcu member. I think if you run the sparse
checker against this patch, it's going to complain. Are you allowed to
access current->real_parent in this context?

> + if (valid_predump_signal(p_sig)) {
> + read_lock(&tasklist_lock);
> + do_notify_parent_predump(current);
> + read_unlock(&tasklist_lock);
> + cond_resched();
> + }
> +
> if (print_fatal_signals)
> print_fatal_signal(ksig->info.si_signo);
> proc_coredump_connector(current);
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 123bd73..43eb250d 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2258,6 +2258,76 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which,
> return -EINVAL;
> }
>
> +static int prctl_get_predump_signal(struct task_struct *tsk, pid_t pid,
> + int __user *addr)
> +{
> + struct task_struct *p;
> + int error;
> +
> + /* For the current task, the common case */
> + if (pid == 0) {
> + put_user(tsk->predump_signal, addr);
> + return 0;
> + }
> +
> + error = -ESRCH;
> + rcu_read_lock();
> + p = find_task_by_vpid(pid);
> + if (p) {
> + error = 0;
> + put_user(p->predump_signal, addr);

This is wrong. You can't call put_user() while you're in an RCU
read-side critical section.

As below, I would like it if you could just get rid of this branch,
and restrict this prctl to operating on the current process.

> + }
> + rcu_read_unlock();
> + return error;
> +}
> +
> +/*
> + * Returns true if current's euid is same as p's uid or euid,
> + * or has CAP_SYS_ADMIN.
> + *
> + * Called with rcu_read_lock, creds are safe.
> + *
> + * Adapted from set_one_prio_perm().
> + */
> +static bool set_predump_signal_perm(struct task_struct *p)
> +{
> + const struct cred *cred = current_cred(), *pcred = __task_cred(p);
> +
> + return uid_eq(pcred->uid, cred->euid) ||
> + uid_eq(pcred->euid, cred->euid) ||
> + capable(CAP_SYS_ADMIN);
> +}

This permission check permits fiddling with other processes in
scenarios where kill() wouldn't let you send signals (specifically, if
you control the EUID of the target task). That worries me. Also,
kill() is subject to LSM checks (see check_kill_permission()), but
this interface isn't. I would really prefer it if you could amend this
so that you can only operate on the current task, and get rid of this
permission check.

[...]
> SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> unsigned long, arg4, unsigned long, arg5)
> {
> @@ -2476,6 +2546,13 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which,
> return -EINVAL;
> error = arch_prctl_spec_ctrl_set(me, arg2, arg3);
> break;
> + case PR_SET_PREDUMP_SIG:
> + error = prctl_set_predump_signal(me, (pid_t)arg2, (int)arg3);
> + break;
> + case PR_GET_PREDUMP_SIG:
> + error = prctl_get_predump_signal(me, (pid_t)arg2,
> + (int __user *)arg3);
> + break;

New prctl() calls should check that the unused arguments are zero, in
order to make it possible to safely add more flags in the future.

2018-10-15 12:07:40

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

On 10/12, Enke Chen wrote:
>
> For simplicity and consistency, this patch provides an implementation
> for signal-based fault notification prior to the coredump of a child
> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
> be used by an application to express its interest and to specify the
> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.

To be honest, I can't say I like this new feature...

> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -696,6 +696,10 @@ struct task_struct {
> int exit_signal;
> /* The signal sent when the parent dies: */
> int pdeath_signal;
> +
> + /* The signal sent prior to a child's coredump: */
> + int predump_signal;
> +

At least, I think predump_signal should live in signal_struct, not
task_struct.

(pdeath_signal too, but it is too late to change (fix) this awkward API).

> +static void do_notify_parent_predump(struct task_struct *tsk)
> +{
> + struct sighand_struct *sighand;
> + struct task_struct *parent;
> + struct kernel_siginfo info;
> + unsigned long flags;
> + int sig;
> +
> + parent = tsk->real_parent;

So, debuggere won't be notified, only real_parent...

> + sig = parent->predump_signal;

probably ->predump_signal should be cleared on exec?

> + /* Check again with tasklist_lock" locked by the caller */
> + if (!valid_predump_signal(sig))
> + return;

I don't understand why we need valid_predump_signal() at all.

> bool get_signal(struct ksignal *ksig)
> {
> struct sighand_struct *sighand = current->sighand;
> @@ -2497,6 +2535,19 @@ bool get_signal(struct ksignal *ksig)
> current->flags |= PF_SIGNALED;
>
> if (sig_kernel_coredump(signr)) {
> + /*
> + * Notify the parent prior to the coredump if the
> + * parent is interested in such a notificaiton.
> + */
> + int p_sig = current->real_parent->predump_signal;
> +
> + if (valid_predump_signal(p_sig)) {
> + read_lock(&tasklist_lock);
> + do_notify_parent_predump(current);
> + read_unlock(&tasklist_lock);
> + cond_resched();

perhaps this should be called by do_coredump() after coredump_wait() kills
all the sub-threads?

> +static int prctl_set_predump_signal(struct task_struct *tsk, pid_t pid, int sig)
> +{
> + struct task_struct *p;
> + int error;
> +
> + /* 0 is valid for disabling the feature */
> + if (sig && !valid_predump_signal(sig))
> + return -EINVAL;
> +
> + /* For the current task, the common case */
> + if (pid == 0) {
> + tsk->predump_signal = sig;
> + return 0;
> + }
> +
> + error = -ESRCH;
> + rcu_read_lock();
> + p = find_task_by_vpid(pid);
> + if (p) {
> + if (!set_predump_signal_perm(p))
> + error = -EPERM;
> + else {
> + error = 0;
> + p->predump_signal = sig;
> + }
> + }
> + rcu_read_unlock();
> + return error;
> +}

Why? I mean, why do we really want to support the pid != 0 case?

Oleg.


2018-10-15 18:17:22

by Enke Chen (enkechen)

[permalink] [raw]
Subject: Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

Hi, Greg:

> Shouldn't there also be a manpage update, and a kselftest added for this
> new user/kernel api that is being created?
>

I will submit a patch for manpage update once the code is accepted.

Regarding the kselftest, I am not sure. Once the prctl() is limited to
self (which I will do), the logic would be pretty straightforward. Not
sure if the selftest would add much value.

Thanks. -- Enke

On 10/12/18 11:40 PM, Greg Kroah-Hartman wrote:
> On Fri, Oct 12, 2018 at 05:33:35PM -0700, Enke Chen wrote:
>> For simplicity and consistency, this patch provides an implementation
>> for signal-based fault notification prior to the coredump of a child
>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>> be used by an application to express its interest and to specify the
>> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>>
>> Background:
>>
>> As the coredump of a process may take time, in certain time-sensitive
>> applications it is necessary for a parent process (e.g., a process
>> manager) to be notified of a child's imminent death before the coredump
>> so that the parent process can act sooner, such as re-spawning an
>> application process, or initiating a control-plane fail-over.
>>
>> Currently there are two ways for a parent process to be notified of a
>> child process's state change. One is to use the POSIX signal, and
>> another is to use the kernel connector module. The specific events and
>> actions are summarized as follows:
>>
>> Process Event POSIX Signal Connector-based
>> ----------------------------------------------------------------------
>> ptrace_attach() do_notify_parent_cldstop() proc_ptrace_connector()
>> SIGCHLD / CLD_STOPPED
>>
>> ptrace_detach() do_notify_parent_cldstop() proc_ptrace_connector()
>> SIGCHLD / CLD_CONTINUED
>>
>> pre_coredump/ N/A proc_coredump_connector()
>> get_signal()
>>
>> post_coredump/ do_notify_parent() proc_exit_connector()
>> do_exit() SIGCHLD / exit_signal
>> ----------------------------------------------------------------------
>>
>> As shown in the table, the signal-based pre-coredump notification is not
>> currently available. In some cases using a connector-based notification
>> can be quite complicated (e.g., when a process manager is written in shell
>> scripts and thus is subject to certain inherent limitations), and a
>> signal-based notification would be simpler and better suited.
>>
>> Signed-off-by: Enke Chen <[email protected]>
>> ---
>> arch/x86/kernel/signal_compat.c | 2 +-
>> include/linux/sched.h | 4 ++
>> include/linux/signal.h | 5 +++
>> include/uapi/asm-generic/siginfo.h | 3 +-
>> include/uapi/linux/prctl.h | 4 ++
>> kernel/fork.c | 1 +
>> kernel/signal.c | 51 +++++++++++++++++++++++++
>> kernel/sys.c | 77 ++++++++++++++++++++++++++++++++++++++
>> 8 files changed, 145 insertions(+), 2 deletions(-)
>
> Shouldn't there also be a manpage update, and a kselftest added for this
> new user/kernel api that is being created?
>
> thanks,
>
> greg k-h
>

2018-10-15 18:41:35

by Enke Chen (enkechen)

[permalink] [raw]
Subject: Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

Hi, Christian:

As I replied to Jann, I will remove the code that does the setting on others
to make the code simpler and more secure.

Thanks. -- Enke

>> +static bool set_predump_signal_perm(struct task_struct *p)
>> +{
>> + const struct cred *cred = current_cred(), *pcred = __task_cred(p);
>> +
>> + return uid_eq(pcred->uid, cred->euid) ||
>> + uid_eq(pcred->euid, cred->euid) ||
>> + capable(CAP_SYS_ADMIN);
>
> So before proceeding I'd like to discuss at least two points:
> - how does this interact with the dumpability of a process?
> - do we need the capable(CAP_SYS_ADMIN) restriction to init_user_ns?
> Seems we could make this work per-user-ns just like
> PRCTL_SET_PDEATHSIG does?
>
>> +}


2018-10-15 18:44:49

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

On Mon, Oct 15, 2018 at 11:16:36AM -0700, Enke Chen wrote:
> Hi, Greg:
>
> > Shouldn't there also be a manpage update, and a kselftest added for this
> > new user/kernel api that is being created?
> >
>
> I will submit a patch for manpage update once the code is accepted.

Writing a manpage update is key to see if what you are describing
actually matches the code you have submitted. You should do both at the
same time so that they can be reviewed together.

> Regarding the kselftest, I am not sure. Once the prctl() is limited to
> self (which I will do), the logic would be pretty straightforward. Not
> sure if the selftest would add much value.

If you do not have a test for this feature, how do you know it even
works at all? How will you know if it breaks in a future kernel
release? Have you tested this? If so, how?

thanks,

greg k-h

2018-10-15 18:46:53

by Enke Chen (enkechen)

[permalink] [raw]
Subject: Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

Hi, Jann:

Thanks a lot for you detailed review. Please see my replied/comments inline.

On 10/13/18 11:27 AM, Jann Horn wrote:
> On Sat, Oct 13, 2018 at 2:33 AM Enke Chen <[email protected]> wrote:
>> For simplicity and consistency, this patch provides an implementation
>> for signal-based fault notification prior to the coredump of a child
>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>> be used by an application to express its interest and to specify the
>> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>
> Your suggested API looks vaguely similar to PR_SET_PDEATHSIG, but with
> some important differences:
>
> - You don't reset the signal on setuid execution.
> - You permit setting this not just on the current process, but also on others.
>
> For both of these: Are these differences actually necessary, and if
> so, can you provide a specific rationale? From a security perspective,
> I would very much prefer it if this API had semantics closer to
> PR_SET_PDEATHSIG.

Regarding setting on others, I started with setting for self. But there is
a requirement for supporting the feature for a process manager written in
bash script. That's the reason for allowing the setting on others.

Given the feedback from you and others, I agree that it would be simpler and
more secure to remove the setting on others. We can submit a patch for bash
to support the setting natively.

Regarding the impact of "setuid", this property "PR_SET_PREDUMP_SIG" has to
do with the application/process whether the signal handler is set for receiving
such a notification. If it is set, the "uid" should not matter.

>
> [...]
>> diff --git a/kernel/signal.c b/kernel/signal.c
>> index 312b43e..eb4a483 100644
>> --- a/kernel/signal.c
>> +++ b/kernel/signal.c
>> @@ -2337,6 +2337,44 @@ static int ptrace_signal(int signr, kernel_siginfo_t *info)
>> return signr;
>> }
>>
>> +/*
>> + * Let the parent, if so desired, know about the imminent death of a child
>> + * prior to its coredump.
>> + *
>> + * Locking logic is similar to do_notify_parent_cldstop().
>> + */
>> +static void do_notify_parent_predump(struct task_struct *tsk)
>> +{
>> + struct sighand_struct *sighand;
>> + struct task_struct *parent;
>> + struct kernel_siginfo info;
>> + unsigned long flags;
>> + int sig;
>> +
>> + parent = tsk->real_parent;
>> + sig = parent->predump_signal;
>> +
>> + /* Check again with "tasklist_lock" locked by the caller */
>> + if (!valid_predump_signal(sig))
>> + return;
>> +
>> + clear_siginfo(&info);
>> + info.si_signo = sig;
>> + if (sig == SIGCHLD)
>> + info.si_code = CLD_PREDUMP;
>> +
>> + rcu_read_lock();
>> + info.si_pid = task_pid_nr_ns(tsk, task_active_pid_ns(parent));
>> + info.si_uid = from_kuid_munged(task_cred_xxx(parent, user_ns),
>> + task_uid(tsk));
>
> You're sending a signal from the current namespaces, but with IDs that
> have been mapped into the parent's namespaces? That looks wrong to me.

I am following the example "do_notify_parent_cldstop()" called in the same
routine "get_signal()". If there is a better way, sure I will use it.

>
>> + rcu_read_unlock();
>> +
>> + sighand = parent->sighand;
>> + spin_lock_irqsave(&sighand->siglock, flags);
>> + __group_send_sig_info(sig, &info, parent);
>> + spin_unlock_irqrestore(&sighand->siglock, flags);
>> +}
>> +
>> bool get_signal(struct ksignal *ksig)
>> {
>> struct sighand_struct *sighand = current->sighand;
>> @@ -2497,6 +2535,19 @@ bool get_signal(struct ksignal *ksig)
>> current->flags |= PF_SIGNALED;
>>
>> if (sig_kernel_coredump(signr)) {
>> + /*
>> + * Notify the parent prior to the coredump if the
>> + * parent is interested in such a notificaiton.
>> + */
>> + int p_sig = current->real_parent->predump_signal;
>
> current->real_parent is an __rcu member. I think if you run the sparse
> checker against this patch, it's going to complain. Are you allowed to
> access current->real_parent in this context?

Let me check, and get back to you on this one.

>
>> + if (valid_predump_signal(p_sig)) {
>> + read_lock(&tasklist_lock);
>> + do_notify_parent_predump(current);
>> + read_unlock(&tasklist_lock);
>> + cond_resched();
>> + }
>> +
>> if (print_fatal_signals)
>> print_fatal_signal(ksig->info.si_signo);
>> proc_coredump_connector(current);
>> diff --git a/kernel/sys.c b/kernel/sys.c
>> index 123bd73..43eb250d 100644
>> --- a/kernel/sys.c
>> +++ b/kernel/sys.c
>> @@ -2258,6 +2258,76 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which,
>> return -EINVAL;
>> }
>>
>> +static int prctl_get_predump_signal(struct task_struct *tsk, pid_t pid,
>> + int __user *addr)
>> +{
>> + struct task_struct *p;
>> + int error;
>> +
>> + /* For the current task, the common case */
>> + if (pid == 0) {
>> + put_user(tsk->predump_signal, addr);
>> + return 0;
>> + }
>> +
>> + error = -ESRCH;
>> + rcu_read_lock();
>> + p = find_task_by_vpid(pid);
>> + if (p) {
>> + error = 0;
>> + put_user(p->predump_signal, addr);
>
> This is wrong. You can't call put_user() while you're in an RCU
> read-side critical section.
>
> As below, I would like it if you could just get rid of this branch,
> and restrict this prctl to operating on the current process.

My bad. The code will be removed.

>
>> + }
>> + rcu_read_unlock();
>> + return error;
>> +}
>> +
>> +/*
>> + * Returns true if current's euid is same as p's uid or euid,
>> + * or has CAP_SYS_ADMIN.
>> + *
>> + * Called with rcu_read_lock, creds are safe.
>> + *
>> + * Adapted from set_one_prio_perm().
>> + */
>> +static bool set_predump_signal_perm(struct task_struct *p)
>> +{
>> + const struct cred *cred = current_cred(), *pcred = __task_cred(p);
>> +
>> + return uid_eq(pcred->uid, cred->euid) ||
>> + uid_eq(pcred->euid, cred->euid) ||
>> + capable(CAP_SYS_ADMIN);
>> +}
>
> This permission check permits fiddling with other processes in
> scenarios where kill() wouldn't let you send signals (specifically, if
> you control the EUID of the target task). That worries me. Also,
> kill() is subject to LSM checks (see check_kill_permission()), but
> this interface isn't. I would really prefer it if you could amend this
> so that you can only operate on the current task, and get rid of this
> permission check.

It is gone.

>
> [...]
>> SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>> unsigned long, arg4, unsigned long, arg5)
>> {
>> @@ -2476,6 +2546,13 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which,
>> return -EINVAL;
>> error = arch_prctl_spec_ctrl_set(me, arg2, arg3);
>> break;
>> + case PR_SET_PREDUMP_SIG:
>> + error = prctl_set_predump_signal(me, (pid_t)arg2, (int)arg3);
>> + break;
>> + case PR_GET_PREDUMP_SIG:
>> + error = prctl_get_predump_signal(me, (pid_t)arg2,
>> + (int __user *)arg3);
>> + break;
>
> New prctl() calls should check that the unused arguments are zero, in
> order to make it possible to safely add more flags in the future.

Will do.

Thanks again. -- Enke


2018-10-15 18:51:16

by Enke Chen (enkechen)

[permalink] [raw]
Subject: Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

Hi, Greg:

On 10/15/18 11:43 AM, Greg Kroah-Hartman wrote:
> On Mon, Oct 15, 2018 at 11:16:36AM -0700, Enke Chen wrote:
>> Hi, Greg:
>>
>>> Shouldn't there also be a manpage update, and a kselftest added for this
>>> new user/kernel api that is being created?
>>>
>>
>> I will submit a patch for manpage update once the code is accepted.
>
> Writing a manpage update is key to see if what you are describing
> actually matches the code you have submitted. You should do both at the
> same time so that they can be reviewed together.

Ok, will do at the same time. But should I submit it as a separate patch?

>
>> Regarding the kselftest, I am not sure. Once the prctl() is limited to
>> self (which I will do), the logic would be pretty straightforward. Not
>> sure if the selftest would add much value.
>
> If you do not have a test for this feature, how do you know it even
> works at all? How will you know if it breaks in a future kernel
> release? Have you tested this? If so, how?

I have the test code. I am just not sure whether I should submit and check
it in to the kselftest?

Thanks. -- Enke


2018-10-15 18:55:47

by Jann Horn

[permalink] [raw]
Subject: Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

On Mon, Oct 15, 2018 at 8:36 PM Enke Chen <[email protected]> wrote:
> On 10/13/18 11:27 AM, Jann Horn wrote:
> > On Sat, Oct 13, 2018 at 2:33 AM Enke Chen <[email protected]> wrote:
> >> For simplicity and consistency, this patch provides an implementation
> >> for signal-based fault notification prior to the coredump of a child
> >> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
> >> be used by an application to express its interest and to specify the
> >> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
> >> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
> >
> > Your suggested API looks vaguely similar to PR_SET_PDEATHSIG, but with
> > some important differences:
> >
> > - You don't reset the signal on setuid execution.
[...]
> >
> > For both of these: Are these differences actually necessary, and if
> > so, can you provide a specific rationale? From a security perspective,
> > I would very much prefer it if this API had semantics closer to
> > PR_SET_PDEATHSIG.
>
[...]
>
> Regarding the impact of "setuid", this property "PR_SET_PREDUMP_SIG" has to
> do with the application/process whether the signal handler is set for receiving
> such a notification. If it is set, the "uid" should not matter.

If an attacker's process first calls PR_SET_PREDUMP_SIG, then forks
off a child, then calls execve() on a setuid binary, the setuid binary
calls setuid(0), and the attacker-controlled child then crashes, the
privileged process will receive an unexpected signal that the attacker
wouldn't have been allowed to send otherwise. For similar reasons, the
parent death signal is reset when a setuid binary is executed:

void setup_new_exec(struct linux_binprm * bprm)
{
/*
* Once here, prepare_binrpm() will not be called any more, so
* the final state of setuid/setgid/fscaps can be merged into the
* secureexec flag.
*/
bprm->secureexec |= bprm->cap_elevated;

if (bprm->secureexec) {
/* Make sure parent cannot signal privileged process. */
current->pdeath_signal = 0;
[...]
}
[...]
}

int commit_creds(struct cred *new)
{
[...]
/* dumpability changes */
if (!uid_eq(old->euid, new->euid) ||
!gid_eq(old->egid, new->egid) ||
!uid_eq(old->fsuid, new->fsuid) ||
!gid_eq(old->fsgid, new->fsgid) ||
!cred_cap_issubset(old, new)) {
if (task->mm)
set_dumpable(task->mm, suid_dumpable);
task->pdeath_signal = 0;
smp_wmb();
}
[...]
}

AppArmor and SELinux also do related changes:

static void apparmor_bprm_committing_creds(struct linux_binprm *bprm)
{
[...]
/* bail out if unconfined or not changing profile */
if ((new_label->proxy == label->proxy) ||
(unconfined(new_label)))
return;

aa_inherit_files(bprm->cred, current->files);

current->pdeath_signal = 0;
[...]
}

static void selinux_bprm_committing_creds(struct linux_binprm *bprm)
{
[...]
new_tsec = bprm->cred->security;
if (new_tsec->sid == new_tsec->osid)
return;

/* Close files for which the new task SID is not authorized. */
flush_unauthorized_files(bprm->cred, current->files);

/* Always clear parent death signal on SID transitions. */
current->pdeath_signal = 0;
[...]
}

You should probably reset the coredump signal in the same places - or
even better, add a new helper for resetting the parent death signal,
and then add code for resetting the coredump signal in there.

2018-10-15 19:00:31

by Enke Chen (enkechen)

[permalink] [raw]
Subject: Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

Hi, Oleg:

On 10/15/18 5:05 AM, Oleg Nesterov wrote:
> On 10/12, Enke Chen wrote:
>>
>> For simplicity and consistency, this patch provides an implementation
>> for signal-based fault notification prior to the coredump of a child
>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>> be used by an application to express its interest and to specify the
>> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>
> To be honest, I can't say I like this new feature...

The requirement for predump notification is real. IMO signal notification
is simpler than "connector" or "signal + connector".

>
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -696,6 +696,10 @@ struct task_struct {
>> int exit_signal;
>> /* The signal sent when the parent dies: */
>> int pdeath_signal;
>> +
>> + /* The signal sent prior to a child's coredump: */
>> + int predump_signal;
>> +
>
> At least, I think predump_signal should live in signal_struct, not
> task_struct.

It makes sense as "signal handling" must be consistent in a process.
I was following the wrong example. I will make the change.

>
> (pdeath_signal too, but it is too late to change (fix) this awkward API).
>
>> +static void do_notify_parent_predump(struct task_struct *tsk)
>> +{
>> + struct sighand_struct *sighand;
>> + struct task_struct *parent;
>> + struct kernel_siginfo info;
>> + unsigned long flags;
>> + int sig;
>> +
>> + parent = tsk->real_parent;
>
> So, debuggere won't be notified, only real_parent...
>
>> + sig = parent->predump_signal;
>
> probably ->predump_signal should be cleared on exec?
>
>> + /* Check again with tasklist_lock" locked by the caller */
>> + if (!valid_predump_signal(sig))
>> + return;
>
> I don't understand why we need valid_predump_signal() at all.
>
>> bool get_signal(struct ksignal *ksig)
>> {
>> struct sighand_struct *sighand = current->sighand;
>> @@ -2497,6 +2535,19 @@ bool get_signal(struct ksignal *ksig)
>> current->flags |= PF_SIGNALED;
>>
>> if (sig_kernel_coredump(signr)) {
>> + /*
>> + * Notify the parent prior to the coredump if the
>> + * parent is interested in such a notificaiton.
>> + */
>> + int p_sig = current->real_parent->predump_signal;
>> +
>> + if (valid_predump_signal(p_sig)) {
>> + read_lock(&tasklist_lock);
>> + do_notify_parent_predump(current);
>> + read_unlock(&tasklist_lock);
>> + cond_resched();
>
> perhaps this should be called by do_coredump() after coredump_wait() kills
> all the sub-threads?
>
>> +static int prctl_set_predump_signal(struct task_struct *tsk, pid_t pid, int sig)
>> +{
>> + struct task_struct *p;
>> + int error;
>> +
>> + /* 0 is valid for disabling the feature */
>> + if (sig && !valid_predump_signal(sig))
>> + return -EINVAL;
>> +
>> + /* For the current task, the common case */
>> + if (pid == 0) {
>> + tsk->predump_signal = sig;
>> + return 0;
>> + }
>> +
>> + error = -ESRCH;
>> + rcu_read_lock();
>> + p = find_task_by_vpid(pid);
>> + if (p) {
>> + if (!set_predump_signal_perm(p))
>> + error = -EPERM;
>> + else {
>> + error = 0;
>> + p->predump_signal = sig;
>> + }
>> + }
>> + rcu_read_unlock();
>> + return error;
>> +}
>
> Why? I mean, why do we really want to support the pid != 0 case?

I will remove it. Please see my reply to Jann.

Thanks. -- Enke


2018-10-15 19:00:43

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

On Mon, Oct 15, 2018 at 11:49:11AM -0700, Enke Chen wrote:
> Hi, Greg:
>
> On 10/15/18 11:43 AM, Greg Kroah-Hartman wrote:
> > On Mon, Oct 15, 2018 at 11:16:36AM -0700, Enke Chen wrote:
> >> Hi, Greg:
> >>
> >>> Shouldn't there also be a manpage update, and a kselftest added for this
> >>> new user/kernel api that is being created?
> >>>
> >>
> >> I will submit a patch for manpage update once the code is accepted.
> >
> > Writing a manpage update is key to see if what you are describing
> > actually matches the code you have submitted. You should do both at the
> > same time so that they can be reviewed together.
>
> Ok, will do at the same time. But should I submit it as a separate patch?

Yes please.

> >> Regarding the kselftest, I am not sure. Once the prctl() is limited to
> >> self (which I will do), the logic would be pretty straightforward. Not
> >> sure if the selftest would add much value.
> >
> > If you do not have a test for this feature, how do you know it even
> > works at all? How will you know if it breaks in a future kernel
> > release? Have you tested this? If so, how?
>
> I have the test code. I am just not sure whether I should submit and check
> it in to the kselftest?

Of course you should, why wouldn't you? Do you want to be the only
person in the world responsible for ensuring that this feature does not
break for the next 20+ years? :)

thanks,

greg k-h

2018-10-15 19:19:00

by Enke Chen (enkechen)

[permalink] [raw]
Subject: Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

Hi, Oleg:

I missed some of your comments in my previous reply.

On 10/15/18 5:05 AM, Oleg Nesterov wrote:
> On 10/12, Enke Chen wrote:
>>
>> For simplicity and consistency, this patch provides an implementation
>> for signal-based fault notification prior to the coredump of a child
>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>> be used by an application to express its interest and to specify the
>> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>
> To be honest, I can't say I like this new feature...
>
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -696,6 +696,10 @@ struct task_struct {
>> int exit_signal;
>> /* The signal sent when the parent dies: */
>> int pdeath_signal;
>> +
>> + /* The signal sent prior to a child's coredump: */
>> + int predump_signal;
>> +
>
> At least, I think predump_signal should live in signal_struct, not
> task_struct.
>
> (pdeath_signal too, but it is too late to change (fix) this awkward API).
>
>> +static void do_notify_parent_predump(struct task_struct *tsk)
>> +{
>> + struct sighand_struct *sighand;
>> + struct task_struct *parent;
>> + struct kernel_siginfo info;
>> + unsigned long flags;
>> + int sig;
>> +
>> + parent = tsk->real_parent;
>
> So, debuggere won't be notified, only real_parent...
>
>> + sig = parent->predump_signal;
>
> probably ->predump_signal should be cleared on exec?


Is this not enough in "copy_process()"?

@@ -1985,6 +1985,7 @@ static __latent_entropy struct task_struct *copy_process(
p->dirty_paused_when = 0;

p->pdeath_signal = 0;
+ p->predump_signal = 0;

>
>> + /* Check again with tasklist_lock" locked by the caller */
>> + if (!valid_predump_signal(sig))
>> + return;
>
> I don't understand why we need valid_predump_signal() at all.

Most of the signals have well-defined semantics, and would not be appropriate
for this purpose. That is why it is limited to only SIGCHLD, SIGUSR1, SIGUSR2.

>
>> bool get_signal(struct ksignal *ksig)
>> {
>> struct sighand_struct *sighand = current->sighand;
>> @@ -2497,6 +2535,19 @@ bool get_signal(struct ksignal *ksig)
>> current->flags |= PF_SIGNALED;
>>
>> if (sig_kernel_coredump(signr)) {
>> + /*
>> + * Notify the parent prior to the coredump if the
>> + * parent is interested in such a notificaiton.
>> + */
>> + int p_sig = current->real_parent->predump_signal;
>> +
>> + if (valid_predump_signal(p_sig)) {
>> + read_lock(&tasklist_lock);
>> + do_notify_parent_predump(current);
>> + read_unlock(&tasklist_lock);
>> + cond_resched();
>
> perhaps this should be called by do_coredump() after coredump_wait() kills
> all the sub-threads?

proc_coredump_connector(current) is located here, they should stay together.

Thanks. -- Enke

>
>> +static int prctl_set_predump_signal(struct task_struct *tsk, pid_t pid, int sig)
>> +{
>> + struct task_struct *p;
>> + int error;
>> +
>> + /* 0 is valid for disabling the feature */
>> + if (sig && !valid_predump_signal(sig))
>> + return -EINVAL;
>> +
>> + /* For the current task, the common case */
>> + if (pid == 0) {
>> + tsk->predump_signal = sig;
>> + return 0;
>> + }
>> +
>> + error = -ESRCH;
>> + rcu_read_lock();
>> + p = find_task_by_vpid(pid);
>> + if (p) {
>> + if (!set_predump_signal_perm(p))
>> + error = -EPERM;
>> + else {
>> + error = 0;
>> + p->predump_signal = sig;
>> + }
>> + }
>> + rcu_read_unlock();
>> + return error;
>> +}
>
> Why? I mean, why do we really want to support the pid != 0 case?
>
> Oleg.
>

2018-10-15 19:24:42

by Enke Chen (enkechen)

[permalink] [raw]
Subject: Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

Hi, Jann:

Thanks for your detail explanation. Will take care of it.

-- Enke

On 10/15/18 11:54 AM, Jann Horn wrote:
> On Mon, Oct 15, 2018 at 8:36 PM Enke Chen <[email protected]> wrote:
>> On 10/13/18 11:27 AM, Jann Horn wrote:
>>> On Sat, Oct 13, 2018 at 2:33 AM Enke Chen <[email protected]> wrote:
>>>> For simplicity and consistency, this patch provides an implementation
>>>> for signal-based fault notification prior to the coredump of a child
>>>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>>>> be used by an application to express its interest and to specify the
>>>> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>>>> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>>>
>>> Your suggested API looks vaguely similar to PR_SET_PDEATHSIG, but with
>>> some important differences:
>>>
>>> - You don't reset the signal on setuid execution.
> [...]
>>>
>>> For both of these: Are these differences actually necessary, and if
>>> so, can you provide a specific rationale? From a security perspective,
>>> I would very much prefer it if this API had semantics closer to
>>> PR_SET_PDEATHSIG.
>>
> [...]
>>
>> Regarding the impact of "setuid", this property "PR_SET_PREDUMP_SIG" has to
>> do with the application/process whether the signal handler is set for receiving
>> such a notification. If it is set, the "uid" should not matter.
>
> If an attacker's process first calls PR_SET_PREDUMP_SIG, then forks
> off a child, then calls execve() on a setuid binary, the setuid binary
> calls setuid(0), and the attacker-controlled child then crashes, the
> privileged process will receive an unexpected signal that the attacker
> wouldn't have been allowed to send otherwise. For similar reasons, the
> parent death signal is reset when a setuid binary is executed:
>
> void setup_new_exec(struct linux_binprm * bprm)
> {
> /*
> * Once here, prepare_binrpm() will not be called any more, so
> * the final state of setuid/setgid/fscaps can be merged into the
> * secureexec flag.
> */
> bprm->secureexec |= bprm->cap_elevated;
>
> if (bprm->secureexec) {
> /* Make sure parent cannot signal privileged process. */
> current->pdeath_signal = 0;
> [...]
> }
> [...]
> }
>
> int commit_creds(struct cred *new)
> {
> [...]
> /* dumpability changes */
> if (!uid_eq(old->euid, new->euid) ||
> !gid_eq(old->egid, new->egid) ||
> !uid_eq(old->fsuid, new->fsuid) ||
> !gid_eq(old->fsgid, new->fsgid) ||
> !cred_cap_issubset(old, new)) {
> if (task->mm)
> set_dumpable(task->mm, suid_dumpable);
> task->pdeath_signal = 0;
> smp_wmb();
> }
> [...]
> }
>
> AppArmor and SELinux also do related changes:
>
> static void apparmor_bprm_committing_creds(struct linux_binprm *bprm)
> {
> [...]
> /* bail out if unconfined or not changing profile */
> if ((new_label->proxy == label->proxy) ||
> (unconfined(new_label)))
> return;
>
> aa_inherit_files(bprm->cred, current->files);
>
> current->pdeath_signal = 0;
> [...]
> }
>
> static void selinux_bprm_committing_creds(struct linux_binprm *bprm)
> {
> [...]
> new_tsec = bprm->cred->security;
> if (new_tsec->sid == new_tsec->osid)
> return;
>
> /* Close files for which the new task SID is not authorized. */
> flush_unauthorized_files(bprm->cred, current->files);
>
> /* Always clear parent death signal on SID transitions. */
> current->pdeath_signal = 0;
> [...]
> }
>
> You should probably reset the coredump signal in the same places - or
> even better, add a new helper for resetting the parent death signal,
> and then add code for resetting the coredump signal in there.
>

2018-10-15 19:31:32

by Enke Chen (enkechen)

[permalink] [raw]
Subject: Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

Hi, Olge:

>> probably ->predump_signal should be cleared on exec?

As I replied to Jann, will do.

Thanks. -- Enke

On 10/15/18 12:17 PM, Enke Chen wrote:
> Hi, Oleg:
>
> I missed some of your comments in my previous reply.
>
> On 10/15/18 5:05 AM, Oleg Nesterov wrote:
>> On 10/12, Enke Chen wrote:
>>>
>>> For simplicity and consistency, this patch provides an implementation
>>> for signal-based fault notification prior to the coredump of a child
>>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>>> be used by an application to express its interest and to specify the
>>> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>>> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>>
>> To be honest, I can't say I like this new feature...
>>
>>> --- a/include/linux/sched.h
>>> +++ b/include/linux/sched.h
>>> @@ -696,6 +696,10 @@ struct task_struct {
>>> int exit_signal;
>>> /* The signal sent when the parent dies: */
>>> int pdeath_signal;
>>> +
>>> + /* The signal sent prior to a child's coredump: */
>>> + int predump_signal;
>>> +
>>
>> At least, I think predump_signal should live in signal_struct, not
>> task_struct.
>>
>> (pdeath_signal too, but it is too late to change (fix) this awkward API).
>>
>>> +static void do_notify_parent_predump(struct task_struct *tsk)
>>> +{
>>> + struct sighand_struct *sighand;
>>> + struct task_struct *parent;
>>> + struct kernel_siginfo info;
>>> + unsigned long flags;
>>> + int sig;
>>> +
>>> + parent = tsk->real_parent;
>>
>> So, debuggere won't be notified, only real_parent...
>>
>>> + sig = parent->predump_signal;
>>
>> probably ->predump_signal should be cleared on exec?
>
>
> Is this not enough in "copy_process()"?
>
> @@ -1985,6 +1985,7 @@ static __latent_entropy struct task_struct *copy_process(
> p->dirty_paused_when = 0;
>
> p->pdeath_signal = 0;
> + p->predump_signal = 0;
>
>>
>>> + /* Check again with tasklist_lock" locked by the caller */
>>> + if (!valid_predump_signal(sig))
>>> + return;
>>
>> I don't understand why we need valid_predump_signal() at all.
>
> Most of the signals have well-defined semantics, and would not be appropriate
> for this purpose. That is why it is limited to only SIGCHLD, SIGUSR1, SIGUSR2.
>
>>
>>> bool get_signal(struct ksignal *ksig)
>>> {
>>> struct sighand_struct *sighand = current->sighand;
>>> @@ -2497,6 +2535,19 @@ bool get_signal(struct ksignal *ksig)
>>> current->flags |= PF_SIGNALED;
>>>
>>> if (sig_kernel_coredump(signr)) {
>>> + /*
>>> + * Notify the parent prior to the coredump if the
>>> + * parent is interested in such a notificaiton.
>>> + */
>>> + int p_sig = current->real_parent->predump_signal;
>>> +
>>> + if (valid_predump_signal(p_sig)) {
>>> + read_lock(&tasklist_lock);
>>> + do_notify_parent_predump(current);
>>> + read_unlock(&tasklist_lock);
>>> + cond_resched();
>>
>> perhaps this should be called by do_coredump() after coredump_wait() kills
>> all the sub-threads?
>
> proc_coredump_connector(current) is located here, they should stay together.
>
> Thanks. -- Enke
>
>>
>>> +static int prctl_set_predump_signal(struct task_struct *tsk, pid_t pid, int sig)
>>> +{
>>> + struct task_struct *p;
>>> + int error;
>>> +
>>> + /* 0 is valid for disabling the feature */
>>> + if (sig && !valid_predump_signal(sig))
>>> + return -EINVAL;
>>> +
>>> + /* For the current task, the common case */
>>> + if (pid == 0) {
>>> + tsk->predump_signal = sig;
>>> + return 0;
>>> + }
>>> +
>>> + error = -ESRCH;
>>> + rcu_read_lock();
>>> + p = find_task_by_vpid(pid);
>>> + if (p) {
>>> + if (!set_predump_signal_perm(p))
>>> + error = -EPERM;
>>> + else {
>>> + error = 0;
>>> + p->predump_signal = sig;
>>> + }
>>> + }
>>> + rcu_read_unlock();
>>> + return error;
>>> +}
>>
>> Why? I mean, why do we really want to support the pid != 0 case?
>>
>> Oleg.
>>

2018-10-15 21:25:09

by Alan Cox

[permalink] [raw]
Subject: Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

> +/*
> + * Returns true if current's euid is same as p's uid or euid,
> + * or has CAP_SYS_ADMIN.
> + *
> + * Called with rcu_read_lock, creds are safe.
> + *
> + * Adapted from set_one_prio_perm().
> + */
> +static bool set_predump_signal_perm(struct task_struct *p)
> +{
> + const struct cred *cred = current_cred(), *pcred = __task_cred(p);
> +
> + return uid_eq(pcred->uid, cred->euid) ||
> + uid_eq(pcred->euid, cred->euid) ||
> + capable(CAP_SYS_ADMIN);
> +}

This makes absolutely no security sense whatsoever. The uid and euid of
the parent and child can both change between the test and the signal
delivery.

There are reasons that the child signal control code is incredibly
careful about either the parent or child using execve or doing a
privilege change that might pose a risk.

Until this code gets the same protections I don't believe it's safe.

Alan

2018-10-15 21:37:53

by Enke Chen (enkechen)

[permalink] [raw]
Subject: Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

Hi, Alan:

As I replied earlier, I will remove the logic that allows setting on others.
This function "set_predump_signal_perm()" will be gone too.

Thanks. -- Enke

On 10/15/18 2:21 PM, Alan Cox wrote:
>> +/*
>> + * Returns true if current's euid is same as p's uid or euid,
>> + * or has CAP_SYS_ADMIN.
>> + *
>> + * Called with rcu_read_lock, creds are safe.
>> + *
>> + * Adapted from set_one_prio_perm().
>> + */
>> +static bool set_predump_signal_perm(struct task_struct *p)
>> +{
>> + const struct cred *cred = current_cred(), *pcred = __task_cred(p);
>> +
>> + return uid_eq(pcred->uid, cred->euid) ||
>> + uid_eq(pcred->euid, cred->euid) ||
>> + capable(CAP_SYS_ADMIN);
>> +}
>
> This makes absolutely no security sense whatsoever. The uid and euid of
> the parent and child can both change between the test and the signal
> delivery.
>
> There are reasons that the child signal control code is incredibly
> careful about either the parent or child using execve or doing a
> privilege change that might pose a risk.
>
> Until this code gets the same protections I don't believe it's safe.
>
> Alan
>

2018-10-15 23:30:45

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

Enke Chen <[email protected]> writes:

> For simplicity and consistency, this patch provides an implementation
> for signal-based fault notification prior to the coredump of a child
> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
> be used by an application to express its interest and to specify the
> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>
> Background:
>
> As the coredump of a process may take time, in certain time-sensitive
> applications it is necessary for a parent process (e.g., a process
> manager) to be notified of a child's imminent death before the coredump
> so that the parent process can act sooner, such as re-spawning an
> application process, or initiating a control-plane fail-over.

You talk about time senstive and then you talk about bash scripts.
I don't think your definition of time-sensitive and my definition match.

With that said I think the best solution would be to figure out how to
allow the coredump to run in parallel with the usual exit signal, and
exit code reaping of the process.

That would solve the problem for everyone, and would not introduce any
new complicated APIs.

Short of that having the prctl in the process that receives the signals
they you are doing is the right way to go.

You are however calling do_notify_parent_predump from the wrong
function, and frankly with the wrong locking. There are multiple paths
to the do_coredump function so you really want this notification from
do_coredump.

But I still think it would be better to solve the root cause problem and
change the coredump logic to be able to run in parallel with the normal
exit notification and zombie reaping logic. Then the problem you are
trying to solve goes away and everyone's code gets simpler.

Eric

> Currently there are two ways for a parent process to be notified of a
> child process's state change. One is to use the POSIX signal, and
> another is to use the kernel connector module. The specific events and
> actions are summarized as follows:
>
> Process Event POSIX Signal Connector-based
> ----------------------------------------------------------------------
> ptrace_attach() do_notify_parent_cldstop() proc_ptrace_connector()
> SIGCHLD / CLD_STOPPED
>
> ptrace_detach() do_notify_parent_cldstop() proc_ptrace_connector()
> SIGCHLD / CLD_CONTINUED
>
> pre_coredump/ N/A proc_coredump_connector()
> get_signal()
>
> post_coredump/ do_notify_parent() proc_exit_connector()
> do_exit() SIGCHLD / exit_signal
> ----------------------------------------------------------------------
>
> As shown in the table, the signal-based pre-coredump notification is not
> currently available. In some cases using a connector-based notification
> can be quite complicated (e.g., when a process manager is written in shell
> scripts and thus is subject to certain inherent limitations), and a
> signal-based notification would be simpler and better suited.
>
> Signed-off-by: Enke Chen <[email protected]>
> ---
> arch/x86/kernel/signal_compat.c | 2 +-
> include/linux/sched.h | 4 ++
> include/linux/signal.h | 5 +++
> include/uapi/asm-generic/siginfo.h | 3 +-
> include/uapi/linux/prctl.h | 4 ++
> kernel/fork.c | 1 +
> kernel/signal.c | 51 +++++++++++++++++++++++++
> kernel/sys.c | 77 ++++++++++++++++++++++++++++++++++++++
> 8 files changed, 145 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kernel/signal_compat.c b/arch/x86/kernel/signal_compat.c
> index 9ccbf05..a3deba8 100644
> --- a/arch/x86/kernel/signal_compat.c
> +++ b/arch/x86/kernel/signal_compat.c
> @@ -30,7 +30,7 @@ static inline void signal_compat_build_tests(void)
> BUILD_BUG_ON(NSIGSEGV != 7);
> BUILD_BUG_ON(NSIGBUS != 5);
> BUILD_BUG_ON(NSIGTRAP != 5);
> - BUILD_BUG_ON(NSIGCHLD != 6);
> + BUILD_BUG_ON(NSIGCHLD != 7);
> BUILD_BUG_ON(NSIGSYS != 1);
>
> /* This is part of the ABI and can never change in size: */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 09026ea..cfb9645 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -696,6 +696,10 @@ struct task_struct {
> int exit_signal;
> /* The signal sent when the parent dies: */
> int pdeath_signal;
> +
> + /* The signal sent prior to a child's coredump: */
> + int predump_signal;
> +
> /* JOBCTL_*, siglock protected: */
> unsigned long jobctl;
>
> diff --git a/include/linux/signal.h b/include/linux/signal.h
> index 706a499..7cb976d 100644
> --- a/include/linux/signal.h
> +++ b/include/linux/signal.h
> @@ -256,6 +256,11 @@ static inline int valid_signal(unsigned long sig)
> return sig <= _NSIG ? 1 : 0;
> }
>
> +static inline int valid_predump_signal(int sig)
> +{
> + return (sig == SIGCHLD) || (sig == SIGUSR1) || (sig == SIGUSR2);
> +}
> +
> struct timespec;
> struct pt_regs;
> enum pid_type;
> diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-generic/siginfo.h
> index cb3d6c2..1a47cef 100644
> --- a/include/uapi/asm-generic/siginfo.h
> +++ b/include/uapi/asm-generic/siginfo.h
> @@ -267,7 +267,8 @@ struct { \
> #define CLD_TRAPPED 4 /* traced child has trapped */
> #define CLD_STOPPED 5 /* child has stopped */
> #define CLD_CONTINUED 6 /* stopped child has continued */
> -#define NSIGCHLD 6
> +#define CLD_PREDUMP 7 /* child is about to dump core */
> +#define NSIGCHLD 7
>
> /*
> * SIGPOLL (or any other signal without signal specific si_codes) si_codes
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index c0d7ea0..79f0a8a 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -219,4 +219,8 @@ struct prctl_mm_map {
> # define PR_SPEC_DISABLE (1UL << 2)
> # define PR_SPEC_FORCE_DISABLE (1UL << 3)
>
> +/* Whether to receive signal prior to child's coredump */
> +#define PR_SET_PREDUMP_SIG 54
> +#define PR_GET_PREDUMP_SIG 55
> +
> #endif /* _LINUX_PRCTL_H */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 07cddff..c296c11 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1985,6 +1985,7 @@ static __latent_entropy struct task_struct *copy_process(
> p->dirty_paused_when = 0;
>
> p->pdeath_signal = 0;
> + p->predump_signal = 0;
> INIT_LIST_HEAD(&p->thread_group);
> p->task_works = NULL;
>
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 312b43e..eb4a483 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2337,6 +2337,44 @@ static int ptrace_signal(int signr, kernel_siginfo_t *info)
> return signr;
> }
>
> +/*
> + * Let the parent, if so desired, know about the imminent death of a child
> + * prior to its coredump.
> + *
> + * Locking logic is similar to do_notify_parent_cldstop().
> + */
> +static void do_notify_parent_predump(struct task_struct *tsk)
> +{
> + struct sighand_struct *sighand;
> + struct task_struct *parent;
> + struct kernel_siginfo info;
> + unsigned long flags;
> + int sig;
> +
> + parent = tsk->real_parent;
> + sig = parent->predump_signal;
> +
> + /* Check again with "tasklist_lock" locked by the caller */
> + if (!valid_predump_signal(sig))
> + return;
> +
> + clear_siginfo(&info);
> + info.si_signo = sig;
> + if (sig == SIGCHLD)
> + info.si_code = CLD_PREDUMP;
> +
> + rcu_read_lock();
> + info.si_pid = task_pid_nr_ns(tsk, task_active_pid_ns(parent));
> + info.si_uid = from_kuid_munged(task_cred_xxx(parent, user_ns),
> + task_uid(tsk));
> + rcu_read_unlock();
> +
> + sighand = parent->sighand;
> + spin_lock_irqsave(&sighand->siglock, flags);
> + __group_send_sig_info(sig, &info, parent);
> + spin_unlock_irqrestore(&sighand->siglock, flags);
> +}
> +
> bool get_signal(struct ksignal *ksig)
> {
> struct sighand_struct *sighand = current->sighand;
> @@ -2497,6 +2535,19 @@ bool get_signal(struct ksignal *ksig)
> current->flags |= PF_SIGNALED;
>
> if (sig_kernel_coredump(signr)) {
> + /*
> + * Notify the parent prior to the coredump if the
> + * parent is interested in such a notificaiton.
> + */
> + int p_sig = current->real_parent->predump_signal;
> +
> + if (valid_predump_signal(p_sig)) {
> + read_lock(&tasklist_lock);
> + do_notify_parent_predump(current);
> + read_unlock(&tasklist_lock);
> + cond_resched();
> + }
> +
> if (print_fatal_signals)
> print_fatal_signal(ksig->info.si_signo);
> proc_coredump_connector(current);
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 123bd73..43eb250d 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2258,6 +2258,76 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which,
> return -EINVAL;
> }
>
> +static int prctl_get_predump_signal(struct task_struct *tsk, pid_t pid,
> + int __user *addr)
> +{
> + struct task_struct *p;
> + int error;
> +
> + /* For the current task, the common case */
> + if (pid == 0) {
> + put_user(tsk->predump_signal, addr);
> + return 0;
> + }
> +
> + error = -ESRCH;
> + rcu_read_lock();
> + p = find_task_by_vpid(pid);
> + if (p) {
> + error = 0;
> + put_user(p->predump_signal, addr);
> + }
> + rcu_read_unlock();
> + return error;
> +}
> +
> +/*
> + * Returns true if current's euid is same as p's uid or euid,
> + * or has CAP_SYS_ADMIN.
> + *
> + * Called with rcu_read_lock, creds are safe.
> + *
> + * Adapted from set_one_prio_perm().
> + */
> +static bool set_predump_signal_perm(struct task_struct *p)
> +{
> + const struct cred *cred = current_cred(), *pcred = __task_cred(p);
> +
> + return uid_eq(pcred->uid, cred->euid) ||
> + uid_eq(pcred->euid, cred->euid) ||
> + capable(CAP_SYS_ADMIN);
> +}
> +
> +static int prctl_set_predump_signal(struct task_struct *tsk, pid_t pid, int sig)
> +{
> + struct task_struct *p;
> + int error;
> +
> + /* 0 is valid for disabling the feature */
> + if (sig && !valid_predump_signal(sig))
> + return -EINVAL;
> +
> + /* For the current task, the common case */
> + if (pid == 0) {
> + tsk->predump_signal = sig;
> + return 0;
> + }
> +
> + error = -ESRCH;
> + rcu_read_lock();
> + p = find_task_by_vpid(pid);
> + if (p) {
> + if (!set_predump_signal_perm(p))
> + error = -EPERM;
> + else {
> + error = 0;
> + p->predump_signal = sig;
> + }
> + }
> + rcu_read_unlock();
> + return error;
> +}
> +
> SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> unsigned long, arg4, unsigned long, arg5)
> {
> @@ -2476,6 +2546,13 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which,
> return -EINVAL;
> error = arch_prctl_spec_ctrl_set(me, arg2, arg3);
> break;
> + case PR_SET_PREDUMP_SIG:
> + error = prctl_set_predump_signal(me, (pid_t)arg2, (int)arg3);
> + break;
> + case PR_GET_PREDUMP_SIG:
> + error = prctl_get_predump_signal(me, (pid_t)arg2,
> + (int __user *)arg3);
> + break;
> default:
> error = -EINVAL;
> break;

2018-10-16 00:35:21

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

On Mon, 15 Oct 2018 18:28:03 -0500, Eric W. Biederman said:
> Enke Chen <[email protected]> writes:
>
> > For simplicity and consistency, this patch provides an implementation
> > for signal-based fault notification prior to the coredump of a child
> > process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
> > be used by an application to express its interest and to specify the
> > signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
> > signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
> >
> > Background:
> >
> > As the coredump of a process may take time, in certain time-sensitive
> > applications it is necessary for a parent process (e.g., a process
> > manager) to be notified of a child's imminent death before the coredump
> > so that the parent process can act sooner, such as re-spawning an
> > application process, or initiating a control-plane fail-over.
>
> You talk about time senstive and then you talk about bash scripts.
> I don't think your definition of time-sensitive and my definition match.

When the process image is measured in hundreds of gigabytes, the corefile can
take a while even by /bin/bash standards. You want fun, watch an HPC process
manage to OOM a machine with 3T of RAM in a way that produces a full image
coredump.

To network storage.


Attachments:
(No filename) (497.00 B)

2018-10-16 00:55:25

by Enke Chen (enkechen)

[permalink] [raw]
Subject: Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

Hi, Eric:

On 10/15/18 4:28 PM, Eric W. Biederman wrote:
> Enke Chen <[email protected]> writes:
>
>> For simplicity and consistency, this patch provides an implementation
>> for signal-based fault notification prior to the coredump of a child
>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>> be used by an application to express its interest and to specify the
>> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>>
>> Background:
>>
>> As the coredump of a process may take time, in certain time-sensitive
>> applications it is necessary for a parent process (e.g., a process
>> manager) to be notified of a child's imminent death before the coredump
>> so that the parent process can act sooner, such as re-spawning an
>> application process, or initiating a control-plane fail-over.
>
> You talk about time senstive and then you talk about bash scripts.
> I don't think your definition of time-sensitive and my definition match.

It's certainly not my preference to have a process manager (or one for each
application) written in bash scripts. But they do work, and are deployed.

>
> With that said I think the best solution would be to figure out how to
> allow the coredump to run in parallel with the usual exit signal, and
> exit code reaping of the process>
> That would solve the problem for everyone, and would not introduce any
> new complicated APIs.

That would certainly help. But given the huge deployment of Linux, I don't
think it would be feasible to change this fundamental behavior (signal post
coredump).

>
> Short of that having the prctl in the process that receives the signals
> they you are doing is the right way to go.

Thanks for for the encouragement.

>
> You are however calling do_notify_parent_predump from the wrong
> function, and frankly with the wrong locking. There are multiple paths
> to the do_coredump function so you really want this notification from
> do_coredump.

This makes two - Oleg also suggested doing it in do_coredump().
I will look into it, perhaps also relocated proc_coredump_connector().

>
> But I still think it would be better to solve the root cause problem and
> change the coredump logic to be able to run in parallel with the normal
> exit notification and zombie reaping logic. Then the problem you are
> trying to solve goes away and everyone's code gets simpler.
>
> Eric
>

Thanks. -- Enke


2018-10-16 14:15:00

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

On 10/15, Enke Chen wrote:
>
> > I don't understand why we need valid_predump_signal() at all.
>
> Most of the signals have well-defined semantics, and would not be appropriate
> for this purpose.

you are going to change the rules anyway.

> That is why it is limited to only SIGCHLD, SIGUSR1, SIGUSR2.

Which do not queue. So the parent won't get the 2nd signal if 2 children
crash at the same time.

> >> if (sig_kernel_coredump(signr)) {
> >> + /*
> >> + * Notify the parent prior to the coredump if the
> >> + * parent is interested in such a notificaiton.
> >> + */
> >> + int p_sig = current->real_parent->predump_signal;
> >> +
> >> + if (valid_predump_signal(p_sig)) {
> >> + read_lock(&tasklist_lock);
> >> + do_notify_parent_predump(current);
> >> + read_unlock(&tasklist_lock);
> >> + cond_resched();
> >
> > perhaps this should be called by do_coredump() after coredump_wait() kills
> > all the sub-threads?
>
> proc_coredump_connector(current) is located here, they should stay together.

Why?

Once again, other threads are still alive. So if the parent restarts the service
after it recieves -predump_signal, the new process can "race" with the old thread.

Oleg.


2018-10-16 15:11:40

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

Oleg Nesterov <[email protected]> writes:

> On 10/15, Enke Chen wrote:
>>
>> > I don't understand why we need valid_predump_signal() at all.
>>
>> Most of the signals have well-defined semantics, and would not be appropriate
>> for this purpose.
>
> you are going to change the rules anyway.

I will just add that CLD_XXX is only valid with SIGCHLD as they are
signal specific si_codes. In conjunction with another signal like
SIGUSR it will have another meaning. I would really appreciate it
if new code does not further complicate siginfo_layout.

>> That is why it is limited to only SIGCHLD, SIGUSR1, SIGUSR2.
>
> Which do not queue. So the parent won't get the 2nd signal if 2 children
> crash at the same time.

We do best effort queueing but we don't guarantee anything. So yes
this makes signals a very louzy interface for sending this kind of
information.

>> >> if (sig_kernel_coredump(signr)) {
>> >> + /*
>> >> + * Notify the parent prior to the coredump if the
>> >> + * parent is interested in such a notificaiton.
>> >> + */
>> >> + int p_sig = current->real_parent->predump_signal;
>> >> +
>> >> + if (valid_predump_signal(p_sig)) {
>> >> + read_lock(&tasklist_lock);
>> >> + do_notify_parent_predump(current);
>> >> + read_unlock(&tasklist_lock);
>> >> + cond_resched();
>> >
>> > perhaps this should be called by do_coredump() after coredump_wait() kills
>> > all the sub-threads?
>>
>> proc_coredump_connector(current) is located here, they should stay together.
>
> Why?
>
> Once again, other threads are still alive. So if the parent restarts the service
> after it recieves -predump_signal, the new process can "race" with the old thread.

Yes. It isn't until do_coredump calls coredump_wait that all of the
threads are killed.

Eric


2018-10-16 15:28:14

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

Enke Chen <[email protected]> writes:

> Hi, Eric:
>
> On 10/15/18 4:28 PM, Eric W. Biederman wrote:

>> With that said I think the best solution would be to figure out how to
>> allow the coredump to run in parallel with the usual exit signal, and
>> exit code reaping of the process>
>> That would solve the problem for everyone, and would not introduce any
>> new complicated APIs.
>
> That would certainly help. But given the huge deployment of Linux, I don't
> think it would be feasible to change this fundamental behavior (signal post
> coredump).

Of course it will be feasible to change. Make it a sysctl and keep the
current default and no one will even notice. Waiting for something that
is happening asynchronously is not be difficult so having the wait
optional should not be a problem.

Right now the default in most distributions is to disable core dumps
entirely. Which means that you are going to have to find a very
specific situation in which people and applications care about core
dumps happening to break an existing setup.

Then all you have to do to get the non-blocking behavior is to just do:
echo 1 > /proc/sys/kernel_core_async

Then everything else works without modifications and everyone is happy.
Maybe I am wearing rose colored glasses but that looks like all that is
needed and it should be much easier to work with and maintain than
having to modify every manager process to listen for unreliable signals,
and then take action on those unreliable signals.

Eric

2018-10-17 00:40:46

by Enke Chen (enkechen)

[permalink] [raw]
Subject: Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

Hi, Oleg:

On 10/16/18 7:14 AM, Oleg Nesterov wrote:
> On 10/15, Enke Chen wrote:
>>
>>> I don't understand why we need valid_predump_signal() at all.
>>
>> Most of the signals have well-defined semantics, and would not be appropriate
>> for this purpose.
>
> you are going to change the rules anyway.
>
>> That is why it is limited to only SIGCHLD, SIGUSR1, SIGUSR2.
>
> Which do not queue. So the parent won't get the 2nd signal if 2 children
> crash at the same time.
>
>>>> if (sig_kernel_coredump(signr)) {
>>>> + /*
>>>> + * Notify the parent prior to the coredump if the
>>>> + * parent is interested in such a notificaiton.
>>>> + */
>>>> + int p_sig = current->real_parent->predump_signal;
>>>> +
>>>> + if (valid_predump_signal(p_sig)) {
>>>> + read_lock(&tasklist_lock);
>>>> + do_notify_parent_predump(current);
>>>> + read_unlock(&tasklist_lock);
>>>> + cond_resched();
>>>
>>> perhaps this should be called by do_coredump() after coredump_wait() kills
>>> all the sub-threads?
>>
>> proc_coredump_connector(current) is located here, they should stay together.
>
> Why?
>
> Once again, other threads are still alive. So if the parent restarts the service
> after it recieves -predump_signal, the new process can "race" with the old thread.

Yes, it is a good idea to do the signal notification in do_coredump() after
coredump_wait(). Will make the change as suggested.

Thanks. -- Enke

2018-10-19 23:03:37

by Enke Chen (enkechen)

[permalink] [raw]
Subject: Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

Hi, Jann:

Regarding the security considerations, it seems simpler and more secure to
just clear the "pre-coredump signal" cross execve(2), and let the new program
decide for itself. What do you think?

---
Changes to prctl(2):

DESCRIPTION

PR_SET_PREDUMP_SIG (since Linux 4.20.x)
This allows the calling process to receive a signal (arg2,
if nonzero) from a child process prior to the coredump of
the child process. arg2 must be SIGUSR1, or SIGUSR2, or
SIGCHLD, or 0 (for clear).

When SIGCHLD is specified, the signal code is set to
CLD_PREDUMP in such an SIGCHLD signal.

The value of the pre-coredump signal is cleared across
execve(2), or for the child of a fork(2).

PR_GET_PREDUMP_SIG (since Linux 4.20.x)
Return the current value of the pre-coredump signal for the
calling process, in the location pointed to by (int *) arg2.
---

Thanks. -- Enke

On 10/15/18 11:54 AM, Jann Horn wrote:
> On Mon, Oct 15, 2018 at 8:36 PM Enke Chen <[email protected]> wrote:
>> On 10/13/18 11:27 AM, Jann Horn wrote:
>>> On Sat, Oct 13, 2018 at 2:33 AM Enke Chen <[email protected]> wrote:
>>>> For simplicity and consistency, this patch provides an implementation
>>>> for signal-based fault notification prior to the coredump of a child
>>>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>>>> be used by an application to express its interest and to specify the
>>>> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>>>> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>>>
>>> Your suggested API looks vaguely similar to PR_SET_PDEATHSIG, but with
>>> some important differences:
>>>
>>> - You don't reset the signal on setuid execution.
> [...]
>>>
>>> For both of these: Are these differences actually necessary, and if
>>> so, can you provide a specific rationale? From a security perspective,
>>> I would very much prefer it if this API had semantics closer to
>>> PR_SET_PDEATHSIG.
>>
> [...]
>>
>> Regarding the impact of "setuid", this property "PR_SET_PREDUMP_SIG" has to
>> do with the application/process whether the signal handler is set for receiving
>> such a notification. If it is set, the "uid" should not matter.
>
> If an attacker's process first calls PR_SET_PREDUMP_SIG, then forks
> off a child, then calls execve() on a setuid binary, the setuid binary
> calls setuid(0), and the attacker-controlled child then crashes, the
> privileged process will receive an unexpected signal that the attacker
> wouldn't have been allowed to send otherwise. For similar reasons, the
> parent death signal is reset when a setuid binary is executed:
>
> void setup_new_exec(struct linux_binprm * bprm)
> {
> /*
> * Once here, prepare_binrpm() will not be called any more, so
> * the final state of setuid/setgid/fscaps can be merged into the
> * secureexec flag.
> */
> bprm->secureexec |= bprm->cap_elevated;
>
> if (bprm->secureexec) {
> /* Make sure parent cannot signal privileged process. */
> current->pdeath_signal = 0;
> [...]
> }
> [...]
> }
>
> int commit_creds(struct cred *new)
> {
> [...]
> /* dumpability changes */
> if (!uid_eq(old->euid, new->euid) ||
> !gid_eq(old->egid, new->egid) ||
> !uid_eq(old->fsuid, new->fsuid) ||
> !gid_eq(old->fsgid, new->fsgid) ||
> !cred_cap_issubset(old, new)) {
> if (task->mm)
> set_dumpable(task->mm, suid_dumpable);
> task->pdeath_signal = 0;
> smp_wmb();
> }
> [...]
> }
>
> AppArmor and SELinux also do related changes:
>
> static void apparmor_bprm_committing_creds(struct linux_binprm *bprm)
> {
> [...]
> /* bail out if unconfined or not changing profile */
> if ((new_label->proxy == label->proxy) ||
> (unconfined(new_label)))
> return;
>
> aa_inherit_files(bprm->cred, current->files);
>
> current->pdeath_signal = 0;
> [...]
> }
>
> static void selinux_bprm_committing_creds(struct linux_binprm *bprm)
> {
> [...]
> new_tsec = bprm->cred->security;
> if (new_tsec->sid == new_tsec->osid)
> return;
>
> /* Close files for which the new task SID is not authorized. */
> flush_unauthorized_files(bprm->cred, current->files);
>
> /* Always clear parent death signal on SID transitions. */
> current->pdeath_signal = 0;
> [...]
> }
>
> You should probably reset the coredump signal in the same places - or
> even better, add a new helper for resetting the parent death signal,
> and then add code for resetting the coredump signal in there.
>

2018-10-22 15:42:27

by Jann Horn

[permalink] [raw]
Subject: Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

On Sat, Oct 20, 2018 at 1:01 AM Enke Chen <[email protected]> wrote:
> Regarding the security considerations, it seems simpler and more secure to
> just clear the "pre-coredump signal" cross execve(2), and let the new program
> decide for itself. What do you think?

I don't have a problem with these semantics.

I could imagine someone being unhappy about the theoretical race
window if they want to perform an in-place reexecution of a running
service, but I don't know whether anyone actually cares about that.

> Changes to prctl(2):
>
> DESCRIPTION
>
> PR_SET_PREDUMP_SIG (since Linux 4.20.x)
> This allows the calling process to receive a signal (arg2,
> if nonzero) from a child process prior to the coredump of
> the child process. arg2 must be SIGUSR1, or SIGUSR2, or
> SIGCHLD, or 0 (for clear).
>
> When SIGCHLD is specified, the signal code is set to
> CLD_PREDUMP in such an SIGCHLD signal.
>
> The value of the pre-coredump signal is cleared across
> execve(2), or for the child of a fork(2).
>
> PR_GET_PREDUMP_SIG (since Linux 4.20.x)
> Return the current value of the pre-coredump signal for the
> calling process, in the location pointed to by (int *) arg2.

2018-10-22 20:49:46

by Enke Chen (enkechen)

[permalink] [raw]
Subject: Re: [PATCH] kernel/signal: Signal-based pre-coredump notification

Jann,

Thanks for the feedback. I will post a revised patch shortly.

On the related topic of "pdeath_signal", there are several inconsistencies
by preserving the flag across execve(2). The flag is cleared under several
conditions in different places. I will start a separate thread to see if
it can still be cleaned up.

PR_SET_PDEATHSIG (since Linux 2.1.57)
Set the parent death signal of the calling process to arg2
(either a signal value in the range 1..maxsig, or 0 to clear).
This is the signal that the calling process will get when its
parent dies. This value is cleared for the child of a fork(2)
and (since Linux 2.4.36 / 2.6.23) when executing a set-user-ID
or set-group-ID binary, or a binary that has associated
capabilities (see capabilities(7)). This value is preserved
across execve(2).

-- Enke

On 10/22/18 8:40 AM, Jann Horn wrote:
> On Sat, Oct 20, 2018 at 1:01 AM Enke Chen <[email protected]> wrote:
>> Regarding the security considerations, it seems simpler and more secure to
>> just clear the "pre-coredump signal" cross execve(2), and let the new program
>> decide for itself. What do you think?
>
> I don't have a problem with these semantics.
>
> I could imagine someone being unhappy about the theoretical race
> window if they want to perform an in-place reexecution of a running
> service, but I don't know whether anyone actually cares about that.
>
>> Changes to prctl(2):
>>
>> DESCRIPTION
>>
>> PR_SET_PREDUMP_SIG (since Linux 4.20.x)
>> This allows the calling process to receive a signal (arg2,
>> if nonzero) from a child process prior to the coredump of
>> the child process. arg2 must be SIGUSR1, or SIGUSR2, or
>> SIGCHLD, or 0 (for clear).
>>
>> When SIGCHLD is specified, the signal code is set to
>> CLD_PREDUMP in such an SIGCHLD signal.
>>
>> The value of the pre-coredump signal is cleared across
>> execve(2), or for the child of a fork(2).
>>
>> PR_GET_PREDUMP_SIG (since Linux 4.20.x)
>> Return the current value of the pre-coredump signal for the
>> calling process, in the location pointed to by (int *) arg2.

2018-10-22 23:30:12

by Enke Chen (enkechen)

[permalink] [raw]
Subject: [PATCH v2] kernel/signal: Signal-based pre-coredump notification

For simplicity and consistency, this patch provides an implementation
for signal-based fault notification prior to the coredump of a child
process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
be used by an application to express its interest and to specify the
signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.

Changes to prctl(2):

PR_SET_PREDUMP_SIG (since Linux 4.20.x)
Set the child pre-coredump signal of the calling process to
arg2 (either SIGUSR1, or SIUSR2, or SIGCHLD, or 0 to clear).
This is the signal that the calling process will get prior to
the coredump of a child process. This value is cleared across
execve(2), or for the child of a fork(2).

When SIGCHLD is specified, the signal code will be set to
CLD_PREDUMP in such an SIGCHLD signal.

PR_GET_PREDUMP_SIG (since Linux 4.20.x)
Return the current value of the child pre-coredump signal,
in the location pointed to by (int *) arg2.

Background:

As the coredump of a process may take time, in certain time-sensitive
applications it is necessary for a parent process (e.g., a process
manager) to be notified of a child's imminent death before the coredump
so that the parent process can act sooner, such as re-spawning an
application process, or initiating a control-plane fail-over.

Currently there are two ways for a parent process to be notified of a
child process's state change. One is to use the POSIX signal, and
another is to use the kernel connector module. The specific events and
actions are summarized as follows:

Process Event POSIX Signal Connector-based
----------------------------------------------------------------------
ptrace_attach() do_notify_parent_cldstop() proc_ptrace_connector()
SIGCHLD / CLD_STOPPED

ptrace_detach() do_notify_parent_cldstop() proc_ptrace_connector()
SIGCHLD / CLD_CONTINUED

pre_coredump/ N/A proc_coredump_connector()
get_signal()

post_coredump/ do_notify_parent() proc_exit_connector()
do_exit() SIGCHLD / exit_signal
----------------------------------------------------------------------

As shown in the table, the signal-based pre-coredump notification is not
currently available. In some cases using a connector-based notification
can be quite complicated (e.g., when a process manager is written in shell
scripts and thus is subject to certain inherent limitations), and a
signal-based notification would be simpler and better suited.

Signed-off-by: Enke Chen <[email protected]>
---
v1 -> v2:

o remove the setting/gettting on others in prctl().
o move the notification from get_signal() to do_coredump().
o notify the "parent" instead of "real_parent".
o move the "predump_signal" from "task_struct" to "signal_struct".
o clear the signal setting across execve(2) as well.
o add validation for unused prctl() parameters.
o add selftests for the new prctl() API.

arch/x86/kernel/signal_compat.c | 2 +-
fs/coredump.c | 10 ++
fs/exec.c | 3 +
include/linux/sched/signal.h | 4 +
include/linux/signal.h | 5 +
include/uapi/asm-generic/siginfo.h | 3 +-
include/uapi/linux/prctl.h | 4 +
kernel/fork.c | 3 +
kernel/signal.c | 39 ++++++
kernel/sys.c | 15 ++
tools/testing/selftests/prctl/Makefile | 2 +-
tools/testing/selftests/prctl/predump-sig-test.c | 171 +++++++++++++++++++++++
12 files changed, 258 insertions(+), 3 deletions(-)
create mode 100644 tools/testing/selftests/prctl/predump-sig-test.c

diff --git a/arch/x86/kernel/signal_compat.c b/arch/x86/kernel/signal_compat.c
index 9ccbf05..a3deba8 100644
--- a/arch/x86/kernel/signal_compat.c
+++ b/arch/x86/kernel/signal_compat.c
@@ -30,7 +30,7 @@ static inline void signal_compat_build_tests(void)
BUILD_BUG_ON(NSIGSEGV != 7);
BUILD_BUG_ON(NSIGBUS != 5);
BUILD_BUG_ON(NSIGTRAP != 5);
- BUILD_BUG_ON(NSIGCHLD != 6);
+ BUILD_BUG_ON(NSIGCHLD != 7);
BUILD_BUG_ON(NSIGSYS != 1);

/* This is part of the ABI and can never change in size: */
diff --git a/fs/coredump.c b/fs/coredump.c
index e42e17e..f11e31f 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -546,6 +546,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
struct cred *cred;
int retval = 0;
int ispipe;
+ bool notify;
struct files_struct *displaced;
/* require nonrelative corefile path and be extra careful */
bool need_suid_safe = false;
@@ -590,6 +591,15 @@ void do_coredump(const kernel_siginfo_t *siginfo)
if (retval < 0)
goto fail_creds;

+ /*
+ * Send the pre-coredump signal to the parent if requested.
+ */
+ read_lock(&tasklist_lock);
+ notify = do_notify_parent_predump(current);
+ read_unlock(&tasklist_lock);
+ if (notify)
+ cond_resched();
+
old_cred = override_creds(cred);

ispipe = format_corename(&cn, &cprm);
diff --git a/fs/exec.c b/fs/exec.c
index fc281b7..7714da7 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1181,6 +1181,9 @@ static int de_thread(struct task_struct *tsk)
/* we have changed execution domain */
tsk->exit_signal = SIGCHLD;

+ /* Clear the pre-coredump signal before loading a new binary */
+ sig->predump_signal = 0;
+
#ifdef CONFIG_POSIX_TIMERS
exit_itimers(sig);
flush_itimer_signals();
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 13789d1..047829d 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -112,6 +112,9 @@ struct signal_struct {
int group_stop_count;
unsigned int flags; /* see SIGNAL_* flags below */

+ /* The signal sent prior to a child's coredump */
+ int predump_signal;
+
/*
* PR_SET_CHILD_SUBREAPER marks a process, like a service
* manager, to re-parent orphan (double-forking) child processes
@@ -332,6 +335,7 @@ extern int kill_pid_info_as_cred(int, struct kernel_siginfo *, struct pid *,
extern int kill_pgrp(struct pid *pid, int sig, int priv);
extern int kill_pid(struct pid *pid, int sig, int priv);
extern __must_check bool do_notify_parent(struct task_struct *, int);
+extern bool do_notify_parent_predump(struct task_struct *p);
extern void __wake_up_parent(struct task_struct *p, struct task_struct *parent);
extern void force_sig(int, struct task_struct *);
extern int send_sig(int, struct task_struct *, int);
diff --git a/include/linux/signal.h b/include/linux/signal.h
index f428e86..6e1b2c9 100644
--- a/include/linux/signal.h
+++ b/include/linux/signal.h
@@ -262,6 +262,11 @@ static inline int valid_signal(unsigned long sig)
return sig <= _NSIG ? 1 : 0;
}

+static inline int valid_predump_signal(int sig)
+{
+ return (sig == SIGCHLD) || (sig == SIGUSR1) || (sig == SIGUSR2);
+}
+
struct timespec;
struct pt_regs;
enum pid_type;
diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-generic/siginfo.h
index cb3d6c2..1a47cef 100644
--- a/include/uapi/asm-generic/siginfo.h
+++ b/include/uapi/asm-generic/siginfo.h
@@ -267,7 +267,8 @@ struct { \
#define CLD_TRAPPED 4 /* traced child has trapped */
#define CLD_STOPPED 5 /* child has stopped */
#define CLD_CONTINUED 6 /* stopped child has continued */
-#define NSIGCHLD 6
+#define CLD_PREDUMP 7 /* child is about to dump core */
+#define NSIGCHLD 7

/*
* SIGPOLL (or any other signal without signal specific si_codes) si_codes
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index c0d7ea0..79f0a8a 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -219,4 +219,8 @@ struct prctl_mm_map {
# define PR_SPEC_DISABLE (1UL << 2)
# define PR_SPEC_FORCE_DISABLE (1UL << 3)

+/* Whether to receive signal prior to child's coredump */
+#define PR_SET_PREDUMP_SIG 54
+#define PR_GET_PREDUMP_SIG 55
+
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 07cddff..a9c9988 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1553,6 +1553,9 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
tty_audit_fork(sig);
sched_autogroup_fork(sig);

+ /* Clear the pre-coredump signal for the child of fork(2) */
+ sig->predump_signal = 0;
+
sig->oom_score_adj = current->signal->oom_score_adj;
sig->oom_score_adj_min = current->signal->oom_score_adj_min;

diff --git a/kernel/signal.c b/kernel/signal.c
index 9a32bc2..e4c154b 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1855,6 +1855,45 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
return autoreap;
}

+/*
+ * While do_notify_parent() notifies the parent of a child's death post
+ * its coredump, this function lets the parent (if so desired) know about
+ * the imminent death of a child just prior to its coredump.
+ *
+ * The caller must hold at least the read lock on tasklist_lock.
+ */
+bool do_notify_parent_predump(struct task_struct *tsk)
+{
+ struct sighand_struct *sighand;
+ struct kernel_siginfo info;
+ struct task_struct *parent;
+ unsigned long flags;
+ pid_t pid;
+ int sig;
+
+ parent = tsk->parent;
+ sighand = parent->sighand;
+ pid = task_tgid_vnr(tsk);
+
+ spin_lock_irqsave(&sighand->siglock, flags);
+ sig = parent->signal->predump_signal;
+ if (!valid_predump_signal(sig)) {
+ spin_unlock_irqrestore(&sighand->siglock, flags);
+ return false;
+ }
+
+ clear_siginfo(&info);
+ info.si_pid = pid;
+ info.si_signo = sig;
+ if (sig == SIGCHLD)
+ info.si_code = CLD_PREDUMP;
+
+ __group_send_sig_info(sig, &info, parent);
+ __wake_up_parent(tsk, parent);
+ spin_unlock_irqrestore(&sighand->siglock, flags);
+ return true;
+}
+
/**
* do_notify_parent_cldstop - notify parent of stopped/continued state change
* @tsk: task reporting the state change
diff --git a/kernel/sys.c b/kernel/sys.c
index 123bd73..8ed9a63 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2476,6 +2476,21 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which,
return -EINVAL;
error = arch_prctl_spec_ctrl_set(me, arg2, arg3);
break;
+ case PR_SET_PREDUMP_SIG:
+ if (arg3 || arg4 || arg5)
+ return -EINVAL;
+
+ /* 0 is valid for disabling the feature */
+ if (arg2 && !valid_predump_signal((int)arg2))
+ return -EINVAL;
+ me->signal->predump_signal = (int)arg2;
+ break;
+ case PR_GET_PREDUMP_SIG:
+ if (arg3 || arg4 || arg5)
+ return -EINVAL;
+ error = put_user(me->signal->predump_signal,
+ (int __user *)arg2);
+ break;
default:
error = -EINVAL;
break;
diff --git a/tools/testing/selftests/prctl/Makefile b/tools/testing/selftests/prctl/Makefile
index c7923b2..f8d60d5 100644
--- a/tools/testing/selftests/prctl/Makefile
+++ b/tools/testing/selftests/prctl/Makefile
@@ -5,7 +5,7 @@ ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/x86/ -e s/x86_64/x86/)

ifeq ($(ARCH),x86)
TEST_PROGS := disable-tsc-ctxt-sw-stress-test disable-tsc-on-off-stress-test \
- disable-tsc-test
+ disable-tsc-test predump-sig-test
all: $(TEST_PROGS)

include ../lib.mk
diff --git a/tools/testing/selftests/prctl/predump-sig-test.c b/tools/testing/selftests/prctl/predump-sig-test.c
new file mode 100644
index 0000000..57042f7
--- /dev/null
+++ b/tools/testing/selftests/prctl/predump-sig-test.c
@@ -0,0 +1,171 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2018, Enke Chen, Cisco Systems, Inc.
+ *
+ * Tests for prctl(PR_SET_PREDUMP_SIG, ...) / prctl(PR_GET_PREDUMP_SIG, ...)
+ *
+ * When set with prctl(), the specified signal is sent to the parent process
+ * prior to the coredump of a child process.
+ *
+ * Usage: ./predump-sig-test {SIGUSR1 | SIGUSR2 | SIGCHLD}
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <sys/prctl.h>
+#include <signal.h>
+#include <sys/signalfd.h>
+#include <errno.h>
+
+#ifndef PR_SET_PREDUMP_SIG
+#define PR_SET_PREDUMP_SIG 54
+#define PR_GET_PREDUMP_SIG 55
+#endif
+
+#ifndef CLD_PREDUMP
+#define CLD_PREDUMP 7 /* child is about to dump core */
+#endif
+
+#define handle_error(msg) \
+ do { perror(msg); exit(EXIT_FAILURE); } while (0)
+
+static int test_prctl(int sig)
+{
+ int sig2, rc;
+
+ rc = prctl(PR_SET_PREDUMP_SIG, sig, 0, 0, 0);
+ if (rc < 0)
+ handle_error("prctl: setting");
+
+ rc = prctl(PR_GET_PREDUMP_SIG, &sig2, 0, 0, 0);
+ if (rc < 0)
+ handle_error("prctl: getting");
+
+ if (sig2 != sig) {
+ printf("prctl: sig %d, post %d\n", sig, sig2);
+ return -1;
+ }
+ return 0;
+}
+
+static int sigfd;
+static int predump_signal;
+
+static int init_signalfd(void)
+{
+ sigset_t mask;
+ int sfd;
+
+ sigemptyset(&mask);
+ sigaddset(&mask, SIGCHLD);
+ if (predump_signal && (predump_signal != SIGCHLD))
+ sigaddset(&mask, predump_signal);
+
+ /*
+ * Block signals so that they aren't handled according to their
+ * default dispositions.
+ */
+ if (sigprocmask(SIG_BLOCK, &mask, NULL) == -1)
+ handle_error("sigprocmask");
+
+ sfd = signalfd(-1, &mask, SFD_CLOEXEC);
+ if (sfd == -1)
+ handle_error("signalfd");
+
+ return sfd;
+}
+
+static void parent_func(pid_t child_pid)
+{
+ struct signalfd_siginfo si;
+ int count = 0;
+ ssize_t s;
+
+ for (;;) {
+ s = read(sigfd, &si, sizeof(struct signalfd_siginfo));
+ if (s != sizeof(struct signalfd_siginfo))
+ handle_error("read");
+
+ count++;
+ printf("\nReceived signal: ssi_pid %ld, ssi_signo %d\n",
+ si.ssi_pid, si.ssi_signo);
+ printf("siginfo: ssi_errno %d, ssi_code %d, ssi_status %d\n",
+ si.ssi_errno, si.ssi_code, si.ssi_status);
+
+ if (si.ssi_signo == SIGCHLD) {
+ if (si.ssi_code == CLD_PREDUMP)
+ printf("predump signal\n");
+ else
+ break;
+ } else if (si.ssi_signo == predump_signal)
+ printf("predump signal\n");
+ }
+
+ printf("Test result: %s\n", (count == 2) ? "PASS" : "FAIL");
+ fflush(stdout);
+}
+
+static void child_func(void)
+{
+ int rc, sig;
+
+ printf("\nChild pid: %ld\n", (long)getpid());
+
+ /* Test: Child should not inherit the predump_signal */
+ rc = prctl(PR_GET_PREDUMP_SIG, &sig, 0, 0, 0);
+ if (rc < 0)
+ handle_error("prctl: child");
+
+ printf("child: predump_signal %d\n", sig);
+
+ /* Force coredump here */
+ printf("child: calling abort()\n");
+ fflush(stdout);
+ abort();
+}
+
+int main(int argc, char *argv[])
+{
+ pid_t child_pid;
+ int rc;
+
+ if (argc != 2) {
+ printf("invalid number of arguments\n");
+ exit(EXIT_FAILURE);
+ }
+
+ if (strcmp(argv[1], "SIGUSR1") == 0)
+ predump_signal = SIGUSR1;
+ else if (strcmp(argv[1], "SIGUSR2") == 0)
+ predump_signal = SIGUSR2;
+ else if (strcmp(argv[1], "SIGCHLD") == 0)
+ predump_signal = SIGCHLD;
+ else {
+ printf("invalid argument for signal\n");
+ fflush(stdout);
+ exit(EXIT_FAILURE);
+ }
+
+ /* Test: prctl() setting */
+ rc = test_prctl(0);
+ printf("prctl: sig %d %s\n", 0, (rc == 0) ? "PASS" : "FAIL");
+ rc = test_prctl(predump_signal);
+ printf("prctl: sig %d %s\n",
+ predump_signal, (rc == 0) ? "PASS" : "FAIL");
+
+ /* Init signalfd */
+ sigfd = init_signalfd();
+
+ child_pid = fork();
+ if (child_pid == -1)
+ handle_error("fork");
+
+ if (child_pid == 0) { /* Code executed by child */
+ child_func();
+ } else { /* Code executed by parent */
+ parent_func(child_pid);
+ exit(EXIT_SUCCESS);
+ }
+}
--
1.8.3.1


2018-10-23 09:25:11

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v2] kernel/signal: Signal-based pre-coredump notification

On 10/22, Enke Chen wrote:
>
> As the coredump of a process may take time, in certain time-sensitive
> applications it is necessary for a parent process (e.g., a process
> manager) to be notified of a child's imminent death before the coredump
> so that the parent process can act sooner, such as re-spawning an
> application process, or initiating a control-plane fail-over.

Personally I still do not like this feature, but I won't argue.

> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -546,6 +546,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
> struct cred *cred;
> int retval = 0;
> int ispipe;
> + bool notify;
> struct files_struct *displaced;
> /* require nonrelative corefile path and be extra careful */
> bool need_suid_safe = false;
> @@ -590,6 +591,15 @@ void do_coredump(const kernel_siginfo_t *siginfo)
> if (retval < 0)
> goto fail_creds;
>
> + /*
> + * Send the pre-coredump signal to the parent if requested.
> + */
> + read_lock(&tasklist_lock);
> + notify = do_notify_parent_predump(current);
> + read_unlock(&tasklist_lock);
> + if (notify)
> + cond_resched();

Hmm. I do not understand why do we need cond_resched(). And even if we need it,
why we can't call it unconditionally?

I'd also suggest to move read_lock/unlock(tasklist) into do_notify_parent_predump()
and remove the "task_struct *tsk" argument, tsk is always current.

Yes, do_notify_parent() and do_notify_parent_cldstop() are called with tasklist_lock
held, but there are good reasons for that.


> +static inline int valid_predump_signal(int sig)
> +{
> + return (sig == SIGCHLD) || (sig == SIGUSR1) || (sig == SIGUSR2);
> +}

I still do not understand why do we need to restrict predump_signal.

PR_SET_PREDUMP_SIG can only change the caller's ->predump_signal, so to me
even PR_SET_PREDUMP_SIG(SIGKILL) is fine.

And once again, SIGCHLD/SIGUSR do not queue, this means that PR_SET_PREDUMP_SIG
is pointless if you have 2 or more children.

> +bool do_notify_parent_predump(struct task_struct *tsk)
> +{
> + struct sighand_struct *sighand;
> + struct kernel_siginfo info;
> + struct task_struct *parent;
> + unsigned long flags;
> + pid_t pid;
> + int sig;
> +
> + parent = tsk->parent;
> + sighand = parent->sighand;
> + pid = task_tgid_vnr(tsk);
> +
> + spin_lock_irqsave(&sighand->siglock, flags);
> + sig = parent->signal->predump_signal;
> + if (!valid_predump_signal(sig)) {
> + spin_unlock_irqrestore(&sighand->siglock, flags);
> + return false;
> + }

Why do we need to check parent->signal->predump_signal under ->siglock?
This complicates the code for no reason, afaics.

> + clear_siginfo(&info);
> + info.si_pid = pid;
> + info.si_signo = sig;
> + if (sig == SIGCHLD)
> + info.si_code = CLD_PREDUMP;
> +
> + __group_send_sig_info(sig, &info, parent);
> + __wake_up_parent(tsk, parent);

Why __wake_up_parent() ?

do_notify_parent() does this to wake up the parent sleeping in do_wait(), to
report the event. But predump_signal has nothing to do with wait().

Now. This version sends the signal to ->parent, not ->real_parent. OK, but this
means that real_parent won't be notified if its child is traced.


> + case PR_SET_PREDUMP_SIG:
> + if (arg3 || arg4 || arg5)
> + return -EINVAL;
> +
> + /* 0 is valid for disabling the feature */
> + if (arg2 && !valid_predump_signal((int)arg2))
> + return -EINVAL;
> + me->signal->predump_signal = (int)arg2;
> + break;

Again, I do not understand why do we need valid_predump_signal(). But even
if we need it, I don't understand why should we check it twice. IOW, why
do_notify_parent_predump() can't simply check ->predump_signal != 0?

Whatever we do, PR_SET_PREDUMP_SIG should validate arg2 anyway. Who else can
change ->predump_signal after that?

> + case PR_GET_PREDUMP_SIG:
> + if (arg3 || arg4 || arg5)
> + return -EINVAL;
> + error = put_user(me->signal->predump_signal,
> + (int __user *)arg2);

To me it would be better to simply return ->predump_signal, iow

error = me->signal->predump_signal;
break;

but I won't insist, this is subjective and cosmetic.

Oleg.


2018-10-23 20:16:53

by Enke Chen (enkechen)

[permalink] [raw]
Subject: Re: [PATCH v2] kernel/signal: Signal-based pre-coredump notification

Hi, Oleg:

Thanks for your review. Please see my replies inline.

On 10/23/18 2:23 AM, Oleg Nesterov wrote:
> On 10/22, Enke Chen wrote:
>>
>> As the coredump of a process may take time, in certain time-sensitive
>> applications it is necessary for a parent process (e.g., a process
>> manager) to be notified of a child's imminent death before the coredump
>> so that the parent process can act sooner, such as re-spawning an
>> application process, or initiating a control-plane fail-over.
>
> Personally I still do not like this feature, but I won't argue.
>
>> --- a/fs/coredump.c
>> +++ b/fs/coredump.c
>> @@ -546,6 +546,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>> struct cred *cred;
>> int retval = 0;
>> int ispipe;
>> + bool notify;
>> struct files_struct *displaced;
>> /* require nonrelative corefile path and be extra careful */
>> bool need_suid_safe = false;
>> @@ -590,6 +591,15 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>> if (retval < 0)
>> goto fail_creds;
>>
>> + /*
>> + * Send the pre-coredump signal to the parent if requested.
>> + */
>> + read_lock(&tasklist_lock);
>> + notify = do_notify_parent_predump(current);
>> + read_unlock(&tasklist_lock);
>> + if (notify)
>> + cond_resched();
>
> Hmm. I do not understand why do we need cond_resched(). And even if we need it,
> why we can't call it unconditionally?

Remember the goal is to allow the parent (e.g., a process manager) to take early
action. The "yield" before doing coredump will help.

The yield is made conditional because the notification is conditional.
Is that ok?

>
> I'd also suggest to move read_lock/unlock(tasklist) into do_notify_parent_predump()
> and remove the "task_struct *tsk" argument, tsk is always current.
>
> Yes, do_notify_parent() and do_notify_parent_cldstop() are called with tasklist_lock
> held, but there are good reasons for that.

Sure I will make the suggested changes. This function is only called in one place.

>
>
>> +static inline int valid_predump_signal(int sig)
>> +{
>> + return (sig == SIGCHLD) || (sig == SIGUSR1) || (sig == SIGUSR2);
>> +}
>
> I still do not understand why do we need to restrict predump_signal.
>
> PR_SET_PREDUMP_SIG can only change the caller's ->predump_signal, so to me
> even PR_SET_PREDUMP_SIG(SIGKILL) is fine.

I will remove it to reduce the code size and give more flexibility to the application.

>
> And once again, SIGCHLD/SIGUSR do not queue, this means that PR_SET_PREDUMP_SIG
> is pointless if you have 2 or more children.

Hmm, could you point me to the code where SIGCHLD/SIGUSR is treated differently
w.r.t. queuing? That does not sound right to me.

>
>> +bool do_notify_parent_predump(struct task_struct *tsk)
>> +{
>> + struct sighand_struct *sighand;
>> + struct kernel_siginfo info;
>> + struct task_struct *parent;
>> + unsigned long flags;
>> + pid_t pid;
>> + int sig;
>> +
>> + parent = tsk->parent;
>> + sighand = parent->sighand;
>> + pid = task_tgid_vnr(tsk);
>> +
>> + spin_lock_irqsave(&sighand->siglock, flags);
>> + sig = parent->signal->predump_signal;
>> + if (!valid_predump_signal(sig)) {
>> + spin_unlock_irqrestore(&sighand->siglock, flags);
>> + return false;
>> + }
>
> Why do we need to check parent->signal->predump_signal under ->siglock?
> This complicates the code for no reason, afaics.
>
>> + clear_siginfo(&info);
>> + info.si_pid = pid;
>> + info.si_signo = sig;
>> + if (sig == SIGCHLD)
>> + info.si_code = CLD_PREDUMP;
>> +
>> + __group_send_sig_info(sig, &info, parent);
>> + __wake_up_parent(tsk, parent);
>
> Why __wake_up_parent() ?

not needed, and will remove.

>
> do_notify_parent() does this to wake up the parent sleeping in do_wait(), to
> report the event. But predump_signal has nothing to do with wait().
>
> Now. This version sends the signal to ->parent, not ->real_parent. OK, but this
> means that real_parent won't be notified if its child is traced.
> >
>> + case PR_SET_PREDUMP_SIG:
>> + if (arg3 || arg4 || arg5)
>> + return -EINVAL;
>> +
>> + /* 0 is valid for disabling the feature */
>> + if (arg2 && !valid_predump_signal((int)arg2))
>> + return -EINVAL;
>> + me->signal->predump_signal = (int)arg2;
>> + break;
>
> Again, I do not understand why do we need valid_predump_signal(). But even
> if we need it, I don't understand why should we check it twice. IOW, why
> do_notify_parent_predump() can't simply check ->predump_signal != 0?
>
> Whatever we do, PR_SET_PREDUMP_SIG should validate arg2 anyway. Who else can
> change ->predump_signal after that?

Ok, will relax.

>
>> + case PR_GET_PREDUMP_SIG:
>> + if (arg3 || arg4 || arg5)
>> + return -EINVAL;
>> + error = put_user(me->signal->predump_signal,
>> + (int __user *)arg2);
>
> To me it would be better to simply return ->predump_signal, iow
>
> error = me->signal->predump_signal;
> break;
>
> but I won't insist, this is subjective and cosmetic.

Vast majority of system calls returns 0 or -1. So does PR_GET_PDEATHSIG.
I would like to keep them consistent.

Thanks again.

-- Enke

2018-10-23 21:41:02

by Enke Chen (enkechen)

[permalink] [raw]
Subject: Re: [PATCH v2] kernel/signal: Signal-based pre-coredump notification

Hi, Oleg:

On 10/23/18 12:43 PM, Enke Chen wrote:

>>
>>> --- a/fs/coredump.c
>>> +++ b/fs/coredump.c
>>> @@ -546,6 +546,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>>> struct cred *cred;
>>> int retval = 0;
>>> int ispipe;
>>> + bool notify;
>>> struct files_struct *displaced;
>>> /* require nonrelative corefile path and be extra careful */
>>> bool need_suid_safe = false;
>>> @@ -590,6 +591,15 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>>> if (retval < 0)
>>> goto fail_creds;
>>>
>>> + /*
>>> + * Send the pre-coredump signal to the parent if requested.
>>> + */
>>> + read_lock(&tasklist_lock);
>>> + notify = do_notify_parent_predump(current);
>>> + read_unlock(&tasklist_lock);
>>> + if (notify)
>>> + cond_resched();
>>
>> Hmm. I do not understand why do we need cond_resched(). And even if we need it,
>> why we can't call it unconditionally?
>
> Remember the goal is to allow the parent (e.g., a process manager) to take early
> action. The "yield" before doing coredump will help.
>
> The yield is made conditional because the notification is conditional.
> Is that ok?

Given this is in do_coredump(), it is ok to make it unconditional for simplicity.

>>
>>> +bool do_notify_parent_predump(struct task_struct *tsk)
>>> +{
>>> + struct sighand_struct *sighand;
>>> + struct kernel_siginfo info;
>>> + struct task_struct *parent;
>>> + unsigned long flags;
>>> + pid_t pid;
>>> + int sig;
>>> +
>>> + parent = tsk->parent;
>>> + sighand = parent->sighand;
>>> + pid = task_tgid_vnr(tsk);
>>> +
>>> + spin_lock_irqsave(&sighand->siglock, flags);
>>> + sig = parent->signal->predump_signal;
>>> + if (!valid_predump_signal(sig)) {
>>> + spin_unlock_irqrestore(&sighand->siglock, flags);
>>> + return false;
>>> + }
>>
>> Why do we need to check parent->signal->predump_signal under ->siglock?
>> This complicates the code for no reason, afaics.

Will simplify.

Thanks. -- Enke


2018-10-24 05:44:47

by Enke Chen (enkechen)

[permalink] [raw]
Subject: [PATCH v3] kernel/signal: Signal-based pre-coredump notification

For simplicity and consistency, this patch provides an implementation
for signal-based fault notification prior to the coredump of a child
process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
be used by an application to express its interest and to specify the
signal for such a notification. A new signal code CLD_PREDUMP is also
defined for SIGCHLD.

Changes to prctl(2):

PR_SET_PREDUMP_SIG (since Linux 4.20.x)
Set the child pre-coredump signal of the calling process to
arg2 (either a signal value in the range 1..maxsig, or 0 to
clear). This is the signal that the calling process will get
prior to the coredump of a child process. This value is
cleared across execve(2), or for the child of a fork(2).

When SIGCHLD is specified, the signal code will be set to
CLD_PREDUMP in such an SIGCHLD signal.

PR_GET_PREDUMP_SIG (since Linux 4.20.x)
Return the current value of the child pre-coredump signal,
in the location pointed to by (int *) arg2.

Background:

As the coredump of a process may take time, in certain time-sensitive
applications it is necessary for a parent process (e.g., a process
manager) to be notified of a child's imminent death before the coredump
so that the parent process can act sooner, such as re-spawning an
application process, or initiating a control-plane fail-over.

Currently there are two ways for a parent process to be notified of a
child process's state change. One is to use the POSIX signal, and
another is to use the kernel connector module. The specific events and
actions are summarized as follows:

Process Event POSIX Signal Connector-based
----------------------------------------------------------------------
ptrace_attach() do_notify_parent_cldstop() proc_ptrace_connector()
SIGCHLD / CLD_STOPPED

ptrace_detach() do_notify_parent_cldstop() proc_ptrace_connector()
SIGCHLD / CLD_CONTINUED

pre_coredump/ N/A proc_coredump_connector()
get_signal()

post_coredump/ do_notify_parent() proc_exit_connector()
do_exit() SIGCHLD / exit_signal
----------------------------------------------------------------------

As shown in the table, the signal-based pre-coredump notification is not
currently available. In some cases using a connector-based notification
can be quite complicated (e.g., when a process manager is written in shell
scripts and thus is subject to certain inherent limitations), and a
signal-based notification would be simpler and better suited.

Signed-off-by: Enke Chen <[email protected]>
---
v2 -> v3:

Addressed review comments from Oleg Nesterov, including:
o remove the restriction on signal for PR_SET_PREDUMP_SIG.
o code simplification

arch/x86/kernel/signal_compat.c | 2 +-
fs/coredump.c | 6 +
fs/exec.c | 3 +
include/linux/sched/signal.h | 4 +
include/uapi/asm-generic/siginfo.h | 3 +-
include/uapi/linux/prctl.h | 4 +
kernel/fork.c | 3 +
kernel/signal.c | 31 +++++
kernel/sys.c | 13 ++
tools/testing/selftests/prctl/Makefile | 2 +-
tools/testing/selftests/prctl/predump-sig-test.c | 169 +++++++++++++++++++++++
11 files changed, 237 insertions(+), 3 deletions(-)
create mode 100644 tools/testing/selftests/prctl/predump-sig-test.c

diff --git a/arch/x86/kernel/signal_compat.c b/arch/x86/kernel/signal_compat.c
index 9ccbf05..a3deba8 100644
--- a/arch/x86/kernel/signal_compat.c
+++ b/arch/x86/kernel/signal_compat.c
@@ -30,7 +30,7 @@ static inline void signal_compat_build_tests(void)
BUILD_BUG_ON(NSIGSEGV != 7);
BUILD_BUG_ON(NSIGBUS != 5);
BUILD_BUG_ON(NSIGTRAP != 5);
- BUILD_BUG_ON(NSIGCHLD != 6);
+ BUILD_BUG_ON(NSIGCHLD != 7);
BUILD_BUG_ON(NSIGSYS != 1);

/* This is part of the ABI and can never change in size: */
diff --git a/fs/coredump.c b/fs/coredump.c
index e42e17e..d6ca1a3 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -590,6 +590,12 @@ void do_coredump(const kernel_siginfo_t *siginfo)
if (retval < 0)
goto fail_creds;

+ /*
+ * Send the pre-coredump signal to the parent if requested.
+ */
+ do_notify_parent_predump();
+ cond_resched();
+
old_cred = override_creds(cred);

ispipe = format_corename(&cn, &cprm);
diff --git a/fs/exec.c b/fs/exec.c
index fc281b7..7714da7 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1181,6 +1181,9 @@ static int de_thread(struct task_struct *tsk)
/* we have changed execution domain */
tsk->exit_signal = SIGCHLD;

+ /* Clear the pre-coredump signal before loading a new binary */
+ sig->predump_signal = 0;
+
#ifdef CONFIG_POSIX_TIMERS
exit_itimers(sig);
flush_itimer_signals();
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 13789d1..132ce08 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -112,6 +112,9 @@ struct signal_struct {
int group_stop_count;
unsigned int flags; /* see SIGNAL_* flags below */

+ /* The signal sent prior to a child's coredump */
+ int predump_signal;
+
/*
* PR_SET_CHILD_SUBREAPER marks a process, like a service
* manager, to re-parent orphan (double-forking) child processes
@@ -332,6 +335,7 @@ extern int kill_pid_info_as_cred(int, struct kernel_siginfo *, struct pid *,
extern int kill_pgrp(struct pid *pid, int sig, int priv);
extern int kill_pid(struct pid *pid, int sig, int priv);
extern __must_check bool do_notify_parent(struct task_struct *, int);
+extern void do_notify_parent_predump(void);
extern void __wake_up_parent(struct task_struct *p, struct task_struct *parent);
extern void force_sig(int, struct task_struct *);
extern int send_sig(int, struct task_struct *, int);
diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-generic/siginfo.h
index cb3d6c2..1a47cef 100644
--- a/include/uapi/asm-generic/siginfo.h
+++ b/include/uapi/asm-generic/siginfo.h
@@ -267,7 +267,8 @@ struct { \
#define CLD_TRAPPED 4 /* traced child has trapped */
#define CLD_STOPPED 5 /* child has stopped */
#define CLD_CONTINUED 6 /* stopped child has continued */
-#define NSIGCHLD 6
+#define CLD_PREDUMP 7 /* child is about to dump core */
+#define NSIGCHLD 7

/*
* SIGPOLL (or any other signal without signal specific si_codes) si_codes
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index c0d7ea0..79f0a8a 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -219,4 +219,8 @@ struct prctl_mm_map {
# define PR_SPEC_DISABLE (1UL << 2)
# define PR_SPEC_FORCE_DISABLE (1UL << 3)

+/* Whether to receive signal prior to child's coredump */
+#define PR_SET_PREDUMP_SIG 54
+#define PR_GET_PREDUMP_SIG 55
+
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 07cddff..8e30a00 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1553,6 +1553,9 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
tty_audit_fork(sig);
sched_autogroup_fork(sig);

+ /* Clear the pre-coredump signal for the child */
+ sig->predump_signal = 0;
+
sig->oom_score_adj = current->signal->oom_score_adj;
sig->oom_score_adj_min = current->signal->oom_score_adj_min;

diff --git a/kernel/signal.c b/kernel/signal.c
index 9a32bc2..904ad8a 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1855,6 +1855,37 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
return autoreap;
}

+/*
+ * While do_notify_parent() notifies the parent of a child's death post
+ * its coredump, this function lets the parent (if so desired) know about
+ * the imminent death of a child just prior to its coredump.
+ */
+void do_notify_parent_predump(void)
+{
+ struct sighand_struct *sighand;
+ struct kernel_siginfo info;
+ struct task_struct *parent;
+ unsigned long flags;
+ int sig;
+
+ read_lock(&tasklist_lock);
+ parent = current->parent;
+ sig = parent->signal->predump_signal;
+ if (sig != 0) {
+ clear_siginfo(&info);
+ info.si_pid = task_tgid_vnr(current);
+ info.si_signo = sig;
+ if (sig == SIGCHLD)
+ info.si_code = CLD_PREDUMP;
+
+ sighand = parent->sighand;
+ spin_lock_irqsave(&sighand->siglock, flags);
+ __group_send_sig_info(sig, &info, parent);
+ spin_unlock_irqrestore(&sighand->siglock, flags);
+ }
+ read_unlock(&tasklist_lock);
+}
+
/**
* do_notify_parent_cldstop - notify parent of stopped/continued state change
* @tsk: task reporting the state change
diff --git a/kernel/sys.c b/kernel/sys.c
index 123bd73..39aa3b8 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2476,6 +2476,19 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which,
return -EINVAL;
error = arch_prctl_spec_ctrl_set(me, arg2, arg3);
break;
+ case PR_SET_PREDUMP_SIG:
+ if (arg3 || arg4 || arg5)
+ return -EINVAL;
+ if (!valid_signal((int)arg2))
+ return -EINVAL;
+ me->signal->predump_signal = (int)arg2;
+ break;
+ case PR_GET_PREDUMP_SIG:
+ if (arg3 || arg4 || arg5)
+ return -EINVAL;
+ error = put_user(me->signal->predump_signal,
+ (int __user *)arg2);
+ break;
default:
error = -EINVAL;
break;
diff --git a/tools/testing/selftests/prctl/Makefile b/tools/testing/selftests/prctl/Makefile
index c7923b2..f8d60d5 100644
--- a/tools/testing/selftests/prctl/Makefile
+++ b/tools/testing/selftests/prctl/Makefile
@@ -5,7 +5,7 @@ ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/x86/ -e s/x86_64/x86/)

ifeq ($(ARCH),x86)
TEST_PROGS := disable-tsc-ctxt-sw-stress-test disable-tsc-on-off-stress-test \
- disable-tsc-test
+ disable-tsc-test predump-sig-test
all: $(TEST_PROGS)

include ../lib.mk
diff --git a/tools/testing/selftests/prctl/predump-sig-test.c b/tools/testing/selftests/prctl/predump-sig-test.c
new file mode 100644
index 0000000..1b93521
--- /dev/null
+++ b/tools/testing/selftests/prctl/predump-sig-test.c
@@ -0,0 +1,169 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2018, Enke Chen, Cisco Systems, Inc.
+ *
+ * Tests for prctl(PR_SET_PREDUMP_SIG, ...) / prctl(PR_GET_PREDUMP_SIG, ...)
+ *
+ * When set with prctl(), the specified signal is sent to the parent process
+ * prior to the coredump of a child process.
+ *
+ * Usage: ./predump-sig-test {SIGUSR1 | SIGCHLD}
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <sys/prctl.h>
+#include <signal.h>
+#include <sys/signalfd.h>
+#include <errno.h>
+
+#ifndef PR_SET_PREDUMP_SIG
+#define PR_SET_PREDUMP_SIG 54
+#define PR_GET_PREDUMP_SIG 55
+#endif
+
+#ifndef CLD_PREDUMP
+#define CLD_PREDUMP 7 /* child is about to dump core */
+#endif
+
+#define handle_error(msg) \
+ do { perror(msg); exit(EXIT_FAILURE); } while (0)
+
+static int test_prctl(int sig)
+{
+ int sig2, rc;
+
+ rc = prctl(PR_SET_PREDUMP_SIG, sig, 0, 0, 0);
+ if (rc < 0)
+ handle_error("prctl: setting");
+
+ rc = prctl(PR_GET_PREDUMP_SIG, &sig2, 0, 0, 0);
+ if (rc < 0)
+ handle_error("prctl: getting");
+
+ if (sig2 != sig) {
+ printf("prctl: sig %d, post %d\n", sig, sig2);
+ return -1;
+ }
+ return 0;
+}
+
+static int sigfd;
+static int predump_signal;
+
+static int init_signalfd(void)
+{
+ sigset_t mask;
+ int sfd;
+
+ sigemptyset(&mask);
+ sigaddset(&mask, SIGCHLD);
+ if (predump_signal && (predump_signal != SIGCHLD))
+ sigaddset(&mask, predump_signal);
+
+ /*
+ * Block signals so that they aren't handled according to their
+ * default dispositions.
+ */
+ if (sigprocmask(SIG_BLOCK, &mask, NULL) == -1)
+ handle_error("sigprocmask");
+
+ sfd = signalfd(-1, &mask, SFD_CLOEXEC);
+ if (sfd == -1)
+ handle_error("signalfd");
+
+ return sfd;
+}
+
+static void parent_fn(pid_t child_pid)
+{
+ struct signalfd_siginfo si;
+ int count = 0;
+ ssize_t s;
+
+ for (;;) {
+ s = read(sigfd, &si, sizeof(struct signalfd_siginfo));
+ if (s != sizeof(struct signalfd_siginfo))
+ handle_error("read");
+
+ count++;
+ printf("\nReceived signal: ssi_pid %ld, ssi_signo %d\n",
+ si.ssi_pid, si.ssi_signo);
+ printf("siginfo: ssi_errno %d, ssi_code %d, ssi_status %d\n",
+ si.ssi_errno, si.ssi_code, si.ssi_status);
+
+ if (si.ssi_signo == SIGCHLD) {
+ if (si.ssi_code == CLD_PREDUMP)
+ printf("predump signal\n");
+ else
+ break;
+ } else if (si.ssi_signo == predump_signal)
+ printf("predump signal\n");
+ }
+
+ printf("Test result: %s\n", (count == 2) ? "PASS" : "FAIL");
+ fflush(stdout);
+}
+
+static void child_fn(void)
+{
+ int rc, sig;
+
+ printf("\nChild pid: %ld\n", (long)getpid());
+
+ /* Test: Child should not inherit the predump_signal */
+ rc = prctl(PR_GET_PREDUMP_SIG, &sig, 0, 0, 0);
+ if (rc < 0)
+ handle_error("prctl: child");
+
+ printf("child: predump_signal %d\n", sig);
+
+ /* Force coredump here */
+ printf("child: calling abort()\n");
+ fflush(stdout);
+ abort();
+}
+
+int main(int argc, char *argv[])
+{
+ pid_t child_pid;
+ int rc;
+
+ if (argc != 2) {
+ printf("invalid number of arguments\n");
+ exit(EXIT_FAILURE);
+ }
+
+ if (strcmp(argv[1], "SIGUSR1") == 0)
+ predump_signal = SIGUSR1;
+ else if (strcmp(argv[1], "SIGCHLD") == 0)
+ predump_signal = SIGCHLD;
+ else {
+ printf("invalid argument for signal\n");
+ fflush(stdout);
+ exit(EXIT_FAILURE);
+ }
+
+ /* Test: prctl() setting */
+ rc = test_prctl(0);
+ printf("prctl: sig %d %s\n", 0, (rc == 0) ? "PASS" : "FAIL");
+ rc = test_prctl(predump_signal);
+ printf("prctl: sig %d %s\n",
+ predump_signal, (rc == 0) ? "PASS" : "FAIL");
+
+ /* Init signalfd */
+ sigfd = init_signalfd();
+
+ child_pid = fork();
+ if (child_pid == -1)
+ handle_error("fork");
+
+ if (child_pid == 0) { /* Code executed by child */
+ child_fn();
+ } else { /* Code executed by parent */
+ parent_fn(child_pid);
+ exit(EXIT_SUCCESS);
+ }
+}
--
1.8.3.1


2018-10-24 13:32:57

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v2] kernel/signal: Signal-based pre-coredump notification

Enke Chen <[email protected]> writes:

> For simplicity and consistency, this patch provides an implementation
> for signal-based fault notification prior to the coredump of a child
> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
> be used by an application to express its interest and to specify the
> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>
> Changes to prctl(2):
>
> PR_SET_PREDUMP_SIG (since Linux 4.20.x)
> Set the child pre-coredump signal of the calling process to
> arg2 (either SIGUSR1, or SIUSR2, or SIGCHLD, or 0 to clear).
> This is the signal that the calling process will get prior to
> the coredump of a child process. This value is cleared across
> execve(2), or for the child of a fork(2).
>
> When SIGCHLD is specified, the signal code will be set to
> CLD_PREDUMP in such an SIGCHLD signal.

Your signal handling is still not right. Please read and comprehend
siginfo_layout.

You have not filled in all of the required fields for the SIGCHLD case.
For the non SIGCHLD case you are using si_code == 0 == SI_USER which is
very wrong. This is not a user generated signal.

Let me say this slowly. The pair si_signo si_code determines the union
member of struct siginfo. That needs to be handled consistently. You
aren't. I just finished fixing this up in the entire kernel and now you
are trying to add a usage that is worst than most of the bugs I have
fixed. I really don't appreciate having to deal with no bugs.



Further siginfo can be dropped. Multiple signals with the same signal
number can be consolidated. What is your plan for dealing with that?
Other code paths pair with wait to get the information out. There
is no equivalent of wait in your code.

Signals can be delayed by quite a bit, scheduling delays etc. They can
not provide any meaningful kind of real time notification.

So between delays and loss of information signals appear to be a very
poor fit for this usecase.

I am concerned about code that does not fit the usecase well because
such code winds up as code that no one cares about that must be
maintained indefinitely, because somewhere out there there is one use
that would break if the interface was removed. This does not feel like
an interface people will want to use and maintain in proper working
order forever.

Ugh. Your test case is even using signalfd. So you don't even want
this signal to be delivered as a signal.

You add an interface that takes a pointer and you don't add a compat
interface. See Oleg's point of just returning the signal number in the
return code.

Now I am wondering how well prctl works from a 32bit process on a 64bit
kernel. At first glance it looks like it probably does not work.

Consistency with PDEATHSIG is not a good argument for anything.
PDEATHSIG at the present time is unusable in the real world by most
applications that want something like it.

So far I see an interface that even you don't want to use as designed,
that is implemented incorrectly.

The concern is real and deserves to be addressed. I don't think signals
are the right way to handle it, and certainly not this patch as it
stands.

Eric


> diff --git a/arch/x86/kernel/signal_compat.c b/arch/x86/kernel/signal_compat.c
> index 9ccbf05..a3deba8 100644
> --- a/arch/x86/kernel/signal_compat.c
> +++ b/arch/x86/kernel/signal_compat.c
> @@ -30,7 +30,7 @@ static inline void signal_compat_build_tests(void)
> BUILD_BUG_ON(NSIGSEGV != 7);
> BUILD_BUG_ON(NSIGBUS != 5);
> BUILD_BUG_ON(NSIGTRAP != 5);
> - BUILD_BUG_ON(NSIGCHLD != 6);
> + BUILD_BUG_ON(NSIGCHLD != 7);
> BUILD_BUG_ON(NSIGSYS != 1);
>
> /* This is part of the ABI and can never change in size: */
> diff --git a/fs/coredump.c b/fs/coredump.c
> index e42e17e..f11e31f 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -546,6 +546,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
> struct cred *cred;
> int retval = 0;
> int ispipe;
> + bool notify;
> struct files_struct *displaced;
> /* require nonrelative corefile path and be extra careful */
> bool need_suid_safe = false;
> @@ -590,6 +591,15 @@ void do_coredump(const kernel_siginfo_t *siginfo)
> if (retval < 0)
> goto fail_creds;
>
> + /*
> + * Send the pre-coredump signal to the parent if requested.
> + */
> + read_lock(&tasklist_lock);
> + notify = do_notify_parent_predump(current);
> + read_unlock(&tasklist_lock);
> + if (notify)
> + cond_resched();
> +
> old_cred = override_creds(cred);
>
> ispipe = format_corename(&cn, &cprm);
> diff --git a/fs/exec.c b/fs/exec.c
> index fc281b7..7714da7 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1181,6 +1181,9 @@ static int de_thread(struct task_struct *tsk)
> /* we have changed execution domain */
> tsk->exit_signal = SIGCHLD;
>
> + /* Clear the pre-coredump signal before loading a new binary */
> + sig->predump_signal = 0;
> +
> #ifdef CONFIG_POSIX_TIMERS
> exit_itimers(sig);
> flush_itimer_signals();
> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
> index 13789d1..047829d 100644
> --- a/include/linux/sched/signal.h
> +++ b/include/linux/sched/signal.h
> @@ -112,6 +112,9 @@ struct signal_struct {
> int group_stop_count;
> unsigned int flags; /* see SIGNAL_* flags below */
>
> + /* The signal sent prior to a child's coredump */
> + int predump_signal;
> +
> /*
> * PR_SET_CHILD_SUBREAPER marks a process, like a service
> * manager, to re-parent orphan (double-forking) child processes
> @@ -332,6 +335,7 @@ extern int kill_pid_info_as_cred(int, struct kernel_siginfo *, struct pid *,
> extern int kill_pgrp(struct pid *pid, int sig, int priv);
> extern int kill_pid(struct pid *pid, int sig, int priv);
> extern __must_check bool do_notify_parent(struct task_struct *, int);
> +extern bool do_notify_parent_predump(struct task_struct *p);
> extern void __wake_up_parent(struct task_struct *p, struct task_struct *parent);
> extern void force_sig(int, struct task_struct *);
> extern int send_sig(int, struct task_struct *, int);
> diff --git a/include/linux/signal.h b/include/linux/signal.h
> index f428e86..6e1b2c9 100644
> --- a/include/linux/signal.h
> +++ b/include/linux/signal.h
> @@ -262,6 +262,11 @@ static inline int valid_signal(unsigned long sig)
> return sig <= _NSIG ? 1 : 0;
> }
>
> +static inline int valid_predump_signal(int sig)
> +{
> + return (sig == SIGCHLD) || (sig == SIGUSR1) || (sig == SIGUSR2);
> +}
> +
> struct timespec;
> struct pt_regs;
> enum pid_type;
> diff --git a/include/uapi/asm-generic/siginfo.h b/include/uapi/asm-generic/siginfo.h
> index cb3d6c2..1a47cef 100644
> --- a/include/uapi/asm-generic/siginfo.h
> +++ b/include/uapi/asm-generic/siginfo.h
> @@ -267,7 +267,8 @@ struct { \
> #define CLD_TRAPPED 4 /* traced child has trapped */
> #define CLD_STOPPED 5 /* child has stopped */
> #define CLD_CONTINUED 6 /* stopped child has continued */
> -#define NSIGCHLD 6
> +#define CLD_PREDUMP 7 /* child is about to dump core */
> +#define NSIGCHLD 7
>
> /*
> * SIGPOLL (or any other signal without signal specific si_codes) si_codes
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 9a32bc2..e4c154b 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -1855,6 +1855,45 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
> return autoreap;
> }
>
> +/*
> + * While do_notify_parent() notifies the parent of a child's death post
> + * its coredump, this function lets the parent (if so desired) know about
> + * the imminent death of a child just prior to its coredump.
> + *
> + * The caller must hold at least the read lock on tasklist_lock.
> + */
> +bool do_notify_parent_predump(struct task_struct *tsk)
> +{
> + struct sighand_struct *sighand;
> + struct kernel_siginfo info;
> + struct task_struct *parent;
> + unsigned long flags;
> + pid_t pid;
> + int sig;
> +
> + parent = tsk->parent;
> + sighand = parent->sighand;
> + pid = task_tgid_vnr(tsk);
> +
> + spin_lock_irqsave(&sighand->siglock, flags);
> + sig = parent->signal->predump_signal;
> + if (!valid_predump_signal(sig)) {
> + spin_unlock_irqrestore(&sighand->siglock, flags);
> + return false;
> + }
> +
> + clear_siginfo(&info);
> + info.si_pid = pid;
> + info.si_signo = sig;
> + if (sig == SIGCHLD)
> + info.si_code = CLD_PREDUMP;



> +
> + __group_send_sig_info(sig, &info, parent);
> + __wake_up_parent(tsk, parent);
> + spin_unlock_irqrestore(&sighand->siglock, flags);
> + return true;
> +}
> +
> /**
> * do_notify_parent_cldstop - notify parent of stopped/continued state change
> * @tsk: task reporting the state change
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 123bd73..8ed9a63 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2476,6 +2476,21 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which,
> return -EINVAL;
> error = arch_prctl_spec_ctrl_set(me, arg2, arg3);
> break;
> + case PR_SET_PREDUMP_SIG:
> + if (arg3 || arg4 || arg5)
> + return -EINVAL;
> +
> + /* 0 is valid for disabling the feature */
> + if (arg2 && !valid_predump_signal((int)arg2))
> + return -EINVAL;
> + me->signal->predump_signal = (int)arg2;
> + break;
> + case PR_GET_PREDUMP_SIG:
> + if (arg3 || arg4 || arg5)
> + return -EINVAL;
> + error = put_user(me->signal->predump_signal,
> + (int __user *)arg2);
> + break;
> default:
> error = -EINVAL;
> break;
> diff --git a/tools/testing/selftests/prctl/Makefile b/tools/testing/selftests/prctl/Makefile
> index c7923b2..f8d60d5 100644
> --- a/tools/testing/selftests/prctl/Makefile
> +++ b/tools/testing/selftests/prctl/Makefile
> @@ -5,7 +5,7 @@ ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/x86/ -e s/x86_64/x86/)
>
> ifeq ($(ARCH),x86)
> TEST_PROGS := disable-tsc-ctxt-sw-stress-test disable-tsc-on-off-stress-test \
> - disable-tsc-test
> + disable-tsc-test predump-sig-test
> all: $(TEST_PROGS)
>
> include ../lib.mk
> diff --git a/tools/testing/selftests/prctl/predump-sig-test.c b/tools/testing/selftests/prctl/predump-sig-test.c
> new file mode 100644
> index 0000000..57042f7
> --- /dev/null
> +++ b/tools/testing/selftests/prctl/predump-sig-test.c
> @@ -0,0 +1,171 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2018, Enke Chen, Cisco Systems, Inc.
> + *
> + * Tests for prctl(PR_SET_PREDUMP_SIG, ...) / prctl(PR_GET_PREDUMP_SIG, ...)
> + *
> + * When set with prctl(), the specified signal is sent to the parent process
> + * prior to the coredump of a child process.
> + *
> + * Usage: ./predump-sig-test {SIGUSR1 | SIGUSR2 | SIGCHLD}
> + */
> +
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <unistd.h>
> +#include <sys/prctl.h>
> +#include <signal.h>
> +#include <sys/signalfd.h>
> +#include <errno.h>
> +
> +#ifndef PR_SET_PREDUMP_SIG
> +#define PR_SET_PREDUMP_SIG 54
> +#define PR_GET_PREDUMP_SIG 55
> +#endif
> +
> +#ifndef CLD_PREDUMP
> +#define CLD_PREDUMP 7 /* child is about to dump core */
> +#endif
> +
> +#define handle_error(msg) \
> + do { perror(msg); exit(EXIT_FAILURE); } while (0)
> +
> +static int test_prctl(int sig)
> +{
> + int sig2, rc;
> +
> + rc = prctl(PR_SET_PREDUMP_SIG, sig, 0, 0, 0);
> + if (rc < 0)
> + handle_error("prctl: setting");
> +
> + rc = prctl(PR_GET_PREDUMP_SIG, &sig2, 0, 0, 0);
> + if (rc < 0)
> + handle_error("prctl: getting");
> +
> + if (sig2 != sig) {
> + printf("prctl: sig %d, post %d\n", sig, sig2);
> + return -1;
> + }
> + return 0;
> +}
> +
> +static int sigfd;
> +static int predump_signal;
> +
> +static int init_signalfd(void)
> +{
> + sigset_t mask;
> + int sfd;
> +
> + sigemptyset(&mask);
> + sigaddset(&mask, SIGCHLD);
> + if (predump_signal && (predump_signal != SIGCHLD))
> + sigaddset(&mask, predump_signal);
> +
> + /*
> + * Block signals so that they aren't handled according to their
> + * default dispositions.
> + */
> + if (sigprocmask(SIG_BLOCK, &mask, NULL) == -1)
> + handle_error("sigprocmask");
> +
> + sfd = signalfd(-1, &mask, SFD_CLOEXEC);
> + if (sfd == -1)
> + handle_error("signalfd");
> +
> + return sfd;
> +}
> +
> +static void parent_func(pid_t child_pid)
> +{
> + struct signalfd_siginfo si;
> + int count = 0;
> + ssize_t s;
> +
> + for (;;) {
> + s = read(sigfd, &si, sizeof(struct signalfd_siginfo));
> + if (s != sizeof(struct signalfd_siginfo))
> + handle_error("read");
> +
> + count++;
> + printf("\nReceived signal: ssi_pid %ld, ssi_signo %d\n",
> + si.ssi_pid, si.ssi_signo);
> + printf("siginfo: ssi_errno %d, ssi_code %d, ssi_status %d\n",
> + si.ssi_errno, si.ssi_code, si.ssi_status);
> +
> + if (si.ssi_signo == SIGCHLD) {
> + if (si.ssi_code == CLD_PREDUMP)
> + printf("predump signal\n");
> + else
> + break;
> + } else if (si.ssi_signo == predump_signal)
> + printf("predump signal\n");
> + }
> +
> + printf("Test result: %s\n", (count == 2) ? "PASS" : "FAIL");
> + fflush(stdout);
> +}
> +
> +static void child_func(void)
> +{
> + int rc, sig;
> +
> + printf("\nChild pid: %ld\n", (long)getpid());
> +
> + /* Test: Child should not inherit the predump_signal */
> + rc = prctl(PR_GET_PREDUMP_SIG, &sig, 0, 0, 0);
> + if (rc < 0)
> + handle_error("prctl: child");
> +
> + printf("child: predump_signal %d\n", sig);
> +
> + /* Force coredump here */
> + printf("child: calling abort()\n");
> + fflush(stdout);
> + abort();
> +}
> +
> +int main(int argc, char *argv[])
> +{
> + pid_t child_pid;
> + int rc;
> +
> + if (argc != 2) {
> + printf("invalid number of arguments\n");
> + exit(EXIT_FAILURE);
> + }
> +
> + if (strcmp(argv[1], "SIGUSR1") == 0)
> + predump_signal = SIGUSR1;
> + else if (strcmp(argv[1], "SIGUSR2") == 0)
> + predump_signal = SIGUSR2;
> + else if (strcmp(argv[1], "SIGCHLD") == 0)
> + predump_signal = SIGCHLD;
> + else {
> + printf("invalid argument for signal\n");
> + fflush(stdout);
> + exit(EXIT_FAILURE);
> + }
> +
> + /* Test: prctl() setting */
> + rc = test_prctl(0);
> + printf("prctl: sig %d %s\n", 0, (rc == 0) ? "PASS" : "FAIL");
> + rc = test_prctl(predump_signal);
> + printf("prctl: sig %d %s\n",
> + predump_signal, (rc == 0) ? "PASS" : "FAIL");
> +
> + /* Init signalfd */
> + sigfd = init_signalfd();
> +
> + child_pid = fork();
> + if (child_pid == -1)
> + handle_error("fork");
> +
> + if (child_pid == 0) { /* Code executed by child */
> + child_func();
> + } else { /* Code executed by parent */
> + parent_func(child_pid);
> + exit(EXIT_SUCCESS);
> + }
> +}

2018-10-24 13:54:55

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v2] kernel/signal: Signal-based pre-coredump notification

On 10/23, Enke Chen wrote:
>
> >> + /*
> >> + * Send the pre-coredump signal to the parent if requested.
> >> + */
> >> + read_lock(&tasklist_lock);
> >> + notify = do_notify_parent_predump(current);
> >> + read_unlock(&tasklist_lock);
> >> + if (notify)
> >> + cond_resched();
> >
> > Hmm. I do not understand why do we need cond_resched(). And even if we need it,
> > why we can't call it unconditionally?
>
> Remember the goal is to allow the parent (e.g., a process manager) to take early
> action. The "yield" before doing coredump will help.

I don't see how can it actually help...

cond_resched() is nop if CONFIG_PREEMPT or should_resched() == 0.

and the coredumping thread will certainly need to sleep/wait anyway.

> > And once again, SIGCHLD/SIGUSR do not queue, this means that PR_SET_PREDUMP_SIG
> > is pointless if you have 2 or more children.
>
> Hmm, could you point me to the code where SIGCHLD/SIGUSR is treated differently
> w.r.t. queuing? That does not sound right to me.

see the legacy_queue() check. Any signal < SIGRTMIN do not queue. IOW, if SIGCHLD
is already pending, then next SIGCHLD is simply ignored.

Oleg.


2018-10-24 14:03:12

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v3] kernel/signal: Signal-based pre-coredump notification

On 10/23, Enke Chen wrote:
>
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -590,6 +590,12 @@ void do_coredump(const kernel_siginfo_t *siginfo)
> if (retval < 0)
> goto fail_creds;
>
> + /*
> + * Send the pre-coredump signal to the parent if requested.
> + */
> + do_notify_parent_predump();
> + cond_resched();

I am still not sure cond_resched() makes any sense here...

> @@ -1553,6 +1553,9 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
> tty_audit_fork(sig);
> sched_autogroup_fork(sig);
>
> + /* Clear the pre-coredump signal for the child */
> + sig->predump_signal = 0;

No need, copy_signal() does zalloc().


> +void do_notify_parent_predump(void)
> +{
> + struct sighand_struct *sighand;
> + struct kernel_siginfo info;
> + struct task_struct *parent;
> + unsigned long flags;
> + int sig;
> +
> + read_lock(&tasklist_lock);
> + parent = current->parent;
> + sig = parent->signal->predump_signal;
> + if (sig != 0) {
> + clear_siginfo(&info);
> + info.si_pid = task_tgid_vnr(current);
> + info.si_signo = sig;
> + if (sig == SIGCHLD)
> + info.si_code = CLD_PREDUMP;
> +
> + sighand = parent->sighand;
> + spin_lock_irqsave(&sighand->siglock, flags);
> + __group_send_sig_info(sig, &info, parent);
> + spin_unlock_irqrestore(&sighand->siglock, flags);

You can just use do_send_sig_info() and remove sighand/flags/spin_lock_irqsave.

Perhaps the "likely" predump_signal==0 check at the start makes sense to avoid
read_lock(tasklist).

And I'd suggest to move it into coredump.c and make it static. It won't have
another user.

Oleg.


2018-10-24 21:56:53

by Enke Chen (enkechen)

[permalink] [raw]
Subject: Re: [PATCH v2] kernel/signal: Signal-based pre-coredump notification

Hi, Olge:

On 10/24/18 6:52 AM, Oleg Nesterov wrote:
> On 10/23, Enke Chen wrote:
>>
>>>> + /*
>>>> + * Send the pre-coredump signal to the parent if requested.
>>>> + */
>>>> + read_lock(&tasklist_lock);
>>>> + notify = do_notify_parent_predump(current);
>>>> + read_unlock(&tasklist_lock);
>>>> + if (notify)
>>>> + cond_resched();
>>>
>>> Hmm. I do not understand why do we need cond_resched(). And even if we need it,
>>> why we can't call it unconditionally?
>>
>> Remember the goal is to allow the parent (e.g., a process manager) to take early
>> action. The "yield" before doing coredump will help.
>
> I don't see how can it actually help...
>
> cond_resched() is nop if CONFIG_PREEMPT or should_resched() == 0.
>
> and the coredumping thread will certainly need to sleep/wait anyway.

I am really surprised by this - cond_resched() is used in many places and it actually
does not do anything w/o CONFIG_PREEMPT.

Will remove.

>
>>> And once again, SIGCHLD/SIGUSR do not queue, this means that PR_SET_PREDUMP_SIG
>>> is pointless if you have 2 or more children.
>>
>> Hmm, could you point me to the code where SIGCHLD/SIGUSR is treated differently
>> w.r.t. queuing? That does not sound right to me.
>
> see the legacy_queue() check. Any signal < SIGRTMIN do not queue. IOW, if SIGCHLD
> is already pending, then next SIGCHLD is simply ignored.

Got it. This means that a distinct signal (in particular a RT signal) would be more
preferred. This is what it is done in our application. You earlier suggestion about
removing the signal limitation makes a lot sense to me now.

Given that a distinct signal is more preferred, I am wondering if I should just remove
CLD_PREDUMP from the patch.

Thanks. -- Enke





2018-10-24 22:04:34

by Enke Chen (enkechen)

[permalink] [raw]
Subject: Re: [PATCH v3] kernel/signal: Signal-based pre-coredump notification

Hi, Oleg:

On 10/24/18 7:02 AM, Oleg Nesterov wrote:
> On 10/23, Enke Chen wrote:
>>
>> --- a/fs/coredump.c
>> +++ b/fs/coredump.c
>> @@ -590,6 +590,12 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>> if (retval < 0)
>> goto fail_creds;
>>
>> + /*
>> + * Send the pre-coredump signal to the parent if requested.
>> + */
>> + do_notify_parent_predump();
>> + cond_resched();
>
> I am still not sure cond_resched() makes any sense here...
>
>> @@ -1553,6 +1553,9 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
>> tty_audit_fork(sig);
>> sched_autogroup_fork(sig);
>>
>> + /* Clear the pre-coredump signal for the child */
>> + sig->predump_signal = 0;
>
> No need, copy_signal() does zalloc().
>

Removed.

>
>> +void do_notify_parent_predump(void)
>> +{
>> + struct sighand_struct *sighand;
>> + struct kernel_siginfo info;
>> + struct task_struct *parent;
>> + unsigned long flags;
>> + int sig;
>> +
>> + read_lock(&tasklist_lock);
>> + parent = current->parent;
>> + sig = parent->signal->predump_signal;
>> + if (sig != 0) {
>> + clear_siginfo(&info);
>> + info.si_pid = task_tgid_vnr(current);
>> + info.si_signo = sig;
>> + if (sig == SIGCHLD)
>> + info.si_code = CLD_PREDUMP;
>> +
>> + sighand = parent->sighand;
>> + spin_lock_irqsave(&sighand->siglock, flags);
>> + __group_send_sig_info(sig, &info, parent);
>> + spin_unlock_irqrestore(&sighand->siglock, flags);
>
> You can just use do_send_sig_info() and remove sighand/flags/spin_lock_irqsave.

Ok.

>
> Perhaps the "likely" predump_signal==0 check at the start makes sense to avoid
> read_lock(tasklist).

I am not sure if we should/need to deviate from the convention (locking before
access the parent). In any case it may not matter as the coredump is in the
exceptional code flow.

>
> And I'd suggest to move it into coredump.c and make it static. It won't have
> another user.

Ok.

Thanks. -- Enke


2018-10-24 23:52:37

by Enke Chen (enkechen)

[permalink] [raw]
Subject: Re: [PATCH v2] kernel/signal: Signal-based pre-coredump notification

Hi, Eric:

Thanks for your comments. Please see my replies inline.

On 10/24/18 6:29 AM, Eric W. Biederman wrote:
> Enke Chen <[email protected]> writes:
>
>> For simplicity and consistency, this patch provides an implementation
>> for signal-based fault notification prior to the coredump of a child
>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>> be used by an application to express its interest and to specify the
>> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>>
>> Changes to prctl(2):
>>
>> PR_SET_PREDUMP_SIG (since Linux 4.20.x)
>> Set the child pre-coredump signal of the calling process to
>> arg2 (either SIGUSR1, or SIUSR2, or SIGCHLD, or 0 to clear).
>> This is the signal that the calling process will get prior to
>> the coredump of a child process. This value is cleared across
>> execve(2), or for the child of a fork(2).
>>
>> When SIGCHLD is specified, the signal code will be set to
>> CLD_PREDUMP in such an SIGCHLD signal.
>
> Your signal handling is still not right. Please read and comprehend
> siginfo_layout.
>
> You have not filled in all of the required fields for the SIGCHLD case.
> For the non SIGCHLD case you are using si_code == 0 == SI_USER which is
> very wrong. This is not a user generated signal.
>
> Let me say this slowly. The pair si_signo si_code determines the union
> member of struct siginfo. That needs to be handled consistently. You
> aren't. I just finished fixing this up in the entire kernel and now you
> are trying to add a usage that is worst than most of the bugs I have
> fixed. I really don't appreciate having to deal with no bugs.
>

My apologies. I will investigate and make them consistent.

>
>
> Further siginfo can be dropped. Multiple signals with the same signal
> number can be consolidated. What is your plan for dealing with that?

The primary application for the early notification involves a process
manager which is responsible for re-spawning processes or initiating
the control-plane fail-over. There are two models:

One model is to have 1:1 relationship between a process manager and
application process. There can only be one predump-signal (say, SIGUSR1)
from the child to the parent, and will unlikely be dropped or consolidated.

Another model is to have 1:N where there is only one process manager with
multiple application processes. One of the RT signal can be used to help
make it more reliable.

> Other code paths pair with wait to get the information out. There
> is no equivalent of wait in your code.

I was not aware of that before. Let me investigate.

>
> Signals can be delayed by quite a bit, scheduling delays etc. They can
> not provide any meaningful kind of real time notification.
>

The timing requirement is about 50-100 msecs for BFD. Not sure if that
qualifies as "real time". This mechanism has worked well in deployment
over the years.

> So between delays and loss of information signals appear to be a very
> poor fit for this usecase.
>
> I am concerned about code that does not fit the usecase well because
> such code winds up as code that no one cares about that must be
> maintained indefinitely, because somewhere out there there is one use
> that would break if the interface was removed. This does not feel like
> an interface people will want to use and maintain in proper working
> order forever.
>
> Ugh. Your test case is even using signalfd. So you don't even want
> this signal to be delivered as a signal.

I actually tested sigaction()/waitpid() as well. If there is a preference,
I can check in the sigaction()/waitpid() version instead.

>
> You add an interface that takes a pointer and you don't add a compat
> interface. See Oleg's point of just returning the signal number in the
> return code.

This is what Oleg said "but I won't insist, this is subjective and cosmetic".

It is no big deal either way. It just seems less work if we do not keep
adding exceptions to the prctl(2) manpage:

prctl(2):

On success, PR_GET_DUMPABLE, PR_GET_KEEPCAPS, PR_GET_NO_NEW_PRIVS, PR_CAPBSET_READ, PR_GET_TIMING, PR_GET_SECUREBITS,
PR_MCE_KILL_GET, PR_CAP_AMBIENT+PR_CAP_AMBIENT_IS_SET, and (if it returns) PR_GET_SECCOMP return the nonnegative values described
above. All other option values return 0 on success. On error, -1 is returned, and errno is set appropriately.

>
> Now I am wondering how well prctl works from a 32bit process on a 64bit
> kernel. At first glance it looks like it probably does not work.
>

I am not sure which part would be problematic.

> Consistency with PDEATHSIG is not a good argument for anything.
> PDEATHSIG at the present time is unusable in the real world by most
> applications that want something like it.

Agreed, PDEATHSIG seems to have a few issues ...

>
> So far I see an interface that even you don't want to use as designed,
> that is implemented incorrectly.
>
> The concern is real and deserves to be addressed. I don't think signals
> are the right way to handle it, and certainly not this patch as it
> stands.

I will address your concerns on the patch. Regarding the requirement and the
overall solution, if there are specific questions that I have not answered,
please let me know.

Thanks. -- Enke

2018-10-25 12:25:10

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v2] kernel/signal: Signal-based pre-coredump notification

Enke Chen <[email protected]> writes:

> Hi, Eric:
>
> Thanks for your comments. Please see my replies inline.
>
> On 10/24/18 6:29 AM, Eric W. Biederman wrote:
>> Enke Chen <[email protected]> writes:
>>
>>> For simplicity and consistency, this patch provides an implementation
>>> for signal-based fault notification prior to the coredump of a child
>>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>>> be used by an application to express its interest and to specify the
>>> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>>> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>>>
>>> Changes to prctl(2):
>>>
>>> PR_SET_PREDUMP_SIG (since Linux 4.20.x)
>>> Set the child pre-coredump signal of the calling process to
>>> arg2 (either SIGUSR1, or SIUSR2, or SIGCHLD, or 0 to clear).
>>> This is the signal that the calling process will get prior to
>>> the coredump of a child process. This value is cleared across
>>> execve(2), or for the child of a fork(2).
>>>
>>> When SIGCHLD is specified, the signal code will be set to
>>> CLD_PREDUMP in such an SIGCHLD signal.
>>
>> Your signal handling is still not right. Please read and comprehend
>> siginfo_layout.
>>
>> You have not filled in all of the required fields for the SIGCHLD case.
>> For the non SIGCHLD case you are using si_code == 0 == SI_USER which is
>> very wrong. This is not a user generated signal.
>>
>> Let me say this slowly. The pair si_signo si_code determines the union
>> member of struct siginfo. That needs to be handled consistently. You
>> aren't. I just finished fixing this up in the entire kernel and now you
>> are trying to add a usage that is worst than most of the bugs I have
>> fixed. I really don't appreciate having to deal with no bugs.
>>
>
> My apologies. I will investigate and make them consistent.
>
>>
>>
>> Further siginfo can be dropped. Multiple signals with the same signal
>> number can be consolidated. What is your plan for dealing with that?
>
> The primary application for the early notification involves a process
> manager which is responsible for re-spawning processes or initiating
> the control-plane fail-over. There are two models:
>
> One model is to have 1:1 relationship between a process manager and
> application process. There can only be one predump-signal (say, SIGUSR1)
> from the child to the parent, and will unlikely be dropped or consolidated.
>
> Another model is to have 1:N where there is only one process manager with
> multiple application processes. One of the RT signal can be used to help
> make it more reliable.

Which suggests you want one of the negative si_codes, and to use the _rt
siginfo member like sigqueue.

>> Other code paths pair with wait to get the information out. There
>> is no equivalent of wait in your code.
>
> I was not aware of that before. Let me investigate.
>
>>
>> Signals can be delayed by quite a bit, scheduling delays etc. They can
>> not provide any meaningful kind of real time notification.
>>
>
> The timing requirement is about 50-100 msecs for BFD. Not sure if that
> qualifies as "real time". This mechanism has worked well in deployment
> over the years.

It would help if those numbers were put into the patch description so
people can tell if the mechanism is quick enough.

>> So between delays and loss of information signals appear to be a very
>> poor fit for this usecase.
>>
>> I am concerned about code that does not fit the usecase well because
>> such code winds up as code that no one cares about that must be
>> maintained indefinitely, because somewhere out there there is one use
>> that would break if the interface was removed. This does not feel like
>> an interface people will want to use and maintain in proper working
>> order forever.
>>
>> Ugh. Your test case is even using signalfd. So you don't even want
>> this signal to be delivered as a signal.
>
> I actually tested sigaction()/waitpid() as well. If there is a preference,
> I can check in the sigaction()/waitpid() version instead.
>
>>
>> You add an interface that takes a pointer and you don't add a compat
>> interface. See Oleg's point of just returning the signal number in the
>> return code.
>
> This is what Oleg said "but I won't insist, this is subjective and cosmetic".
>
> It is no big deal either way. It just seems less work if we do not keep
> adding exceptions to the prctl(2) manpage:
>
> prctl(2):
>
> On success, PR_GET_DUMPABLE, PR_GET_KEEPCAPS, PR_GET_NO_NEW_PRIVS, PR_CAPBSET_READ, PR_GET_TIMING, PR_GET_SECUREBITS,
> PR_MCE_KILL_GET, PR_CAP_AMBIENT+PR_CAP_AMBIENT_IS_SET, and (if it returns) PR_GET_SECCOMP return the nonnegative values described
> above. All other option values return 0 on success. On error, -1 is returned, and errno is set appropriately.

More work in the man page versus less work in the kernel, and less code
to maintain. I will vote for more work in the man page.

>> Now I am wondering how well prctl works from a 32bit process on a 64bit
>> kernel. At first glance it looks like it probably does not work.
>>
>
> I am not sure which part would be problematic.

32bit pointers need to be translated into 64bit pointers. If the system
call does not zero extend them. Plus structure sizes.

I think prctl is just inside the line where problems happen but it is so
close to the line of structure size differences that it makes me
nervous. Typically pointers in structures are what cause system calls
to cross that line.

>> Consistency with PDEATHSIG is not a good argument for anything.
>> PDEATHSIG at the present time is unusable in the real world by most
>> applications that want something like it.
>
> Agreed, PDEATHSIG seems to have a few issues ...
>
>>
>> So far I see an interface that even you don't want to use as designed,
>> that is implemented incorrectly.
>>
>> The concern is real and deserves to be addressed. I don't think signals
>> are the right way to handle it, and certainly not this patch as it
>> stands.
>
> I will address your concerns on the patch. Regarding the requirement and the
> overall solution, if there are specific questions that I have not answered,
> please let me know.

So far so good.

Eric

2018-10-25 13:46:47

by Jann Horn

[permalink] [raw]
Subject: Re: [PATCH v2] kernel/signal: Signal-based pre-coredump notification

On Wed, Oct 24, 2018 at 3:30 PM Eric W. Biederman <[email protected]> wrote:
> Enke Chen <[email protected]> writes:
> > For simplicity and consistency, this patch provides an implementation
> > for signal-based fault notification prior to the coredump of a child
> > process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
> > be used by an application to express its interest and to specify the
> > signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
> > signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
> >
> > Changes to prctl(2):
> >
> > PR_SET_PREDUMP_SIG (since Linux 4.20.x)
> > Set the child pre-coredump signal of the calling process to
> > arg2 (either SIGUSR1, or SIUSR2, or SIGCHLD, or 0 to clear).
> > This is the signal that the calling process will get prior to
> > the coredump of a child process. This value is cleared across
> > execve(2), or for the child of a fork(2).
> >
> > When SIGCHLD is specified, the signal code will be set to
> > CLD_PREDUMP in such an SIGCHLD signal.
[...]
> Ugh. Your test case is even using signalfd. So you don't even want
> this signal to be delivered as a signal.

Just to make sure everyone's on the same page: You're suggesting that
it might make sense to deliver the pre-dump notification via a new
type of file instead (along the lines of signalfd, timerfd, eventfd
and so on)?

2018-10-25 20:24:43

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v2] kernel/signal: Signal-based pre-coredump notification

Jann Horn <[email protected]> writes:

> On Wed, Oct 24, 2018 at 3:30 PM Eric W. Biederman <[email protected]> wrote:
>> Enke Chen <[email protected]> writes:
>> > For simplicity and consistency, this patch provides an implementation
>> > for signal-based fault notification prior to the coredump of a child
>> > process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>> > be used by an application to express its interest and to specify the
>> > signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>> > signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>> >
>> > Changes to prctl(2):
>> >
>> > PR_SET_PREDUMP_SIG (since Linux 4.20.x)
>> > Set the child pre-coredump signal of the calling process to
>> > arg2 (either SIGUSR1, or SIUSR2, or SIGCHLD, or 0 to clear).
>> > This is the signal that the calling process will get prior to
>> > the coredump of a child process. This value is cleared across
>> > execve(2), or for the child of a fork(2).
>> >
>> > When SIGCHLD is specified, the signal code will be set to
>> > CLD_PREDUMP in such an SIGCHLD signal.
> [...]
>> Ugh. Your test case is even using signalfd. So you don't even want
>> this signal to be delivered as a signal.
>
> Just to make sure everyone's on the same page: You're suggesting that
> it might make sense to deliver the pre-dump notification via a new
> type of file instead (along the lines of signalfd, timerfd, eventfd
> and so on)?

My real complaint was that the API was not being tested in the way it
is expected to be used. Which makes a test pretty much useless as some
aspect userspace could regress and the test would not notice because it
is testing something different.



I do think that a file descriptor based API might be a good alternative
to a signal based API. The proc connector and signals are not the only
API solution.

The common solution to this problem is that distributions defailt the
rlimit core file size to 0.

Eric

2018-10-25 20:46:33

by Enke Chen (enkechen)

[permalink] [raw]
Subject: Re: [PATCH v2] kernel/signal: Signal-based pre-coredump notification

Hi, Eric:

It turns out that the definition of CLD_PREDUMP could well be considered as
another instance of "over specification", and is completely unnecessary. When
an application chooses a signal for pre-coredump notification, it is much simpler
and robust for the signal to be dedicated for that purpose (in the parent) and
not be mixed with other semantics. The "signo + pid" should be sufficient for
the parent process in both 1:1 and 1:N models.

So I will remove the CLD_PREDUMP and related definitions, and the code can then
be simplified as the following:

+static void do_notify_parent_predump(void)
+{
+ struct task_struct *parent;
+ int sig;
+
+ read_lock(&tasklist_lock);
+ parent = current->parent;
+ sig = parent->signal->predump_signal;
+ if (sig != 0)
+ do_send_sig_info(sig, SEND_SIG_NOINFO, parent, PIDTYPE_TGID);
+ read_unlock(&tasklist_lock);
+}

I will follow up with your other comments.

Thanks. -- Enke

On 10/25/18 5:23 AM, Eric W. Biederman wrote:
> Enke Chen <[email protected]> writes:
>
>> Hi, Eric:
>>
>> Thanks for your comments. Please see my replies inline.
>>
>> On 10/24/18 6:29 AM, Eric W. Biederman wrote:
>>> Enke Chen <[email protected]> writes:
>>>
>>>> For simplicity and consistency, this patch provides an implementation
>>>> for signal-based fault notification prior to the coredump of a child
>>>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>>>> be used by an application to express its interest and to specify the
>>>> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>>>> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>>>>
>>>> Changes to prctl(2):
>>>>
>>>> PR_SET_PREDUMP_SIG (since Linux 4.20.x)
>>>> Set the child pre-coredump signal of the calling process to
>>>> arg2 (either SIGUSR1, or SIUSR2, or SIGCHLD, or 0 to clear).
>>>> This is the signal that the calling process will get prior to
>>>> the coredump of a child process. This value is cleared across
>>>> execve(2), or for the child of a fork(2).
>>>>
>>>> When SIGCHLD is specified, the signal code will be set to
>>>> CLD_PREDUMP in such an SIGCHLD signal.
>>>
>>> Your signal handling is still not right. Please read and comprehend
>>> siginfo_layout.
>>>
>>> You have not filled in all of the required fields for the SIGCHLD case.
>>> For the non SIGCHLD case you are using si_code == 0 == SI_USER which is
>>> very wrong. This is not a user generated signal.
>>>
>>> Let me say this slowly. The pair si_signo si_code determines the union
>>> member of struct siginfo. That needs to be handled consistently. You
>>> aren't. I just finished fixing this up in the entire kernel and now you
>>> are trying to add a usage that is worst than most of the bugs I have
>>> fixed. I really don't appreciate having to deal with no bugs.
>>>
>>
>> My apologies. I will investigate and make them consistent.
>>
>>>
>>>
>>> Further siginfo can be dropped. Multiple signals with the same signal
>>> number can be consolidated. What is your plan for dealing with that?
>>
>> The primary application for the early notification involves a process
>> manager which is responsible for re-spawning processes or initiating
>> the control-plane fail-over. There are two models:
>>
>> One model is to have 1:1 relationship between a process manager and
>> application process. There can only be one predump-signal (say, SIGUSR1)
>> from the child to the parent, and will unlikely be dropped or consolidated.
>>
>> Another model is to have 1:N where there is only one process manager with
>> multiple application processes. One of the RT signal can be used to help
>> make it more reliable.
>
> Which suggests you want one of the negative si_codes, and to use the _rt
> siginfo member like sigqueue.
>
>>> Other code paths pair with wait to get the information out. There
>>> is no equivalent of wait in your code.
>>
>> I was not aware of that before. Let me investigate.
>>
>>>
>>> Signals can be delayed by quite a bit, scheduling delays etc. They can
>>> not provide any meaningful kind of real time notification.
>>>
>>
>> The timing requirement is about 50-100 msecs for BFD. Not sure if that
>> qualifies as "real time". This mechanism has worked well in deployment
>> over the years.
>
> It would help if those numbers were put into the patch description so
> people can tell if the mechanism is quick enough.
>
>>> So between delays and loss of information signals appear to be a very
>>> poor fit for this usecase.
>>>
>>> I am concerned about code that does not fit the usecase well because
>>> such code winds up as code that no one cares about that must be
>>> maintained indefinitely, because somewhere out there there is one use
>>> that would break if the interface was removed. This does not feel like
>>> an interface people will want to use and maintain in proper working
>>> order forever.
>>>
>>> Ugh. Your test case is even using signalfd. So you don't even want
>>> this signal to be delivered as a signal.
>>
>> I actually tested sigaction()/waitpid() as well. If there is a preference,
>> I can check in the sigaction()/waitpid() version instead.
>>
>>>
>>> You add an interface that takes a pointer and you don't add a compat
>>> interface. See Oleg's point of just returning the signal number in the
>>> return code.
>>
>> This is what Oleg said "but I won't insist, this is subjective and cosmetic".
>>
>> It is no big deal either way. It just seems less work if we do not keep
>> adding exceptions to the prctl(2) manpage:
>>
>> prctl(2):
>>
>> On success, PR_GET_DUMPABLE, PR_GET_KEEPCAPS, PR_GET_NO_NEW_PRIVS, PR_CAPBSET_READ, PR_GET_TIMING, PR_GET_SECUREBITS,
>> PR_MCE_KILL_GET, PR_CAP_AMBIENT+PR_CAP_AMBIENT_IS_SET, and (if it returns) PR_GET_SECCOMP return the nonnegative values described
>> above. All other option values return 0 on success. On error, -1 is returned, and errno is set appropriately.
>
> More work in the man page versus less work in the kernel, and less code
> to maintain. I will vote for more work in the man page.
>
>>> Now I am wondering how well prctl works from a 32bit process on a 64bit
>>> kernel. At first glance it looks like it probably does not work.
>>>
>>
>> I am not sure which part would be problematic.
>
> 32bit pointers need to be translated into 64bit pointers. If the system
> call does not zero extend them. Plus structure sizes.
>
> I think prctl is just inside the line where problems happen but it is so
> close to the line of structure size differences that it makes me
> nervous. Typically pointers in structures are what cause system calls
> to cross that line.
>
>>> Consistency with PDEATHSIG is not a good argument for anything.
>>> PDEATHSIG at the present time is unusable in the real world by most
>>> applications that want something like it.
>>
>> Agreed, PDEATHSIG seems to have a few issues ...
>>
>>>
>>> So far I see an interface that even you don't want to use as designed,
>>> that is implemented incorrectly.
>>>
>>> The concern is real and deserves to be addressed. I don't think signals
>>> are the right way to handle it, and certainly not this patch as it
>>> stands.
>>
>> I will address your concerns on the patch. Regarding the requirement and the
>> overall solution, if there are specific questions that I have not answered,
>> please let me know.
>
> So far so good.
>
> Eric
>

2018-10-25 21:25:08

by Enke Chen (enkechen)

[permalink] [raw]
Subject: Re: [PATCH v2] kernel/signal: Signal-based pre-coredump notification

Hi, Eric:

I have a couple comments inlined.

>> On Wed, Oct 24, 2018 at 3:30 PM Eric W. Biederman <[email protected]> wrote:
>>> Enke Chen <[email protected]> writes:
>>>> For simplicity and consistency, this patch provides an implementation
>>>> for signal-based fault notification prior to the coredump of a child
>>>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>>>> be used by an application to express its interest and to specify the
>>>> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>>>> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>>>>
>>>> Changes to prctl(2):
>>>>
>>>> PR_SET_PREDUMP_SIG (since Linux 4.20.x)
>>>> Set the child pre-coredump signal of the calling process to
>>>> arg2 (either SIGUSR1, or SIUSR2, or SIGCHLD, or 0 to clear).
>>>> This is the signal that the calling process will get prior to
>>>> the coredump of a child process. This value is cleared across
>>>> execve(2), or for the child of a fork(2).
>>>>
>>>> When SIGCHLD is specified, the signal code will be set to
>>>> CLD_PREDUMP in such an SIGCHLD signal.
>> [...]
>>> Ugh. Your test case is even using signalfd. So you don't even want
>>> this signal to be delivered as a signal.
>>
>> Just to make sure everyone's on the same page: You're suggesting that
>> it might make sense to deliver the pre-dump notification via a new
>> type of file instead (along the lines of signalfd, timerfd, eventfd
>> and so on)?
>
> My real complaint was that the API was not being tested in the way it
> is expected to be used. Which makes a test pretty much useless as some
> aspect userspace could regress and the test would not notice because it
> is testing something different.
>
>

As I stated in a prior email, I have test code for both sigaction/waipid(),
and signefd(). As the sigaction/waitpid is more widely used and that is
what you prefer, I will change the selftest code to reflect that in the
next version. Actually I should separate out the selftest code.

>
> I do think that a file descriptor based API might be a good alternative
> to a signal based API. The proc connector and signals are not the only
> API solution.
>
> The common solution to this problem is that distributions defailt the
> rlimit core file size to 0.

We do need coredumps in order to have the bugs fixed.

Thanks. -- Enke

2018-10-25 21:57:03

by Enke Chen (enkechen)

[permalink] [raw]
Subject: Re: [PATCH v2] kernel/signal: Signal-based pre-coredump notification

Hi, Eric:

Please see my replied inline.

On 10/25/18 5:23 AM, Eric W. Biederman wrote:
> Enke Chen <[email protected]> writes:
>
>> Hi, Eric:
>>
>> Thanks for your comments. Please see my replies inline.
>>
>> On 10/24/18 6:29 AM, Eric W. Biederman wrote:
>>> Enke Chen <[email protected]> writes:
>>>
>>>> For simplicity and consistency, this patch provides an implementation
>>>> for signal-based fault notification prior to the coredump of a child
>>>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>>>> be used by an application to express its interest and to specify the
>>>> signal (SIGCHLD or SIGUSR1 or SIGUSR2) for such a notification. A new
>>>> signal code (si_code), CLD_PREDUMP, is also defined for SIGCHLD.
>>>>
>>>> Changes to prctl(2):
>>>>
>>>> PR_SET_PREDUMP_SIG (since Linux 4.20.x)
>>>> Set the child pre-coredump signal of the calling process to
>>>> arg2 (either SIGUSR1, or SIUSR2, or SIGCHLD, or 0 to clear).
>>>> This is the signal that the calling process will get prior to
>>>> the coredump of a child process. This value is cleared across
>>>> execve(2), or for the child of a fork(2).
>>>>
>>>> When SIGCHLD is specified, the signal code will be set to
>>>> CLD_PREDUMP in such an SIGCHLD signal.
>>>
>>> Your signal handling is still not right. Please read and comprehend
>>> siginfo_layout.
>>>
>>> You have not filled in all of the required fields for the SIGCHLD case.
>>> For the non SIGCHLD case you are using si_code == 0 == SI_USER which is
>>> very wrong. This is not a user generated signal.
>>>
>>> Let me say this slowly. The pair si_signo si_code determines the union
>>> member of struct siginfo. That needs to be handled consistently. You
>>> aren't. I just finished fixing this up in the entire kernel and now you
>>> are trying to add a usage that is worst than most of the bugs I have
>>> fixed. I really don't appreciate having to deal with no bugs.
>>>
>>
>> My apologies. I will investigate and make them consistent.
>>
>>>
>>>
>>> Further siginfo can be dropped. Multiple signals with the same signal
>>> number can be consolidated. What is your plan for dealing with that?
>>
>> The primary application for the early notification involves a process
>> manager which is responsible for re-spawning processes or initiating
>> the control-plane fail-over. There are two models:
>>
>> One model is to have 1:1 relationship between a process manager and
>> application process. There can only be one predump-signal (say, SIGUSR1)
>> from the child to the parent, and will unlikely be dropped or consolidated.
>>
>> Another model is to have 1:N where there is only one process manager with
>> multiple application processes. One of the RT signal can be used to help
>> make it more reliable.
>
> Which suggests you want one of the negative si_codes, and to use the _rt
> siginfo member like sigqueue.

It seems that we do not need to touch the si_codes. A dedicated signal
for the pre-coredump notification is simpler and more robust. There are
enough RT signal numbers available.

>
>>> Other code paths pair with wait to get the information out. There
>>> is no equivalent of wait in your code.
>>
>> I was not aware of that before. Let me investigate.
>>
>>>
>>> Signals can be delayed by quite a bit, scheduling delays etc. They can
>>> not provide any meaningful kind of real time notification.
>>>
>>
>> The timing requirement is about 50-100 msecs for BFD. Not sure if that
>> qualifies as "real time". This mechanism has worked well in deployment
>> over the years.
>
> It would help if those numbers were put into the patch description so
> people can tell if the mechanism is quick enough.

I will do as suggested, but at the risk of making the patch description
longer than the patch itself :-)

>
>>> So between delays and loss of information signals appear to be a very
>>> poor fit for this usecase.
>>>
>>> I am concerned about code that does not fit the usecase well because
>>> such code winds up as code that no one cares about that must be
>>> maintained indefinitely, because somewhere out there there is one use
>>> that would break if the interface was removed. This does not feel like
>>> an interface people will want to use and maintain in proper working
>>> order forever.
>>>
>>> Ugh. Your test case is even using signalfd. So you don't even want
>>> this signal to be delivered as a signal.
>>
>> I actually tested sigaction()/waitpid() as well. If there is a preference,
>> I can check in the sigaction()/waitpid() version instead.
>>
>>>
>>> You add an interface that takes a pointer and you don't add a compat
>>> interface. See Oleg's point of just returning the signal number in the
>>> return code.
>>
>> This is what Oleg said "but I won't insist, this is subjective and cosmetic".
>>
>> It is no big deal either way. It just seems less work if we do not keep
>> adding exceptions to the prctl(2) manpage:
>>
>> prctl(2):
>>
>> On success, PR_GET_DUMPABLE, PR_GET_KEEPCAPS, PR_GET_NO_NEW_PRIVS, PR_CAPBSET_READ, PR_GET_TIMING, PR_GET_SECUREBITS,
>> PR_MCE_KILL_GET, PR_CAP_AMBIENT+PR_CAP_AMBIENT_IS_SET, and (if it returns) PR_GET_SECCOMP return the nonnegative values described
>> above. All other option values return 0 on success. On error, -1 is returned, and errno is set appropriately.
>
> More work in the man page versus less work in the kernel, and less code
> to maintain. I will vote for more work in the man page.

Oleg has given me a pass on this one. It is one line. But I still
prefer not to change back unless there is strong opinion...

>
>>> Now I am wondering how well prctl works from a 32bit process on a 64bit
>>> kernel. At first glance it looks like it probably does not work.
>>>
>>
>> I am not sure which part would be problematic.
>
> 32bit pointers need to be translated into 64bit pointers. If the system
> call does not zero extend them. Plus structure sizes.
>
> I think prctl is just inside the line where problems happen but it is so
> close to the line of structure size differences that it makes me
> nervous. Typically pointers in structures are what cause system calls
> to cross that line.
>
>>> Consistency with PDEATHSIG is not a good argument for anything.
>>> PDEATHSIG at the present time is unusable in the real world by most
>>> applications that want something like it.
>>
>> Agreed, PDEATHSIG seems to have a few issues ...
>>
>>>
>>> So far I see an interface that even you don't want to use as designed,
>>> that is implemented incorrectly.
>>>
>>> The concern is real and deserves to be addressed. I don't think signals
>>> are the right way to handle it, and certainly not this patch as it
>>> stands.
>>
>> I will address your concerns on the patch. Regarding the requirement and the
>> overall solution, if there are specific questions that I have not answered,
>> please let me know.
>
> So far so good.
>

Thanks. Reviews from folks on the list have certainly made the code shorter,
simpler and cleaner.

-- Enke


2018-10-25 22:57:13

by Enke Chen (enkechen)

[permalink] [raw]
Subject: [PATCH v4] kernel/signal: Signal-based pre-coredump notification

For simplicity and consistency, this patch provides an implementation
for signal-based fault notification prior to the coredump of a child
process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
be used by an application to express its interest and to specify the
signal for such a notification.

Changes to prctl(2):

PR_SET_PREDUMP_SIG (since Linux 4.20.x)
Set the child pre-coredump signal of the calling process to
arg2 (either a signal value in the range 1..maxsig, or 0 to
clear). This is the signal that the calling process will get
prior to the coredump of a child process. This value is
cleared across execve(2), or for the child of a fork(2).

PR_GET_PREDUMP_SIG (since Linux 4.20.x)
Return the current value of the child pre-coredump signal,
in the location pointed to by (int *) arg2.

Background:

As the coredump of a process may take time, in certain time-sensitive
applications it is necessary for a parent process (e.g., a process
manager) to be notified of a child's imminent death before the coredump
so that the parent process can act sooner, such as re-spawning an
application process, or initiating a control-plane fail-over.

One application is BFD. The early fault notification is a critical
component for maintaining BFD sessions (with a timeout value of
50 msec or 100 msec) across a control-plane failure.

Currently there are two ways for a parent process to be notified of a
child process's state change. One is to use the POSIX signal, and
another is to use the kernel connector module. The specific events and
actions are summarized as follows:

Process Event POSIX Signal Connector-based
----------------------------------------------------------------------
ptrace_attach() do_notify_parent_cldstop() proc_ptrace_connector()
SIGCHLD / CLD_STOPPED

ptrace_detach() do_notify_parent_cldstop() proc_ptrace_connector()
SIGCHLD / CLD_CONTINUED

pre_coredump/ N/A proc_coredump_connector()
get_signal()

post_coredump/ do_notify_parent() proc_exit_connector()
do_exit() SIGCHLD / exit_signal
----------------------------------------------------------------------

As shown in the table, the signal-based pre-coredump notification is not
currently available. In some cases using a connector-based notification
can be quite complicated (e.g., when a process manager is written in shell
scripts and thus is subject to certain inherent limitations), and a
signal-based notification would be simpler and better suited.

Signed-off-by: Enke Chen <[email protected]>
---
v3 -> v4:

Addressed review comments from Oleg Nesterov, and Eric W. Biederman,
including:
o remove the definition CLD_PREDUMP.
o code simplification.
o split out the selftest code.

fs/coredump.c | 23 +++++++++++++++++++++++
fs/exec.c | 3 +++
include/linux/sched/signal.h | 3 +++
include/uapi/linux/prctl.h | 4 ++++
kernel/sys.c | 13 +++++++++++++
5 files changed, 46 insertions(+)

diff --git a/fs/coredump.c b/fs/coredump.c
index e42e17e..22c40dc 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -536,6 +536,24 @@ static int umh_pipe_setup(struct subprocess_info *info, struct cred *new)
return err;
}

+/*
+ * While do_notify_parent() notifies the parent of a child's death post
+ * its coredump, this function lets the parent (if so desired) know about
+ * the imminent death of a child just prior to its coredump.
+ */
+static void do_notify_parent_predump(void)
+{
+ struct task_struct *parent;
+ int sig;
+
+ read_lock(&tasklist_lock);
+ parent = current->parent;
+ sig = parent->signal->predump_signal;
+ if (sig != 0)
+ do_send_sig_info(sig, SEND_SIG_NOINFO, parent, PIDTYPE_TGID);
+ read_unlock(&tasklist_lock);
+}
+
void do_coredump(const kernel_siginfo_t *siginfo)
{
struct core_state core_state;
@@ -590,6 +608,11 @@ void do_coredump(const kernel_siginfo_t *siginfo)
if (retval < 0)
goto fail_creds;

+ /*
+ * Send the pre-coredump signal to the parent if requested.
+ */
+ do_notify_parent_predump();
+
old_cred = override_creds(cred);

ispipe = format_corename(&cn, &cprm);
diff --git a/fs/exec.c b/fs/exec.c
index fc281b7..7714da7 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1181,6 +1181,9 @@ static int de_thread(struct task_struct *tsk)
/* we have changed execution domain */
tsk->exit_signal = SIGCHLD;

+ /* Clear the pre-coredump signal before loading a new binary */
+ sig->predump_signal = 0;
+
#ifdef CONFIG_POSIX_TIMERS
exit_itimers(sig);
flush_itimer_signals();
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 13789d1..728ef68 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -112,6 +112,9 @@ struct signal_struct {
int group_stop_count;
unsigned int flags; /* see SIGNAL_* flags below */

+ /* The signal sent prior to a child's coredump */
+ int predump_signal;
+
/*
* PR_SET_CHILD_SUBREAPER marks a process, like a service
* manager, to re-parent orphan (double-forking) child processes
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index c0d7ea0..79f0a8a 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -219,4 +219,8 @@ struct prctl_mm_map {
# define PR_SPEC_DISABLE (1UL << 2)
# define PR_SPEC_FORCE_DISABLE (1UL << 3)

+/* Whether to receive signal prior to child's coredump */
+#define PR_SET_PREDUMP_SIG 54
+#define PR_GET_PREDUMP_SIG 55
+
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index 123bd73..39aa3b8 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2476,6 +2476,19 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which,
return -EINVAL;
error = arch_prctl_spec_ctrl_set(me, arg2, arg3);
break;
+ case PR_SET_PREDUMP_SIG:
+ if (arg3 || arg4 || arg5)
+ return -EINVAL;
+ if (!valid_signal((int)arg2))
+ return -EINVAL;
+ me->signal->predump_signal = (int)arg2;
+ break;
+ case PR_GET_PREDUMP_SIG:
+ if (arg3 || arg4 || arg5)
+ return -EINVAL;
+ error = put_user(me->signal->predump_signal,
+ (int __user *)arg2);
+ break;
default:
error = -EINVAL;
break;
--
1.8.3.1


2018-10-25 22:57:46

by Enke Chen (enkechen)

[permalink] [raw]
Subject: [PATCH] selftests/prctl: selftest for pre-coredump signal notification


Dependency: [PATCH] kernel/signal: Signal-based pre-coredump notification

Signed-off-by: Enke Chen <[email protected]>
---
tools/testing/selftests/prctl/Makefile | 2 +-
tools/testing/selftests/prctl/predump-sig-test.c | 160 +++++++++++++++++++++++
2 files changed, 161 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/prctl/predump-sig-test.c

diff --git a/tools/testing/selftests/prctl/Makefile b/tools/testing/selftests/prctl/Makefile
index c7923b2..f8d60d5 100644
--- a/tools/testing/selftests/prctl/Makefile
+++ b/tools/testing/selftests/prctl/Makefile
@@ -5,7 +5,7 @@ ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/x86/ -e s/x86_64/x86/)

ifeq ($(ARCH),x86)
TEST_PROGS := disable-tsc-ctxt-sw-stress-test disable-tsc-on-off-stress-test \
- disable-tsc-test
+ disable-tsc-test predump-sig-test
all: $(TEST_PROGS)

include ../lib.mk
diff --git a/tools/testing/selftests/prctl/predump-sig-test.c b/tools/testing/selftests/prctl/predump-sig-test.c
new file mode 100644
index 0000000..15d62691
--- /dev/null
+++ b/tools/testing/selftests/prctl/predump-sig-test.c
@@ -0,0 +1,160 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2018, Enke Chen, Cisco Systems, Inc.
+ *
+ * Tests for prctl(PR_SET_PREDUMP_SIG, ...) / prctl(PR_GET_PREDUMP_SIG, ...)
+ *
+ * When set with prctl(), the specified signal is sent to the parent process
+ * prior to the coredump of a child process.
+ *
+ * Usage: ./predump-sig-test {SIGUSR1 | SIGRT2}
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <sys/wait.h>
+#include <sys/prctl.h>
+#include <signal.h>
+#include <errno.h>
+
+#ifndef PR_SET_PREDUMP_SIG
+#define PR_SET_PREDUMP_SIG 54
+#define PR_GET_PREDUMP_SIG 55
+#endif
+
+#define SIGRT2 (SIGRTMIN + 1)
+
+#define handle_error(msg) \
+ do { perror(msg); exit(EXIT_FAILURE); } while (0)
+
+static sig_idx;
+static siginfo_t siginfo_rcv[2];
+
+static void sigaction_func(int sig, siginfo_t *siginfo, void *arg)
+{
+ memcpy(&siginfo_rcv[sig_idx], siginfo, sizeof(siginfo_t));
+ sig_idx++;
+}
+
+static int set_sigaction(int sig)
+{
+ struct sigaction new_action;
+ int rc;
+
+ memset(&new_action, 0, sizeof(struct sigaction));
+ new_action.sa_sigaction = sigaction_func;
+ new_action.sa_flags = SA_SIGINFO;
+ sigemptyset(&new_action.sa_mask);
+
+ return sigaction(sig, &new_action, NULL);
+}
+
+static int test_prctl(int sig)
+{
+ int sig2, rc;
+
+ rc = prctl(PR_SET_PREDUMP_SIG, sig, 0, 0, 0);
+ if (rc < 0)
+ handle_error("prctl: setting");
+
+ rc = prctl(PR_GET_PREDUMP_SIG, &sig2, 0, 0, 0);
+ if (rc < 0)
+ handle_error("prctl: getting");
+
+ if (sig2 != sig) {
+ printf("prctl: sig %d, post %d\n", sig, sig2);
+ return -1;
+ }
+ return 0;
+}
+
+static void child_fn(void)
+{
+ int rc, sig;
+
+ printf("\nChild pid: %ld\n", (long)getpid());
+
+ /* Test: Child should not inherit the predump_signal */
+ rc = prctl(PR_GET_PREDUMP_SIG, &sig, 0, 0, 0);
+ if (rc < 0)
+ handle_error("prctl: child");
+
+ printf("child: predump_signal %d\n", sig);
+
+ /* Force coredump here */
+ printf("child: calling abort()\n");
+ fflush(stdout);
+ abort();
+}
+
+static int parent_fn(pid_t child_pid)
+{
+ int i, status, count;
+ siginfo_t *si;
+ pid_t w;
+
+ for (count = 0; count < 2; count++) {
+ w = waitpid(child_pid, &status, 0);
+ printf("\nwaitpid: %d\n", w);
+ if (w < 0)
+ perror("waitpid");
+
+ si = &siginfo_rcv[count];
+ printf("signal: si_signo %d, si_pid %ld, si_uid %d\n",
+ si->si_signo, si->si_pid, si->si_uid);
+ printf("siginfo: si_errno %d, si_code %d, si_status %d\n",
+ si->si_errno, si->si_code, si->si_status);
+ }
+ fflush(stdout);
+}
+
+int main(int argc, char *argv[])
+{
+ pid_t child_pid;
+ int rc, signo;
+
+ if (argc != 2) {
+ printf("invalid number of arguments\n");
+ exit(EXIT_FAILURE);
+ }
+
+ if (strcmp(argv[1], "SIGUSR1") == 0)
+ signo = SIGUSR1;
+ else if (strcmp(argv[1], "SIGRT2") == 0)
+ signo = SIGRT2;
+ else {
+ printf("invalid argument for signal\n");
+ fflush(stdout);
+ exit(EXIT_FAILURE);
+ }
+
+ rc = set_sigaction(SIGCHLD);
+ if (rc < 0)
+ handle_error("set_sigaction: SIGCHLD");
+
+ if (signo != SIGCHLD) {
+ rc = set_sigaction(signo);
+ if (rc < 0)
+ handle_error("set_sigaction: SIGCHLD");
+ }
+
+ /* Test: prctl() setting */
+ rc = test_prctl(0);
+ printf("prctl: sig %d %s\n", 0, (rc == 0) ? "PASS" : "FAIL");
+
+ rc = test_prctl(signo);
+ printf("prctl: sig %d %s\n", signo, (rc == 0) ? "PASS" : "FAIL");
+
+ child_pid = fork();
+ if (child_pid == -1)
+ handle_error("fork");
+
+ if (child_pid == 0) { /* Code executed by child */
+ child_fn();
+ } else { /* Code executed by parent */
+ parent_fn(child_pid);
+ exit(EXIT_SUCCESS);
+ }
+}
--
1.8.3.1


2018-10-26 08:29:01

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v4] kernel/signal: Signal-based pre-coredump notification

On 10/25, Enke Chen wrote:
>
> +static void do_notify_parent_predump(void)
> +{
> + struct task_struct *parent;
> + int sig;
> +
> + read_lock(&tasklist_lock);
> + parent = current->parent;
> + sig = parent->signal->predump_signal;
> + if (sig != 0)
> + do_send_sig_info(sig, SEND_SIG_NOINFO, parent, PIDTYPE_TGID);
> + read_unlock(&tasklist_lock);

Ah. It is strange I didn't think about this before... Please, do not take
tasklist_lock, use rcu_read_lock() instead. do_send_sig_info() uses the
rcu-friendly lock_task_sighand(), so rcu_dereference(parent) should work
fine.

Oleg.


2018-10-26 22:24:30

by Enke Chen (enkechen)

[permalink] [raw]
Subject: Re: [PATCH v4] kernel/signal: Signal-based pre-coredump notification

Hi, Olge:

This is really a good idea given that "parent" is declared as RCU-protected.
Just a bit odd, though, that the "parent" has not been accessed this way in
the code base.

So just to confirm: the revised code would look like the following:

static void do_notify_parent_predump(void)
{
struct task_struct *parent;
int sig;

rcu_read_lock();
parent = rcu_dereference(current->parent);
sig = parent->signal->predump_signal;
if (sig != 0)
do_send_sig_info(sig, SEND_SIG_NOINFO, parent, PIDTYPE_TGID);
rcu_read_unlock();
}

Thank you so much for your help during this review. I would like to ack your
contribution in the "Reviewed-by:" field.

-- Enke

On 10/26/18 1:28 AM, Oleg Nesterov wrote:
> On 10/25, Enke Chen wrote:
>>
>> +static void do_notify_parent_predump(void)
>> +{
>> + struct task_struct *parent;
>> + int sig;
>> +
>> + read_lock(&tasklist_lock);
>> + parent = current->parent;
>> + sig = parent->signal->predump_signal;
>> + if (sig != 0)
>> + do_send_sig_info(sig, SEND_SIG_NOINFO, parent, PIDTYPE_TGID);
>> + read_unlock(&tasklist_lock);
>
> Ah. It is strange I didn't think about this before... Please, do not take
> tasklist_lock, use rcu_read_lock() instead. do_send_sig_info() uses the
> rcu-friendly lock_task_sighand(), so rcu_dereference(parent) should work
> fine.
>
> Oleg.
>

2018-10-29 11:18:56

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v4] kernel/signal: Signal-based pre-coredump notification

Hi,

On 10/26, Enke Chen wrote:
>
> This is really a good idea given that "parent" is declared as RCU-protected.
> Just a bit odd, though, that the "parent" has not been accessed this way in
> the code base.

It is acccessed when possible,

> So just to confirm: the revised code would look like the following:
>
> static void do_notify_parent_predump(void)
> {
> struct task_struct *parent;
> int sig;
>
> rcu_read_lock();
> parent = rcu_dereference(current->parent);
> sig = parent->signal->predump_signal;
> if (sig != 0)
> do_send_sig_info(sig, SEND_SIG_NOINFO, parent, PIDTYPE_TGID);
> rcu_read_unlock();
> }

Yes, this is what I meant.

But I still think do_notify_parent_predump() should notify ->real_parent,
not ->parent.

Oleg.


2018-10-29 21:09:36

by Enke Chen (enkechen)

[permalink] [raw]
Subject: Re: [PATCH v4] kernel/signal: Signal-based pre-coredump notification

Hi, Oleg:

Yes, it should be the "real_parent" that is more interested in the notification.
Will revert back.

+static void do_notify_parent_predump(void)
+{
+ struct task_struct *parent;
+ int sig;
+
+ rcu_read_lock();
+ parent = rcu_dereference(current->real_parent);
+ sig = parent->signal->predump_signal;
+ if (sig != 0)
+ do_send_sig_info(sig, SEND_SIG_NOINFO, parent, PIDTYPE_TGID);
+ rcu_read_unlock();
+}

Thanks. -- Enke

On 10/29/18 4:18 AM, Oleg Nesterov wrote:
> Hi,
>
> On 10/26, Enke Chen wrote:
>>
>> This is really a good idea given that "parent" is declared as RCU-protected.
>> Just a bit odd, though, that the "parent" has not been accessed this way in
>> the code base.
>
> It is acccessed when possible,
>
>> So just to confirm: the revised code would look like the following:
>>
>> static void do_notify_parent_predump(void)
>> {
>> struct task_struct *parent;
>> int sig;
>>
>> rcu_read_lock();
>> parent = rcu_dereference(current->parent);
>> sig = parent->signal->predump_signal;
>> if (sig != 0)
>> do_send_sig_info(sig, SEND_SIG_NOINFO, parent, PIDTYPE_TGID);
>> rcu_read_unlock();
>> }
>
> Yes, this is what I meant.
>
> But I still think do_notify_parent_predump() should notify ->real_parent,
> not ->parent.
>
> Oleg.
>

2018-10-29 22:32:31

by Enke Chen (enkechen)

[permalink] [raw]
Subject: [PATCH v5] kernel/signal: Signal-based pre-coredump notification

For simplicity and consistency, this patch provides an implementation
for signal-based fault notification prior to the coredump of a child
process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
be used by an application to express its interest and to specify the
signal for such a notification.

Changes to prctl(2):

PR_SET_PREDUMP_SIG (since Linux 4.20.x)
Set the child pre-coredump signal of the calling process to
arg2 (either a signal value in the range 1..maxsig, or 0 to
clear). This is the signal that the calling process will get
prior to the coredump of a child process. This value is
cleared across execve(2), or for the child of a fork(2).

PR_GET_PREDUMP_SIG (since Linux 4.20.x)
Return the current value of the child pre-coredump signal,
in the location pointed to by (int *) arg2.

Background:

As the coredump of a process may take time, in certain time-sensitive
applications it is necessary for a parent process (e.g., a process
manager) to be notified of a child's imminent death before the coredump
so that the parent process can act sooner, such as re-spawning an
application process, or initiating a control-plane fail-over.

One application is BFD. The early fault notification is a critical
component for maintaining BFD sessions (with a timeout value of
50 msec or 100 msec) across a control-plane failure.

Currently there are two ways for a parent process to be notified of a
child process's state change. One is to use the POSIX signal, and
another is to use the kernel connector module. The specific events and
actions are summarized as follows:

Process Event POSIX Signal Connector-based
----------------------------------------------------------------------
ptrace_attach() do_notify_parent_cldstop() proc_ptrace_connector()
SIGCHLD / CLD_STOPPED

ptrace_detach() do_notify_parent_cldstop() proc_ptrace_connector()
SIGCHLD / CLD_CONTINUED

pre_coredump/ N/A proc_coredump_connector()
get_signal()

post_coredump/ do_notify_parent() proc_exit_connector()
do_exit() SIGCHLD / exit_signal
----------------------------------------------------------------------

As shown in the table, the signal-based pre-coredump notification is not
currently available. In some cases using a connector-based notification
can be quite complicated (e.g., when a process manager is written in shell
scripts and thus is subject to certain inherent limitations), and a
signal-based notification would be simpler and better suited.

Signed-off-by: Enke Chen <[email protected]>
Reviewed-by: Oleg Nesterov <[email protected]>
---
v4 -> v5:
Addressed review comments from Oleg Nesterov:
o use rcu_read_lock instead.
o revert back to notify the real_parent.

fs/coredump.c | 23 +++++++++++++++++++++++
fs/exec.c | 3 +++
include/linux/sched/signal.h | 3 +++
include/uapi/linux/prctl.h | 4 ++++
kernel/sys.c | 13 +++++++++++++
5 files changed, 46 insertions(+)

diff --git a/fs/coredump.c b/fs/coredump.c
index e42e17e..740b1bb 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -536,6 +536,24 @@ static int umh_pipe_setup(struct subprocess_info *info, struct cred *new)
return err;
}

+/*
+ * While do_notify_parent() notifies the parent of a child's death post
+ * its coredump, this function lets the parent (if so desired) know about
+ * the imminent death of a child just prior to its coredump.
+ */
+static void do_notify_parent_predump(void)
+{
+ struct task_struct *parent;
+ int sig;
+
+ rcu_read_lock();
+ parent = rcu_dereference(current->real_parent);
+ sig = parent->signal->predump_signal;
+ if (sig != 0)
+ do_send_sig_info(sig, SEND_SIG_NOINFO, parent, PIDTYPE_TGID);
+ rcu_read_unlock();
+}
+
void do_coredump(const kernel_siginfo_t *siginfo)
{
struct core_state core_state;
@@ -590,6 +608,11 @@ void do_coredump(const kernel_siginfo_t *siginfo)
if (retval < 0)
goto fail_creds;

+ /*
+ * Send the pre-coredump signal to the parent if requested.
+ */
+ do_notify_parent_predump();
+
old_cred = override_creds(cred);

ispipe = format_corename(&cn, &cprm);
diff --git a/fs/exec.c b/fs/exec.c
index fc281b7..7714da7 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1181,6 +1181,9 @@ static int de_thread(struct task_struct *tsk)
/* we have changed execution domain */
tsk->exit_signal = SIGCHLD;

+ /* Clear the pre-coredump signal before loading a new binary */
+ sig->predump_signal = 0;
+
#ifdef CONFIG_POSIX_TIMERS
exit_itimers(sig);
flush_itimer_signals();
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 13789d1..728ef68 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -112,6 +112,9 @@ struct signal_struct {
int group_stop_count;
unsigned int flags; /* see SIGNAL_* flags below */

+ /* The signal sent prior to a child's coredump */
+ int predump_signal;
+
/*
* PR_SET_CHILD_SUBREAPER marks a process, like a service
* manager, to re-parent orphan (double-forking) child processes
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index c0d7ea0..79f0a8a 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -219,4 +219,8 @@ struct prctl_mm_map {
# define PR_SPEC_DISABLE (1UL << 2)
# define PR_SPEC_FORCE_DISABLE (1UL << 3)

+/* Whether to receive signal prior to child's coredump */
+#define PR_SET_PREDUMP_SIG 54
+#define PR_GET_PREDUMP_SIG 55
+
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index 123bd73..39aa3b8 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2476,6 +2476,19 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which,
return -EINVAL;
error = arch_prctl_spec_ctrl_set(me, arg2, arg3);
break;
+ case PR_SET_PREDUMP_SIG:
+ if (arg3 || arg4 || arg5)
+ return -EINVAL;
+ if (!valid_signal((int)arg2))
+ return -EINVAL;
+ me->signal->predump_signal = (int)arg2;
+ break;
+ case PR_GET_PREDUMP_SIG:
+ if (arg3 || arg4 || arg5)
+ return -EINVAL;
+ error = put_user(me->signal->predump_signal,
+ (int __user *)arg2);
+ break;
default:
error = -EINVAL;
break;
--
1.8.3.1


2018-10-30 16:47:20

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v5] kernel/signal: Signal-based pre-coredump notification

On 10/29, Enke Chen wrote:
>
> Reviewed-by: Oleg Nesterov <[email protected]>

Hmm. I didn't say this ;)

But OK, feel free to keep this tag.

I do not like this feauture. But I see no technical problems in this version
and I never pretented I understand the user-space needs, so I won't argue.

Oleg.


2018-10-31 00:26:30

by Enke Chen (enkechen)

[permalink] [raw]
Subject: Re: [PATCH v5] kernel/signal: Signal-based pre-coredump notification

Hi, Oleg:

On 10/30/18 9:46 AM, Oleg Nesterov wrote:
> On 10/29, Enke Chen wrote:
>>
>> Reviewed-by: Oleg Nesterov <[email protected]>
>
> Hmm. I didn't say this ;)
>
> But OK, feel free to keep this tag.

Thanks.

>
> I do not like this feauture. But I see no technical problems in this version
> and I never pretented I understand the user-space needs, so I won't argue.

As I explained earlier, the primary application is in the area of network
high-availability / non-stop-forwarding where early fault notification and
early action can help maintain BFD sessions and thus avoid unnecessary
disruption to forwarding while the control-plane is recovering.

-- Enke


2018-11-12 23:22:51

by Enke Chen (enkechen)

[permalink] [raw]
Subject: Re: [PATCH v5] kernel/signal: Signal-based pre-coredump notification

Hi, Folks:

Could you please take care of this patch [PATCH 5]?

Thanks. -- Enke

On 10/29/18 3:31 PM, Enke Chen wrote:
> For simplicity and consistency, this patch provides an implementation
> for signal-based fault notification prior to the coredump of a child
> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
> be used by an application to express its interest and to specify the
> signal for such a notification.
>
> Changes to prctl(2):
>
> PR_SET_PREDUMP_SIG (since Linux 4.20.x)
> Set the child pre-coredump signal of the calling process to
> arg2 (either a signal value in the range 1..maxsig, or 0 to
> clear). This is the signal that the calling process will get
> prior to the coredump of a child process. This value is
> cleared across execve(2), or for the child of a fork(2).
>
> PR_GET_PREDUMP_SIG (since Linux 4.20.x)
> Return the current value of the child pre-coredump signal,
> in the location pointed to by (int *) arg2.
>
> Background:
>
> As the coredump of a process may take time, in certain time-sensitive
> applications it is necessary for a parent process (e.g., a process
> manager) to be notified of a child's imminent death before the coredump
> so that the parent process can act sooner, such as re-spawning an
> application process, or initiating a control-plane fail-over.
>
> One application is BFD. The early fault notification is a critical
> component for maintaining BFD sessions (with a timeout value of
> 50 msec or 100 msec) across a control-plane failure.
>
> Currently there are two ways for a parent process to be notified of a
> child process's state change. One is to use the POSIX signal, and
> another is to use the kernel connector module. The specific events and
> actions are summarized as follows:
>
> Process Event POSIX Signal Connector-based
> ----------------------------------------------------------------------
> ptrace_attach() do_notify_parent_cldstop() proc_ptrace_connector()
> SIGCHLD / CLD_STOPPED
>
> ptrace_detach() do_notify_parent_cldstop() proc_ptrace_connector()
> SIGCHLD / CLD_CONTINUED
>
> pre_coredump/ N/A proc_coredump_connector()
> get_signal()
>
> post_coredump/ do_notify_parent() proc_exit_connector()
> do_exit() SIGCHLD / exit_signal
> ----------------------------------------------------------------------
>
> As shown in the table, the signal-based pre-coredump notification is not
> currently available. In some cases using a connector-based notification
> can be quite complicated (e.g., when a process manager is written in shell
> scripts and thus is subject to certain inherent limitations), and a
> signal-based notification would be simpler and better suited.
>
> Signed-off-by: Enke Chen <[email protected]>
> Reviewed-by: Oleg Nesterov <[email protected]>
> ---
> v4 -> v5:
> Addressed review comments from Oleg Nesterov:
> o use rcu_read_lock instead.
> o revert back to notify the real_parent.
>
> fs/coredump.c | 23 +++++++++++++++++++++++
> fs/exec.c | 3 +++
> include/linux/sched/signal.h | 3 +++
> include/uapi/linux/prctl.h | 4 ++++
> kernel/sys.c | 13 +++++++++++++
> 5 files changed, 46 insertions(+)
>
> diff --git a/fs/coredump.c b/fs/coredump.c
> index e42e17e..740b1bb 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -536,6 +536,24 @@ static int umh_pipe_setup(struct subprocess_info *info, struct cred *new)
> return err;
> }
>
> +/*
> + * While do_notify_parent() notifies the parent of a child's death post
> + * its coredump, this function lets the parent (if so desired) know about
> + * the imminent death of a child just prior to its coredump.
> + */
> +static void do_notify_parent_predump(void)
> +{
> + struct task_struct *parent;
> + int sig;
> +
> + rcu_read_lock();
> + parent = rcu_dereference(current->real_parent);
> + sig = parent->signal->predump_signal;
> + if (sig != 0)
> + do_send_sig_info(sig, SEND_SIG_NOINFO, parent, PIDTYPE_TGID);
> + rcu_read_unlock();
> +}
> +
> void do_coredump(const kernel_siginfo_t *siginfo)
> {
> struct core_state core_state;
> @@ -590,6 +608,11 @@ void do_coredump(const kernel_siginfo_t *siginfo)
> if (retval < 0)
> goto fail_creds;
>
> + /*
> + * Send the pre-coredump signal to the parent if requested.
> + */
> + do_notify_parent_predump();
> +
> old_cred = override_creds(cred);
>
> ispipe = format_corename(&cn, &cprm);
> diff --git a/fs/exec.c b/fs/exec.c
> index fc281b7..7714da7 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1181,6 +1181,9 @@ static int de_thread(struct task_struct *tsk)
> /* we have changed execution domain */
> tsk->exit_signal = SIGCHLD;
>
> + /* Clear the pre-coredump signal before loading a new binary */
> + sig->predump_signal = 0;
> +
> #ifdef CONFIG_POSIX_TIMERS
> exit_itimers(sig);
> flush_itimer_signals();
> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
> index 13789d1..728ef68 100644
> --- a/include/linux/sched/signal.h
> +++ b/include/linux/sched/signal.h
> @@ -112,6 +112,9 @@ struct signal_struct {
> int group_stop_count;
> unsigned int flags; /* see SIGNAL_* flags below */
>
> + /* The signal sent prior to a child's coredump */
> + int predump_signal;
> +
> /*
> * PR_SET_CHILD_SUBREAPER marks a process, like a service
> * manager, to re-parent orphan (double-forking) child processes
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index c0d7ea0..79f0a8a 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -219,4 +219,8 @@ struct prctl_mm_map {
> # define PR_SPEC_DISABLE (1UL << 2)
> # define PR_SPEC_FORCE_DISABLE (1UL << 3)
>
> +/* Whether to receive signal prior to child's coredump */
> +#define PR_SET_PREDUMP_SIG 54
> +#define PR_GET_PREDUMP_SIG 55
> +
> #endif /* _LINUX_PRCTL_H */
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 123bd73..39aa3b8 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2476,6 +2476,19 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which,
> return -EINVAL;
> error = arch_prctl_spec_ctrl_set(me, arg2, arg3);
> break;
> + case PR_SET_PREDUMP_SIG:
> + if (arg3 || arg4 || arg5)
> + return -EINVAL;
> + if (!valid_signal((int)arg2))
> + return -EINVAL;
> + me->signal->predump_signal = (int)arg2;
> + break;
> + case PR_GET_PREDUMP_SIG:
> + if (arg3 || arg4 || arg5)
> + return -EINVAL;
> + error = put_user(me->signal->predump_signal,
> + (int __user *)arg2);
> + break;
> default:
> error = -EINVAL;
> break;
>

2018-11-22 11:42:46

by Enke Chen (enkechen)

[permalink] [raw]
Subject: Re: [PATCH v5] kernel/signal: Signal-based pre-coredump notification

Hi, Andrew:

On 11/21/18 5:09 PM, Enke Chen wrote:
> Hi, Andrew:
>
> On 11/21/18 4:37 PM, Andrew Morton wrote:

>> Do we have other linux-specific signal extensions which could piggyback onto that?
>
> No. There are enough existing signals that an application can choose for this
> purpose, such as SIGUSR1, SIGUSR1, and any of the RT signals.
>
> Thanks. -- Enke
>

Let me clarify: I meant this feature is complete by itself. But you seem to ask a
somewhat different question, for which I am not aware of.

Thanks. -- Enke


2018-11-22 12:17:31

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v5] kernel/signal: Signal-based pre-coredump notification

On Tue, 30 Oct 2018 17:46:29 +0100 Oleg Nesterov <[email protected]> wrote:

> On 10/29, Enke Chen wrote:
> >
> > Reviewed-by: Oleg Nesterov <[email protected]>
>
> Hmm. I didn't say this ;)
>
> But OK, feel free to keep this tag.
>
> I do not like this feauture.

Why is that?

> But I see no technical problems in this version
> and I never pretented I understand the user-space needs, so I won't argue.

The changelog appears to spell this all out quite well? Unusually
well, in my experience ;)

A couple of things...

- We'll be looking for a manpage update please. (Search MAINTAINERS
for "manpage")

- As it's a linux-specific feature, a test under
tools/testing/selftests would be appropriate. I don't know how much
that work will be. Do we have other linux-specific signal extensions
which could piggyback onto that?


2018-11-22 13:02:13

by Enke Chen (enkechen)

[permalink] [raw]
Subject: Re: [PATCH v5] kernel/signal: Signal-based pre-coredump notification

Hi, Andrew:

On 11/21/18 4:37 PM, Andrew Morton wrote:
> On Tue, 30 Oct 2018 17:46:29 +0100 Oleg Nesterov <[email protected]> wrote:
>
>> On 10/29, Enke Chen wrote:
>>>
>>> Reviewed-by: Oleg Nesterov <[email protected]>
>>
>> Hmm. I didn't say this ;)
>>
>> But OK, feel free to keep this tag.
>>
>> I do not like this feauture.
>
> Why is that?
>
>> But I see no technical problems in this version
>> and I never pretented I understand the user-space needs, so I won't argue.
>
> The changelog appears to spell this all out quite well? Unusually
> well, in my experience ;)

I also followed up with a little more explanation in the email thread on
10/30/2018:

---
As I explained earlier, the primary application is in the area of network
high-availability / non-stop-forwarding where early fault notification and
early action can help maintain BFD sessions and thus avoid unnecessary
disruption to forwarding while the control-plane is recovering.
---

BTW, I probably should have pointed out this earlier:

BFD stands for "RFC 5880: Bi-directional forwarding detection".

>
> A couple of things...
>
> - We'll be looking for a manpage update please. (Search MAINTAINERS
> for "manpage")

Yes, I will submit a manpage update. Most of the text is already
written in the patch description.

>
> - As it's a linux-specific feature, a test under
> tools/testing/selftests would be appropriate. I don't know how much
> that work will be.

The selftest code was submitted on 10/25/2018:

[PATCH] selftests/prctl: selftest for pre-coredump signal notification

> Do we have other linux-specific signal extensions which could piggyback onto that?

No. There are enough existing signals that an application can choose for this
purpose, such as SIGUSR1, SIGUSR1, and any of the RT signals.

Thanks. -- Enke

2018-11-22 13:53:36

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v5] kernel/signal: Signal-based pre-coredump notification

On Wed, 21 Nov 2018 17:09:50 -0800 Enke Chen <[email protected]> wrote:

> Hi, Andrew:
>
> On 11/21/18 4:37 PM, Andrew Morton wrote:
> > On Tue, 30 Oct 2018 17:46:29 +0100 Oleg Nesterov <[email protected]> wrote:
> >
> >> On 10/29, Enke Chen wrote:
> >>>
> >>> Reviewed-by: Oleg Nesterov <[email protected]>
> >>
> >> Hmm. I didn't say this ;)
> >>
> >> But OK, feel free to keep this tag.
> >>
> >> I do not like this feauture.
> >
> > Why is that?
> >
> >> But I see no technical problems in this version
> >> and I never pretented I understand the user-space needs, so I won't argue.
> >
> > The changelog appears to spell this all out quite well? Unusually
> > well, in my experience ;)
>
> I also followed up with a little more explanation in the email thread on
> 10/30/2018:
>
> ---
> As I explained earlier, the primary application is in the area of network
> high-availability / non-stop-forwarding where early fault notification and
> early action can help maintain BFD sessions and thus avoid unnecessary
> disruption to forwarding while the control-plane is recovering.
> ---
>
> BTW, I probably should have pointed out this earlier:
>
> BFD stands for "RFC 5880: Bi-directional forwarding detection".

I saw that. My point is that your above followup wasn't necessary -
the changelog is clear!

> >
> > - As it's a linux-specific feature, a test under
> > tools/testing/selftests would be appropriate. I don't know how much
> > that work will be.
>
> The selftest code was submitted on 10/25/2018:
>
> [PATCH] selftests/prctl: selftest for pre-coredump signal notification

OK, please prepare these as a patch series.

> > Do we have other linux-specific signal extensions which could piggyback onto that?
>
> No. There are enough existing signals that an application can choose for this
> purpose, such as SIGUSR1, SIGUSR1, and any of the RT signals.
>

My point is that if we have previously added any linux-specific signal
expensions then your selftest patch would be an appropriate place where
we could add tests for those features. I'm not saying that you should
add such tests at this time, but please do prepare the selftest as a
thing which tests linux-specific signal extensions in general, not as a
thing which tests pre-coredump signals only.


2018-11-22 18:46:09

by Enke Chen (enkechen)

[permalink] [raw]
Subject: Re: [PATCH v5] kernel/signal: Signal-based pre-coredump notification

Hi, Andrew:

As suggested, I will post them as a patch series (with the same version v5):

[PATCH v5 1/2] kernel/signal: Signal-based pre-coredump notification
[PATCH v5 2/2] selftests/prctl: selftest for pre-coredump signal notification

I have a diff for the manpage as well. I guess that it should be submitted separately
from the code.

Thanks. -- Enke

On 11/21/18 5:33 PM, Andrew Morton wrote:
> On Wed, 21 Nov 2018 17:09:50 -0800 Enke Chen <[email protected]> wrote:
>
>> Hi, Andrew:
>>
>> On 11/21/18 4:37 PM, Andrew Morton wrote:
>>> On Tue, 30 Oct 2018 17:46:29 +0100 Oleg Nesterov <[email protected]> wrote:
>>>
>>>> On 10/29, Enke Chen wrote:
>>>>>
>>>>> Reviewed-by: Oleg Nesterov <[email protected]>
>>>>
>>>> Hmm. I didn't say this ;)
>>>>
>>>> But OK, feel free to keep this tag.
>>>>
>>>> I do not like this feauture.
>>>
>>> Why is that?
>>>
>>>> But I see no technical problems in this version
>>>> and I never pretented I understand the user-space needs, so I won't argue.
>>>
>>> The changelog appears to spell this all out quite well? Unusually
>>> well, in my experience ;)
>>
>> I also followed up with a little more explanation in the email thread on
>> 10/30/2018:
>>
>> ---
>> As I explained earlier, the primary application is in the area of network
>> high-availability / non-stop-forwarding where early fault notification and
>> early action can help maintain BFD sessions and thus avoid unnecessary
>> disruption to forwarding while the control-plane is recovering.
>> ---
>>
>> BTW, I probably should have pointed out this earlier:
>>
>> BFD stands for "RFC 5880: Bi-directional forwarding detection".
>
> I saw that. My point is that your above followup wasn't necessary -
> the changelog is clear!
>
>>>
>>> - As it's a linux-specific feature, a test under
>>> tools/testing/selftests would be appropriate. I don't know how much
>>> that work will be.
>>
>> The selftest code was submitted on 10/25/2018:
>>
>> [PATCH] selftests/prctl: selftest for pre-coredump signal notification
>
> OK, please prepare these as a patch series.
>
>>> Do we have other linux-specific signal extensions which could piggyback onto that?
>>
>> No. There are enough existing signals that an application can choose for this
>> purpose, such as SIGUSR1, SIGUSR1, and any of the RT signals.
>>
>
> My point is that if we have previously added any linux-specific signal
> expensions then your selftest patch would be an appropriate place where
> we could add tests for those features. I'm not saying that you should
> add such tests at this time, but please do prepare the selftest as a
> thing which tests linux-specific signal extensions in general, not as a
> thing which tests pre-coredump signals only.
>

2018-11-27 22:55:45

by Enke Chen (enkechen)

[permalink] [raw]
Subject: [PATCH v5 1/2] kernel/signal: Signal-based pre-coredump notification

[Repost as a series, as suggested by Andrew Morton]

For simplicity and consistency, this patch provides an implementation
for signal-based fault notification prior to the coredump of a child
process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
be used by an application to express its interest and to specify the
signal for such a notification.

Changes to prctl(2):

PR_SET_PREDUMP_SIG (since Linux 4.20.x)
Set the child pre-coredump signal of the calling process to
arg2 (either a signal value in the range 1..maxsig, or 0 to
clear). This is the signal that the calling process will get
prior to the coredump of a child process. This value is
cleared across execve(2), or for the child of a fork(2).

PR_GET_PREDUMP_SIG (since Linux 4.20.x)
Return the current value of the child pre-coredump signal,
in the location pointed to by (int *) arg2.

Background:

As the coredump of a process may take time, in certain time-sensitive
applications it is necessary for a parent process (e.g., a process
manager) to be notified of a child's imminent death before the coredump
so that the parent process can act sooner, such as re-spawning an
application process, or initiating a control-plane fail-over.

One application is BFD. The early fault notification is a critical
component for maintaining BFD sessions (with a timeout value of
50 msec or 100 msec) across a control-plane failure.

Currently there are two ways for a parent process to be notified of a
child process's state change. One is to use the POSIX signal, and
another is to use the kernel connector module. The specific events and
actions are summarized as follows:

Process Event POSIX Signal Connector-based
----------------------------------------------------------------------
ptrace_attach() do_notify_parent_cldstop() proc_ptrace_connector()
SIGCHLD / CLD_STOPPED

ptrace_detach() do_notify_parent_cldstop() proc_ptrace_connector()
SIGCHLD / CLD_CONTINUED

pre_coredump/ N/A proc_coredump_connector()
get_signal()

post_coredump/ do_notify_parent() proc_exit_connector()
do_exit() SIGCHLD / exit_signal
----------------------------------------------------------------------

As shown in the table, the signal-based pre-coredump notification is not
currently available. In some cases using a connector-based notification
can be quite complicated (e.g., when a process manager is written in shell
scripts and thus is subject to certain inherent limitations), and a
signal-based notification would be simpler and better suited.

Signed-off-by: Enke Chen <[email protected]>
Reviewed-by: Oleg Nesterov <[email protected]>
---
v4 -> v5:
Addressed review comments from Oleg Nesterov:
o use rcu_read_lock instead.
o revert back to notify the real_parent.

fs/coredump.c | 23 +++++++++++++++++++++++
fs/exec.c | 3 +++
include/linux/sched/signal.h | 3 +++
include/uapi/linux/prctl.h | 4 ++++
kernel/sys.c | 13 +++++++++++++
5 files changed, 46 insertions(+)

diff --git a/fs/coredump.c b/fs/coredump.c
index e42e17e..740b1bb 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -536,6 +536,24 @@ static int umh_pipe_setup(struct subprocess_info *info, struct cred *new)
return err;
}

+/*
+ * While do_notify_parent() notifies the parent of a child's death post
+ * its coredump, this function lets the parent (if so desired) know about
+ * the imminent death of a child just prior to its coredump.
+ */
+static void do_notify_parent_predump(void)
+{
+ struct task_struct *parent;
+ int sig;
+
+ rcu_read_lock();
+ parent = rcu_dereference(current->real_parent);
+ sig = parent->signal->predump_signal;
+ if (sig != 0)
+ do_send_sig_info(sig, SEND_SIG_NOINFO, parent, PIDTYPE_TGID);
+ rcu_read_unlock();
+}
+
void do_coredump(const kernel_siginfo_t *siginfo)
{
struct core_state core_state;
@@ -590,6 +608,11 @@ void do_coredump(const kernel_siginfo_t *siginfo)
if (retval < 0)
goto fail_creds;

+ /*
+ * Send the pre-coredump signal to the parent if requested.
+ */
+ do_notify_parent_predump();
+
old_cred = override_creds(cred);

ispipe = format_corename(&cn, &cprm);
diff --git a/fs/exec.c b/fs/exec.c
index fc281b7..7714da7 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1181,6 +1181,9 @@ static int de_thread(struct task_struct *tsk)
/* we have changed execution domain */
tsk->exit_signal = SIGCHLD;

+ /* Clear the pre-coredump signal before loading a new binary */
+ sig->predump_signal = 0;
+
#ifdef CONFIG_POSIX_TIMERS
exit_itimers(sig);
flush_itimer_signals();
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 13789d1..728ef68 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -112,6 +112,9 @@ struct signal_struct {
int group_stop_count;
unsigned int flags; /* see SIGNAL_* flags below */

+ /* The signal sent prior to a child's coredump */
+ int predump_signal;
+
/*
* PR_SET_CHILD_SUBREAPER marks a process, like a service
* manager, to re-parent orphan (double-forking) child processes
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index c0d7ea0..79f0a8a 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -219,4 +219,8 @@ struct prctl_mm_map {
# define PR_SPEC_DISABLE (1UL << 2)
# define PR_SPEC_FORCE_DISABLE (1UL << 3)

+/* Whether to receive signal prior to child's coredump */
+#define PR_SET_PREDUMP_SIG 54
+#define PR_GET_PREDUMP_SIG 55
+
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index 123bd73..39aa3b8 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2476,6 +2476,19 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which,
return -EINVAL;
error = arch_prctl_spec_ctrl_set(me, arg2, arg3);
break;
+ case PR_SET_PREDUMP_SIG:
+ if (arg3 || arg4 || arg5)
+ return -EINVAL;
+ if (!valid_signal((int)arg2))
+ return -EINVAL;
+ me->signal->predump_signal = (int)arg2;
+ break;
+ case PR_GET_PREDUMP_SIG:
+ if (arg3 || arg4 || arg5)
+ return -EINVAL;
+ error = put_user(me->signal->predump_signal,
+ (int __user *)arg2);
+ break;
default:
error = -EINVAL;
break;
--
1.8.3.1

2018-11-27 22:55:56

by Enke Chen (enkechen)

[permalink] [raw]
Subject: [PATCH v5 2/2] selftests/prctl: selftest for pre-coredump signal notification

[Repost as a series, as suggested by Andrew Morton]

Selftest for the pre-coredump signal notification

Signed-off-by: Enke Chen <[email protected]>
---
tools/testing/selftests/prctl/Makefile | 2 +-
tools/testing/selftests/prctl/predump-sig-test.c | 160 +++++++++++++++++++++++
2 files changed, 161 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/prctl/predump-sig-test.c

diff --git a/tools/testing/selftests/prctl/Makefile b/tools/testing/selftests/prctl/Makefile
index c7923b2..f8d60d5 100644
--- a/tools/testing/selftests/prctl/Makefile
+++ b/tools/testing/selftests/prctl/Makefile
@@ -5,7 +5,7 @@ ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/x86/ -e s/x86_64/x86/)

ifeq ($(ARCH),x86)
TEST_PROGS := disable-tsc-ctxt-sw-stress-test disable-tsc-on-off-stress-test \
- disable-tsc-test
+ disable-tsc-test predump-sig-test
all: $(TEST_PROGS)

include ../lib.mk
diff --git a/tools/testing/selftests/prctl/predump-sig-test.c b/tools/testing/selftests/prctl/predump-sig-test.c
new file mode 100644
index 0000000..15d62691
--- /dev/null
+++ b/tools/testing/selftests/prctl/predump-sig-test.c
@@ -0,0 +1,160 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2018, Enke Chen, Cisco Systems, Inc.
+ *
+ * Tests for prctl(PR_SET_PREDUMP_SIG, ...) / prctl(PR_GET_PREDUMP_SIG, ...)
+ *
+ * When set with prctl(), the specified signal is sent to the parent process
+ * prior to the coredump of a child process.
+ *
+ * Usage: ./predump-sig-test {SIGUSR1 | SIGRT2}
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <sys/wait.h>
+#include <sys/prctl.h>
+#include <signal.h>
+#include <errno.h>
+
+#ifndef PR_SET_PREDUMP_SIG
+#define PR_SET_PREDUMP_SIG 54
+#define PR_GET_PREDUMP_SIG 55
+#endif
+
+#define SIGRT2 (SIGRTMIN + 1)
+
+#define handle_error(msg) \
+ do { perror(msg); exit(EXIT_FAILURE); } while (0)
+
+static sig_idx;
+static siginfo_t siginfo_rcv[2];
+
+static void sigaction_func(int sig, siginfo_t *siginfo, void *arg)
+{
+ memcpy(&siginfo_rcv[sig_idx], siginfo, sizeof(siginfo_t));
+ sig_idx++;
+}
+
+static int set_sigaction(int sig)
+{
+ struct sigaction new_action;
+ int rc;
+
+ memset(&new_action, 0, sizeof(struct sigaction));
+ new_action.sa_sigaction = sigaction_func;
+ new_action.sa_flags = SA_SIGINFO;
+ sigemptyset(&new_action.sa_mask);
+
+ return sigaction(sig, &new_action, NULL);
+}
+
+static int test_prctl(int sig)
+{
+ int sig2, rc;
+
+ rc = prctl(PR_SET_PREDUMP_SIG, sig, 0, 0, 0);
+ if (rc < 0)
+ handle_error("prctl: setting");
+
+ rc = prctl(PR_GET_PREDUMP_SIG, &sig2, 0, 0, 0);
+ if (rc < 0)
+ handle_error("prctl: getting");
+
+ if (sig2 != sig) {
+ printf("prctl: sig %d, post %d\n", sig, sig2);
+ return -1;
+ }
+ return 0;
+}
+
+static void child_fn(void)
+{
+ int rc, sig;
+
+ printf("\nChild pid: %ld\n", (long)getpid());
+
+ /* Test: Child should not inherit the predump_signal */
+ rc = prctl(PR_GET_PREDUMP_SIG, &sig, 0, 0, 0);
+ if (rc < 0)
+ handle_error("prctl: child");
+
+ printf("child: predump_signal %d\n", sig);
+
+ /* Force coredump here */
+ printf("child: calling abort()\n");
+ fflush(stdout);
+ abort();
+}
+
+static int parent_fn(pid_t child_pid)
+{
+ int i, status, count;
+ siginfo_t *si;
+ pid_t w;
+
+ for (count = 0; count < 2; count++) {
+ w = waitpid(child_pid, &status, 0);
+ printf("\nwaitpid: %d\n", w);
+ if (w < 0)
+ perror("waitpid");
+
+ si = &siginfo_rcv[count];
+ printf("signal: si_signo %d, si_pid %ld, si_uid %d\n",
+ si->si_signo, si->si_pid, si->si_uid);
+ printf("siginfo: si_errno %d, si_code %d, si_status %d\n",
+ si->si_errno, si->si_code, si->si_status);
+ }
+ fflush(stdout);
+}
+
+int main(int argc, char *argv[])
+{
+ pid_t child_pid;
+ int rc, signo;
+
+ if (argc != 2) {
+ printf("invalid number of arguments\n");
+ exit(EXIT_FAILURE);
+ }
+
+ if (strcmp(argv[1], "SIGUSR1") == 0)
+ signo = SIGUSR1;
+ else if (strcmp(argv[1], "SIGRT2") == 0)
+ signo = SIGRT2;
+ else {
+ printf("invalid argument for signal\n");
+ fflush(stdout);
+ exit(EXIT_FAILURE);
+ }
+
+ rc = set_sigaction(SIGCHLD);
+ if (rc < 0)
+ handle_error("set_sigaction: SIGCHLD");
+
+ if (signo != SIGCHLD) {
+ rc = set_sigaction(signo);
+ if (rc < 0)
+ handle_error("set_sigaction: SIGCHLD");
+ }
+
+ /* Test: prctl() setting */
+ rc = test_prctl(0);
+ printf("prctl: sig %d %s\n", 0, (rc == 0) ? "PASS" : "FAIL");
+
+ rc = test_prctl(signo);
+ printf("prctl: sig %d %s\n", signo, (rc == 0) ? "PASS" : "FAIL");
+
+ child_pid = fork();
+ if (child_pid == -1)
+ handle_error("fork");
+
+ if (child_pid == 0) { /* Code executed by child */
+ child_fn();
+ } else { /* Code executed by parent */
+ parent_fn(child_pid);
+ exit(EXIT_SUCCESS);
+ }
+}
--
1.8.3.1


2018-11-28 15:22:13

by Dave Martin

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] kernel/signal: Signal-based pre-coredump notification

On Tue, Nov 27, 2018 at 10:54:41PM +0000, Enke Chen wrote:
> [Repost as a series, as suggested by Andrew Morton]
>
> For simplicity and consistency, this patch provides an implementation
> for signal-based fault notification prior to the coredump of a child
> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
> be used by an application to express its interest and to specify the
> signal for such a notification.
>
> Changes to prctl(2):
>
> PR_SET_PREDUMP_SIG (since Linux 4.20.x)
> Set the child pre-coredump signal of the calling process to
> arg2 (either a signal value in the range 1..maxsig, or 0 to
> clear). This is the signal that the calling process will get
> prior to the coredump of a child process. This value is
> cleared across execve(2), or for the child of a fork(2).
>
> PR_GET_PREDUMP_SIG (since Linux 4.20.x)
> Return the current value of the child pre-coredump signal,
> in the location pointed to by (int *) arg2.
>
> Background:
>
> As the coredump of a process may take time, in certain time-sensitive
> applications it is necessary for a parent process (e.g., a process
> manager) to be notified of a child's imminent death before the coredump
> so that the parent process can act sooner, such as re-spawning an
> application process, or initiating a control-plane fail-over.
>
> One application is BFD. The early fault notification is a critical
> component for maintaining BFD sessions (with a timeout value of
> 50 msec or 100 msec) across a control-plane failure.
>
> Currently there are two ways for a parent process to be notified of a
> child process's state change. One is to use the POSIX signal, and
> another is to use the kernel connector module. The specific events and
> actions are summarized as follows:
>
> Process Event POSIX Signal Connector-based
> ----------------------------------------------------------------------
> ptrace_attach() do_notify_parent_cldstop() proc_ptrace_connector()
> SIGCHLD / CLD_STOPPED
>
> ptrace_detach() do_notify_parent_cldstop() proc_ptrace_connector()
> SIGCHLD / CLD_CONTINUED
>
> pre_coredump/ N/A proc_coredump_connector()
> get_signal()
>
> post_coredump/ do_notify_parent() proc_exit_connector()
> do_exit() SIGCHLD / exit_signal
> ----------------------------------------------------------------------
>
> As shown in the table, the signal-based pre-coredump notification is not
> currently available. In some cases using a connector-based notification
> can be quite complicated (e.g., when a process manager is written in shell
> scripts and thus is subject to certain inherent limitations), and a
> signal-based notification would be simpler and better suited.

Since this is a notification of a change of process status, would it be
more natural to send it through SIGCHLD?

As with other supplementary child status events, a flag could be added
for wait and sigaction.sa_flags to indicate whether the parent wants
this event to be reported or not.

Then a suitable CLD_XXX could be defined for this, and we could
piggyback on PR_{SET,GET}_PDEATHSIG rather than having to have something
new.

(I hadn't been watching this thread closely, so apologies if this has
been discussed already.)

>
> Signed-off-by: Enke Chen <[email protected]>
> Reviewed-by: Oleg Nesterov <[email protected]>
> ---
> v4 -> v5:
> Addressed review comments from Oleg Nesterov:
> o use rcu_read_lock instead.
> o revert back to notify the real_parent.
>
> fs/coredump.c | 23 +++++++++++++++++++++++
> fs/exec.c | 3 +++
> include/linux/sched/signal.h | 3 +++
> include/uapi/linux/prctl.h | 4 ++++
> kernel/sys.c | 13 +++++++++++++
> 5 files changed, 46 insertions(+)
>
> diff --git a/fs/coredump.c b/fs/coredump.c
> index e42e17e..740b1bb 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -536,6 +536,24 @@ static int umh_pipe_setup(struct subprocess_info *info, struct cred *new)
> return err;
> }
>
> +/*
> + * While do_notify_parent() notifies the parent of a child's death post
> + * its coredump, this function lets the parent (if so desired) know about
> + * the imminent death of a child just prior to its coredump.
> + */
> +static void do_notify_parent_predump(void)
> +{
> + struct task_struct *parent;
> + int sig;
> +
> + rcu_read_lock();
> + parent = rcu_dereference(current->real_parent);
> + sig = parent->signal->predump_signal;
> + if (sig != 0)
> + do_send_sig_info(sig, SEND_SIG_NOINFO, parent, PIDTYPE_TGID);

Doesn't this send si_code == SI_USER. That seems wrong: the receiving
process wouldn't not be able to distinguish a real pre-coredump
notification from a bogus one sent by kill(2) etc.

SEND_SIG_PRIV also looks wrong, because it assumes that the sender is
"the kernel" so there is no si_pid.

This may be another reason for building on top of SIGCHLD which already
has the required (but weird) "si_pid == affected process" semantics,
rather than si_pid indicating the sender.

> + rcu_read_unlock();
> +}
> +
> void do_coredump(const kernel_siginfo_t *siginfo)
> {
> struct core_state core_state;
> @@ -590,6 +608,11 @@ void do_coredump(const kernel_siginfo_t *siginfo)
> if (retval < 0)
> goto fail_creds;
>
> + /*
> + * Send the pre-coredump signal to the parent if requested.
> + */
> + do_notify_parent_predump();
> +
> old_cred = override_creds(cred);
>
> ispipe = format_corename(&cn, &cprm);
> diff --git a/fs/exec.c b/fs/exec.c
> index fc281b7..7714da7 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1181,6 +1181,9 @@ static int de_thread(struct task_struct *tsk)
> /* we have changed execution domain */
> tsk->exit_signal = SIGCHLD;
>
> + /* Clear the pre-coredump signal before loading a new binary */
> + sig->predump_signal = 0;
> +
> #ifdef CONFIG_POSIX_TIMERS
> exit_itimers(sig);
> flush_itimer_signals();
> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
> index 13789d1..728ef68 100644
> --- a/include/linux/sched/signal.h
> +++ b/include/linux/sched/signal.h
> @@ -112,6 +112,9 @@ struct signal_struct {
> int group_stop_count;
> unsigned int flags; /* see SIGNAL_* flags below */
>
> + /* The signal sent prior to a child's coredump */
> + int predump_signal;
> +
> /*
> * PR_SET_CHILD_SUBREAPER marks a process, like a service
> * manager, to re-parent orphan (double-forking) child processes
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index c0d7ea0..79f0a8a 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -219,4 +219,8 @@ struct prctl_mm_map {
> # define PR_SPEC_DISABLE (1UL << 2)
> # define PR_SPEC_FORCE_DISABLE (1UL << 3)
>
> +/* Whether to receive signal prior to child's coredump */
> +#define PR_SET_PREDUMP_SIG 54
> +#define PR_GET_PREDUMP_SIG 55
> +
> #endif /* _LINUX_PRCTL_H */
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 123bd73..39aa3b8 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2476,6 +2476,19 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which,
> return -EINVAL;
> error = arch_prctl_spec_ctrl_set(me, arg2, arg3);
> break;
> + case PR_SET_PREDUMP_SIG:
> + if (arg3 || arg4 || arg5)

glibc has

int prctl(int option, ...);

Some prctls() police extra arguments for zeros, but this means that
the userspace caller also has to supply pointless 0 arguments.

It's debatable which is the preferred approach. Did you have any
particular rationale for your choice here?

[...]

Cheers
---Dave

2018-11-29 00:16:36

by Enke Chen (enkechen)

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] kernel/signal: Signal-based pre-coredump notification

Hi, Dave:

Thanks for your comments. You have indeed missed some of the prior reviews
and discussions. But that is OK.

Please see my replies inline.

On 11/28/18 7:19 AM, Dave Martin wrote:
> On Tue, Nov 27, 2018 at 10:54:41PM +0000, Enke Chen wrote:
>> [Repost as a series, as suggested by Andrew Morton]
>>
>> For simplicity and consistency, this patch provides an implementation
>> for signal-based fault notification prior to the coredump of a child
>> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
>> be used by an application to express its interest and to specify the
>> signal for such a notification.
>>
>> Changes to prctl(2):
>>
>> PR_SET_PREDUMP_SIG (since Linux 4.20.x)
>> Set the child pre-coredump signal of the calling process to
>> arg2 (either a signal value in the range 1..maxsig, or 0 to
>> clear). This is the signal that the calling process will get
>> prior to the coredump of a child process. This value is
>> cleared across execve(2), or for the child of a fork(2).
>>
>> PR_GET_PREDUMP_SIG (since Linux 4.20.x)
>> Return the current value of the child pre-coredump signal,
>> in the location pointed to by (int *) arg2.
>>
>> Background:
>>
>> As the coredump of a process may take time, in certain time-sensitive
>> applications it is necessary for a parent process (e.g., a process
>> manager) to be notified of a child's imminent death before the coredump
>> so that the parent process can act sooner, such as re-spawning an
>> application process, or initiating a control-plane fail-over.
>>
>> One application is BFD. The early fault notification is a critical
>> component for maintaining BFD sessions (with a timeout value of
>> 50 msec or 100 msec) across a control-plane failure.
>>
>> Currently there are two ways for a parent process to be notified of a
>> child process's state change. One is to use the POSIX signal, and
>> another is to use the kernel connector module. The specific events and
>> actions are summarized as follows:
>>
>> Process Event POSIX Signal Connector-based
>> ----------------------------------------------------------------------
>> ptrace_attach() do_notify_parent_cldstop() proc_ptrace_connector()
>> SIGCHLD / CLD_STOPPED
>>
>> ptrace_detach() do_notify_parent_cldstop() proc_ptrace_connector()
>> SIGCHLD / CLD_CONTINUED
>>
>> pre_coredump/ N/A proc_coredump_connector()
>> get_signal()
>>
>> post_coredump/ do_notify_parent() proc_exit_connector()
>> do_exit() SIGCHLD / exit_signal
>> ----------------------------------------------------------------------
>>
>> As shown in the table, the signal-based pre-coredump notification is not
>> currently available. In some cases using a connector-based notification
>> can be quite complicated (e.g., when a process manager is written in shell
>> scripts and thus is subject to certain inherent limitations), and a
>> signal-based notification would be simpler and better suited.
>
> Since this is a notification of a change of process status, would it be
> more natural to send it through SIGCHLD?
>
> As with other supplementary child status events, a flag could be added
> for wait and sigaction.sa_flags to indicate whether the parent wants
> this event to be reported or not.
>
> Then a suitable CLD_XXX could be defined for this, and we could
> piggyback on PR_{SET,GET}_PDEATHSIG rather than having to have something
> new.
>

> (I hadn't been watching this thread closely, so apologies if this has
> been discussed already.)

Indeed, I defined the signal code CLD_PREDUMP for SIGCHLD initially, but it
was removed after discussion:

v3 --> v4:

Addressed review comments from Oleg Nesterov, and Eric W. Biederman,
including:
o remove the definition CLD_PREDUMP.
---

You can look at the discussions in the email thread, in particular several
issues pointed out by Eric Biederman, and my reply to Eric.

There are two models 1:1 (one process manager with one child process), and 1:N
(one process manager with many child processes). A legacy signal like SIGCHLD
would not work in the 1:N model due to compression/loss of legacy signals. One
need to use a RT signal in that case.

One more point in my reply:

When an application chooses a signal for pre-coredump notification, it is much
simpler and robust for the signal to be dedicated for that purpose (in the parent)
and not be mixed with other semantics. The "signo + pid" should be sufficient for
the parent process in both 1:1 and 1:N models.

>
>>
>> Signed-off-by: Enke Chen <[email protected]>
>> Reviewed-by: Oleg Nesterov <[email protected]>
>> ---
>> v4 -> v5:
>> Addressed review comments from Oleg Nesterov:
>> o use rcu_read_lock instead.
>> o revert back to notify the real_parent.
>>
>> fs/coredump.c | 23 +++++++++++++++++++++++
>> fs/exec.c | 3 +++
>> include/linux/sched/signal.h | 3 +++
>> include/uapi/linux/prctl.h | 4 ++++
>> kernel/sys.c | 13 +++++++++++++
>> 5 files changed, 46 insertions(+)
>>
>> diff --git a/fs/coredump.c b/fs/coredump.c
>> index e42e17e..740b1bb 100644
>> --- a/fs/coredump.c
>> +++ b/fs/coredump.c
>> @@ -536,6 +536,24 @@ static int umh_pipe_setup(struct subprocess_info *info, struct cred *new)
>> return err;
>> }
>>
>> +/*
>> + * While do_notify_parent() notifies the parent of a child's death post
>> + * its coredump, this function lets the parent (if so desired) know about
>> + * the imminent death of a child just prior to its coredump.
>> + */
>> +static void do_notify_parent_predump(void)
>> +{
>> + struct task_struct *parent;
>> + int sig;
>> +
>> + rcu_read_lock();
>> + parent = rcu_dereference(current->real_parent);
>> + sig = parent->signal->predump_signal;
>> + if (sig != 0)
>> + do_send_sig_info(sig, SEND_SIG_NOINFO, parent, PIDTYPE_TGID);
>
> Doesn't this send si_code == SI_USER. That seems wrong: the receiving
> process wouldn't not be able to distinguish a real pre-coredump
> notification from a bogus one sent by kill(2) et
The receiving process (i.e., the process manager) knows the PID of all
its child processes. Thus it can easily tell whether the signal is from
a child or not.

>
> SEND_SIG_PRIV also looks wrong, because it assumes that the sender is
> "the kernel" so there is no si_pid.

That is why SEND_SIG_NOINFO is chosen here for carrying the "pid". What
matters to the parent process is the "signo + pid" for identifying the
child process for the pre-coredump notification.

>
> This may be another reason for building on top of SIGCHLD which already
> has the required (but weird) "si_pid == affected process" semantics,
> rather than si_pid indicating the sender.
>
>> + rcu_read_unlock();
>> +}
>> +
>> void do_coredump(const kernel_siginfo_t *siginfo)
>> {
>> struct core_state core_state;
>> @@ -590,6 +608,11 @@ void do_coredump(const kernel_siginfo_t *siginfo)
>> if (retval < 0)
>> goto fail_creds;
>>
>> + /*
>> + * Send the pre-coredump signal to the parent if requested.
>> + */
>> + do_notify_parent_predump();
>> +
>> old_cred = override_creds(cred);
>>
>> ispipe = format_corename(&cn, &cprm);
>> diff --git a/fs/exec.c b/fs/exec.c
>> index fc281b7..7714da7 100644
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -1181,6 +1181,9 @@ static int de_thread(struct task_struct *tsk)
>> /* we have changed execution domain */
>> tsk->exit_signal = SIGCHLD;
>>
>> + /* Clear the pre-coredump signal before loading a new binary */
>> + sig->predump_signal = 0;
>> +
>> #ifdef CONFIG_POSIX_TIMERS
>> exit_itimers(sig);
>> flush_itimer_signals();
>> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
>> index 13789d1..728ef68 100644
>> --- a/include/linux/sched/signal.h
>> +++ b/include/linux/sched/signal.h
>> @@ -112,6 +112,9 @@ struct signal_struct {
>> int group_stop_count;
>> unsigned int flags; /* see SIGNAL_* flags below */
>>
>> + /* The signal sent prior to a child's coredump */
>> + int predump_signal;
>> +
>> /*
>> * PR_SET_CHILD_SUBREAPER marks a process, like a service
>> * manager, to re-parent orphan (double-forking) child processes
>> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
>> index c0d7ea0..79f0a8a 100644
>> --- a/include/uapi/linux/prctl.h
>> +++ b/include/uapi/linux/prctl.h
>> @@ -219,4 +219,8 @@ struct prctl_mm_map {
>> # define PR_SPEC_DISABLE (1UL << 2)
>> # define PR_SPEC_FORCE_DISABLE (1UL << 3)
>>
>> +/* Whether to receive signal prior to child's coredump */
>> +#define PR_SET_PREDUMP_SIG 54
>> +#define PR_GET_PREDUMP_SIG 55
>> +
>> #endif /* _LINUX_PRCTL_H */
>> diff --git a/kernel/sys.c b/kernel/sys.c
>> index 123bd73..39aa3b8 100644
>> --- a/kernel/sys.c
>> +++ b/kernel/sys.c
>> @@ -2476,6 +2476,19 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which,
>> return -EINVAL;
>> error = arch_prctl_spec_ctrl_set(me, arg2, arg3);
>> break;
>> + case PR_SET_PREDUMP_SIG:
>> + if (arg3 || arg4 || arg5)
>
> glibc has
>
> int prctl(int option, ...);
>
> Some prctls() police extra arguments for zeros, but this means that
> the userspace caller also has to supply pointless 0 arguments.
>
> It's debatable which is the preferred approach. Did you have any
> particular rationale for your choice here?
>

The initial version did not check the values of these unused arguments.
But Jann Horn pointed out the new convention is to enforce the 0 values
so I followed ...

Thanks. -- Enke

2018-11-29 11:56:27

by Dave Martin

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] kernel/signal: Signal-based pre-coredump notification

On Thu, Nov 29, 2018 at 12:15:35AM +0000, Enke Chen wrote:
> Hi, Dave:
>
> Thanks for your comments. You have indeed missed some of the prior reviews
> and discussions. But that is OK.
>
> Please see my replies inline.
>
> On 11/28/18 7:19 AM, Dave Martin wrote:
> > On Tue, Nov 27, 2018 at 10:54:41PM +0000, Enke Chen wrote:
> >> [Repost as a series, as suggested by Andrew Morton]
> >>
> >> For simplicity and consistency, this patch provides an implementation
> >> for signal-based fault notification prior to the coredump of a child
> >> process. A new prctl command, PR_SET_PREDUMP_SIG, is defined that can
> >> be used by an application to express its interest and to specify the
> >> signal for such a notification.
> >>
> >> Changes to prctl(2):
> >>
> >> PR_SET_PREDUMP_SIG (since Linux 4.20.x)
> >> Set the child pre-coredump signal of the calling process to
> >> arg2 (either a signal value in the range 1..maxsig, or 0 to
> >> clear). This is the signal that the calling process will get
> >> prior to the coredump of a child process. This value is
> >> cleared across execve(2), or for the child of a fork(2).
> >>
> >> PR_GET_PREDUMP_SIG (since Linux 4.20.x)
> >> Return the current value of the child pre-coredump signal,
> >> in the location pointed to by (int *) arg2.
> >>
> >> Background:
> >>
> >> As the coredump of a process may take time, in certain time-sensitive
> >> applications it is necessary for a parent process (e.g., a process
> >> manager) to be notified of a child's imminent death before the coredump
> >> so that the parent process can act sooner, such as re-spawning an
> >> application process, or initiating a control-plane fail-over.
> >>
> >> One application is BFD. The early fault notification is a critical
> >> component for maintaining BFD sessions (with a timeout value of
> >> 50 msec or 100 msec) across a control-plane failure.
> >>
> >> Currently there are two ways for a parent process to be notified of a
> >> child process's state change. One is to use the POSIX signal, and
> >> another is to use the kernel connector module. The specific events and
> >> actions are summarized as follows:
> >>
> >> Process Event POSIX Signal Connector-based
> >> ----------------------------------------------------------------------
> >> ptrace_attach() do_notify_parent_cldstop() proc_ptrace_connector()
> >> SIGCHLD / CLD_STOPPED
> >>
> >> ptrace_detach() do_notify_parent_cldstop() proc_ptrace_connector()
> >> SIGCHLD / CLD_CONTINUED
> >>
> >> pre_coredump/ N/A proc_coredump_connector()
> >> get_signal()
> >>
> >> post_coredump/ do_notify_parent() proc_exit_connector()
> >> do_exit() SIGCHLD / exit_signal
> >> ----------------------------------------------------------------------
> >>
> >> As shown in the table, the signal-based pre-coredump notification is not
> >> currently available. In some cases using a connector-based notification
> >> can be quite complicated (e.g., when a process manager is written in shell
> >> scripts and thus is subject to certain inherent limitations), and a
> >> signal-based notification would be simpler and better suited.
> >
> > Since this is a notification of a change of process status, would it be
> > more natural to send it through SIGCHLD?
> >
> > As with other supplementary child status events, a flag could be added
> > for wait and sigaction.sa_flags to indicate whether the parent wants
> > this event to be reported or not.
> >
> > Then a suitable CLD_XXX could be defined for this, and we could
> > piggyback on PR_{SET,GET}_PDEATHSIG rather than having to have something
> > new.
> >
>
> > (I hadn't been watching this thread closely, so apologies if this has
> > been discussed already.)
>
> Indeed, I defined the signal code CLD_PREDUMP for SIGCHLD initially, but it
> was removed after discussion:
>
> v3 --> v4:
>
> Addressed review comments from Oleg Nesterov, and Eric W. Biederman,
> including:
> o remove the definition CLD_PREDUMP.
> ---
>
> You can look at the discussions in the email thread, in particular several
> issues pointed out by Eric Biederman, and my reply to Eric.

Ah, right.

> There are two models 1:1 (one process manager with one child process), and 1:N
> (one process manager with many child processes). A legacy signal like SIGCHLD
> would not work in the 1:N model due to compression/loss of legacy signals. One
> need to use a RT signal in that case.

SIGCHLD can be redirected to an RT signal via clone(). Are you saying
the signal is still not queued in that case? I had assumed that things
like pthreads rely on this working.

However, one detail I had missed is that only child exits are reported
via the exit signal set by clone(). Other child status changes are
seem to be reported via SIGCHLD always.

Making your supervised processes into clone-children might interact
badly with pthreads if it uses wait(__WCLONE) internally. I've not
looked into that.

> One more point in my reply:
>
> When an application chooses a signal for pre-coredump notification, it is much
> simpler and robust for the signal to be dedicated for that purpose (in the parent)
> and not be mixed with other semantics. The "signo + pid" should be sufficient for
> the parent process in both 1:1 and 1:N models.

What if the signal queue overflows? sigqueue() returns EAGAIN, but I
think that signals queued by the kernel would simply be lost. This
probably won't happen in any non-pathological scenario, but the process
manager may just silently go wrong instead of failing cleanly when/if
this happens.

SIGCHLD + wait() is immune to this problem for other child status
notifications (albeit with higher overhead).

Unless I've missed something fundamental, signals simply aren't a
reliable data transport: if you need 100% reliability, you need to be
using another mechanism, either in combination with a signal, or by
itself.

> >>
> >> Signed-off-by: Enke Chen <[email protected]>
> >> Reviewed-by: Oleg Nesterov <[email protected]>
> >> ---
> >> v4 -> v5:
> >> Addressed review comments from Oleg Nesterov:
> >> o use rcu_read_lock instead.
> >> o revert back to notify the real_parent.
> >>
> >> fs/coredump.c | 23 +++++++++++++++++++++++
> >> fs/exec.c | 3 +++
> >> include/linux/sched/signal.h | 3 +++
> >> include/uapi/linux/prctl.h | 4 ++++
> >> kernel/sys.c | 13 +++++++++++++
> >> 5 files changed, 46 insertions(+)
> >>
> >> diff --git a/fs/coredump.c b/fs/coredump.c
> >> index e42e17e..740b1bb 100644
> >> --- a/fs/coredump.c
> >> +++ b/fs/coredump.c
> >> @@ -536,6 +536,24 @@ static int umh_pipe_setup(struct subprocess_info *info, struct cred *new)
> >> return err;
> >> }
> >>
> >> +/*
> >> + * While do_notify_parent() notifies the parent of a child's death post
> >> + * its coredump, this function lets the parent (if so desired) know about
> >> + * the imminent death of a child just prior to its coredump.
> >> + */
> >> +static void do_notify_parent_predump(void)
> >> +{
> >> + struct task_struct *parent;
> >> + int sig;
> >> +
> >> + rcu_read_lock();
> >> + parent = rcu_dereference(current->real_parent);
> >> + sig = parent->signal->predump_signal;
> >> + if (sig != 0)
> >> + do_send_sig_info(sig, SEND_SIG_NOINFO, parent, PIDTYPE_TGID);
> >
> > Doesn't this send si_code == SI_USER. That seems wrong: the receiving
> > process wouldn't not be able to distinguish a real pre-coredump
> > notification from a bogus one sent by kill(2) et
> The receiving process (i.e., the process manager) knows the PID of all
> its child processes. Thus it can easily tell whether the signal is from
> a child or not.

But it can't tell whether the child sent the signal via kill(2) etc., or
whether the child started dumping core.

It's cleaner to be able to filter reliably on si_code, especially if the
process you're supervising isn't fully under your control. For example,
even if the supervised process is considered trustworthy, it could still
be exploited by an attacker (or simply go wrong) in a way that causes
it do to a bogus kill().

(If you treat any incoming signal as anything more than a hint, failure
to check si_code is usually a bug in general.)

> >
> > SEND_SIG_PRIV also looks wrong, because it assumes that the sender is
> > "the kernel" so there is no si_pid.
>
> That is why SEND_SIG_NOINFO is chosen here for carrying the "pid". What
> matters to the parent process is the "signo + pid" for identifying the
> child process for the pre-coredump notification.

I think that si_code should at least be SI_KERNEL, but how that is
achieved doesn't really matter.

> >
> > This may be another reason for building on top of SIGCHLD which already
> > has the required (but weird) "si_pid == affected process" semantics,
> > rather than si_pid indicating the sender.
> >
> >> + rcu_read_unlock();
> >> +}
> >> +
> >> void do_coredump(const kernel_siginfo_t *siginfo)
> >> {
> >> struct core_state core_state;
> >> @@ -590,6 +608,11 @@ void do_coredump(const kernel_siginfo_t *siginfo)
> >> if (retval < 0)
> >> goto fail_creds;
> >>
> >> + /*
> >> + * Send the pre-coredump signal to the parent if requested.
> >> + */
> >> + do_notify_parent_predump();
> >> +
> >> old_cred = override_creds(cred);
> >>
> >> ispipe = format_corename(&cn, &cprm);
> >> diff --git a/fs/exec.c b/fs/exec.c
> >> index fc281b7..7714da7 100644
> >> --- a/fs/exec.c
> >> +++ b/fs/exec.c
> >> @@ -1181,6 +1181,9 @@ static int de_thread(struct task_struct *tsk)
> >> /* we have changed execution domain */
> >> tsk->exit_signal = SIGCHLD;
> >>
> >> + /* Clear the pre-coredump signal before loading a new binary */
> >> + sig->predump_signal = 0;
> >> +
> >> #ifdef CONFIG_POSIX_TIMERS
> >> exit_itimers(sig);
> >> flush_itimer_signals();
> >> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
> >> index 13789d1..728ef68 100644
> >> --- a/include/linux/sched/signal.h
> >> +++ b/include/linux/sched/signal.h
> >> @@ -112,6 +112,9 @@ struct signal_struct {
> >> int group_stop_count;
> >> unsigned int flags; /* see SIGNAL_* flags below */
> >>
> >> + /* The signal sent prior to a child's coredump */
> >> + int predump_signal;
> >> +
> >> /*
> >> * PR_SET_CHILD_SUBREAPER marks a process, like a service
> >> * manager, to re-parent orphan (double-forking) child processes
> >> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> >> index c0d7ea0..79f0a8a 100644
> >> --- a/include/uapi/linux/prctl.h
> >> +++ b/include/uapi/linux/prctl.h
> >> @@ -219,4 +219,8 @@ struct prctl_mm_map {
> >> # define PR_SPEC_DISABLE (1UL << 2)
> >> # define PR_SPEC_FORCE_DISABLE (1UL << 3)
> >>
> >> +/* Whether to receive signal prior to child's coredump */
> >> +#define PR_SET_PREDUMP_SIG 54
> >> +#define PR_GET_PREDUMP_SIG 55
> >> +
> >> #endif /* _LINUX_PRCTL_H */
> >> diff --git a/kernel/sys.c b/kernel/sys.c
> >> index 123bd73..39aa3b8 100644
> >> --- a/kernel/sys.c
> >> +++ b/kernel/sys.c
> >> @@ -2476,6 +2476,19 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which,
> >> return -EINVAL;
> >> error = arch_prctl_spec_ctrl_set(me, arg2, arg3);
> >> break;
> >> + case PR_SET_PREDUMP_SIG:
> >> + if (arg3 || arg4 || arg5)
> >
> > glibc has
> >
> > int prctl(int option, ...);
> >
> > Some prctls() police extra arguments for zeros, but this means that
> > the userspace caller also has to supply pointless 0 arguments.
> >
> > It's debatable which is the preferred approach. Did you have any
> > particular rationale for your choice here?
> >
>
> The initial version did not check the values of these unused arguments.
> But Jann Horn pointed out the new convention is to enforce the 0 values
> so I followed ...

Hmmm, I wasn't aware of this convention when I added PR_SVE_SET_VL etc.,
and there is no clear pattern in sys.c, and nobody commented at the
time.

Of course, it works either way.

Cheers
---Dave

2018-11-30 00:28:52

by Enke Chen (enkechen)

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] kernel/signal: Signal-based pre-coredump notification

Hi, Dave:

On 11/29/18 3:55 AM, Dave Martin wrote:
>> Indeed, I defined the signal code CLD_PREDUMP for SIGCHLD initially, but it
>> was removed after discussion:
>>
>> v3 --> v4:
>>
>> Addressed review comments from Oleg Nesterov, and Eric W. Biederman,
>> including:
>> o remove the definition CLD_PREDUMP.
>> ---
>>
>> You can look at the discussions in the email thread, in particular several
>> issues pointed out by Eric Biederman, and my reply to Eric.
>
> Ah, right.
>
>> There are two models 1:1 (one process manager with one child process), and 1:N
>> (one process manager with many child processes). A legacy signal like SIGCHLD
>> would not work in the 1:N model due to compression/loss of legacy signals. One
>> need to use a RT signal in that case.
>
> SIGCHLD can be redirected to an RT signal via clone(). Are you saying
> the signal is still not queued in that case? I had assumed that things
> like pthreads rely on this working.
>
> However, one detail I had missed is that only child exits are reported
> via the exit signal set by clone(). Other child status changes are
> seem to be reported via SIGCHLD always.
>
> Making your supervised processes into clone-children might interact
> badly with pthreads if it uses wait(__WCLONE) internally. I've not
> looked into that.

As Oleg commented before:

And once again, SIGCHLD/SIGUSR do not queue, this means that PR_SET_PREDUMP_SIG
is pointless if you have 2 or more children.
In addition, there is really no need to introduce a new semantics to SIGCHLD.
There are enough signals available for one to be designated in the parent process
for the pre-coredump notification.

>
>> One more point in my reply:
>>
>> When an application chooses a signal for pre-coredump notification, it is much
>> simpler and robust for the signal to be dedicated for that purpose (in the parent)
>> and not be mixed with other semantics. The "signo + pid" should be sufficient for
>> the parent process in both 1:1 and 1:N models.
>
> What if the signal queue overflows? sigqueue() returns EAGAIN, but I
> think that signals queued by the kernel would simply be lost. This
> probably won't happen in any non-pathological scenario, but the process
> manager may just silently go wrong instead of failing cleanly when/if
> this happens.

As pointed out by Oleg:

see the legacy_queue() check. Any signal < SIGRTMIN do not queue. IOW, if SIGCHLD
is already pending, then next SIGCHLD is simply ignored.

I went though the code and confirm it.

>
> SIGCHLD + wait() is immune to this problem for other child status
> notifications (albeit with higher overhead).
>
> Unless I've missed something fundamental, signals simply aren't a
> reliable data transport: if you need 100% reliability, you need to be
> using another mechanism, either in combination with a signal, or by
> itself.

Given the right signo, e.g., a RT signal for both models, or SIGUSR1/SIGUSR2
for 1:1 model, the pre-coredump signal notification is 100% reliable, and
it is the simplest solution.

When there are many child processes for the 1:N model, if needed, there is an
API for enlarging queue limit:

setrlimit(RLIMIT_SIGPENDING, xxx);

>
>>>>
>>>> Signed-off-by: Enke Chen <[email protected]>
>>>> Reviewed-by: Oleg Nesterov <[email protected]>
>>>> ---
>>>> v4 -> v5:
>>>> Addressed review comments from Oleg Nesterov:
>>>> o use rcu_read_lock instead.
>>>> o revert back to notify the real_parent.
>>>>
>>>> fs/coredump.c | 23 +++++++++++++++++++++++
>>>> fs/exec.c | 3 +++
>>>> include/linux/sched/signal.h | 3 +++
>>>> include/uapi/linux/prctl.h | 4 ++++
>>>> kernel/sys.c | 13 +++++++++++++
>>>> 5 files changed, 46 insertions(+)
>>>>
>>>> diff --git a/fs/coredump.c b/fs/coredump.c
>>>> index e42e17e..740b1bb 100644
>>>> --- a/fs/coredump.c
>>>> +++ b/fs/coredump.c
>>>> @@ -536,6 +536,24 @@ static int umh_pipe_setup(struct subprocess_info *info, struct cred *new)
>>>> return err;
>>>> }
>>>>
>>>> +/*
>>>> + * While do_notify_parent() notifies the parent of a child's death post
>>>> + * its coredump, this function lets the parent (if so desired) know about
>>>> + * the imminent death of a child just prior to its coredump.
>>>> + */
>>>> +static void do_notify_parent_predump(void)
>>>> +{
>>>> + struct task_struct *parent;
>>>> + int sig;
>>>> +
>>>> + rcu_read_lock();
>>>> + parent = rcu_dereference(current->real_parent);
>>>> + sig = parent->signal->predump_signal;
>>>> + if (sig != 0)
>>>> + do_send_sig_info(sig, SEND_SIG_NOINFO, parent, PIDTYPE_TGID);
>>>
>>> Doesn't this send si_code == SI_USER. That seems wrong: the receiving
>>> process wouldn't not be able to distinguish a real pre-coredump
>>> notification from a bogus one sent by kill(2) et
>> The receiving process (i.e., the process manager) knows the PID of all
>> its child processes. Thus it can easily tell whether the signal is from
>> a child or not.
>
> But it can't tell whether the child sent the signal via kill(2) etc., or
> whether the child started dumping core.

Once a signal is chosen and designated for this special purpose, a child
process would not send the signal to its parent via kill(2), right?

>
> It's cleaner to be able to filter reliably on si_code, especially if the
> process you're supervising isn't fully under your control. For example,
> even if the supervised process is considered trustworthy, it could still
> be exploited by an attacker (or simply go wrong) in a way that causes
> it do to a bogus kill().
>
> (If you treat any incoming signal as anything more than a hint, failure
> to check si_code is usually a bug in general.)
>

There is one signal that the application has designated for this special
purpose. The application should take the appropriate action once the signal
is received from its child. If it does not, that is a bug in the application
and not in the kernel.

>>>
>>> SEND_SIG_PRIV also looks wrong, because it assumes that the sender is
>>> "the kernel" so there is no si_pid.
>>
>> That is why SEND_SIG_NOINFO is chosen here for carrying the "pid". What
>> matters to the parent process is the "signo + pid" for identifying the
>> child process for the pre-coredump notification.
>
> I think that si_code should at least be SI_KERNEL, but how that is
> achieved doesn't really matter.
>

Again, it is the "signo + pid" that matter to the parent process. There is
no need to over-specify.


>>>> diff --git a/kernel/sys.c b/kernel/sys.c
>>>> index 123bd73..39aa3b8 100644
>>>> --- a/kernel/sys.c
>>>> +++ b/kernel/sys.c
>>>> @@ -2476,6 +2476,19 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which,
>>>> return -EINVAL;
>>>> error = arch_prctl_spec_ctrl_set(me, arg2, arg3);
>>>> break;
>>>> + case PR_SET_PREDUMP_SIG:
>>>> + if (arg3 || arg4 || arg5)
>>>
>>> glibc has
>>>
>>> int prctl(int option, ...);
>>>
>>> Some prctls() police extra arguments for zeros, but this means that
>>> the userspace caller also has to supply pointless 0 arguments.
>>>
>>> It's debatable which is the preferred approach. Did you have any
>>> particular rationale for your choice here?
>>>
>>
>> The initial version did not check the values of these unused arguments.
>> But Jann Horn pointed out the new convention is to enforce the 0 values
>> so I followed ...
>
> Hmmm, I wasn't aware of this convention when I added PR_SVE_SET_VL etc.,
> and there is no clear pattern in sys.c, and nobody commented at the
> time.
>
> Of course, it works either way.

Here is Jann's comment:

--
New prctl() calls should check that the unused arguments are zero, in
order to make it possible to safely add more flags in the future.
--

I can do it either way, but we do need to decide on a convention going forward.

Thanks. -- Enke


2018-11-30 12:07:00

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] kernel/signal: Signal-based pre-coredump notification

On 11/29, Dave Martin wrote:
>
> SIGCHLD + wait() is immune to this problem for other child status
> notifications (albeit with higher overhead).
>
> Unless I've missed something fundamental, signals simply aren't a
> reliable data transport

Yes. But I hope we are not going to implement WCOREDUMP.

I am not fan of this patch, but at least it is simple.

Oleg.


2018-12-04 22:39:06

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] kernel/signal: Signal-based pre-coredump notification

On Wed, 28 Nov 2018 16:15:35 -0800 Enke Chen <[email protected]> wrote:

> Thanks for your comments. You have indeed missed some of the prior reviews
> and discussions. But that is OK.

This is why it is best to update the changelog in response to the
review discussion - if person A was wondering about something then it
is likely that person B will wonder about the same thing. Take a
reviewer's questioning as an indication that the changelog was lacking.

I'm not sure what to make of this patchset, really. Oleg sounds
unhappy and that's always a bad sign. And signals are rather ugly
things. Oleg, can you please expand on your concerns?

The arguments against using connector aren't very compelling. Sure,
connector is hard to use from shell scripts but so are signals? And is
there much of a usecase for managing these things with shell scripts
anyway?

Has an eventfd implementation been considered?

I'm sure we've added other kernel->userspace comminication schemes, but
I can't immediately recall them all :(


2018-12-05 06:49:16

by Jann Horn

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] kernel/signal: Signal-based pre-coredump notification

On Thu, Nov 29, 2018 at 3:55 AM Dave Martin <[email protected]> wrote:
> On Thu, Nov 29, 2018 at 12:15:35AM +0000, Enke Chen wrote:
> > Hi, Dave:
> >
> > Thanks for your comments. You have indeed missed some of the prior reviews
> > and discussions. But that is OK.
> >
> > Please see my replies inline.
> >
> > On 11/28/18 7:19 AM, Dave Martin wrote:
> > > On Tue, Nov 27, 2018 at 10:54:41PM +0000, Enke Chen wrote:
> > >> diff --git a/kernel/sys.c b/kernel/sys.c
> > >> index 123bd73..39aa3b8 100644
> > >> --- a/kernel/sys.c
> > >> +++ b/kernel/sys.c
> > >> @@ -2476,6 +2476,19 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which,
> > >> return -EINVAL;
> > >> error = arch_prctl_spec_ctrl_set(me, arg2, arg3);
> > >> break;
> > >> + case PR_SET_PREDUMP_SIG:
> > >> + if (arg3 || arg4 || arg5)
> > >
> > > glibc has
> > >
> > > int prctl(int option, ...);
> > >
> > > Some prctls() police extra arguments for zeros, but this means that
> > > the userspace caller also has to supply pointless 0 arguments.
> > >
> > > It's debatable which is the preferred approach. Did you have any
> > > particular rationale for your choice here?
> > >
> >
> > The initial version did not check the values of these unused arguments.
> > But Jann Horn pointed out the new convention is to enforce the 0 values
> > so I followed ...
>
> Hmmm, I wasn't aware of this convention when I added PR_SVE_SET_VL etc.,
> and there is no clear pattern in sys.c, and nobody commented at the
> time.
>
> Of course, it works either way.

Looking at the last couple prctls that have been added:

PR_GET_SPECULATION_CTRL/PR_GET_SPECULATION_CTRL: checks unused args
(commit b617cfc858161140d69cc0b5cc211996b557a1c7, by tglx)
PR_SVE_GET_VL/PR_SVE_SET_VL: doesn't check unused args (commit
2d2123bc7c7f843aa9db87720de159a049839862, by Dave Martin)
PR_CAP_AMBIENT: checks unused args (by Andy Lutomirski)
PR_SET_FP_MODE/PR_GET_FP_MODE: doesn't check unused args
PR_MPX_ENABLE_MANAGEMENT/PR_MPX_DISABLE_MANAGEMENT: checks unused
args; this one actually specifically added such checks in commit
e9d1b4f3c60997fe197bf0243cb4a41a44387a88 ("x86, mpx: Strictly enforce
empty prctl() args") and specifically says "should be done for all new
prctl()s":

Description from Michael Kerrisk. He suggested an identical patch
to one I had already coded up and tested.

commit fe3d197f8431 "x86, mpx: On-demand kernel allocation of bounds
tables" added two new prctl() operations, PR_MPX_ENABLE_MANAGEMENT and
PR_MPX_DISABLE_MANAGEMENT. However, no checks were included to ensure
that unused arguments are zero, as is done in many existing prctl()s
and as should be done for all new prctl()s. This patch adds the
required checks.

Suggested-by: Andy Lutomirski <[email protected]>
Suggested-by: Michael Kerrisk <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
Cc: Dave Hansen <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>

2018-12-06 17:31:55

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] kernel/signal: Signal-based pre-coredump notification

On 12/04, Andrew Morton wrote:
>
> I'm not sure what to make of this patchset, really. Oleg sounds
> unhappy and that's always a bad sign. And signals are rather ugly
> things. Oleg, can you please expand on your concerns?

I don't really know what can I say...

Yes the signals are ugly things, this is one reason. Worse, you need to
use real-time predump_signal if parent has 2 or more children.

And when it comes to child-parent notifications we already have do_wait().
So probably it would be better to implement WCOREDUMP so that the parent
could rely on SIGCHLD + waitpid(WCOREDUMP). Not that I think this is a
good idea and this is much more complex especially if we want to report
this event to ptracer too, while this patch is very simple.

In fact I think that we need something more generic, but I can't suggest
any good idea :/

May be we could simply add a tracepoint or perf_event_coredump, I dunno.
Sure, the usage of perf events is not that simple...

Oleg.