Add/Modify some features of the core_pattern infrastructure:
1) Change how we detect recursive dumps. Currently we have a mechanism by which
we try to compare pathnames of the crashing process to the core_pattern path.
This is broken for a dozen reasons, and just doesn't work in any sort of robust
way. I'm replacing it with the use of a 0 RLIMIT_CORE value. Since helper
apps set RLIMIT_CORE to zero, we don't write out core files for any process with
that particular limit set. It the core_pattern is a pipe, any non-zero limit is
translated to RLIM_INFINITY. This allows complete dumps to be captured, but
prevents infinite recursion in the event that the core_pattern process itself
crashes.
2) Allow for the kernel to wait for a core_pattern process to complete. One of
the things core_pattern processes might do is interrogate the status of a
crashing process via its /proc/pid directory. To ensure that that directory is
not removed prematurely, we wait for the process to exit prior to cleaning it
up.
3) Introduce a core_pipe_limit sysctl. As we would like to avoid boundless
process creation in response to large numbers of parallel process crashes, this
sysctl limits the number of parallel dumps we can process. If
core_pattern_limit is set to X, only X dumps will be processed in parallel, dump
X+1 will be dropped and its miss will be logged. Setting this sysctl to zero
will allow limitless parallel dumps, but will also not wait for the
core_pattern process to complete before cleaning up the crashing process, so as
to avoid holding on to all that exess memory if load gets too high.
This resolves kernel.org bug 13355:
http://bugzilla.kernel.org/show_bug.cgi?id=13355
Signed-off-by: Neil Horman <[email protected]>
Reported-by: Earl Chew <[email protected]>
diff --git a/fs/exec.c b/fs/exec.c
index 895823d..d5a635e 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -62,6 +62,8 @@
int core_uses_pid;
char core_pattern[CORENAME_MAX_SIZE] = "core";
+unsigned int core_pipe_limit;
+
int suid_dumpable = 0;
/* The maximal length of core_pattern is also specified in sysctl.c */
@@ -1701,6 +1703,8 @@ int get_dumpable(struct mm_struct *mm)
return (ret >= 2) ? 2 : ret;
}
+static atomic_t core_dump_count = ATOMIC_INIT(0);
+
void do_coredump(long signr, int exit_code, struct pt_regs *regs)
{
struct core_state core_state;
@@ -1717,7 +1721,8 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
unsigned long core_limit = current->signal->rlim[RLIMIT_CORE].rlim_cur;
char **helper_argv = NULL;
int helper_argc = 0;
- char *delimit;
+ pid_t pid = 0;
+ int dump_count;
audit_core_dumps(signr);
@@ -1784,42 +1789,53 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
goto fail_unlock;
if (ispipe) {
+
+ if (core_limit == 0) {
+ /*
+ * Normally core limits are irrelevant to pipes, since
+ * we're not writing to the file system, but we use
+ * core_limit of 0 here as a speacial value. Any
+ * non-zero limit gets set to RLIM_INFINITY below, but
+ * a limit of 0 skips the dump. This is a consistent
+ * way to catch recursive crashes. We can still crash
+ * if the core_pattern binary sets RLIM_CORE = !0
+ * but it runs as root, and can do lots of stupid things
+ */
+ printk(KERN_WARNING "Process %d(%s) has RLIMIT_CORE set to 0\n",
+ task_tgid_vnr(current), current->comm);
+ printk(KERN_WARNING "Aborting core\n");
+ goto fail_unlock;
+ }
+
helper_argv = argv_split(GFP_KERNEL, corename+1, &helper_argc);
if (!helper_argv) {
printk(KERN_WARNING "%s failed to allocate memory\n",
__func__);
goto fail_unlock;
}
- /* Terminate the string before the first option */
- delimit = strchr(corename, ' ');
- if (delimit)
- *delimit = '\0';
- delimit = strrchr(helper_argv[0], '/');
- if (delimit)
- delimit++;
- else
- delimit = helper_argv[0];
- if (!strcmp(delimit, current->comm)) {
- printk(KERN_NOTICE "Recursive core dump detected, "
- "aborting\n");
- goto fail_unlock;
+
+ dump_count = atomic_inc_return(&core_dump_count);
+ if (core_pipe_limit && (core_pipe_limit < dump_count)) {
+ printk(KERN_WARNING "Capturing too many dumps at once\n");
+ goto fail_dropcount;
}
core_limit = RLIM_INFINITY;
/* SIGPIPE can happen, but it's just never processed */
- if (call_usermodehelper_pipe(corename+1, helper_argv, NULL,
- &file)) {
- printk(KERN_INFO "Core dump to %s pipe failed\n",
+ if (call_usermodehelper_pipe(helper_argv[0], helper_argv, NULL,
+ &file, &pid)) {
+ printk(KERN_CRIT "Core dump to %s pipe failed\n",
corename);
- goto fail_unlock;
+ goto fail_dropcount;
}
+
} else
file = filp_open(corename,
O_CREAT | 2 | O_NOFOLLOW | O_LARGEFILE | flag,
0600);
if (IS_ERR(file))
- goto fail_unlock;
+ goto fail_dropcount;
inode = file->f_path.dentry->d_inode;
if (inode->i_nlink > 1)
goto close_fail; /* multiple links - don't dump */
@@ -1845,10 +1861,16 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
retval = binfmt->core_dump(signr, regs, file, core_limit);
+ if (ispipe && core_pipe_limit)
+ sys_wait4(pid, NULL, 0, NULL);
+
if (retval)
current->signal->group_exit_code |= 0x80;
close_fail:
filp_close(file, NULL);
+fail_dropcount:
+ if (ispipe)
+ atomic_dec(&core_dump_count);
fail_unlock:
if (helper_argv)
argv_free(helper_argv);
diff --git a/include/linux/kmod.h b/include/linux/kmod.h
index 384ca8b..07e9f53 100644
--- a/include/linux/kmod.h
+++ b/include/linux/kmod.h
@@ -65,7 +65,8 @@ enum umh_wait {
};
/* Actually execute the sub-process */
-int call_usermodehelper_exec(struct subprocess_info *info, enum umh_wait wait);
+int call_usermodehelper_exec(struct subprocess_info *info, enum umh_wait wait,
+ pid_t *pid);
/* Free the subprocess_info. This is only needed if you're not going
to call call_usermodehelper_exec */
@@ -80,7 +81,7 @@ call_usermodehelper(char *path, char **argv, char **envp, enum umh_wait wait)
info = call_usermodehelper_setup(path, argv, envp, gfp_mask);
if (info == NULL)
return -ENOMEM;
- return call_usermodehelper_exec(info, wait);
+ return call_usermodehelper_exec(info, wait, NULL);
}
static inline int
@@ -95,14 +96,14 @@ call_usermodehelper_keys(char *path, char **argv, char **envp,
return -ENOMEM;
call_usermodehelper_setkeys(info, session_keyring);
- return call_usermodehelper_exec(info, wait);
+ return call_usermodehelper_exec(info, wait, NULL);
}
extern void usermodehelper_init(void);
struct file;
extern int call_usermodehelper_pipe(char *path, char *argv[], char *envp[],
- struct file **filp);
+ struct file **filp, pid_t *pid);
extern int usermodehelper_disable(void);
extern void usermodehelper_enable(void);
diff --git a/kernel/kmod.c b/kernel/kmod.c
index 7e95bed..9828286 100644
--- a/kernel/kmod.c
+++ b/kernel/kmod.c
@@ -126,6 +126,7 @@ struct subprocess_info {
char **envp;
enum umh_wait wait;
int retval;
+ pid_t helper_pid;
struct file *stdin;
void (*cleanup)(char **argv, char **envp);
};
@@ -204,7 +205,15 @@ static int wait_for_helper(void *data)
* populate the status, but will return -ECHILD. */
allow_signal(SIGCHLD);
+ /*
+ * Set the pid to zero here, since
+ * The helper process will have exited
+ * by the time the caller reads this
+ */
+ sub_info->helper_pid = 0;
+
pid = kernel_thread(____call_usermodehelper, sub_info, SIGCHLD);
+
if (pid < 0) {
sub_info->retval = pid;
} else {
@@ -253,9 +262,11 @@ static void __call_usermodehelper(struct work_struct *work)
if (wait == UMH_WAIT_PROC || wait == UMH_NO_WAIT)
pid = kernel_thread(wait_for_helper, sub_info,
CLONE_FS | CLONE_FILES | SIGCHLD);
- else
+ else {
pid = kernel_thread(____call_usermodehelper, sub_info,
CLONE_VFORK | SIGCHLD);
+ sub_info->helper_pid = pid;
+ }
switch (wait) {
case UMH_NO_WAIT:
@@ -369,6 +380,7 @@ struct subprocess_info *call_usermodehelper_setup(char *path, char **argv,
sub_info->path = path;
sub_info->argv = argv;
sub_info->envp = envp;
+ sub_info->helper_pid = 0;
sub_info->cred = prepare_usermodehelper_creds();
if (!sub_info->cred) {
kfree(sub_info);
@@ -457,7 +469,7 @@ EXPORT_SYMBOL(call_usermodehelper_stdinpipe);
* (ie. it runs with full root capabilities).
*/
int call_usermodehelper_exec(struct subprocess_info *sub_info,
- enum umh_wait wait)
+ enum umh_wait wait, pid_t *pid)
{
DECLARE_COMPLETION_ONSTACK(done);
int retval = 0;
@@ -481,7 +493,8 @@ int call_usermodehelper_exec(struct subprocess_info *sub_info,
goto unlock;
wait_for_completion(&done);
retval = sub_info->retval;
-
+ if (pid)
+ *pid = sub_info->helper_pid;
out:
call_usermodehelper_freeinfo(sub_info);
unlock:
@@ -502,11 +515,11 @@ EXPORT_SYMBOL(call_usermodehelper_exec);
* lower-level call_usermodehelper_* functions.
*/
int call_usermodehelper_pipe(char *path, char **argv, char **envp,
- struct file **filp)
+ struct file **filp, pid_t *pid)
{
struct subprocess_info *sub_info;
int ret;
-
+
sub_info = call_usermodehelper_setup(path, argv, envp, GFP_KERNEL);
if (sub_info == NULL)
return -ENOMEM;
@@ -515,7 +528,8 @@ int call_usermodehelper_pipe(char *path, char **argv, char **envp,
if (ret < 0)
goto out;
- return call_usermodehelper_exec(sub_info, UMH_WAIT_EXEC);
+ ret = call_usermodehelper_exec(sub_info, UMH_WAIT_EXEC, pid);
+ return ret;
out:
call_usermodehelper_freeinfo(sub_info);
diff --git a/kernel/sys.c b/kernel/sys.c
index e7998cf..30f7dc8 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1863,7 +1863,7 @@ int orderly_poweroff(bool force)
call_usermodehelper_setcleanup(info, argv_cleanup);
- ret = call_usermodehelper_exec(info, UMH_NO_WAIT);
+ ret = call_usermodehelper_exec(info, UMH_NO_WAIT, NULL);
out:
if (ret && force) {
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index b2970d5..2346c0f 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -75,6 +75,7 @@ extern int max_threads;
extern int core_uses_pid;
extern int suid_dumpable;
extern char core_pattern[];
+extern unsigned int core_pipe_limit;
extern int pid_max;
extern int min_free_kbytes;
extern int pid_max_min, pid_max_max;
@@ -396,6 +397,14 @@ static struct ctl_table kern_table[] = {
.proc_handler = &proc_dostring,
.strategy = &sysctl_string,
},
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "core_pipe_limit",
+ .data = &core_pipe_limit,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
#ifdef CONFIG_PROC_SYSCTL
{
.procname = "tainted",
On Mon, 22 Jun 2009 13:28:18 -0400
Neil Horman <[email protected]> wrote:
> Add/Modify some features of the core_pattern infrastructure:
It would have been preferable to present this as three separate patches
rather than a bundle?
> 1) Change how we detect recursive dumps. Currently we have a mechanism by which
> we try to compare pathnames of the crashing process to the core_pattern path.
> This is broken for a dozen reasons, and just doesn't work in any sort of robust
> way. I'm replacing it with the use of a 0 RLIMIT_CORE value. Since helper
> apps set RLIMIT_CORE to zero, we don't write out core files for any process with
> that particular limit set. It the core_pattern is a pipe, any non-zero limit is
> translated to RLIM_INFINITY. This allows complete dumps to be captured, but
> prevents infinite recursion in the event that the core_pattern process itself
> crashes.
Seems reasonable.
> 2) Allow for the kernel to wait for a core_pattern process to complete. One of
> the things core_pattern processes might do is interrogate the status of a
> crashing process via its /proc/pid directory. To ensure that that directory is
> not removed prematurely, we wait for the process to exit prior to cleaning it
> up.
hm.
> 3) Introduce a core_pipe_limit sysctl.
Please document this? Documentation/filesystems/proc.txt, perhaps.
> As we would like to avoid boundless
> process creation in response to large numbers of parallel process crashes, this
> sysctl limits the number of parallel dumps we can process. If
> core_pattern_limit is set to X, only X dumps will be processed in parallel, dump
> X+1 will be dropped and its miss will be logged. Setting this sysctl to zero
> will allow limitless parallel dumps, but will also not wait for the
> core_pattern process to complete before cleaning up the crashing process, so as
> to avoid holding on to all that exess memory if load gets too high.
This seems like it adds unreliability and perhaps DoSability.
Did you consider resolving this problem by throttling (ie: an upper
limit on the number of concurrent dumpers) rather than by discarding?
> This resolves kernel.org bug 13355:
> http://bugzilla.kernel.org/show_bug.cgi?id=13355
>
> Signed-off-by: Neil Horman <[email protected]>
> Reported-by: Earl Chew <[email protected]>
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 895823d..d5a635e 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -62,6 +62,8 @@
>
> int core_uses_pid;
> char core_pattern[CORENAME_MAX_SIZE] = "core";
> +unsigned int core_pipe_limit;
> +
> int suid_dumpable = 0;
>
> /* The maximal length of core_pattern is also specified in sysctl.c */
> @@ -1701,6 +1703,8 @@ int get_dumpable(struct mm_struct *mm)
> return (ret >= 2) ? 2 : ret;
> }
>
> +static atomic_t core_dump_count = ATOMIC_INIT(0);
fyi, we decided that the `= ATOMIC_INIT(0)' is no longer needed in this
case - we just use the all-zeroes pattern for atomic_t's.
This won't make any difference if gcc manages to move the atomic_t out of
.data and into .bss. Whether it succesfully does that for this
construct I do not know.
Also, core_dump_count could be made static inside do_coredump(), which
is a little neater.
> void do_coredump(long signr, int exit_code, struct pt_regs *regs)
> {
> struct core_state core_state;
> @@ -1717,7 +1721,8 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
> unsigned long core_limit = current->signal->rlim[RLIMIT_CORE].rlim_cur;
> char **helper_argv = NULL;
> int helper_argc = 0;
> - char *delimit;
> + pid_t pid = 0;
> + int dump_count;
>
> audit_core_dumps(signr);
>
> @@ -1784,42 +1789,53 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
> goto fail_unlock;
>
> if (ispipe) {
> +
> + if (core_limit == 0) {
> + /*
> + * Normally core limits are irrelevant to pipes, since
> + * we're not writing to the file system, but we use
> + * core_limit of 0 here as a speacial value. Any
> + * non-zero limit gets set to RLIM_INFINITY below, but
> + * a limit of 0 skips the dump. This is a consistent
> + * way to catch recursive crashes. We can still crash
> + * if the core_pattern binary sets RLIM_CORE = !0
> + * but it runs as root, and can do lots of stupid things
> + */
> + printk(KERN_WARNING "Process %d(%s) has RLIMIT_CORE set to 0\n",
> + task_tgid_vnr(current), current->comm);
whoa. What does task_tgid_vnr() do?
<reads that function and three levels of callee>
<gets grumpy at undocumentedness, stops and then guesses>
Thread group leader's PID relative to its pid namespace?
> + printk(KERN_WARNING "Aborting core\n");
> + goto fail_unlock;
> + }
> +
> helper_argv = argv_split(GFP_KERNEL, corename+1, &helper_argc);
> if (!helper_argv) {
> printk(KERN_WARNING "%s failed to allocate memory\n",
> __func__);
> goto fail_unlock;
> }
> - /* Terminate the string before the first option */
> - delimit = strchr(corename, ' ');
> - if (delimit)
> - *delimit = '\0';
> - delimit = strrchr(helper_argv[0], '/');
> - if (delimit)
> - delimit++;
> - else
> - delimit = helper_argv[0];
> - if (!strcmp(delimit, current->comm)) {
> - printk(KERN_NOTICE "Recursive core dump detected, "
> - "aborting\n");
> - goto fail_unlock;
> +
> + dump_count = atomic_inc_return(&core_dump_count);
> + if (core_pipe_limit && (core_pipe_limit < dump_count)) {
> + printk(KERN_WARNING "Capturing too many dumps at once\n");
hm, unprivileged-user-triggerable printk. Doesn't matter much.
However if we're really going to emit userspace runtime debugging info
from the kernel, those messages should provide some means by which the
offending userspace can be identified. Which container, user,
application, etc is causing the problem? On a really large and busy
system, a message like this could be quite puzzling.
> + goto fail_dropcount;
> }
>
> core_limit = RLIM_INFINITY;
>
> /* SIGPIPE can happen, but it's just never processed */
> - if (call_usermodehelper_pipe(corename+1, helper_argv, NULL,
> - &file)) {
> - printk(KERN_INFO "Core dump to %s pipe failed\n",
> + if (call_usermodehelper_pipe(helper_argv[0], helper_argv, NULL,
> + &file, &pid)) {
> + printk(KERN_CRIT "Core dump to %s pipe failed\n",
> corename);
> - goto fail_unlock;
> + goto fail_dropcount;
> }
> +
> } else
> file = filp_open(corename,
> O_CREAT | 2 | O_NOFOLLOW | O_LARGEFILE | flag,
> 0600);
> if (IS_ERR(file))
> - goto fail_unlock;
> + goto fail_dropcount;
> inode = file->f_path.dentry->d_inode;
> if (inode->i_nlink > 1)
> goto close_fail; /* multiple links - don't dump */
> @@ -1845,10 +1861,16 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
>
> retval = binfmt->core_dump(signr, regs, file, core_limit);
>
> + if (ispipe && core_pipe_limit)
> + sys_wait4(pid, NULL, 0, NULL);
> +
> if (retval)
> current->signal->group_exit_code |= 0x80;
> close_fail:
> filp_close(file, NULL);
> +fail_dropcount:
> + if (ispipe)
> + atomic_dec(&core_dump_count);
> fail_unlock:
> if (helper_argv)
> argv_free(helper_argv);
> diff --git a/include/linux/kmod.h b/include/linux/kmod.h
> index 384ca8b..07e9f53 100644
> --- a/include/linux/kmod.h
> +++ b/include/linux/kmod.h
> @@ -65,7 +65,8 @@ enum umh_wait {
> };
>
> /* Actually execute the sub-process */
> -int call_usermodehelper_exec(struct subprocess_info *info, enum umh_wait wait);
> +int call_usermodehelper_exec(struct subprocess_info *info, enum umh_wait wait,
> + pid_t *pid);
>
> /* Free the subprocess_info. This is only needed if you're not going
> to call call_usermodehelper_exec */
> @@ -80,7 +81,7 @@ call_usermodehelper(char *path, char **argv, char **envp, enum umh_wait wait)
> info = call_usermodehelper_setup(path, argv, envp, gfp_mask);
> if (info == NULL)
> return -ENOMEM;
> - return call_usermodehelper_exec(info, wait);
> + return call_usermodehelper_exec(info, wait, NULL);
> }
>
> static inline int
> @@ -95,14 +96,14 @@ call_usermodehelper_keys(char *path, char **argv, char **envp,
> return -ENOMEM;
>
> call_usermodehelper_setkeys(info, session_keyring);
> - return call_usermodehelper_exec(info, wait);
> + return call_usermodehelper_exec(info, wait, NULL);
> }
>
> extern void usermodehelper_init(void);
>
> struct file;
> extern int call_usermodehelper_pipe(char *path, char *argv[], char *envp[],
> - struct file **filp);
> + struct file **filp, pid_t *pid);
>
> extern int usermodehelper_disable(void);
> extern void usermodehelper_enable(void);
> diff --git a/kernel/kmod.c b/kernel/kmod.c
> index 7e95bed..9828286 100644
> --- a/kernel/kmod.c
> +++ b/kernel/kmod.c
> @@ -126,6 +126,7 @@ struct subprocess_info {
> char **envp;
> enum umh_wait wait;
> int retval;
> + pid_t helper_pid;
> struct file *stdin;
> void (*cleanup)(char **argv, char **envp);
> };
> @@ -204,7 +205,15 @@ static int wait_for_helper(void *data)
> * populate the status, but will return -ECHILD. */
> allow_signal(SIGCHLD);
>
> + /*
> + * Set the pid to zero here, since
> + * The helper process will have exited
> + * by the time the caller reads this
> + */
There's no need to cram the comments into 50 columns.
This sentence has a random capitalised word in the middle, and no full-stop ;)
> + sub_info->helper_pid = 0;
> +
> pid = kernel_thread(____call_usermodehelper, sub_info, SIGCHLD);
> +
> if (pid < 0) {
> sub_info->retval = pid;
> } else {
> @@ -253,9 +262,11 @@ static void __call_usermodehelper(struct work_struct *work)
> if (wait == UMH_WAIT_PROC || wait == UMH_NO_WAIT)
> pid = kernel_thread(wait_for_helper, sub_info,
> CLONE_FS | CLONE_FILES | SIGCHLD);
> - else
> + else {
> pid = kernel_thread(____call_usermodehelper, sub_info,
> CLONE_VFORK | SIGCHLD);
> + sub_info->helper_pid = pid;
> + }
>
> switch (wait) {
> case UMH_NO_WAIT:
> @@ -369,6 +380,7 @@ struct subprocess_info *call_usermodehelper_setup(char *path, char **argv,
> sub_info->path = path;
> sub_info->argv = argv;
> sub_info->envp = envp;
> + sub_info->helper_pid = 0;
> sub_info->cred = prepare_usermodehelper_creds();
> if (!sub_info->cred) {
> kfree(sub_info);
> @@ -457,7 +469,7 @@ EXPORT_SYMBOL(call_usermodehelper_stdinpipe);
> * (ie. it runs with full root capabilities).
> */
> int call_usermodehelper_exec(struct subprocess_info *sub_info,
> - enum umh_wait wait)
> + enum umh_wait wait, pid_t *pid)
> {
> DECLARE_COMPLETION_ONSTACK(done);
> int retval = 0;
> @@ -481,7 +493,8 @@ int call_usermodehelper_exec(struct subprocess_info *sub_info,
> goto unlock;
> wait_for_completion(&done);
> retval = sub_info->retval;
> -
> + if (pid)
> + *pid = sub_info->helper_pid;
I query the addition of helper_pid.
That pid already gets returned in sub_info->retval, does it not? Did
we just duplicate that?
If we indeed do need this new field, can this apparent duplication be
cleaned up? For example, can the copying of the PID into ->retval be
removed, and switch all pid-using sites over to use the new
->helper_pid?
> out:
> call_usermodehelper_freeinfo(sub_info);
> unlock:
> @@ -502,11 +515,11 @@ EXPORT_SYMBOL(call_usermodehelper_exec);
> * lower-level call_usermodehelper_* functions.
> */
> int call_usermodehelper_pipe(char *path, char **argv, char **envp,
> - struct file **filp)
> + struct file **filp, pid_t *pid)
> {
> struct subprocess_info *sub_info;
> int ret;
> -
> +
> sub_info = call_usermodehelper_setup(path, argv, envp, GFP_KERNEL);
> if (sub_info == NULL)
> return -ENOMEM;
> @@ -515,7 +528,8 @@ int call_usermodehelper_pipe(char *path, char **argv, char **envp,
> if (ret < 0)
> goto out;
>
> - return call_usermodehelper_exec(sub_info, UMH_WAIT_EXEC);
> + ret = call_usermodehelper_exec(sub_info, UMH_WAIT_EXEC, pid);
> + return ret;
>
> out:
> call_usermodehelper_freeinfo(sub_info);
> diff --git a/kernel/sys.c b/kernel/sys.c
> index e7998cf..30f7dc8 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -1863,7 +1863,7 @@ int orderly_poweroff(bool force)
>
> call_usermodehelper_setcleanup(info, argv_cleanup);
>
> - ret = call_usermodehelper_exec(info, UMH_NO_WAIT);
> + ret = call_usermodehelper_exec(info, UMH_NO_WAIT, NULL);
>
> out:
> if (ret && force) {
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index b2970d5..2346c0f 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -75,6 +75,7 @@ extern int max_threads;
> extern int core_uses_pid;
> extern int suid_dumpable;
> extern char core_pattern[];
> +extern unsigned int core_pipe_limit;
Bah.
We should fix this one day.
> extern int pid_max;
> extern int min_free_kbytes;
> extern int pid_max_min, pid_max_max;
> @@ -396,6 +397,14 @@ static struct ctl_table kern_table[] = {
> .proc_handler = &proc_dostring,
> .strategy = &sysctl_string,
> },
> + {
> + .ctl_name = CTL_UNNUMBERED,
> + .procname = "core_pipe_limit",
> + .data = &core_pipe_limit,
> + .maxlen = sizeof(unsigned int),
> + .mode = 0644,
> + .proc_handler = &proc_dointvec,
> + },
Please ensure that the core_pipe_limit documentation describes its
units.
It's "number of concurrent coredumps, system-wide", yes?
It does make coredumping unreliable, doesn't it? :(
On Thu, Jun 25, 2009 at 04:30:50PM -0700, Andrew Morton wrote:
> On Mon, 22 Jun 2009 13:28:18 -0400
> Neil Horman <[email protected]> wrote:
>
> > Add/Modify some features of the core_pattern infrastructure:
>
> It would have been preferable to present this as three separate patches
> rather than a bundle?
>
Arguably, yes. I can certain split the recursive dump detection out into a
separate patch if you like, but I'd rather keep the sysctl and core wait bits
together. They aren't explicitly dependent on one another, but the sysctl
provides a mechanism to limit the backlog of waiting dumps, which I think is
important for people to have if they're going to use the core waiting feature.
> > 1) Change how we detect recursive dumps. Currently we have a mechanism by which
> > we try to compare pathnames of the crashing process to the core_pattern path.
> > This is broken for a dozen reasons, and just doesn't work in any sort of robust
> > way. I'm replacing it with the use of a 0 RLIMIT_CORE value. Since helper
> > apps set RLIMIT_CORE to zero, we don't write out core files for any process with
> > that particular limit set. It the core_pattern is a pipe, any non-zero limit is
> > translated to RLIM_INFINITY. This allows complete dumps to be captured, but
> > prevents infinite recursion in the event that the core_pattern process itself
> > crashes.
>
> Seems reasonable.
>
Cool, thanks!
> > 2) Allow for the kernel to wait for a core_pattern process to complete. One of
> > the things core_pattern processes might do is interrogate the status of a
> > crashing process via its /proc/pid directory. To ensure that that directory is
> > not removed prematurely, we wait for the process to exit prior to cleaning it
> > up.
>
> hm.
>
Apport (which this patch was initially written for) does exactly this, and it
normally works because it gets the data it needs out of proc quickly, before the
kernel removes it, this just guarantees that behavior.
> > 3) Introduce a core_pipe_limit sysctl.
>
> Please document this? Documentation/filesystems/proc.txt, perhaps.
>
Sure, I can do that.
> > As we would like to avoid boundless
> > process creation in response to large numbers of parallel process crashes, this
> > sysctl limits the number of parallel dumps we can process. If
> > core_pattern_limit is set to X, only X dumps will be processed in parallel, dump
> > X+1 will be dropped and its miss will be logged. Setting this sysctl to zero
> > will allow limitless parallel dumps, but will also not wait for the
> > core_pattern process to complete before cleaning up the crashing process, so as
> > to avoid holding on to all that exess memory if load gets too high.
>
> This seems like it adds unreliability and perhaps DoSability.
>
Quite the opposite, actually. If this sysctl is zero, no waiting takes place.
Any non-zero value allows only that many processes to wait in do_coredump in
parallel. Any crashes above that, simply get logged to syslog, and the
userspace helper isn't forked. It prevents an infinite number of core_patter
helper processes from running in parallel and blocking forever.
> Did you consider resolving this problem by throttling (ie: an upper
> limit on the number of concurrent dumpers) rather than by discarding?
Thats exactly what it does. Perhaps I didn't explain it clear enough. I'll
repost in the morning, separating the patch into two segments (the recursion
detection and the sysctl+core waiting), and I'll add documentation in proc.txt
that clarifies the meaning and use of the sysctl.
Thanks for the feedback!
Neil
>
> > This resolves kernel.org bug 13355:
> > http://bugzilla.kernel.org/show_bug.cgi?id=13355
> >
> > Signed-off-by: Neil Horman <[email protected]>
> > Reported-by: Earl Chew <[email protected]>
> >
> > diff --git a/fs/exec.c b/fs/exec.c
> > index 895823d..d5a635e 100644
> > --- a/fs/exec.c
> > +++ b/fs/exec.c
> > @@ -62,6 +62,8 @@
> >
> > int core_uses_pid;
> > char core_pattern[CORENAME_MAX_SIZE] = "core";
> > +unsigned int core_pipe_limit;
> > +
> > int suid_dumpable = 0;
> >
> > /* The maximal length of core_pattern is also specified in sysctl.c */
> > @@ -1701,6 +1703,8 @@ int get_dumpable(struct mm_struct *mm)
> > return (ret >= 2) ? 2 : ret;
> > }
> >
> > +static atomic_t core_dump_count = ATOMIC_INIT(0);
>
> fyi, we decided that the `= ATOMIC_INIT(0)' is no longer needed in this
> case - we just use the all-zeroes pattern for atomic_t's.
>
> This won't make any difference if gcc manages to move the atomic_t out of
> .data and into .bss. Whether it succesfully does that for this
> construct I do not know.
>
>
> Also, core_dump_count could be made static inside do_coredump(), which
> is a little neater.
>
> > void do_coredump(long signr, int exit_code, struct pt_regs *regs)
> > {
> > struct core_state core_state;
> > @@ -1717,7 +1721,8 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
> > unsigned long core_limit = current->signal->rlim[RLIMIT_CORE].rlim_cur;
> > char **helper_argv = NULL;
> > int helper_argc = 0;
> > - char *delimit;
> > + pid_t pid = 0;
> > + int dump_count;
> >
> > audit_core_dumps(signr);
> >
> > @@ -1784,42 +1789,53 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
> > goto fail_unlock;
> >
> > if (ispipe) {
> > +
> > + if (core_limit == 0) {
> > + /*
> > + * Normally core limits are irrelevant to pipes, since
> > + * we're not writing to the file system, but we use
> > + * core_limit of 0 here as a speacial value. Any
> > + * non-zero limit gets set to RLIM_INFINITY below, but
> > + * a limit of 0 skips the dump. This is a consistent
> > + * way to catch recursive crashes. We can still crash
> > + * if the core_pattern binary sets RLIM_CORE = !0
> > + * but it runs as root, and can do lots of stupid things
> > + */
> > + printk(KERN_WARNING "Process %d(%s) has RLIMIT_CORE set to 0\n",
> > + task_tgid_vnr(current), current->comm);
>
> whoa. What does task_tgid_vnr() do?
>
> <reads that function and three levels of callee>
>
> <gets grumpy at undocumentedness, stops and then guesses>
>
> Thread group leader's PID relative to its pid namespace?
>
> > + printk(KERN_WARNING "Aborting core\n");
> > + goto fail_unlock;
> > + }
> > +
> > helper_argv = argv_split(GFP_KERNEL, corename+1, &helper_argc);
> > if (!helper_argv) {
> > printk(KERN_WARNING "%s failed to allocate memory\n",
> > __func__);
> > goto fail_unlock;
> > }
> > - /* Terminate the string before the first option */
> > - delimit = strchr(corename, ' ');
> > - if (delimit)
> > - *delimit = '\0';
> > - delimit = strrchr(helper_argv[0], '/');
> > - if (delimit)
> > - delimit++;
> > - else
> > - delimit = helper_argv[0];
> > - if (!strcmp(delimit, current->comm)) {
> > - printk(KERN_NOTICE "Recursive core dump detected, "
> > - "aborting\n");
> > - goto fail_unlock;
> > +
> > + dump_count = atomic_inc_return(&core_dump_count);
> > + if (core_pipe_limit && (core_pipe_limit < dump_count)) {
> > + printk(KERN_WARNING "Capturing too many dumps at once\n");
>
> hm, unprivileged-user-triggerable printk. Doesn't matter much.
>
> However if we're really going to emit userspace runtime debugging info
> from the kernel, those messages should provide some means by which the
> offending userspace can be identified. Which container, user,
> application, etc is causing the problem? On a really large and busy
> system, a message like this could be quite puzzling.
>
> > + goto fail_dropcount;
> > }
> >
> > core_limit = RLIM_INFINITY;
> >
> > /* SIGPIPE can happen, but it's just never processed */
> > - if (call_usermodehelper_pipe(corename+1, helper_argv, NULL,
> > - &file)) {
> > - printk(KERN_INFO "Core dump to %s pipe failed\n",
> > + if (call_usermodehelper_pipe(helper_argv[0], helper_argv, NULL,
> > + &file, &pid)) {
> > + printk(KERN_CRIT "Core dump to %s pipe failed\n",
> > corename);
> > - goto fail_unlock;
> > + goto fail_dropcount;
> > }
> > +
> > } else
> > file = filp_open(corename,
> > O_CREAT | 2 | O_NOFOLLOW | O_LARGEFILE | flag,
> > 0600);
> > if (IS_ERR(file))
> > - goto fail_unlock;
> > + goto fail_dropcount;
> > inode = file->f_path.dentry->d_inode;
> > if (inode->i_nlink > 1)
> > goto close_fail; /* multiple links - don't dump */
> > @@ -1845,10 +1861,16 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
> >
> > retval = binfmt->core_dump(signr, regs, file, core_limit);
> >
> > + if (ispipe && core_pipe_limit)
> > + sys_wait4(pid, NULL, 0, NULL);
> > +
> > if (retval)
> > current->signal->group_exit_code |= 0x80;
> > close_fail:
> > filp_close(file, NULL);
> > +fail_dropcount:
> > + if (ispipe)
> > + atomic_dec(&core_dump_count);
> > fail_unlock:
> > if (helper_argv)
> > argv_free(helper_argv);
> > diff --git a/include/linux/kmod.h b/include/linux/kmod.h
> > index 384ca8b..07e9f53 100644
> > --- a/include/linux/kmod.h
> > +++ b/include/linux/kmod.h
> > @@ -65,7 +65,8 @@ enum umh_wait {
> > };
> >
> > /* Actually execute the sub-process */
> > -int call_usermodehelper_exec(struct subprocess_info *info, enum umh_wait wait);
> > +int call_usermodehelper_exec(struct subprocess_info *info, enum umh_wait wait,
> > + pid_t *pid);
> >
> > /* Free the subprocess_info. This is only needed if you're not going
> > to call call_usermodehelper_exec */
> > @@ -80,7 +81,7 @@ call_usermodehelper(char *path, char **argv, char **envp, enum umh_wait wait)
> > info = call_usermodehelper_setup(path, argv, envp, gfp_mask);
> > if (info == NULL)
> > return -ENOMEM;
> > - return call_usermodehelper_exec(info, wait);
> > + return call_usermodehelper_exec(info, wait, NULL);
> > }
> >
> > static inline int
> > @@ -95,14 +96,14 @@ call_usermodehelper_keys(char *path, char **argv, char **envp,
> > return -ENOMEM;
> >
> > call_usermodehelper_setkeys(info, session_keyring);
> > - return call_usermodehelper_exec(info, wait);
> > + return call_usermodehelper_exec(info, wait, NULL);
> > }
> >
> > extern void usermodehelper_init(void);
> >
> > struct file;
> > extern int call_usermodehelper_pipe(char *path, char *argv[], char *envp[],
> > - struct file **filp);
> > + struct file **filp, pid_t *pid);
> >
> > extern int usermodehelper_disable(void);
> > extern void usermodehelper_enable(void);
> > diff --git a/kernel/kmod.c b/kernel/kmod.c
> > index 7e95bed..9828286 100644
> > --- a/kernel/kmod.c
> > +++ b/kernel/kmod.c
> > @@ -126,6 +126,7 @@ struct subprocess_info {
> > char **envp;
> > enum umh_wait wait;
> > int retval;
> > + pid_t helper_pid;
> > struct file *stdin;
> > void (*cleanup)(char **argv, char **envp);
> > };
> > @@ -204,7 +205,15 @@ static int wait_for_helper(void *data)
> > * populate the status, but will return -ECHILD. */
> > allow_signal(SIGCHLD);
> >
> > + /*
> > + * Set the pid to zero here, since
> > + * The helper process will have exited
> > + * by the time the caller reads this
> > + */
>
> There's no need to cram the comments into 50 columns.
>
> This sentence has a random capitalised word in the middle, and no full-stop ;)
>
> > + sub_info->helper_pid = 0;
> > +
> > pid = kernel_thread(____call_usermodehelper, sub_info, SIGCHLD);
> > +
> > if (pid < 0) {
> > sub_info->retval = pid;
> > } else {
> > @@ -253,9 +262,11 @@ static void __call_usermodehelper(struct work_struct *work)
> > if (wait == UMH_WAIT_PROC || wait == UMH_NO_WAIT)
> > pid = kernel_thread(wait_for_helper, sub_info,
> > CLONE_FS | CLONE_FILES | SIGCHLD);
> > - else
> > + else {
> > pid = kernel_thread(____call_usermodehelper, sub_info,
> > CLONE_VFORK | SIGCHLD);
> > + sub_info->helper_pid = pid;
> > + }
> >
> > switch (wait) {
> > case UMH_NO_WAIT:
> > @@ -369,6 +380,7 @@ struct subprocess_info *call_usermodehelper_setup(char *path, char **argv,
> > sub_info->path = path;
> > sub_info->argv = argv;
> > sub_info->envp = envp;
> > + sub_info->helper_pid = 0;
> > sub_info->cred = prepare_usermodehelper_creds();
> > if (!sub_info->cred) {
> > kfree(sub_info);
> > @@ -457,7 +469,7 @@ EXPORT_SYMBOL(call_usermodehelper_stdinpipe);
> > * (ie. it runs with full root capabilities).
> > */
> > int call_usermodehelper_exec(struct subprocess_info *sub_info,
> > - enum umh_wait wait)
> > + enum umh_wait wait, pid_t *pid)
> > {
> > DECLARE_COMPLETION_ONSTACK(done);
> > int retval = 0;
> > @@ -481,7 +493,8 @@ int call_usermodehelper_exec(struct subprocess_info *sub_info,
> > goto unlock;
> > wait_for_completion(&done);
> > retval = sub_info->retval;
> > -
> > + if (pid)
> > + *pid = sub_info->helper_pid;
>
> I query the addition of helper_pid.
>
> That pid already gets returned in sub_info->retval, does it not? Did
> we just duplicate that?
>
> If we indeed do need this new field, can this apparent duplication be
> cleaned up? For example, can the copying of the PID into ->retval be
> removed, and switch all pid-using sites over to use the new
> ->helper_pid?
>
> > out:
> > call_usermodehelper_freeinfo(sub_info);
> > unlock:
> > @@ -502,11 +515,11 @@ EXPORT_SYMBOL(call_usermodehelper_exec);
> > * lower-level call_usermodehelper_* functions.
> > */
> > int call_usermodehelper_pipe(char *path, char **argv, char **envp,
> > - struct file **filp)
> > + struct file **filp, pid_t *pid)
> > {
> > struct subprocess_info *sub_info;
> > int ret;
> > -
> > +
> > sub_info = call_usermodehelper_setup(path, argv, envp, GFP_KERNEL);
> > if (sub_info == NULL)
> > return -ENOMEM;
> > @@ -515,7 +528,8 @@ int call_usermodehelper_pipe(char *path, char **argv, char **envp,
> > if (ret < 0)
> > goto out;
> >
> > - return call_usermodehelper_exec(sub_info, UMH_WAIT_EXEC);
> > + ret = call_usermodehelper_exec(sub_info, UMH_WAIT_EXEC, pid);
> > + return ret;
> >
> > out:
> > call_usermodehelper_freeinfo(sub_info);
> > diff --git a/kernel/sys.c b/kernel/sys.c
> > index e7998cf..30f7dc8 100644
> > --- a/kernel/sys.c
> > +++ b/kernel/sys.c
> > @@ -1863,7 +1863,7 @@ int orderly_poweroff(bool force)
> >
> > call_usermodehelper_setcleanup(info, argv_cleanup);
> >
> > - ret = call_usermodehelper_exec(info, UMH_NO_WAIT);
> > + ret = call_usermodehelper_exec(info, UMH_NO_WAIT, NULL);
> >
> > out:
> > if (ret && force) {
> > diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> > index b2970d5..2346c0f 100644
> > --- a/kernel/sysctl.c
> > +++ b/kernel/sysctl.c
> > @@ -75,6 +75,7 @@ extern int max_threads;
> > extern int core_uses_pid;
> > extern int suid_dumpable;
> > extern char core_pattern[];
> > +extern unsigned int core_pipe_limit;
>
> Bah.
>
> We should fix this one day.
>
> > extern int pid_max;
> > extern int min_free_kbytes;
> > extern int pid_max_min, pid_max_max;
> > @@ -396,6 +397,14 @@ static struct ctl_table kern_table[] = {
> > .proc_handler = &proc_dostring,
> > .strategy = &sysctl_string,
> > },
> > + {
> > + .ctl_name = CTL_UNNUMBERED,
> > + .procname = "core_pipe_limit",
> > + .data = &core_pipe_limit,
> > + .maxlen = sizeof(unsigned int),
> > + .mode = 0644,
> > + .proc_handler = &proc_dointvec,
> > + },
>
> Please ensure that the core_pipe_limit documentation describes its
> units.
>
> It's "number of concurrent coredumps, system-wide", yes?
>
> It does make coredumping unreliable, doesn't it? :(
>
On Thu, Jun 25, 2009 at 04:30:50PM -0700, Andrew Morton wrote:
>
<snip>
Regarding the rest of you comments below:
> > This resolves kernel.org bug 13355:
> > http://bugzilla.kernel.org/show_bug.cgi?id=13355
> >
> > Signed-off-by: Neil Horman <[email protected]>
> > Reported-by: Earl Chew <[email protected]>
> >
> > diff --git a/fs/exec.c b/fs/exec.c
> > index 895823d..d5a635e 100644
> > --- a/fs/exec.c
> > +++ b/fs/exec.c
> > @@ -62,6 +62,8 @@
> >
> > int core_uses_pid;
> > char core_pattern[CORENAME_MAX_SIZE] = "core";
> > +unsigned int core_pipe_limit;
> > +
> > int suid_dumpable = 0;
> >
> > /* The maximal length of core_pattern is also specified in sysctl.c */
> > @@ -1701,6 +1703,8 @@ int get_dumpable(struct mm_struct *mm)
> > return (ret >= 2) ? 2 : ret;
> > }
> >
> > +static atomic_t core_dump_count = ATOMIC_INIT(0);
>
> fyi, we decided that the `= ATOMIC_INIT(0)' is no longer needed in this
> case - we just use the all-zeroes pattern for atomic_t's.
>
> This won't make any difference if gcc manages to move the atomic_t out of
> .data and into .bss. Whether it succesfully does that for this
> construct I do not know.
>
Understood, thanks for the heads up.
>
> Also, core_dump_count could be made static inside do_coredump(), which
> is a little neater.
>
I'll make this update when I repost, sure.
> > void do_coredump(long signr, int exit_code, struct pt_regs *regs)
> > {
> > struct core_state core_state;
> > @@ -1717,7 +1721,8 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
> > unsigned long core_limit = current->signal->rlim[RLIMIT_CORE].rlim_cur;
> > char **helper_argv = NULL;
> > int helper_argc = 0;
> > - char *delimit;
> > + pid_t pid = 0;
> > + int dump_count;
> >
> > audit_core_dumps(signr);
> >
> > @@ -1784,42 +1789,53 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
> > goto fail_unlock;
> >
> > if (ispipe) {
> > +
> > + if (core_limit == 0) {
> > + /*
> > + * Normally core limits are irrelevant to pipes, since
> > + * we're not writing to the file system, but we use
> > + * core_limit of 0 here as a speacial value. Any
> > + * non-zero limit gets set to RLIM_INFINITY below, but
> > + * a limit of 0 skips the dump. This is a consistent
> > + * way to catch recursive crashes. We can still crash
> > + * if the core_pattern binary sets RLIM_CORE = !0
> > + * but it runs as root, and can do lots of stupid things
> > + */
> > + printk(KERN_WARNING "Process %d(%s) has RLIMIT_CORE set to 0\n",
> > + task_tgid_vnr(current), current->comm);
>
> whoa. What does task_tgid_vnr() do?
>
> <reads that function and three levels of callee>
>
> <gets grumpy at undocumentedness, stops and then guesses>
>
> Thread group leader's PID relative to its pid namespace?
>
Thats exactly right. Its the same way we compute pid when using the %p
expansion in format_corename. I'll add comments regarding its use in the
updated patch.
> > + printk(KERN_WARNING "Aborting core\n");
> > + goto fail_unlock;
> > + }
> > +
> > helper_argv = argv_split(GFP_KERNEL, corename+1, &helper_argc);
> > if (!helper_argv) {
> > printk(KERN_WARNING "%s failed to allocate memory\n",
> > __func__);
> > goto fail_unlock;
> > }
> > - /* Terminate the string before the first option */
> > - delimit = strchr(corename, ' ');
> > - if (delimit)
> > - *delimit = '\0';
> > - delimit = strrchr(helper_argv[0], '/');
> > - if (delimit)
> > - delimit++;
> > - else
> > - delimit = helper_argv[0];
> > - if (!strcmp(delimit, current->comm)) {
> > - printk(KERN_NOTICE "Recursive core dump detected, "
> > - "aborting\n");
> > - goto fail_unlock;
> > +
> > + dump_count = atomic_inc_return(&core_dump_count);
> > + if (core_pipe_limit && (core_pipe_limit < dump_count)) {
> > + printk(KERN_WARNING "Capturing too many dumps at once\n");
>
> hm, unprivileged-user-triggerable printk. Doesn't matter much.
>
> However if we're really going to emit userspace runtime debugging info
> from the kernel, those messages should provide some means by which the
> offending userspace can be identified. Which container, user,
> application, etc is causing the problem? On a really large and busy
> system, a message like this could be quite puzzling.
>
Ok, I'll make the appropriate updates for repost here.
> > + goto fail_dropcount;
> > }
> >
> > core_limit = RLIM_INFINITY;
> >
> > /* SIGPIPE can happen, but it's just never processed */
> > - if (call_usermodehelper_pipe(corename+1, helper_argv, NULL,
> > - &file)) {
> > - printk(KERN_INFO "Core dump to %s pipe failed\n",
> > + if (call_usermodehelper_pipe(helper_argv[0], helper_argv, NULL,
> > + &file, &pid)) {
> > + printk(KERN_CRIT "Core dump to %s pipe failed\n",
> > corename);
> > - goto fail_unlock;
> > + goto fail_dropcount;
> > }
> > +
> > } else
> > file = filp_open(corename,
> > O_CREAT | 2 | O_NOFOLLOW | O_LARGEFILE | flag,
> > 0600);
> > if (IS_ERR(file))
> > - goto fail_unlock;
> > + goto fail_dropcount;
> > inode = file->f_path.dentry->d_inode;
> > if (inode->i_nlink > 1)
> > goto close_fail; /* multiple links - don't dump */
> > @@ -1845,10 +1861,16 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
> >
> > retval = binfmt->core_dump(signr, regs, file, core_limit);
> >
> > + if (ispipe && core_pipe_limit)
> > + sys_wait4(pid, NULL, 0, NULL);
> > +
> > if (retval)
> > current->signal->group_exit_code |= 0x80;
> > close_fail:
> > filp_close(file, NULL);
> > +fail_dropcount:
> > + if (ispipe)
> > + atomic_dec(&core_dump_count);
> > fail_unlock:
> > if (helper_argv)
> > argv_free(helper_argv);
> > diff --git a/include/linux/kmod.h b/include/linux/kmod.h
> > index 384ca8b..07e9f53 100644
> > --- a/include/linux/kmod.h
> > +++ b/include/linux/kmod.h
> > @@ -65,7 +65,8 @@ enum umh_wait {
> > };
> >
> > /* Actually execute the sub-process */
> > -int call_usermodehelper_exec(struct subprocess_info *info, enum umh_wait wait);
> > +int call_usermodehelper_exec(struct subprocess_info *info, enum umh_wait wait,
> > + pid_t *pid);
> >
> > /* Free the subprocess_info. This is only needed if you're not going
> > to call call_usermodehelper_exec */
> > @@ -80,7 +81,7 @@ call_usermodehelper(char *path, char **argv, char **envp, enum umh_wait wait)
> > info = call_usermodehelper_setup(path, argv, envp, gfp_mask);
> > if (info == NULL)
> > return -ENOMEM;
> > - return call_usermodehelper_exec(info, wait);
> > + return call_usermodehelper_exec(info, wait, NULL);
> > }
> >
> > static inline int
> > @@ -95,14 +96,14 @@ call_usermodehelper_keys(char *path, char **argv, char **envp,
> > return -ENOMEM;
> >
> > call_usermodehelper_setkeys(info, session_keyring);
> > - return call_usermodehelper_exec(info, wait);
> > + return call_usermodehelper_exec(info, wait, NULL);
> > }
> >
> > extern void usermodehelper_init(void);
> >
> > struct file;
> > extern int call_usermodehelper_pipe(char *path, char *argv[], char *envp[],
> > - struct file **filp);
> > + struct file **filp, pid_t *pid);
> >
> > extern int usermodehelper_disable(void);
> > extern void usermodehelper_enable(void);
> > diff --git a/kernel/kmod.c b/kernel/kmod.c
> > index 7e95bed..9828286 100644
> > --- a/kernel/kmod.c
> > +++ b/kernel/kmod.c
> > @@ -126,6 +126,7 @@ struct subprocess_info {
> > char **envp;
> > enum umh_wait wait;
> > int retval;
> > + pid_t helper_pid;
> > struct file *stdin;
> > void (*cleanup)(char **argv, char **envp);
> > };
> > @@ -204,7 +205,15 @@ static int wait_for_helper(void *data)
> > * populate the status, but will return -ECHILD. */
> > allow_signal(SIGCHLD);
> >
> > + /*
> > + * Set the pid to zero here, since
> > + * The helper process will have exited
> > + * by the time the caller reads this
> > + */
>
> There's no need to cram the comments into 50 columns.
>
> This sentence has a random capitalised word in the middle, and no full-stop ;)
>
Copy that, me fix bad typing soon :)
> > + sub_info->helper_pid = 0;
> > +
> > pid = kernel_thread(____call_usermodehelper, sub_info, SIGCHLD);
> > +
> > if (pid < 0) {
> > sub_info->retval = pid;
> > } else {
> > @@ -253,9 +262,11 @@ static void __call_usermodehelper(struct work_struct *work)
> > if (wait == UMH_WAIT_PROC || wait == UMH_NO_WAIT)
> > pid = kernel_thread(wait_for_helper, sub_info,
> > CLONE_FS | CLONE_FILES | SIGCHLD);
> > - else
> > + else {
> > pid = kernel_thread(____call_usermodehelper, sub_info,
> > CLONE_VFORK | SIGCHLD);
> > + sub_info->helper_pid = pid;
> > + }
> >
> > switch (wait) {
> > case UMH_NO_WAIT:
> > @@ -369,6 +380,7 @@ struct subprocess_info *call_usermodehelper_setup(char *path, char **argv,
> > sub_info->path = path;
> > sub_info->argv = argv;
> > sub_info->envp = envp;
> > + sub_info->helper_pid = 0;
> > sub_info->cred = prepare_usermodehelper_creds();
> > if (!sub_info->cred) {
> > kfree(sub_info);
> > @@ -457,7 +469,7 @@ EXPORT_SYMBOL(call_usermodehelper_stdinpipe);
> > * (ie. it runs with full root capabilities).
> > */
> > int call_usermodehelper_exec(struct subprocess_info *sub_info,
> > - enum umh_wait wait)
> > + enum umh_wait wait, pid_t *pid)
> > {
> > DECLARE_COMPLETION_ONSTACK(done);
> > int retval = 0;
> > @@ -481,7 +493,8 @@ int call_usermodehelper_exec(struct subprocess_info *sub_info,
> > goto unlock;
> > wait_for_completion(&done);
> > retval = sub_info->retval;
> > -
> > + if (pid)
> > + *pid = sub_info->helper_pid;
>
> I query the addition of helper_pid.
>
> That pid already gets returned in sub_info->retval, does it not? Did
> we just duplicate that?
>
> If we indeed do need this new field, can this apparent duplication be
> cleaned up? For example, can the copying of the PID into ->retval be
> removed, and switch all pid-using sites over to use the new
> ->helper_pid?
>
IIRC, when I did this initially, it was done because in the case of using
call_usermodehelper with UMH_NO_WAIT, retval contained the thread we create to
do the usermodehelper waiting for us, rather than the usermode helper itself. I
had thought this would be cleaner, so as not to mess with any other callers, but
I'll look at removing it.
> > out:
> > call_usermodehelper_freeinfo(sub_info);
> > unlock:
> > @@ -502,11 +515,11 @@ EXPORT_SYMBOL(call_usermodehelper_exec);
> > * lower-level call_usermodehelper_* functions.
> > */
> > int call_usermodehelper_pipe(char *path, char **argv, char **envp,
> > - struct file **filp)
> > + struct file **filp, pid_t *pid)
> > {
> > struct subprocess_info *sub_info;
> > int ret;
> > -
> > +
> > sub_info = call_usermodehelper_setup(path, argv, envp, GFP_KERNEL);
> > if (sub_info == NULL)
> > return -ENOMEM;
> > @@ -515,7 +528,8 @@ int call_usermodehelper_pipe(char *path, char **argv, char **envp,
> > if (ret < 0)
> > goto out;
> >
> > - return call_usermodehelper_exec(sub_info, UMH_WAIT_EXEC);
> > + ret = call_usermodehelper_exec(sub_info, UMH_WAIT_EXEC, pid);
> > + return ret;
> >
> > out:
> > call_usermodehelper_freeinfo(sub_info);
> > diff --git a/kernel/sys.c b/kernel/sys.c
> > index e7998cf..30f7dc8 100644
> > --- a/kernel/sys.c
> > +++ b/kernel/sys.c
> > @@ -1863,7 +1863,7 @@ int orderly_poweroff(bool force)
> >
> > call_usermodehelper_setcleanup(info, argv_cleanup);
> >
> > - ret = call_usermodehelper_exec(info, UMH_NO_WAIT);
> > + ret = call_usermodehelper_exec(info, UMH_NO_WAIT, NULL);
> >
> > out:
> > if (ret && force) {
> > diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> > index b2970d5..2346c0f 100644
> > --- a/kernel/sysctl.c
> > +++ b/kernel/sysctl.c
> > @@ -75,6 +75,7 @@ extern int max_threads;
> > extern int core_uses_pid;
> > extern int suid_dumpable;
> > extern char core_pattern[];
> > +extern unsigned int core_pipe_limit;
>
> Bah.
>
> We should fix this one day.
>
Is there a particular way you'd like to see it fixed? I can certainly clean it
up (both core_pipe_limit, and the other core related sysctl variables).
> > extern int pid_max;
> > extern int min_free_kbytes;
> > extern int pid_max_min, pid_max_max;
> > @@ -396,6 +397,14 @@ static struct ctl_table kern_table[] = {
> > .proc_handler = &proc_dostring,
> > .strategy = &sysctl_string,
> > },
> > + {
> > + .ctl_name = CTL_UNNUMBERED,
> > + .procname = "core_pipe_limit",
> > + .data = &core_pipe_limit,
> > + .maxlen = sizeof(unsigned int),
> > + .mode = 0644,
> > + .proc_handler = &proc_dointvec,
> > + },
>
> Please ensure that the core_pipe_limit documentation describes its
> units.
>
Will do.
> It's "number of concurrent coredumps, system-wide", yes?
>
Correct.
> It does make coredumping unreliable, doesn't it? :(
>
If its set to a non-zero value, and we're using a pipe in core_pattern, yes.
However, I would argue that its userspace introducing an unreliability, rather
than this control by itself. If the user space core cathcing process has a bug
and runs forever, we could crash the system simply by chewing up system resouces
if we create lots of segfaulting programs. This is really meant to limit that
behavior. The default however is to set this value to zero, which causes no
waiting on user processes, but allows unlimited core dump same as before, so no
limitations there.
I'll rework the patch with your notes, and repost soon. Thanks!
Neil
On Fri, 26 Jun 2009 06:48:04 -0400 Neil Horman <[email protected]> wrote:
> > > --- a/kernel/sysctl.c
> > > +++ b/kernel/sysctl.c
> > > @@ -75,6 +75,7 @@ extern int max_threads;
> > > extern int core_uses_pid;
> > > extern int suid_dumpable;
> > > extern char core_pattern[];
> > > +extern unsigned int core_pipe_limit;
> >
> > Bah.
> >
> > We should fix this one day.
> >
> Is there a particular way you'd like to see it fixed?
Not in the context of this patch.
One way would be to add a new sysctl-externs.h and then put all the
declarations in there. That file gets included by sysctl.c and by each
file which shares a global with sysctl.c
On Fri, Jun 26, 2009 at 09:20:01AM -0700, Andrew Morton wrote:
> On Fri, 26 Jun 2009 06:48:04 -0400 Neil Horman <[email protected]> wrote:
>
> > > > --- a/kernel/sysctl.c
> > > > +++ b/kernel/sysctl.c
> > > > @@ -75,6 +75,7 @@ extern int max_threads;
> > > > extern int core_uses_pid;
> > > > extern int suid_dumpable;
> > > > extern char core_pattern[];
> > > > +extern unsigned int core_pipe_limit;
> > >
> > > Bah.
> > >
> > > We should fix this one day.
> > >
> > Is there a particular way you'd like to see it fixed?
>
> Not in the context of this patch.
>
> One way would be to add a new sysctl-externs.h and then put all the
> declarations in there. That file gets included by sysctl.c and by each
> file which shares a global with sysctl.c
>
Thanks, I'll keep that in mind, and look into squaring it away when I have a few
extra cycles.
Neil
>
>
On Thu, Jun 25, 2009 at 04:30:50PM -0700, Andrew Morton wrote:
<snip>
Ok, heres a new version of the changeset, taking into account all of your notes,
Andrew. Patches tested by me on a kernel with your latest -mm patch and it all
works well.
This resolves kernel.org bug 13355:
http://bugzilla.kernel.org/show_bug.cgi?id=13355
Change Notes:
1) Split the patch into two smaller patches. I kept the sysctl with the wait
code for the reasons we discussed earlier.
2) core_dump_count is now statically declared within the scope of the
do_coredump function
3) Commented more clearly the use of the task_tgid_vnr call
4) Made the printk indicating that a crash is over the core dump limit more
usefull
5) Fixed some misc. gramatical, formatting errors
6) Removed the helper_pid variable. Turns out it wasn't needed with a minor
change to the usermodehelper code.
7) Documented the sysctl in Documentation/sysctl/kernel.txt
Regards
Neil
Signed-off-by: Neil Horman <[email protected]>
Reported-by: Earl Chew <[email protected]>
core_pattern: Change how we detect recursive dumps with core_pattern pipes
Change how we detect recursive dumps. Currently we have a mechanism by which
we try to compare pathnames of the crashing process to the core_pattern path.
This is broken for a dozen reasons, and just doesn't work in any sort of robust
way. I'm replacing it with the use of a 0 RLIMIT_CORE value. Since helper
apps set RLIMIT_CORE to zero, we don't write out core files for any process with
that particular limit set. It the core_pattern is a pipe, any non-zero limit is
translated to RLIM_INFINITY. This allows complete dumps to be captured, but
prevents infinite recursion in the event that the core_pattern process itself
crashes.
Signed-off-by: Neil Horman <[email protected]>
Reported-by: Earl Chew <[email protected]>
exec.c | 32 +++++++++++++++++++-------------
1 file changed, 19 insertions(+), 13 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index ebe359f..163cfa7 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1802,22 +1802,28 @@ int do_coredump(long signr, int exit_code, struct pt_regs * regs)
goto fail_unlock;
if (ispipe) {
- helper_argv = argv_split(GFP_KERNEL, corename+1, &helper_argc);
- /* Terminate the string before the first option */
- delimit = strchr(corename, ' ');
- if (delimit)
- *delimit = '\0';
- delimit = strrchr(helper_argv[0], '/');
- if (delimit)
- delimit++;
- else
- delimit = helper_argv[0];
- if (!strcmp(delimit, current->comm)) {
- printk(KERN_NOTICE "Recursive core dump detected, "
- "aborting\n");
+ if (core_limit == 0) {
+ /*
+ * Normally core limits are irrelevant to pipes, since
+ * we're not writing to the file system, but we use
+ * core_limit of 0 here as a speacial value. Any
+ * non-zero limit gets set to RLIM_INFINITY below, but
+ * a limit of 0 skips the dump. This is a consistent
+ * way to catch recursive crashes. We can still crash
+ * if the core_pattern binary sets RLIM_CORE = !0
+ * but it runs as root, and can do lots of stupid things
+ * Note that we use task_tgid_vnr here to grab the pid of the
+ * process group leader. That way we get the right pid if a thread
+ * in a multi-threaded core_pattern process dies.
+ */
+ printk(KERN_WARNING "Process %d(%s) has RLIMIT_CORE set to 0\n",
+ task_tgid_vnr(current), current->comm);
+ printk(KERN_WARNING "Aborting core\n");
goto fail_unlock;
}
+ helper_argv = argv_split(GFP_KERNEL, corename+1, &helper_argc);
+
core_limit = RLIM_INFINITY;
/* SIGPIPE can happen, but it's just never processed */
Allow for the kernel to wait for a core_pattern process to complete
One of the things core_pattern processes might do is interrogate the status of a
crashing process via its /proc/pid directory. To ensure that that directory is
not removed prematurely, we wait for the process to exit prior to cleaning it
up.
Since the addition of this feature makes it possible to block the reaping of a
crashed process (if the collecting process never exits), Also introduce a new
sysctl: core_pipe_limit. This sysctl, when non-zero, defined the maximum number
of crashing processes that can be collected in parallel. Processes exceeding
this limit are noted in the kernel log buffer and their cores are skipped. If
core_pipe_limit is zero (the default), this feature is disabled entirely. In
other words, the number of cores which may be collected in parallel are
unlimited, but access to a crashing processes /proc/pid directory is not
guaranteed, as the kernel will not wait for the crashing process to exit.
Signed-off-by: Neil Horman <[email protected]>
Reported-by: Earl Chew <[email protected]>
Documentation/sysctl/kernel.txt | 22 ++++++++++++++++++++++
fs/exec.c | 28 +++++++++++++++++++++++-----
include/linux/kmod.h | 9 +++++----
kernel/kmod.c | 18 ++++++++++++------
kernel/sys.c | 2 +-
kernel/sysctl.c | 9 +++++++++
6 files changed, 72 insertions(+), 16 deletions(-)
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index a4ccdd1..bcd6a7c 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -20,6 +20,7 @@ show up in /proc/sys/kernel:
- acct
- core_pattern
- core_uses_pid
+- core_pipe_limit
- ctrl-alt-del
- dentry-state
- domainname
@@ -123,6 +124,27 @@ the filename.
==============================================================
+core_pipe_limit:
+
+This sysctl is only applicable when core_pattern is configured to pipe core
+files to user space helper a (Shen the first character of core_pattern is a '|',
+see above). When collecting cores via a pipe to an application, it is
+occasionally usefull for the collecting application to gather data about the
+crashing process from its /proc/pid directory. In order to do this safely, the
+kernel must wait for the collecting process to exit, so as not to remove the
+crashing processes proc files prematurely. This in turn creates the possibiltiy
+that a misbehaving userspace collecting process can block the reaping of a
+crashed process simply by never exiting. This sysctl defends against that. It
+defines how many concurrent crashing processes may be piped to user space
+applications in parallel. If this value is exceeded, then those crashing
+processes above that value are noted via the kernel log and their cores are
+skipped. 0 is a special value, indicating that unlimited processes may be
+captured in parallel, but that no waiting will take place (i.e. the collecting
+process is not guaranteed access to /proc/<crahing pid>/). This value defaults
+to 0.
+
+==============================================================
+
ctrl-alt-del:
When the value in this file is 0, ctrl-alt-del is trapped and
diff --git a/fs/exec.c b/fs/exec.c
index 163cfa7..9a5f40c 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -64,6 +64,8 @@
int core_uses_pid;
char core_pattern[CORENAME_MAX_SIZE] = "core";
+unsigned int core_pipe_limit;
+
int suid_dumpable = 0;
/* The maximal length of core_pattern is also specified in sysctl.c */
@@ -1735,7 +1737,9 @@ int do_coredump(long signr, int exit_code, struct pt_regs * regs)
unsigned long core_limit = current->signal->rlim[RLIMIT_CORE].rlim_cur;
char **helper_argv = NULL;
int helper_argc = 0;
- char *delimit;
+ pid_t pid = 0;
+ int dump_count;
+ static atomic_t core_dump_count = ATOMIC_INIT(0);
audit_core_dumps(signr);
@@ -1824,21 +1828,29 @@ int do_coredump(long signr, int exit_code, struct pt_regs * regs)
helper_argv = argv_split(GFP_KERNEL, corename+1, &helper_argc);
+ dump_count = atomic_inc_return(&core_dump_count);
+ if (core_pipe_limit && (core_pipe_limit < dump_count)) {
+ printk(KERN_WARNING "Pid %d(%s) over core_pipe_limit\n",
+ task_tgid_vnr(current), current->comm);
+ printk(KERN_WARNING "Skipping core dump\n");
+ goto fail_dropcount;
+ }
+
core_limit = RLIM_INFINITY;
/* SIGPIPE can happen, but it's just never processed */
- if (call_usermodehelper_pipe(corename+1, helper_argv, NULL,
- &file)) {
+ if (call_usermodehelper_pipe(helper_argv[0], helper_argv, NULL,
+ &file, &pid) < 0) {
printk(KERN_INFO "Core dump to %s pipe failed\n",
corename);
- goto fail_unlock;
+ goto fail_dropcount;
}
} else
file = filp_open(corename,
O_CREAT | 2 | O_NOFOLLOW | O_LARGEFILE | flag,
0600);
if (IS_ERR(file))
- goto fail_unlock;
+ goto fail_dropcount;
inode = file->f_path.dentry->d_inode;
if (inode->i_nlink > 1)
goto close_fail; /* multiple links - don't dump */
@@ -1864,10 +1876,16 @@ int do_coredump(long signr, int exit_code, struct pt_regs * regs)
retval = binfmt->core_dump(signr, regs, file, core_limit);
+ if (ispipe && core_pipe_limit)
+ sys_wait4(pid, NULL, 0, NULL);
+
if (retval)
current->signal->group_exit_code |= 0x80;
close_fail:
filp_close(file, NULL);
+fail_dropcount:
+ if (ispipe)
+ atomic_dec(&core_dump_count);
fail_unlock:
if (helper_argv)
argv_free(helper_argv);
diff --git a/include/linux/kmod.h b/include/linux/kmod.h
index 92213a9..4085ada 100644
--- a/include/linux/kmod.h
+++ b/include/linux/kmod.h
@@ -60,7 +60,8 @@ enum umh_wait {
};
/* Actually execute the sub-process */
-int call_usermodehelper_exec(struct subprocess_info *info, enum umh_wait wait);
+int call_usermodehelper_exec(struct subprocess_info *info, enum umh_wait wait,
+ pid_t *pid);
/* Free the subprocess_info. This is only needed if you're not going
to call call_usermodehelper_exec */
@@ -75,7 +76,7 @@ call_usermodehelper(char *path, char **argv, char **envp, enum umh_wait wait)
info = call_usermodehelper_setup(path, argv, envp, gfp_mask);
if (info == NULL)
return -ENOMEM;
- return call_usermodehelper_exec(info, wait);
+ return call_usermodehelper_exec(info, wait, NULL);
}
static inline int
@@ -90,14 +91,14 @@ call_usermodehelper_keys(char *path, char **argv, char **envp,
return -ENOMEM;
call_usermodehelper_setkeys(info, session_keyring);
- return call_usermodehelper_exec(info, wait);
+ return call_usermodehelper_exec(info, wait, NULL);
}
extern void usermodehelper_init(void);
struct file;
extern int call_usermodehelper_pipe(char *path, char *argv[], char *envp[],
- struct file **filp);
+ struct file **filp, pid_t *pid);
extern int usermodehelper_disable(void);
extern void usermodehelper_enable(void);
diff --git a/kernel/kmod.c b/kernel/kmod.c
index c935b9b..8660ada 100644
--- a/kernel/kmod.c
+++ b/kernel/kmod.c
@@ -203,6 +203,7 @@ static int wait_for_helper(void *data)
allow_signal(SIGCHLD);
pid = kernel_thread(____call_usermodehelper, sub_info, SIGCHLD);
+
if (pid < 0) {
sub_info->retval = pid;
} else {
@@ -251,9 +252,11 @@ static void __call_usermodehelper(struct work_struct *work)
if (wait == UMH_WAIT_PROC || wait == UMH_NO_WAIT)
pid = kernel_thread(wait_for_helper, sub_info,
CLONE_FS | CLONE_FILES | SIGCHLD);
- else
+ else {
pid = kernel_thread(____call_usermodehelper, sub_info,
CLONE_VFORK | SIGCHLD);
+ sub_info->retval = pid;
+ }
switch (wait) {
case UMH_NO_WAIT:
@@ -367,6 +370,7 @@ struct subprocess_info *call_usermodehelper_setup(char *path, char **argv,
sub_info->path = path;
sub_info->argv = argv;
sub_info->envp = envp;
+ sub_info->retval = 0;
sub_info->cred = prepare_usermodehelper_creds();
if (!sub_info->cred)
return NULL;
@@ -453,7 +457,7 @@ EXPORT_SYMBOL(call_usermodehelper_stdinpipe);
* (ie. it runs with full root capabilities).
*/
int call_usermodehelper_exec(struct subprocess_info *sub_info,
- enum umh_wait wait)
+ enum umh_wait wait, pid_t *pid)
{
DECLARE_COMPLETION_ONSTACK(done);
int retval = 0;
@@ -477,7 +481,8 @@ int call_usermodehelper_exec(struct subprocess_info *sub_info,
goto unlock;
wait_for_completion(&done);
retval = sub_info->retval;
-
+ if (retval > 0)
+ *pid = sub_info->retval;
out:
call_usermodehelper_freeinfo(sub_info);
unlock:
@@ -498,11 +503,11 @@ EXPORT_SYMBOL(call_usermodehelper_exec);
* lower-level call_usermodehelper_* functions.
*/
int call_usermodehelper_pipe(char *path, char **argv, char **envp,
- struct file **filp)
+ struct file **filp, pid_t *pid)
{
struct subprocess_info *sub_info;
int ret;
-
+
sub_info = call_usermodehelper_setup(path, argv, envp, GFP_KERNEL);
if (sub_info == NULL)
return -ENOMEM;
@@ -511,7 +516,8 @@ int call_usermodehelper_pipe(char *path, char **argv, char **envp,
if (ret < 0)
goto out;
- return call_usermodehelper_exec(sub_info, UMH_WAIT_EXEC);
+ ret = call_usermodehelper_exec(sub_info, UMH_WAIT_EXEC, pid);
+ return ret;
out:
call_usermodehelper_freeinfo(sub_info);
diff --git a/kernel/sys.c b/kernel/sys.c
index 73b72a3..969b713 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1858,7 +1858,7 @@ int orderly_poweroff(bool force)
call_usermodehelper_setcleanup(info, argv_cleanup);
- ret = call_usermodehelper_exec(info, UMH_NO_WAIT);
+ ret = call_usermodehelper_exec(info, UMH_NO_WAIT, NULL);
out:
if (ret && force) {
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index d14953a..3c0de70 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -75,6 +75,7 @@ extern int max_threads;
extern int core_uses_pid;
extern int suid_dumpable;
extern char core_pattern[];
+extern unsigned int core_pipe_limit;
extern int pid_max;
extern int min_free_kbytes;
extern int pid_max_min, pid_max_max;
@@ -387,6 +388,14 @@ static struct ctl_table kern_table[] = {
.proc_handler = &proc_dostring,
.strategy = &sysctl_string,
},
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "core_pipe_limit",
+ .data = &core_pipe_limit,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
#ifdef CONFIG_PROC_SYSCTL
{
.procname = "tainted",
On Fri, 26 Jun 2009 14:02:22 -0400
Neil Horman <[email protected]> wrote:
>
> core_pattern: Change how we detect recursive dumps with core_pattern pipes
>
> Change how we detect recursive dumps. Currently we have a mechanism by which
> we try to compare pathnames of the crashing process to the core_pattern path.
> This is broken for a dozen reasons, and just doesn't work in any sort of robust
> way. I'm replacing it with the use of a 0 RLIMIT_CORE value. Since helper
> apps set RLIMIT_CORE to zero, we don't write out core files for any process with
> that particular limit set. It the core_pattern is a pipe, any non-zero limit is
> translated to RLIM_INFINITY. This allows complete dumps to be captured, but
> prevents infinite recursion in the event that the core_pattern process itself
> crashes.
>
The patch appears to be against 2.6.30 or something. I get rejects due
to some other patch in exec.c which was added three weeks ago. Please
don't do that :(
>
>
> exec.c | 32 +++++++++++++++++++-------------
> 1 file changed, 19 insertions(+), 13 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index ebe359f..163cfa7 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1802,22 +1802,28 @@ int do_coredump(long signr, int exit_code, struct pt_regs * regs)
> goto fail_unlock;
>
> if (ispipe) {
> - helper_argv = argv_split(GFP_KERNEL, corename+1, &helper_argc);
> - /* Terminate the string before the first option */
> - delimit = strchr(corename, ' ');
> - if (delimit)
> - *delimit = '\0';
> - delimit = strrchr(helper_argv[0], '/');
> - if (delimit)
> - delimit++;
> - else
> - delimit = helper_argv[0];
> - if (!strcmp(delimit, current->comm)) {
> - printk(KERN_NOTICE "Recursive core dump detected, "
> - "aborting\n");
> + if (core_limit == 0) {
> + /*
> + * Normally core limits are irrelevant to pipes, since
> + * we're not writing to the file system, but we use
> + * core_limit of 0 here as a speacial value. Any
> + * non-zero limit gets set to RLIM_INFINITY below, but
> + * a limit of 0 skips the dump. This is a consistent
> + * way to catch recursive crashes. We can still crash
> + * if the core_pattern binary sets RLIM_CORE = !0
> + * but it runs as root, and can do lots of stupid things
> + * Note that we use task_tgid_vnr here to grab the pid of the
> + * process group leader. That way we get the right pid if a thread
> + * in a multi-threaded core_pattern process dies.
> + */
> + printk(KERN_WARNING "Process %d(%s) has RLIMIT_CORE set to 0\n",
> + task_tgid_vnr(current), current->comm);
> + printk(KERN_WARNING "Aborting core\n");
> goto fail_unlock;
> }
A few cosmetic things:
- The asterisks don't line up in the comment block. Normally we'll do
/*
*
*
rather than
/*
*
*
- The comment overflows 80 columns and makes a mess.
- Would it not be neater to do this check in a separate function?
Then the comment block can go above the function rather than being
all scrunched to the right and do_coredump() (which is already >150
lines) just gets
if (ispipe) {
+ if (core_limit_is_zero())
+ goto fail_unlock;
On 06/26, Neil Horman wrote:
>
> - if (call_usermodehelper_pipe(corename+1, helper_argv, NULL,
> - &file)) {
> + if (call_usermodehelper_pipe(helper_argv[0], helper_argv, NULL,
> + &file, &pid) < 0) {
...
> + if (ispipe && core_pipe_limit)
> + sys_wait4(pid, NULL, 0, NULL);
How this can work? sys_wait4() can only wait for our childs, but the parent
of the dumping thread is khelper.
If we change kmod.c, then I think we need UMH_WAIT_CUSTOM_COMPLETION.
But perhaps do_coredump() can just sleep on file->...->i_pipe->wait waiting
for pipe_inode_info->readers == 0 ?
Oleg.
On 06/26, Neil Horman wrote:
>
> + if (core_limit == 0) {
> + /*
> + * Normally core limits are irrelevant to pipes, since
> + * we're not writing to the file system, but we use
> + * core_limit of 0 here as a speacial value. Any
> + * non-zero limit gets set to RLIM_INFINITY below, but
> + * a limit of 0 skips the dump. This is a consistent
> + * way to catch recursive crashes. We can still crash
> + * if the core_pattern binary sets RLIM_CORE = !0
> + * but it runs as root, and can do lots of stupid things
> + * Note that we use task_tgid_vnr here to grab the pid of the
> + * process group leader. That way we get the right pid if a thread
> + * in a multi-threaded core_pattern process dies.
> + */
> + printk(KERN_WARNING "Process %d(%s) has RLIMIT_CORE set to 0\n",
> + task_tgid_vnr(current), current->comm);
> + printk(KERN_WARNING "Aborting core\n");
Andrew has already pointed out this, unprivileged-user-triggerable
printk.
Doesn't look good, if core_pattern starts with "|" any user can set
RLIMIT_CORE = 0 and then just do
for (;;)
if (pid = fork())
kill(pid, SIGQUIT);
to DOS printk/syslog, no?
Oleg.
On Fri, Jun 26, 2009 at 12:37:23PM -0700, Andrew Morton wrote:
> On Fri, 26 Jun 2009 14:02:22 -0400
> Neil Horman <[email protected]> wrote:
>
> >
> > core_pattern: Change how we detect recursive dumps with core_pattern pipes
> >
> > Change how we detect recursive dumps. Currently we have a mechanism by which
> > we try to compare pathnames of the crashing process to the core_pattern path.
> > This is broken for a dozen reasons, and just doesn't work in any sort of robust
> > way. I'm replacing it with the use of a 0 RLIMIT_CORE value. Since helper
> > apps set RLIMIT_CORE to zero, we don't write out core files for any process with
> > that particular limit set. It the core_pattern is a pipe, any non-zero limit is
> > translated to RLIM_INFINITY. This allows complete dumps to be captured, but
> > prevents infinite recursion in the event that the core_pattern process itself
> > crashes.
> >
>
> The patch appears to be against 2.6.30 or something. I get rejects due
> to some other patch in exec.c which was added three weeks ago. Please
> don't do that :(
>
No, this patch is against a branch I made from the 2.6.28-rc2 tag, to which I
cleanly applied your -mm patch that I got from kernel.org.
> >
> >
> > exec.c | 32 +++++++++++++++++++-------------
> > 1 file changed, 19 insertions(+), 13 deletions(-)
> >
> > diff --git a/fs/exec.c b/fs/exec.c
> > index ebe359f..163cfa7 100644
> > --- a/fs/exec.c
> > +++ b/fs/exec.c
> > @@ -1802,22 +1802,28 @@ int do_coredump(long signr, int exit_code, struct pt_regs * regs)
> > goto fail_unlock;
> >
> > if (ispipe) {
> > - helper_argv = argv_split(GFP_KERNEL, corename+1, &helper_argc);
> > - /* Terminate the string before the first option */
> > - delimit = strchr(corename, ' ');
> > - if (delimit)
> > - *delimit = '\0';
> > - delimit = strrchr(helper_argv[0], '/');
> > - if (delimit)
> > - delimit++;
> > - else
> > - delimit = helper_argv[0];
> > - if (!strcmp(delimit, current->comm)) {
> > - printk(KERN_NOTICE "Recursive core dump detected, "
> > - "aborting\n");
> > + if (core_limit == 0) {
> > + /*
> > + * Normally core limits are irrelevant to pipes, since
> > + * we're not writing to the file system, but we use
> > + * core_limit of 0 here as a speacial value. Any
> > + * non-zero limit gets set to RLIM_INFINITY below, but
> > + * a limit of 0 skips the dump. This is a consistent
> > + * way to catch recursive crashes. We can still crash
> > + * if the core_pattern binary sets RLIM_CORE = !0
> > + * but it runs as root, and can do lots of stupid things
> > + * Note that we use task_tgid_vnr here to grab the pid of the
> > + * process group leader. That way we get the right pid if a thread
> > + * in a multi-threaded core_pattern process dies.
> > + */
> > + printk(KERN_WARNING "Process %d(%s) has RLIMIT_CORE set to 0\n",
> > + task_tgid_vnr(current), current->comm);
> > + printk(KERN_WARNING "Aborting core\n");
> > goto fail_unlock;
> > }
>
> A few cosmetic things:
>
> - The asterisks don't line up in the comment block. Normally we'll do
>
> /*
> *
> *
>
> rather than
>
> /*
> *
> *
>
I'll fix that
> - The comment overflows 80 columns and makes a mess.
>
> - Would it not be neater to do this check in a separate function?
> Then the comment block can go above the function rather than being
> all scrunched to the right and do_coredump() (which is already >150
> lines) just gets
>
> if (ispipe) {
> + if (core_limit_is_zero())
> + goto fail_unlock;
Yeah, I can do that.
Neil
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On Fri, Jun 26, 2009 at 06:48:19PM +0200, Oleg Nesterov wrote:
> On 06/26, Neil Horman wrote:
> >
> > - if (call_usermodehelper_pipe(corename+1, helper_argv, NULL,
> > - &file)) {
> > + if (call_usermodehelper_pipe(helper_argv[0], helper_argv, NULL,
> > + &file, &pid) < 0) {
>
> ...
>
>
> > + if (ispipe && core_pipe_limit)
> > + sys_wait4(pid, NULL, 0, NULL);
>
> How this can work? sys_wait4() can only wait for our childs, but the parent
> of the dumping thread is khelper.
>
>
> If we change kmod.c, then I think we need UMH_WAIT_CUSTOM_COMPLETION.
>
>
> But perhaps do_coredump() can just sleep on file->...->i_pipe->wait waiting
> for pipe_inode_info->readers == 0 ?
>
Thats a good idea, I like that better. I'll fix that up and repost monday
morning.
Neil
> Oleg.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On Fri, Jun 26, 2009 at 06:59:08PM +0200, Oleg Nesterov wrote:
> On 06/26, Neil Horman wrote:
> >
> > + if (core_limit == 0) {
> > + /*
> > + * Normally core limits are irrelevant to pipes, since
> > + * we're not writing to the file system, but we use
> > + * core_limit of 0 here as a speacial value. Any
> > + * non-zero limit gets set to RLIM_INFINITY below, but
> > + * a limit of 0 skips the dump. This is a consistent
> > + * way to catch recursive crashes. We can still crash
> > + * if the core_pattern binary sets RLIM_CORE = !0
> > + * but it runs as root, and can do lots of stupid things
> > + * Note that we use task_tgid_vnr here to grab the pid of the
> > + * process group leader. That way we get the right pid if a thread
> > + * in a multi-threaded core_pattern process dies.
> > + */
> > + printk(KERN_WARNING "Process %d(%s) has RLIMIT_CORE set to 0\n",
> > + task_tgid_vnr(current), current->comm);
> > + printk(KERN_WARNING "Aborting core\n");
>
> Andrew has already pointed out this, unprivileged-user-triggerable
> printk.
>
> Doesn't look good, if core_pattern starts with "|" any user can set
> RLIMIT_CORE = 0 and then just do
>
> for (;;)
> if (pid = fork())
> kill(pid, SIGQUIT);
>
> to DOS printk/syslog, no?
>
I don't think SIGQUIT will trigger this, but SIGSEGV will. Regardless, if you
do that, I would think you have bigger problems on your system. I would be
silent about this, but I'm not sure thats a better solution
Neil
> Oleg.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On 06/26, Neil Horman wrote:
>
> On Fri, Jun 26, 2009 at 06:59:08PM +0200, Oleg Nesterov wrote:
> >
> > Doesn't look good, if core_pattern starts with "|" any user can set
> > RLIMIT_CORE = 0 and then just do
> >
> > for (;;)
> > if (pid = fork())
> > kill(pid, SIGQUIT);
> >
> > to DOS printk/syslog, no?
> >
> I don't think SIGQUIT will trigger this, but SIGSEGV will.
SIGQUIT falls into SIG_KERNEL_COREDUMP_MASK too.
> Regardless, if you
> do that, I would think you have bigger problems on your system.
OK, agreed.
Neil, I think you are doing useful changes, but can't we cleanup
do_coredump() first?
I just can't look at unnecessary "if (ispipe)" checks...
Oleg.
Given that do_coredump() calls put_cred() on exit path, it is a bit ugly
to do put_cred() + "goto fail" twice, just add the new "fail_creds" label.
Signed-off-by: Oleg Nesterov <[email protected]>
--- WAIT/fs/exec.c~CD_1_PUT_CRED 2009-06-15 12:48:28.000000000 +0200
+++ WAIT/fs/exec.c 2009-06-26 20:06:01.000000000 +0200
@@ -1732,13 +1732,11 @@ void do_coredump(long signr, int exit_co
binfmt = current->binfmt;
if (!binfmt || !binfmt->core_dump)
- goto fail;
+ goto out;
cred = prepare_creds();
- if (!cred) {
- retval = -ENOMEM;
- goto fail;
- }
+ if (!cred)
+ goto out;
down_write(&mm->mmap_sem);
/*
@@ -1746,8 +1744,7 @@ void do_coredump(long signr, int exit_co
*/
if (mm->core_state || !get_dumpable(mm)) {
up_write(&mm->mmap_sem);
- put_cred(cred);
- goto fail;
+ goto fail_creds;
}
/*
@@ -1761,10 +1758,8 @@ void do_coredump(long signr, int exit_co
}
retval = coredump_wait(exit_code, &core_state);
- if (retval < 0) {
- put_cred(cred);
- goto fail;
- }
+ if (retval < 0)
+ goto fail_creds;
old_cred = override_creds(cred);
@@ -1862,9 +1857,10 @@ fail_unlock:
if (helper_argv)
argv_free(helper_argv);
+ coredump_finish(mm);
revert_creds(old_cred);
+fail_creds:
put_cred(cred);
- coredump_finish(mm);
-fail:
+out:
return;
}
do_coredump() does a lot of file checks which are only needed when !ispipe,
move them into "else" branch.
Signed-off-by: Oleg Nesterov <[email protected]>
--- WAIT/fs/exec.c~CD_2_ISPIPE 2009-06-26 20:06:01.000000000 +0200
+++ WAIT/fs/exec.c 2009-06-26 20:59:57.000000000 +0200
@@ -1784,9 +1784,6 @@ void do_coredump(long signr, int exit_co
* at which point file size limits and permissions will be imposed
* as it does with any other process
*/
- if ((!ispipe) && (core_limit < binfmt->min_coredump))
- goto fail_unlock;
-
if (ispipe) {
helper_argv = argv_split(GFP_KERNEL, corename+1, &helper_argc);
if (!helper_argv) {
@@ -1818,39 +1815,43 @@ void do_coredump(long signr, int exit_co
corename);
goto fail_unlock;
}
- } else
+ } else {
+ if (core_limit < binfmt->min_coredump)
+ goto fail_unlock;
+
file = filp_open(corename,
O_CREAT | 2 | O_NOFOLLOW | O_LARGEFILE | flag,
0600);
- if (IS_ERR(file))
- goto fail_unlock;
- inode = file->f_path.dentry->d_inode;
- if (inode->i_nlink > 1)
- goto close_fail; /* multiple links - don't dump */
- if (!ispipe && d_unhashed(file->f_path.dentry))
- goto close_fail;
+ if (IS_ERR(file))
+ goto fail_unlock;
- /* AK: actually i see no reason to not allow this for named pipes etc.,
- but keep the previous behaviour for now. */
- if (!ispipe && !S_ISREG(inode->i_mode))
- goto close_fail;
- /*
- * Dont allow local users get cute and trick others to coredump
- * into their pre-created files:
- */
- if (inode->i_uid != current_fsuid())
- goto close_fail;
- if (!file->f_op)
- goto close_fail;
- if (!file->f_op->write)
- goto close_fail;
- if (!ispipe && do_truncate(file->f_path.dentry, 0, 0, file) != 0)
- goto close_fail;
+ if (d_unhashed(file->f_path.dentry))
+ goto close_fail;
+ inode = file->f_path.dentry->d_inode;
+ if (inode->i_nlink > 1)
+ goto close_fail; /* multiple links - don't dump */
+ /* AK: actually i see no reason to not allow this for named
+ pipes etc, but keep the previous behaviour for now. */
+ if (!S_ISREG(inode->i_mode))
+ goto close_fail;
+ /*
+ * Dont allow local users get cute and trick others to coredump
+ * into their pre-created files:
+ */
+ if (inode->i_uid != current_fsuid())
+ goto close_fail;
+ if (!file->f_op)
+ goto close_fail;
+ if (!file->f_op->write)
+ goto close_fail;
+ if (do_truncate(file->f_path.dentry, 0, 0, file) != 0)
+ goto close_fail;
+ }
retval = binfmt->core_dump(signr, regs, file, core_limit);
-
if (retval)
current->signal->group_exit_code |= 0x80;
+
close_fail:
filp_close(file, NULL);
fail_unlock:
> if (!binfmt || !binfmt->core_dump)
> - goto fail;
> + goto out;
[...]
> +out:
> return;
> }
If the label does nothing, just use "return" instead of goto.
Thanks,
Roland
On Fri, Jun 26, 2009 at 09:14:18PM +0200, Oleg Nesterov wrote:
> On 06/26, Neil Horman wrote:
> >
> > On Fri, Jun 26, 2009 at 06:59:08PM +0200, Oleg Nesterov wrote:
> > >
> > > Doesn't look good, if core_pattern starts with "|" any user can set
> > > RLIMIT_CORE = 0 and then just do
> > >
> > > for (;;)
> > > if (pid = fork())
> > > kill(pid, SIGQUIT);
> > >
> > > to DOS printk/syslog, no?
> > >
> > I don't think SIGQUIT will trigger this, but SIGSEGV will.
>
> SIGQUIT falls into SIG_KERNEL_COREDUMP_MASK too.
>
> > Regardless, if you
> > do that, I would think you have bigger problems on your system.
>
> OK, agreed.
>
>
> Neil, I think you are doing useful changes, but can't we cleanup
> do_coredump() first?
>
Thank you, and yes, I've got no problem cleaning up do_coredump first
> I just can't look at unnecessary "if (ispipe)" checks...
>
I can get behind that. I've got some stuff to do this weekend, so I'll wait to
let Andrew roll in your cleanup patches that you've posted (I've got a bunch of
weekend chores anyway), then early next week I'll rediff my patches with your
changes in place + Andrews new set of notes.
Thanks!
Neil
P.S. Andrew - I doubled checked, and my patches are on a branch in my git tree
rooted at tag v2.6.28-rc2 with your latest mm1 patch applied:
http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.28-rc2/2.6.28-rc2-mm1/2.6.28-rc2-mm1.bz2
So I'm not sure where the conflicts are comming from. I'll re-clone the tree
and apply everything from scratch though once Olegs patches are in place, to see
if theres something I missed.
> Oleg.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On 06/26, Oleg Nesterov wrote:
>
> do_coredump() does a lot of file checks which are only needed when !ispipe,
I wonder why do we check d_unhashed() after filp_open() ?
Also, we are doing filp_open(O_CREAT | 2 | O_NOFOLLOW | O_LARGEFILE), this
2 == O_RDWR looks very strange.
Oleg.
On 06/26, Roland McGrath wrote:
>
> > if (!binfmt || !binfmt->core_dump)
> > - goto fail;
> > + goto out;
> [...]
> > +out:
> > return;
> > }
>
> If the label does nothing, just use "return" instead of goto.
But this is what the current code does, I just renamed "fail" to "out"
to make sure I didn't miss a goto.
And in fact I am not sure return is better, "goto out" looks more
consistent to me.
But OK, please find the updated patch below.
--------------------------------------------------------------------------
[PATCH 1/2] do_coredump: factor out put_cred() calls
Given that do_coredump() calls put_cred() on exit path, it is a bit ugly
to do put_cred() + "goto fail" twice, just add the new "fail_creds" label.
Signed-off-by: Oleg Nesterov <[email protected]>
--- WAIT/fs/exec.c~CD_1_PUT_CRED 2009-06-15 12:48:28.000000000 +0200
+++ WAIT/fs/exec.c 2009-06-26 22:30:28.000000000 +0200
@@ -1732,13 +1732,11 @@ void do_coredump(long signr, int exit_co
binfmt = current->binfmt;
if (!binfmt || !binfmt->core_dump)
- goto fail;
+ return;
cred = prepare_creds();
- if (!cred) {
- retval = -ENOMEM;
- goto fail;
- }
+ if (!cred)
+ return;
down_write(&mm->mmap_sem);
/*
@@ -1746,8 +1744,7 @@ void do_coredump(long signr, int exit_co
*/
if (mm->core_state || !get_dumpable(mm)) {
up_write(&mm->mmap_sem);
- put_cred(cred);
- goto fail;
+ goto fail_creds;
}
/*
@@ -1761,10 +1758,8 @@ void do_coredump(long signr, int exit_co
}
retval = coredump_wait(exit_code, &core_state);
- if (retval < 0) {
- put_cred(cred);
- goto fail;
- }
+ if (retval < 0)
+ goto fail_creds;
old_cred = override_creds(cred);
@@ -1862,9 +1857,8 @@ fail_unlock:
if (helper_argv)
argv_free(helper_argv);
+ coredump_finish(mm);
revert_creds(old_cred);
+fail_creds:
put_cred(cred);
- coredump_finish(mm);
-fail:
- return;
}
> One way would be to add a new sysctl-externs.h and then put all the
> declarations in there. That file gets included by sysctl.c and by each
> file which shares a global with sysctl.c
Long ago I had a experimental patch to put sysctls into a new ELF section. The
you could simply put a DEFINE_SYSCTL(....) into the appropiate source file
which defined the variable and most of the tables went.
No externs, no mess, no patch collisions, everything much nicer.
The only problem was that it didn't support the numerical sysctl
space, so that would need to be removed first.
It's deprecated for quite some time now:
if (msg_count < 5) {
msg_count++;
printk(KERN_INFO
"warning: process `%s' used the deprecated sysctl "
"system call with ", current->comm);
Should it finally go now? If yes I could polish up the old patch again.
-Andi
--
[email protected] -- Speaking for myself only.
On Sun, 28 Jun 2009 21:31:12 +0200 Andi Kleen <[email protected]> wrote:
> > One way would be to add a new sysctl-externs.h and then put all the
> > declarations in there. That file gets included by sysctl.c and by each
> > file which shares a global with sysctl.c
>
> Long ago I had a experimental patch to put sysctls into a new ELF section. The
> you could simply put a DEFINE_SYSCTL(....) into the appropiate source file
> which defined the variable and most of the tables went.
>
> No externs, no mess, no patch collisions, everything much nicer.
>
> The only problem was that it didn't support the numerical sysctl
> space, so that would need to be removed first.
>
> It's deprecated for quite some time now:
>
> if (msg_count < 5) {
> msg_count++;
> printk(KERN_INFO
> "warning: process `%s' used the deprecated sysctl "
> "system call with ", current->comm);
>
> Should it finally go now? If yes I could polish up the old patch again.
>
I suspect that it will be a long time before we can actually remove the
numerical sysctl support, if ever. In _theory_ we should support it
for ever. But in practice, we could probably remove it with a minimum
of disruption a few years hence, but it's hard to work this out.
When was the last time we saw a "warning: process `%s' used the
obsolete bdflush system call" warning? A quick google here says 2004.
Is that data? A bit, I guess.
Maybe Eric has thought about this issue?
On Sun, Jun 28, 2009 at 01:52:35PM -0700, Andrew Morton wrote:
> On Sun, 28 Jun 2009 21:31:12 +0200 Andi Kleen <[email protected]> wrote:
>
> > > One way would be to add a new sysctl-externs.h and then put all the
> > > declarations in there. That file gets included by sysctl.c and by each
> > > file which shares a global with sysctl.c
> >
> > Long ago I had a experimental patch to put sysctls into a new ELF section. The
> > you could simply put a DEFINE_SYSCTL(....) into the appropiate source file
> > which defined the variable and most of the tables went.
> >
> > No externs, no mess, no patch collisions, everything much nicer.
> >
> > The only problem was that it didn't support the numerical sysctl
> > space, so that would need to be removed first.
> >
> > It's deprecated for quite some time now:
> >
> > if (msg_count < 5) {
> > msg_count++;
> > printk(KERN_INFO
> > "warning: process `%s' used the deprecated sysctl "
> > "system call with ", current->comm);
> >
> > Should it finally go now? If yes I could polish up the old patch again.
> >
>
> I suspect that it will be a long time before we can actually remove the
> numerical sysctl support, if ever. In _theory_ we should support it
To my knowledge the only one that is really commonly used is the one
used by glibc, but that can be kept of course.
> for ever. But in practice, we could probably remove it with a minimum
> of disruption a few years hence, but it's hard to work this out.
One alternative would be to just keep a translation table
numerical -> symbolic for now and remove that later. That would
still not need any externs at least.
Or perhaps make a explicit CONFIG_SYSCTL_NUMERICAL and ask
users to report in.
>
> When was the last time we saw a "warning: process `%s' used the
> obsolete bdflush system call" warning? A quick google here says 2004.
> Is that data? A bit, I guess.
bdflush?
> Maybe Eric has thought about this issue?
feature-removal says Sep 2010 (which seems a bit long in the future)
-Andi
--
[email protected] -- Speaking for myself only.
On Sun, 28 Jun 2009 23:00:54 +0200 Andi Kleen <[email protected]> wrote:
> >
> > When was the last time we saw a "warning: process `%s' used the
> > obsolete bdflush system call" warning? A quick google here says 2004.
> > Is that data? A bit, I guess.
>
> bdflush?
I don't understand your answer.
Many years ago (guess: 2001/2002) the bdflush() syscall was deprecated
and we did
if (msg_count < 5) {
msg_count++;
printk(KERN_INFO
"warning: process `%s' used the obsolete bdflush"
" system call\n", current->comm);
printk(KERN_INFO "Fix your initscripts?\n");
}
to encourage people to stop using it. A brief search indicates that
nobody has reported this message in 4-5 years.
So to answer the question "how long must we leave things in place before
removing them?", that's the only data I am aware of.
Andrew Morton <[email protected]> writes:
> On Sun, 28 Jun 2009 21:31:12 +0200 Andi Kleen <[email protected]> wrote:
>
>> > One way would be to add a new sysctl-externs.h and then put all the
>> > declarations in there. That file gets included by sysctl.c and by each
>> > file which shares a global with sysctl.c
>>
>> Long ago I had a experimental patch to put sysctls into a new ELF section. The
>> you could simply put a DEFINE_SYSCTL(....) into the appropiate source file
>> which defined the variable and most of the tables went.
>>
>> No externs, no mess, no patch collisions, everything much nicer.
>>
>> The only problem was that it didn't support the numerical sysctl
>> space, so that would need to be removed first.
>>
>> It's deprecated for quite some time now:
>>
>> if (msg_count < 5) {
>> msg_count++;
>> printk(KERN_INFO
>> "warning: process `%s' used the deprecated sysctl "
>> "system call with ", current->comm);
>>
>> Should it finally go now? If yes I could polish up the old patch again.
>>
>
> I suspect that it will be a long time before we can actually remove the
> numerical sysctl support, if ever. In _theory_ we should support it
> for ever. But in practice, we could probably remove it with a minimum
> of disruption a few years hence, but it's hard to work this out.
>
> When was the last time we saw a "warning: process `%s' used the
> obsolete bdflush system call" warning? A quick google here says 2004.
> Is that data? A bit, I guess.
>
> Maybe Eric has thought about this issue?
I have a patchset. That I intend to dust off now and repost now that the
merge window has closed that converts all of the binary sysctl handling
into compatibility code that calls read/write on the ascii sysctls.
The bulk of the patchset is removing all of the binary sysctl support code
from all of the susbystems that it makes no longer needed.
At which point the maintenance pain is of all of the binary sysctls is
essentially removed.
At which point the technical decisions on dumping binary become much
easier. They are reduced to one big file that we can keep or compile
out at our leisure. I still figure September 2010 as documented in
feature-removal.txt sounds like a good date for the final ripping all of
the binary sysctl code out.
It is certainly time to remove the internal ABIs. The current system
is still too error prone, for users adding new sysctls. It was only
a week or two ago that I had to nack a patch for adding a binary sysctl.
Once we have purged the binary parts it should be much easier to simplify
the sysctl code.
So my big question:
Andrew should I toss all 100 or so patches over the wall to you
and your -mm tree? Or should I maintain a public git tree based
at 2.6.31-rc1? Get it into linux-next and ask Linus to pull it when
the merge window comes?
Eric
> Andrew should I toss all 100 or so patches over the wall to you
> and your -mm tree? Or should I maintain a public git tree based
> at 2.6.31-rc1? Get it into linux-next and ask Linus to pull it when
> the merge window comes?
What do these 100 odd patches do exactly?
I think DEFINE_SYSCTL()/ELF section would be the correct direction to go
for all global variable sysctls.
Then the binary sysctls could be handled by a global table
in a separate file like you described
[My old patch back then also had a sysctl_name() syscall to still
allow changing sysctl without mounting /proc, but that wasn't
very popular]
For dynamically generated sysctls (relatively rare but there)
the current interfaces are not great, but could be probably kept.
That all doesn't really need 100 patches though.
-Andi
--
[email protected] -- Speaking for myself only.
Andrew Morton <[email protected]> writes:
> On Sun, 28 Jun 2009 23:00:54 +0200 Andi Kleen <[email protected]> wrote:
>
>> >
>> > When was the last time we saw a "warning: process `%s' used the
>> > obsolete bdflush system call" warning? A quick google here says 2004.
>> > Is that data? A bit, I guess.
>>
>> bdflush?
>
> I don't understand your answer.
>
> Many years ago (guess: 2001/2002) the bdflush() syscall was deprecated
> and we did
>
> if (msg_count < 5) {
> msg_count++;
> printk(KERN_INFO
> "warning: process `%s' used the obsolete bdflush"
> " system call\n", current->comm);
> printk(KERN_INFO "Fix your initscripts?\n");
> }
>
> to encourage people to stop using it. A brief search indicates that
> nobody has reported this message in 4-5 years.
>
> So to answer the question "how long must we leave things in place before
> removing them?", that's the only data I am aware of.
Good point. I just did a quick search and I can find people hitting the
deprecated sysctl message in 2.6.30. So next year might be too soon
to actually remove the code. Still we can remove the internal APIs
for 2.6.32 so it is no big deal.
Eric
On Sun, 28 Jun 2009 14:35:31 -0700 [email protected] (Eric W. Biederman) wrote:
> Andrew should I toss all 100 or so patches over the wall to you
> and your -mm tree? Or should I maintain a public git tree based
> at 2.6.31-rc1? Get it into linux-next and ask Linus to pull it when
> the merge window comes?
The latter would be more convenient for me. It may cause pain for
yourself and Stephen though?
Andi Kleen <[email protected]> writes:
>> Andrew should I toss all 100 or so patches over the wall to you
>> and your -mm tree? Or should I maintain a public git tree based
>> at 2.6.31-rc1? Get it into linux-next and ask Linus to pull it when
>> the merge window comes?
>
> What do these 100 odd patches do exactly?
Mostly a fine grained killing of ctl_name, and strategy.
> I think DEFINE_SYSCTL()/ELF section would be the correct direction to go
> for all global variable sysctls.
Perhaps. I don't know how those data structures interact with
what we have in kernel and in modules.
> Then the binary sysctls could be handled by a global table
> in a separate file like you described
Getting the binary sysctl crud out of the core path should
happen first. That is just a handful of patches.
> For dynamically generated sysctls (relatively rare but there)
> the current interfaces are not great, but could be probably kept.
Things like register_sysctl_path can be greatly improved. Now
that we don't have to worry about the binary paths.
> That all doesn't really need 100 patches though.
If you want the patches to be small enough to be human readable it
takes a lot. If you want the patches to be CC'able to the appropriate
maintainers and you don't want to require them to weed through a bunch of
irrelevant code it takes a lot.
Typos are a real danger in an operation like this.
Eric
Ok, Version 3 of this patchset. This is the same set of patches as before only
redifffed and modified in the following ways:
1) Rediffed to the head of linus' tree
2) Rediffed to apply on top of Oleg's 2 patches that clean up do_coredump
3) Modified to use pipe_wait to wait for pipe readers to hit zero rather than
the racy/buggy process of waiting on the reader process to exit.
Neil
core_pattern: Change how we detect recursive dumps with core_pattern pipes
Change how we detect recursive dumps. Currently we have a mechanism by which
we try to compare pathnames of the crashing process to the core_pattern path.
This is broken for a dozen reasons, and just doesn't work in any sort of robust
way. I'm replacing it with the use of a 0 RLIMIT_CORE value. Since helper
apps set RLIMIT_CORE to zero, we don't write out core files for any process with
that particular limit set. It the core_pattern is a pipe, any non-zero limit is
translated to RLIM_INFINITY. This allows complete dumps to be captured, but
prevents infinite recursion in the event that the core_pattern process itself
crashes.
Signed-off-by: Neil Horman <[email protected]>
Reported-by: Earl Chew <[email protected]>
diff --git a/fs/exec.c b/fs/exec.c
index 9e05bd8..9defd20 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1776,35 +1776,34 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
lock_kernel();
ispipe = format_corename(corename, signr);
unlock_kernel();
- /*
- * Don't bother to check the RLIMIT_CORE value if core_pattern points
- * to a pipe. Since we're not writing directly to the filesystem
- * RLIMIT_CORE doesn't really apply, as no actual core file will be
- * created unless the pipe reader choses to write out the core file
- * at which point file size limits and permissions will be imposed
- * as it does with any other process
- */
+
if (ispipe) {
+ if (core_limit == 0) {
+ /*
+ * Normally core limits are irrelevant to pipes, since
+ * we're not writing to the file system, but we use
+ * core_limit of 0 here as a speacial value. Any
+ * non-zero limit gets set to RLIM_INFINITY below, but
+ * a limit of 0 skips the dump. This is a consistent
+ * way to catch recursive crashes. We can still crash
+ * if the core_pattern binary sets RLIM_CORE = !0
+ * but it runs as root, and can do lots of stupid things
+ * Note that we use task_tgid_vnr here to grab the pid of the
+ * process group leader. That way we get the right pid if a thread
+ * in a multi-threaded core_pattern process dies.
+ */
+ printk(KERN_WARNING "Process %d(%s) has RLIMIT_CORE set to 0\n",
+ task_tgid_vnr(current), current->comm);
+ printk(KERN_WARNING "Aborting core\n");
+ goto fail_unlock;
+ }
+
helper_argv = argv_split(GFP_KERNEL, corename+1, &helper_argc);
if (!helper_argv) {
printk(KERN_WARNING "%s failed to allocate memory\n",
__func__);
goto fail_unlock;
}
- /* Terminate the string before the first option */
- delimit = strchr(corename, ' ');
- if (delimit)
- *delimit = '\0';
- delimit = strrchr(helper_argv[0], '/');
- if (delimit)
- delimit++;
- else
- delimit = helper_argv[0];
- if (!strcmp(delimit, current->comm)) {
- printk(KERN_NOTICE "Recursive core dump detected, "
- "aborting\n");
- goto fail_unlock;
- }
core_limit = RLIM_INFINITY;
Allow for the kernel to wait for a core_pattern process to complete
One of the things core_pattern processes might do is interrogate the status of a
crashing process via its /proc/pid directory. To ensure that that directory is
not removed prematurely, we wait for the process to exit prior to cleaning it
up.
Since the addition of this feature makes it possible to block the reaping of a
crashed process (if the collecting process never exits), Also introduce a new
sysctl: core_pipe_limit. This sysctl, when non-zero, defined the maximum number
of crashing processes that can be collected in parallel. Processes exceeding
this limit are noted in the kernel log buffer and their cores are skipped. If
core_pipe_limit is zero (the default), this feature is disabled entirely. In
other words, the number of cores which may be collected in parallel are
unlimited, but access to a crashing processes /proc/pid directory is not
guaranteed, as the kernel will not wait for the crashing process to exit.
Signed-off-by: Neil Horman <[email protected]>
Reported-by: Earl Chew <[email protected]>
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 322a00b..bb226ba 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -21,6 +21,7 @@ show up in /proc/sys/kernel:
- acct
- auto_msgmni
- core_pattern
+- core_pipe_limit
- core_uses_pid
- ctrl-alt-del
- dentry-state
@@ -119,6 +120,27 @@ core_pattern is used to specify a core dumpfile pattern name.
==============================================================
+core_pipe_limit:
+
+This sysctl is only applicable when core_pattern is configured to pipe core
+files to user space helper a (when the first character of core_pattern is a '|',
+see above). When collecting cores via a pipe to an application, it is
+occasionally usefull for the collecting application to gather data about the
+crashing process from its /proc/pid directory. In order to do this safely, the
+kernel must wait for the collecting process to exit, so as not to remove the
+crashing processes proc files prematurely. This in turn creates the possibility
+that a misbehaving userspace collecting process can block the reaping of a
+crashed process simply by never exiting. This sysctl defends against that. It
+defines how many concurrent crashing processes may be piped to user space
+applications in parallel. If this value is exceeded, then those crashing
+processes above that value are noted via the kernel log and their cores are
+skipped. 0 is a special value, indicating that unlimited processes may be
+captured in parallel, but that no waiting will take place (i.e. the collecting
+process is not guaranteed access to /proc/<crahing pid>/). This value defaults
+to 0.
+
+==============================================================
+
core_uses_pid:
The default coredump filename is "core". By setting
diff --git a/fs/exec.c b/fs/exec.c
index 9defd20..33d6db6 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -55,6 +55,7 @@
#include <linux/kmod.h>
#include <linux/fsnotify.h>
#include <linux/fs_struct.h>
+#include <linux/pipe_fs_i.h>
#include <asm/uaccess.h>
#include <asm/mmu_context.h>
@@ -63,6 +64,7 @@
int core_uses_pid;
char core_pattern[CORENAME_MAX_SIZE] = "core";
+unsigned int core_pipe_limit;
int suid_dumpable = 0;
/* The maximal length of core_pattern is also specified in sysctl.c */
@@ -1716,7 +1718,7 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
char corename[CORENAME_MAX_SIZE + 1];
struct mm_struct *mm = current->mm;
struct linux_binfmt * binfmt;
- struct inode * inode;
+ struct inode * inode = NULL;
struct file * file;
const struct cred *old_cred;
struct cred *cred;
@@ -1726,7 +1728,8 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
unsigned long core_limit = current->signal->rlim[RLIMIT_CORE].rlim_cur;
char **helper_argv = NULL;
int helper_argc = 0;
- char *delimit;
+ int dump_count = 0;
+ static atomic_t core_dump_count = ATOMIC_INIT(0);
audit_core_dumps(signr);
@@ -1798,22 +1801,31 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
goto fail_unlock;
}
+ dump_count = atomic_inc_return(&core_dump_count);
+ if (core_pipe_limit && (core_pipe_limit < dump_count)) {
+ printk(KERN_WARNING "Pid %d(%s) over core_pipe_limit\n",
+ task_tgid_vnr(current), current->comm);
+ printk(KERN_WARNING "Skipping core dump\n");
+ goto fail_dropcount;
+ }
+
helper_argv = argv_split(GFP_KERNEL, corename+1, &helper_argc);
if (!helper_argv) {
printk(KERN_WARNING "%s failed to allocate memory\n",
__func__);
- goto fail_unlock;
+ goto fail_dropcount;
}
core_limit = RLIM_INFINITY;
/* SIGPIPE can happen, but it's just never processed */
- if (call_usermodehelper_pipe(corename+1, helper_argv, NULL,
+ if (call_usermodehelper_pipe(helper_argv[0], helper_argv, NULL,
&file)) {
printk(KERN_INFO "Core dump to %s pipe failed\n",
corename);
- goto fail_unlock;
+ goto fail_dropcount;
}
+ inode = file->f_path.dentry->d_inode;
} else {
if (core_limit < binfmt->min_coredump)
goto fail_unlock;
@@ -1853,6 +1865,12 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
close_fail:
filp_close(file, NULL);
+fail_dropcount:
+ if (dump_count) {
+ while (core_pipe_limit && inode->i_pipe->readers)
+ pipe_wait(inode->i_pipe);
+ atomic_dec(&core_dump_count);
+ }
fail_unlock:
if (helper_argv)
argv_free(helper_argv);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 62e4ff9..681052f 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -77,6 +77,7 @@ extern int max_threads;
extern int core_uses_pid;
extern int suid_dumpable;
extern char core_pattern[];
+extern unsigned int core_pipe_limit;
extern int pid_max;
extern int min_free_kbytes;
extern int pid_max_min, pid_max_max;
@@ -407,6 +408,14 @@ static struct ctl_table kern_table[] = {
.proc_handler = &proc_dostring,
.strategy = &sysctl_string,
},
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "core_pipe_limit",
+ .data = &core_pipe_limit,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
#ifdef CONFIG_PROC_SYSCTL
{
.procname = "tainted",
On 06/28, Neil Horman wrote:
>
> Allow for the kernel to wait for a core_pattern process to complete
(please change the subject to match)
> One of the things core_pattern processes might do is interrogate the status of a
> crashing process via its /proc/pid directory. To ensure that that directory is
> not removed prematurely, we wait for the process to exit prior to cleaning it
> up.
>
> Since the addition of this feature makes it possible to block the reaping of a
> crashed process (if the collecting process never exits), Also introduce a new
> sysctl: core_pipe_limit.
Perhaps this sysctl should be added in a separate patch? This patch mixes
differents things imho.
But in fact I don't really understand why do we need the new sysctl. Yes,
if the collecting process never exits, the coredumping thread can't be reaped.
But this process runs as root, it can do other bad things. And let's suppose
it just does nothing, say sleeps forever, and do not read the data from pipe.
In that case, regardless of any sysctls, ->core_dump() never finishes too.
> +fail_dropcount:
> + if (dump_count) {
> + while (core_pipe_limit && inode->i_pipe->readers)
> + pipe_wait(inode->i_pipe);
No, no, this is racy and wrong.
First, it is possible that it exits between ->readers != 0 check and
pipe_wait(), we will sleep forever.
Also, pipe_wait() should be called under pipe_lock(), I guess lockdep
should complain if you test your patch ;)
I'd suggest you to make a simple helper,
static inline void xxx(struct file *file)
{
struct pipe_inode_info *pipe = file->...;
wait_event(pipe->wait, pipe->readers == 0);
}
I believe we don't need pipe_lock().
Oleg.
On Mon, Jun 29, 2009 at 12:24:55AM +0200, Oleg Nesterov wrote:
> On 06/28, Neil Horman wrote:
> >
> > Allow for the kernel to wait for a core_pattern process to complete
>
> (please change the subject to match)
>
Fine.
> > One of the things core_pattern processes might do is interrogate the status of a
> > crashing process via its /proc/pid directory. To ensure that that directory is
> > not removed prematurely, we wait for the process to exit prior to cleaning it
> > up.
> >
> > Since the addition of this feature makes it possible to block the reaping of a
> > crashed process (if the collecting process never exits), Also introduce a new
> > sysctl: core_pipe_limit.
>
> Perhaps this sysctl should be added in a separate patch? This patch mixes
> differents things imho.
>
No, I disagree. If we're going to have a sysctl, It should be added in this
patch. I agree that since these processes run as root, we can have all sort of
bad things happen. But I think theres an advantage to being able to limit the
damage that a core_pattern process can do if it never exits. This is a problem
we can avoid easily, and I'd rather not introduce the possibility of waiting
(forever) on a process without the ability to mitigate the risks that incurrs.
> But in fact I don't really understand why do we need the new sysctl. Yes,
> if the collecting process never exits, the coredumping thread can't be reaped.
> But this process runs as root, it can do other bad things. And let's suppose
> it just does nothing, say sleeps forever, and do not read the data from pipe.
> In that case, regardless of any sysctls, ->core_dump() never finishes too.
>
Not always true, in the event that the core file is smaller than the pipe size.
But regardless, if ->core_dump never returns due to the aforementioned
situation, the sysctl provides the ability to mitigate the damange that can do,
since the dump count is held while ->core_dump is called.
> > +fail_dropcount:
> > + if (dump_count) {
> > + while (core_pipe_limit && inode->i_pipe->readers)
> > + pipe_wait(inode->i_pipe);
>
> No, no, this is racy and wrong.
>
> First, it is possible that it exits between ->readers != 0 check and
> pipe_wait(), we will sleep forever.
>
Its my understanding that pipe_wait returns from any pipe event, including the
closing of a pipe, I would have thought that the above code would catch that,
although, as I type that, I can see how it wouldn't without a lock.
> Also, pipe_wait() should be called under pipe_lock(), I guess lockdep
> should complain if you test your patch ;)
>
I did test it, and received no such lockdep warnings.
> I'd suggest you to make a simple helper,
>
> static inline void xxx(struct file *file)
> {
> struct pipe_inode_info *pipe = file->...;
>
> wait_event(pipe->wait, pipe->readers == 0);
> }
>
> I believe we don't need pipe_lock().
>
Ok, I like that, I'll repost tomorrow morning, after I get some sleep.
Thanks!
Neil
> Oleg.
>
>
>
On 06/29, Oleg Nesterov wrote:
>
> But in fact I don't really understand why do we need the new sysctl.
Sorry Neil, I misread the patch. This sysctls limits the number of coredumps
to pipe in flight, not the number of "wait for ->readers == 0".
Agreed, perhaps makes sense. But imho a separate patch is better. And,
> Yes,
> if the collecting process never exits, the coredumping thread can't be reaped.
> But this process runs as root, it can do other bad things. And let's suppose
> it just does nothing, say sleeps forever, and do not read the data from pipe.
> In that case, regardless of any sysctls, ->core_dump() never finishes too.
so I think this core_pipe_limit is more or less orthogonal to "wait for complete".
Oleg.
On 06/28, Neil Horman wrote:
>
> On Mon, Jun 29, 2009 at 12:24:55AM +0200, Oleg Nesterov wrote:
> >
> > Perhaps this sysctl should be added in a separate patch? This patch mixes
> > differents things imho.
> >
> No, I disagree. If we're going to have a sysctl, It should be added in this
> patch. I agree that since these processes run as root, we can have all sort of
> bad things happen. But I think theres an advantage to being able to limit the
> damage that a core_pattern process can do if it never exits.
Yes, but why it should be added in this patch?
> > But in fact I don't really understand why do we need the new sysctl. Yes,
> > if the collecting process never exits, the coredumping thread can't be reaped.
> > But this process runs as root, it can do other bad things. And let's suppose
> > it just does nothing, say sleeps forever, and do not read the data from pipe.
> > In that case, regardless of any sysctls, ->core_dump() never finishes too.
> >
> Not always true, in the event that the core file is smaller than the pipe size.
sure,
> But regardless, if ->core_dump never returns due to the aforementioned
> situation, the sysctl provides the ability to mitigate the damange that can do,
> since the dump count is held while ->core_dump is called.
Yes, I misread the sysctl code. Perhaps another reason to split this patch ;)
Oleg.
On Sun, Jun 28, 2009 at 03:06:25PM -0700, Eric W. Biederman wrote:
> Andi Kleen <[email protected]> writes:
>
> >> Andrew should I toss all 100 or so patches over the wall to you
> >> and your -mm tree? Or should I maintain a public git tree based
> >> at 2.6.31-rc1? Get it into linux-next and ask Linus to pull it when
> >> the merge window comes?
> >
> > What do these 100 odd patches do exactly?
>
> Mostly a fine grained killing of ctl_name, and strategy.
Ok.
The only issue is -- assuming we convert that over to DEFINE_SYSCTL
too (which I think would be much nicer) all tee maintainers would
need to process two patches. So perhaps it would be better to combine
this into a single update?
> > Then the binary sysctls could be handled by a global table
> > in a separate file like you described
>
> Getting the binary sysctl crud out of the core path should
> happen first. That is just a handful of patches.
Agreed. It should be just a separate table.
> > the current interfaces are not great, but could be probably kept.
>
> Things like register_sysctl_path can be greatly improved. Now
> that we don't have to worry about the binary paths.
With a module DEFINE_SYSCTL() only truly dynamic sysctls (like
the network per device sysctls) would need that anyways; the
far majority of callers wouldn't need to call any functions.
-Andi
--
[email protected] -- Speaking for myself only.
On Mon, Jun 29, 2009 at 01:32:00AM +0200, Oleg Nesterov wrote:
> On 06/28, Neil Horman wrote:
> >
> > On Mon, Jun 29, 2009 at 12:24:55AM +0200, Oleg Nesterov wrote:
> > >
> > > Perhaps this sysctl should be added in a separate patch? This patch mixes
> > > differents things imho.
> > >
> > No, I disagree. If we're going to have a sysctl, It should be added in this
> > patch. I agree that since these processes run as root, we can have all sort of
> > bad things happen. But I think theres an advantage to being able to limit the
> > damage that a core_pattern process can do if it never exits.
>
> Yes, but why it should be added in this patch?
>
I agree with what you said earlier, in that the sysctl is orthogonal to the
wait_for_complete functionality, from an implementation standpoint. But I don't
feel as though they are independent from a behavioral standpoint. Given that
the sysctl defines a default value of zero in which unlimited parallel core
dumps are allowed, but none are waited for, separating the patches creates a
behavioral split in which the the core_pattern behavior changes for the span of
one commit, and in such a way that the system can deadlock if the core_pattern
process does bad things. To illustrate, say we applied the wait_for_core patch
separately from sysctl patch. If someone built with both patches, their
core_pattern behavior would appear unchanged, but if they built with only the
first patch, and their core_pattern app had a bug in which the process never
exited properly, they would get an unbounded backlog of unreaped processes. I
could certainly modify the first patch to never wait, and then modify the sysctl
to decide when it was ok to wait, but to add a patch that allows for a wait
state that never happens seems a bit odd to me.
Regards
Neil
On 06/29, Neil Horman wrote:
>
> On Mon, Jun 29, 2009 at 01:32:00AM +0200, Oleg Nesterov wrote:
> > On 06/28, Neil Horman wrote:
> > >
> > > On Mon, Jun 29, 2009 at 12:24:55AM +0200, Oleg Nesterov wrote:
> > > >
> > > > Perhaps this sysctl should be added in a separate patch? This patch mixes
> > > > differents things imho.
> > > >
> > > No, I disagree. If we're going to have a sysctl, It should be added in this
> > > patch. I agree that since these processes run as root, we can have all sort of
> > > bad things happen. But I think theres an advantage to being able to limit the
> > > damage that a core_pattern process can do if it never exits.
> >
> > Yes, but why it should be added in this patch?
> >
> I agree with what you said earlier, in that the sysctl is orthogonal to the
> wait_for_complete functionality, from an implementation standpoint. But I don't
> feel as though they are independent from a behavioral standpoint.
I think they are ;)
> Given that
> the sysctl defines a default value of zero in which unlimited parallel core
> dumps are allowed, but none are waited for, separating the patches creates a
> behavioral split in which the the core_pattern behavior changes for the span of
> one commit, and in such a way that the system can deadlock if the core_pattern
> process does bad things. To illustrate, say we applied the wait_for_core patch
> separately from sysctl patch. If someone built with both patches, their
> core_pattern behavior would appear unchanged, but if they built with only the
> first patch, and their core_pattern app had a bug in which the process never
> exited properly, they would get an unbounded backlog of unreaped processes.
Just send the sysctl first. If this limit makes sense (I think yes), then
it makes sense even without wait_for_core patch, because the buggy core_pattern
app can make problems even without wait_for_core.
This looks so obvious to me, but of course please do what you think right.
(from another message)
>
> So, I'd like to solicit your opinion on something here. I've run into an
> interesting chicken and egg problem in attempting to implement this. I found
> that while implementing this, that the user space component might be reliying on
> an feof condition on stdin to determine when to stop reading from the pipe.
I see.
> I see
> two options:
>
> 1) Require user space to parse the core file during its read, to know how much
> data to extract from the pipe, and call it a user space problem if they read
> more than that and wind up blocking indefinately?
I don't think we should break user-space.
> 2) Develop another way to detect that userspace is done and has exited.
Yes, as I said we can implement UMH_WAIT_CUSTOM_COMPLETION, but this needs
changes in kmod.c
> I personally like option 2, and I think I can do it using the inotify
Well, not good to depend on CONFIG_INOTIFY, and this looks like an overkill
to me.
How about (completely untested) patch below?
Oleg.
--- WAIT/fs/exec.c~CD_3_CLOSE_WAIT 2009-06-30 01:26:38.000000000 +0200
+++ WAIT/fs/exec.c 2009-06-30 01:47:18.000000000 +0200
@@ -25,6 +25,7 @@
#include <linux/slab.h>
#include <linux/file.h>
#include <linux/fdtable.h>
+#include <linux/pipe_fs_i.h>
#include <linux/mm.h>
#include <linux/stat.h>
#include <linux/fcntl.h>
@@ -1710,6 +1711,23 @@ int get_dumpable(struct mm_struct *mm)
return (ret >= 2) ? 2 : ret;
}
+static void xxx(struct file *file)
+{
+ struct pipe_inode_info *pipe = file->f_path.dentry->d_inode->i_pipe;
+
+ pipe_lock(pipe);
+ while (pipe->readers) {
+ pipe->readers++;
+ pipe->writers--;
+ wake_up_interruptible_sync(&pipe->wait);
+ kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
+ pipe_wait(pipe);
+ pipe->writers++;
+ pipe->readers--;
+ }
+ pipe_unlock(pipe);
+}
+
void do_coredump(long signr, int exit_code, struct pt_regs *regs)
{
struct core_state core_state;
@@ -1853,11 +1871,12 @@ void do_coredump(long signr, int exit_co
current->signal->group_exit_code |= 0x80;
close_fail:
+ if (ispipe)
+ xxx(file);
filp_close(file, NULL);
fail_unlock:
if (helper_argv)
argv_free(helper_argv);
-
coredump_finish(mm);
revert_creds(old_cred);
fail_creds:
Ok, Heres version 4 of this patch set. Based on feedback from the past few days
I've made some changes (noted below). I've tested all of these patches here,
and they work quite well, I'm able to prevent recursive core dumps, wait on
dumps to complete, and limit the number of dupms I handle in parallel
Change Notes:
1) Diffed against latest Linus kernel + Olegs cleanup patches to do_coredump
2) Refactored into 3 patches instead of two (I still don't think its needed, but
I've received more than one request to pull the sysctl into a separate patch, so
I'm going with consensus here, and it won't hurt anything anyway)
3) Changed how we detect completed user space processes. We need to be able to
close a pipe and then wait on the pipe reader to finish with it. As such we
have to do some trickery with the pipe readers and writers counts to make that
happen.
Patches in subsequent mails
Regards
Neil
core_pattern: Change how we detect recursive dumps with core_pattern pipes
Change how we detect recursive dumps. Currently we have a mechanism by which
we try to compare pathnames of the crashing process to the core_pattern path.
This is broken for a dozen reasons, and just doesn't work in any sort of robust
way. I'm replacing it with the use of a 0 RLIMIT_CORE value. Since helper
apps set RLIMIT_CORE to zero, we don't write out core files for any process with
that particular limit set. It the core_pattern is a pipe, any non-zero limit is
translated to RLIM_INFINITY. This allows complete dumps to be captured, but
prevents infinite recursion in the event that the core_pattern process itself
crashes.
Signed-off-by: Neil Horman <[email protected]>
Reported-by: Earl Chew <[email protected]>
exec.c | 43 +++++++++++++++++++++----------------------
1 file changed, 21 insertions(+), 22 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index 9e05bd8..9defd20 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1776,35 +1776,34 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
lock_kernel();
ispipe = format_corename(corename, signr);
unlock_kernel();
- /*
- * Don't bother to check the RLIMIT_CORE value if core_pattern points
- * to a pipe. Since we're not writing directly to the filesystem
- * RLIMIT_CORE doesn't really apply, as no actual core file will be
- * created unless the pipe reader choses to write out the core file
- * at which point file size limits and permissions will be imposed
- * as it does with any other process
- */
+
if (ispipe) {
+ if (core_limit == 0) {
+ /*
+ * Normally core limits are irrelevant to pipes, since
+ * we're not writing to the file system, but we use
+ * core_limit of 0 here as a speacial value. Any
+ * non-zero limit gets set to RLIM_INFINITY below, but
+ * a limit of 0 skips the dump. This is a consistent
+ * way to catch recursive crashes. We can still crash
+ * if the core_pattern binary sets RLIM_CORE = !0
+ * but it runs as root, and can do lots of stupid things
+ * Note that we use task_tgid_vnr here to grab the pid of the
+ * process group leader. That way we get the right pid if a thread
+ * in a multi-threaded core_pattern process dies.
+ */
+ printk(KERN_WARNING "Process %d(%s) has RLIMIT_CORE set to 0\n",
+ task_tgid_vnr(current), current->comm);
+ printk(KERN_WARNING "Aborting core\n");
+ goto fail_unlock;
+ }
+
helper_argv = argv_split(GFP_KERNEL, corename+1, &helper_argc);
if (!helper_argv) {
printk(KERN_WARNING "%s failed to allocate memory\n",
__func__);
goto fail_unlock;
}
- /* Terminate the string before the first option */
- delimit = strchr(corename, ' ');
- if (delimit)
- *delimit = '\0';
- delimit = strrchr(helper_argv[0], '/');
- if (delimit)
- delimit++;
- else
- delimit = helper_argv[0];
- if (!strcmp(delimit, current->comm)) {
- printk(KERN_NOTICE "Recursive core dump detected, "
- "aborting\n");
- goto fail_unlock;
- }
core_limit = RLIM_INFINITY;
core_pattern: Introduce core pipe limiting sysctl
Since we can dump cores to pipe, rather than directly to the filesystem, we
create a condition in which a user can create a very high load on the system
simply by running bad applications. If the pipe reader specified in
core_pattern is poorly written, we can have lots of ourstandig resources and
processes in the system. This sysctl introduces an ability to limit that
resource consumption. core_pipe_limit defines how many in-flight dumps may be
run in parallel, dumps beyond this value are skipped and a note is made in the
kernel log. A special value of 0 in core_pipe_limit denotes unlimited core
dumps may be handled (this is the default value).
Signed-off-by: Neil Horman <[email protected]>
Reported-by: Earl Chew <[email protected]>
Documentation/sysctl/kernel.txt | 22 ++++++++++++++++++++++
fs/exec.c | 21 +++++++++++++++++----
kernel/sysctl.c | 9 +++++++++
3 files changed, 48 insertions(+), 4 deletions(-)
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 322a00b..bb226ba 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -21,6 +21,7 @@ show up in /proc/sys/kernel:
- acct
- auto_msgmni
- core_pattern
+- core_pipe_limit
- core_uses_pid
- ctrl-alt-del
- dentry-state
@@ -119,6 +120,27 @@ core_pattern is used to specify a core dumpfile pattern name.
==============================================================
+core_pipe_limit:
+
+This sysctl is only applicable when core_pattern is configured to pipe core
+files to user space helper a (when the first character of core_pattern is a '|',
+see above). When collecting cores via a pipe to an application, it is
+occasionally usefull for the collecting application to gather data about the
+crashing process from its /proc/pid directory. In order to do this safely, the
+kernel must wait for the collecting process to exit, so as not to remove the
+crashing processes proc files prematurely. This in turn creates the possibility
+that a misbehaving userspace collecting process can block the reaping of a
+crashed process simply by never exiting. This sysctl defends against that. It
+defines how many concurrent crashing processes may be piped to user space
+applications in parallel. If this value is exceeded, then those crashing
+processes above that value are noted via the kernel log and their cores are
+skipped. 0 is a special value, indicating that unlimited processes may be
+captured in parallel, but that no waiting will take place (i.e. the collecting
+process is not guaranteed access to /proc/<crahing pid>/). This value defaults
+to 0.
+
+==============================================================
+
core_uses_pid:
The default coredump filename is "core". By setting
diff --git a/fs/exec.c b/fs/exec.c
index 9defd20..93ab6eb 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -63,6 +63,7 @@
int core_uses_pid;
char core_pattern[CORENAME_MAX_SIZE] = "core";
+unsigned int core_pipe_limit;
int suid_dumpable = 0;
/* The maximal length of core_pattern is also specified in sysctl.c */
@@ -1726,7 +1727,8 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
unsigned long core_limit = current->signal->rlim[RLIMIT_CORE].rlim_cur;
char **helper_argv = NULL;
int helper_argc = 0;
- char *delimit;
+ int dump_count = 0;
+ static atomic_t core_dump_count = ATOMIC_INIT(0);
audit_core_dumps(signr);
@@ -1798,21 +1800,29 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
goto fail_unlock;
}
+ dump_count = atomic_inc_return(&core_dump_count);
+ if (core_pipe_limit && (core_pipe_limit < dump_count)) {
+ printk(KERN_WARNING "Pid %d(%s) over core_pipe_limit\n",
+ task_tgid_vnr(current), current->comm);
+ printk(KERN_WARNING "Skipping core dump\n");
+ goto fail_dropcount;
+ }
+
helper_argv = argv_split(GFP_KERNEL, corename+1, &helper_argc);
if (!helper_argv) {
printk(KERN_WARNING "%s failed to allocate memory\n",
__func__);
- goto fail_unlock;
+ goto fail_dropcount;
}
core_limit = RLIM_INFINITY;
/* SIGPIPE can happen, but it's just never processed */
- if (call_usermodehelper_pipe(corename+1, helper_argv, NULL,
+ if (call_usermodehelper_pipe(helper_argv[0], helper_argv, NULL,
&file)) {
printk(KERN_INFO "Core dump to %s pipe failed\n",
corename);
- goto fail_unlock;
+ goto fail_dropcount;
}
} else {
if (core_limit < binfmt->min_coredump)
@@ -1853,6 +1863,9 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
close_fail:
filp_close(file, NULL);
+fail_dropcount:
+ if (dump_count)
+ atomic_dec(&core_dump_count);
fail_unlock:
if (helper_argv)
argv_free(helper_argv);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 62e4ff9..681052f 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -77,6 +77,7 @@ extern int max_threads;
extern int core_uses_pid;
extern int suid_dumpable;
extern char core_pattern[];
+extern unsigned int core_pipe_limit;
extern int pid_max;
extern int min_free_kbytes;
extern int pid_max_min, pid_max_max;
@@ -407,6 +408,14 @@ static struct ctl_table kern_table[] = {
.proc_handler = &proc_dostring,
.strategy = &sysctl_string,
},
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "core_pipe_limit",
+ .data = &core_pipe_limit,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
#ifdef CONFIG_PROC_SYSCTL
{
.procname = "tainted",
core_pattern: Allow core_pattern pipes to wait for user space to complete
One of the things that user space processes like to do is look at metadata for a
crashing process in their /proc/<pid> directory. this is racy however, since
do_coredump in the kernel doesn't wait for the user space process to complete
before it reaps the crashing process. This patch corrects that. Allowing the
kernel to wait for the user space process to complete before cleaning up the
crashing process. This is a bit tricky to do for a few reasons:
1) The user space process isn't our child, so we can't sys_wait4 on it
2) We need to close the pipe before waiting for the user process to complete,
since the user process may rely on an EOF condition
I've discussed several solutions with Oleg Nesterov off-list about this, and
this is the one we've come up with. We basically add ourselves as an additional
reader (to prevent cleanup of the pipe), write the dump in ->core_dump(), then
iteratively remove ourselves as a writer (to create the EOF condition) and wake
up the user process. note that we add ourselves as a reader before writing the
file. this closes the race in the window between the time we write the dump and
the time we start checking for the user space process to be done with the pipe.
Signed-off-by: Neil Horman <[email protected]>
Reported-by: Earl Chew <[email protected]>
exec.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 50 insertions(+), 2 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index 93ab6eb..8dbf5a4 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -55,6 +55,7 @@
#include <linux/kmod.h>
#include <linux/fsnotify.h>
#include <linux/fs_struct.h>
+#include <linux/pipe_fs_i.h>
#include <asm/uaccess.h>
#include <asm/mmu_context.h>
@@ -1711,14 +1712,48 @@ int get_dumpable(struct mm_struct *mm)
return (ret >= 2) ? 2 : ret;
}
+static void wait_for_dump_helpers(struct file *file)
+{
+ struct inode *inode;
+ struct pipe_inode_info *pipe;
+
+ if (!file)
+ return;
+
+ inode = file->f_path.dentry->d_inode;
+
+ if (!S_ISFIFO(inode->i_mode))
+ return;
+
+ pipe = inode->i_pipe;
+
+ pipe_lock(pipe);
+ while (pipe->readers > 1) {
+ pipe->writers--;
+ wake_up_interruptible_sync(&pipe->wait);
+ kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
+ pipe_wait(pipe);
+ pipe->writers++;
+ }
+
+ /*
+ * This reclaims the additional readers count we took in
+ * do_coredump
+ */
+ pipe->readers--;
+ pipe_unlock(pipe);
+
+}
+
+
void do_coredump(long signr, int exit_code, struct pt_regs *regs)
{
struct core_state core_state;
char corename[CORENAME_MAX_SIZE + 1];
struct mm_struct *mm = current->mm;
struct linux_binfmt * binfmt;
- struct inode * inode;
- struct file * file;
+ struct inode * inode = NULL;
+ struct file * file = NULL;
const struct cred *old_cred;
struct cred *cred;
int retval = 0;
@@ -1729,6 +1764,7 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
int helper_argc = 0;
int dump_count = 0;
static atomic_t core_dump_count = ATOMIC_INIT(0);
+ struct pipe_inode_info *pipe;
audit_core_dumps(signr);
@@ -1824,6 +1860,17 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
corename);
goto fail_dropcount;
}
+
+ /*
+ * This lets us wait on a pipe after we close the writing
+ * end. The extra reader count prevents the pipe_inode_info
+ * from getting freed. This extra count is reclaimed in
+ * wait_for_dump_helpers
+ */
+ pipe = file->f_path.dentry->d_inode->i_pipe;
+ pipe_lock(pipe);
+ pipe->readers++;
+ pipe_unlock(pipe);
} else {
if (core_limit < binfmt->min_coredump)
goto fail_unlock;
@@ -1862,6 +1909,7 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
current->signal->group_exit_code |= 0x80;
close_fail:
+ wait_for_dump_helpers(file);
filp_close(file, NULL);
fail_dropcount:
if (dump_count)
On 06/30, Neil Horman wrote:
>
> void do_coredump(long signr, int exit_code, struct pt_regs *regs)
> {
> struct core_state core_state;
> char corename[CORENAME_MAX_SIZE + 1];
> struct mm_struct *mm = current->mm;
> struct linux_binfmt * binfmt;
> - struct inode * inode;
> - struct file * file;
> + struct inode * inode = NULL;
> + struct file * file = NULL;
why this change?
> @@ -1824,6 +1860,17 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
> corename);
> goto fail_dropcount;
> }
> +
> + /*
> + * This lets us wait on a pipe after we close the writing
> + * end. The extra reader count prevents the pipe_inode_info
> + * from getting freed.
but it can't be freed until we close file?
> This extra count is reclaimed in
> + * wait_for_dump_helpers
> + */
> + pipe = file->f_path.dentry->d_inode->i_pipe;
> + pipe_lock(pipe);
> + pipe->readers++;
> + pipe_unlock(pipe);
why should we inc ->readers in advance?
> + wait_for_dump_helpers(file);
why do we call it unconditionally and then check ISFIFO? We only need to wait
when ispipe = T, and in that case we know that this file is pipe.
IOW, could you explain why the (much simpler) patch I sent doesn't work ?
Hmm. And in fact that pipe->readers++ above doesn't look right. What if
the core_patter task exits? Since we incremented ->readers we can't notice
the fact there are no readers, and f_op->write() will hang forever.
Oleg.
On Wed, Jul 01, 2009 at 07:52:57AM +0200, Oleg Nesterov wrote:
> On 06/30, Neil Horman wrote:
> >
> > void do_coredump(long signr, int exit_code, struct pt_regs *regs)
> > {
> > struct core_state core_state;
> > char corename[CORENAME_MAX_SIZE + 1];
> > struct mm_struct *mm = current->mm;
> > struct linux_binfmt * binfmt;
> > - struct inode * inode;
> > - struct file * file;
> > + struct inode * inode = NULL;
> > + struct file * file = NULL;
>
> why this change?
>
Its part of a cosmetic change, see below.
> > @@ -1824,6 +1860,17 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
> > corename);
> > goto fail_dropcount;
> > }
> > +
> > + /*
> > + * This lets us wait on a pipe after we close the writing
> > + * end. The extra reader count prevents the pipe_inode_info
> > + * from getting freed.
>
> but it can't be freed until we close file?
>
Damn, leftover comment from a previous version, needs to be removed.
> > This extra count is reclaimed in
> > + * wait_for_dump_helpers
> > + */
> > + pipe = file->f_path.dentry->d_inode->i_pipe;
> > + pipe_lock(pipe);
> > + pipe->readers++;
> > + pipe_unlock(pipe);
>
> why should we inc ->readers in advance?
>
Read the comment immediately above it and look at the filp_close path. We inc
->readers in advance so as to prevent pipe_inode_info getting freed between the
time we write out the core file and the time we wait on the pipe. If the
userspace helper exits in between those points we inode->i_pipe will be null by
the time we get to wait_for_dump_helpers. And a simple null check isn't
sufficient in wait_for_dump_helpers, since that still creates a window between
the check and the alternative increment of readers inside the loop, leading to a
use after free/corruption case.
> > + wait_for_dump_helpers(file);
>
> why do we call it unconditionally and then check ISFIFO? We only need to wait
> when ispipe = T, and in that case we know that this file is pipe.
>
Cosmetic, I can call it unconditionally here and then check if its a fifo in the
function, so that in do_coredump I don't have to do the following:
if (is_pipe)
wait_for_dump_helpers(file);
out_unlock:
filp_close(...)
if (is_pipe)
atomic_dec(&core_dump_count);
This is exactly the sort of crap your cleanups to do_coredump attemtped to
remove. I thought it best not to undo that work :)
I also do a NULL check in wait_for_dump_helpers, so that if the helper fails to
start properly, its a fall through case.
> IOW, could you explain why the (much simpler) patch I sent doesn't work ?
>
In short, because the much simpler patch that you sent is broken. I in fact
tried it as is, and ran across the exact race that I described above, in which
the user space helepr exited before we waited on it, resulting in an oops when
we tried to manipulate the i_pipe pointer, which had become NULL;
>
> Hmm. And in fact that pipe->readers++ above doesn't look right. What if
> the core_patter task exits? Since we incremented ->readers we can't notice
> the fact there are no readers, and f_op->write() will hang forever.
>
But if we don't we can loose the inode->i_pipe pointer. I suppose what we need
to do is increment writers immediately, then decrement writers and increment
readers after the return from ->core_dump
Neil
On 07/01, Neil Horman wrote:
>
> On Wed, Jul 01, 2009 at 07:52:57AM +0200, Oleg Nesterov wrote:
> > > This extra count is reclaimed in
> > > + * wait_for_dump_helpers
> > > + */
> > > + pipe = file->f_path.dentry->d_inode->i_pipe;
> > > + pipe_lock(pipe);
> > > + pipe->readers++;
> > > + pipe_unlock(pipe);
> >
> > why should we inc ->readers in advance?
> >
> Read the comment immediately above it and look at the filp_close path. We inc
> ->readers in advance so as to prevent pipe_inode_info getting freed between the
> time we write out the core file and the time we wait on the pipe.
Can't understand.
call_usermodehelper_stdinpipe() creates 2 files, both share the same
inode/pipe_inode_info (->f_path actually).
Until we close the file returned by call_usermodehelper_pipe(),
pipe_inode_info can't go away.
> If the
> userspace helper exits in between those points we inode->i_pipe will be null by
> the time we get to wait_for_dump_helpers.
See above. Can't understand how this can happen.
> > > + wait_for_dump_helpers(file);
> >
> > why do we call it unconditionally and then check ISFIFO? We only need to wait
> > when ispipe = T, and in that case we know that this file is pipe.
> >
> Cosmetic, I can call it unconditionally here and then check if its a fifo in the
> function, so that in do_coredump I don't have to do the following:
> if (is_pipe)
> wait_for_dump_helpers(file);
I think the above is better. More straightforward and clean.
> This is exactly the sort of crap your cleanups to do_coredump attemtped to
> remove. I thought it best not to undo that work :)
Well. I tried to remove unnecessary "if (ispipe)" checks, yes. But in that
case we can't avoid this check. And your patch still does this check, but
instead of simple "ispipe == T" we check
S_ISFIFO(file->f_path.dentry->d_inode->i_mode), doesn't look as a cleanup ;)
And please note this relies on do_coredump()->S_ISREG() check which can be
removed.
> > IOW, could you explain why the (much simpler) patch I sent doesn't work ?
> >
> In short, because the much simpler patch that you sent is broken. I in fact
> tried it as is, and ran across the exact race that I described above, in which
> the user space helepr exited before we waited on it, resulting in an oops when
> we tried to manipulate the i_pipe pointer, which had become NULL;
I must have missed something. And yes, as I said I didn't test my patch.
But I don't understand how this can happen, see above. And look, if this
is possible then dump_write()->pipe_write() should oops too, it doesn't
check inode/pipe != NULL.
OK, I have to check this all. But perhaps you can explain where I am wrong?
Perhaps I should actually apply my patch and test ;)
Oleg.
On Wed, Jul 01, 2009 at 02:25:33PM +0200, Oleg Nesterov wrote:
> On 07/01, Neil Horman wrote:
> >
> > On Wed, Jul 01, 2009 at 07:52:57AM +0200, Oleg Nesterov wrote:
> > > > This extra count is reclaimed in
> > > > + * wait_for_dump_helpers
> > > > + */
> > > > + pipe = file->f_path.dentry->d_inode->i_pipe;
> > > > + pipe_lock(pipe);
> > > > + pipe->readers++;
> > > > + pipe_unlock(pipe);
> > >
> > > why should we inc ->readers in advance?
> > >
> > Read the comment immediately above it and look at the filp_close path. We inc
> > ->readers in advance so as to prevent pipe_inode_info getting freed between the
> > time we write out the core file and the time we wait on the pipe.
>
> Can't understand.
>
> call_usermodehelper_stdinpipe() creates 2 files, both share the same
> inode/pipe_inode_info (->f_path actually).
>
> Until we close the file returned by call_usermodehelper_pipe(),
> pipe_inode_info can't go away.
>
> > If the
> > userspace helper exits in between those points we inode->i_pipe will be null by
> > the time we get to wait_for_dump_helpers.
>
> See above. Can't understand how this can happen.
>
Yes, I apologize, in reviewing that code, I don't see how it could happen
either. I think that I must have changed to things at once and gotten erroneous
results in my testing (most likely I did the wait_for_dump_helpers after the
filp_close and that cause the race). I'm reworking the patch now.
> > > > + wait_for_dump_helpers(file);
> > >
> > > why do we call it unconditionally and then check ISFIFO? We only need to wait
> > > when ispipe = T, and in that case we know that this file is pipe.
> > >
> > Cosmetic, I can call it unconditionally here and then check if its a fifo in the
> > function, so that in do_coredump I don't have to do the following:
> > if (is_pipe)
> > wait_for_dump_helpers(file);
>
> I think the above is better. More straightforward and clean.
>
> > This is exactly the sort of crap your cleanups to do_coredump attemtped to
> > remove. I thought it best not to undo that work :)
>
> Well. I tried to remove unnecessary "if (ispipe)" checks, yes. But in that
> case we can't avoid this check. And your patch still does this check, but
> instead of simple "ispipe == T" we check
> S_ISFIFO(file->f_path.dentry->d_inode->i_mode), doesn't look as a cleanup ;)
>
Ok, whatever, I'm not interested in debating what looks prettier, so I'll just
use the ispipe test and be done with it.
I'm testing version 5 and will post the set shortly.
Neil
On 07/01, Neil Horman wrote:
>
> On Wed, Jul 01, 2009 at 02:25:33PM +0200, Oleg Nesterov wrote:
> > On 07/01, Neil Horman wrote:
> > >
> > > On Wed, Jul 01, 2009 at 07:52:57AM +0200, Oleg Nesterov wrote:
> > > > > This extra count is reclaimed in
> > > > > + * wait_for_dump_helpers
> > > > > + */
> > > > > + pipe = file->f_path.dentry->d_inode->i_pipe;
> > > > > + pipe_lock(pipe);
> > > > > + pipe->readers++;
> > > > > + pipe_unlock(pipe);
> > > >
> > > > why should we inc ->readers in advance?
> > > >
> > > Read the comment immediately above it and look at the filp_close path. We inc
> > > ->readers in advance so as to prevent pipe_inode_info getting freed between the
> > > time we write out the core file and the time we wait on the pipe.
> >
> > Can't understand.
> >
> > call_usermodehelper_stdinpipe() creates 2 files, both share the same
> > inode/pipe_inode_info (->f_path actually).
> >
> > Until we close the file returned by call_usermodehelper_pipe(),
> > pipe_inode_info can't go away.
> >
> > > If the
> > > userspace helper exits in between those points we inode->i_pipe will be null by
> > > the time we get to wait_for_dump_helpers.
> >
> > See above. Can't understand how this can happen.
> >
> Yes, I apologize, in reviewing that code, I don't see how it could happen
> either. I think that I must have changed to things at once and gotten erroneous
> results in my testing (most likely I did the wait_for_dump_helpers after the
> filp_close and that cause the race).
Yes, this can explain the bug you hit.
Anyway. I appiled my patch and tested it with "!/bin/true" in
proc/sys/core_pattern. It works even if I add
schedule_timeout_interruptible(3*HZ) before binfmt->core_dump()
to make sure core_pattern task exits before we start to wait.
Oleg.
Ok, version 5 of this patchset. The frist two patches are identical. The
second two have some minor cleanups
Change notes:
1) Patch 3 has the wait_for_dump_helper modified a bit to remove the ISFIFO
check. Instead, we use the ispipe test in do_coredump.
2) wait_for_dump_helpers is modified slightly to move the readers/writers
modification outside of the while loop, so we don't have to constantly churn
those values.
3) Augment the test to call wait_for_dump_helpers to include a non-zero check on
core_pipe_limit. This restores the documented sysctl behavior that I intended,
which only allows for waiting on processes when core_pipe_limit is non-zero.
Patches Follow
Neil
core_pattern: Change how we detect recursive dumps with core_pattern pipes
Change how we detect recursive dumps. Currently we have a mechanism by which
we try to compare pathnames of the crashing process to the core_pattern path.
This is broken for a dozen reasons, and just doesn't work in any sort of robust
way. I'm replacing it with the use of a 0 RLIMIT_CORE value. Since helper
apps set RLIMIT_CORE to zero, we don't write out core files for any process with
that particular limit set. It the core_pattern is a pipe, any non-zero limit is
translated to RLIM_INFINITY. This allows complete dumps to be captured, but
prevents infinite recursion in the event that the core_pattern process itself
crashes.
Signed-off-by: Neil Horman <[email protected]>
Reported-by: Earl Chew <[email protected]>
exec.c | 43 +++++++++++++++++++++----------------------
1 file changed, 21 insertions(+), 22 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index 9e05bd8..9defd20 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1776,35 +1776,34 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
lock_kernel();
ispipe = format_corename(corename, signr);
unlock_kernel();
- /*
- * Don't bother to check the RLIMIT_CORE value if core_pattern points
- * to a pipe. Since we're not writing directly to the filesystem
- * RLIMIT_CORE doesn't really apply, as no actual core file will be
- * created unless the pipe reader choses to write out the core file
- * at which point file size limits and permissions will be imposed
- * as it does with any other process
- */
+
if (ispipe) {
+ if (core_limit == 0) {
+ /*
+ * Normally core limits are irrelevant to pipes, since
+ * we're not writing to the file system, but we use
+ * core_limit of 0 here as a speacial value. Any
+ * non-zero limit gets set to RLIM_INFINITY below, but
+ * a limit of 0 skips the dump. This is a consistent
+ * way to catch recursive crashes. We can still crash
+ * if the core_pattern binary sets RLIM_CORE = !0
+ * but it runs as root, and can do lots of stupid things
+ * Note that we use task_tgid_vnr here to grab the pid of the
+ * process group leader. That way we get the right pid if a thread
+ * in a multi-threaded core_pattern process dies.
+ */
+ printk(KERN_WARNING "Process %d(%s) has RLIMIT_CORE set to 0\n",
+ task_tgid_vnr(current), current->comm);
+ printk(KERN_WARNING "Aborting core\n");
+ goto fail_unlock;
+ }
+
helper_argv = argv_split(GFP_KERNEL, corename+1, &helper_argc);
if (!helper_argv) {
printk(KERN_WARNING "%s failed to allocate memory\n",
__func__);
goto fail_unlock;
}
- /* Terminate the string before the first option */
- delimit = strchr(corename, ' ');
- if (delimit)
- *delimit = '\0';
- delimit = strrchr(helper_argv[0], '/');
- if (delimit)
- delimit++;
- else
- delimit = helper_argv[0];
- if (!strcmp(delimit, current->comm)) {
- printk(KERN_NOTICE "Recursive core dump detected, "
- "aborting\n");
- goto fail_unlock;
- }
core_limit = RLIM_INFINITY;
core_pattern: Introduce core pipe limiting sysctl
Since we can dump cores to pipe, rather than directly to the filesystem, we
create a condition in which a user can create a very high load on the system
simply by running bad applications. If the pipe reader specified in
core_pattern is poorly written, we can have lots of ourstandig resources and
processes in the system. This sysctl introduces an ability to limit that
resource consumption. core_pipe_limit defines how many in-flight dumps may be
run in parallel, dumps beyond this value are skipped and a note is made in the
kernel log. A special value of 0 in core_pipe_limit denotes unlimited core
dumps may be handled (this is the default value).
Signed-off-by: Neil Horman <[email protected]>
Reported-by: Earl Chew <[email protected]>
Documentation/sysctl/kernel.txt | 22 ++++++++++++++++++++++
fs/exec.c | 21 +++++++++++++++++----
kernel/sysctl.c | 9 +++++++++
3 files changed, 48 insertions(+), 4 deletions(-)
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 322a00b..bb226ba 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -21,6 +21,7 @@ show up in /proc/sys/kernel:
- acct
- auto_msgmni
- core_pattern
+- core_pipe_limit
- core_uses_pid
- ctrl-alt-del
- dentry-state
@@ -119,6 +120,27 @@ core_pattern is used to specify a core dumpfile pattern name.
==============================================================
+core_pipe_limit:
+
+This sysctl is only applicable when core_pattern is configured to pipe core
+files to user space helper a (when the first character of core_pattern is a '|',
+see above). When collecting cores via a pipe to an application, it is
+occasionally usefull for the collecting application to gather data about the
+crashing process from its /proc/pid directory. In order to do this safely, the
+kernel must wait for the collecting process to exit, so as not to remove the
+crashing processes proc files prematurely. This in turn creates the possibility
+that a misbehaving userspace collecting process can block the reaping of a
+crashed process simply by never exiting. This sysctl defends against that. It
+defines how many concurrent crashing processes may be piped to user space
+applications in parallel. If this value is exceeded, then those crashing
+processes above that value are noted via the kernel log and their cores are
+skipped. 0 is a special value, indicating that unlimited processes may be
+captured in parallel, but that no waiting will take place (i.e. the collecting
+process is not guaranteed access to /proc/<crahing pid>/). This value defaults
+to 0.
+
+==============================================================
+
core_uses_pid:
The default coredump filename is "core". By setting
diff --git a/fs/exec.c b/fs/exec.c
index 9defd20..93ab6eb 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -63,6 +63,7 @@
int core_uses_pid;
char core_pattern[CORENAME_MAX_SIZE] = "core";
+unsigned int core_pipe_limit;
int suid_dumpable = 0;
/* The maximal length of core_pattern is also specified in sysctl.c */
@@ -1726,7 +1727,8 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
unsigned long core_limit = current->signal->rlim[RLIMIT_CORE].rlim_cur;
char **helper_argv = NULL;
int helper_argc = 0;
- char *delimit;
+ int dump_count = 0;
+ static atomic_t core_dump_count = ATOMIC_INIT(0);
audit_core_dumps(signr);
@@ -1798,21 +1800,29 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
goto fail_unlock;
}
+ dump_count = atomic_inc_return(&core_dump_count);
+ if (core_pipe_limit && (core_pipe_limit < dump_count)) {
+ printk(KERN_WARNING "Pid %d(%s) over core_pipe_limit\n",
+ task_tgid_vnr(current), current->comm);
+ printk(KERN_WARNING "Skipping core dump\n");
+ goto fail_dropcount;
+ }
+
helper_argv = argv_split(GFP_KERNEL, corename+1, &helper_argc);
if (!helper_argv) {
printk(KERN_WARNING "%s failed to allocate memory\n",
__func__);
- goto fail_unlock;
+ goto fail_dropcount;
}
core_limit = RLIM_INFINITY;
/* SIGPIPE can happen, but it's just never processed */
- if (call_usermodehelper_pipe(corename+1, helper_argv, NULL,
+ if (call_usermodehelper_pipe(helper_argv[0], helper_argv, NULL,
&file)) {
printk(KERN_INFO "Core dump to %s pipe failed\n",
corename);
- goto fail_unlock;
+ goto fail_dropcount;
}
} else {
if (core_limit < binfmt->min_coredump)
@@ -1853,6 +1863,9 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
close_fail:
filp_close(file, NULL);
+fail_dropcount:
+ if (dump_count)
+ atomic_dec(&core_dump_count);
fail_unlock:
if (helper_argv)
argv_free(helper_argv);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 62e4ff9..681052f 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -77,6 +77,7 @@ extern int max_threads;
extern int core_uses_pid;
extern int suid_dumpable;
extern char core_pattern[];
+extern unsigned int core_pipe_limit;
extern int pid_max;
extern int min_free_kbytes;
extern int pid_max_min, pid_max_max;
@@ -407,6 +408,14 @@ static struct ctl_table kern_table[] = {
.proc_handler = &proc_dostring,
.strategy = &sysctl_string,
},
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "core_pipe_limit",
+ .data = &core_pipe_limit,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
#ifdef CONFIG_PROC_SYSCTL
{
.procname = "tainted",
core_pattern: Allow core_pattern pipes to wait for user space to complete
One of the things that user space processes like to do is look at metadata for a
crashing process in their /proc/<pid> directory. this is racy however, since
do_coredump in the kernel doesn't wait for the user space process to complete
before it reaps the crashing process. This patch corrects that. Allowing the
kernel to wait for the user space process to complete before cleaning up the
crashing process. This is a bit tricky to do for a few reasons:
1) The user space process isn't our child, so we can't sys_wait4 on it
2) We need to close the pipe before waiting for the user process to complete,
since the user process may rely on an EOF condition
I've discussed several solutions with Oleg Nesterov off-list about this, and
this is the one we've come up with. We basically add ourselves as an additional
reader (to prevent cleanup of the pipe), write the dump in ->core_dump(), then
iteratively remove ourselves as a writer (to create the EOF condition) and wake
up the user process. note that we add ourselves as a reader before writing the
file. this closes the race in the window between the time we write the dump and
the time we start checking for the user space process to be done with the pipe.
Signed-off-by: Neil Horman <[email protected]>
Reported-by: Earl Chew <[email protected]>
exec.c | 29 +++++++++++++++++++++++++++++
1 file changed, 29 insertions(+)
diff --git a/fs/exec.c b/fs/exec.c
index 93ab6eb..d124346 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -55,6 +55,7 @@
#include <linux/kmod.h>
#include <linux/fsnotify.h>
#include <linux/fs_struct.h>
+#include <linux/pipe_fs_i.h>
#include <asm/uaccess.h>
#include <asm/mmu_context.h>
@@ -1711,6 +1712,32 @@ int get_dumpable(struct mm_struct *mm)
return (ret >= 2) ? 2 : ret;
}
+static void wait_for_dump_helpers(struct file *file)
+{
+ struct pipe_inode_info *pipe;
+
+ pipe = file->f_path.dentry->d_inode->i_pipe;
+
+ pipe_lock(pipe);
+ pipe->readers++;
+ pipe->writers--;
+ while (pipe->readers > 1) {
+ wake_up_interruptible_sync(&pipe->wait);
+ kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
+ pipe_wait(pipe);
+ }
+
+ /*
+ * This reclaims the additional readers count we took in
+ * do_coredump
+ */
+ pipe->readers--;
+ pipe->writers++;
+ pipe_unlock(pipe);
+
+}
+
+
void do_coredump(long signr, int exit_code, struct pt_regs *regs)
{
struct core_state core_state;
@@ -1862,6 +1889,8 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
current->signal->group_exit_code |= 0x80;
close_fail:
+ if (ispipe && core_pipe_limit)
+ wait_for_dump_helpers(file);
filp_close(file, NULL);
fail_dropcount:
if (dump_count)
Neil, the changelog is not right,
On 07/01, Neil Horman wrote:
>
> We basically add ourselves as an additional
> reader (to prevent cleanup of the pipe), write the dump in ->core_dump(),
This is not what we do. We do not and must not inc readers before
->core_dump(). We only do this in wait_for_dump_helpers().
> note that we add ourselves as a reader before writing the
> file. this closes the race in the window between the time we write the dump and
> the time we start checking for the user space process to be done with the pipe.
again, this doesn't match the patch.
> +static void wait_for_dump_helpers(struct file *file)
> +{
> + struct pipe_inode_info *pipe;
> +
> + pipe = file->f_path.dentry->d_inode->i_pipe;
> +
> + pipe_lock(pipe);
> + pipe->readers++;
> + pipe->writers--;
> + while (pipe->readers > 1) {
> + wake_up_interruptible_sync(&pipe->wait);
> + kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
> + pipe_wait(pipe);
> + }
> +
> + /*
> + * This reclaims the additional readers count we took in
> + * do_coredump
> + */
now the comment is wrong.
Can't understand why do you change readers/writers unconditianally,
we shouldn't have the false wakeups since ->writers == 0, but OK.
Oleg.
On Wed, Jul 01, 2009 at 06:06:10PM +0200, Oleg Nesterov wrote:
> Neil, the changelog is not right,
>
> On 07/01, Neil Horman wrote:
> >
> > We basically add ourselves as an additional
> > reader (to prevent cleanup of the pipe), write the dump in ->core_dump(),
>
> This is not what we do. We do not and must not inc readers before
> ->core_dump(). We only do this in wait_for_dump_helpers().
>
> > note that we add ourselves as a reader before writing the
> > file. this closes the race in the window between the time we write the dump and
> > the time we start checking for the user space process to be done with the pipe.
>
> again, this doesn't match the patch.
>
Gahhh! Stupid commit message reuse.
> > +static void wait_for_dump_helpers(struct file *file)
> > +{
> > + struct pipe_inode_info *pipe;
> > +
> > + pipe = file->f_path.dentry->d_inode->i_pipe;
> > +
> > + pipe_lock(pipe);
> > + pipe->readers++;
> > + pipe->writers--;
> > + while (pipe->readers > 1) {
> > + wake_up_interruptible_sync(&pipe->wait);
> > + kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
> > + pipe_wait(pipe);
> > + }
> > +
> > + /*
> > + * This reclaims the additional readers count we took in
> > + * do_coredump
> > + */
>
> now the comment is wrong.
>
Removing.
> Can't understand why do you change readers/writers unconditianally,
> we shouldn't have the false wakeups since ->writers == 0, but OK.
>
I do it for exactly the reason I said I was doing it, because We remove
ourselves as a writer (to get the EOF condition that closing would otherwise
give us), and then we add ourselves as a reader (to prevent reaping when the
real reader in userspace actually exits). Your patch does exactly the same
thing, I just moved the increment/decrement outside of the while loop so we dont
needlessly increment and decrement those values multiple times. Its just an
efficiency change.
Neil
> Oleg.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
Ok, one more time to fix the comments
This patcset provides improvements to the core_pattern infrastructure:
1) Make detection of recurseive core dumps more robust and accurate
2) Provide a sysctl to limit the number of parallel coredumps in flight when
using pipes
3) Allows for the kernel to wait for the completion of user space core dump
processing
Neil
core_pattern: Change how we detect recursive dumps with core_pattern pipes
Change how we detect recursive dumps. Currently we have a mechanism by which
we try to compare pathnames of the crashing process to the core_pattern path.
This is broken for a dozen reasons, and just doesn't work in any sort of robust
way. I'm replacing it with the use of a 0 RLIMIT_CORE value. Since helper
apps set RLIMIT_CORE to zero, we don't write out core files for any process with
that particular limit set. It the core_pattern is a pipe, any non-zero limit is
translated to RLIM_INFINITY. This allows complete dumps to be captured, but
prevents infinite recursion in the event that the core_pattern process itself
crashes.
Signed-off-by: Neil Horman <[email protected]>
Reported-by: Earl Chew <[email protected]>
exec.c | 43 +++++++++++++++++++++----------------------
1 file changed, 21 insertions(+), 22 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index 9e05bd8..9defd20 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1776,35 +1776,34 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
lock_kernel();
ispipe = format_corename(corename, signr);
unlock_kernel();
- /*
- * Don't bother to check the RLIMIT_CORE value if core_pattern points
- * to a pipe. Since we're not writing directly to the filesystem
- * RLIMIT_CORE doesn't really apply, as no actual core file will be
- * created unless the pipe reader choses to write out the core file
- * at which point file size limits and permissions will be imposed
- * as it does with any other process
- */
+
if (ispipe) {
+ if (core_limit == 0) {
+ /*
+ * Normally core limits are irrelevant to pipes, since
+ * we're not writing to the file system, but we use
+ * core_limit of 0 here as a speacial value. Any
+ * non-zero limit gets set to RLIM_INFINITY below, but
+ * a limit of 0 skips the dump. This is a consistent
+ * way to catch recursive crashes. We can still crash
+ * if the core_pattern binary sets RLIM_CORE = !0
+ * but it runs as root, and can do lots of stupid things
+ * Note that we use task_tgid_vnr here to grab the pid of the
+ * process group leader. That way we get the right pid if a thread
+ * in a multi-threaded core_pattern process dies.
+ */
+ printk(KERN_WARNING "Process %d(%s) has RLIMIT_CORE set to 0\n",
+ task_tgid_vnr(current), current->comm);
+ printk(KERN_WARNING "Aborting core\n");
+ goto fail_unlock;
+ }
+
helper_argv = argv_split(GFP_KERNEL, corename+1, &helper_argc);
if (!helper_argv) {
printk(KERN_WARNING "%s failed to allocate memory\n",
__func__);
goto fail_unlock;
}
- /* Terminate the string before the first option */
- delimit = strchr(corename, ' ');
- if (delimit)
- *delimit = '\0';
- delimit = strrchr(helper_argv[0], '/');
- if (delimit)
- delimit++;
- else
- delimit = helper_argv[0];
- if (!strcmp(delimit, current->comm)) {
- printk(KERN_NOTICE "Recursive core dump detected, "
- "aborting\n");
- goto fail_unlock;
- }
core_limit = RLIM_INFINITY;
core_pattern: Introduce core pipe limiting sysctl
Since we can dump cores to pipe, rather than directly to the filesystem, we
create a condition in which a user can create a very high load on the system
simply by running bad applications. If the pipe reader specified in
core_pattern is poorly written, we can have lots of ourstandig resources and
processes in the system. This sysctl introduces an ability to limit that
resource consumption. core_pipe_limit defines how many in-flight dumps may be
run in parallel, dumps beyond this value are skipped and a note is made in the
kernel log. A special value of 0 in core_pipe_limit denotes unlimited core
dumps may be handled (this is the default value).
Signed-off-by: Neil Horman <[email protected]>
Reported-by: Earl Chew <[email protected]>
Documentation/sysctl/kernel.txt | 22 ++++++++++++++++++++++
fs/exec.c | 21 +++++++++++++++++----
kernel/sysctl.c | 9 +++++++++
3 files changed, 48 insertions(+), 4 deletions(-)
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 322a00b..bb226ba 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -21,6 +21,7 @@ show up in /proc/sys/kernel:
- acct
- auto_msgmni
- core_pattern
+- core_pipe_limit
- core_uses_pid
- ctrl-alt-del
- dentry-state
@@ -119,6 +120,27 @@ core_pattern is used to specify a core dumpfile pattern name.
==============================================================
+core_pipe_limit:
+
+This sysctl is only applicable when core_pattern is configured to pipe core
+files to user space helper a (when the first character of core_pattern is a '|',
+see above). When collecting cores via a pipe to an application, it is
+occasionally usefull for the collecting application to gather data about the
+crashing process from its /proc/pid directory. In order to do this safely, the
+kernel must wait for the collecting process to exit, so as not to remove the
+crashing processes proc files prematurely. This in turn creates the possibility
+that a misbehaving userspace collecting process can block the reaping of a
+crashed process simply by never exiting. This sysctl defends against that. It
+defines how many concurrent crashing processes may be piped to user space
+applications in parallel. If this value is exceeded, then those crashing
+processes above that value are noted via the kernel log and their cores are
+skipped. 0 is a special value, indicating that unlimited processes may be
+captured in parallel, but that no waiting will take place (i.e. the collecting
+process is not guaranteed access to /proc/<crahing pid>/). This value defaults
+to 0.
+
+==============================================================
+
core_uses_pid:
The default coredump filename is "core". By setting
diff --git a/fs/exec.c b/fs/exec.c
index 9defd20..93ab6eb 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -63,6 +63,7 @@
int core_uses_pid;
char core_pattern[CORENAME_MAX_SIZE] = "core";
+unsigned int core_pipe_limit;
int suid_dumpable = 0;
/* The maximal length of core_pattern is also specified in sysctl.c */
@@ -1726,7 +1727,8 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
unsigned long core_limit = current->signal->rlim[RLIMIT_CORE].rlim_cur;
char **helper_argv = NULL;
int helper_argc = 0;
- char *delimit;
+ int dump_count = 0;
+ static atomic_t core_dump_count = ATOMIC_INIT(0);
audit_core_dumps(signr);
@@ -1798,21 +1800,29 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
goto fail_unlock;
}
+ dump_count = atomic_inc_return(&core_dump_count);
+ if (core_pipe_limit && (core_pipe_limit < dump_count)) {
+ printk(KERN_WARNING "Pid %d(%s) over core_pipe_limit\n",
+ task_tgid_vnr(current), current->comm);
+ printk(KERN_WARNING "Skipping core dump\n");
+ goto fail_dropcount;
+ }
+
helper_argv = argv_split(GFP_KERNEL, corename+1, &helper_argc);
if (!helper_argv) {
printk(KERN_WARNING "%s failed to allocate memory\n",
__func__);
- goto fail_unlock;
+ goto fail_dropcount;
}
core_limit = RLIM_INFINITY;
/* SIGPIPE can happen, but it's just never processed */
- if (call_usermodehelper_pipe(corename+1, helper_argv, NULL,
+ if (call_usermodehelper_pipe(helper_argv[0], helper_argv, NULL,
&file)) {
printk(KERN_INFO "Core dump to %s pipe failed\n",
corename);
- goto fail_unlock;
+ goto fail_dropcount;
}
} else {
if (core_limit < binfmt->min_coredump)
@@ -1853,6 +1863,9 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
close_fail:
filp_close(file, NULL);
+fail_dropcount:
+ if (dump_count)
+ atomic_dec(&core_dump_count);
fail_unlock:
if (helper_argv)
argv_free(helper_argv);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 62e4ff9..681052f 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -77,6 +77,7 @@ extern int max_threads;
extern int core_uses_pid;
extern int suid_dumpable;
extern char core_pattern[];
+extern unsigned int core_pipe_limit;
extern int pid_max;
extern int min_free_kbytes;
extern int pid_max_min, pid_max_max;
@@ -407,6 +408,14 @@ static struct ctl_table kern_table[] = {
.proc_handler = &proc_dostring,
.strategy = &sysctl_string,
},
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "core_pipe_limit",
+ .data = &core_pipe_limit,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
#ifdef CONFIG_PROC_SYSCTL
{
.procname = "tainted",
core_pattern: Allow core_pattern pipes to wait for user space to complete
One of the things that user space processes like to do is look at metadata for a
crashing process in their /proc/<pid> directory. this is racy however, since
do_coredump in the kernel doesn't wait for the user space process to complete
before it reaps the crashing process. This patch corrects that. Allowing the
kernel to wait for the user space process to complete before cleaning up the
crashing process. This is a bit tricky to do for a few reasons:
1) The user space process isn't our child, so we can't sys_wait4 on it
2) We need to close the pipe before waiting for the user process to complete,
since the user process may rely on an EOF condition
I've discussed several solutions with Oleg Nesterov off-list about this, and
this is the one we've come up with. We add ourselves as a pipe reader (to
prevent premature cleanup of the pipe_inode_info), and remove ourselves as a
writer (to provide an EOF condition to the writer in user space), then we
iterate until the user space process exits (which we detect by pipe->readers ==
1, hence the > 1 check in the loop). When we exit the loop, we restore the
proper reader/writer values, then we return and let filp_close in do_coredump
clean up the pipe data properly.
Signed-off-by: Neil Horman <[email protected]>
Reported-by: Earl Chew <[email protected]>
exec.c | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)
diff --git a/fs/exec.c b/fs/exec.c
index 93ab6eb..43872cc 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -55,6 +55,7 @@
#include <linux/kmod.h>
#include <linux/fsnotify.h>
#include <linux/fs_struct.h>
+#include <linux/pipe_fs_i.h>
#include <asm/uaccess.h>
#include <asm/mmu_context.h>
@@ -1711,6 +1712,29 @@ int get_dumpable(struct mm_struct *mm)
return (ret >= 2) ? 2 : ret;
}
+static void wait_for_dump_helpers(struct file *file)
+{
+ struct pipe_inode_info *pipe;
+
+ pipe = file->f_path.dentry->d_inode->i_pipe;
+
+ pipe_lock(pipe);
+ pipe->readers++;
+ pipe->writers--;
+
+ while (pipe->readers > 1) {
+ wake_up_interruptible_sync(&pipe->wait);
+ kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
+ pipe_wait(pipe);
+ }
+
+ pipe->readers--;
+ pipe->writers++;
+ pipe_unlock(pipe);
+
+}
+
+
void do_coredump(long signr, int exit_code, struct pt_regs *regs)
{
struct core_state core_state;
@@ -1862,6 +1886,8 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
current->signal->group_exit_code |= 0x80;
close_fail:
+ if (ispipe && core_pipe_limit)
+ wait_for_dump_helpers(file);
filp_close(file, NULL);
fail_dropcount:
if (dump_count)
(add Roland)
Neil, I guess we both are tired of this thread, but I still have questions ;)
On 07/01, Neil Horman wrote:
>
> +static void wait_for_dump_helpers(struct file *file)
> +{
> + struct pipe_inode_info *pipe;
> +
> + pipe = file->f_path.dentry->d_inode->i_pipe;
> +
> + pipe_lock(pipe);
> + pipe->readers++;
> + pipe->writers--;
> +
> + while (pipe->readers > 1) {
> + wake_up_interruptible_sync(&pipe->wait);
> + kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
> + pipe_wait(pipe);
> + }
> +
> + pipe->readers--;
> + pipe->writers++;
> + pipe_unlock(pipe);
> +
> +}
OK, I think this is simple enough and should work.
This is not exactly correct wrt signals, if we get TIF_SIGPENDING this
becomes a busy-wait loop.
I'd suggest to do while (->readers && !signal_pending()), this is not
exactly right too because we have other problems with signals, but
this is another story.
> void do_coredump(long signr, int exit_code, struct pt_regs *regs)
> {
> struct core_state core_state;
> @@ -1862,6 +1886,8 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
> current->signal->group_exit_code |= 0x80;
>
> close_fail:
> + if (ispipe && core_pipe_limit)
> + wait_for_dump_helpers(file);
Oh. I thought I misread the first version, but now I see I got it right.
And now I confused again.
So, we only wait if core_pipe_limit != 0. Why?
The previous version, v4, called wait_for_dump_helpers() unconditionally.
And this looks more right to me. Once again, even without wait_for_dump()
the coredumping process can't be reaped until core_pattern app reads all
data from the pipe.
I won't insist. However, anybody else please take a look?
core_pipe_limit != 0 limits the number of coredump-via-pipe in flight, OK.
But, should wait_for_dump_helpers() depend on core_limit_pipe != 0?
Oleg.
On Thu, Jul 02, 2009 at 10:29:14AM +0200, Oleg Nesterov wrote:
> (add Roland)
>
> Neil, I guess we both are tired of this thread, but I still have questions ;)
>
> On 07/01, Neil Horman wrote:
> >
> > +static void wait_for_dump_helpers(struct file *file)
> > +{
> > + struct pipe_inode_info *pipe;
> > +
> > + pipe = file->f_path.dentry->d_inode->i_pipe;
> > +
> > + pipe_lock(pipe);
> > + pipe->readers++;
> > + pipe->writers--;
> > +
> > + while (pipe->readers > 1) {
> > + wake_up_interruptible_sync(&pipe->wait);
> > + kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
> > + pipe_wait(pipe);
> > + }
> > +
> > + pipe->readers--;
> > + pipe->writers++;
> > + pipe_unlock(pipe);
> > +
> > +}
>
> OK, I think this is simple enough and should work.
>
> This is not exactly correct wrt signals, if we get TIF_SIGPENDING this
> becomes a busy-wait loop.
>
> I'd suggest to do while (->readers && !signal_pending()), this is not
> exactly right too because we have other problems with signals, but
> this is another story.
>
> > void do_coredump(long signr, int exit_code, struct pt_regs *regs)
> > {
> > struct core_state core_state;
> > @@ -1862,6 +1886,8 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
> > current->signal->group_exit_code |= 0x80;
> >
> > close_fail:
> > + if (ispipe && core_pipe_limit)
> > + wait_for_dump_helpers(file);
>
> Oh. I thought I misread the first version, but now I see I got it right.
> And now I confused again.
>
> So, we only wait if core_pipe_limit != 0. Why?
>
> The previous version, v4, called wait_for_dump_helpers() unconditionally.
> And this looks more right to me. Once again, even without wait_for_dump()
> the coredumping process can't be reaped until core_pattern app reads all
> data from the pipe.
>
> I won't insist. However, anybody else please take a look?
>
> core_pipe_limit != 0 limits the number of coredump-via-pipe in flight, OK.
>
> But, should wait_for_dump_helpers() depend on core_limit_pipe != 0?
>
I messed this up in v4 and am fixing it here. If you read the documentation I
added in patch 2, you can see that my intent with the core_pipe_limit sysctl was
to designate 0 as a special value allowing unlimited parallel core_dumps in
which we do not wait for any user space process completion (so that current
system behavior can be maintained, which I think is desireable for those user
space helpers who don't need access to a crashing processes meta data via proc.
If you look above in the second patch where we do an atomic_inc_return, you'll
see that we only honor the core_pipe_limit value if its non-zero. This addional
check restores the behavior I documented in that patch.
Neil
> Oleg.
>
>
On 07/02, Neil Horman wrote:
>
> On Thu, Jul 02, 2009 at 10:29:14AM +0200, Oleg Nesterov wrote:
> > (add Roland)
> >
> > Neil, I guess we both are tired of this thread, but I still have questions ;)
> >
> > On 07/01, Neil Horman wrote:
> > >
> > > +static void wait_for_dump_helpers(struct file *file)
> > > +{
> > > + struct pipe_inode_info *pipe;
> > > +
> > > + pipe = file->f_path.dentry->d_inode->i_pipe;
> > > +
> > > + pipe_lock(pipe);
> > > + pipe->readers++;
> > > + pipe->writers--;
> > > +
> > > + while (pipe->readers > 1) {
> > > + wake_up_interruptible_sync(&pipe->wait);
> > > + kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
> > > + pipe_wait(pipe);
> > > + }
> > > +
> > > + pipe->readers--;
> > > + pipe->writers++;
> > > + pipe_unlock(pipe);
> > > +
> > > +}
> >
> > OK, I think this is simple enough and should work.
> >
> > This is not exactly correct wrt signals, if we get TIF_SIGPENDING this
> > becomes a busy-wait loop.
> >
> > I'd suggest to do while (->readers && !signal_pending()), this is not
> > exactly right too because we have other problems with signals, but
> > this is another story.
> >
> > > void do_coredump(long signr, int exit_code, struct pt_regs *regs)
> > > {
> > > struct core_state core_state;
> > > @@ -1862,6 +1886,8 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
> > > current->signal->group_exit_code |= 0x80;
> > >
> > > close_fail:
> > > + if (ispipe && core_pipe_limit)
> > > + wait_for_dump_helpers(file);
> >
> > Oh. I thought I misread the first version, but now I see I got it right.
> > And now I confused again.
> >
> > So, we only wait if core_pipe_limit != 0. Why?
> >
> > The previous version, v4, called wait_for_dump_helpers() unconditionally.
> > And this looks more right to me. Once again, even without wait_for_dump()
> > the coredumping process can't be reaped until core_pattern app reads all
> > data from the pipe.
> >
> > I won't insist. However, anybody else please take a look?
> >
> > core_pipe_limit != 0 limits the number of coredump-via-pipe in flight, OK.
> >
> > But, should wait_for_dump_helpers() depend on core_limit_pipe != 0?
> >
> I messed this up in v4 and am fixing it here. If you read the documentation I
> added in patch 2, you can see that my intent with the core_pipe_limit sysctl was
> to designate 0 as a special value allowing unlimited parallel core_dumps in
> which we do not wait for any user space process completion
We do wait in any case. If core_dump app doesn't read the data from the pipe
->core_dump() can't complete. OK, unless all data fits into pipe buffers.
> (so that current
> system behavior can be maintained, which I think is desireable for those user
> space helpers who don't need access to a crashing processes meta data via proc.
> If you look above in the second patch where we do an atomic_inc_return, you'll
> see that we only honor the core_pipe_limit value if its non-zero. This addional
> check restores the behavior I documented in that patch.
If you you look at my message you will see I am not arguing, but I am asking
others to ack this behaviour.
As for implementation, my only complaint is that wait_for_dump_helpers() lacks
signal_pending() check, this wasn't answered.
Oleg.
On Thu, Jul 02, 2009 at 01:36:10PM +0200, Oleg Nesterov wrote:
> On 07/02, Neil Horman wrote:
> >
> > On Thu, Jul 02, 2009 at 10:29:14AM +0200, Oleg Nesterov wrote:
> > > (add Roland)
> > >
> > > Neil, I guess we both are tired of this thread, but I still have questions ;)
> > >
> > > On 07/01, Neil Horman wrote:
> > > >
> > > > +static void wait_for_dump_helpers(struct file *file)
> > > > +{
> > > > + struct pipe_inode_info *pipe;
> > > > +
> > > > + pipe = file->f_path.dentry->d_inode->i_pipe;
> > > > +
> > > > + pipe_lock(pipe);
> > > > + pipe->readers++;
> > > > + pipe->writers--;
> > > > +
> > > > + while (pipe->readers > 1) {
> > > > + wake_up_interruptible_sync(&pipe->wait);
> > > > + kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
> > > > + pipe_wait(pipe);
> > > > + }
> > > > +
> > > > + pipe->readers--;
> > > > + pipe->writers++;
> > > > + pipe_unlock(pipe);
> > > > +
> > > > +}
> > >
> > > OK, I think this is simple enough and should work.
> > >
> > > This is not exactly correct wrt signals, if we get TIF_SIGPENDING this
> > > becomes a busy-wait loop.
> > >
> > > I'd suggest to do while (->readers && !signal_pending()), this is not
> > > exactly right too because we have other problems with signals, but
> > > this is another story.
> > >
> > > > void do_coredump(long signr, int exit_code, struct pt_regs *regs)
> > > > {
> > > > struct core_state core_state;
> > > > @@ -1862,6 +1886,8 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
> > > > current->signal->group_exit_code |= 0x80;
> > > >
> > > > close_fail:
> > > > + if (ispipe && core_pipe_limit)
> > > > + wait_for_dump_helpers(file);
> > >
> > > Oh. I thought I misread the first version, but now I see I got it right.
> > > And now I confused again.
> > >
> > > So, we only wait if core_pipe_limit != 0. Why?
> > >
> > > The previous version, v4, called wait_for_dump_helpers() unconditionally.
> > > And this looks more right to me. Once again, even without wait_for_dump()
> > > the coredumping process can't be reaped until core_pattern app reads all
> > > data from the pipe.
> > >
> > > I won't insist. However, anybody else please take a look?
> > >
> > > core_pipe_limit != 0 limits the number of coredump-via-pipe in flight, OK.
> > >
> > > But, should wait_for_dump_helpers() depend on core_limit_pipe != 0?
> > >
> > I messed this up in v4 and am fixing it here. If you read the documentation I
> > added in patch 2, you can see that my intent with the core_pipe_limit sysctl was
> > to designate 0 as a special value allowing unlimited parallel core_dumps in
> > which we do not wait for any user space process completion
>
> We do wait in any case. If core_dump app doesn't read the data from the pipe
> ->core_dump() can't complete. OK, unless all data fits into pipe buffers.
>
Thats true, but consider the converse situation, in which the userspace app does
read the pipe, so that we return from ->core_dump(). If the user app then
queries the /proc/<pid> directory of the crashing process we are open to race.
Thats what this wait helps with.
> > (so that current
> > system behavior can be maintained, which I think is desireable for those user
> > space helpers who don't need access to a crashing processes meta data via proc.
> > If you look above in the second patch where we do an atomic_inc_return, you'll
> > see that we only honor the core_pipe_limit value if its non-zero. This addional
> > check restores the behavior I documented in that patch.
>
> If you you look at my message you will see I am not arguing, but I am asking
> others to ack this behaviour.
>
Ok, but you asked the question as to why I added that check, this is the answer.
> As for implementation, my only complaint is that wait_for_dump_helpers() lacks
> signal_pending() check, this wasn't answered.
>
I'll have to defer to others on this. It seems to me that, given that we are
waiting here in the context of process that has already received a fatal signal,
theres no opportunity to handle subsequent signals, so we don't really need to
check for them. As for the user space helper, I'm not sure what detecting a
pending signal will do for us here. I agree we busy wait if a signal is
pending, but if we drop out of the loop if a signal is pending then we cancel
the wait early, leading to the early removal of the /proc file for the crashing
process. Could we add a schedule to the loop to allow the user space helper to
run if a signal is pending instead of just dropping the loop?
Neil
> Oleg.
>
>
On 07/02, Neil Horman wrote:
>
> On Thu, Jul 02, 2009 at 01:36:10PM +0200, Oleg Nesterov wrote:
> > On 07/02, Neil Horman wrote:
> > >
> > > > > void do_coredump(long signr, int exit_code, struct pt_regs *regs)
> > > > > {
> > > > > struct core_state core_state;
> > > > > @@ -1862,6 +1886,8 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
> > > > > current->signal->group_exit_code |= 0x80;
> > > > >
> > > > > close_fail:
> > > > > + if (ispipe && core_pipe_limit)
> > > > > + wait_for_dump_helpers(file);
> > > >
> > > > Oh. I thought I misread the first version, but now I see I got it right.
> > > > And now I confused again.
> > > >
> > > > So, we only wait if core_pipe_limit != 0. Why?
> > > >
> > > > The previous version, v4, called wait_for_dump_helpers() unconditionally.
> > > > And this looks more right to me. Once again, even without wait_for_dump()
> > > > the coredumping process can't be reaped until core_pattern app reads all
> > > > data from the pipe.
> > > >
> > > > I won't insist. However, anybody else please take a look?
> > > >
> > > > core_pipe_limit != 0 limits the number of coredump-via-pipe in flight, OK.
> > > >
> > > > But, should wait_for_dump_helpers() depend on core_limit_pipe != 0?
> > > >
> > > I messed this up in v4 and am fixing it here. If you read the documentation I
> > > added in patch 2, you can see that my intent with the core_pipe_limit sysctl was
> > > to designate 0 as a special value allowing unlimited parallel core_dumps in
> > > which we do not wait for any user space process completion
> >
> > We do wait in any case. If core_dump app doesn't read the data from the pipe
> > ->core_dump() can't complete. OK, unless all data fits into pipe buffers.
> >
> Thats true, but consider the converse situation, in which the userspace app does
> read the pipe, so that we return from ->core_dump(). If the user app then
> queries the /proc/<pid> directory of the crashing process we are open to race.
> Thats what this wait helps with.
Sure. And I do agree wait_for_dump_helpers() is useful.
And I do agree with core_pipe_limit.
> > > (so that current
> > > system behavior can be maintained, which I think is desireable for those user
> > > space helpers who don't need access to a crashing processes meta data via proc.
> > > If you look above in the second patch where we do an atomic_inc_return, you'll
> > > see that we only honor the core_pipe_limit value if its non-zero. This addional
> > > check restores the behavior I documented in that patch.
> >
> > If you you look at my message you will see I am not arguing, but I am asking
> > others to ack this behaviour.
> >
> Ok, but you asked the question as to why I added that check, this is the answer.
And I still can't understand your answer.
My question is: why don't we do wait_for_dump_helpers() if core_pipe_limit == 0.
Because I don't really understand how core_pipe_limit connected to
wait_for_dump_helpers(). Because, once again, we have to wait for core_pattern
app in any case.
> > As for implementation, my only complaint is that wait_for_dump_helpers() lacks
> > signal_pending() check, this wasn't answered.
> >
> I'll have to defer to others on this. It seems to me that, given that we are
> waiting here in the context of process that has already received a fatal signal,
> theres no opportunity to handle subsequent signals,
Yes, we can't handle subsequent signals, but this is not needed.
> I agree we busy wait if a signal is
> pending,
Yes. And this is not nice.
> but if we drop out of the loop if a signal is pending then we cancel
> the wait early, leading to the early removal of the /proc file for the crashing
> process.
Yes. But if signal_pending() == T we already have other problems. In
particular pipe_write() can fail, and in this case the coredump won't
complete anyway.
Hopefully this will be changed soon: the coredumping task should ignore
ignore all signals except SIGKILL which should terminate the coredump,
and in this case of course wait_for_dump_helpers() should abort.
> Could we add a schedule to the loop to allow the user space helper to
> run if a signal is pending instead of just dropping the loop?
Not sure I understand what you mean, but no, we can't return to user-space
to handle the signal. (and just in case, pipe_wait() does schedule()).
Oleg.
On Thu, Jul 02, 2009 at 05:37:36PM +0200, Oleg Nesterov wrote:
> On 07/02, Neil Horman wrote:
> >
> > On Thu, Jul 02, 2009 at 01:36:10PM +0200, Oleg Nesterov wrote:
> > > On 07/02, Neil Horman wrote:
> > > >
> > > > > > void do_coredump(long signr, int exit_code, struct pt_regs *regs)
> > > > > > {
> > > > > > struct core_state core_state;
> > > > > > @@ -1862,6 +1886,8 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
> > > > > > current->signal->group_exit_code |= 0x80;
> > > > > >
> > > > > > close_fail:
> > > > > > + if (ispipe && core_pipe_limit)
> > > > > > + wait_for_dump_helpers(file);
> > > > >
> > > > > Oh. I thought I misread the first version, but now I see I got it right.
> > > > > And now I confused again.
> > > > >
> > > > > So, we only wait if core_pipe_limit != 0. Why?
> > > > >
> > > > > The previous version, v4, called wait_for_dump_helpers() unconditionally.
> > > > > And this looks more right to me. Once again, even without wait_for_dump()
> > > > > the coredumping process can't be reaped until core_pattern app reads all
> > > > > data from the pipe.
> > > > >
> > > > > I won't insist. However, anybody else please take a look?
> > > > >
> > > > > core_pipe_limit != 0 limits the number of coredump-via-pipe in flight, OK.
> > > > >
> > > > > But, should wait_for_dump_helpers() depend on core_limit_pipe != 0?
> > > > >
> > > > I messed this up in v4 and am fixing it here. If you read the documentation I
> > > > added in patch 2, you can see that my intent with the core_pipe_limit sysctl was
> > > > to designate 0 as a special value allowing unlimited parallel core_dumps in
> > > > which we do not wait for any user space process completion
> > >
> > > We do wait in any case. If core_dump app doesn't read the data from the pipe
> > > ->core_dump() can't complete. OK, unless all data fits into pipe buffers.
> > >
> > Thats true, but consider the converse situation, in which the userspace app does
> > read the pipe, so that we return from ->core_dump(). If the user app then
> > queries the /proc/<pid> directory of the crashing process we are open to race.
> > Thats what this wait helps with.
>
> Sure. And I do agree wait_for_dump_helpers() is useful.
>
> And I do agree with core_pipe_limit.
>
> > > > (so that current
> > > > system behavior can be maintained, which I think is desireable for those user
> > > > space helpers who don't need access to a crashing processes meta data via proc.
> > > > If you look above in the second patch where we do an atomic_inc_return, you'll
> > > > see that we only honor the core_pipe_limit value if its non-zero. This addional
> > > > check restores the behavior I documented in that patch.
> > >
> > > If you you look at my message you will see I am not arguing, but I am asking
> > > others to ack this behaviour.
> > >
> > Ok, but you asked the question as to why I added that check, this is the answer.
>
> And I still can't understand your answer.
>
> My question is: why don't we do wait_for_dump_helpers() if core_pipe_limit == 0.
>
I'm sorry if I'm not explaining myself clearly. Perhaps it would be best to say
that I made this choice by design. I wanted core_pipe_limit == 0 to be a
special value in which we did 2 things:
1) Allowed an unlimited number of coredumps-to-pipes in parallel.
2) Disabled waiting on usermode helper processes to complete
I understand what you're saying in that we block in ->core_wait() once the pipe
fills up, but, as you see, we want to be able to wait after we've finished
writing the core (for the reasons we've discussed). Conversely, I see advantage
in not waiting on usermode helpers if they have no need for additional crashing
process info. In short, I see an advantage to being able to disable this
waiting feature from user space. I.e allowing the crashing process to exit
immediately while the user helper continues to run could be adventageous.
Put it this way: If you want to be able to have an unlimited number of user mode
helpers run in parallel and have the kernel wait on each of them, set
core_pipe_limit to MAXINT, and you effectively have that situation. Since
core_pipe_limit == 0 effectively has to mean the same thing as core_pipe_limit
== MAXINT (in that you have an effectively unbounded number of processes
operating concurrently), why not add in this feature which allows you to disable
the wait after ->core_dump() entirely.
> Because I don't really understand how core_pipe_limit connected to
> wait_for_dump_helpers(). Because, once again, we have to wait for core_pattern
> app in any case.
>
> > > As for implementation, my only complaint is that wait_for_dump_helpers() lacks
> > > signal_pending() check, this wasn't answered.
> > >
> > I'll have to defer to others on this. It seems to me that, given that we are
> > waiting here in the context of process that has already received a fatal signal,
> > theres no opportunity to handle subsequent signals,
>
> Yes, we can't handle subsequent signals, but this is not needed.
>
Ok.
> > I agree we busy wait if a signal is
> > pending,
>
> Yes. And this is not nice.
>
> > but if we drop out of the loop if a signal is pending then we cancel
> > the wait early, leading to the early removal of the /proc file for the crashing
> > process.
>
> Yes. But if signal_pending() == T we already have other problems. In
> particular pipe_write() can fail, and in this case the coredump won't
> complete anyway.
>
Who's going to call pipe_write? The userspace process isn't going to write to
stdin, and by the time we're in this loop, we're done writing to the pipe
anyway.
> Hopefully this will be changed soon: the coredumping task should ignore
> ignore all signals except SIGKILL which should terminate the coredump,
> and in this case of course wait_for_dump_helpers() should abort.
>
It sounds like what we should do then is, rather than gate the loop on
signal_pending, we should instead gate it on fatal_signal_pending, which should
only return true if SIGKILL is asserted on the crashing task. Does that sound
reasonable to you?
> > Could we add a schedule to the loop to allow the user space helper to
> > run if a signal is pending instead of just dropping the loop?
>
> Not sure I understand what you mean, but no, we can't return to user-space
> to handle the signal. (and just in case, pipe_wait() does schedule()).
>
Ok, I was starting to think you meant we need to check the helper process for
signals, but I see know you're really concerned about the SIGKILL condition for
the crashing process, and yes, for that condition the signal check makes sense.
Do you agree that gating the loop on the additional check of
fatal_signal_pending makes sense here?
Regards
Neil
> Oleg.
>
>
Patch V7, adjusting wait loop for signal handling.
This patcset provides improvements to the core_pattern infrastructure:
1) Make detection of recurseive core dumps more robust and accurate
2) Provide a sysctl to limit the number of parallel coredumps in flight when
using pipes
3) Allows for the kernel to wait for the completion of user space core dump
processing
Neil
core_pattern: Change how we detect recursive dumps with core_pattern pipes
Change how we detect recursive dumps. Currently we have a mechanism by which
we try to compare pathnames of the crashing process to the core_pattern path.
This is broken for a dozen reasons, and just doesn't work in any sort of robust
way. I'm replacing it with the use of a 0 RLIMIT_CORE value. Since helper
apps set RLIMIT_CORE to zero, we don't write out core files for any process with
that particular limit set. It the core_pattern is a pipe, any non-zero limit is
translated to RLIM_INFINITY. This allows complete dumps to be captured, but
prevents infinite recursion in the event that the core_pattern process itself
crashes.
Signed-off-by: Neil Horman <[email protected]>
Reported-by: Earl Chew <[email protected]>
exec.c | 43 +++++++++++++++++++++----------------------
1 file changed, 21 insertions(+), 22 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index 9e05bd8..9defd20 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1776,35 +1776,34 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
lock_kernel();
ispipe = format_corename(corename, signr);
unlock_kernel();
- /*
- * Don't bother to check the RLIMIT_CORE value if core_pattern points
- * to a pipe. Since we're not writing directly to the filesystem
- * RLIMIT_CORE doesn't really apply, as no actual core file will be
- * created unless the pipe reader choses to write out the core file
- * at which point file size limits and permissions will be imposed
- * as it does with any other process
- */
+
if (ispipe) {
+ if (core_limit == 0) {
+ /*
+ * Normally core limits are irrelevant to pipes, since
+ * we're not writing to the file system, but we use
+ * core_limit of 0 here as a speacial value. Any
+ * non-zero limit gets set to RLIM_INFINITY below, but
+ * a limit of 0 skips the dump. This is a consistent
+ * way to catch recursive crashes. We can still crash
+ * if the core_pattern binary sets RLIM_CORE = !0
+ * but it runs as root, and can do lots of stupid things
+ * Note that we use task_tgid_vnr here to grab the pid of the
+ * process group leader. That way we get the right pid if a thread
+ * in a multi-threaded core_pattern process dies.
+ */
+ printk(KERN_WARNING "Process %d(%s) has RLIMIT_CORE set to 0\n",
+ task_tgid_vnr(current), current->comm);
+ printk(KERN_WARNING "Aborting core\n");
+ goto fail_unlock;
+ }
+
helper_argv = argv_split(GFP_KERNEL, corename+1, &helper_argc);
if (!helper_argv) {
printk(KERN_WARNING "%s failed to allocate memory\n",
__func__);
goto fail_unlock;
}
- /* Terminate the string before the first option */
- delimit = strchr(corename, ' ');
- if (delimit)
- *delimit = '\0';
- delimit = strrchr(helper_argv[0], '/');
- if (delimit)
- delimit++;
- else
- delimit = helper_argv[0];
- if (!strcmp(delimit, current->comm)) {
- printk(KERN_NOTICE "Recursive core dump detected, "
- "aborting\n");
- goto fail_unlock;
- }
core_limit = RLIM_INFINITY;
core_pattern: Introduce core pipe limiting sysctl
Since we can dump cores to pipe, rather than directly to the filesystem, we
create a condition in which a user can create a very high load on the system
simply by running bad applications. If the pipe reader specified in
core_pattern is poorly written, we can have lots of ourstandig resources and
processes in the system. This sysctl introduces an ability to limit that
resource consumption. core_pipe_limit defines how many in-flight dumps may be
run in parallel, dumps beyond this value are skipped and a note is made in the
kernel log. A special value of 0 in core_pipe_limit denotes unlimited core
dumps may be handled (this is the default value).
Signed-off-by: Neil Horman <[email protected]>
Reported-by: Earl Chew <[email protected]>
Documentation/sysctl/kernel.txt | 22 ++++++++++++++++++++++
fs/exec.c | 21 +++++++++++++++++----
kernel/sysctl.c | 9 +++++++++
3 files changed, 48 insertions(+), 4 deletions(-)
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 322a00b..bb226ba 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -21,6 +21,7 @@ show up in /proc/sys/kernel:
- acct
- auto_msgmni
- core_pattern
+- core_pipe_limit
- core_uses_pid
- ctrl-alt-del
- dentry-state
@@ -119,6 +120,27 @@ core_pattern is used to specify a core dumpfile pattern name.
==============================================================
+core_pipe_limit:
+
+This sysctl is only applicable when core_pattern is configured to pipe core
+files to user space helper a (when the first character of core_pattern is a '|',
+see above). When collecting cores via a pipe to an application, it is
+occasionally usefull for the collecting application to gather data about the
+crashing process from its /proc/pid directory. In order to do this safely, the
+kernel must wait for the collecting process to exit, so as not to remove the
+crashing processes proc files prematurely. This in turn creates the possibility
+that a misbehaving userspace collecting process can block the reaping of a
+crashed process simply by never exiting. This sysctl defends against that. It
+defines how many concurrent crashing processes may be piped to user space
+applications in parallel. If this value is exceeded, then those crashing
+processes above that value are noted via the kernel log and their cores are
+skipped. 0 is a special value, indicating that unlimited processes may be
+captured in parallel, but that no waiting will take place (i.e. the collecting
+process is not guaranteed access to /proc/<crahing pid>/). This value defaults
+to 0.
+
+==============================================================
+
core_uses_pid:
The default coredump filename is "core". By setting
diff --git a/fs/exec.c b/fs/exec.c
index 9defd20..93ab6eb 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -63,6 +63,7 @@
int core_uses_pid;
char core_pattern[CORENAME_MAX_SIZE] = "core";
+unsigned int core_pipe_limit;
int suid_dumpable = 0;
/* The maximal length of core_pattern is also specified in sysctl.c */
@@ -1726,7 +1727,8 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
unsigned long core_limit = current->signal->rlim[RLIMIT_CORE].rlim_cur;
char **helper_argv = NULL;
int helper_argc = 0;
- char *delimit;
+ int dump_count = 0;
+ static atomic_t core_dump_count = ATOMIC_INIT(0);
audit_core_dumps(signr);
@@ -1798,21 +1800,29 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
goto fail_unlock;
}
+ dump_count = atomic_inc_return(&core_dump_count);
+ if (core_pipe_limit && (core_pipe_limit < dump_count)) {
+ printk(KERN_WARNING "Pid %d(%s) over core_pipe_limit\n",
+ task_tgid_vnr(current), current->comm);
+ printk(KERN_WARNING "Skipping core dump\n");
+ goto fail_dropcount;
+ }
+
helper_argv = argv_split(GFP_KERNEL, corename+1, &helper_argc);
if (!helper_argv) {
printk(KERN_WARNING "%s failed to allocate memory\n",
__func__);
- goto fail_unlock;
+ goto fail_dropcount;
}
core_limit = RLIM_INFINITY;
/* SIGPIPE can happen, but it's just never processed */
- if (call_usermodehelper_pipe(corename+1, helper_argv, NULL,
+ if (call_usermodehelper_pipe(helper_argv[0], helper_argv, NULL,
&file)) {
printk(KERN_INFO "Core dump to %s pipe failed\n",
corename);
- goto fail_unlock;
+ goto fail_dropcount;
}
} else {
if (core_limit < binfmt->min_coredump)
@@ -1853,6 +1863,9 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
close_fail:
filp_close(file, NULL);
+fail_dropcount:
+ if (dump_count)
+ atomic_dec(&core_dump_count);
fail_unlock:
if (helper_argv)
argv_free(helper_argv);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 62e4ff9..681052f 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -77,6 +77,7 @@ extern int max_threads;
extern int core_uses_pid;
extern int suid_dumpable;
extern char core_pattern[];
+extern unsigned int core_pipe_limit;
extern int pid_max;
extern int min_free_kbytes;
extern int pid_max_min, pid_max_max;
@@ -407,6 +408,14 @@ static struct ctl_table kern_table[] = {
.proc_handler = &proc_dostring,
.strategy = &sysctl_string,
},
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "core_pipe_limit",
+ .data = &core_pipe_limit,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
#ifdef CONFIG_PROC_SYSCTL
{
.procname = "tainted",
core_pattern: Allow core_pattern pipes to wait for user space to complete
One of the things that user space processes like to do is look at metadata for a
crashing process in their /proc/<pid> directory. this is racy however, since
do_coredump in the kernel doesn't wait for the user space process to complete
before it reaps the crashing process. This patch corrects that. Allowing the
kernel to wait for the user space process to complete before cleaning up the
crashing process. This is a bit tricky to do for a few reasons:
1) The user space process isn't our child, so we can't sys_wait4 on it
2) We need to close the pipe before waiting for the user process to complete,
since the user process may rely on an EOF condition
I've discussed several solutions with Oleg Nesterov off-list about this, and
this is the one we've come up with. We add ourselves as a pipe reader (to
prevent premature cleanup of the pipe_inode_info), and remove ourselves as a
writer (to provide an EOF condition to the writer in user space), then we
iterate until the user space process exits (which we detect by pipe->readers ==
1, hence the > 1 check in the loop). When we exit the loop, we restore the
proper reader/writer values, then we return and let filp_close in do_coredump
clean up the pipe data properly.
Signed-off-by: Neil Horman <[email protected]>
Reported-by: Earl Chew <[email protected]>
exec.c | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)
diff --git a/fs/exec.c b/fs/exec.c
index 93ab6eb..6b3579e 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -55,6 +55,7 @@
#include <linux/kmod.h>
#include <linux/fsnotify.h>
#include <linux/fs_struct.h>
+#include <linux/pipe_fs_i.h>
#include <asm/uaccess.h>
#include <asm/mmu_context.h>
@@ -1711,6 +1712,29 @@ int get_dumpable(struct mm_struct *mm)
return (ret >= 2) ? 2 : ret;
}
+static void wait_for_dump_helpers(struct file *file)
+{
+ struct pipe_inode_info *pipe;
+
+ pipe = file->f_path.dentry->d_inode->i_pipe;
+
+ pipe_lock(pipe);
+ pipe->readers++;
+ pipe->writers--;
+
+ while ((pipe->readers > 1) && (!fatal_signal_pending(current))) {
+ wake_up_interruptible_sync(&pipe->wait);
+ kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
+ pipe_wait(pipe);
+ }
+
+ pipe->readers--;
+ pipe->writers++;
+ pipe_unlock(pipe);
+
+}
+
+
void do_coredump(long signr, int exit_code, struct pt_regs *regs)
{
struct core_state core_state;
@@ -1862,6 +1886,8 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
current->signal->group_exit_code |= 0x80;
close_fail:
+ if (ispipe && core_pipe_limit)
+ wait_for_dump_helpers(file);
filp_close(file, NULL);
fail_dropcount:
if (dump_count)
On 07/02, Neil Horman wrote:
>
> On Thu, Jul 02, 2009 at 05:37:36PM +0200, Oleg Nesterov wrote:
>
> > And I still can't understand your answer.
> >
> > My question is: why don't we do wait_for_dump_helpers() if core_pipe_limit == 0.
> >
> I'm sorry if I'm not explaining myself clearly. Perhaps it would be best to say
> that I made this choice by design. I wanted core_pipe_limit == 0 to be a
> special value in which we did 2 things:
> 1) Allowed an unlimited number of coredumps-to-pipes in parallel.
> 2) Disabled waiting on usermode helper processes to complete
>
> I understand what you're saying in that we block in ->core_wait() once the pipe
> fills up, but, as you see, we want to be able to wait after we've finished
> writing the core (for the reasons we've discussed). Conversely, I see advantage
> in not waiting on usermode helpers if they have no need for additional crashing
> process info. In short, I see an advantage to being able to disable this
> waiting feature from user space. I.e allowing the crashing process to exit
> immediately while the user helper continues to run could be adventageous.
>
> Put it this way: If you want to be able to have an unlimited number of user mode
> helpers run in parallel and have the kernel wait on each of them, set
> core_pipe_limit to MAXINT, and you effectively have that situation. Since
> core_pipe_limit == 0 effectively has to mean the same thing as core_pipe_limit
> == MAXINT (in that you have an effectively unbounded number of processes
> operating concurrently), why not add in this feature which allows you to disable
> the wait after ->core_dump() entirely.
>
>
>
> > Because I don't really understand how core_pipe_limit connected to
> > wait_for_dump_helpers(). Because, once again, we have to wait for core_pattern
> > app in any case.
> >
> > > > As for implementation, my only complaint is that wait_for_dump_helpers() lacks
> > > > signal_pending() check, this wasn't answered.
> > > >
> > > I'll have to defer to others on this. It seems to me that, given that we are
> > > waiting here in the context of process that has already received a fatal signal,
> > > theres no opportunity to handle subsequent signals,
> >
> > Yes, we can't handle subsequent signals, but this is not needed.
> >
> Ok.
>
> > > I agree we busy wait if a signal is
> > > pending,
> >
> > Yes. And this is not nice.
> >
> > > but if we drop out of the loop if a signal is pending then we cancel
> > > the wait early, leading to the early removal of the /proc file for the crashing
> > > process.
> >
> > Yes. But if signal_pending() == T we already have other problems. In
> > particular pipe_write() can fail, and in this case the coredump won't
> > complete anyway.
> >
> Who's going to call pipe_write? The userspace process isn't going to write to
> stdin,
dump_write() calls pipe_write().
> and by the time we're in this loop, we're done writing to the pipe
> anyway.
Sure. But dump_write() can fail if we recieve the signal. In that case
it doesn't really matter wait_for_dump_helpers() aborts.
> > Hopefully this will be changed soon: the coredumping task should ignore
> > ignore all signals except SIGKILL which should terminate the coredump,
> > and in this case of course wait_for_dump_helpers() should abort.
> >
> It sounds like what we should do then is, rather than gate the loop on
> signal_pending, we should instead gate it on fatal_signal_pending, which should
> only return true if SIGKILL is asserted on the crashing task. Does that sound
> reasonable to you?
No. I reply to v7.
Oleg.
On 07/02, Neil Horman wrote:
>
> +static void wait_for_dump_helpers(struct file *file)
> +{
> + struct pipe_inode_info *pipe;
> +
> + pipe = file->f_path.dentry->d_inode->i_pipe;
> +
> + pipe_lock(pipe);
> + pipe->readers++;
> + pipe->writers--;
> +
> + while ((pipe->readers > 1) && (!fatal_signal_pending(current))) {
> + wake_up_interruptible_sync(&pipe->wait);
> + kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
> + pipe_wait(pipe);
> + }
The kernel can hang on UP machine.
Suppose that rt task starts the coredump and then recieves a non-fatal
signal before core_pattern app closes stdin.
In that case this "while()" above becomes a busy-wait loop which never
stops.
Neil, please use signal_pending() as I said. Yes, this means wait_for_
will abort if we recieve a non-fatal signal. But as I said we have other
problems with signals here, when we fix them wait_for_dump_helpers()
will be fixed automatically.
Oleg.
Patch V8, converting fatal_signal_pending to signal_pending
This patcset provides improvements to the core_pattern infrastructure:
1) Make detection of recurseive core dumps more robust and accurate
2) Provide a sysctl to limit the number of parallel coredumps in flight when
using pipes
3) Allows for the kernel to wait for the completion of user space core dump
processing
Neil
core_pattern: Change how we detect recursive dumps with core_pattern pipes
Change how we detect recursive dumps. Currently we have a mechanism by which
we try to compare pathnames of the crashing process to the core_pattern path.
This is broken for a dozen reasons, and just doesn't work in any sort of robust
way. I'm replacing it with the use of a 0 RLIMIT_CORE value. Since helper
apps set RLIMIT_CORE to zero, we don't write out core files for any process with
that particular limit set. It the core_pattern is a pipe, any non-zero limit is
translated to RLIM_INFINITY. This allows complete dumps to be captured, but
prevents infinite recursion in the event that the core_pattern process itself
crashes.
Signed-off-by: Neil Horman <[email protected]>
Reported-by: Earl Chew <[email protected]>
exec.c | 43 +++++++++++++++++++++----------------------
1 file changed, 21 insertions(+), 22 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index 9e05bd8..9defd20 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1776,35 +1776,34 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
lock_kernel();
ispipe = format_corename(corename, signr);
unlock_kernel();
- /*
- * Don't bother to check the RLIMIT_CORE value if core_pattern points
- * to a pipe. Since we're not writing directly to the filesystem
- * RLIMIT_CORE doesn't really apply, as no actual core file will be
- * created unless the pipe reader choses to write out the core file
- * at which point file size limits and permissions will be imposed
- * as it does with any other process
- */
+
if (ispipe) {
+ if (core_limit == 0) {
+ /*
+ * Normally core limits are irrelevant to pipes, since
+ * we're not writing to the file system, but we use
+ * core_limit of 0 here as a speacial value. Any
+ * non-zero limit gets set to RLIM_INFINITY below, but
+ * a limit of 0 skips the dump. This is a consistent
+ * way to catch recursive crashes. We can still crash
+ * if the core_pattern binary sets RLIM_CORE = !0
+ * but it runs as root, and can do lots of stupid things
+ * Note that we use task_tgid_vnr here to grab the pid of the
+ * process group leader. That way we get the right pid if a thread
+ * in a multi-threaded core_pattern process dies.
+ */
+ printk(KERN_WARNING "Process %d(%s) has RLIMIT_CORE set to 0\n",
+ task_tgid_vnr(current), current->comm);
+ printk(KERN_WARNING "Aborting core\n");
+ goto fail_unlock;
+ }
+
helper_argv = argv_split(GFP_KERNEL, corename+1, &helper_argc);
if (!helper_argv) {
printk(KERN_WARNING "%s failed to allocate memory\n",
__func__);
goto fail_unlock;
}
- /* Terminate the string before the first option */
- delimit = strchr(corename, ' ');
- if (delimit)
- *delimit = '\0';
- delimit = strrchr(helper_argv[0], '/');
- if (delimit)
- delimit++;
- else
- delimit = helper_argv[0];
- if (!strcmp(delimit, current->comm)) {
- printk(KERN_NOTICE "Recursive core dump detected, "
- "aborting\n");
- goto fail_unlock;
- }
core_limit = RLIM_INFINITY;
core_pattern: Introduce core pipe limiting sysctl
Since we can dump cores to pipe, rather than directly to the filesystem, we
create a condition in which a user can create a very high load on the system
simply by running bad applications. If the pipe reader specified in
core_pattern is poorly written, we can have lots of ourstandig resources and
processes in the system. This sysctl introduces an ability to limit that
resource consumption. core_pipe_limit defines how many in-flight dumps may be
run in parallel, dumps beyond this value are skipped and a note is made in the
kernel log. A special value of 0 in core_pipe_limit denotes unlimited core
dumps may be handled (this is the default value).
Signed-off-by: Neil Horman <[email protected]>
Reported-by: Earl Chew <[email protected]>
Documentation/sysctl/kernel.txt | 22 ++++++++++++++++++++++
fs/exec.c | 21 +++++++++++++++++----
kernel/sysctl.c | 9 +++++++++
3 files changed, 48 insertions(+), 4 deletions(-)
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 322a00b..bb226ba 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -21,6 +21,7 @@ show up in /proc/sys/kernel:
- acct
- auto_msgmni
- core_pattern
+- core_pipe_limit
- core_uses_pid
- ctrl-alt-del
- dentry-state
@@ -119,6 +120,27 @@ core_pattern is used to specify a core dumpfile pattern name.
==============================================================
+core_pipe_limit:
+
+This sysctl is only applicable when core_pattern is configured to pipe core
+files to user space helper a (when the first character of core_pattern is a '|',
+see above). When collecting cores via a pipe to an application, it is
+occasionally usefull for the collecting application to gather data about the
+crashing process from its /proc/pid directory. In order to do this safely, the
+kernel must wait for the collecting process to exit, so as not to remove the
+crashing processes proc files prematurely. This in turn creates the possibility
+that a misbehaving userspace collecting process can block the reaping of a
+crashed process simply by never exiting. This sysctl defends against that. It
+defines how many concurrent crashing processes may be piped to user space
+applications in parallel. If this value is exceeded, then those crashing
+processes above that value are noted via the kernel log and their cores are
+skipped. 0 is a special value, indicating that unlimited processes may be
+captured in parallel, but that no waiting will take place (i.e. the collecting
+process is not guaranteed access to /proc/<crahing pid>/). This value defaults
+to 0.
+
+==============================================================
+
core_uses_pid:
The default coredump filename is "core". By setting
diff --git a/fs/exec.c b/fs/exec.c
index 9defd20..93ab6eb 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -63,6 +63,7 @@
int core_uses_pid;
char core_pattern[CORENAME_MAX_SIZE] = "core";
+unsigned int core_pipe_limit;
int suid_dumpable = 0;
/* The maximal length of core_pattern is also specified in sysctl.c */
@@ -1726,7 +1727,8 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
unsigned long core_limit = current->signal->rlim[RLIMIT_CORE].rlim_cur;
char **helper_argv = NULL;
int helper_argc = 0;
- char *delimit;
+ int dump_count = 0;
+ static atomic_t core_dump_count = ATOMIC_INIT(0);
audit_core_dumps(signr);
@@ -1798,21 +1800,29 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
goto fail_unlock;
}
+ dump_count = atomic_inc_return(&core_dump_count);
+ if (core_pipe_limit && (core_pipe_limit < dump_count)) {
+ printk(KERN_WARNING "Pid %d(%s) over core_pipe_limit\n",
+ task_tgid_vnr(current), current->comm);
+ printk(KERN_WARNING "Skipping core dump\n");
+ goto fail_dropcount;
+ }
+
helper_argv = argv_split(GFP_KERNEL, corename+1, &helper_argc);
if (!helper_argv) {
printk(KERN_WARNING "%s failed to allocate memory\n",
__func__);
- goto fail_unlock;
+ goto fail_dropcount;
}
core_limit = RLIM_INFINITY;
/* SIGPIPE can happen, but it's just never processed */
- if (call_usermodehelper_pipe(corename+1, helper_argv, NULL,
+ if (call_usermodehelper_pipe(helper_argv[0], helper_argv, NULL,
&file)) {
printk(KERN_INFO "Core dump to %s pipe failed\n",
corename);
- goto fail_unlock;
+ goto fail_dropcount;
}
} else {
if (core_limit < binfmt->min_coredump)
@@ -1853,6 +1863,9 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
close_fail:
filp_close(file, NULL);
+fail_dropcount:
+ if (dump_count)
+ atomic_dec(&core_dump_count);
fail_unlock:
if (helper_argv)
argv_free(helper_argv);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 62e4ff9..681052f 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -77,6 +77,7 @@ extern int max_threads;
extern int core_uses_pid;
extern int suid_dumpable;
extern char core_pattern[];
+extern unsigned int core_pipe_limit;
extern int pid_max;
extern int min_free_kbytes;
extern int pid_max_min, pid_max_max;
@@ -407,6 +408,14 @@ static struct ctl_table kern_table[] = {
.proc_handler = &proc_dostring,
.strategy = &sysctl_string,
},
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "core_pipe_limit",
+ .data = &core_pipe_limit,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
#ifdef CONFIG_PROC_SYSCTL
{
.procname = "tainted",
core_pattern: Allow core_pattern pipes to wait for user space to complete
One of the things that user space processes like to do is look at metadata for a
crashing process in their /proc/<pid> directory. this is racy however, since
do_coredump in the kernel doesn't wait for the user space process to complete
before it reaps the crashing process. This patch corrects that. Allowing the
kernel to wait for the user space process to complete before cleaning up the
crashing process. This is a bit tricky to do for a few reasons:
1) The user space process isn't our child, so we can't sys_wait4 on it
2) We need to close the pipe before waiting for the user process to complete,
since the user process may rely on an EOF condition
I've discussed several solutions with Oleg Nesterov off-list about this, and
this is the one we've come up with. We add ourselves as a pipe reader (to
prevent premature cleanup of the pipe_inode_info), and remove ourselves as a
writer (to provide an EOF condition to the writer in user space), then we
iterate until the user space process exits (which we detect by pipe->readers ==
1, hence the > 1 check in the loop). When we exit the loop, we restore the
proper reader/writer values, then we return and let filp_close in do_coredump
clean up the pipe data properly.
Signed-off-by: Neil Horman <[email protected]>
Reported-by: Earl Chew <[email protected]>
exec.c | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)
diff --git a/fs/exec.c b/fs/exec.c
index 93ab6eb..6b3579e 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -55,6 +55,7 @@
#include <linux/kmod.h>
#include <linux/fsnotify.h>
#include <linux/fs_struct.h>
+#include <linux/pipe_fs_i.h>
#include <asm/uaccess.h>
#include <asm/mmu_context.h>
@@ -1711,6 +1712,29 @@ int get_dumpable(struct mm_struct *mm)
return (ret >= 2) ? 2 : ret;
}
+static void wait_for_dump_helpers(struct file *file)
+{
+ struct pipe_inode_info *pipe;
+
+ pipe = file->f_path.dentry->d_inode->i_pipe;
+
+ pipe_lock(pipe);
+ pipe->readers++;
+ pipe->writers--;
+
+ while ((pipe->readers > 1) && (!signal_pending(current))) {
+ wake_up_interruptible_sync(&pipe->wait);
+ kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
+ pipe_wait(pipe);
+ }
+
+ pipe->readers--;
+ pipe->writers++;
+ pipe_unlock(pipe);
+
+}
+
+
void do_coredump(long signr, int exit_code, struct pt_regs *regs)
{
struct core_state core_state;
@@ -1862,6 +1886,8 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
current->signal->group_exit_code |= 0x80;
close_fail:
+ if (ispipe && core_pipe_limit)
+ wait_for_dump_helpers(file);
filp_close(file, NULL);
fail_dropcount:
if (dump_count)
Reposting unchanged with oleg cc'd for him to ack
On Fri, Jul 03, 2009 at 06:44:47AM -0400, Neil Horman wrote:
>
> Patch V8, converting fatal_signal_pending to signal_pending
>
>
> This patcset provides improvements to the core_pattern infrastructure:
>
> 1) Make detection of recurseive core dumps more robust and accurate
>
> 2) Provide a sysctl to limit the number of parallel coredumps in flight when
> using pipes
>
> 3) Allows for the kernel to wait for the completion of user space core dump
> processing
>
> Neil
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
Reposting with Oleg CC'd for him to ACK
On Fri, Jul 03, 2009 at 06:50:08AM -0400, Neil Horman wrote:
> core_pattern: Change how we detect recursive dumps with core_pattern pipes
>
> Change how we detect recursive dumps. Currently we have a mechanism by which
> we try to compare pathnames of the crashing process to the core_pattern path.
> This is broken for a dozen reasons, and just doesn't work in any sort of robust
> way. I'm replacing it with the use of a 0 RLIMIT_CORE value. Since helper
> apps set RLIMIT_CORE to zero, we don't write out core files for any process with
> that particular limit set. It the core_pattern is a pipe, any non-zero limit is
> translated to RLIM_INFINITY. This allows complete dumps to be captured, but
> prevents infinite recursion in the event that the core_pattern process itself
> crashes.
>
> Signed-off-by: Neil Horman <[email protected]>
> Reported-by: Earl Chew <[email protected]>
>
>
> exec.c | 43 +++++++++++++++++++++----------------------
> 1 file changed, 21 insertions(+), 22 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 9e05bd8..9defd20 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1776,35 +1776,34 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
> lock_kernel();
> ispipe = format_corename(corename, signr);
> unlock_kernel();
> - /*
> - * Don't bother to check the RLIMIT_CORE value if core_pattern points
> - * to a pipe. Since we're not writing directly to the filesystem
> - * RLIMIT_CORE doesn't really apply, as no actual core file will be
> - * created unless the pipe reader choses to write out the core file
> - * at which point file size limits and permissions will be imposed
> - * as it does with any other process
> - */
> +
> if (ispipe) {
> + if (core_limit == 0) {
> + /*
> + * Normally core limits are irrelevant to pipes, since
> + * we're not writing to the file system, but we use
> + * core_limit of 0 here as a speacial value. Any
> + * non-zero limit gets set to RLIM_INFINITY below, but
> + * a limit of 0 skips the dump. This is a consistent
> + * way to catch recursive crashes. We can still crash
> + * if the core_pattern binary sets RLIM_CORE = !0
> + * but it runs as root, and can do lots of stupid things
> + * Note that we use task_tgid_vnr here to grab the pid of the
> + * process group leader. That way we get the right pid if a thread
> + * in a multi-threaded core_pattern process dies.
> + */
> + printk(KERN_WARNING "Process %d(%s) has RLIMIT_CORE set to 0\n",
> + task_tgid_vnr(current), current->comm);
> + printk(KERN_WARNING "Aborting core\n");
> + goto fail_unlock;
> + }
> +
> helper_argv = argv_split(GFP_KERNEL, corename+1, &helper_argc);
> if (!helper_argv) {
> printk(KERN_WARNING "%s failed to allocate memory\n",
> __func__);
> goto fail_unlock;
> }
> - /* Terminate the string before the first option */
> - delimit = strchr(corename, ' ');
> - if (delimit)
> - *delimit = '\0';
> - delimit = strrchr(helper_argv[0], '/');
> - if (delimit)
> - delimit++;
> - else
> - delimit = helper_argv[0];
> - if (!strcmp(delimit, current->comm)) {
> - printk(KERN_NOTICE "Recursive core dump detected, "
> - "aborting\n");
> - goto fail_unlock;
> - }
>
> core_limit = RLIM_INFINITY;
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
Reposting with Oleg Cc'd so he can ACK it
On Fri, Jul 03, 2009 at 06:51:29AM -0400, Neil Horman wrote:
> core_pattern: Introduce core pipe limiting sysctl
>
> Since we can dump cores to pipe, rather than directly to the filesystem, we
> create a condition in which a user can create a very high load on the system
> simply by running bad applications. If the pipe reader specified in
> core_pattern is poorly written, we can have lots of ourstandig resources and
> processes in the system. This sysctl introduces an ability to limit that
> resource consumption. core_pipe_limit defines how many in-flight dumps may be
> run in parallel, dumps beyond this value are skipped and a note is made in the
> kernel log. A special value of 0 in core_pipe_limit denotes unlimited core
> dumps may be handled (this is the default value).
>
> Signed-off-by: Neil Horman <[email protected]>
> Reported-by: Earl Chew <[email protected]>
>
>
> Documentation/sysctl/kernel.txt | 22 ++++++++++++++++++++++
> fs/exec.c | 21 +++++++++++++++++----
> kernel/sysctl.c | 9 +++++++++
> 3 files changed, 48 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
> index 322a00b..bb226ba 100644
> --- a/Documentation/sysctl/kernel.txt
> +++ b/Documentation/sysctl/kernel.txt
> @@ -21,6 +21,7 @@ show up in /proc/sys/kernel:
> - acct
> - auto_msgmni
> - core_pattern
> +- core_pipe_limit
> - core_uses_pid
> - ctrl-alt-del
> - dentry-state
> @@ -119,6 +120,27 @@ core_pattern is used to specify a core dumpfile pattern name.
>
> ==============================================================
>
> +core_pipe_limit:
> +
> +This sysctl is only applicable when core_pattern is configured to pipe core
> +files to user space helper a (when the first character of core_pattern is a '|',
> +see above). When collecting cores via a pipe to an application, it is
> +occasionally usefull for the collecting application to gather data about the
> +crashing process from its /proc/pid directory. In order to do this safely, the
> +kernel must wait for the collecting process to exit, so as not to remove the
> +crashing processes proc files prematurely. This in turn creates the possibility
> +that a misbehaving userspace collecting process can block the reaping of a
> +crashed process simply by never exiting. This sysctl defends against that. It
> +defines how many concurrent crashing processes may be piped to user space
> +applications in parallel. If this value is exceeded, then those crashing
> +processes above that value are noted via the kernel log and their cores are
> +skipped. 0 is a special value, indicating that unlimited processes may be
> +captured in parallel, but that no waiting will take place (i.e. the collecting
> +process is not guaranteed access to /proc/<crahing pid>/). This value defaults
> +to 0.
> +
> +==============================================================
> +
> core_uses_pid:
>
> The default coredump filename is "core". By setting
> diff --git a/fs/exec.c b/fs/exec.c
> index 9defd20..93ab6eb 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -63,6 +63,7 @@
>
> int core_uses_pid;
> char core_pattern[CORENAME_MAX_SIZE] = "core";
> +unsigned int core_pipe_limit;
> int suid_dumpable = 0;
>
> /* The maximal length of core_pattern is also specified in sysctl.c */
> @@ -1726,7 +1727,8 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
> unsigned long core_limit = current->signal->rlim[RLIMIT_CORE].rlim_cur;
> char **helper_argv = NULL;
> int helper_argc = 0;
> - char *delimit;
> + int dump_count = 0;
> + static atomic_t core_dump_count = ATOMIC_INIT(0);
>
> audit_core_dumps(signr);
>
> @@ -1798,21 +1800,29 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
> goto fail_unlock;
> }
>
> + dump_count = atomic_inc_return(&core_dump_count);
> + if (core_pipe_limit && (core_pipe_limit < dump_count)) {
> + printk(KERN_WARNING "Pid %d(%s) over core_pipe_limit\n",
> + task_tgid_vnr(current), current->comm);
> + printk(KERN_WARNING "Skipping core dump\n");
> + goto fail_dropcount;
> + }
> +
> helper_argv = argv_split(GFP_KERNEL, corename+1, &helper_argc);
> if (!helper_argv) {
> printk(KERN_WARNING "%s failed to allocate memory\n",
> __func__);
> - goto fail_unlock;
> + goto fail_dropcount;
> }
>
> core_limit = RLIM_INFINITY;
>
> /* SIGPIPE can happen, but it's just never processed */
> - if (call_usermodehelper_pipe(corename+1, helper_argv, NULL,
> + if (call_usermodehelper_pipe(helper_argv[0], helper_argv, NULL,
> &file)) {
> printk(KERN_INFO "Core dump to %s pipe failed\n",
> corename);
> - goto fail_unlock;
> + goto fail_dropcount;
> }
> } else {
> if (core_limit < binfmt->min_coredump)
> @@ -1853,6 +1863,9 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
>
> close_fail:
> filp_close(file, NULL);
> +fail_dropcount:
> + if (dump_count)
> + atomic_dec(&core_dump_count);
> fail_unlock:
> if (helper_argv)
> argv_free(helper_argv);
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 62e4ff9..681052f 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -77,6 +77,7 @@ extern int max_threads;
> extern int core_uses_pid;
> extern int suid_dumpable;
> extern char core_pattern[];
> +extern unsigned int core_pipe_limit;
> extern int pid_max;
> extern int min_free_kbytes;
> extern int pid_max_min, pid_max_max;
> @@ -407,6 +408,14 @@ static struct ctl_table kern_table[] = {
> .proc_handler = &proc_dostring,
> .strategy = &sysctl_string,
> },
> + {
> + .ctl_name = CTL_UNNUMBERED,
> + .procname = "core_pipe_limit",
> + .data = &core_pipe_limit,
> + .maxlen = sizeof(unsigned int),
> + .mode = 0644,
> + .proc_handler = &proc_dointvec,
> + },
> #ifdef CONFIG_PROC_SYSCTL
> {
> .procname = "tainted",
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
Reposting for Oleg to Ack
On Fri, Jul 03, 2009 at 06:52:33AM -0400, Neil Horman wrote:
> core_pattern: Allow core_pattern pipes to wait for user space to complete
>
> One of the things that user space processes like to do is look at metadata for a
> crashing process in their /proc/<pid> directory. this is racy however, since
> do_coredump in the kernel doesn't wait for the user space process to complete
> before it reaps the crashing process. This patch corrects that. Allowing the
> kernel to wait for the user space process to complete before cleaning up the
> crashing process. This is a bit tricky to do for a few reasons:
>
> 1) The user space process isn't our child, so we can't sys_wait4 on it
> 2) We need to close the pipe before waiting for the user process to complete,
> since the user process may rely on an EOF condition
>
> I've discussed several solutions with Oleg Nesterov off-list about this, and
> this is the one we've come up with. We add ourselves as a pipe reader (to
> prevent premature cleanup of the pipe_inode_info), and remove ourselves as a
> writer (to provide an EOF condition to the writer in user space), then we
> iterate until the user space process exits (which we detect by pipe->readers ==
> 1, hence the > 1 check in the loop). When we exit the loop, we restore the
> proper reader/writer values, then we return and let filp_close in do_coredump
> clean up the pipe data properly.
>
> Signed-off-by: Neil Horman <[email protected]>
> Reported-by: Earl Chew <[email protected]>
>
>
> exec.c | 26 ++++++++++++++++++++++++++
> 1 file changed, 26 insertions(+)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 93ab6eb..6b3579e 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -55,6 +55,7 @@
> #include <linux/kmod.h>
> #include <linux/fsnotify.h>
> #include <linux/fs_struct.h>
> +#include <linux/pipe_fs_i.h>
>
> #include <asm/uaccess.h>
> #include <asm/mmu_context.h>
> @@ -1711,6 +1712,29 @@ int get_dumpable(struct mm_struct *mm)
> return (ret >= 2) ? 2 : ret;
> }
>
> +static void wait_for_dump_helpers(struct file *file)
> +{
> + struct pipe_inode_info *pipe;
> +
> + pipe = file->f_path.dentry->d_inode->i_pipe;
> +
> + pipe_lock(pipe);
> + pipe->readers++;
> + pipe->writers--;
> +
> + while ((pipe->readers > 1) && (!signal_pending(current))) {
> + wake_up_interruptible_sync(&pipe->wait);
> + kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
> + pipe_wait(pipe);
> + }
> +
> + pipe->readers--;
> + pipe->writers++;
> + pipe_unlock(pipe);
> +
> +}
> +
> +
> void do_coredump(long signr, int exit_code, struct pt_regs *regs)
> {
> struct core_state core_state;
> @@ -1862,6 +1886,8 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
> current->signal->group_exit_code |= 0x80;
>
> close_fail:
> + if (ispipe && core_pipe_limit)
> + wait_for_dump_helpers(file);
> filp_close(file, NULL);
> fail_dropcount:
> if (dump_count)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On 07/07, Neil Horman wrote:
>
> > +static void wait_for_dump_helpers(struct file *file)
> > +{
> > + struct pipe_inode_info *pipe;
> > +
> > + pipe = file->f_path.dentry->d_inode->i_pipe;
> > +
> > + pipe_lock(pipe);
> > + pipe->readers++;
> > + pipe->writers--;
> > +
> > + while ((pipe->readers > 1) && (!signal_pending(current))) {
> > + wake_up_interruptible_sync(&pipe->wait);
> > + kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
> > + pipe_wait(pipe);
> > + }
> > +
> > + pipe->readers--;
> > + pipe->writers++;
> > + pipe_unlock(pipe);
> > +
> > +}
> > +
> > +
> > void do_coredump(long signr, int exit_code, struct pt_regs *regs)
> > {
> > struct core_state core_state;
> > @@ -1862,6 +1886,8 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
> > current->signal->group_exit_code |= 0x80;
> >
> > close_fail:
> > + if (ispipe && core_pipe_limit)
> > + wait_for_dump_helpers(file);
Looks good to me.
As for "&& core_pipe_limit" here, I agree this is subjective and I see
your point.
Oleg.
Patch V9, re-diffing against latest -mm tree
No, changes, just re-diffing against latest -mm to fix up some patch application
static
Neil
core_pattern: Change how we detect recursive dumps with core_pattern pipes
Change how we detect recursive dumps. Currently we have a mechanism by which
we try to compare pathnames of the crashing process to the core_pattern path.
This is broken for a dozen reasons, and just doesn't work in any sort of robust
way. I'm replacing it with the use of a 0 RLIMIT_CORE value. Since helper
apps set RLIMIT_CORE to zero, we don't write out core files for any process with
that particular limit set. It the core_pattern is a pipe, any non-zero limit is
translated to RLIM_INFINITY. This allows complete dumps to be captured, but
prevents infinite recursion in the event that the core_pattern process itself
crashes.
Signed-off-by: Neil Horman <[email protected]>
Reported-by: Earl Chew <[email protected]>
exec.c | 43 +++++++++++++++++++++----------------------
1 file changed, 21 insertions(+), 22 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index 76ed4f6..21d2686 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1788,38 +1788,37 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
lock_kernel();
ispipe = format_corename(corename, signr);
unlock_kernel();
- /*
- * Don't bother to check the RLIMIT_CORE value if core_pattern points
- * to a pipe. Since we're not writing directly to the filesystem
- * RLIMIT_CORE doesn't really apply, as no actual core file will be
- * created unless the pipe reader choses to write out the core file
- * at which point file size limits and permissions will be imposed
- * as it does with any other process
- */
+
if ((!ispipe) && (core_limit < binfmt->min_coredump))
goto fail_unlock;
if (ispipe) {
+ if (core_limit == 0) {
+ /*
+ * Normally core limits are irrelevant to pipes, since
+ * we're not writing to the file system, but we use
+ * core_limit of 0 here as a speacial value. Any
+ * non-zero limit gets set to RLIM_INFINITY below, but
+ * a limit of 0 skips the dump. This is a consistent
+ * way to catch recursive crashes. We can still crash
+ * if the core_pattern binary sets RLIM_CORE = !0
+ * but it runs as root, and can do lots of stupid things
+ * Note that we use task_tgid_vnr here to grab the pid of the
+ * process group leader. That way we get the right pid if a thread
+ * in a multi-threaded core_pattern process dies.
+ */
+ printk(KERN_WARNING "Process %d(%s) has RLIMIT_CORE set to 0\n",
+ task_tgid_vnr(current), current->comm);
+ printk(KERN_WARNING "Aborting core\n");
+ goto fail_unlock;
+ }
+
helper_argv = argv_split(GFP_KERNEL, corename+1, &helper_argc);
if (!helper_argv) {
printk(KERN_WARNING "%s failed to allocate memory\n",
__func__);
goto fail_unlock;
}
- /* Terminate the string before the first option */
- delimit = strchr(corename, ' ');
- if (delimit)
- *delimit = '\0';
- delimit = strrchr(helper_argv[0], '/');
- if (delimit)
- delimit++;
- else
- delimit = helper_argv[0];
- if (!strcmp(delimit, current->comm)) {
- printk(KERN_NOTICE "Recursive core dump detected, "
- "aborting\n");
- goto fail_unlock;
- }
core_limit = RLIM_INFINITY;
core_pattern: Introduce core pipe limiting sysctl
Since we can dump cores to pipe, rather than directly to the filesystem, we
create a condition in which a user can create a very high load on the system
simply by running bad applications. If the pipe reader specified in
core_pattern is poorly written, we can have lots of ourstandig resources and
processes in the system. This sysctl introduces an ability to limit that
resource consumption. core_pipe_limit defines how many in-flight dumps may be
run in parallel, dumps beyond this value are skipped and a note is made in the
kernel log. A special value of 0 in core_pipe_limit denotes unlimited core
dumps may be handled (this is the default value).
Signed-off-by: Neil Horman <[email protected]>
Reported-by: Earl Chew <[email protected]>
Documentation/sysctl/kernel.txt | 22 ++++++++++++++++++++++
fs/exec.c | 25 +++++++++++++++++++------
kernel/sysctl.c | 9 +++++++++
3 files changed, 50 insertions(+), 6 deletions(-)
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 382cfd8..7706b6a 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -21,6 +21,7 @@ show up in /proc/sys/kernel:
- acct
- auto_msgmni
- core_pattern
+- core_pipe_limit
- core_uses_pid
- ctrl-alt-del
- dentry-state
@@ -119,6 +120,27 @@ core_pattern is used to specify a core dumpfile pattern name.
==============================================================
+core_pipe_limit:
+
+This sysctl is only applicable when core_pattern is configured to pipe core
+files to user space helper a (when the first character of core_pattern is a '|',
+see above). When collecting cores via a pipe to an application, it is
+occasionally usefull for the collecting application to gather data about the
+crashing process from its /proc/pid directory. In order to do this safely, the
+kernel must wait for the collecting process to exit, so as not to remove the
+crashing processes proc files prematurely. This in turn creates the possibility
+that a misbehaving userspace collecting process can block the reaping of a
+crashed process simply by never exiting. This sysctl defends against that. It
+defines how many concurrent crashing processes may be piped to user space
+applications in parallel. If this value is exceeded, then those crashing
+processes above that value are noted via the kernel log and their cores are
+skipped. 0 is a special value, indicating that unlimited processes may be
+captured in parallel, but that no waiting will take place (i.e. the collecting
+process is not guaranteed access to /proc/<crahing pid>/). This value defaults
+to 0.
+
+==============================================================
+
core_uses_pid:
The default coredump filename is "core". By setting
diff --git a/fs/exec.c b/fs/exec.c
index 21d2686..5f70d7e 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -63,6 +63,7 @@
int core_uses_pid;
char core_pattern[CORENAME_MAX_SIZE] = "core";
+unsigned int core_pipe_limit;
int suid_dumpable = 0;
/* The maximal length of core_pattern is also specified in sysctl.c */
@@ -1733,7 +1734,8 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
unsigned long core_limit = current->signal->rlim[RLIMIT_CORE].rlim_cur;
char **helper_argv = NULL;
int helper_argc = 0;
- char *delimit;
+ int dump_count = 0;
+ static atomic_t core_dump_count = ATOMIC_INIT(0);
audit_core_dumps(signr);
@@ -1812,29 +1814,37 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
printk(KERN_WARNING "Aborting core\n");
goto fail_unlock;
}
-
+
+ dump_count = atomic_inc_return(&core_dump_count);
+ if (core_pipe_limit && (core_pipe_limit < dump_count)) {
+ printk(KERN_WARNING "Pid %d(%s) over core_pipe_limit\n",
+ task_tgid_vnr(current), current->comm);
+ printk(KERN_WARNING "Skipping core dump\n");
+ goto fail_dropcount;
+ }
+
helper_argv = argv_split(GFP_KERNEL, corename+1, &helper_argc);
if (!helper_argv) {
printk(KERN_WARNING "%s failed to allocate memory\n",
__func__);
- goto fail_unlock;
+ goto fail_dropcount;
}
core_limit = RLIM_INFINITY;
/* SIGPIPE can happen, but it's just never processed */
- if (call_usermodehelper_pipe(corename+1, helper_argv, NULL,
+ if (call_usermodehelper_pipe(helper_argv[0], helper_argv, NULL,
&file)) {
printk(KERN_INFO "Core dump to %s pipe failed\n",
corename);
- goto fail_unlock;
+ goto fail_dropcount;
}
} else
file = filp_open(corename,
O_CREAT | 2 | O_NOFOLLOW | O_LARGEFILE | flag,
0600);
if (IS_ERR(file))
- goto fail_unlock;
+ goto fail_dropcount;
inode = file->f_path.dentry->d_inode;
if (inode->i_nlink > 1)
goto close_fail; /* multiple links - don't dump */
@@ -1864,6 +1874,9 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
current->signal->group_exit_code |= 0x80;
close_fail:
filp_close(file, NULL);
+fail_dropcount:
+ if (dump_count)
+ atomic_dec(&core_dump_count);
fail_unlock:
if (helper_argv)
argv_free(helper_argv);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index f267340..ad8c01a 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -77,6 +77,7 @@ extern int max_threads;
extern int core_uses_pid;
extern int suid_dumpable;
extern char core_pattern[];
+extern unsigned int core_pipe_limit;
extern int pid_max;
extern int min_free_kbytes;
extern int pid_max_min, pid_max_max;
@@ -413,6 +414,14 @@ static struct ctl_table kern_table[] = {
.proc_handler = &proc_dostring,
.strategy = &sysctl_string,
},
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "core_pipe_limit",
+ .data = &core_pipe_limit,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
#ifdef CONFIG_PROC_SYSCTL
{
.procname = "tainted",
core_pattern: Allow core_pattern pipes to wait for user space to complete
One of the things that user space processes like to do is look at metadata for a
crashing process in their /proc/<pid> directory. this is racy however, since
do_coredump in the kernel doesn't wait for the user space process to complete
before it reaps the crashing process. This patch corrects that. Allowing the
kernel to wait for the user space process to complete before cleaning up the
crashing process. This is a bit tricky to do for a few reasons:
1) The user space process isn't our child, so we can't sys_wait4 on it
2) We need to close the pipe before waiting for the user process to complete,
since the user process may rely on an EOF condition
I've discussed several solutions with Oleg Nesterov off-list about this, and
this is the one we've come up with. We add ourselves as a pipe reader (to
prevent premature cleanup of the pipe_inode_info), and remove ourselves as a
writer (to provide an EOF condition to the writer in user space), then we
iterate until the user space process exits (which we detect by pipe->readers ==
1, hence the > 1 check in the loop). When we exit the loop, we restore the
proper reader/writer values, then we return and let filp_close in do_coredump
clean up the pipe data properly.
Signed-off-by: Neil Horman <[email protected]>
Reported-by: Earl Chew <[email protected]>
exec.c | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)
diff --git a/fs/exec.c b/fs/exec.c
index 5f70d7e..510b455 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -55,6 +55,7 @@
#include <linux/kmod.h>
#include <linux/fsnotify.h>
#include <linux/fs_struct.h>
+#include <linux/pipe_fs_i.h>
#include <asm/uaccess.h>
#include <asm/mmu_context.h>
@@ -1718,6 +1719,29 @@ int get_dumpable(struct mm_struct *mm)
return (ret >= 2) ? 2 : ret;
}
+static void wait_for_dump_helpers(struct file *file)
+{
+ struct pipe_inode_info *pipe;
+
+ pipe = file->f_path.dentry->d_inode->i_pipe;
+
+ pipe_lock(pipe);
+ pipe->readers++;
+ pipe->writers--;
+
+ while ((pipe->readers > 1) && (!signal_pending(current))) {
+ wake_up_interruptible_sync(&pipe->wait);
+ kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
+ pipe_wait(pipe);
+ }
+
+ pipe->readers--;
+ pipe->writers++;
+ pipe_unlock(pipe);
+
+}
+
+
void do_coredump(long signr, int exit_code, struct pt_regs *regs)
{
struct core_state core_state;
@@ -1873,6 +1897,8 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
if (retval)
current->signal->group_exit_code |= 0x80;
close_fail:
+ if (ispipe && core_pipe_limit)
+ wait_for_dump_helpers(file);
filp_close(file, NULL);
fail_dropcount:
if (dump_count)
On Mon, 2009-06-22 at 13:28 -0400, Neil Horman wrote:
> 2) Allow for the kernel to wait for a core_pattern process to complete. One of
> the things core_pattern processes might do is interrogate the status of a
> crashing process via its /proc/pid directory. To ensure that that directory is
> not removed prematurely, we wait for the process to exit prior to cleaning it
> up.
>
Would this mean that the kernel would wait for the pattern process to
complete before PANIC in the case of init core dumping?
I'd find that useful :-)
Scott
--
Scott James Remnant
[email protected]
On Wed, Jul 29, 2009 at 04:13:02PM +0100, Scott James Remnant wrote:
> On Mon, 2009-06-22 at 13:28 -0400, Neil Horman wrote:
>
> > 2) Allow for the kernel to wait for a core_pattern process to complete. One of
> > the things core_pattern processes might do is interrogate the status of a
> > crashing process via its /proc/pid directory. To ensure that that directory is
> > not removed prematurely, we wait for the process to exit prior to cleaning it
> > up.
> >
> Would this mean that the kernel would wait for the pattern process to
> complete before PANIC in the case of init core dumping?
>
> I'd find that useful :-)
>
Not without additional work. If init crashed in the initramfs, I don't think
theres a way to handle that. If it crashes at some later time, I think it just
gets restarted IIRC. I'm sure you can change that behavior, but this patch
doesn't address that.
If you want to debug a custom init process, why not run a wrapper program as
init, that just forks the init you want to run and captures the core when it
crashes?
Neil
> Scott
> --
> Scott James Remnant
> [email protected]
On Wed, 2009-07-29 at 16:18 -0400, Neil Horman wrote:
> On Wed, Jul 29, 2009 at 04:13:02PM +0100, Scott James Remnant wrote:
> > On Mon, 2009-06-22 at 13:28 -0400, Neil Horman wrote:
> >
> > > 2) Allow for the kernel to wait for a core_pattern process to complete. One of
> > > the things core_pattern processes might do is interrogate the status of a
> > > crashing process via its /proc/pid directory. To ensure that that directory is
> > > not removed prematurely, we wait for the process to exit prior to cleaning it
> > > up.
> > >
> > Would this mean that the kernel would wait for the pattern process to
> > complete before PANIC in the case of init core dumping?
> >
> > I'd find that useful :-)
> >
> Not without additional work. If init crashed in the initramfs, I don't think
> theres a way to handle that. If it crashes at some later time, I think it just
> gets restarted IIRC. I'm sure you can change that behavior, but this patch
> doesn't address that.
>
When the system init daemon crashes, the kernel PANICs. When not using
core_pattern, this is ok, we get a core file - when using apport, as far
as I can tell it never waits for apport to finish so we don't get the
crash.
I was hoping that by waiting for the core_pattern process to finish,
this might solve this issue.
The other obvious fix I can apply to the init daemon is to reset the
core pattern and deliberately dump core somewhere that can be picked up.
> If you want to debug a custom init process, why not run a wrapper program as
> init, that just forks the init you want to run and captures the core when it
> crashes?
>
Because then that's not the init daemon; it's not pid 1, it doesn't have
processes reparented to it. And it's very annoying having the entire
system reparented to gdb, which doesn't deal so well with that ;-)
Scott
--
Scott James Remnant
[email protected]