2010-06-22 00:12:24

by Edward Allcutt

[permalink] [raw]
Subject: [PATCH] fs: limit maximum concurrent coredumps

A moderately large number of large core files being dumped simultaneously
can impose severe latency penalties on other processes due to IO load.

For example, a common configuration for PHP web-servers includes apache's
prefork MPM, mod_php and a PHP opcode cache utilizing shared memory. In
certain failure modes, all requests serviced by PHP result in a segfault.
Enabling coredumps might lead to 10-20 coredumps per second, all attempting
to write a 150-200MB core file. This leads to the whole system becoming
entirely unresponsive for many minutes.

The ability to limit concurrent coredumps allows dumping core to be safely
enabled in these situations without affecting responsiveness of the system
as a whole.

I have several servers running with this patch applied (actually backported
to v2.6.26) and it has allowed me to deal successfully with the situation
described above.

In do_coredump I have pulled dump_count back out to the top-level scope
and core_dump_count is now incremented in the normal path.

Added sysctl parameter for tuning core_max_concurrency.

checkpatch complains about "extern ... core_max_concurrency" but this seems
to be an acceptable exception based on the preponderance of other extern
variables in kernel/sysctl.c

Signed-off-by: Edward Allcutt <[email protected]>
---
Documentation/sysctl/kernel.txt | 10 ++++++++++
fs/exec.c | 21 ++++++++++++++-------
kernel/sysctl.c | 8 ++++++++
3 files changed, 32 insertions(+), 7 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 3894eaa..6a58d9b 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -23,6 +23,7 @@ show up in /proc/sys/kernel:
- bootloader_version [ X86 only ]
- callhome [ S390 only ]
- auto_msgmni
+- core_max_concurrency
- core_pattern
- core_pipe_limit
- core_uses_pid
@@ -139,6 +140,15 @@ on has a service contract with IBM.

==============================================================

+core_max_concurrency:
+
+Controls the maximum number of coredumps that can be in progress
+concurrently. Coredumping processes that would cause this value
+to be exceeded are noted via the kernel log and their cores are
+skipped. 0 is a special value, indicating that unlimited processes
+may be captured in parallel. This value defaults to 0.
+
+==============================================================
core_pattern:

core_pattern is used to specify a core dumpfile pattern name.
diff --git a/fs/exec.c b/fs/exec.c
index e19de6a..90785e1 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -62,6 +62,7 @@
#include "internal.h"

int core_uses_pid;
+unsigned int core_max_concurrency;
char core_pattern[CORENAME_MAX_SIZE] = "core";
unsigned int core_pipe_limit;
int suid_dumpable = 0;
@@ -1844,6 +1845,7 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
int retval = 0;
int flag = 0;
int ispipe;
+ int dump_count = 0;
static atomic_t core_dump_count = ATOMIC_INIT(0);
struct coredump_params cprm = {
.signr = signr,
@@ -1865,6 +1867,14 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
if (!__get_dumpable(cprm.mm_flags))
goto fail;

+ dump_count = atomic_inc_return(&core_dump_count);
+ if (core_max_concurrency && (core_max_concurrency < dump_count)) {
+ printk(KERN_WARNING "Pid %d(%s) over core_max_concurrency\n",
+ task_tgid_vnr(current), current->comm);
+ printk(KERN_WARNING "Skipping core dump\n");
+ goto fail;
+ }
+
cred = prepare_creds();
if (!cred)
goto fail;
@@ -1900,7 +1910,6 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
unlock_kernel();

if (ispipe) {
- int dump_count;
char **helper_argv;

if (cprm.limit == 1) {
@@ -1926,19 +1935,18 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
}
cprm.limit = RLIM_INFINITY;

- dump_count = atomic_inc_return(&core_dump_count);
if (core_pipe_limit && (core_pipe_limit < dump_count)) {
printk(KERN_WARNING "Pid %d(%s) over core_pipe_limit\n",
task_tgid_vnr(current), current->comm);
printk(KERN_WARNING "Skipping core dump\n");
- goto fail_dropcount;
+ goto fail_unlock;
}

helper_argv = argv_split(GFP_KERNEL, corename+1, NULL);
if (!helper_argv) {
printk(KERN_WARNING "%s failed to allocate memory\n",
__func__);
- goto fail_dropcount;
+ goto fail_unlock;
}

retval = call_usermodehelper_fns(helper_argv[0], helper_argv,
@@ -1994,14 +2002,13 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
close_fail:
if (cprm.file)
filp_close(cprm.file, NULL);
-fail_dropcount:
- if (ispipe)
- atomic_dec(&core_dump_count);
fail_unlock:
coredump_finish(mm);
revert_creds(old_cred);
fail_creds:
put_cred(cred);
fail:
+ if (dump_count)
+ atomic_dec(&core_dump_count);
return;
}
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index d24f761..3f95916 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -88,6 +88,7 @@ extern int sysctl_oom_dump_tasks;
extern int max_threads;
extern int core_uses_pid;
extern int suid_dumpable;
+extern unsigned int core_max_concurrency;
extern char core_pattern[];
extern unsigned int core_pipe_limit;
extern int pid_max;
@@ -415,6 +416,13 @@ static struct ctl_table kern_table[] = {
.proc_handler = proc_dointvec,
},
{
+ .procname = "core_max_concurrency",
+ .data = &core_max_concurrency,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
+ {
.procname = "core_pattern",
.data = core_pattern,
.maxlen = CORENAME_MAX_SIZE,
--
1.7.1


2010-06-22 01:23:55

by Roland McGrath

[permalink] [raw]
Subject: Re: [PATCH] fs: limit maximum concurrent coredumps

A core dump is just an instance of a process suddenly reading lots of its
address space and doing lots of filesystem writes, producing the kinds of
thrashing that any such instance might entail. It really seems like the
real solution to this kind of problem will be in some more general kind of
throttling of processes (or whatever manner of collections thereof) when
they got hog-wild on page-ins or filesystem writes, or whatever else. I'm
not trying to get into the details of what that would be. But I have to
cite this hack as the off-topic kludge that it really is. That said, I do
certainly sympathize with the desire for a quick hack that addresses the
scenario you experience.

For the case you described, it seems to me that constraining concurrency
per se would be better than punting core dumps when too concurrent. That
is, you should not skip the dump when you hit the limit. Rather, you
should block in do_coredump() until the next dump already in progress
finishes. (It should be possible to use TASK_KILLABLE so that those dumps
in waiting can be aborted with a follow-on SIGKILL. But Oleg will have to
check on the signals details being right for that.)

That won't make your crashers each complete quickly, but it will prevent
the thrashing. Instead of some crashers suddenly not producing dumps at
all, they'll just all queue up waiting to finish crashing but not using any
CPU or IO resources. That way you don't lose any core dumps unless you
want to start SIGKILL'ing things (which oom_kill might do if need be),
you just don't die in flames trying to do nothing but dump cores.


Thanks,
Roland

2010-06-22 01:42:36

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] fs: limit maximum concurrent coredumps

On Mon, 21 Jun 2010 18:23:03 -0700 (PDT) Roland McGrath <[email protected]> wrote:

> A core dump is just an instance of a process suddenly reading lots of its
> address space and doing lots of filesystem writes, producing the kinds of
> thrashing that any such instance might entail. It really seems like the
> real solution to this kind of problem will be in some more general kind of
> throttling of processes (or whatever manner of collections thereof) when
> they got hog-wild on page-ins or filesystem writes, or whatever else. I'm
> not trying to get into the details of what that would be. But I have to
> cite this hack as the off-topic kludge that it really is. That said, I do
> certainly sympathize with the desire for a quick hack that addresses the
> scenario you experience.

yup.

> For the case you described, it seems to me that constraining concurrency
> per se would be better than punting core dumps when too concurrent. That
> is, you should not skip the dump when you hit the limit. Rather, you
> should block in do_coredump() until the next dump already in progress
> finishes. (It should be possible to use TASK_KILLABLE so that those dumps
> in waiting can be aborted with a follow-on SIGKILL. But Oleg will have to
> check on the signals details being right for that.)

yup.

Might be able to use semaphores for this. Use sema_init(),
down_killable() and up().

Modifying the max concurrency value would require a loop of up()s and
down()s, probably all surrounded by a mutex_lock. Which is a bit ugly,
and should be done in kernel/semaphore.c I guess.

> That won't make your crashers each complete quickly, but it will prevent
> the thrashing. Instead of some crashers suddenly not producing dumps at
> all, they'll just all queue up waiting to finish crashing but not using any
> CPU or IO resources. That way you don't lose any core dumps unless you
> want to start SIGKILL'ing things (which oom_kill might do if need be),
> you just don't die in flames trying to do nothing but dump cores.

A global knob is a bit old-school. Perhaps it should be a per-memcg
knob or something.



otoh, one could perhaps toss all these tasks into a blkio_cgroup and
solve this problem with the block IO controller. After all, that's
what it's for.

2010-06-22 02:18:22

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH] fs: limit maximum concurrent coredumps

> A moderately large number of large core files being dumped simultaneously
> can impose severe latency penalties on other processes due to IO load.
>
> For example, a common configuration for PHP web-servers includes apache's
> prefork MPM, mod_php and a PHP opcode cache utilizing shared memory. In
> certain failure modes, all requests serviced by PHP result in a segfault.
> Enabling coredumps might lead to 10-20 coredumps per second, all attempting
> to write a 150-200MB core file. This leads to the whole system becoming
> entirely unresponsive for many minutes.
>
> The ability to limit concurrent coredumps allows dumping core to be safely
> enabled in these situations without affecting responsiveness of the system
> as a whole.
>
> I have several servers running with this patch applied (actually backported
> to v2.6.26) and it has allowed me to deal successfully with the situation
> described above.
>
> In do_coredump I have pulled dump_count back out to the top-level scope
> and core_dump_count is now incremented in the normal path.
>
> Added sysctl parameter for tuning core_max_concurrency.
>
> checkpatch complains about "extern ... core_max_concurrency" but this seems
> to be an acceptable exception based on the preponderance of other extern
> variables in kernel/sysctl.c

To be honest, My customers have very similar problem for long time. If
HPC MPI job or apache prefork MPM processces die, The system load to
make large concurrent core dump I/O and consume all I/O band width.
Eventually, all other service stall long time and HA cluster software
shutdown such node forcely.

My case used two technique, 1) limit concurrent coredump (but not skip
as sugessted Roland) 2) reduce I/O priority automatically in do_coredump().
I don't know my way was correct way. but I believe we need something case.


2010-06-22 03:57:52

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] fs: limit maximum concurrent coredumps

On 06/21/2010 06:23 PM, Roland McGrath wrote:
> A core dump is just an instance of a process suddenly reading lots of its
> address space and doing lots of filesystem writes, producing the kinds of
> thrashing that any such instance might entail.

It is, although with one possibly important difference: the process has
had an involuntary state transition, which may mean that its priority
settings that it had as a "live" process are no longer applicable. It
would certainly seem appropriate to give the administrator the option of
altering the priority parameters of coredumping processes.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2010-06-22 08:44:55

by Jiri Kosina

[permalink] [raw]
Subject: Re: [PATCH] fs: limit maximum concurrent coredumps

On Mon, 21 Jun 2010, Roland McGrath wrote:

> A core dump is just an instance of a process suddenly reading lots of its
> address space and doing lots of filesystem writes, producing the kinds of
> thrashing that any such instance might entail.

But priority settings don't apply any more for core dumping process, do
they?

--
Jiri Kosina
SUSE Labs, Novell Inc.

2010-06-22 08:54:46

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH] fs: limit maximum concurrent coredumps

On Mon, 21 Jun 2010 18:41:16 -0700
Andrew Morton <[email protected]> wrote:

> On Mon, 21 Jun 2010 18:23:03 -0700 (PDT) Roland McGrath <[email protected]> wrote:

> > That won't make your crashers each complete quickly, but it will prevent
> > the thrashing. Instead of some crashers suddenly not producing dumps at
> > all, they'll just all queue up waiting to finish crashing but not using any
> > CPU or IO resources. That way you don't lose any core dumps unless you
> > want to start SIGKILL'ing things (which oom_kill might do if need be),
> > you just don't die in flames trying to do nothing but dump cores.
>
> A global knob is a bit old-school. Perhaps it should be a per-memcg
> knob or something.
>

Hmm, in my desktop, it seems coredump in a group is charged against
root cgroup. (not against the group it belongs to.)
This seems strange.....I've chased why...for 2 hours. I noticed

==
[root@bluextal kamezawa]# cat /proc/sys/kernel/core_pattern
|/usr/libexec/abrt-hook-ccpp /var/cache/abrt %p %s %u %c
==
This is fedora-12.

Then, for recent distros, doing "coredump" with some limited resource may
be a job of abrt program. It can make use of I/O cgroup + direct I/O.

If a kernel help is necesary, this helper function should work in
the caller's cgroup, maybe.

Regards,
-Kame

2010-06-22 08:57:43

by Roland McGrath

[permalink] [raw]
Subject: Re: [PATCH] fs: limit maximum concurrent coredumps

> But priority settings don't apply any more for core dumping process, do
> they?

Why wouldn't they?

2010-06-23 16:00:29

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] fs: limit maximum concurrent coredumps

On 06/21, Edward Allcutt wrote:
>
> The ability to limit concurrent coredumps allows dumping core to be safely
> enabled in these situations without affecting responsiveness of the system
> as a whole.

OK, but please note that the patch is not right,

> @@ -1844,6 +1845,7 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
> int retval = 0;
> int flag = 0;
> int ispipe;
> + int dump_count = 0;
> static atomic_t core_dump_count = ATOMIC_INIT(0);
> struct coredump_params cprm = {
> .signr = signr,
> @@ -1865,6 +1867,14 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
> if (!__get_dumpable(cprm.mm_flags))
> goto fail;
>
> + dump_count = atomic_inc_return(&core_dump_count);
> + if (core_max_concurrency && (core_max_concurrency < dump_count)) {
> + printk(KERN_WARNING "Pid %d(%s) over core_max_concurrency\n",
> + task_tgid_vnr(current), current->comm);
> + printk(KERN_WARNING "Skipping core dump\n");
> + goto fail;
> + }
> +

We can't return here. We should kill other threads which share the same
->mm in any case.

Suppose that core_dump_count > core_max_concurrency, and we send, say,
SIGQUIT to the process. With this patch SIGQUIT suddenly starts to kill
the single thread, this must not happen.

If you change the patch to sleep until core_dump_count < core_max_concurrency,
then, again, we should kill other threads first.

Oleg.

2010-06-23 16:05:16

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] fs: limit maximum concurrent coredumps

On 06/23, Oleg Nesterov wrote:
>
> On 06/21, Edward Allcutt wrote:
> >
> > The ability to limit concurrent coredumps allows dumping core to be safely
> > enabled in these situations without affecting responsiveness of the system
> > as a whole.
>
> OK, but please note that the patch is not right,

OOPS, sorry, I was not exactly right too.

> > @@ -1844,6 +1845,7 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
> > int retval = 0;
> > int flag = 0;
> > int ispipe;
> > + int dump_count = 0;
> > static atomic_t core_dump_count = ATOMIC_INIT(0);
> > struct coredump_params cprm = {
> > .signr = signr,
> > @@ -1865,6 +1867,14 @@ void do_coredump(long signr, int exit_code, struct pt_regs *regs)
> > if (!__get_dumpable(cprm.mm_flags))
> > goto fail;
> >
> > + dump_count = atomic_inc_return(&core_dump_count);
> > + if (core_max_concurrency && (core_max_concurrency < dump_count)) {
> > + printk(KERN_WARNING "Pid %d(%s) over core_max_concurrency\n",
> > + task_tgid_vnr(current), current->comm);
> > + printk(KERN_WARNING "Skipping core dump\n");
> > + goto fail;
> > + }
> > +
>
> We can't return here. We should kill other threads which share the same
> ->mm in any case.
>
> Suppose that core_dump_count > core_max_concurrency, and we send, say,
> SIGQUIT to the process. With this patch SIGQUIT suddenly starts to kill
> the single thread, this must not happen.

well, the caller does do_group_exit() after do_coredump(), this kills
sub-threads.

However, this doesn't kill other CLONE_VM tasks. Perhaps this is fine,
but I am not sure.

> If you change the patch to sleep until core_dump_count < core_max_concurrency,
> then, again, we should kill other threads first.

Yes, this is true. If we are going to sleep, we shouldn't allow other
threads to run.

Oleg.