2007-11-28 12:38:40

by Robin Holt

[permalink] [raw]
Subject: Can we make application core dumps interruptible?


We have a customer machine with 4096 cpus. When some user applications
crash, it begins dumping core and can tie up the filesystem and
processors for a considerable period of time. Often, they contact the
user and the user says the core dump files will not be useful and they
reboot the machine. They have already reduced the default core dump size
to not dump anything and taken all reasonable steps to limiting core dumps
while still allowing them to be useful for those users that need them.
They would like to not need to reboot.

They hoped for a couple changes, one of which is a way for a SIGTERM,
SIGKILL, or something along that line interrupting the core dump process.
Is this the correct direction to take? Are there any better ideas for
handling this?

Thanks,
Robin Holt


2007-11-28 13:00:22

by Alan

[permalink] [raw]
Subject: Re: Can we make application core dumps interruptible?

> They hoped for a couple changes, one of which is a way for a SIGTERM,
> SIGKILL, or something along that line interrupting the core dump process.
> Is this the correct direction to take? Are there any better ideas for
> handling this?

Probably. In addition current kernels allow you to pipe the core dump via
a userspace helper instead of direct to disk. That may possibly be
helpful ?

Alan

2007-11-29 12:02:21

by Oleg Nesterov

[permalink] [raw]
Subject: Re: Can we make application core dumps interruptible?

On 11/28, Robin Holt wrote:
>
> We have a customer machine with 4096 cpus. When some user applications
> crash, it begins dumping core and can tie up the filesystem and
> processors for a considerable period of time. Often, they contact the
> user and the user says the core dump files will not be useful and they
> reboot the machine. They have already reduced the default core dump size
> to not dump anything and taken all reasonable steps to limiting core dumps
> while still allowing them to be useful for those users that need them.
> They would like to not need to reboot.
>
> They hoped for a couple changes, one of which is a way for a SIGTERM,
> SIGKILL, or something along that line interrupting the core dump process.
> Is this the correct direction to take? Are there any better ideas for
> handling this?

Well, I don't know what would be the right soultion, but perhaps we can do
something like the patch below. Allows to abort the coredump with kill -9.

Oleg.

--- fs/binfmt_elf.c~ 2007-10-25 16:22:10.000000000 +0400
+++ fs/binfmt_elf.c 2007-11-29 14:47:43.000000000 +0300
@@ -1178,6 +1178,9 @@ out:
*/
static int dump_write(struct file *file, const void *addr, int nr)
{
+ if (sigismember(&current->signal->shared_pending.signal, SIGKILL))
+ return 0;
+
return file->f_op->write(file, addr, nr, &file->f_pos) == nr;
}