While running 20 parallel instances of dd as follows:
#!/bin/bash
for i in `seq 1 20`; do
dd if=/dev/zero of=/export/hda3/dd_$i bs=1073741824 count=1 &
done
wait
on a 16G machine, we noticed that rather than just killing the
processes, the entire kernel went down. Stracing dd reveals that it first
does an mmap2, which makes 1GB worth of zero page mappings. Then it
performs
a read on those pages from /dev/zero, and finally it performs a write.
The
machine died during the reads. Looking at the code, it was noticed that
/dev/zero's read operation had been changed at some point from giving
zero page mappings to actually zeroing the page. The zeroing of the
pages causes physical pages to be allocated to the process. But, when
the process exhausts all the memory that it can, the kernel cannot kill
it, as it is still in the kernel mode allocating more memory.
Consequently,
the kernel eventually crashes.
To fix this, I propose that when a fatal signal is pending during
/dev/zero read operation, we simply return and let the user process die.
Here is a patch that does that.
Signed-off-by: Salman Qazi <[email protected]>
---
diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index 8f05c38..2ffa36e 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -696,6 +696,11 @@ static ssize_t read_zero(struct file * file, char __user * buf,
break;
buf += chunk;
count -= chunk;
+ /* The exit code here doesn't actually matter, as userland
+ * will never see it.
+ */
+ if (fatal_signal_pending(current))
+ return -ENOMEM;
cond_resched();
}
return written ? written : -EFAULT;
On Thu, 4 Jun 2009 13:32:55 -0700 (PDT)
Salman Qazi <[email protected]> wrote:
> While running 20 parallel instances of dd as follows:
>
> #!/bin/bash
>
> for i in `seq 1 20`; do
> dd if=/dev/zero of=/export/hda3/dd_$i bs=1073741824 count=1 &
> done
> wait
>
> on a 16G machine, we noticed that rather than just killing the
> processes, the entire kernel went down. Stracing dd reveals that it first
> does an mmap2, which makes 1GB worth of zero page mappings. Then it
> performs
> a read on those pages from /dev/zero, and finally it performs a write.
> The
> machine died during the reads. Looking at the code, it was noticed that
> /dev/zero's read operation had been changed at some point from giving
> zero page mappings to actually zeroing the page. The zeroing of the
> pages causes physical pages to be allocated to the process.
erk, Nick broke dd(1):
commit 557ed1fa2620dc119adb86b34c614e152a629a80
Author: Nick Piggin <[email protected]>
Date: Tue Oct 16 01:24:40 2007 -0700
remove ZERO_PAGE
This is the first report I've seen of problems arising from that
change.
> But, when
> the process exhausts all the memory that it can, the kernel cannot kill
> it, as it is still in the kernel mode allocating more memory.
> Consequently,
> the kernel eventually crashes.
>
> To fix this, I propose that when a fatal signal is pending during
> /dev/zero read operation, we simply return and let the user process die.
> Here is a patch that does that.
>
> Signed-off-by: Salman Qazi <[email protected]>
> ---
> diff --git a/drivers/char/mem.c b/drivers/char/mem.c
> index 8f05c38..2ffa36e 100644
> --- a/drivers/char/mem.c
> +++ b/drivers/char/mem.c
> @@ -696,6 +696,11 @@ static ssize_t read_zero(struct file * file, char __user * buf,
> break;
> buf += chunk;
> count -= chunk;
> + /* The exit code here doesn't actually matter, as userland
> + * will never see it.
> + */
> + if (fatal_signal_pending(current))
> + return -ENOMEM;
> cond_resched();
> }
> return written ? written : -EFAULT;
OK. I think.
It's presumptuous to return -ENOMEM: we don't _know_ that this signal
came from the oom-killer. It would be better to return -EINTR here.
On Thu, Jun 4, 2009 at 1:50 PM, Andrew Morton <[email protected]> wrote:
> On Thu, 4 Jun 2009 13:32:55 -0700 (PDT)
> Salman Qazi <[email protected]> wrote:
>
>> While running 20 parallel instances of dd as follows:
>>
>> #!/bin/bash
>>
>> for i in `seq 1 20`; do
>> ? ? ? ? ?dd if=/dev/zero of=/export/hda3/dd_$i bs=1073741824 count=1 &
>> done
>> wait
>>
>> on a 16G machine, we noticed that rather than just killing the
>> processes, the entire kernel went down. ?Stracing dd reveals that it first
>> does an mmap2, which makes 1GB worth of zero page mappings. ?Then it
>> performs
>> a read on those pages from /dev/zero, and finally it performs a write.
>> The
>> machine died during the reads. ?Looking at the code, it was noticed that
>> /dev/zero's read operation had been changed at some point from giving
>> zero page mappings to actually zeroing the page. ?The zeroing of the
>> pages causes physical pages to be allocated to the process.
>
> erk, Nick broke dd(1):
>
> ?commit 557ed1fa2620dc119adb86b34c614e152a629a80
> ?Author: Nick Piggin <[email protected]>
> ?Date: ? Tue Oct 16 01:24:40 2007 -0700
>
> ? ? ?remove ZERO_PAGE
>
>
> This is the first report I've seen of problems arising from that
> change.
>
>> ?But, when
>> the process exhausts all the memory that it can, the kernel cannot kill
>> it, as it is still in the kernel mode allocating more memory.
>> Consequently,
>> the kernel eventually crashes.
>>
>> To fix this, I propose that when a fatal signal is pending during
>> /dev/zero read operation, we simply return and let the user process die.
>> Here is a patch that does that.
>>
>> Signed-off-by: Salman Qazi <[email protected]>
>> ---
>> diff --git a/drivers/char/mem.c b/drivers/char/mem.c
>> index 8f05c38..2ffa36e 100644
>> --- a/drivers/char/mem.c
>> +++ b/drivers/char/mem.c
>> @@ -696,6 +696,11 @@ static ssize_t read_zero(struct file * file, char __user * buf,
>> ? ? ? ? ? ? ? ? ? ? ? break;
>> ? ? ? ? ? ? ? buf += chunk;
>> ? ? ? ? ? ? ? count -= chunk;
>> + ? ? ? ? ? ? /* The exit code here doesn't actually matter, as userland
>> + ? ? ? ? ? ? ?* will never see it.
>> + ? ? ? ? ? ? ?*/
>> + ? ? ? ? ? ? if (fatal_signal_pending(current))
>> + ? ? ? ? ? ? ? ? ? ? return -ENOMEM;
>> ? ? ? ? ? ? ? cond_resched();
>> ? ? ? }
>> ? ? ? return written ? written : -EFAULT;
>
> OK. ?I think.
>
> It's presumptuous to return -ENOMEM: we don't _know_ that this signal
> came from the oom-killer. ?It would be better to return -EINTR here.
agreed.
>
On Thu, 4 Jun 2009, Andrew Morton wrote:
> >
> > To fix this, I propose that when a fatal signal is pending during
> > /dev/zero read operation, we simply return and let the user process die.
> > Here is a patch that does that.
> >
> > Signed-off-by: Salman Qazi <[email protected]>
> > ---
> > diff --git a/drivers/char/mem.c b/drivers/char/mem.c
> > index 8f05c38..2ffa36e 100644
> > --- a/drivers/char/mem.c
> > +++ b/drivers/char/mem.c
> > @@ -696,6 +696,11 @@ static ssize_t read_zero(struct file * file, char __user * buf,
> > break;
> > buf += chunk;
> > count -= chunk;
> > + /* The exit code here doesn't actually matter, as userland
> > + * will never see it.
> > + */
> > + if (fatal_signal_pending(current))
> > + return -ENOMEM;
> > cond_resched();
> > }
> > return written ? written : -EFAULT;
>
> OK. I think.
>
> It's presumptuous to return -ENOMEM: we don't _know_ that this signal
> came from the oom-killer. It would be better to return -EINTR here.
I don't think the error matters in this case, since we literally only care
about fatal signals, but I agree that EINTR is probably better.
That said, it would be even better to basically act as if it was a signal,
and do something like
return written ? written : -EINTR;
because that might allow us to simply make it _totally_ interruptible some
day. There is nothing that says that /dev/zero shouldn't act like an
interruptible file descriptor (like a pipe), and just return a partial
read.
If we want to do this for 2.6.30, though, I very much agree with the
notion of limiting it to just fatal signals, though.
Linus
On Thu, 4 Jun 2009, Linus Torvalds wrote:
>
> If we want to do this for 2.6.30, though, I very much agree with the
> notion of limiting it to just fatal signals, though.
IOW, I really think the patch should look like the following, and that
this has nothing to do with OOM-killing at all.
Linus
---
drivers/char/mem.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)
diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index 8f05c38..65e12bc 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -694,6 +694,9 @@ static ssize_t read_zero(struct file * file, char __user * buf,
written += chunk - unwritten;
if (unwritten)
break;
+ /* Consider changing this to just 'signal_pending()' with lots of testing */
+ if (fatal_signal_pending(current))
+ return written ? written : -EINTR;
buf += chunk;
count -= chunk;
cond_resched();