Date: Sun, 31 Aug 2003 01:57:14 +0200
From: Andrea Arcangeli <andrea@suse.de>
To: Marcelo Tosatti <marcelo@parcelfarce.linux.theplanet.co.uk>
Cc: Marcelo Tosatti <marcelo@conectiva.com.br>,
       Mike Fedyk <mfedyk@matchmail.com>, Antonio Vargas <wind@cocodriloo.com>,
       lkml <linux-kernel@vger.kernel.org>,
       Marc-Christian Petersen <m.c.p@wolk-project.de>
Subject: Re: Andrea VM changes
Message-ID: <20030830235714.GM24409@dualathlon.random>
References: <20030830231904.GL24409@dualathlon.random> <Pine.LNX.4.44.0308302026570.20323-100000@logos.cnet>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <Pine.LNX.4.44.0308302026570.20323-100000@logos.cnet>
User-Agent: Mutt/1.4i
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 4853
Lines: 107

On Sat, Aug 30, 2003 at 08:30:36PM -0300, Marcelo Tosatti wrote:
> 
> 
> On Sun, 31 Aug 2003, Andrea Arcangeli wrote:
> 
> > On Sat, Aug 30, 2003 at 04:21:02PM -0300, Marcelo Tosatti wrote:
> > > y
> > > 
> > > On Sat, 30 Aug 2003, Marcelo Tosatti wrote:
> > > 
> > > > >
> > > > > Indeed, you are right.
> > > > >
> > > > > I'll start looking at them Monday. I'll keep you in touch. Thanks.
> > > >
> > > > Andrea,
> > > >
> > > > Would you mind to explain me 05_vm_06_swap_out-3 ?
> > > >
> > > > I see you change shrink_cache, try_to_free_pages_zone, etc.
> > > >
> > > > Can you please give me a detailed explanation of the changes there?
> > > >
> > > > I appreciate very much.
> > > >
> > > > I'll keep looking at other patches for now.
> > > 
> > > 05_vm_09_misc_junk-3 removes the PF_MEMDIE and you also seem to remove the
> > > OOM killer. Is that right? Why?
> > 
> > because the oom killer is a DoS on servers, on a database setup, with 2G
> > free, with say all tasks 2.7G large, it'll start killing all the
> > thousand database tasks instead of the 2g netscape task that hit an
> > userspace bug and it started allocating ram in a loop, and that will
> > make no progress since no physical ram will be released. There's no need
> > of oom killer to keep the system stable, with my vm, and the current
> > probabilistic oom killer in the page fault hander 
> 
> So tasks get killed in case of page allocation failure? 

yes.

When alloc_pages returns NULL during the page fault handling we just
call do_exit. With 2.2-aa we were even smarter, we also checked if the
task had iopl privilegies (something that at the moment we can do only
in the page fault handler btw), so we could trust the task and just send
a SIGTERM a few times, instead of doing immediatly a do_exit(SIGKILL).
So we wouldn't screwup the graphics card for example (killing an iopl
task isn't always safe). But I never forward ported this very nice
feature to 2.4.

If alloc_pages returns null in all other cases, it's up to the caller to
return -ENOMEM to userspace as a retval of the syscall.

> > kills the right task most of the time (unlike the stock oom killers that
> > works well only for the desktops or developer machines). So it does a
> > much better job and it doesn't risk to DoS the box due oom.
> 
> Mind to explain me in more detail the OOM killing mechanism? 

the current logic depends on alloc_pages to return NULL.

And alloc_pages will return null depending on the
swapping/cache-shrinking.

The current code in mainline instead is even OOM deadlock prone in the
VM, for example not only the oom killer can do a DoSable wrong selection
of the task on servers, but it can even fail to detect an oom condition.
Another other thing that can easily fool the current oom killer, is the
mlocked ram: the current oom killer will be fooled by the fact there's
still some swap free and it'll never kick in and the box will deadlock.
This can't happen with my tree since I don't trust the unreliable
statistical information we have from the kernel: we simply have no way
to (efficiently) calculate the number of freeable pages at any given
time, and as such the only reasonable thing we can do is to try to
swap/shrink a number of times and to giveup eventually (that is like
counting inefficiently the number of freeable pages a few times).

> > Another DoS generated by the oom killer is that it'll try forever to
> > kill a UNINTERRUPTIBLE task hanging in a nfs server that is down, so it
> > hangs the whole box for an unlimited time.
> > 
> > I've an algorithm that will work, and that will provide very good
> > guarantees to kill the "best" task to make the machine usable again,
> > with the needed protection against the security DoSes, but it's in
> > no-way similar to the current oom killer.
> 
> My concern is about how this oom killer works. 

This oom killer on desktops may do a worse selections of the task to
kill (the usual ssh now has a chance to be killed), but it fixes the oom
deadlocks and it won't do stupid things on servers shall a netscape or
whatever else app hit an userspace bug. So I've to prefer it, until I
will write a reliable algorithm for the oom killing that won't fall into
dosable corner cases so easily (mlock/nfs/database as the three most
common examples of where current mainline can fail, btw the lowmem
shortage is another very common DoS that the oom killer will never
notice, my tree doesn't deadlock [or at least not technically, in
practice it may look like a kernel deadlock despite syscalls returns
-ENOMEM ;) ] during lowmem shortage on the 64G boxes).

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/