Date: Fri, 6 Dec 2002 15:57:19 +0100
From: Andrea Arcangeli <andrea@suse.de>
To: Andrew Morton <akpm@digeo.com>
Cc: William Lee Irwin III <wli@holomorphy.com>,
       Norman Gaywood <norm@turing.une.edu.au>, linux-kernel@vger.kernel.org
Subject: Re: Maybe a VM bug in 2.4.18-18 from RH 8.0?
Message-ID: <20021206145718.GL1567@dualathlon.random>
References: <3DEFF69F.481AB823@digeo.com> <20021206011733.GF1567@dualathlon.random> <3DEFFEAA.6B386051@digeo.com> <20021206014429.GI1567@dualathlon.random> <20021206021559.GK9882@holomorphy.com> <20021206022853.GJ1567@dualathlon.random> <20021206024140.GL9882@holomorphy.com> <3DF034BB.D5F863B5@digeo.com> <20021206054804.GK1567@dualathlon.random> <3DF049F9.6F83D13@digeo.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <3DF049F9.6F83D13@digeo.com>
User-Agent: Mutt/1.4i
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 5585
Lines: 119

On Thu, Dec 05, 2002 at 10:55:53PM -0800, Andrew Morton wrote:
> Andrea Arcangeli wrote:
> > 
> > the
> > algorithm is autotuned at boot and depends on the zone sizes, and it
> > applies to the dma zone too with respect to the normal zone, the highmem
> > case is just one of the cases that the fix for the general problem
> > resolves,
> 
> Linus's incremental min will protect ZONE_DMA in the same manner.

of how many bytes?

> 
> > and you're totally wrong saying that mlocking 700m on a 4G box
> > could kill it.
> 
> It is possible to mlock 700M of the normal zone on a 4G -aa kernel.
> I can't immediately think of anything apart from vma's which will
> make it fall over, but it will run like crap.

you're missing the whole point. the vma are zone-normal users. You're
saying that you can run out of ZONE_NORMAL if you run
alloc_page(GFP_KERNEL) for some hundred thousand times. Yeah that's not
a big news.

I'm saying you *can't* run out of zone-normal due highmem allocations so
if you run alloc_pages(GFP_HIGHMEM), period.

that's a completely different thing.

I thought you understood what the problem is, not sure why you say you
can run out of zone-normal running 100000 times alloc_page(GFP_KERNEL),
that has *nothing* to do with the bug we're discussing here, if you
don't want to run out of zone-normal after 100000 GFP_KERNEL page
allocations you can only drop the zone-normal.

The bug we're discussing here is that w/o my fix you will run out of
zone-normal despite you didn't start allocating zone-normal yet and
despite you still have 60G free in the highmem zone. This is what the
patch prevents, nothing more nothing less.

And it's not so much specific to google, they were just unlucky
triggering it, as said just allocate plenty of pagetables (they are
highmem capable in my tree and 2.5) or swapoff -a, and you'll run in the
very same scenario that needs my fix in all normal workloads that
allocates some more than some hunded mbytes of ram.

And this is definitely a generic problem, not even specific to linux,
it's an OS wide design problem while dealing with the balancing of
different zones that have overlapping but not equivalent capabilities,
it even applies to zone-dma with respect to zone-normal and zone-highmem
and there's no other fix around it at the moment.

Mainline fixes it in a very weak way, it reserves a few meges only,
that's not nearly enough if you need to allocate more than one more
inode etc... The lowmem reservation must allow the machine to do
interesting workloads for the whole uptime, not to defer the failure of
a few seconds. A few megs aren't nearly enough.

If interesting workloads needs huge zone-normal, just reserve more of it
at boot and they will work. if all the zone-normal isn't enough you fall
into a totally different problem, that is the zone-normal existence in
the first place and it has nothing to do with this bug, and you can fix
the other problem only by dropping the zone-normal (of course if you do
that you will in turn fix this problem too, but the problems are
different).

The only alternate fix is to be able to migrate pagetables (1st level
only, pte) and all the other highmem capable allocations at runtime
(pagecache, shared memory etc..). Which is clearly not possible in 2.5
and 2.4.

Once that will be possible/implemented my fix can go away and you can
simply migrate the highmem capable allocations from zone-normal to
highmem. That would be the only alternate and also dynamic/superior fix
but it's not feasible at the moment, at the very least not in 2.4. It
would also have some performance implications, I'm sure lots of people
prefers to throw away 500M of ram in a 32G machine than riskying to
spend the cpu time in memcopies, so it would not be *that* superior, it
would be inferior in some ways.

Reserving 500M of ram on a 32G machine doesn't really matter at all, so
the current fix is certainly the best thing we can do for 2.4, and for
2.5 too unless you want to implement highmem migration for all highmem
capable kernel objects (which would work fine too).

Also your possible multiplicator via sysctl remains a much inferior to
my fix that is able to cleanly enforce classzone-point-of-view
watermarks (not fixed watermarks), you would need to change
multiplicator depending on zone size and depending on the zone to make
it equivalent, so yes, you could implement it equivally but it would be
much less clean and readable than my current code (and more hardly
tunable with a kernel paramter at boot like my current fix is).

> > 2.5 misses this important fix too btw.
> 
> It does not appear to be an important fix at all.  There have been

well if you ignore it people can use my tree, I personally need that fix
for myself on big boxes so I'm going to retain it in one form or
another (the form in mainline is too weak as said and just adding a
multiplicator would not be equivalent as said above).

> 2.5 has much bigger problems than this - radix_tree nodes and pte_chains
> in particular.

I'm not saying there aren't bigger problems in 2.5, but I don't classify
this one as a minor one, infact it was a showstopper for a long time in
2.4 (one of the last ones), until I fixed it and it still is a problem
because the 2.4 fix is way too weak (a few megs aren't enough to
guarantee big workloads to succeed).

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/