2001-12-14 11:52:51

by Ingo Molnar

[permalink] [raw]
Subject: [patch] mempool-2.5.1-D0


the attached patch against -pre11 fixes a possible deadlock pointed out by
Arjan: gfp_nowait needs to exclude __GFP_IO as well, to avoid some of the
deeper deadlocks where the first ->alloc() would generate IO.

Ingo

--- linux/mm/mempool.c.orig Fri Dec 14 12:34:08 2001
+++ linux/mm/mempool.c Fri Dec 14 12:35:53 2001
@@ -185,7 +185,7 @@
struct list_head *tmp;
int curr_nr;
DECLARE_WAITQUEUE(wait, current);
- int gfp_nowait = gfp_mask & ~__GFP_WAIT;
+ int gfp_nowait = gfp_mask & ~(__GFP_WAIT | __GFP_IO);

repeat_alloc:
element = pool->alloc(gfp_nowait, pool->pool_data);



2001-12-14 16:17:51

by Ingo Molnar

[permalink] [raw]
Subject: [patch] mempool-2.5.1-D1


there is another thinko in the mempool code, reported by Suparna
Bhattacharya. If mempool_alloc() is called from an IRQ context then we
return too early. The correct behavior is to allocate GFP_ATOMIC, if that
fails then we look at the pool and return an element, or return NULL.

Ingo


Attachments:
mempool-2.5.1-D1 (1.84 kB)

2001-12-14 17:16:37

by Ingo Molnar

[permalink] [raw]
Subject: [patch] mempool-2.5.1-D2


Andrew Morton noticed another bug, run_tasklist() should not be called as
TASK_UNINTERRUPTIBLE. The attached patch fixes this.

Ingo


Attachments:
mempool-2.5.1-D2 (1.98 kB)

2001-12-14 22:27:49

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [patch] mempool-2.5.1-D2

On Fri, Dec 14, 2001 at 08:13:49PM +0100, Ingo Molnar wrote:
>
> Andrew Morton noticed another bug, run_tasklist() should not be called as
> TASK_UNINTERRUPTIBLE. The attached patch fixes this.

Btw, wouldn't reservation result in the same effect as these mempools for
significantly less code?

-ben

2001-12-15 04:43:55

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] mempool-2.5.1-D2


On Fri, 14 Dec 2001, Benjamin LaHaise wrote:

> Btw, wouldn't reservation result in the same effect as these mempools
> for significantly less code?

exactly what kind of SLAB based reservation system do you have in mind?
(what interface, how would it work, etc.) Take a look at how bio.c,
highmem.c and raid1.c uses the mempool mechanizm, the main properties of
mempool cannot be expressed via SLAB reservation:

- mempool allows the use of non-SLAB allocators as the underlying
allocator. (eg. the highmem.c mempool uses the page allocator to alloc
lowmem pages. raid1.c uses 4 allocators: kmalloc(), page_alloc(),
bio_alloc() and mempool_alloc() of a different pool.)

- mempool_alloc(), if called from a process context, never fails. This
simplifies lowlevel IO code (which often must not fail) visibly.

- mempool allows the pooling of arbitrarily complex memory buffers, not
just a single SLAB buffer. (eg. the raid1.c resync pool uses a
combination of alloc_mempool(), bio_alloc() and multiple page_alloc()
buffers. This is also a performance enhancement for raid1.c.)

- mempool handles allocation in a more deadlock-avoidance-aware way than
a normal allocator would do:

- first it ->alloc()'s atomically
- then it tries to take from the pool if the pool is at least
half full
- then it ->alloc()'s non-atomically
- then it takes from the pool if it's non-empty
- then it waits for pool elements to be freed

this makes for five different levels of allocation, ordered for
performance and blocking-avoidance, while still kicking the VM and
trying as hard as possible if there is a resource squeeze. In the
normal case we never touch the mempool spinlocks, we just call
->alloc() and if the core allocator does per-CPU caching then we'll
have the exact same high level of scalability as the underlying
allocator.

- mempool adds reservation without increasing the complexity of the
underlying allocators.

Ingo

2001-12-15 05:29:47

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [patch] mempool-2.5.1-D2

On Sat, Dec 15, 2001 at 07:41:12AM +0100, Ingo Molnar wrote:
> exactly what kind of SLAB based reservation system do you have in mind?
> (what interface, how would it work, etc.) Take a look at how bio.c,
> highmem.c and raid1.c uses the mempool mechanizm, the main properties of
> mempool cannot be expressed via SLAB reservation:
>
> - mempool allows the use of non-SLAB allocators as the underlying
> allocator. (eg. the highmem.c mempool uses the page allocator to alloc
> lowmem pages. raid1.c uses 4 allocators: kmalloc(), page_alloc(),
> bio_alloc() and mempool_alloc() of a different pool.)

That's of dubious value. Personally, I think there should be two
allocators: slab and page allocs. Anything beyond that seems to be
duplicating functionality.

> - mempool_alloc(), if called from a process context, never fails. This
> simplifies lowlevel IO code (which often must not fail) visibly.

Arguably, the same should be possible with normal allocations. Btw, if
you looked at the page level reservations, it did so but still left the
details of how that memory is reserved up to the vm. The idea behind that
is to reserve memory, but still allow clean and immediately reclaimable
pages to populate the memory until it is allocate (many reservations will
never be touched as they're worst case journal/whatnot protection). Plus,
with a reservation in hand, an allocation from that reservation will
never fail.

> - mempool handles allocation in a more deadlock-avoidance-aware way than
> a normal allocator would do:
>
> - first it ->alloc()'s atomically

Great. Function calls through pointers are really not a good idea on
modern cpus.

> - then it tries to take from the pool if the pool is at least
> half full
> - then it ->alloc()'s non-atomically
> - then it takes from the pool if it's non-empty
> - then it waits for pool elements to be freed

Oh dear. Another set of vm logic that has to be kept in sync with the
behaviour of the slab, alloc_pages and try_to_free_pages. We're already
failing to keep alloc_pages deadlock free; how can you be certain that
this arbitary "half full pool" condition is not going to cause deadlocks
for $random_arbitary_driver?

> this makes for five different levels of allocation, ordered for
> performance and blocking-avoidance, while still kicking the VM and
> trying as hard as possible if there is a resource squeeze. In the
> normal case we never touch the mempool spinlocks, we just call
> ->alloc() and if the core allocator does per-CPU caching then we'll
> have the exact same high level of scalability as the underlying
> allocator.

Again, this is duplicating functionality that doesn't need to be. The
1 additional branch for the uncommon case that reservations adds is
far, far cheaper and easier to understand.

> - mempool adds reservation without increasing the complexity of the
> underlying allocators.

This is where my basic disagreement with the approach comes from. As
I see it, all of the logic that mempools adds is already present in the
current system (or at the very least should be). To give you some insight
to how I think reservations should work and how they can simplify code in
the current allocators, take the case of an ordinary memory allocation of
a single page. Quite simply, if there are no immediately free pages, we
need to wait for another page to be returned to the free pool (this is
identical to the logic you added in mempool that prevents a pool from
failing an allocation). Right now, memory allocations can fail because
we allow ourselves to grossly overcommit memory usage. That you're adding
mempool to patch over that behaviour, is *wrong*, imo. The correct way to
fix this is to make the underlying allocator behave properly: the system
has enough information at the time of the initial allocation to
deterministically say "yes, the vm will be able to allocate this page" or
"no, i have to wait until another user frees up memory". Yes, you can
argue that we don't currently keep all the necessary statistics on hand
to make this determination, but that's a small matter of programming.

The above is looks like a bit more of a rant than I'd meant to write, but
I think the current allocator is broken and in need of fixing, and once
fixed there should be no need for yet another layer on top of it.

-ben
--
Fish.

2001-12-15 17:55:56

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: [patch] mempool-2.5.1-D2

On Sat, 15 Dec 2001 07:41:12 +0100 (CET)
Ingo Molnar <[email protected]> wrote:

> - mempool_alloc(), if called from a process context, never fails. This
> simplifies lowlevel IO code (which often must not fail) visibly.

Uh, do you trust your own word? This already sounds like an upcoming deadlock
to me _now_. I saw a lot of try-and-error during the last month regarding
exactly this point. There have been VM-days where allocs didn't really fail
(set with right flags), but didn't come back either. And exactly this was the
reason why the stuff was _broken_. Obviously no process can wait a indefinitely
long time to get its alloc fulfilled. And there are conditions under heavy load
where this cannot be met, and you will see complete stall.

In fact I pretty much agree to Ben's thesis that the current allocator has a
problem. I would not call it broken, but it cannot present the ad-hoc answer to
one (_the_) important question: what is the correct cache page to drop _now_
when resources get low and I have to successfully return an allocation?
This is _the_ central issue that must be solved in a VM with such a tremendous
page caching going on like we have now. And really important is the fact the
answer must be presentable ad-hoc. If you have to loop around, wait for I/O or
whatever, then the basic design is already sub-optimal.
Looking at your mempool-ideas one cannot fight the impression that you try to
"patch" around a deficiency of the current code. This cannot be the right thing
to do.

Regards,
Stephan


2001-12-15 20:20:45

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] mempool-2.5.1-D2


On Sat, 15 Dec 2001, Stephan von Krawczynski wrote:

> > - mempool_alloc(), if called from a process context, never fails. This
> > simplifies lowlevel IO code (which often must not fail) visibly.
>
> Uh, do you trust your own word? This already sounds like an upcoming
> deadlock to me _now_. [...]

please check it out how it works. It's not done by 'loop forever until
some allocation succeeds'. It's done by FIFO queueing for pool elements
that are guaranteed to be freed after some reasonable timeout. (and there
is no other freeing path that might leak the elements.)

> [...] I saw a lot of try-and-error during the last month regarding
> exactly this point. There have been VM-days where allocs didn't really
> fail (set with right flags), but didn't come back either. [...]

hm, iirc, the code was just re-trying the allocation infinitely (while
sleeping on kswapd_wait).

> [...] And exactly this was the reason why the stuff was _broken_.
> Obviously no process can wait a indefinitely long time to get its
> alloc fulfilled. And there are conditions under heavy load where this
> cannot be met, and you will see complete stall.

this is the problem with doing this in the (current) page allocator:
allocation and freeing of pages is done by every process, so the real ones
that need those pages for deadlock avoidance are starved. Identifying
reserved pools and creating closed circuits of allocation/freeing
relations solves this problem - 'outsiders' cannot 'steal' from the
reserve. In addition, creating pools of composite structures helps as well
in cases where multiple allocations are needed to start a guaranteed
freeing operation.

mempool moves deadlock avoidance to a different, and explicit level. If
everything uses mempools then the normal allocators (the page allocator)
can remove all their reserved pools and deadlock-avoidance code.

> [...] Looking at your mempool-ideas one cannot fight the impression
> that you try to "patch" around a deficiency of the current code. This
> cannot be the right thing to do.

to the contrary - i'm not 'patching around' any deficiency, i'm removing
the need to put deadlock avoidance into the page allocator. But in this
transitional period of time the 'old code' still stays around for a while.
If you look at Ben's patch you'll see the same kind of dualness - until a
mechanizm is fully used things like that are unavoidable.

Ingo

2001-12-17 16:19:59

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: [patch] mempool-2.5.1-D2

On Sat, 15 Dec 2001 23:17:56 +0100 (CET)
Ingo Molnar <[email protected]> wrote:

>
> On Sat, 15 Dec 2001, Stephan von Krawczynski wrote:
>
> > > - mempool_alloc(), if called from a process context, never fails. This
> > > simplifies lowlevel IO code (which often must not fail) visibly.
> >
> > Uh, do you trust your own word? This already sounds like an upcoming
> > deadlock to me _now_. [...]
>
> please check it out how it works. It's not done by 'loop forever until
> some allocation succeeds'. It's done by FIFO queueing for pool elements
> that are guaranteed to be freed after some reasonable timeout. (and there
> is no other freeing path that might leak the elements.)

This is like solving a problem by not looking onto it. You will obviously _not_
shoot down allocated and still used bios, no matter how long they are going to
take. So your fixed size pool will run out in certain (maybe weird) conditions.
If you cannot resize (alloc additional mem from standard VM) you are just dead.
Look at it from a different point of view: its basically all the same. Standard
VM has a limited resource and tries to give it away in an intelligent way.
Mempool does the same thing - for a smaller sized environment. But that is per
se no gain. And just as Andrea pointed out, the not-used part of the resources
is just plain wasted - though he thinks this is _good_ because it is simpler in
design and implementation. But on the other hand, you could just do it vice
versa: don't make the mempools, make a cache-pool. VM handles memory, and we
are using a fixed size (but resizeable) mem block as pure pool for page cache.
every page that is somehow locked down (iow _used_, and not simply cached) is
pulled away/out from the cache-pool. cache pool ages (hello rik :-), but stays
the same size. You end up with _lots_ of _free_ mem in normal loads and
acceptable performance. This is no good design. But it doesn't need to answer
the question, what pages to expell under pressure, because per definition in
this design there is nothing to expell/drop. when mem gets low it really _is_
low, because your applications ate it all up. The current design cannot answer
this question correctly, because I must not be able to see allocation failures
in a box of 1 GB ram and very few used applications - and a hugh page cache.
But they are there. So there is a problem, probably in the implementation of a
working design. The answer "drivers must be able to cope with failing allocs"
is WRONG WRONG WRONG. They should not oops, ok, but they cannot stand such a
situation, you will always loose something (probably data). Your good points in
mempool usage will all be the simple fact, that there is a memory reserve that
is not touched by the page cache. There are about 29 ways to achieve this same
goal - and most of them are a lot more straight forward and require less
changes in the rest-kernel.

Please _solve_ the problem, do not _spread_ it.

Regards,
Stephan

2001-12-17 18:58:48

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] mempool-2.5.1-D2


On Mon, 17 Dec 2001, Stephan von Krawczynski wrote:

> [...] You will obviously _not_ shoot down allocated and still used
> bios, no matter how long they are going to take. So your fixed size
> pool will run out in certain (maybe weird) conditions. If you cannot
> resize (alloc additional mem from standard VM) you are just dead.

sure, the pool will run out under heavy VM load. Will it stay empty
forever? Nope, because all mempool users are *required* to deallocate the
buffer after some (reasonable) timeout. (such as IO latency.) This is
pretty much by definition. (Sure there might be weird cases like IO
failure timeouts, but sooner or later the buffer will be returned, and it
will be reused.)

(by the way, this is true for every other reservation solution as well,
just look at the patches. You wont resize on the fly whenever there is
shortage - thats the problem with shortages, there just wont be more RAM.
If anyone uses reserved pools and doesnt release those buffers then we are
deadlocked. Memory reserves *must not* be used as a kmalloc pool. Doing
that can be considered an advanced form of a 'memory leak'.)

(and there is mempool_resize() if some aspect of the device is changed.)

Ingo

2001-12-17 20:44:41

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [patch] mempool-2.5.1-D2

On Mon, Dec 17, 2001 at 09:56:07PM +0100, Ingo Molnar wrote:
> sure, the pool will run out under heavy VM load. Will it stay empty
> forever? Nope, because all mempool users are *required* to deallocate the
> buffer after some (reasonable) timeout. (such as IO latency.) This is
> pretty much by definition. (Sure there might be weird cases like IO
> failure timeouts, but sooner or later the buffer will be returned, and it
> will be reused.)

loop. deadlock. kmap. deadlock. You've got a lot of code to fix before
this statement is remotely true.

> (by the way, this is true for every other reservation solution as well,
> just look at the patches. You wont resize on the fly whenever there is
> shortage - thats the problem with shortages, there just wont be more RAM.
> If anyone uses reserved pools and doesnt release those buffers then we are
> deadlocked. Memory reserves *must not* be used as a kmalloc pool. Doing
> that can be considered an advanced form of a 'memory leak'.)

Absolutely. That's why I think we should at least do some work on design
of the code so that we have an idea of what the pitfalls are, plus
documentation before putting it into the kernel.

-ben
--
Fish.

2001-12-17 23:57:35

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: [patch] mempool-2.5.1-D2

>
> On Mon, 17 Dec 2001, Stephan von Krawczynski wrote:
>
> > [...] You will obviously _not_ shoot down allocated and still used
> > bios, no matter how long they are going to take. So your fixed
size
> > pool will run out in certain (maybe weird) conditions. If you
cannot
> > resize (alloc additional mem from standard VM) you are just dead.
>
> sure, the pool will run out under heavy VM load. Will it stay empty
> forever? Nope, because all mempool users are *required* to
deallocate the
> buffer after some (reasonable) timeout. (such as IO latency.) This
is
> pretty much by definition. (Sure there might be weird cases like IO
> failure timeouts, but sooner or later the buffer will be returned,
and it
> will be reused.)

Hm, and where is the real-world-difference to standard VM? I mean
today your bad-ass application gets shot down by L's oom-killer and
your VM will "refill". So you're not going to die for long in the
current situation either.
I have yet to see the brilliance in mempools. I mean, for sure I can
imagine systems that are going to like it (e.g. embedded) a _lot_. But
these are far off the "standard" system profile.
I asked this several times now, and I will continue to, where is the
VM _design_ guru that explains the designed short path to drop page
caches when in need of allocable mem, regarding a system with
aggressive caching like 2.4? This _must_ exist. If it does not, the
whole issue is broken, and it is obvious that nobody will ever find an
acceptable implementation.
I turned this problem about a hundred times round now, and as far as I
can see everything comes down to the simple fact, that VM has to
_know_ the difference between a only-cached page and a _really-used_
one. And I do agree with Rik, that the only-cached pages need an aging
algorithm, probably a most-simple approach (could be list-ordering).
This should answer the question: who's dropped next?
On the other hand you have aging in the used-pages for finding out
who's swapped out next. BUT I would say that swapping should only
happen when only-cached pages are down to a minimum level (like 5% of
memtotal).
Forgive my simplistic approach, where are the guys to shoot me?
And where the hell is the need for mempool in this rough design idea?

Regards,
Stephan

2001-12-21 19:15:42

by Pavel Machek

[permalink] [raw]
Subject: Re: [patch] mempool-2.5.1-D2

Hi!

> - mempool_alloc(), if called from a process context, never fails. This
> simplifies lowlevel IO code (which often must not fail) visibly.

Really? I do not see how you can guarantee this on machine with finite
ammount of memory.
Pavel
--
Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt,
details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html.

2001-12-18 14:45:47

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] mempool-2.5.1-D2


On Tue, 18 Dec 2001, Stephan von Krawczynski wrote:

> Hm, and where is the real-world-difference to standard VM? I mean
> today your bad-ass application gets shot down by L's oom-killer and
> your VM will "refill". So you're not going to die for long in the
> current situation either. [...]

Think of the following trivial case: 'the whole system is full of dirty
pagecache pages, the rest is kmalloc()ed somewhere'. Nothing to oom,
nothing to kill, plenty of swap left and no RAM. And besides, in this
situation, oom is the worst possible answer, the application getting
oom-ed is not at fault in this case.

Ingo

2001-12-18 15:36:35

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: [patch] mempool-2.5.1-D2

On Tue, 18 Dec 2001 17:43:01 +0100 (CET)
Ingo Molnar <[email protected]> wrote:

>
> On Tue, 18 Dec 2001, Stephan von Krawczynski wrote:
>
> > Hm, and where is the real-world-difference to standard VM? I mean
> > today your bad-ass application gets shot down by L's oom-killer and
> > your VM will "refill". So you're not going to die for long in the
> > current situation either. [...]
>
> Think of the following trivial case: 'the whole system is full of dirty
> pagecache pages, the rest is kmalloc()ed somewhere'. Nothing to oom,
> nothing to kill, plenty of swap left and no RAM. And besides, in this
> situation, oom is the worst possible answer, the application getting
> oom-ed is not at fault in this case.

You are right that this is a broken situation. Now your answer is a _specific_
patch: you say "let nice people using mempools survive (and fuck the rest
(implicit))". You do not solve the problem, you drive _around_ it for _certain_
VM users (the mempool guys). This is _not_ the correct answer to the situation.
In my eyes "correct" would mean: what is the reason for the pagecache (dirty)
being able to eat up all my mem (which may be _plenty_). Something is wrong
with the design then, remember the basics:(very meta this one :-)
object vm {
pool with free mem
pool with cached-only mem
pool with dirty-cache mem (your naming)
...
func cached_to_dirty_mem
func dirty_to_cached_mem
func drop_cached_mem
func alloc_cached_mem
...
}

Your problem obviously is that the function moving pages from dirty_to_cached
(meaning not dirty) is either not existing or not working and thats why your
situation is broken (there is nothing that can be dropped, meaning pool with
cached-only is empty and cannot be filled from dirty-cache). If it is not
existing, then the basic design relies on _external_ undefines (VM users
dropping the dirty pages themselves at undefined time and order) and is
therefore provable incomplete and broken for all management cases with limited
resources (like mem). You cannot really intend to save the broken state by
saying "let it break, but not for mempool". As long as you cannot create a
closed circle in VM you will break. But even inside mempools you fight the same
problem. You have to rely on mempool-users giving back the resources early
enough to handle the new requests. If the timed-circle doesn't work out as
expected you increase your mempool (resize). It's all the same. This is exactly
like a VM situation without (or without agressive) page-cache. You upgrade RAM
if you ran out. This analogy brings up the real and simple cure inside your
design: you made the page-cache go away from the mempools. This would obviously
cure VM, too. But it would seriously hit performance, so its a nono.

The really working example of limited-resources-management inside linux is the
scheduler. There you have "users" (processes) that work or not, and when there
is "no work" (e.g. idle), you may very well run niced-processes (in
simplification _one_) eating up the "rest" of the resources to make something
out of it. But if a "real" user comes in and wants resources, the nice one will
go away. It is a complete design.

In VM the page-cache should be a special case "nice" user. It can use all
available resources, but has to vanish if someone really needs them. And this
is currently _not_ solved, incomplete, and therefore contains big black holes,
like your described situation.

Regards,
Stephan