2001-12-15 18:13:20

by Ingo Molnar

[permalink] [raw]
Subject: Re: mempool design


On Sat, 15 Dec 2001, Rik van Riel wrote:

> > such scenarios can only be solved by using/creating independent pools,
> > and/or by using 'composite' pools like raid1.c does. One common
>
> OK, you've convinced me ...
> ... of the fact that you're reinventing Ben's reservation
> mechanism, poorly.

i have to admit that i did not know Ben's patch until today. I must have
missed it when he released it, and apparently there were no followup
releases(?). I now understand why Ben had to flame me. Anyway, here is his
patch:

http://lwn.net/2001/0531/a/bcrl-reservation.php3

With all respect, even if i had read it before, i'd have done mempool.c
the same way as it is now. (but i'd obviously have Cc:-ed Ben on it during
its development.) I'd like to sum up Ben's patch (Ben please correct me if
i misrepresent your patch in any way):

the patch adds a reservation feature to the page allocator. It defines a
'reservation structure', which causes the true free pages count of
particular page zones to be decreased artificially, thus creating a
virtual reserve of pages. These reservation structures can be assigned to
processes on a codepath basis. Eg. on IRQ entry the current process gets
assigned the IRQ-atomic reservation - and any original reservation is
restored on IRQ-exit. On swapping-code entry, arbitrary processes get the
swapping reservation. kswapd, kupdated and bdflush have their own,
permanent reservations. Freeing into the reserved pools is done by linking
the reservation structure to it's "home-zone", which the __free_pages()
code polls and refills. One process has a single active reservation
structure to allocate from.

this approach IMO does not answer some fundamental issues:

- Allocations might still fail with NULL. With mempool, allocations in
process contexts are guaranteed to always succeed.

- it does not allow the reservation of higher order allocations, which can
be especially important given the poor higher-order behavior of the page
allocator.

- the reservation patch does not offer deadlock avoidance in critical code
paths with complex allocation patterns (see the examples from my
previous email). Just having separate pools of pages is not enough.

- minor nit #1: reservations are tied to zones, while mempool can take
from different zones, as long as the zones are compatible.

- minor nit #2: reservations are adding overhead to critical code areas
(and yes, besides oom-only code, the fast-path is touched as well) such
as rmqueue() and __free_pages(). Mempool does not add overhead to the
underlying allocator(s).

- perhaps there is a more advanced patch available (Ben?), but right now i
cannot see how the SLAB allocator can have the same reservation concept
added, without excessive code duplication.

Rik, it would be nice if you could provide a few technical arguments that
underscore your point. If i'm wrong then i'd like to be proven wrong.

Ingo


2001-12-15 18:47:32

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: mempool design

On Sat, Dec 15, 2001 at 08:40:19PM +0100, Ingo Molnar wrote:
> With all respect, even if i had read it before, i'd have done mempool.c
> the same way as it is now. (but i'd obviously have Cc:-ed Ben on it during
> its development.) I'd like to sum up Ben's patch (Ben please correct me if
> i misrepresent your patch in any way):

You're making the assumption that an incomplete patch is useless and
has no design pricipals behind it. What I disagree with is the design
of mempool, not the implementation. The design for reservations is to
use enforced accounting limits to achive the effect of seperate memory
pools. Mempool's design is to build seperate pools on top of existing
pools of memory. Can't you see the obvious duplication that implies?

The first implementation of the reservation patch is full of bogosities,
I'm the first one to admit that. But am I going to go off and write an
entirely new patch that fixes everything and gets the design right to
replace mempool? Not with the current rate of patches being ignored.

-ben

2001-12-15 20:21:15

by Ingo Molnar

[permalink] [raw]
Subject: Re: mempool design


On Sat, 15 Dec 2001, Benjamin LaHaise wrote:

> [...] The design for reservations is to use enforced accounting limits
> to achive the effect of seperate memory pools. [...]

how is this going to handle higher-order pools? How is this going to
handle the need for composite allocations?

I think putting in reservation for higher-order pages is going to make
some parts of page_alloc.c *really* complex, and this complexity i dont
think we need.

> [...] Mempool's design is to build seperate pools on top of existing
> pools of memory. Can't you see the obvious duplication that implies?

no. Mempool's design is to build preallocated, reserved,
guaranteed-allocation pools on top of simpler allocators. Simpler
allocators will try reasonably hard to get something allocated, but might
fail as well. The majority of allocations within the kernel has no
deadlock relevance at all. If we allocate a new file structure, or create
a new socket, or are trying to page in overcommitted memory then we can
return with -ENOMEM (or OOM) just fine. Allocators are simpler and faster
without built-in deadlock avoidance and 'reserves'.

so the difference in design is that you are trying to add reservations as
a feature of the allocators themselves, while i'm trying to add it as a
generic, allocator-independent subsystem, which also solved a number of
other problems we always had in the IO code. Under this design, the 'pure'
allocators themselves have no concept of reserved pools at all, so there
is no duplication in concepts. (and no duplication of code.)

so the fundamental question is, should reservation be a core functionality
of allocators, or should it be a separate subsystem. *If* it can be put
into the core allocators (page allocator, SLAB) that gives us all the
features that mempool gives us today then i think i'd like that approach.
But i cannot really see how the composite allocation thing can be done via
reservations.

Ingo

2001-12-17 15:06:21

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: mempool design

On Sat, Dec 15, 2001 at 11:18:33PM +0100, Ingo Molnar wrote:
>
> On Sat, 15 Dec 2001, Benjamin LaHaise wrote:
>
> > [...] The design for reservations is to use enforced accounting limits
> > to achive the effect of seperate memory pools. [...]
>
> how is this going to handle higher-order pools? How is this going to
> handle the need for composite allocations?
>
> I think putting in reservation for higher-order pages is going to make
> some parts of page_alloc.c *really* complex, and this complexity i dont
> think we need.
>
> > [...] Mempool's design is to build seperate pools on top of existing
> > pools of memory. Can't you see the obvious duplication that implies?
>
> no. Mempool's design is to build preallocated, reserved,
> guaranteed-allocation pools on top of simpler allocators. Simpler
> allocators will try reasonably hard to get something allocated, but might
> fail as well. The majority of allocations within the kernel has no
> deadlock relevance at all. If we allocate a new file structure, or create
> a new socket, or are trying to page in overcommitted memory then we can
> return with -ENOMEM (or OOM) just fine. Allocators are simpler and faster
> without built-in deadlock avoidance and 'reserves'.
>
> so the difference in design is that you are trying to add reservations as
> a feature of the allocators themselves, while i'm trying to add it as a
> generic, allocator-independent subsystem, which also solved a number of
> other problems we always had in the IO code. Under this design, the 'pure'
> allocators themselves have no concept of reserved pools at all, so there
> is no duplication in concepts. (and no duplication of code.)
>
> so the fundamental question is, should reservation be a core functionality
> of allocators, or should it be a separate subsystem. *If* it can be put
> into the core allocators (page allocator, SLAB) that gives us all the
> features that mempool gives us today then i think i'd like that approach.
> But i cannot really see how the composite allocation thing can be done via
> reservations.

This whole long thread can be resumed in two points:

1 mempool reserved memory is "wasted" i.e. not usable as cache
2 if the mempool code is moved inside the memory balancing of the
VM we could use this memory as clean, atomically-freeable cache

however the option 2 is quite complex, think when somebody mmap the page
and we find_lock etc... we cannot "lock" a reserved page, or it would be
unfreeable, at least unless we're sure this "lock" will go away without
us deadlocking on it while waiting.

so in short solution 1 is much simpler and much more obviously correct,
and the only disavantage is that it reduces the amount of clean cache
that could be pontentially be used by the kernel.

If implementation details and code complexity would be our last design
priority solution 2 advocated by Ben, Rik and SCT would be obviously
superior.

At the moment in 2.5 and also in 2.4 we use the "mempool outside VM"
logic just because we can keep it under control without being killed by
the huge complexity of the implementation details with the locking of
clean cache, nesting into the vm etc... Of course I'm considering a
correct implementation of it, not an hack where cache can be mlocked and
the kernel deadlocks because the reserved memory isn't freeable anymore.

Personally I'm more relaxed with the mempool approch because it reduces
the complexity of an order of magnitude, it abstracts the thing without
making the memory balancing more complex and it definitely solve the
problem (if used correctly i.e. not two alloc_bio in a row from the same
pool from multiple tasks at the same time as pointed out by Ingo).

If somebody wants such 1% of ram back he can buy another dimm of ram and
plug it into his hardware. I mean such 1% of ram lost is something that
can be solved by throwing a few euros into the hardware (and people buys
gigabyte boxes anyways so they don't need all of the 100% of ram), the
other complexy cannot be solved with a few euros, that can only be
solved with lots braincycles and it would be a maintainance work as
well. Abstraction and layering definitely helps cutting down the
complexity of the code.

Andrea

2001-12-17 15:24:52

by Ingo Molnar

[permalink] [raw]
Subject: Re: mempool design


On Mon, 17 Dec 2001, Andrea Arcangeli wrote:

> This whole long thread can be resumed in two points:
>
> 1 mempool reserved memory is "wasted" i.e. not usable as cache

reservations, as in Ben's published (i know, incomplete) implementation,
are 'wasted' as well.

> 2 if the mempool code is moved inside the memory balancing of the
> VM we could use this memory as clean, atomically-freeable cache

i agree - i proposed something like this to SCT about 3-4 years ago (back
when the buffer-cache was still reentrant), and it's still not
implemented. And i'm not betting on it being done soon. Making the
pagecache structures IRQ-safe looks like the same kind of trouble we had
with the IRQ-reentrant buffer-cache. It can be done (in fact it's quite
easy to do the initial bits), but it can bite us in multiple ways. And in
the real deadlock scenarios we have no clean pages anyway.

i personally get the shivers from any global counters where being off by 1
in 1% of the cases will bite us only in 1 out of 10000 systems.

> Personally I'm more relaxed with the mempool approch because it
> reduces the complexity of an order of magnitude, it abstracts the
> thing without making the memory balancing more complex and it
> definitely solve the problem (if used correctly i.e. not two alloc_bio
> in a row from the same pool from multiple tasks at the same time as
> pointed out by Ingo).

yep - and as your VM rewrite has proven it as well, reducing complexity
and interdependencies within the VM is the top priority at the moment and
brings the most benefits. And the amount of reserved (lost) pool-pages
does not scale up with more RAM in the system - it scales up with more
devices (and more mounted filesystems) in the system. And we have
per-device RAM footprint anyway. So it's not like 'struct page'.

Ingo

2001-12-17 15:45:18

by Victor Yodaiken

[permalink] [raw]
Subject: Re: mempool design

On Mon, Dec 17, 2001 at 04:04:26PM +0100, Andrea Arcangeli wrote:
> If somebody wants such 1% of ram back he can buy another dimm of ram and
> plug it into his hardware. I mean such 1% of ram lost is something that
> can be solved by throwing a few euros into the hardware (and people buys
> gigabyte boxes anyways so they don't need all of the 100% of ram), the
> other complexy cannot be solved with a few euros, that can only be
> solved with lots braincycles and it would be a maintainance work as
> well. Abstraction and layering definitely helps cutting down the
> complexity of the code.

I agree with all your arguments up to here. But being able to run Linux
in 4Meg or even 8M is important to a very large class of applications.
Even if you are concerned mostly about bigger systems, making sure NT
remains at a serious disadvantage in the embedded boxes is key because
MS will certainly hope to use control of SOHO routers, set-top boxes
etc to set "standards" that will improve their competitivity in desktop
and beyond. It would be a delicious irony if MS were able to re-use
against Linux the "first control low end" strategy that allowed them
vaporize the old line UNIXes, but irony is not as satisfying as winning.

2001-12-17 15:59:28

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: mempool design

On Mon, Dec 17, 2001 at 06:21:53PM +0100, Ingo Molnar wrote:
>
> On Mon, 17 Dec 2001, Andrea Arcangeli wrote:
>
> > This whole long thread can be resumed in two points:
> >
> > 1 mempool reserved memory is "wasted" i.e. not usable as cache
>
> reservations, as in Ben's published (i know, incomplete) implementation,
> are 'wasted' as well.

yes, I was referring only about his long term design arguments.

> > 2 if the mempool code is moved inside the memory balancing of the
> > VM we could use this memory as clean, atomically-freeable cache
>
> i agree - i proposed something like this to SCT about 3-4 years ago (back
> when the buffer-cache was still reentrant), and it's still not
> implemented. And i'm not betting on it being done soon. Making the
> pagecache structures IRQ-safe looks like the same kind of trouble we had
> with the IRQ-reentrant buffer-cache. It can be done (in fact it's quite
> easy to do the initial bits), but it can bite us in multiple ways. And in
> the real deadlock scenarios we have no clean pages anyway.

in theory those pages should be reserved, so it would be the same like
the pages in the mempool, but while they do nothing they could hold some
cache data, but for example they couldn't be either mapped in any
address space etc... at least unless we're able to atomically unmap
pages and flush tlb in all cpus etc.. :) it would be a mess and it's not
a concidence that Ben's first implementation wasn't taking adantage of
it and that in 3-4 years it's still not there yet :). Plus as you
mentioned it would add the local_save_irq overhead to the common path as
well, to be able to do things from irqs (which I didn't considered in
the previous email). That would hurt performance.

> i personally get the shivers from any global counters where being off by 1
> in 1% of the cases will bite us only in 1 out of 10000 systems.

yes, and as said it's a problem that doesn't affect performance or
scalability, nor it wastes a _percentage_ of ram, it only wastes a
_fixed_ amount of ram.

> > Personally I'm more relaxed with the mempool approch because it
> > reduces the complexity of an order of magnitude, it abstracts the
> > thing without making the memory balancing more complex and it
> > definitely solve the problem (if used correctly i.e. not two alloc_bio
> > in a row from the same pool from multiple tasks at the same time as
> > pointed out by Ingo).
>
> yep - and as your VM rewrite has proven it as well, reducing complexity
> and interdependencies within the VM is the top priority at the moment and
> brings the most benefits. And the amount of reserved (lost) pool-pages
> does not scale up with more RAM in the system - it scales up with more
> devices (and more mounted filesystems) in the system. And we have
> per-device RAM footprint anyway. So it's not like 'struct page'.

100% agreed (as said above too :).

Andrea

2001-12-17 16:11:19

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: mempool design

On Mon, Dec 17, 2001 at 08:38:02AM -0700, Victor Yodaiken wrote:
> On Mon, Dec 17, 2001 at 04:04:26PM +0100, Andrea Arcangeli wrote:
> > If somebody wants such 1% of ram back he can buy another dimm of ram and
> > plug it into his hardware. I mean such 1% of ram lost is something that
> > can be solved by throwing a few euros into the hardware (and people buys
> > gigabyte boxes anyways so they don't need all of the 100% of ram), the
> > other complexy cannot be solved with a few euros, that can only be
> > solved with lots braincycles and it would be a maintainance work as
> > well. Abstraction and layering definitely helps cutting down the
> > complexity of the code.
>
> I agree with all your arguments up to here. But being able to run Linux
> in 4Meg or even 8M is important to a very large class of applications.
> Even if you are concerned mostly about bigger systems, making sure NT
> remains at a serious disadvantage in the embedded boxes is key because
> MS will certainly hope to use control of SOHO routers, set-top boxes
> etc to set "standards" that will improve their competitivity in desktop
> and beyond. It would be a delicious irony if MS were able to re-use
> against Linux the "first control low end" strategy that allowed them
> vaporize the old line UNIXes, but irony is not as satisfying as winning.

I may been misleading mentionin a 1%, the 1% doesn't mean a 1% of ram is
wasted (otherwise adding a new dimm couldn't solve it because you would
waste even more ram :). As Ingo also mentioned, it's a fixed amount of
ram that is wasted in the mempool.

For very low end machines you can simply define a very small mempool, it
will potentially reduce scalability during heavy I/O with mem shortage
but it will waste very very little ram (potentially in the simpler case
you only need 1 entry in the pool to guarantee deadlock avoidance). And
there's nearly nothing to worry about, we always had those mempools
since 2.0 at least, look at buffer.c and search for the async argument
to the functions allocating the bhs. Now with the bio we have more
mempools because lots of people still uses the bh, so in the short term
(before 2.6) we can waste some more byte, but once the bh and
ll_rw_block will be dead most of the bio overhead will go away and we'll
only hold the advantages of doing I/O in more than one page with a
single metadata entity (2.6). The other obvious advantage of the mempool
code is that we share it across all the mempool users, so we'll save
some byte of icache too by avoiding code duplication compared to 2.4 too :).

Infact the solution 2) cannot solve your 4M/8M boot problem either,
since such memory would need to be resrved anyways, and it could act
only as clean filesystem cache. So in short the only difference between 1) and
2) would be a little more of fs cache in solution 2) but with an huge
implementation complexity and local_save_irq all over the place in the
VM so with lower performance. It wouldn't make a difference in
functionality (boot or not boot, this is the real problem you worry
about :).

Andrea

2001-12-17 17:34:26

by Geoffrey

[permalink] [raw]
Subject: kernel panic

I'm looking for the proper forum for posting a (possible) kernel bug.
I'm receiving a panic when attempting to write a cdrw under 2.4.12.

I don't know that this is a bug, but I would expect some other activity
rather than a panic/lockup.

Suggestions as to the proper forum would be appreciated.

--
Until later: Geoffrey [email protected]

"...the system (Microsoft passport) carries significant risks to users
that
are not made adequately clear in the technical documentation available."
- David P. Kormann and Aviel D. Rubin, AT&T Labs - Research
- http://www.avirubin.com/passport.html

2001-12-18 00:33:30

by Rik van Riel

[permalink] [raw]
Subject: Re: mempool design

On Mon, 17 Dec 2001, Andrea Arcangeli wrote:
> On Mon, Dec 17, 2001 at 06:21:53PM +0100, Ingo Molnar wrote:
> > On Mon, 17 Dec 2001, Andrea Arcangeli wrote:
> >
> > > This whole long thread can be resumed in two points:
> > >
> > > 1 mempool reserved memory is "wasted" i.e. not usable as cache
> >
> > reservations, as in Ben's published (i know, incomplete) implementation,
> > are 'wasted' as well.
>
> yes, I was referring only about his long term design arguments.

Long term design arguments don't have to make the short-term
implementation any more complex. I guess you presented a nice
argument to go with the more flexible solution.

cheers,

Rik
--
Shortwave goes a long way: irc.starchat.net #swl

http://www.surriel.com/ http://distro.conectiva.com/

2001-12-18 14:58:04

by Ingo Molnar

[permalink] [raw]
Subject: Re: mempool design


On Mon, 17 Dec 2001, Victor Yodaiken wrote:

> I agree with all your arguments up to here. But being able to run
> Linux in 4Meg or even 8M is important to a very large class of
> applications. [...]

the amount of reserved RAM should be very low. Especially in embedded
applications that usually have a very controlled environment, with a low
number of well-behaving devices, the number of pages needed to be reserved
is very low. I wouldnt worry about this.

Ingo

2001-12-18 16:13:06

by Victor Yodaiken

[permalink] [raw]
Subject: Re: mempool design

On Tue, Dec 18, 2001 at 05:55:14PM +0100, Ingo Molnar wrote:
>
> On Mon, 17 Dec 2001, Victor Yodaiken wrote:
>
> > I agree with all your arguments up to here. But being able to run
> > Linux in 4Meg or even 8M is important to a very large class of
> > applications. [...]
>
> the amount of reserved RAM should be very low. Especially in embedded
> applications that usually have a very controlled environment, with a low
> number of well-behaving devices, the number of pages needed to be reserved
> is very low. I wouldnt worry about this.


Bueno.

2001-12-18 18:43:26

by Alan

[permalink] [raw]
Subject: Re: mempool design

> If somebody wants such 1% of ram back he can buy another dimm of ram and
> plug it into his hardware. I mean such 1% of ram lost is something that
> can be solved by throwing a few euros into the hardware (and people buys
> gigabyte boxes anyways so they don't need all of the 100% of ram), the

How do I add dimms to an embedded board ?

> solved with lots braincycles and it would be a maintainance work as
> well. Abstraction and layering definitely helps cutting down the
> complexity of the code.

I'm not too worried. mempool as an API can relatively easily be persuaded
to do reservations on an underlying allocator some point in the future.

Alan