LinuxLists.cc - [RFC] My research agenda for 2.7

[permalink] [raw]

Subject: Re: [RFC] My research agenda for 2.7

On Wed, Jun 25, 2003 at 01:11:01AM +0200, Daniel Phillips wrote:
> - Page size is represented on a per-address space basis with a shift count.
> In practice, the smallest is 9 (512 byte sector), could imagine 7 (each
> ext2 inode is separate page) or 8 (actual hardsect size on some drives).
> 12 will be the most common size. 13 gives 8K blocksize for, e.g., alpha.
> 21 and 22 give 2M and 4M page size, matching hardware capabilities of
> x86, and other sizes are possible on machines like MIPS, where page size
> is software controllable
> - Implemented only for file-backed memory (page cache)

Per struct address_space? This is an unnecessary limitation.

On Wed, Jun 25, 2003 at 01:11:01AM +0200, Daniel Phillips wrote:
> - Special case these ops in page cache access layer instead of having the
> messy code in the block IO library
> - Subpage struct pages are dynamically allocated. But buffer_heads are gone
> so this is a lateral change.

This gives me the same data structure proliferation chills as bh's.

-- wli

2003-06-25 00:52:17

[permalink] [raw]

Subject: Re: [RFC] My research agenda for 2.7

On Wednesday 25 June 2003 02:47, William Lee Irwin III wrote:
> On Wed, Jun 25, 2003 at 01:11:01AM +0200, Daniel Phillips wrote:
> > - Page size is represented on a per-address space basis with a shift
> > count. In practice, the smallest is 9 (512 byte sector), could imagine 7
> > (each ext2 inode is separate page) or 8 (actual hardsect size on some
> > drives). 12 will be the most common size. 13 gives 8K blocksize for,
> > e.g., alpha. 21 and 22 give 2M and 4M page size, matching hardware
> > capabilities of x86, and other sizes are possible on machines like MIPS,
> > where page size is software controllable
> > - Implemented only for file-backed memory (page cache)
>
> Per struct address_space? This is an unnecessary limitation.

It's a sensible limitation, it keeps the radix tree lookup simple.

> On Wed, Jun 25, 2003 at 01:11:01AM +0200, Daniel Phillips wrote:
> > - Special case these ops in page cache access layer instead of having
> > the messy code in the block IO library
> > - Subpage struct pages are dynamically allocated. But buffer_heads are
> > gone so this is a lateral change.
>
> This gives me the same data structure proliferation chills as bh's.

It's not nearly as bad. There is no distinction between subpage and base
struct page for almost all page operations, e.g., locking, IO, data access.

Regards,

Daniel

2003-06-25 00:56:30

[permalink] [raw]

Subject: Re: [RFC] My research agenda for 2.7

On Wednesday 25 June 2003 02:47, William Lee Irwin III wrote:
>> Per struct address_space? This is an unnecessary limitation.

On Wed, Jun 25, 2003 at 03:07:18AM +0200, Daniel Phillips wrote:
> It's a sensible limitation, it keeps the radix tree lookup simple.

It severely limits its usefulness. Dropping in a more flexible data
structure should be fine.

On Wednesday 25 June 2003 02:47, William Lee Irwin III wrote:
>> This gives me the same data structure proliferation chills as bh's.

On Wed, Jun 25, 2003 at 03:07:18AM +0200, Daniel Phillips wrote:
> It's not nearly as bad. There is no distinction between subpage and base
> struct page for almost all page operations, e.g., locking, IO, data access.

But those are code sanitation issues. You need to make sure this
doesn't explode on PAE.

-- wli

2003-06-25 01:12:34

[permalink] [raw]

Subject: Re: [RFC] My research agenda for 2.7

On Wednesday 25 June 2003 03:10, William Lee Irwin III wrote:
> On Wednesday 25 June 2003 02:47, William Lee Irwin III wrote:
> >> Per struct address_space? This is an unnecessary limitation.
>
> On Wed, Jun 25, 2003 at 03:07:18AM +0200, Daniel Phillips wrote:
> > It's a sensible limitation, it keeps the radix tree lookup simple.
>
> It severely limits its usefulness. Dropping in a more flexible data
> structure should be fine.

Eventually it could well make sense to do that, e.g., the radix tree
eventually ought to evolve into a btree of extents (probably). But making
things so complex in the first version, thus losing much of the incremental
development advantage, would not be smart. With a single size of page per
address_space, changes to the radix tree code are limited to a couple of
lines, for example.

But perhaps you'd like to supply some examples where more than one size of
page in the same address space really matters?

> On Wednesday 25 June 2003 02:47, William Lee Irwin III wrote:
> >> This gives me the same data structure proliferation chills as bh's.
>
> On Wed, Jun 25, 2003 at 03:07:18AM +0200, Daniel Phillips wrote:
> > It's not nearly as bad. There is no distinction between subpage and base
> > struct page for almost all page operations, e.g., locking, IO, data
> > access.
>
> But those are code sanitation issues. You need to make sure this
> doesn't explode on PAE.

Indeed, that is important. Good night, see you tomorrow.

Daniel

2003-06-25 01:16:56

[permalink] [raw]

Subject: Re: [RFC] My research agenda for 2.7

On Wednesday 25 June 2003 03:10, William Lee Irwin III wrote:
>> It severely limits its usefulness. Dropping in a more flexible data
>> structure should be fine.

On Wed, Jun 25, 2003 at 03:25:47AM +0200, Daniel Phillips wrote:
> Eventually it could well make sense to do that, e.g., the radix tree
> eventually ought to evolve into a btree of extents (probably). But making
> things so complex in the first version, thus losing much of the incremental
> development advantage, would not be smart. With a single size of page per
> address_space, changes to the radix tree code are limited to a couple of
> lines, for example.
> But perhaps you'd like to supply some examples where more than one size of
> page in the same address space really matters?

Software-refill TLB architectures would very much like to be handed the
largest physically contiguous chunk of memory out of pagecache possible
and map it out using the fewest number of TLB entries possible. Dropping
in a B+ tree to replace radix trees should be a weekend project at worst.
Speculatively allocating elements that are "as large as sane/possible"
will invariably result in variable-sized elements in the same tree.

-- wli

2003-06-25 09:15:51

[permalink] [raw]

Subject: Re: [RFC] My research agenda for 2.7

> 1) Active memory defragmentation
>
> I doubt anyone will deny that this is desirable. Active defragmentation will
> eliminate higher order allocation failures for non-atomic allocations, and I
> hope, generally improve the efficiency and transparency of the kernel memory
> allocator.
>

It might be just me, but this scheme sounds a bit complicated (I'm still
absorbing the other two). I find it difficult to see what happens when a
page used by a kernel pointer changes for any case other than vmalloc()
but I probably am missing something.

How about: Move order-0 allocations to slab (ignoring bootstrap issues for
now but shouldn't be hard anyway)

Each cache slab is 2^MAX_GFP_ORDER large and there is three caches
o order0-user
o order0-kreclaim
o order0-knoreclaim

order0-user is for any userspace allocation. These pages should be
trivially reclaimable with rmap available. If a large order block is
necessary, select one slab and reclaim it. This will break LRU ordering
something rotten but I am guessing that LRU ordering is not the prime
concern here. If a defrag daemon exists, scan MAX_DEFRAG_SCAN slabs and
pick the one with the most clean filesystem backed pages to chuck out
(least IO involved in reclaim).

order0-kreclaim is for kernel allocations which are trivially reclaimable
and that can be safely discared knowing that no pointer exists to them.
This is most likely to be usable for slab allocations of caches like
dentry's which can be safely thrown out. A quick look of /proc/slabinfo
shows that most slabs are just 1 page large. Slabs already have a
set_shrinker() callback for the removal of objects so it is likely that
this could be extended for telling caches to clear all objects and discard
a particular slab.

order0-knoreclaim is for kernel allocations which cannot be easily
reclaimed and have pointers to the allocation which are difficult to
reclaim. For all intents and purposes, these are not reclaimable without
impementing swapping in kernel space.

This has a couple of things going for it

o Reclaimable pages are in easy to search globs
o Gets nifty per-cpu alloc and caching that comes with slab automatically
o Freeing high order pages is a case of discarding pages in one slab
o Doesn't require fancy pants updating of pointers or page tables
o Possible ditch the mempool interface as slab already has similar functionality
o Seems simple

Opinions?

--
Mel Gorman

2003-06-26 18:45:55

[permalink] [raw]

Subject: Re: [RFC] My research agenda for 2.7

On Wednesday 25 June 2003 11:29, Mel Gorman wrote:
> > 1) Active memory defragmentation
> >
> > I doubt anyone will deny that this is desirable. Active defragmentation
> > will eliminate higher order allocation failures for non-atomic
> > allocations, and I hope, generally improve the efficiency and
> > transparency of the kernel memory allocator.
>
> It might be just me, but this scheme sounds a bit complicated (I'm still
> absorbing the other two).

Mel,

I probably spent too much time dwelling on the hard cases of page moving, and
didn't sufficiently emphasize that handling a few easy cases would do a
pretty good job, certainly better than we do now. For example:

* Most process pages are easily movable, since usually only page tables
will hold references.

* Most page cache pages are easily movable, likewise

* Similarly, page table pages are not too hard to move

Most slab pages are hard to move. We could try to fix that for certain common
object types, or we could just tell slab to use its own biggish chunks of
memory, which it can play in as it sees fit.

> I find it difficult to see what happens when a
> page used by a kernel pointer changes for any case other than vmalloc()
> but I probably am missing something.

The point you apparently missed is that the defragger will identify and update
those kernel pointers, being careful about races of course.

> How about: Move order-0 allocations to slab (ignoring bootstrap issues for
> now but shouldn't be hard anyway)

That sounds like radical surgery to me, but to each his own experiments.

> Each cache slab is 2^MAX_GFP_ORDER large and there is three caches
> o order0-user
> o order0-kreclaim
> o order0-knoreclaim
>
> order0-user is for any userspace allocation. These pages should be
> trivially reclaimable with rmap available. If a large order block is
> necessary, select one slab and reclaim it. This will break LRU ordering
> something rotten but I am guessing that LRU ordering is not the prime
> concern here. If a defrag daemon exists, scan MAX_DEFRAG_SCAN slabs and
> pick the one with the most clean filesystem backed pages to chuck out
> (least IO involved in reclaim).

Defragmentation by selective eviction is possible, but isn't necessarily
optimal. In the case of memory that isn't swap or file-backed, it isn't even
possible. On the other hand, you may think of the page move case as simply
an optimized evict-and-reload, if that helps understand where I'm going.

Regardless, it would be good to teach vmscan to evict pages that will help
build higher order allocation units, when these are in short supply. This
would be an optimization heuristic; it would still necessary to handle the
general case of data moving in order to make any guarantee re fragmentation
control.

> order0-kreclaim is for kernel allocations which are trivially reclaimable
> and that can be safely discared knowing that no pointer exists to them.
> This is most likely to be usable for slab allocations of caches like
> dentry's which can be safely thrown out. A quick look of /proc/slabinfo
> shows that most slabs are just 1 page large. Slabs already have a
> set_shrinker() callback for the removal of objects so it is likely that
> this could be extended for telling caches to clear all objects and discard
> a particular slab.
>
> order0-knoreclaim is for kernel allocations which cannot be easily
> reclaimed and have pointers to the allocation which are difficult to
> reclaim. For all intents and purposes, these are not reclaimable without
> impementing swapping in kernel space.
>
> This has a couple of things going for it
>
> o Reclaimable pages are in easy to search globs
> o Gets nifty per-cpu alloc and caching that comes with slab automatically
> o Freeing high order pages is a case of discarding pages in one slab
> o Doesn't require fancy pants updating of pointers or page tables

Without updating pointers, active defragmentation is not possible. But
perhaps you meant to say that active defragmentation is not needed?

> o Possible ditch the mempool interface as slab already has similar
> functionality
> o Seems simple
>
> Opinions?

Yes :-)

Regards,

Daniel

2003-06-26 19:47:46

[permalink] [raw]

Subject: Re: [RFC] My research agenda for 2.7

On Thu, 26 Jun 2003, Daniel Phillips wrote:

> * Most process pages are easily movable, since usually only page tables
> will hold references.
>
> * Most page cache pages are easily movable, likewise
>
> * Similarly, page table pages are not too hard to move
>

I think that finding pages like this together is unlikely, especially if
the system has been running a long time. In the worst case you will have
every easily-moved page adjactent to a near-impossible-to-move page. The
buddy allocator could be taught to be selective of which freelists it used
for different types of allocation but that doesn't sound any easier than
moving order0 pages to slab.

There also would be trouble identifing adjactent pages of order 0 which
are are used for these particular task. Buddy allocators, including the
one implemented in Linux, do not record what order allocation a struct
page belongs to. I could be wrong but it just feels like it would be a lot
of work to free up just a few adjacent pages.

Tentatively, I would assert that being able to group these types of pages
together is a pre-requestive for effective defragging and I think moving
order-0 pages to slab would enforce this grouping. For example, the defrag
daemon could scan just order0-user slabs and move pages from sparsly
populated slabs to other fuller slabs and free slabs as it goes.

In other words, order0 caches are not a mutually exclusive goal to
defragging but I bet you a shiny penny it'd make the implementation a lot
easier.

> Most slab pages are hard to move. We could try to fix that for certain
> common object types, or we could just tell slab to use its own biggish
> chunks of memory, which it can play in as it sees fit.
>

Moving slab pages is probably not an option unless some quick way of
updating all the pointers to the objects is found. I would say the only
slab pages that are "movable" belong to caches which can quickly discard
their objects.

> > I find it difficult to see what happens when a
> > page used by a kernel pointer changes for any case other than vmalloc()
> > but I probably am missing something.
>
> The point you apparently missed is that the defragger will identify and
> update those kernel pointers, being careful about races of course.
>

I didn't miss it as such, but I don't see how it could be implemented
either. I also wonder if moving kernel pages is really worth the hassle.
It's perfectly possible that defragging only user pages will be enough.
The only way to be sure would be to walk all struct pages in the mem_map
and see what the layout looks like.

> > How about: Move order-0 allocations to slab (ignoring bootstrap issues for
> > now but shouldn't be hard anyway)
>
> That sounds like radical surgery to me, but to each his own experiments.
>

I don't think it would be too radical because the buddy allocator would
still remain the principal page allocator. What would be a bitch is the
additional storage requirements required for such a large number of slabs
needed to contain pages. At the very least, the slab allocator would need
a means of discarding the slab descriptors for full slabs and maintaining
just pointers to the first page of each full slab. A comparison of page
allocation from buddy and page allocation from slab would also be
necessary to make sure the page allocation performance wasn't butchered.

> Defragmentation by selective eviction is possible, but isn't necessarily
> optimal. In the case of memory that isn't swap or file-backed, it isn't even
> possible. On the other hand, you may think of the page move case as simply
> an optimized evict-and-reload, if that helps understand where I'm going.
>

Well, take this this case

1) Slabs can contain 10 order0 pages
2) Defragger finds two slabs; slabA with 2 pages and slabB with 5
3) Defragger copies 2 pages from slabA to slabB and users rmap to update
page tables

The alternative sounds like it would be scanning mem_map looking for pages
that can be moved which sounds expensive. With order0 in slab, you can
easily find blocks of pages which when moved immeditaly free up a large
adjacent block.

> > o Reclaimable pages are in easy to search globs
> > o Gets nifty per-cpu alloc and caching that comes with slab automatically
> > o Freeing high order pages is a case of discarding pages in one slab
> > o Doesn't require fancy pants updating of pointers or page tables
>
> Without updating pointers, active defragmentation is not possible. But
> perhaps you meant to say that active defragmentation is not needed?
>

I am saying that full defragmentation would be a very expensive operation
which might not work if it cannot find suitably freeable adjacent pages.
If order0 pages were in slab, the whole searching problem becomes trivial
(just go to the relevant cache and scan the slabs).

> > o Possible ditch the mempool interface as slab already has similar
> > functionality
> > o Seems simple
> >
> > Opinions?
>
> Yes :-)
>

heh.

--
Mel Gorman

2003-06-26 19:55:47

by Andrew Morton

[permalink] [raw]

Subject: Re: [RFC] My research agenda for 2.7

Mel Gorman <[email protected]> wrote:
>
> Buddy allocators, including the
> one implemented in Linux, do not record what order allocation a struct
> page belongs to.

We can do that.

--- 25/mm/page_alloc.c~a 2003-06-26 13:09:11.000000000 -0700
+++ 25-akpm/mm/page_alloc.c 2003-06-26 13:09:24.000000000 -0700
@@ -123,6 +123,7 @@ static void prep_compound_page(struct pa
SetPageCompound(p);
p->lru.next = (void *)page;
}
+ page[1].index = order;
}

static void destroy_compound_page(struct page *page, unsigned long order)

_

2003-06-27 00:20:26

[permalink] [raw]

Subject: Re: [RFC] My research agenda for 2.7

On Thursday 26 June 2003 22:01, Mel Gorman wrote:
> I think that finding pages like this together is unlikely, especially if
> the system has been running a long time. In the worst case you will have
> every easily-moved page adjactent to a near-impossible-to-move page.

Mel,

I addressed that in my previous post: "Most slab pages are hard to move ... we
could just tell slab to use its own biggish chunks of memory, which it can
play in as it sees fit". Another way of putting this is, we isolate the type
of data that's hard to defrag in its own space, so it can't fragment our
precious movable space. That separate space can expand and contract
chunkwise.

To be sure, this solution means we still can't forcibly create higher order
units in those undefraggable regions, but that is no worse than the current
situation. In the defraggable regions, hopefully the majority of memory, we
have a much improved situation: we can forcibly create as many high order
allocation units as we need, in order to handle, e.g., an Ext2/3 filesystem
that's running with a large blocksize.

> Moving slab pages is probably not an option unless some quick way of
> updating all the pointers to the objects is found.

That was the part in my original post about a "new API". If we choose to do
so, we could fix up most kinds of slab objects to support some protocol to
assist in the effort, or use a data structure better suited to relocation. I
doubt it's necessary to go that far. As far as I can see, slab is not a big
offender in terms of fragmentation at the moment.

> I also wonder if moving kernel pages is really worth the hassle.

That's the question of course. The benefit is getting rid of high order
allocation failures, and gaining some confidence that larger filesystem
blocksizes will work reliably, however the workload evolves. The cost is
some additional complexity, no argument about that. On the other hand, it
would no longer be necessary to use voodoo to try to get the allocator to
magically not fragment.

> It's perfectly possible that defragging only user pages will be enough.

Indeed. That's an excellent place to start. We'd include page cache pages in
that, I'd expect. Then it would be attractive to go on to handle page table
pages, and with that we'd have the low-hanging fruit.

> The alternative sounds like it would be scanning mem_map looking for pages
> that can be moved which sounds expensive.

It looks linear to me, and it's not supposed to happen often. It would
typically be a response to a nasty situation like you'd get by mounting a
volume that uses 32K blocks after running a long time with mostly 4K blocks.

> > Without updating pointers, active defragmentation is not possible. But
> > perhaps you meant to say that active defragmentation is not needed?
>
> I am saying that full defragmentation would be a very expensive operation
> which might not work if it cannot find suitably freeable adjacent pages.

Full defragmentation, whatever that is, would be pointless for the
applications I've described, though perhaps it could be needed by some
bizarre application like software hotplug memory. In the latter case you
wouldn't be too worried about how long it takes.

In general though, the defragger would only run long enough to restore balance
to the free counts across the various orders. In rare cases, it would have
to run synchronously in order to prevent an allocation from failing.

> If order0 pages were in slab, the whole searching problem becomes trivial
> (just go to the relevant cache and scan the slabs).

You might want to write a separate [rfc] to describe your idea. For one
thing, I don't see why you'd want to use slab for that.

Regards,

Daniel

2003-06-27 12:46:31

[permalink] [raw]

Subject: Re: [RFC] My research agenda for 2.7

On Fri, 27 Jun 2003, Daniel Phillips wrote:

> On Thursday 26 June 2003 22:01, Mel Gorman wrote:
> > I think that finding pages like this together is unlikely, especially if
> > the system has been running a long time. In the worst case you will have
> > every easily-moved page adjactent to a near-impossible-to-move page.
>
> I addressed that in my previous post: "Most slab pages are hard to move
> ... we could just tell slab to use its own biggish chunks of memory,
> which it can play in as it sees fit".

Ah, ok, sorry, I missed that but it wouldn't be the first time I missed
something. It was just a few days ago I wrote a pile of material on the
new buddy allocator as part of a publication and still missed that the
order of pages can be identified because of compound pages, thanks Andrew.

> > I also wonder if moving kernel pages is really worth the hassle.
>
> That's the question of course. The benefit is getting rid of high order
> allocation failures, and gaining some confidence that larger filesystem
> blocksizes will work reliably, however the workload evolves.

I'm still working on 2.6 documentation which I expect will be published in
a few months. When I get that written, I'll look into seeing what can be
done with VM Regress to calculate fragmentation and to see how often do
high order allocations actually fail. It might help determine where
defragging is most needed.

IIRC, Martin J. Bligh had a patch which displayed information about the
buddy allocator freelist so that will probably be the starting point. From
there, it should be handy enough to see how intermixed are kernel page
allocations with user allocations. It might turn out that kernel pages
tend to be clustered together anyway.

> > If order0 pages were in slab, the whole searching problem becomes trivial
> > (just go to the relevant cache and scan the slabs).
>
> You might want to write a separate [rfc] to describe your idea. For one
> thing, I don't see why you'd want to use slab for that.
>

You're right, I will need to write a proper RFC one way or the other. I
was thinking of using slabs because that way there wouldn't be need to
scan all of mem_map, just a small number of slabs. I have no basis for
this other than hand waving gestures though.

Anyway, as I know I won't be coding any time soon due to writing docs,
I'll shut up for the moment :-)

--
Mel Gorman

2003-06-27 14:26:44

[permalink] [raw]

Subject: Re: [RFC] My research agenda for 2.7

On Friday 27 June 2003 15:00, Mel Gorman wrote:
> On Fri, 27 Jun 2003, Daniel Phillips wrote:
> I was thinking of using slabs because that way there wouldn't be need to
> scan all of mem_map, just a small number of slabs. I have no basis for
> this other than hand waving gestures though.

That's the right idea, it's just not necessary to use slab cache to achieve
it. Actually, to handle huge (hardware) pages efficiently, my first instinct
is to partition them into their own largish chunks as well, and allocate new
chunks as necessary. To get rid of a chunk (because freespace of that type
of chunk has fallen below some threshold) it has to be entirely empty, which
can be accomplished using the same move logic as for defragmentation.

You're right to be worried about intrusion of unmovable pages into regions
that are supposed to be defraggable. It's very easy for some random kernel
code to take a use count on a page and hang onto it forever, making the page
unmovable. My hope is that:

- This doesn't happen much
- Code that does that can be cleaned up
- Even when it happens it won't hurt much
- If all of the above fails, fix the api for the offending code or create
a new one
- If that fails too, give up.

Regards,

Daniel

2003-06-27 14:24:57

by Martin J. Bligh

[permalink] [raw]

Subject: Re: [RFC] My research agenda for 2.7

> IIRC, Martin J. Bligh had a patch which displayed information about the
> buddy allocator freelist so that will probably be the starting point. From
> there, it should be handy enough to see how intermixed are kernel page
> allocations with user allocations. It might turn out that kernel pages
> tend to be clustered together anyway.

That should be merged now - /proc/buddyinfo. I guess you could do the same
for allocated pages (though it'd be rather heavyweight ;-))

M.

2003-06-27 14:31:37

by Martin J. Bligh

[permalink] [raw]

Subject: Re: [RFC] My research agenda for 2.7

>> > I also wonder if moving kernel pages is really worth the hassle.
>>
>> That's the question of course. The benefit is getting rid of high order
>> allocation failures, and gaining some confidence that larger filesystem
>> blocksizes will work reliably, however the workload evolves.

Oh, BTW ... I suspect you've realised this already, but ....

The buddy allocator is not a good system for getting rid of fragmentation.
If I group pages together in aligned pairs, and F is free and A is
allocated, it'll not do anything useful with this:

F A A F F A A F F A A F F A A F F A F A

because the adjacent "F"s aren't "buddies". It seems that the purpose of
the buddy allocator was to be quick at allocating pages. Now that we stuck
a front end cache on it, in the form of hot & cold pages, that goal no
longer seems paramount - altering it to reduce fragmentation at the source,
rather than actively defrag afterwards would seem like a good goal to me.

M.

2003-06-27 14:39:33

[permalink] [raw]

Subject: Re: [RFC] My research agenda for 2.7

On Friday 27 June 2003 16:43, Martin J. Bligh wrote:
> The buddy allocator is not a good system for getting rid of fragmentation.

We've talked in the past about throwing out the buddy allocator and adopting
something more modern and efficient and I hope somebody will actually get
around to doing that. In any event, defragging is an orthogonal issue. Some
allocation strategies may be statistically more resistiant to fragmentation
than others, but no allocator has been invented, or ever will be, that can
guarantee that terminal fragmentation will never occur - only active
defragmentation can provide such a guarantee.

Regards,

Daniel

2003-06-27 14:50:47

by Martin J. Bligh

[permalink] [raw]

Subject: Re: [RFC] My research agenda for 2.7

--Daniel Phillips <[email protected]> wrote (on Friday, June 27, 2003 16:54:46 +0200):

> On Friday 27 June 2003 16:43, Martin J. Bligh wrote:
>> The buddy allocator is not a good system for getting rid of fragmentation.
>
> We've talked in the past about throwing out the buddy allocator and adopting
> something more modern and efficient and I hope somebody will actually get
> around to doing that. In any event, defragging is an orthogonal issue. Some
> allocation strategies may be statistically more resistiant to fragmentation
> than others, but no allocator has been invented, or ever will be, that can
> guarantee that terminal fragmentation will never occur - only active
> defragmentation can provide such a guarantee.

Whilst I agree with that in principle, it's inevitably expensive. Thus
whilst we may need to have that code, we should try to avoid using it ;-)

The buddy allocator is obviously flawed in this department ... strategies
like allocating a 4M block to a process up front, then allocing out of that
until we're low on mem, then reclaiming in as large blocks as possible from
those process caches, etc, etc, would obviously help too. Though maybe
we're just permanently low on mem after a while, so it'd be better to just
group pagecache pages together ... that would actually be pretty simple to
change ... hmmm.

M.

2003-06-27 15:05:12

[permalink] [raw]

Subject: Re: [RFC] My research agenda for 2.7

On Friday 27 June 2003 17:04, Martin J. Bligh wrote:
> Daniel Phillips <[email protected]> wrote (on Friday, June 27, 2003
> > Some allocation strategies may be statistically more
> > resistiant to fragmentation than others, but no allocator has been
> > invented, or ever will be, that can guarantee that terminal fragmentation
> > will never occur - only active defragmentation can provide such a
> > guarantee.
>
> Whilst I agree with that in principle, it's inevitably expensive. Thus
> whilst we may need to have that code, we should try to avoid using it ;-)

That's exactly the idea. Active defragmentation is just a fallback to handle
currently-unhandled corner cases. A good, efficient allocator that resists
fragmentation in the first place is still needed.

Regards,

Daniel

2003-06-27 15:10:31

[permalink] [raw]

Subject: Re: [RFC] My research agenda for 2.7

On Fri, 27 Jun 2003, Daniel Phillips wrote:

> On Friday 27 June 2003 17:04, Martin J. Bligh wrote:
> > Daniel Phillips <[email protected]> wrote (on Friday, June 27, 2003
> > > Some allocation strategies may be statistically more
> > > resistiant to fragmentation than others, but no allocator has been
> > > invented, or ever will be, that can guarantee that terminal fragmentation
> > > will never occur - only active defragmentation can provide such a
> > > guarantee.
> >
> > Whilst I agree with that in principle, it's inevitably expensive. Thus
> > whilst we may need to have that code, we should try to avoid using it ;-)
>
> That's exactly the idea. Active defragmentation is just a fallback to handle
> currently-unhandled corner cases. A good, efficient allocator that resists
> fragmentation in the first place is still needed.
>

I still suspect moving order0 allocations to slab will be a fragmentation
resistent allocator but my main concern would be that the slab allocator
overhead, both CPU and storage requirements will be too high.

On the other hand, it would do some things you are looking for. For
example, it allocates large blocks of memory in one lump and then
allocates them piecemeal. Second, it would be resistent to the FAFAFA
problem Martin pointed out. As slabs would be allocated in a large block
from the buddy, you are guarenteed that you'll be able to free up buddies.
Lastly, as there would be a cache specifically for userspace pages, a
defragger that looked exclusively at user pages will still be sure of
being able to free adjacent buddies.

I need to write a proper RFC.....

--
Mel Gorman

2003-06-27 15:35:43

[permalink] [raw]

Subject: Re: [RFC] My research agenda for 2.7

On Friday 27 June 2003 17:22, Mel Gorman wrote:
> I still suspect moving order0 allocations to slab will be a fragmentation
> resistent allocator but my main concern would be that the slab allocator
> overhead, both CPU and storage requirements will be too high.
>
> On the other hand, it would do some things you are looking for. For
> example, it allocates large blocks of memory in one lump and then
> allocates them piecemeal. Second, it would be resistent to the FAFAFA
> problem Martin pointed out. As slabs would be allocated in a large block
> from the buddy, you are guarenteed that you'll be able to free up buddies.
> Lastly, as there would be a cache specifically for userspace pages, a
> defragger that looked exclusively at user pages will still be sure of
> being able to free adjacent buddies.
>
> I need to write a proper RFC.....

You might want to have a look at this:

http://www.research.att.com/sw/tools/vmalloc/
(Vmalloc: A Memory Allocation Library)

Regards,

Daniel

2003-06-27 15:45:40

[permalink] [raw]

Subject: Re: [RFC] My research agenda for 2.7

On Friday 27 June 2003 17:50, Daniel Phillips wrote:
> You might want to have a look at this:
>
> http://www.research.att.com/sw/tools/vmalloc/
> (Vmalloc: A Memory Allocation Library)

Whoops, I retract that. I did a quick scan of the page and ran into a link
labled "Non-Commercial License Agreement and Software", attached to some form
of license loaded with ugly patent words and so on. My immediate reaction
is: stay far, far away from that work, it is encumbered. It would be nice if
somebody knows differently, but for now I must ignore Vo's work.

Regards,

Daniel

2003-06-29 19:11:21

[permalink] [raw]

Subject: Re: [RFC] My research agenda for 2.7

On Fri, 27 Jun 2003, Daniel Phillips wrote:

> > http://www.research.att.com/sw/tools/vmalloc/
> > (Vmalloc: A Memory Allocation Library)
>
> Whoops, I retract that. I did a quick scan of the page and ran into a link
> labled "Non-Commercial License Agreement and Software", attached to some form
> of license loaded with ugly patent words and so on.

The paper itself is still freely available and the idea could still be
implemented but I still think that the slab allocator will be sufficent
for order0 allocations and use caches to cluster the different types of
order0 allocations together. I've read that paper before and I think it is
overkill for the kernel because the allocation pattern isn't that
complicated in comparison to userspace memory management which that
allocator is aimed at. Not to mention that to get the most out of the
allocator, you need to know what your allocation pattern is going to be
like in advance.

Of course, at this point, I still need to write an RFC about using slab
for order0 allocations and ultimatly I'll have to put code where my emails
are :-/

To help show why slab might work though, I updated vmregress and released
0.10 available at http://www.csn.ul.ie/~mel/projects/vmregress/ . The main
addition is that it now can collect some basic statistics on the types of
page allocations. Specifically, with the trace_alloc module loaded and the
trace_pagealloc.diff patch applied (see kernel_patches/ directory),
/proc/vmregress/trace_alloc will print out the number of allocations for
each order that used either the GFP_KERNEL or GFP_USER flags.

To test it, I booted a clean 2.5.73 system[1], started X and ran
konqueror. The results were;

Kernel allocations
------------------
BeforeX After X
Order 0: 6886 14595
Order 1: 335 383
Order 2: 17 23
Order 3: 2 2
Order 4: 3 3
Order 5: 0 0
Order 6: 0 0
Order 7: 0 0
Order 8: 0 0
Order 9: 0 0
Order 10: 0 0

Userspace allocations
---------------------
Order 0: 53050 71505
Order 1: 1 1
Order 2: 0 0
Order 3: 0 0
Order 4: 0 0
Order 5: 0 0
Order 6: 0 0
Order 7: 0 0
Order 8: 0 0
Order 9: 0 0
Order 10: 0 0

As you can see, order0 allocations were a *lot* more common, at least in
my system. Because they are so common in comparison to other orders, I
think that putting order0 in slabs of size 2^MAX_ORDER will make
defragmentation *so* much easier, if not plain simple, because you can
shuffle around order0 pages in the slabs to free up one slab which frees
up one large 2^MAX_ORDER adjacent block of pages.

It's possible that a journalled filesystem (or some other subsystem I am
not familiar with) would show that order > 0 allocations are a lot more
important than my system shows but the means to collect the information is
in vmregress so be my guest. When the time requires, I'll extend vm
regress to see how fragmented memory actually gets.

Much much later, it would be worth looking at the buddy allocator and
seeing would it be worth changing to something much simpler as most of its
work would be handled by the slab allocator instead.

[1] Tonnes of errors were spewed out on uninitialised timers. With highmem
enabled, I got a sleeping while atomic error warning in a loop caused
by default_idle(). I still have to see can I get around to tracking
down if it's a PEBKAC problem or a kernel bug

--
Mel Gorman

2003-06-29 20:54:28

[permalink] [raw]

Subject: Re: [RFC] My research agenda for 2.7

On Sunday 29 June 2003 21:25, Mel Gorman wrote:
> As you can see, order0 allocations were a *lot* more common, at least in
> my system.

Mel,

There's no question that that's the case today. However, there are good
reasons for using a largish filesystem blocksize, 16K for example, once it
becomes possible to do so. With an active volume mounted using 16K blocks,
you'd see that the balance of allocations shifts towards order 2. The size
of the shift will be workload-dependent, ranging from almost no order 2
allocations, to almost all. To keep things interesting, it's quite possible
for the balance to change suddenly and/or strongly.

> Because they are so common in comparison to other orders, I
> think that putting order0 in slabs of size 2^MAX_ORDER will make
> defragmentation *so* much easier, if not plain simple, because you can
> shuffle around order0 pages in the slabs to free up one slab which frees
> up one large 2^MAX_ORDER adjacent block of pages.

But how will you shuffle those pages around?

Regards,

Daniel

2003-06-29 21:12:39

[permalink] [raw]

Subject: Re: [RFC] My research agenda for 2.7

On Sat, 28 Jun 2003, Daniel Phillips wrote:

> There's no question that that's the case today. However, there are good
> reasons for using a largish filesystem blocksize, 16K for example, once it
> becomes possible to do so.

Fair enough. I am not very familiar with filesystems or why it would be
desirable to have a blocksize greater than a page size. I'd be grateful if
you'd recommend a good book, paper or documentation set that would
enlighten me.

> > Because they are so common in comparison to other orders, I
> > think that putting order0 in slabs of size 2^MAX_ORDER will make
> > defragmentation *so* much easier, if not plain simple, because you can
> > shuffle around order0 pages in the slabs to free up one slab which frees
> > up one large 2^MAX_ORDER adjacent block of pages.
>
> But how will you shuffle those pages around?
>

Thats where your defragger would need to kick in. The defragger would scan
at most MAX_DEFRAG_SCAN slabs belonging to the order0 userspace cache
where MAX_DEFRAG_SCAN is related to how urgent the request is. Select the
slab with the least objects in them and either:

a) Reclaim the pages by swapping them out or whatever
b) Copy the pages to another slab and update the page tables via rmap

Once the pages are copied from the selected slab, destroy it and you have
a large block of free pages. The point which I am getting across, quite
badly, is that by having order0 pages in slab, you are guarenteed to be
able to quickly find a cluster of pages to move which will free up a
contiguous block of 2^MAX_ORDER pages, or at least 2^MAX_GFP_ORDER with
the current slab implementation.

For kernel pages, it would get a lot more complicated but at least the
kernel pages would be clustered together in the order0 cache for kernel
pages.

--
Mel Gorman

2003-06-29 21:39:25

[permalink] [raw]

Subject: Re: [RFC] My research agenda for 2.7

On Sunday 29 June 2003 23:26, Mel Gorman wrote:
> On Sat, 28 Jun 2003, Daniel Phillips wrote:
> > > Because they are so common in comparison to other orders, I
> > > think that putting order0 in slabs of size 2^MAX_ORDER will make
> > > defragmentation *so* much easier, if not plain simple, because you can
> > > shuffle around order0 pages in the slabs to free up one slab which
> > > frees up one large 2^MAX_ORDER adjacent block of pages.
> >
> > But how will you shuffle those pages around?
>
> Thats where your defragger would need to kick in. The defragger would scan
> at most MAX_DEFRAG_SCAN slabs belonging to the order0 userspace cache
> where MAX_DEFRAG_SCAN is related to how urgent the request is. Select the
> slab with the least objects in them and either:
> a) Reclaim the pages by swapping them out or whatever
> b) Copy the pages to another slab and update the page tables via rmap
> Once the pages are copied from the selected slab, destroy it and you have
> a large block of free pages.

Yes, now we're on the same page, so to speak. Personally, I don't have much
interest in working on improving the allocator per se. I'd love to see
somebody else take a run at that, and I will occupy myself with the gritty
details of how to move pages without making the system crater.

> The point which I am getting across, quite
> badly, is that by having order0 pages in slab, you are guarenteed to be
> able to quickly find a cluster of pages to move which will free up a
> contiguous block of 2^MAX_ORDER pages, or at least 2^MAX_GFP_ORDER with
> the current slab implementation.

I don't see that it's guaranteed, but I do see that organizing pages in
slab-like chunks is a good thing to do - a close reading of my original
response to you shows I was thinking about that.

I also don't see that the slab cache in its current incarnation is the right
tool for the job. It handles things that we just don't need to handle, such
as objects of arbitary size and alignment. I'm sure you could make it work,
but why not just tweak alloc_pages to know about chunks instead?

Regards,

Daniel

2003-06-29 21:54:02

[permalink] [raw]

Subject: Re: [RFC] My research agenda for 2.7

On Sunday 29 June 2003 23:26, Mel Gorman wrote:
>> Thats where your defragger would need to kick in. The defragger would scan
>> at most MAX_DEFRAG_SCAN slabs belonging to the order0 userspace cache
>> where MAX_DEFRAG_SCAN is related to how urgent the request is. Select the
>> slab with the least objects in them and either:
>> a) Reclaim the pages by swapping them out or whatever
>> b) Copy the pages to another slab and update the page tables via rmap
>> Once the pages are copied from the selected slab, destroy it and you have
>> a large block of free pages.

On Sat, Jun 28, 2003 at 11:54:43PM +0200, Daniel Phillips wrote:
> Yes, now we're on the same page, so to speak. Personally, I don't have much
> interest in working on improving the allocator per se. I'd love to see
> somebody else take a run at that, and I will occupy myself with the gritty
> details of how to move pages without making the system crater.

This sounds like it's behind dependent on physically scanning slabs,
since one must choose slab pages for replacement on the basis of their
potential to restore contiguity, not merely "dump whatever's replaceable
and check how much got freed".

On Sunday 29 June 2003 23:26, Mel Gorman wrote:
>> The point which I am getting across, quite
>> badly, is that by having order0 pages in slab, you are guarenteed to be
>> able to quickly find a cluster of pages to move which will free up a
>> contiguous block of 2^MAX_ORDER pages, or at least 2^MAX_GFP_ORDER with
>> the current slab implementation.

On Sat, Jun 28, 2003 at 11:54:43PM +0200, Daniel Phillips wrote:
> I don't see that it's guaranteed, but I do see that organizing pages in
> slab-like chunks is a good thing to do - a close reading of my original
> response to you shows I was thinking about that.
> I also don't see that the slab cache in its current incarnation is the right
> tool for the job. It handles things that we just don't need to handle, such
> as objects of arbitary size and alignment. I'm sure you could make it work,
> but why not just tweak alloc_pages to know about chunks instead?

A larger question in my mind is what slabs have to do with this at all.
Slabs are for fixed-size allocations. These techniques are explicitly
oriented toward making user allocations more variable, not less.

-- wli

2003-06-29 23:03:00

[permalink] [raw]

Subject: Re: [RFC] My research agenda for 2.7

On Monday 30 June 2003 00:07, William Lee Irwin III wrote:
> On Sunday 29 June 2003 23:26, Mel Gorman wrote:
> > ...I will occupy myself with the
> > gritty details of how to move pages without making the system crater.
>
> This sounds like it's behind dependent on physically scanning slabs,
> since one must choose slab pages for replacement on the basis of their
> potential to restore contiguity, not merely "dump whatever's replaceable
> and check how much got freed".

Though I'm not sure what "behind dependent" means, and I'm not the one
advocating slab for this, it's quite correct that scanning strategy would
need to change, at least when the system runs into cross-order imbalances.
But this isn't much different from the kinds of things we do already.

Regards,

Daniel

2003-07-02 20:58:59

by Mike Fedyk

[permalink] [raw]

Subject: Re: [RFC] My research agenda for 2.7

On Fri, Jun 27, 2003 at 02:00:42PM +0100, Mel Gorman wrote:
> You're right, I will need to write a proper RFC one way or the other. I
> was thinking of using slabs because that way there wouldn't be need to
> scan all of mem_map, just a small number of slabs. I have no basis for
> this other than hand waving gestures though.

Mel,

This sounds much like something I was reading from Larry McVoy using page
objects (like one level higher in magnatude than pages).

I don't remember the URL, but there was something pretty extensive from
Larry already explaining the concept.

2003-07-03 01:50:40

by Larry McVoy

[permalink] [raw]

Subject: Re: [RFC] My research agenda for 2.7

On Wed, Jul 02, 2003 at 02:10:55PM -0700, Mike Fedyk wrote:
> On Fri, Jun 27, 2003 at 02:00:42PM +0100, Mel Gorman wrote:
> > You're right, I will need to write a proper RFC one way or the other. I
> > was thinking of using slabs because that way there wouldn't be need to
> > scan all of mem_map, just a small number of slabs. I have no basis for
> > this other than hand waving gestures though.
>
> Mel,
>
> This sounds much like something I was reading from Larry McVoy using page
> objects (like one level higher in magnatude than pages).
>
> I don't remember the URL, but there was something pretty extensive from
> Larry already explaining the concept.

If we're thinking about the same thing, the basic idea was to store
information into a higher level object and make more intelligent paging
decisions based on the higher level object. In my brain, since I'm a
SunOS guy, that means that you store information in the vnode (inode)
which reflects the status of all pages backed by this inode.

Instead of trying to figure out what to do at the page level, you figure
out what to do at the object level.

Some postings about this:

http://groups.google.com/groups?q=topvn+mcvoy&hl=en&lr=&ie=UTF-8&oe=UTF-8&selm=3cgeu9%24h96%40fido.asd.sgi.com&rnum=1

http://groups.google.com/groups?q=vnode+mcvoy&start=10&hl=en&lr=&ie=UTF-8&oe=UTF-8&selm=l0ojgnINN59t%40appserv.Eng.Sun.COM&rnum=12

I can't find the writeup that you are thinking about. I know what you mean,
there was a discussion of paging algs and I went off about how scanning a
page a time is insane. If someone finds the URL let me know.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-07-03 02:06:35