2003-06-24 22:56:25

by Daniel Phillips

[permalink] [raw]
Subject: [RFC] My research agenda for 2.7

This note describes several items on my research agenda for the 2.7 kernel
development timeframe. First a little history.

In 2.3/2.4, the dominant change to Linux's memory management subsystem was the
unification of the page and buffer cache, in the sense that most double
storage was eliminated and copying from the buffer to page cache was no
longer required on each file write. In 2.5, the dominant change was the
addition of reverse page mapping. I participated actively in the latter,
being motivated by the belief that with reverse mapping in place, a number of
significant evolutionary improvements to the memory management subsystem
would become possible, or indeed, easy. Here, I list the three main projects
I hope to complete in the 2.7 timeframe, for comment:

1) Active memory defragmentation
2) Variable sized page cache objects
3) Physical block cache

These are ordered from least to most controversial. I believe all three will
prove to be worthwhile improvements to the Linux's memory management
subsystem, and hopefully, I support that belief adequately in the text below.
Of course, working code and benchmarks are the final arbiter of goodness, and
will appear in due course.

1) Active memory defragmentation

I doubt anyone will deny that this is desirable. Active defragmentation will
eliminate higher order allocation failures for non-atomic allocations, and I
hope, generally improve the efficiency and transparency of the kernel memory
allocator.

The purpose of active memory defragmentation is to catch corner cases, rather
than to be a complete replacement for the current allocation system. The
most obvious and problematic corner case is the one where all physical memory
units of a given order are used up, in which case the allocator only has two
options: wait or fail. Active defragmentation introduces a third option,
which should eliminate nearly all instances of the former and all of the
latter, except when physical memory is genuinely exhausted for some reason
(i.e., bona fide OOM).

The idea is to run a defragmentation daemon that kicks in whenever
availability of some allocation order falls below a threshold. The defrag
daemon first scans memory for easily movable pages that can form new, free
units of the needed order. If this pass fails, the daemon could go as far as
quiescing the system (a technique already used in RCU locking) and move some
not-so-easy-to-move pages.

In order to move a page of physical memory, we need to know what is pointing
at it. This is often easy, for example in the common case when the only
pointer to a page is held by the page cache and the page is not under IO. We
only need to hold the page cache lock and the page table lock to move such a
page.

Moving anonymous memory is pretty easy as well, with the help of reverse page
mapping. We need to hold the appropriate locks, then walk a page's reverse
map list, updating pointers to a new copy of the page. (If we encounter
nasty corner cases here, we could possibly fall back to a quiescing
strategy.)

Some difficult situations may be dealt with by creating a new internal kernel
API that provides a way of notifying some subsystem that page ownership
information is wanted, or that certain pages should be reallocated according
to the wishes of the defragmentation daemon. Obviously, there is plenty of
opportunity for over-engineering in this area, but equally, there is
opportunity for clever design work that derives much benefit while avoiding
most of the potential complexity.

Physical memory defragmentation is an enabler for variable-sized pages, next
on my list.

2) Variable sized page objects

This item will no doubt seem as controversial as the first is uncontroversial.
It may help to know that my prototype code, done under 2.4, indicates that
the complete system actually gets smaller with this feature, and possibly
slightly faster as well. Essentially, if we have variable sized pages then
we can eliminate the messy buffer lists that are (sometimes) attached to
pages, and indeed, eliminate buffers themselves. Traditional buffer IO and
data operations can be trivially reexpressed in terms of pages, provided page
objects can take on the same range of sizes as buffers currently do. The
block IO library also gets shorter, since much looping and state handling
becomes redundant.

For my purpose, "variable sized" means that each struct page object can
reference a data frame of any binary size, including smaller than a physical
page. To keep the implementation sane, all pages of a given address_space
are the same size. Also, page objects smaller than a physical page are only
allowed for file-backed memory, not anonymous memory. Rules for large pages
remain to be determined, however since there is already considerable work
being done in this area, I will not dwell on it further.

The most interesting aspect of variable sized pages is how subpages are
handled. This could get messy, but fortunately a simple approach is possible.
Subpage page structs do not need to reside in the mem_map; instead they can
be dynamically allocated from slab cache. The extra bookkeeping needed inside
the page cache operations to keep track of this is not much, and particularly,
does not add more than a sub-cycle penalty to the case where subpages are
not used (even so, I expect this penalty to be more than made up by shorter,
straighter paths in the block IO library).

One benefit of subpages that may not be immediately obvious is the opportunity
to save some space in the mem_map array: with subpages it becomes quite
attractive to use a larger PAGE_CACHE_SIZE, i.e., a filesystem that must use
a small block size for some reason won't cause a lot of additional internal
fragmentation.

But to my mind, the biggest benefit to subpages is the opportunity to
eliminate some redundant state information that is currently shared between
pages and buffer_heads. To be sure, I've been less than successful to date
at communicating the importance of this simplification, but in this case, the
code should be the proof.

Variable-size pages will deliver immediate benefits to filesystems such as
Ext2 and Ext3, in the form of larger volume size limits and more efficient
transfers. As a side effect, we will probably need to implement tail merging
in Ext2/3 to control the resulting increased internal fragmentation, but that
is another story, for another mailing list.

Variable size pages should fit together nicely with the current work being
done on large (2 and 4 MB) page handling, and especially, it will be nice for
architectures like MIPS that can optimize variable sized pages in
hardware.

Some bullet points:

- Rationalize state info: state represented only in struct page, not struct
page + list of struct buffer_head

- 1K filesystems aren't a special case any more

- More efficient IO path, esp for 1K fs

- Net removal of code by simplifying the vfs block IO library (new code
is added to page cache access functions).

- Makes the page locking unit match the filesystem locking unit for most
filesystems

- Generalizes to superpages

- Performance. It's a little more efficient. Eliminates one class of
objects, allowing us to concentrate more on the remaining class.

- Large file blocks (multiple physical pages)

- Eliminate buffer heads. Final use of buffer heads is as data handle for
filesystem metadata (IO function of buffer heads will be taken over by
BIO). Basically, we can just substitute struct page for struct
buffer_head. Can make this transition gradual, after establishing
one buffer head per page.

- Memory pressure now acts on only one class of object, making balancing
more even.

Relies on:

- Active defragmentation

How it works:

- Page size is represented on a per-address space basis with a shift count.
In practice, the smallest is 9 (512 byte sector), could imagine 7 (each
ext2 inode is separate page) or 8 (actual hardsect size on some drives).
12 will be the most common size. 13 gives 8K blocksize for, e.g., alpha.
21 and 22 give 2M and 4M page size, matching hardware capabilities of
x86, and other sizes are possible on machines like MIPS, where page size
is software controllable

- Implemented only for file-backed memory (page cache)

- Special case these ops in page cache access layer instead of having the
messy code in the block IO library

- Subpage struct pages are dynamically allocated. But buffer_heads are gone
so this is a lateral change.

3) Physical block cache

This item is not strictly concerned with memory management, but as it impacts
the same subsystems, I have included it in this note.

In brief, a physical block cache lets the vfs efficiently answer the question:
"given a physical data block, tell me if and where it is mapped into any
address_space on the same volume". This need not be a big change to the
existing strategy: the normal lookup and other operations remain available.
However, the vfs gets the additional responsibility of maintaining a special
per-volume address_space coherently with the multiple file-backed
address_spaces on the volume.

In fact, we already have such per-volume address spaces, and there really
isn't that much work to do here, in order to test the effects of making the
two types of address_space coherent with one another. One way of looking at
this is, full coherence in this area would complete the work of unifying the
page and buffer caches, started some years ago.

Having discussed this idea with a few developers, I've been warned that
difficult issues will arise with some of the more complex filesystems, such
as Ext3. Fortunately, a prototype physical block cache can be adequately
tested with Ext2 alone, which doesn't have a lot of issues. If this proves
to deliver the performance benefits I expect, further work would be justified
to extend the functionality to other filesystems.

So what are the anticpated performance benefits? I've identified two so far:

1. Physical readahead. That is, we can load a block into the page cache
before we know which address_space, if any, it actually belongs to.
Later, when we do know, additionally entering it into its proper
address space is efficient. This will help with the traversal of
many small files case, which Linux currently handles poorly.

2. Raid 5. The biggest performance problem with Raid 5 stems from the
fact that for small, isolated writes it is forced to read a few blocks
to compute every new parity block, and in the process suffers large
amounts of rotational latency. A big cache helps with this a great,
however, the size of cache we could expect to find in, e.g., a high
end scsi drive, is not adequate to eliminate the bad effects, and in
any event, bus saturation becomes a very real problem. We could also
have a separate, physical block cache, but this wastes memory and
causes unnecessary copying. Being able to implement the big cache
directly in the page cache is thus a big advantage in terms of memory
savings, and reduced data copying. There is also a latency advantage,

Summary

Please note that all of the above is unofficial, experimental work. However,
I do believe that all three of these items have the potential to deliver
substantial improvements in terms of reliability, efficiency and obvious
correctness.

Thankyou for your indulgence in reading all the way down to here. The
timeframe for this work is:

- Starts as soon as 2.5 closes
- Prototypes to be posted shortly thereafter

Regards,

Daniel



2003-06-25 00:34:02

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [RFC] My research agenda for 2.7

On Wed, Jun 25, 2003 at 01:11:01AM +0200, Daniel Phillips wrote:
> - Page size is represented on a per-address space basis with a shift count.
> In practice, the smallest is 9 (512 byte sector), could imagine 7 (each
> ext2 inode is separate page) or 8 (actual hardsect size on some drives).
> 12 will be the most common size. 13 gives 8K blocksize for, e.g., alpha.
> 21 and 22 give 2M and 4M page size, matching hardware capabilities of
> x86, and other sizes are possible on machines like MIPS, where page size
> is software controllable
> - Implemented only for file-backed memory (page cache)

Per struct address_space? This is an unnecessary limitation.


On Wed, Jun 25, 2003 at 01:11:01AM +0200, Daniel Phillips wrote:
> - Special case these ops in page cache access layer instead of having the
> messy code in the block IO library
> - Subpage struct pages are dynamically allocated. But buffer_heads are gone
> so this is a lateral change.

This gives me the same data structure proliferation chills as bh's.


-- wli

2003-06-25 00:52:17

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC] My research agenda for 2.7

On Wednesday 25 June 2003 02:47, William Lee Irwin III wrote:
> On Wed, Jun 25, 2003 at 01:11:01AM +0200, Daniel Phillips wrote:
> > - Page size is represented on a per-address space basis with a shift
> > count. In practice, the smallest is 9 (512 byte sector), could imagine 7
> > (each ext2 inode is separate page) or 8 (actual hardsect size on some
> > drives). 12 will be the most common size. 13 gives 8K blocksize for,
> > e.g., alpha. 21 and 22 give 2M and 4M page size, matching hardware
> > capabilities of x86, and other sizes are possible on machines like MIPS,
> > where page size is software controllable
> > - Implemented only for file-backed memory (page cache)
>
> Per struct address_space? This is an unnecessary limitation.

It's a sensible limitation, it keeps the radix tree lookup simple.

> On Wed, Jun 25, 2003 at 01:11:01AM +0200, Daniel Phillips wrote:
> > - Special case these ops in page cache access layer instead of having
> > the messy code in the block IO library
> > - Subpage struct pages are dynamically allocated. But buffer_heads are
> > gone so this is a lateral change.
>
> This gives me the same data structure proliferation chills as bh's.

It's not nearly as bad. There is no distinction between subpage and base
struct page for almost all page operations, e.g., locking, IO, data access.

Regards,

Daniel

2003-06-25 00:56:30

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [RFC] My research agenda for 2.7

On Wednesday 25 June 2003 02:47, William Lee Irwin III wrote:
>> Per struct address_space? This is an unnecessary limitation.

On Wed, Jun 25, 2003 at 03:07:18AM +0200, Daniel Phillips wrote:
> It's a sensible limitation, it keeps the radix tree lookup simple.

It severely limits its usefulness. Dropping in a more flexible data
structure should be fine.


On Wednesday 25 June 2003 02:47, William Lee Irwin III wrote:
>> This gives me the same data structure proliferation chills as bh's.

On Wed, Jun 25, 2003 at 03:07:18AM +0200, Daniel Phillips wrote:
> It's not nearly as bad. There is no distinction between subpage and base
> struct page for almost all page operations, e.g., locking, IO, data access.

But those are code sanitation issues. You need to make sure this
doesn't explode on PAE.


-- wli

2003-06-25 01:12:34

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC] My research agenda for 2.7

On Wednesday 25 June 2003 03:10, William Lee Irwin III wrote:
> On Wednesday 25 June 2003 02:47, William Lee Irwin III wrote:
> >> Per struct address_space? This is an unnecessary limitation.
>
> On Wed, Jun 25, 2003 at 03:07:18AM +0200, Daniel Phillips wrote:
> > It's a sensible limitation, it keeps the radix tree lookup simple.
>
> It severely limits its usefulness. Dropping in a more flexible data
> structure should be fine.

Eventually it could well make sense to do that, e.g., the radix tree
eventually ought to evolve into a btree of extents (probably). But making
things so complex in the first version, thus losing much of the incremental
development advantage, would not be smart. With a single size of page per
address_space, changes to the radix tree code are limited to a couple of
lines, for example.

But perhaps you'd like to supply some examples where more than one size of
page in the same address space really matters?

> On Wednesday 25 June 2003 02:47, William Lee Irwin III wrote:
> >> This gives me the same data structure proliferation chills as bh's.
>
> On Wed, Jun 25, 2003 at 03:07:18AM +0200, Daniel Phillips wrote:
> > It's not nearly as bad. There is no distinction between subpage and base
> > struct page for almost all page operations, e.g., locking, IO, data
> > access.
>
> But those are code sanitation issues. You need to make sure this
> doesn't explode on PAE.

Indeed, that is important. Good night, see you tomorrow.

Daniel

2003-06-25 01:16:56

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [RFC] My research agenda for 2.7

On Wednesday 25 June 2003 03:10, William Lee Irwin III wrote:
>> It severely limits its usefulness. Dropping in a more flexible data
>> structure should be fine.

On Wed, Jun 25, 2003 at 03:25:47AM +0200, Daniel Phillips wrote:
> Eventually it could well make sense to do that, e.g., the radix tree
> eventually ought to evolve into a btree of extents (probably). But making
> things so complex in the first version, thus losing much of the incremental
> development advantage, would not be smart. With a single size of page per
> address_space, changes to the radix tree code are limited to a couple of
> lines, for example.
> But perhaps you'd like to supply some examples where more than one size of
> page in the same address space really matters?

Software-refill TLB architectures would very much like to be handed the
largest physically contiguous chunk of memory out of pagecache possible
and map it out using the fewest number of TLB entries possible. Dropping
in a B+ tree to replace radix trees should be a weekend project at worst.
Speculatively allocating elements that are "as large as sane/possible"
will invariably result in variable-sized elements in the same tree.


-- wli

2003-06-25 09:15:51

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC] My research agenda for 2.7

> 1) Active memory defragmentation
>
> I doubt anyone will deny that this is desirable. Active defragmentation will
> eliminate higher order allocation failures for non-atomic allocations, and I
> hope, generally improve the efficiency and transparency of the kernel memory
> allocator.
>

It might be just me, but this scheme sounds a bit complicated (I'm still
absorbing the other two). I find it difficult to see what happens when a
page used by a kernel pointer changes for any case other than vmalloc()
but I probably am missing something.

How about: Move order-0 allocations to slab (ignoring bootstrap issues for
now but shouldn't be hard anyway)

Each cache slab is 2^MAX_GFP_ORDER large and there is three caches
o order0-user
o order0-kreclaim
o order0-knoreclaim

order0-user is for any userspace allocation. These pages should be
trivially reclaimable with rmap available. If a large order block is
necessary, select one slab and reclaim it. This will break LRU ordering
something rotten but I am guessing that LRU ordering is not the prime
concern here. If a defrag daemon exists, scan MAX_DEFRAG_SCAN slabs and
pick the one with the most clean filesystem backed pages to chuck out
(least IO involved in reclaim).

order0-kreclaim is for kernel allocations which are trivially reclaimable
and that can be safely discared knowing that no pointer exists to them.
This is most likely to be usable for slab allocations of caches like
dentry's which can be safely thrown out. A quick look of /proc/slabinfo
shows that most slabs are just 1 page large. Slabs already have a
set_shrinker() callback for the removal of objects so it is likely that
this could be extended for telling caches to clear all objects and discard
a particular slab.

order0-knoreclaim is for kernel allocations which cannot be easily
reclaimed and have pointers to the allocation which are difficult to
reclaim. For all intents and purposes, these are not reclaimable without
impementing swapping in kernel space.

This has a couple of things going for it

o Reclaimable pages are in easy to search globs
o Gets nifty per-cpu alloc and caching that comes with slab automatically
o Freeing high order pages is a case of discarding pages in one slab
o Doesn't require fancy pants updating of pointers or page tables
o Possible ditch the mempool interface as slab already has similar functionality
o Seems simple

Opinions?

--
Mel Gorman

2003-06-26 18:45:55

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC] My research agenda for 2.7

On Wednesday 25 June 2003 11:29, Mel Gorman wrote:
> > 1) Active memory defragmentation
> >
> > I doubt anyone will deny that this is desirable. Active defragmentation
> > will eliminate higher order allocation failures for non-atomic
> > allocations, and I hope, generally improve the efficiency and
> > transparency of the kernel memory allocator.
>
> It might be just me, but this scheme sounds a bit complicated (I'm still
> absorbing the other two).

Mel,

I probably spent too much time dwelling on the hard cases of page moving, and
didn't sufficiently emphasize that handling a few easy cases would do a
pretty good job, certainly better than we do now. For example:

* Most process pages are easily movable, since usually only page tables
will hold references.

* Most page cache pages are easily movable, likewise

* Similarly, page table pages are not too hard to move

Most slab pages are hard to move. We could try to fix that for certain common
object types, or we could just tell slab to use its own biggish chunks of
memory, which it can play in as it sees fit.

> I find it difficult to see what happens when a
> page used by a kernel pointer changes for any case other than vmalloc()
> but I probably am missing something.

The point you apparently missed is that the defragger will identify and update
those kernel pointers, being careful about races of course.

> How about: Move order-0 allocations to slab (ignoring bootstrap issues for
> now but shouldn't be hard anyway)

That sounds like radical surgery to me, but to each his own experiments.

> Each cache slab is 2^MAX_GFP_ORDER large and there is three caches
> o order0-user
> o order0-kreclaim
> o order0-knoreclaim
>
> order0-user is for any userspace allocation. These pages should be
> trivially reclaimable with rmap available. If a large order block is
> necessary, select one slab and reclaim it. This will break LRU ordering
> something rotten but I am guessing that LRU ordering is not the prime
> concern here. If a defrag daemon exists, scan MAX_DEFRAG_SCAN slabs and
> pick the one with the most clean filesystem backed pages to chuck out
> (least IO involved in reclaim).

Defragmentation by selective eviction is possible, but isn't necessarily
optimal. In the case of memory that isn't swap or file-backed, it isn't even
possible. On the other hand, you may think of the page move case as simply
an optimized evict-and-reload, if that helps understand where I'm going.

Regardless, it would be good to teach vmscan to evict pages that will help
build higher order allocation units, when these are in short supply. This
would be an optimization heuristic; it would still necessary to handle the
general case of data moving in order to make any guarantee re fragmentation
control.

> order0-kreclaim is for kernel allocations which are trivially reclaimable
> and that can be safely discared knowing that no pointer exists to them.
> This is most likely to be usable for slab allocations of caches like
> dentry's which can be safely thrown out. A quick look of /proc/slabinfo
> shows that most slabs are just 1 page large. Slabs already have a
> set_shrinker() callback for the removal of objects so it is likely that
> this could be extended for telling caches to clear all objects and discard
> a particular slab.
>
> order0-knoreclaim is for kernel allocations which cannot be easily
> reclaimed and have pointers to the allocation which are difficult to
> reclaim. For all intents and purposes, these are not reclaimable without
> impementing swapping in kernel space.
>
> This has a couple of things going for it
>
> o Reclaimable pages are in easy to search globs
> o Gets nifty per-cpu alloc and caching that comes with slab automatically
> o Freeing high order pages is a case of discarding pages in one slab
> o Doesn't require fancy pants updating of pointers or page tables

Without updating pointers, active defragmentation is not possible. But
perhaps you meant to say that active defragmentation is not needed?

> o Possible ditch the mempool interface as slab already has similar
> functionality
> o Seems simple
>
> Opinions?

Yes :-)

Regards,

Daniel

2003-06-26 19:47:46

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC] My research agenda for 2.7

On Thu, 26 Jun 2003, Daniel Phillips wrote:

> * Most process pages are easily movable, since usually only page tables
> will hold references.
>
> * Most page cache pages are easily movable, likewise
>
> * Similarly, page table pages are not too hard to move
>

I think that finding pages like this together is unlikely, especially if
the system has been running a long time. In the worst case you will have
every easily-moved page adjactent to a near-impossible-to-move page. The
buddy allocator could be taught to be selective of which freelists it used
for different types of allocation but that doesn't sound any easier than
moving order0 pages to slab.

There also would be trouble identifing adjactent pages of order 0 which
are are used for these particular task. Buddy allocators, including the
one implemented in Linux, do not record what order allocation a struct
page belongs to. I could be wrong but it just feels like it would be a lot
of work to free up just a few adjacent pages.

Tentatively, I would assert that being able to group these types of pages
together is a pre-requestive for effective defragging and I think moving
order-0 pages to slab would enforce this grouping. For example, the defrag
daemon could scan just order0-user slabs and move pages from sparsly
populated slabs to other fuller slabs and free slabs as it goes.

In other words, order0 caches are not a mutually exclusive goal to
defragging but I bet you a shiny penny it'd make the implementation a lot
easier.

> Most slab pages are hard to move. We could try to fix that for certain
> common object types, or we could just tell slab to use its own biggish
> chunks of memory, which it can play in as it sees fit.
>

Moving slab pages is probably not an option unless some quick way of
updating all the pointers to the objects is found. I would say the only
slab pages that are "movable" belong to caches which can quickly discard
their objects.

> > I find it difficult to see what happens when a
> > page used by a kernel pointer changes for any case other than vmalloc()
> > but I probably am missing something.
>
> The point you apparently missed is that the defragger will identify and
> update those kernel pointers, being careful about races of course.
>

I didn't miss it as such, but I don't see how it could be implemented
either. I also wonder if moving kernel pages is really worth the hassle.
It's perfectly possible that defragging only user pages will be enough.
The only way to be sure would be to walk all struct pages in the mem_map
and see what the layout looks like.

> > How about: Move order-0 allocations to slab (ignoring bootstrap issues for
> > now but shouldn't be hard anyway)
>
> That sounds like radical surgery to me, but to each his own experiments.
>

I don't think it would be too radical because the buddy allocator would
still remain the principal page allocator. What would be a bitch is the
additional storage requirements required for such a large number of slabs
needed to contain pages. At the very least, the slab allocator would need
a means of discarding the slab descriptors for full slabs and maintaining
just pointers to the first page of each full slab. A comparison of page
allocation from buddy and page allocation from slab would also be
necessary to make sure the page allocation performance wasn't butchered.

> Defragmentation by selective eviction is possible, but isn't necessarily
> optimal. In the case of memory that isn't swap or file-backed, it isn't even
> possible. On the other hand, you may think of the page move case as simply
> an optimized evict-and-reload, if that helps understand where I'm going.
>

Well, take this this case

1) Slabs can contain 10 order0 pages
2) Defragger finds two slabs; slabA with 2 pages and slabB with 5
3) Defragger copies 2 pages from slabA to slabB and users rmap to update
page tables

The alternative sounds like it would be scanning mem_map looking for pages
that can be moved which sounds expensive. With order0 in slab, you can
easily find blocks of pages which when moved immeditaly free up a large
adjacent block.

> > o Reclaimable pages are in easy to search globs
> > o Gets nifty per-cpu alloc and caching that comes with slab automatically
> > o Freeing high order pages is a case of discarding pages in one slab
> > o Doesn't require fancy pants updating of pointers or page tables
>
> Without updating pointers, active defragmentation is not possible. But
> perhaps you meant to say that active defragmentation is not needed?
>

I am saying that full defragmentation would be a very expensive operation
which might not work if it cannot find suitably freeable adjacent pages.
If order0 pages were in slab, the whole searching problem becomes trivial
(just go to the relevant cache and scan the slabs).

> > o Possible ditch the mempool interface as slab already has similar
> > functionality
> > o Seems simple
> >
> > Opinions?
>
> Yes :-)
>

heh.

--
Mel Gorman

2003-06-26 19:55:47

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC] My research agenda for 2.7

Mel Gorman <[email protected]> wrote:
>
> Buddy allocators, including the
> one implemented in Linux, do not record what order allocation a struct
> page belongs to.

We can do that.


--- 25/mm/page_alloc.c~a 2003-06-26 13:09:11.000000000 -0700
+++ 25-akpm/mm/page_alloc.c 2003-06-26 13:09:24.000000000 -0700
@@ -123,6 +123,7 @@ static void prep_compound_page(struct pa
SetPageCompound(p);
p->lru.next = (void *)page;
}
+ page[1].index = order;
}

static void destroy_compound_page(struct page *page, unsigned long order)

_

2003-06-27 00:20:26

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC] My research agenda for 2.7

On Thursday 26 June 2003 22:01, Mel Gorman wrote:
> I think that finding pages like this together is unlikely, especially if
> the system has been running a long time. In the worst case you will have
> every easily-moved page adjactent to a near-impossible-to-move page.

Mel,

I addressed that in my previous post: "Most slab pages are hard to move ... we
could just tell slab to use its own biggish chunks of memory, which it can
play in as it sees fit". Another way of putting this is, we isolate the type
of data that's hard to defrag in its own space, so it can't fragment our
precious movable space. That separate space can expand and contract
chunkwise.

To be sure, this solution means we still can't forcibly create higher order
units in those undefraggable regions, but that is no worse than the current
situation. In the defraggable regions, hopefully the majority of memory, we
have a much improved situation: we can forcibly create as many high order
allocation units as we need, in order to handle, e.g., an Ext2/3 filesystem
that's running with a large blocksize.

> Moving slab pages is probably not an option unless some quick way of
> updating all the pointers to the objects is found.

That was the part in my original post about a "new API". If we choose to do
so, we could fix up most kinds of slab objects to support some protocol to
assist in the effort, or use a data structure better suited to relocation. I
doubt it's necessary to go that far. As far as I can see, slab is not a big
offender in terms of fragmentation at the moment.

> I also wonder if moving kernel pages is really worth the hassle.

That's the question of course. The benefit is getting rid of high order
allocation failures, and gaining some confidence that larger filesystem
blocksizes will work reliably, however the workload evolves. The cost is
some additional complexity, no argument about that. On the other hand, it
would no longer be necessary to use voodoo to try to get the allocator to
magically not fragment.

> It's perfectly possible that defragging only user pages will be enough.

Indeed. That's an excellent place to start. We'd include page cache pages in
that, I'd expect. Then it would be attractive to go on to handle page table
pages, and with that we'd have the low-hanging fruit.

> The alternative sounds like it would be scanning mem_map looking for pages
> that can be moved which sounds expensive.

It looks linear to me, and it's not supposed to happen often. It would
typically be a response to a nasty situation like you'd get by mounting a
volume that uses 32K blocks after running a long time with mostly 4K blocks.

> > Without updating pointers, active defragmentation is not possible. But
> > perhaps you meant to say that active defragmentation is not needed?
>
> I am saying that full defragmentation would be a very expensive operation
> which might not work if it cannot find suitably freeable adjacent pages.

Full defragmentation, whatever that is, would be pointless for the
applications I've described, though perhaps it could be needed by some
bizarre application like software hotplug memory. In the latter case you
wouldn't be too worried about how long it takes.

In general though, the defragger would only run long enough to restore balance
to the free counts across the various orders. In rare cases, it would have
to run synchronously in order to prevent an allocation from failing.

> If order0 pages were in slab, the whole searching problem becomes trivial
> (just go to the relevant cache and scan the slabs).

You might want to write a separate [rfc] to describe your idea. For one
thing, I don't see why you'd want to use slab for that.

Regards,

Daniel

2003-06-27 12:46:31

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC] My research agenda for 2.7

On Fri, 27 Jun 2003, Daniel Phillips wrote:

> On Thursday 26 June 2003 22:01, Mel Gorman wrote:
> > I think that finding pages like this together is unlikely, especially if
> > the system has been running a long time. In the worst case you will have
> > every easily-moved page adjactent to a near-impossible-to-move page.
>
> I addressed that in my previous post: "Most slab pages are hard to move
> ... we could just tell slab to use its own biggish chunks of memory,
> which it can play in as it sees fit".

Ah, ok, sorry, I missed that but it wouldn't be the first time I missed
something. It was just a few days ago I wrote a pile of material on the
new buddy allocator as part of a publication and still missed that the
order of pages can be identified because of compound pages, thanks Andrew.

> > I also wonder if moving kernel pages is really worth the hassle.
>
> That's the question of course. The benefit is getting rid of high order
> allocation failures, and gaining some confidence that larger filesystem
> blocksizes will work reliably, however the workload evolves.

I'm still working on 2.6 documentation which I expect will be published in
a few months. When I get that written, I'll look into seeing what can be
done with VM Regress to calculate fragmentation and to see how often do
high order allocations actually fail. It might help determine where
defragging is most needed.

IIRC, Martin J. Bligh had a patch which displayed information about the
buddy allocator freelist so that will probably be the starting point. From
there, it should be handy enough to see how intermixed are kernel page
allocations with user allocations. It might turn out that kernel pages
tend to be clustered together anyway.

> > If order0 pages were in slab, the whole searching problem becomes trivial
> > (just go to the relevant cache and scan the slabs).
>
> You might want to write a separate [rfc] to describe your idea. For one
> thing, I don't see why you'd want to use slab for that.
>

You're right, I will need to write a proper RFC one way or the other. I
was thinking of using slabs because that way there wouldn't be need to
scan all of mem_map, just a small number of slabs. I have no basis for
this other than hand waving gestures though.

Anyway, as I know I won't be coding any time soon due to writing docs,
I'll shut up for the moment :-)

--
Mel Gorman

2003-06-27 14:26:44

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC] My research agenda for 2.7

On Friday 27 June 2003 15:00, Mel Gorman wrote:
> On Fri, 27 Jun 2003, Daniel Phillips wrote:
> I was thinking of using slabs because that way there wouldn't be need to
> scan all of mem_map, just a small number of slabs. I have no basis for
> this other than hand waving gestures though.

That's the right idea, it's just not necessary to use slab cache to achieve
it. Actually, to handle huge (hardware) pages efficiently, my first instinct
is to partition them into their own largish chunks as well, and allocate new
chunks as necessary. To get rid of a chunk (because freespace of that type
of chunk has fallen below some threshold) it has to be entirely empty, which
can be accomplished using the same move logic as for defragmentation.

You're right to be worried about intrusion of unmovable pages into regions
that are supposed to be defraggable. It's very easy for some random kernel
code to take a use count on a page and hang onto it forever, making the page
unmovable. My hope is that:

- This doesn't happen much
- Code that does that can be cleaned up
- Even when it happens it won't hurt much
- If all of the above fails, fix the api for the offending code or create
a new one
- If that fails too, give up.

Regards,

Daniel

2003-06-27 14:24:57

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [RFC] My research agenda for 2.7


> IIRC, Martin J. Bligh had a patch which displayed information about the
> buddy allocator freelist so that will probably be the starting point. From
> there, it should be handy enough to see how intermixed are kernel page
> allocations with user allocations. It might turn out that kernel pages
> tend to be clustered together anyway.

That should be merged now - /proc/buddyinfo. I guess you could do the same
for allocated pages (though it'd be rather heavyweight ;-))

M.

2003-06-27 14:31:37

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [RFC] My research agenda for 2.7


>> > I also wonder if moving kernel pages is really worth the hassle.
>>
>> That's the question of course. The benefit is getting rid of high order
>> allocation failures, and gaining some confidence that larger filesystem
>> blocksizes will work reliably, however the workload evolves.

Oh, BTW ... I suspect you've realised this already, but ....

The buddy allocator is not a good system for getting rid of fragmentation.
If I group pages together in aligned pairs, and F is free and A is
allocated, it'll not do anything useful with this:

F A A F F A A F F A A F F A A F F A F A

because the adjacent "F"s aren't "buddies". It seems that the purpose of
the buddy allocator was to be quick at allocating pages. Now that we stuck
a front end cache on it, in the form of hot & cold pages, that goal no
longer seems paramount - altering it to reduce fragmentation at the source,
rather than actively defrag afterwards would seem like a good goal to me.

M.

2003-06-27 14:39:33

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC] My research agenda for 2.7

On Friday 27 June 2003 16:43, Martin J. Bligh wrote:
> The buddy allocator is not a good system for getting rid of fragmentation.

We've talked in the past about throwing out the buddy allocator and adopting
something more modern and efficient and I hope somebody will actually get
around to doing that. In any event, defragging is an orthogonal issue. Some
allocation strategies may be statistically more resistiant to fragmentation
than others, but no allocator has been invented, or ever will be, that can
guarantee that terminal fragmentation will never occur - only active
defragmentation can provide such a guarantee.

Regards,

Daniel

2003-06-27 14:50:47

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [RFC] My research agenda for 2.7

--Daniel Phillips <[email protected]> wrote (on Friday, June 27, 2003 16:54:46 +0200):

> On Friday 27 June 2003 16:43, Martin J. Bligh wrote:
>> The buddy allocator is not a good system for getting rid of fragmentation.
>
> We've talked in the past about throwing out the buddy allocator and adopting
> something more modern and efficient and I hope somebody will actually get
> around to doing that. In any event, defragging is an orthogonal issue. Some
> allocation strategies may be statistically more resistiant to fragmentation
> than others, but no allocator has been invented, or ever will be, that can
> guarantee that terminal fragmentation will never occur - only active
> defragmentation can provide such a guarantee.

Whilst I agree with that in principle, it's inevitably expensive. Thus
whilst we may need to have that code, we should try to avoid using it ;-)

The buddy allocator is obviously flawed in this department ... strategies
like allocating a 4M block to a process up front, then allocing out of that
until we're low on mem, then reclaiming in as large blocks as possible from
those process caches, etc, etc, would obviously help too. Though maybe
we're just permanently low on mem after a while, so it'd be better to just
group pagecache pages together ... that would actually be pretty simple to
change ... hmmm.

M.

2003-06-27 15:05:12

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC] My research agenda for 2.7

On Friday 27 June 2003 17:04, Martin J. Bligh wrote:
> Daniel Phillips <[email protected]> wrote (on Friday, June 27, 2003
> > Some allocation strategies may be statistically more
> > resistiant to fragmentation than others, but no allocator has been
> > invented, or ever will be, that can guarantee that terminal fragmentation
> > will never occur - only active defragmentation can provide such a
> > guarantee.
>
> Whilst I agree with that in principle, it's inevitably expensive. Thus
> whilst we may need to have that code, we should try to avoid using it ;-)

That's exactly the idea. Active defragmentation is just a fallback to handle
currently-unhandled corner cases. A good, efficient allocator that resists
fragmentation in the first place is still needed.

Regards,

Daniel

2003-06-27 15:10:31

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC] My research agenda for 2.7

On Fri, 27 Jun 2003, Daniel Phillips wrote:

> On Friday 27 June 2003 17:04, Martin J. Bligh wrote:
> > Daniel Phillips <[email protected]> wrote (on Friday, June 27, 2003
> > > Some allocation strategies may be statistically more
> > > resistiant to fragmentation than others, but no allocator has been
> > > invented, or ever will be, that can guarantee that terminal fragmentation
> > > will never occur - only active defragmentation can provide such a
> > > guarantee.
> >
> > Whilst I agree with that in principle, it's inevitably expensive. Thus
> > whilst we may need to have that code, we should try to avoid using it ;-)
>
> That's exactly the idea. Active defragmentation is just a fallback to handle
> currently-unhandled corner cases. A good, efficient allocator that resists
> fragmentation in the first place is still needed.
>

I still suspect moving order0 allocations to slab will be a fragmentation
resistent allocator but my main concern would be that the slab allocator
overhead, both CPU and storage requirements will be too high.

On the other hand, it would do some things you are looking for. For
example, it allocates large blocks of memory in one lump and then
allocates them piecemeal. Second, it would be resistent to the FAFAFA
problem Martin pointed out. As slabs would be allocated in a large block
from the buddy, you are guarenteed that you'll be able to free up buddies.
Lastly, as there would be a cache specifically for userspace pages, a
defragger that looked exclusively at user pages will still be sure of
being able to free adjacent buddies.

I need to write a proper RFC.....

--
Mel Gorman

2003-06-27 15:35:43

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC] My research agenda for 2.7

On Friday 27 June 2003 17:22, Mel Gorman wrote:
> I still suspect moving order0 allocations to slab will be a fragmentation
> resistent allocator but my main concern would be that the slab allocator
> overhead, both CPU and storage requirements will be too high.
>
> On the other hand, it would do some things you are looking for. For
> example, it allocates large blocks of memory in one lump and then
> allocates them piecemeal. Second, it would be resistent to the FAFAFA
> problem Martin pointed out. As slabs would be allocated in a large block
> from the buddy, you are guarenteed that you'll be able to free up buddies.
> Lastly, as there would be a cache specifically for userspace pages, a
> defragger that looked exclusively at user pages will still be sure of
> being able to free adjacent buddies.
>
> I need to write a proper RFC.....

You might want to have a look at this:

http://www.research.att.com/sw/tools/vmalloc/
(Vmalloc: A Memory Allocation Library)

Regards,

Daniel

2003-06-27 15:45:40

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC] My research agenda for 2.7

On Friday 27 June 2003 17:50, Daniel Phillips wrote:
> You might want to have a look at this:
>
> http://www.research.att.com/sw/tools/vmalloc/
> (Vmalloc: A Memory Allocation Library)

Whoops, I retract that. I did a quick scan of the page and ran into a link
labled "Non-Commercial License Agreement and Software", attached to some form
of license loaded with ugly patent words and so on. My immediate reaction
is: stay far, far away from that work, it is encumbered. It would be nice if
somebody knows differently, but for now I must ignore Vo's work.

Regards,

Daniel

2003-06-29 19:11:21

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC] My research agenda for 2.7

On Fri, 27 Jun 2003, Daniel Phillips wrote:

> > http://www.research.att.com/sw/tools/vmalloc/
> > (Vmalloc: A Memory Allocation Library)
>
> Whoops, I retract that. I did a quick scan of the page and ran into a link
> labled "Non-Commercial License Agreement and Software", attached to some form
> of license loaded with ugly patent words and so on.

The paper itself is still freely available and the idea could still be
implemented but I still think that the slab allocator will be sufficent
for order0 allocations and use caches to cluster the different types of
order0 allocations together. I've read that paper before and I think it is
overkill for the kernel because the allocation pattern isn't that
complicated in comparison to userspace memory management which that
allocator is aimed at. Not to mention that to get the most out of the
allocator, you need to know what your allocation pattern is going to be
like in advance.

Of course, at this point, I still need to write an RFC about using slab
for order0 allocations and ultimatly I'll have to put code where my emails
are :-/

To help show why slab might work though, I updated vmregress and released
0.10 available at http://www.csn.ul.ie/~mel/projects/vmregress/ . The main
addition is that it now can collect some basic statistics on the types of
page allocations. Specifically, with the trace_alloc module loaded and the
trace_pagealloc.diff patch applied (see kernel_patches/ directory),
/proc/vmregress/trace_alloc will print out the number of allocations for
each order that used either the GFP_KERNEL or GFP_USER flags.

To test it, I booted a clean 2.5.73 system[1], started X and ran
konqueror. The results were;

Kernel allocations
------------------
BeforeX After X
Order 0: 6886 14595
Order 1: 335 383
Order 2: 17 23
Order 3: 2 2
Order 4: 3 3
Order 5: 0 0
Order 6: 0 0
Order 7: 0 0
Order 8: 0 0
Order 9: 0 0
Order 10: 0 0

Userspace allocations
---------------------
Order 0: 53050 71505
Order 1: 1 1
Order 2: 0 0
Order 3: 0 0
Order 4: 0 0
Order 5: 0 0
Order 6: 0 0
Order 7: 0 0
Order 8: 0 0
Order 9: 0 0
Order 10: 0 0

As you can see, order0 allocations were a *lot* more common, at least in
my system. Because they are so common in comparison to other orders, I
think that putting order0 in slabs of size 2^MAX_ORDER will make
defragmentation *so* much easier, if not plain simple, because you can
shuffle around order0 pages in the slabs to free up one slab which frees
up one large 2^MAX_ORDER adjacent block of pages.

It's possible that a journalled filesystem (or some other subsystem I am
not familiar with) would show that order > 0 allocations are a lot more
important than my system shows but the means to collect the information is
in vmregress so be my guest. When the time requires, I'll extend vm
regress to see how fragmented memory actually gets.

Much much later, it would be worth looking at the buddy allocator and
seeing would it be worth changing to something much simpler as most of its
work would be handled by the slab allocator instead.

[1] Tonnes of errors were spewed out on uninitialised timers. With highmem
enabled, I got a sleeping while atomic error warning in a loop caused
by default_idle(). I still have to see can I get around to tracking
down if it's a PEBKAC problem or a kernel bug

--
Mel Gorman

2003-06-29 20:54:28

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC] My research agenda for 2.7

On Sunday 29 June 2003 21:25, Mel Gorman wrote:
> As you can see, order0 allocations were a *lot* more common, at least in
> my system.

Mel,

There's no question that that's the case today. However, there are good
reasons for using a largish filesystem blocksize, 16K for example, once it
becomes possible to do so. With an active volume mounted using 16K blocks,
you'd see that the balance of allocations shifts towards order 2. The size
of the shift will be workload-dependent, ranging from almost no order 2
allocations, to almost all. To keep things interesting, it's quite possible
for the balance to change suddenly and/or strongly.

> Because they are so common in comparison to other orders, I
> think that putting order0 in slabs of size 2^MAX_ORDER will make
> defragmentation *so* much easier, if not plain simple, because you can
> shuffle around order0 pages in the slabs to free up one slab which frees
> up one large 2^MAX_ORDER adjacent block of pages.

But how will you shuffle those pages around?

Regards,

Daniel

2003-06-29 21:12:39

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC] My research agenda for 2.7

On Sat, 28 Jun 2003, Daniel Phillips wrote:

> There's no question that that's the case today. However, there are good
> reasons for using a largish filesystem blocksize, 16K for example, once it
> becomes possible to do so.

Fair enough. I am not very familiar with filesystems or why it would be
desirable to have a blocksize greater than a page size. I'd be grateful if
you'd recommend a good book, paper or documentation set that would
enlighten me.

> > Because they are so common in comparison to other orders, I
> > think that putting order0 in slabs of size 2^MAX_ORDER will make
> > defragmentation *so* much easier, if not plain simple, because you can
> > shuffle around order0 pages in the slabs to free up one slab which frees
> > up one large 2^MAX_ORDER adjacent block of pages.
>
> But how will you shuffle those pages around?
>

Thats where your defragger would need to kick in. The defragger would scan
at most MAX_DEFRAG_SCAN slabs belonging to the order0 userspace cache
where MAX_DEFRAG_SCAN is related to how urgent the request is. Select the
slab with the least objects in them and either:

a) Reclaim the pages by swapping them out or whatever
b) Copy the pages to another slab and update the page tables via rmap

Once the pages are copied from the selected slab, destroy it and you have
a large block of free pages. The point which I am getting across, quite
badly, is that by having order0 pages in slab, you are guarenteed to be
able to quickly find a cluster of pages to move which will free up a
contiguous block of 2^MAX_ORDER pages, or at least 2^MAX_GFP_ORDER with
the current slab implementation.

For kernel pages, it would get a lot more complicated but at least the
kernel pages would be clustered together in the order0 cache for kernel
pages.

--
Mel Gorman

2003-06-29 21:39:25

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC] My research agenda for 2.7

On Sunday 29 June 2003 23:26, Mel Gorman wrote:
> On Sat, 28 Jun 2003, Daniel Phillips wrote:
> > > Because they are so common in comparison to other orders, I
> > > think that putting order0 in slabs of size 2^MAX_ORDER will make
> > > defragmentation *so* much easier, if not plain simple, because you can
> > > shuffle around order0 pages in the slabs to free up one slab which
> > > frees up one large 2^MAX_ORDER adjacent block of pages.
> >
> > But how will you shuffle those pages around?
>
> Thats where your defragger would need to kick in. The defragger would scan
> at most MAX_DEFRAG_SCAN slabs belonging to the order0 userspace cache
> where MAX_DEFRAG_SCAN is related to how urgent the request is. Select the
> slab with the least objects in them and either:
> a) Reclaim the pages by swapping them out or whatever
> b) Copy the pages to another slab and update the page tables via rmap
> Once the pages are copied from the selected slab, destroy it and you have
> a large block of free pages.

Yes, now we're on the same page, so to speak. Personally, I don't have much
interest in working on improving the allocator per se. I'd love to see
somebody else take a run at that, and I will occupy myself with the gritty
details of how to move pages without making the system crater.

> The point which I am getting across, quite
> badly, is that by having order0 pages in slab, you are guarenteed to be
> able to quickly find a cluster of pages to move which will free up a
> contiguous block of 2^MAX_ORDER pages, or at least 2^MAX_GFP_ORDER with
> the current slab implementation.

I don't see that it's guaranteed, but I do see that organizing pages in
slab-like chunks is a good thing to do - a close reading of my original
response to you shows I was thinking about that.

I also don't see that the slab cache in its current incarnation is the right
tool for the job. It handles things that we just don't need to handle, such
as objects of arbitary size and alignment. I'm sure you could make it work,
but why not just tweak alloc_pages to know about chunks instead?

Regards,

Daniel

2003-06-29 21:54:02

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [RFC] My research agenda for 2.7

On Sunday 29 June 2003 23:26, Mel Gorman wrote:
>> Thats where your defragger would need to kick in. The defragger would scan
>> at most MAX_DEFRAG_SCAN slabs belonging to the order0 userspace cache
>> where MAX_DEFRAG_SCAN is related to how urgent the request is. Select the
>> slab with the least objects in them and either:
>> a) Reclaim the pages by swapping them out or whatever
>> b) Copy the pages to another slab and update the page tables via rmap
>> Once the pages are copied from the selected slab, destroy it and you have
>> a large block of free pages.

On Sat, Jun 28, 2003 at 11:54:43PM +0200, Daniel Phillips wrote:
> Yes, now we're on the same page, so to speak. Personally, I don't have much
> interest in working on improving the allocator per se. I'd love to see
> somebody else take a run at that, and I will occupy myself with the gritty
> details of how to move pages without making the system crater.

This sounds like it's behind dependent on physically scanning slabs,
since one must choose slab pages for replacement on the basis of their
potential to restore contiguity, not merely "dump whatever's replaceable
and check how much got freed".


On Sunday 29 June 2003 23:26, Mel Gorman wrote:
>> The point which I am getting across, quite
>> badly, is that by having order0 pages in slab, you are guarenteed to be
>> able to quickly find a cluster of pages to move which will free up a
>> contiguous block of 2^MAX_ORDER pages, or at least 2^MAX_GFP_ORDER with
>> the current slab implementation.

On Sat, Jun 28, 2003 at 11:54:43PM +0200, Daniel Phillips wrote:
> I don't see that it's guaranteed, but I do see that organizing pages in
> slab-like chunks is a good thing to do - a close reading of my original
> response to you shows I was thinking about that.
> I also don't see that the slab cache in its current incarnation is the right
> tool for the job. It handles things that we just don't need to handle, such
> as objects of arbitary size and alignment. I'm sure you could make it work,
> but why not just tweak alloc_pages to know about chunks instead?

A larger question in my mind is what slabs have to do with this at all.
Slabs are for fixed-size allocations. These techniques are explicitly
oriented toward making user allocations more variable, not less.


-- wli

2003-06-29 23:03:00

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC] My research agenda for 2.7

On Monday 30 June 2003 00:07, William Lee Irwin III wrote:
> On Sunday 29 June 2003 23:26, Mel Gorman wrote:
> > ...I will occupy myself with the
> > gritty details of how to move pages without making the system crater.
>
> This sounds like it's behind dependent on physically scanning slabs,
> since one must choose slab pages for replacement on the basis of their
> potential to restore contiguity, not merely "dump whatever's replaceable
> and check how much got freed".

Though I'm not sure what "behind dependent" means, and I'm not the one
advocating slab for this, it's quite correct that scanning strategy would
need to change, at least when the system runs into cross-order imbalances.
But this isn't much different from the kinds of things we do already.

Regards,

Daniel

2003-07-02 20:58:59

by Mike Fedyk

[permalink] [raw]
Subject: Re: [RFC] My research agenda for 2.7

On Fri, Jun 27, 2003 at 02:00:42PM +0100, Mel Gorman wrote:
> You're right, I will need to write a proper RFC one way or the other. I
> was thinking of using slabs because that way there wouldn't be need to
> scan all of mem_map, just a small number of slabs. I have no basis for
> this other than hand waving gestures though.

Mel,

This sounds much like something I was reading from Larry McVoy using page
objects (like one level higher in magnatude than pages).

I don't remember the URL, but there was something pretty extensive from
Larry already explaining the concept.

2003-07-03 01:50:40

by Larry McVoy

[permalink] [raw]
Subject: Re: [RFC] My research agenda for 2.7

On Wed, Jul 02, 2003 at 02:10:55PM -0700, Mike Fedyk wrote:
> On Fri, Jun 27, 2003 at 02:00:42PM +0100, Mel Gorman wrote:
> > You're right, I will need to write a proper RFC one way or the other. I
> > was thinking of using slabs because that way there wouldn't be need to
> > scan all of mem_map, just a small number of slabs. I have no basis for
> > this other than hand waving gestures though.
>
> Mel,
>
> This sounds much like something I was reading from Larry McVoy using page
> objects (like one level higher in magnatude than pages).
>
> I don't remember the URL, but there was something pretty extensive from
> Larry already explaining the concept.

If we're thinking about the same thing, the basic idea was to store
information into a higher level object and make more intelligent paging
decisions based on the higher level object. In my brain, since I'm a
SunOS guy, that means that you store information in the vnode (inode)
which reflects the status of all pages backed by this inode.

Instead of trying to figure out what to do at the page level, you figure
out what to do at the object level.

Some postings about this:

http://groups.google.com/groups?q=topvn+mcvoy&hl=en&lr=&ie=UTF-8&oe=UTF-8&selm=3cgeu9%24h96%40fido.asd.sgi.com&rnum=1

http://groups.google.com/groups?q=vnode+mcvoy&start=10&hl=en&lr=&ie=UTF-8&oe=UTF-8&selm=l0ojgnINN59t%40appserv.Eng.Sun.COM&rnum=12

I can't find the writeup that you are thinking about. I know what you mean,
there was a discussion of paging algs and I went off about how scanning a
page a time is insane. If someone finds the URL let me know.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2003-07-03 02:06:35

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [RFC] My research agenda for 2.7

On Wed, Jul 02, 2003 at 07:04:45PM -0700, Larry McVoy wrote:
> If we're thinking about the same thing, the basic idea was to store
> information into a higher level object and make more intelligent paging
> decisions based on the higher level object. In my brain, since I'm a
> SunOS guy, that means that you store information in the vnode (inode)
> which reflects the status of all pages backed by this inode.
> Instead of trying to figure out what to do at the page level, you figure
> out what to do at the object level.
> Some postings about this:
> http://groups.google.com/groups?q=topvn+mcvoy&hl=en&lr=&ie=UTF-8&oe=UTF-8&selm=3cgeu9%24h96%40fido.asd.sgi.com&rnum=1
> http://groups.google.com/groups?q=vnode+mcvoy&start=10&hl=en&lr=&ie=UTF-8&oe=UTF-8&selm=l0ojgnINN59t%40appserv.Eng.Sun.COM&rnum=12
> I can't find the writeup that you are thinking about. I know what you mean,
> there was a discussion of paging algs and I went off about how scanning a
> page a time is insane. If someone finds the URL let me know.

I believe people are already on file object local page replacement,
though it's more in the planning than implementation phase.


-- wli