LinuxLists.cc - RE: [RFC] Avoiding fragmentation through different allocator

2005-01-12 22:47:43

Subject: RE: [RFC] Avoiding fragmentation through different allocator

Hi Mel!

>Instead of having one global MAX_ORDER-sized array of free
>lists, there are
>three, one for each type of allocation. Finally, there is a
>list of pages of
>size 2^MAX_ORDER which is a global pool of the largest pages
>the kernel deals with.

I've got a patch that I've been testing recently for memory
hotplug that does nearly the exact same thing - break up the
management of page allocations based on type - after having
had a number of conversations with Dave Hansen on this topic.
I've also prototyped this to use as an alternative to adding
duplicate zones for delineating between memory that may be
removed and memory that is not likely to ever be removable. I've
only tested in the context of memory hotplug, but it does
greatly simplify memory removal within individual zones. Your
distinction between areas is pretty cool considering I've only
distinguished at the coarser granularity of user vs. kernel
to date. It would be interesting to throw KernelNonReclaimable
into the mix as well although I haven't gotten there yet... ;-)

>Once a 2^MAX_ORDER block of pages it split for a type of
>allocation, it is
>added to the free-lists for that type, in effect reserving it.
>Hence, over
>time, pages of the related types can be clustered together. This means
>that if we wanted 2^MAX_ORDER number of pages, we could linearly scan a
>block of pages allocated for UserReclaimable and page each of them out.

Interesting. I took a slightly different approach due to some
known delineations between areas that are defined to be non-
removable vs. areas that may be removed at some point. Thus I'm
only managing two distinct free_area lists currently. I'm curious
as to the motivation for having a global MAX_ORDER size list that
is allocation agnostic initially...is it so that the pages can
evolve according to system demands (assuming MAX_ORDER sized
chunks are eventually available again)?

It looks like you left the per_cpu_pages as-is. Did you
consider separating those as well to reflect kernel vs. user
pools?

>- struct free_area free_area[MAX_ORDER];
>+ struct free_area free_area_lists[ALLOC_TYPES][MAX_ORDER];
>+ struct free_area free_area_global;
>+
>+ /*
>+ * This map tracks what each 2^MAX_ORDER sized block
>has been used for.
>+ * When a page is freed, it's index within this bitmap
>is calculated
>+ * using (address >> MAX_ORDER) * 2 . This means that pages will
>+ * always be freed into the correct list in free_area_lists
>+ */
>+ unsigned long *free_area_usemap;

So, the current user/kernelreclaim/kernelnonreclaim determination
is based on this bitmap. Couldn't this be managed in individual
struct pages instead, kind of like the buddy bitmap patches?

I'm trying to figure out one last bug when I remove memory (via
nonlinear sections) that has been dedicated to user allocations.
After which perhaps I'll post it as well, although it is *very*
similar. However it does demonstrate the utility of this approach
for memory hotplug - specifically memory removal - without the
complexity of adding more zones.

matt

2005-01-12 23:16:28

by Mel Gorman

[permalink] [raw]

Subject: RE: [RFC] Avoiding fragmentation through different allocator

On Wed, 12 Jan 2005, Tolentino, Matthew E wrote:

> Hi Mel!
>

Hi.

First off, I think the differences in our approaches are based on
motiviation. I'm tackling fragmentation where as you were tackling
hotplug. That distinction will be the root of a lot of "why are you doing
that?" type questions. This is a failing of my part as I'm only beginning
to get back to grips with memory-management-related fun.

> >Instead of having one global MAX_ORDER-sized array of free
> >lists, there are
> >three, one for each type of allocation. Finally, there is a
> >list of pages of
> >size 2^MAX_ORDER which is a global pool of the largest pages
> >the kernel deals with.
>
> I've got a patch that I've been testing recently for memory
> hotplug that does nearly the exact same thing - break up the
> management of page allocations based on type - after having
> had a number of conversations with Dave Hansen on this topic.
> I've also prototyped this to use as an alternative to adding
> duplicate zones for delineating between memory that may be
> removed and memory that is not likely to ever be removable.

I considered adding a new zone but I felt it would be a massive job for
what I considered to be a simple problem. I think my approach is nice
and isolated within the allocator itself and will be less likely to
affect other code.

On possibility is that we could say that the UserRclm and KernRclm pool
are always eligible for hotplug and have hotplug banks only satisy those
allocations pushing KernNonRclm allocations to fixed banks. How is it
currently known if a bank of memory is hotplug? Is there a node for each
hotplug bank? If yes, we could flag those nodes to only satisify UserRclm
and KernRclm allocations and force fallback to other nodes. The danger is
that allocations would fail because non-hotplug banks were already full
and pageout would not happen because the watermarks were satisified.

(Bear in mind I can't test hotplug-related issues due to lack of suitable
hardware)

> I've
> only tested in the context of memory hotplug, but it does
> greatly simplify memory removal within individual zones. Your
> distinction between areas is pretty cool considering I've only
> distinguished at the coarser granularity of user vs. kernel
> to date. It would be interesting to throw KernelNonReclaimable
> into the mix as well although I haven't gotten there yet... ;-)
>

If you have already posted a version of the patch (you have feedback so I
guess it's there somewhere), can you send me a link to the thread where
you introduced your approach? It's possible that we just need to merge the
ideas.

> >Once a 2^MAX_ORDER block of pages it split for a type of
> >allocation, it is
> >added to the free-lists for that type, in effect reserving it.
> >Hence, over
> >time, pages of the related types can be clustered together. This means
> >that if we wanted 2^MAX_ORDER number of pages, we could linearly scan a
> >block of pages allocated for UserReclaimable and page each of them out.
>
> Interesting. I took a slightly different approach due to some
> known delineations between areas that are defined to be non-
> removable vs. areas that may be removed at some point. Thus I'm
> only managing two distinct free_area lists currently. I'm curious
> as to the motivation for having a global MAX_ORDER size list that
> is allocation agnostic initially...

It's because I consider all 2^MAX_ORDER pages in a zone to be equal where
as I'm guessing you don't. Until they are split, there is nothing special
about them. It is only when it is split that I want it reserved for a
purpose.

However, if we knew there were blocks that were hot-pluggable, we could
just have a hotplug-global and non-hotplug-global pool. If it's a UserRclm
or KernRclm allocation, split from hotplug-global, otherwise use
non-hotplug-global. It'd increase the memory requirements of the patch a
bit though.

> is it so that the pages can
> evolve according to system demands (assuming MAX_ORDER sized
> chunks are eventually available again)?
>

Exactly. Once a 2^MAX_ORDER block has been merged again, it will not be
reserved until the next split.

> It looks like you left the per_cpu_pages as-is. Did you
> consider separating those as well to reflect kernel vs. user
> pools?
>

I kept the per-cpu caches for UserRclm-style allocations only because
otherwise a Kernel-nonreclaimable allocation could easily be taken from a
UserRclm pool. Over a period of time, the UserRclm pool would be harder to
defragment. Even if we paged out everything and dumped all buffers, there
would still be kernel non-reclaimable allocations that have to be moved.
The concession I would make there is that allocations for caches could use
the per-cpu caches as they are easy to get rid of.

> >- struct free_area free_area[MAX_ORDER];
> >+ struct free_area free_area_lists[ALLOC_TYPES][MAX_ORDER];
> >+ struct free_area free_area_global;
> >+
> >+ /*
> >+ * This map tracks what each 2^MAX_ORDER sized block
> >has been used for.
> >+ * When a page is freed, it's index within this bitmap
> >is calculated
> >+ * using (address >> MAX_ORDER) * 2 . This means that pages will
> >+ * always be freed into the correct list in free_area_lists
> >+ */
> >+ unsigned long *free_area_usemap;
>
> So, the current user/kernelreclaim/kernelnonreclaim determination
> is based on this bitmap. Couldn't this be managed in individual
> struct pages instead, kind of like the buddy bitmap patches?
>

Yes, but it would be a waste of memory and the struct page flags are
already under a lot of pressure (In fact, I am 99.9999% certain that those
bits are at a premium and Andrew Morton at least will be fierce unhappy if
I try and use another one). As I only care about a 2^MAX_ORDER block of
pages, I only need two bits per 2^MAX_ORDER pages to track them.
Per-page, I would need one bit per page which would be a fierce waste of
memory.

> I'm trying to figure out one last bug when I remove memory (via
> nonlinear sections) that has been dedicated to user allocations.
> After which perhaps I'll post it as well, although it is *very*
> similar. However it does demonstrate the utility of this approach
> for memory hotplug - specifically memory removal - without the
> complexity of adding more zones.
>

When you post that, make sure linux-mm is cc'd and I'll certainly see it.
On the linux-kernel mailing list, I might miss it. Thanks

--
Mel Gorman

2005-01-13 08:07:22

by Hirokazu Takahashi

[permalink] [raw]

Subject: Re: [RFC] Avoiding fragmentation through different allocator

Hi Mel,

The global list looks interesting.

> > >Instead of having one global MAX_ORDER-sized array of free
> > >lists, there are
> > >three, one for each type of allocation. Finally, there is a
> > >list of pages of
> > >size 2^MAX_ORDER which is a global pool of the largest pages
> > >the kernel deals with.

> > is it so that the pages can
> > evolve according to system demands (assuming MAX_ORDER sized
> > chunks are eventually available again)?
> >
>
> Exactly. Once a 2^MAX_ORDER block has been merged again, it will not be
> reserved until the next split.

FYI, MAX_ORDER is huge in some architectures.
I guess another watermark should be introduced instead of MAX_ORDER.

Thanks,
Hirokazu Takahashi.

2005-01-13 10:27:12

by Mel Gorman

[permalink] [raw]

Subject: Re: [RFC] Avoiding fragmentation through different allocator

On Thu, 13 Jan 2005, Hirokazu Takahashi wrote:

> Hi Mel,
>
> The global list looks interesting.
>
> > > >Instead of having one global MAX_ORDER-sized array of free
> > > >lists, there are
> > > >three, one for each type of allocation. Finally, there is a
> > > >list of pages of
> > > >size 2^MAX_ORDER which is a global pool of the largest pages
> > > >the kernel deals with.
>
> > > is it so that the pages can
> > > evolve according to system demands (assuming MAX_ORDER sized
> > > chunks are eventually available again)?
> > >
> >
> > Exactly. Once a 2^MAX_ORDER block has been merged again, it will not be
> > reserved until the next split.
>
> FYI, MAX_ORDER is huge in some architectures.
> I guess another watermark should be introduced instead of MAX_ORDER.
>

It could be, but remember that the watermark will decide what the largest
non-fragmented block-size will be and I am not sure that is something
architectures really want. i.e. why would an architecture not want to push
to have the largest possible block available?

If they did really want the option, I could add MAX_FRAG_ORDER (ok, bad
name but it's morning) that architectures can optionally define. then in
the main code, just

#ifndef MAX_FRAG_ORDER
#define MAX_FRAG_ORDER MAX_ORDER
#endif

The global lists would then be expected to hold the lists between
MAX_FRAG_ORDER and MAX_ORDER. Would that make sense and would
architectures really want it? If yes, I can code it up.

--
Mel Gorman

2005-01-16 04:04:11

by Yasunori Goto

[permalink] [raw]

Subject: Re: [RFC] Avoiding fragmentation through different allocator

Hello.

I'm also very interested in your patches, because I'm working for
memory hotplug too.

> On possibility is that we could say that the UserRclm and KernRclm pool
> are always eligible for hotplug and have hotplug banks only satisy those
> allocations pushing KernNonRclm allocations to fixed banks. How is it
> currently known if a bank of memory is hotplug? Is there a node for each
> hotplug bank? If yes, we could flag those nodes to only satisify UserRclm
> and KernRclm allocations and force fallback to other nodes.

There are 2 types of memory hotplug.

a)SMP machine case
A some part of memory will be added and removed.

b)NUMA machine case.
Whole of a node will be able to remove and add.
However, if a block of memory like DIMM is broken and disabled,
Its close from a).

How to know where is hotpluggable bank is platform/archtecture
dependent issue.
ex) Asking to ACPI.
Just node0 become unremovable, and other nodes are removable.
etc...

In current your patch, first attribute of all pages are NoRclm.
But if your patches has interface to decide where will be Rclm for
each arch/platform, it might be good.

> The danger is
> that allocations would fail because non-hotplug banks were already full
> and pageout would not happen because the watermarks were satisified.

In this case, if user can change attribute Rclm area to
NoRclm, it is better than nothing.
In hotplug patches, there will be new zone as ZONE_REMOVABLE.
But in this patch, this change attribute is a little bit difficult.
(At first remove the pages from free_area of removable zone,
then add them to free_area of Un-removable zone.)
Probably its change is easier in your patch.

> (Bear in mind I can't test hotplug-related issues due to lack of suitable
> hardware)

I also don't have real hotplug machine now. ;-)
I just use software emulation.

> > It looks like you left the per_cpu_pages as-is. Did you
> > consider separating those as well to reflect kernel vs. user
> > pools?
> >
>
> I kept the per-cpu caches for UserRclm-style allocations only because
> otherwise a Kernel-nonreclaimable allocation could easily be taken from a
> UserRclm pool.

I agree that dividing per-cpu caches is not good way.
But if Kernel-nonreclaimable allocation use its UserRclm pool,
its removable memory bank will be harder to remove suddenly.
Is it correct? If so, it is not good for memory hotplug.
Hmmmm.

Anyway, thank you for your patch. It is very interesting.

Bye.

--
Yasunori Goto <ygoto at us.fujitsu.com>

2005-01-16 16:21:47

by Mel Gorman

[permalink] [raw]

Subject: Re: [RFC] Avoiding fragmentation through different allocator

On Sat, 15 Jan 2005, Yasunori Goto wrote:

> There are 2 types of memory hotplug.
>
> a)SMP machine case
> A some part of memory will be added and removed.
>
> b)NUMA machine case.
> Whole of a node will be able to remove and add.
> However, if a block of memory like DIMM is broken and disabled,
> Its close from a).
>
> How to know where is hotpluggable bank is platform/archtecture
> dependent issue.
> ex) Asking to ACPI.
> Just node0 become unremovable, and other nodes are removable.
> etc...
>

Is there an architecture-independant way of finding this out?

> In current your patch, first attribute of all pages are NoRclm.
> But if your patches has interface to decide where will be Rclm for
> each arch/platform, it might be good.
>

It doesn't have an API as such. In page_alloc.c, there is a function
get_pageblock_type() that returns what type of allocation the block of
memory is being used for. There is no guarentee there is only those type
of allocations there though.

>
> > The danger is
> > that allocations would fail because non-hotplug banks were already full
> > and pageout would not happen because the watermarks were satisified.
>
> In this case, if user can change attribute Rclm area to
> NoRclm, it is better than nothing.
> In hotplug patches, there will be new zone as ZONE_REMOVABLE.

What's the current attidute for adding a new zone? I felt there would be
resistence as a new zone would affect a lot of code paths and be yet
another zone that needed balancing. For example, is there a HIGHMEM
version of the ZONE_REMOVABLE or could normal and highmem be in this zone?

> But in this patch, this change attribute is a little bit difficult.
> (At first remove the pages from free_area of removable zone,
> then add them to free_area of Un-removable zone.)
> Probably its change is easier in your patch.
>

I think the difficulty would be similar because it's still Move Pages From
A To B.

> I agree that dividing per-cpu caches is not good way.
> But if Kernel-nonreclaimable allocation use its UserRclm pool,
> its removable memory bank will be harder to remove suddenly.
> Is it correct? If so, it is not good for memory hotplug.
> Hmmmm.
>

It is correct. However, this will only happen in low-memory conditions.
For a kernel-nonreclaimable allocation to use the userrclm pool, three
conditions have to be met;

1. Kernel-nonreclaimable pool has no pages
2. There are no global 2^MAX_ORDER pages
3. Kern-reclaimable pool has no pages

This is because of the fallback order. If you were interested in testing a
particular workload, you could apply the patch, run a workload and then
look at /proc/buddyinfo. There are three counters at the end of the
output like this;

KernNoRclm Fallback count: 0
KernRclm Fallback count: 0
UserRclm Fallback count: 425

A fallback can get counted twice. For example, if KernNoRclm falls back to
KernRclm and then UserRclm, it's considered to be two fallbacks.

I also have (yet another) tool that is able to track exactly where each
type of allocation is. If you wanted to know precisely where each page is
and see how many non-reclaimable pages are ending up in the wrong place,
the tool could be modified to do that.

--
Mel Gorman

2005-01-17 23:16:52

by Yasunori Goto

[permalink] [raw]

Subject: Re: [RFC] Avoiding fragmentation through different allocator

> > There are 2 types of memory hotplug.
> >
> > a)SMP machine case
> > A some part of memory will be added and removed.
> >
> > b)NUMA machine case.
> > Whole of a node will be able to remove and add.
> > However, if a block of memory like DIMM is broken and disabled,
> > Its close from a).
> >
> > How to know where is hotpluggable bank is platform/archtecture
> > dependent issue.
> > ex) Asking to ACPI.
> > Just node0 become unremovable, and other nodes are removable.
> > etc...
> >
>
> Is there an architecture-independant way of finding this out?

No. At least, I have no idea. :-(

> > In current your patch, first attribute of all pages are NoRclm.
> > But if your patches has interface to decide where will be Rclm for
> > each arch/platform, it might be good.
> >
>
> It doesn't have an API as such. In page_alloc.c, there is a function
> get_pageblock_type() that returns what type of allocation the block of
> memory is being used for. There is no guarentee there is only those type
> of allocations there though.

OK. I will write a patch of function to set it for some arch/platform.

> What's the current attidute for adding a new zone? I felt there would be
> resistence as a new zone would affect a lot of code paths and be yet
> another zone that needed balancing. For example, is there a HIGHMEM
> version of the ZONE_REMOVABLE or could normal and highmem be in this zone?

Yes. In my current patch of memory hotplug, Removable is like Highmem.
( <http://sourceforge.net/mailarchive/forum.php?forum_id=223>
It is group B of "Hot Add patches for NUMA" )

I tried to make new removable zone which could be with normal and dma
before it. But, it needs too much work as you said. So, I gave up it.
I heard Matt-san has some ideas for it. So, I'm looking forward to
see it.

> > I agree that dividing per-cpu caches is not good way.
> > But if Kernel-nonreclaimable allocation use its UserRclm pool,
> > its removable memory bank will be harder to remove suddenly.
> > Is it correct? If so, it is not good for memory hotplug.
> > Hmmmm.
> >
>
> It is correct. However, this will only happen in low-memory conditions.
> For a kernel-nonreclaimable allocation to use the userrclm pool, three
> conditions have to be met;
>
> 1. Kernel-nonreclaimable pool has no pages
> 2. There are no global 2^MAX_ORDER pages
> 3. Kern-reclaimable pool has no pages

I suppose if this patch have worked for one year,
unlucky case might occur. Probably, enterprise system will not
allow it. So, I will try disabling fallback for KernNoRclm.

Thanks.

--
Yasunori Goto <ygoto at us.fujitsu.com>

2005-01-19 13:45:48

by Mel Gorman

[permalink] [raw]

Subject: Re: [RFC] Avoiding fragmentation through different allocator

On Mon, 17 Jan 2005, Yasunori Goto wrote:

> > Is there an architecture-independant way of finding this out?
>
> No. At least, I have no idea. :-(
>

In another mail to Matthew, I suggested that the zone->free_area_usemap
could be used to track hotplug blocks of pages by either using a
bit-pattern of 11 for hotplug pages or adding a third bit.

get_pageblock_type() could then be taught to identify a hotplug region
within page_alloc.c at least. If the information is needed outside the
allocator, it will need more work though.

> > What's the current attidute for adding a new zone? I felt there would be
> > resistence as a new zone would affect a lot of code paths and be yet
> > another zone that needed balancing. For example, is there a HIGHMEM
> > version of the ZONE_REMOVABLE or could normal and highmem be in this zone?
>
> Yes. In my current patch of memory hotplug, Removable is like Highmem.
> ( <http://sourceforge.net/mailarchive/forum.php?forum_id=223>
> It is group B of "Hot Add patches for NUMA" )
>
> I tried to make new removable zone which could be with normal and dma
> before it. But, it needs too much work as you said. So, I gave up it.
> I heard Matt-san has some ideas for it. So, I'm looking forward to
> see it.
>

I'm taking a look through these patches just so I know what the other
approaches were.

> > > I agree that dividing per-cpu caches is not good way.
> > > But if Kernel-nonreclaimable allocation use its UserRclm pool,
> > > its removable memory bank will be harder to remove suddenly.
> > > Is it correct? If so, it is not good for memory hotplug.
> > > Hmmmm.
> > >
> >
> > It is correct. However, this will only happen in low-memory conditions.
> > For a kernel-nonreclaimable allocation to use the userrclm pool, three
> > conditions have to be met;
> >
> > 1. Kernel-nonreclaimable pool has no pages
> > 2. There are no global 2^MAX_ORDER pages
> > 3. Kern-reclaimable pool has no pages
>
> I suppose if this patch have worked for one year,
> unlucky case might occur. Probably, enterprise system will not
> allow it. So, I will try disabling fallback for KernNoRclm.
>

I can almost guarentee that will fail in low-memory conditions. Before I
implemented proper fallback logic, I used to get oopses in low-memory
conditions. I found it was because KernNoRclm had nowhere to fallback but
there was loads of free memory so kswapd was not taking place.

So, just disabling fallback is not the right answer.

--
Mel Gorman