2005-11-01 13:57:15

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19


* Mel Gorman <[email protected]> wrote:

> The set of patches do fix a lot and make a strong start at addressing
> the fragmentation problem, just not 100% of the way. [...]

do you have an expectation to be able to solve the 'fragmentation
problem', all the time, in a 100% way, now or in the future?

> So, with this set of patches, how fragmented you get is dependant on
> the workload and it may still break down and high order allocations
> will fail. But the current situation is that it will defiantly break
> down. The fact is that it has been reported that memory hotplug remove
> works with these patches and doesn't without them. Granted, this is
> just one feature on a high-end machine, but it is one solid operation
> we can perform with the patches and cannot without them. [...]

can you always, under any circumstance hot unplug RAM with these patches
applied? If not, do you have any expectation to reach 100%?

Ingo


2005-11-01 14:10:37

by Dave Hansen

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Tue, 2005-11-01 at 14:56 +0100, Ingo Molnar wrote:
> * Mel Gorman <[email protected]> wrote:
>
> > The set of patches do fix a lot and make a strong start at addressing
> > the fragmentation problem, just not 100% of the way. [...]
>
> do you have an expectation to be able to solve the 'fragmentation
> problem', all the time, in a 100% way, now or in the future?

In a word, yes.

The current allocator has no design for measuring or reducing
fragmentation. These patches provide the framework for at least
measuring fragmentation.

The patches can not do anything magical and there will be a point where
the system has to make a choice: fragment, or fail an allocation when
there _is_ free memory.

These patches take us in a direction where we are capable of making such
a decision.

> > So, with this set of patches, how fragmented you get is dependant on
> > the workload and it may still break down and high order allocations
> > will fail. But the current situation is that it will defiantly break
> > down. The fact is that it has been reported that memory hotplug remove
> > works with these patches and doesn't without them. Granted, this is
> > just one feature on a high-end machine, but it is one solid operation
> > we can perform with the patches and cannot without them. [...]
>
> can you always, under any circumstance hot unplug RAM with these patches
> applied? If not, do you have any expectation to reach 100%?

With these patches, no. There are currently some very nice,
pathological workloads which will still cause fragmentation. But, in
the interest of incremental feature introduction, I think they're a fine
first step. We can effectively reach toward a more comprehensive
solution on top of these patches.

Reaching truly 100% will require some other changes such as being able
to virtually remap things like kernel text.

-- Dave

2005-11-01 14:30:06

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19


* Dave Hansen <[email protected]> wrote:

> > can you always, under any circumstance hot unplug RAM with these patches
> > applied? If not, do you have any expectation to reach 100%?
>
> With these patches, no. There are currently some very nice,
> pathological workloads which will still cause fragmentation. But, in
> the interest of incremental feature introduction, I think they're a
> fine first step. We can effectively reach toward a more comprehensive
> solution on top of these patches.
>
> Reaching truly 100% will require some other changes such as being able
> to virtually remap things like kernel text.

then we need to see that 100% solution first - at least in terms of
conceptual steps. Not being able to hot-unplug RAM in a 100% way wont
satisfy customers. Whatever solution we choose, it must work 100%. Just
to give a comparison: would you be content with your computer failing to
start up apps 1 time out of 100, saying that 99% is good enough? Or
would you call it what it is: buggy and unreliable?

to stress it: hot unplug is a _feature_ that must work 100%, _not_ some
optimization where 99% is good enough. This is a feature that people
will be depending on if we promise it, and 1% failure rate is not
acceptable. Your 'pathological workload' might be customer X's daily
workload. Unless there is a clear definition of what is possible and
what is not (which definition can be relied upon by users), having a 99%
solution is much worse than the current 0% solution!

worse than that, this is a known _hard_ problem to solve in a 100% way,
and saying 'this patch is a good first step' just lures us (and
customers) into believing that we are only 1% away from the desired 100%
solution, while nothing could be further from the truth. They will
demand the remaining 1%, but can we offer it? Unless you can provide a
clear, accepted-upon path towards the 100% solution, we have nothing
right now.

I have no problems with using higher-order pages for performance
purposes [*], as long as 'failed' allocation (and freeing) actions are
user-invisible. But the moment you make it user-visible, it _must_ work
in a deterministic way!

Ingo

[*] in which case any slowdown in the page allocator must be offset by
the gains.

2005-11-01 14:46:14

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19


* Mel Gorman <[email protected]> wrote:

> [...] The full 100% solution would be a large set of far reaching
> patches that would touch a lot of the memory manager. This would get
> rejected because the patches should have have arrived piecemeal. These
> patches are one piece. To reach 100%, other mechanisms are also needed
> such as;
>
> o Page migration to move unreclaimable pages like mlock()ed pages or
> kernel pages that had fallen back into easy-reclaim areas. A mechanism
> would also be needed to move things like kernel text. I think the memory
> hotplug tree has done a lot of work here
> o Mechanism for taking regions of memory offline. Again, I think the
> memory hotplug crowd have something for this. If they don't, one of them
> will chime in.
> o linear page reclaim that linearly scans a region of memory reclaims or
> moves all the pages it. I have a proof-of-concept patch that does the
> linear scan and reclaim but it's currently ugly and depends on this set
> of patches been applied.

how will the 100% solution handle a simple kmalloc()-ed kernel buffer,
that is pinned down, and to/from which live pointers may exist? That
alone can prevent RAM from being removable.

Ingo

2005-11-01 14:41:35

by Mel Gorman

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Tue, 1 Nov 2005, Ingo Molnar wrote:

> * Mel Gorman <[email protected]> wrote:
>
> > The set of patches do fix a lot and make a strong start at addressing
> > the fragmentation problem, just not 100% of the way. [...]
>
> do you have an expectation to be able to solve the 'fragmentation
> problem', all the time, in a 100% way, now or in the future?
>

Not now, but I expect to make 100% on demand in the future for all but
GFP_ATOMIC and GFP_NOFS allocations. As GFP_ATOMIC and GFP_NOFS cannot do
any reclaim work themselves, they will still be required to use smaller
orders or private pools that are refilled using GFP_KERNEL if necessary.
The high order pages would have to be reclaimed by another process like
kswapd just like what happens for order-0 pages today.

> > So, with this set of patches, how fragmented you get is dependant on
> > the workload and it may still break down and high order allocations
> > will fail. But the current situation is that it will defiantly break
> > down. The fact is that it has been reported that memory hotplug remove
> > works with these patches and doesn't without them. Granted, this is
> > just one feature on a high-end machine, but it is one solid operation
> > we can perform with the patches and cannot without them. [...]
>
> can you always, under any circumstance hot unplug RAM with these patches
> applied? If not, do you have any expectation to reach 100%?
>

No, you cannot guarantee hot unplug RAM with these patches applied.
Anecdotal evidence suggests your chances are better on PPC64 which is a
start but we have to start somewhere. The full 100% solution would be a
large set of far reaching patches that would touch a lot of the memory
manager. This would get rejected because the patches should have have
arrived piecemeal. These patches are one piece. To reach 100%, other
mechanisms are also needed such as;

o Page migration to move unreclaimable pages like mlock()ed pages or
kernel pages that had fallen back into easy-reclaim areas. A mechanism
would also be needed to move things like kernel text. I think the memory
hotplug tree has done a lot of work here
o Mechanism for taking regions of memory offline. Again, I think the
memory hotplug crowd have something for this. If they don't, one of them
will chime in.
o linear page reclaim that linearly scans a region of memory reclaims or
moves all the pages it. I have a proof-of-concept patch that does the
linear scan and reclaim but it's currently ugly and depends on this set
of patches been applied.

These patches are the *starting* point that other things like linear page
reclaim can be based on.

--
Mel Gorman
Part-time Phd Student Java Applications Developer
University of Limerick IBM Dublin Software Lab

2005-11-01 14:49:28

by Dave Hansen

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Tue, 2005-11-01 at 15:29 +0100, Ingo Molnar wrote:
> * Dave Hansen <[email protected]> wrote:
> > > can you always, under any circumstance hot unplug RAM with these patches
> > > applied? If not, do you have any expectation to reach 100%?
> >
> > With these patches, no. There are currently some very nice,
> > pathological workloads which will still cause fragmentation. But, in
> > the interest of incremental feature introduction, I think they're a
> > fine first step. We can effectively reach toward a more comprehensive
> > solution on top of these patches.
> >
> > Reaching truly 100% will require some other changes such as being able
> > to virtually remap things like kernel text.
>
> then we need to see that 100% solution first - at least in terms of
> conceptual steps.

I don't think saying "truly 100%" really even makes sense. There will
always be restrictions of some kind. For instance, with a 10MB kernel
image, should you be able to shrink the memory in the system below
10MB? ;)

There is also no precedent in existing UNIXes for a 100% solution. From
http://publib.boulder.ibm.com/infocenter/pseries/index.jsp?topic=/com.ibm.aix.doc/aixbman/prftungd/dlpar.htm , a seemingly arbitrary restriction:

A memory region that contains a large page cannot be removed.

What the fragmentation patches _can_ give us is the ability to have 100%
success in removing certain areas: the "user-reclaimable" areas
referenced in the patch. This gives a customer at least the ability to
plan for how dynamically reconfigurable a system should be.

After these patches, the next logical steps are to increase the
knowledge that the slabs have about fragmentation, and to teach some of
the shrinkers about fragmentation.

After that, we'll need some kind of virtual remapping, breaking the 1:1
kernel virtual mapping, so that the most problematic pages can be
remapped. These pages would retain their virtual address, but getting a
new physical. However, this is quite far down the road and will require
some serious evaluation because it impacts how normal devices are able
to to DMA. The ppc64 proprietary hypervisor has features to work around
these issues, and any new hypervisors wishing to support partition
memory hotplug would likely have to follow suit.

-- Dave

2005-11-01 14:51:12

by Dave Hansen

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Tue, 2005-11-01 at 14:41 +0000, Mel Gorman wrote:
> o Mechanism for taking regions of memory offline. Again, I think the
> memory hotplug crowd have something for this. If they don't, one of them
> will chime in.

I'm not sure what you're asking for here.

Right now, you can offline based on NUMA node, or physical address.
It's all revealed in sysfs. Sounds like "regions" to me. :)

-- Dave

2005-11-01 15:01:32

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19


* Dave Hansen <[email protected]> wrote:

> > then we need to see that 100% solution first - at least in terms of
> > conceptual steps.
>
> I don't think saying "truly 100%" really even makes sense. There will
> always be restrictions of some kind. For instance, with a 10MB kernel
> image, should you be able to shrink the memory in the system below
> 10MB? ;)

think of it in terms of filesystem shrinking: yes, obviously you cannot
shrink to below the allocated size, but no user expects to be able to do
it. But users would not accept filesystem shrinking failing for certain
file layouts. In that case we are better off with no ability to shrink:
it makes it clear that we have not solved the problem, yet.

so it's all about expectations: _could_ you reasonably remove a piece of
RAM? Customer will say: "I have stopped all nonessential services, and
free RAM is at 90%, still I cannot remove that piece of faulty RAM, fix
the kernel!". No reasonable customer will say: "True, I have all RAM
used up in mlock()ed sections, but i want to remove some RAM
nevertheless".

> There is also no precedent in existing UNIXes for a 100% solution.

does this have any relevance to the point, other than to prove that it's
a hard problem that we should not pretend to be able to solve, without
seeing a clear path towards a solution?

Ingo

2005-11-01 15:23:18

by Dave Hansen

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Tue, 2005-11-01 at 16:01 +0100, Ingo Molnar wrote:
> so it's all about expectations: _could_ you reasonably remove a piece of
> RAM? Customer will say: "I have stopped all nonessential services, and
> free RAM is at 90%, still I cannot remove that piece of faulty RAM, fix
> the kernel!".

That's an excellent example. Until we have some kind of kernel
remapping, breaking the 1:1 kernel virtual mapping, these pages will
always exist. The easiest example of this kind of memory is kernel
text.

Another example might be a somewhat errant device driver which has
allocates some large buffers and is doing DMA to or from them. In this
case, we need to have APIs to require devices to give up and reacquire
any dynamically allocated structures. If the device driver does not
implement these APIs it is not compatible with memory hotplug.

> > There is also no precedent in existing UNIXes for a 100% solution.
>
> does this have any relevance to the point, other than to prove that it's
> a hard problem that we should not pretend to be able to solve, without
> seeing a clear path towards a solution?

Agreed. It is a hard problem. One that some other UNIXes have not
fully solved.

Here are the steps that I think we need to take. Do you see any holes
in their coverage? Anything that seems infeasible?

1. Fragmentation avoidance
* by itself, increases likelyhood of having an area of memory
which might be easily removed
* very small (if any) performance overhead
* other potential in-kernel users
* creates infrastructure to enforce the "hotplugablity" of any
particular are of memory.
2. Driver APIs
* Require that drivers specifically request for areas which must
retain constant physical addresses
* Driver must relinquish control of such areas upon request
* Can be worked around by hypervisors
3. Break 1:1 Kernel Virtual/Physial Mapping
* In any large area of physical memory we wish to remove, there will
likely be very, very few straggler pages, which can not easily be
freed.
* Kernel will transparently move the contents of these physical pages
to new pages, keeping constant virtual addresses.
* Negative TLB overhead, as in-kernel large page mappings are broken
down into smaller pages.
* __{p,v}a() become more expensive, likely a table lookup

I've already done (3) on a limited basis, in the early days of memory
hotplug. Not the remapping, just breaking the 1:1 assumptions. It
wasn't too horribly painful.

We'll also need to make some decisions along the way about what to do
about thinks like large pages. Is it better to just punt like AIX and
refuse to remove their areas? Break them down into small pages and
degrade performance?

-- Dave

2005-11-01 15:23:15

by Mel Gorman

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Tue, 1 Nov 2005, Ingo Molnar wrote:

>
> * Mel Gorman <[email protected]> wrote:
>
> > [...] The full 100% solution would be a large set of far reaching
> > patches that would touch a lot of the memory manager. This would get
> > rejected because the patches should have have arrived piecemeal. These
> > patches are one piece. To reach 100%, other mechanisms are also needed
> > such as;
> >
> > o Page migration to move unreclaimable pages like mlock()ed pages or
> > kernel pages that had fallen back into easy-reclaim areas. A mechanism
> > would also be needed to move things like kernel text. I think the memory
> > hotplug tree has done a lot of work here
> > o Mechanism for taking regions of memory offline. Again, I think the
> > memory hotplug crowd have something for this. If they don't, one of them
> > will chime in.
> > o linear page reclaim that linearly scans a region of memory reclaims or
> > moves all the pages it. I have a proof-of-concept patch that does the
> > linear scan and reclaim but it's currently ugly and depends on this set
> > of patches been applied.
>
> how will the 100% solution handle a simple kmalloc()-ed kernel buffer,
> that is pinned down, and to/from which live pointers may exist? That
> alone can prevent RAM from being removable.
>

It would require the page to have it's virtual->physical mapping changed
in the pagetables for each running process and the master page table. That
would be another step on the road to 100% support.

--
Mel Gorman
Part-time Phd Student Java Applications Developer
University of Limerick IBM Dublin Software Lab

2005-11-01 15:24:10

by Mel Gorman

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Tue, 1 Nov 2005, Dave Hansen wrote:

> On Tue, 2005-11-01 at 14:41 +0000, Mel Gorman wrote:
> > o Mechanism for taking regions of memory offline. Again, I think the
> > memory hotplug crowd have something for this. If they don't, one of them
> > will chime in.
>
> I'm not sure what you're asking for here.
>
> Right now, you can offline based on NUMA node, or physical address.
> It's all revealed in sysfs. Sounds like "regions" to me. :)
>

Ah yes, that would do the job all right.

--
Mel Gorman
Part-time Phd Student Java Applications Developer
University of Limerick IBM Dublin Software Lab

2005-11-01 16:49:51

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

Ingo Molnar wrote:
> so it's all about expectations: _could_ you reasonably remove a piece of
> RAM? Customer will say: "I have stopped all nonessential services, and
> free RAM is at 90%, still I cannot remove that piece of faulty RAM, fix
> the kernel!". No reasonable customer will say: "True, I have all RAM
> used up in mlock()ed sections, but i want to remove some RAM
> nevertheless".
>
Hi, I'm one of men in -lhms

In my understanding...
- Memory Hotremove on IBM's LPAR? approach is
[remove some amount of memory from somewhere.]
For this approach, Mel's patch will work well.
But this will not guaranntee a user can remove specified range of
memory at any time because how memory range is used is not defined by an admin
but by the kernel automatically. But to extract some amount of memory,
Mel's patch is very important and they need this.

My own target is NUMA node hotplug, what NUMA node hotplug want is
- [remove the range of memory] For this approach, admin should define
*core* node and removable node. Memory on removable node is removable.
Dividing area into removable and not-removable is needed, because
we cannot allocate any kernel's object on removable area.
Removable area should be 100% removable. Customer can know the limitation before using.

What I'm considering now is this:
- removable area is hot-added area
- not-removable area is memory which is visible to kernel at boot time.
(I'd like to achieve this by the limitation : hot-added node goes into only ZONE_HIGHMEM)
A customer can hot add their extra memory after boot. This is very easy to understand.
Peformance problem is trade-off.(I'm afraid of this ;)

If a cutomer wants to guarantee some memory areas should be hot-removable,
he will hot-add them.
I don't think adding memory for the kernel by hot-add is wanted by a customer.

-- Kame

2005-11-01 16:59:58

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

Kamezawa Hiroyuki wrote:
> Ingo Molnar wrote:
>
>> so it's all about expectations: _could_ you reasonably remove a piece
>> of RAM? Customer will say: "I have stopped all nonessential services,
>> and free RAM is at 90%, still I cannot remove that piece of faulty
>> RAM, fix the kernel!". No reasonable customer will say: "True, I have
>> all RAM used up in mlock()ed sections, but i want to remove some RAM
>> nevertheless".
>>
> Hi, I'm one of men in -lhms
>
> In my understanding...
> - Memory Hotremove on IBM's LPAR? approach is
> [remove some amount of memory from somewhere.]
> For this approach, Mel's patch will work well.
> But this will not guaranntee a user can remove specified range of
> memory at any time because how memory range is used is not defined by
> an admin
> but by the kernel automatically. But to extract some amount of memory,
> Mel's patch is very important and they need this.
>
One more consideration...
Some cpus which support virtialization will be shipped by some vendor in near future.
If someone uses vitualized OS, only problem is *resizing*.
Hypervisor will be able to remap semi-physical pages anyware with hardware assistance
but system resizing needs operating system assistance.
To this direction, [remove some amount of memory from somewhere.] is important approach.

-- Kame


2005-11-01 17:19:40

by Mel Gorman

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Wed, 2 Nov 2005, Kamezawa Hiroyuki wrote:

> Ingo Molnar wrote:
> > so it's all about expectations: _could_ you reasonably remove a piece of
> > RAM? Customer will say: "I have stopped all nonessential services, and free
> > RAM is at 90%, still I cannot remove that piece of faulty RAM, fix the
> > kernel!". No reasonable customer will say: "True, I have all RAM used up in
> > mlock()ed sections, but i want to remove some RAM nevertheless".
> >
> Hi, I'm one of men in -lhms
>
> In my understanding...
> - Memory Hotremove on IBM's LPAR? approach is
> [remove some amount of memory from somewhere.]
> For this approach, Mel's patch will work well.
> But this will not guaranntee a user can remove specified range of
> memory at any time because how memory range is used is not defined by an
> admin
> but by the kernel automatically. But to extract some amount of memory,
> Mel's patch is very important and they need this.
>
> My own target is NUMA node hotplug, what NUMA node hotplug want is
> - [remove the range of memory] For this approach, admin should define
> *core* node and removable node. Memory on removable node is removable.
> Dividing area into removable and not-removable is needed, because
> we cannot allocate any kernel's object on removable area.
> Removable area should be 100% removable. Customer can know the limitation
> before using.
>

In this case, we would want some mechanism that says "don't put awkward
pages in this NUMA node" in a clear way. One way we could do this is;

1. Move fallback_allocs to be per-node. fallback_allocs is currently
defined as
int fallback_allocs[RCLM_TYPES-1][RCLM_TYPES+1] = {
{RCLM_NORCLM, RCLM_FALLBACK, RCLM_KERN, RCLM_EASY, RCLM_TYPES},
{RCLM_EASY, RCLM_FALLBACK, RCLM_NORCLM, RCLM_KERN, RCLM_TYPES},
{RCLM_KERN, RCLM_FALLBACK, RCLM_NORCLM, RCLM_EASY, RCLM_TYPES}
};

The effect is that a RCLM_NORCLM allocation, falls back to
RCLM_FALLBACK, RCLM_KERN, RCLM_EASY and then gives up.

2. Architectures would need to provide a function that allocates and
populates a fallback_allocs[][] array. If they do not provide one, a
generic function uses array like the one above

3. When adding a node that must be removable, make the array look like
this

int fallback_allocs[RCLM_TYPES-1][RCLM_TYPES+1] = {
{RCLM_NORCLM, RCLM_TYPES, RCLM_TYPES, RCLM_TYPES, RCLM_TYPES},
{RCLM_EASY, RCLM_FALLBACK, RCLM_NORCLM, RCLM_KERN, RCLM_TYPES},
{RCLM_KERN, RCLM_TYPES, RCLM_TYPES, RCLM_TYPES, RCLM_TYPES},
};

The effect of this is only allocations that are easily reclaimable will
end up in this node. This would be a straight-forward addition to build
upon this set of patches. The difference would only be visible to
architectures that cared.

> What I'm considering now is this:
> - removable area is hot-added area
> - not-removable area is memory which is visible to kernel at boot time.
> (I'd like to achieve this by the limitation : hot-added node goes into only
> ZONE_HIGHMEM)


ZONE_HIGHMEM can still end up with PTE pages if allocating PTE pages from
highmem is configured. This is bad. With the above approach, nodes that
are not hot-added that have a ZONE_HIGHMEM will be able to use it for PTEs
as well. But when a node is hot-added, it will have a ZONE_HIGHMEM that is
not used for PTE allocations because they are not RCLM_EASY allocations.

--
Mel Gorman
Part-time Phd Student Java Applications Developer
University of Limerick IBM Dublin Software Lab

2005-11-01 18:07:50

by linux-os (Dick Johnson)

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19


On Tue, 1 Nov 2005, Kamezawa Hiroyuki wrote:

> Ingo Molnar wrote:
>> so it's all about expectations: _could_ you reasonably remove a piece of
>> RAM? Customer will say: "I have stopped all nonessential services, and
>> free RAM is at 90%, still I cannot remove that piece of faulty RAM, fix
>> the kernel!". No reasonable customer will say: "True, I have all RAM
>> used up in mlock()ed sections, but i want to remove some RAM
>> nevertheless".
>>
> Hi, I'm one of men in -lhms
>
> In my understanding...
> - Memory Hotremove on IBM's LPAR? approach is
> [remove some amount of memory from somewhere.]
> For this approach, Mel's patch will work well.
> But this will not guaranntee a user can remove specified range of
> memory at any time because how memory range is used is not defined by an admin
> but by the kernel automatically. But to extract some amount of memory,
> Mel's patch is very important and they need this.
>
> My own target is NUMA node hotplug, what NUMA node hotplug want is
> - [remove the range of memory] For this approach, admin should define
> *core* node and removable node. Memory on removable node is removable.
> Dividing area into removable and not-removable is needed, because
> we cannot allocate any kernel's object on removable area.
> Removable area should be 100% removable. Customer can know the limitation before using.
>
> What I'm considering now is this:
> - removable area is hot-added area
> - not-removable area is memory which is visible to kernel at boot time.
> (I'd like to achieve this by the limitation : hot-added node goes into only ZONE_HIGHMEM)
> A customer can hot add their extra memory after boot. This is very easy to understand.
> Peformance problem is trade-off.(I'm afraid of this ;)
>
> If a cutomer wants to guarantee some memory areas should be hot-removable,
> he will hot-add them.
> I don't think adding memory for the kernel by hot-add is wanted by a customer.
>
> -- Kame

With ix86 machines, the page directory pointed to by CR3 needs
to always be present in physical memory. This means that there
must always be some RAM that can't be hot-swapped (you can't
put back the contents of the page-directory without using
the CPU which needs the page directory).

This is explained on page 5-21 of the i486 reference manual.
This happens because there is no "present" bit in CR3 as there
are in the page tables themselves.

This problem means that "surprise" swaps are impossible. However,
given a forewarning, it is possible to build a new table somewhere
in existing RAM within the physical constraints required, call
some code there (needs to be a 1:1 translation), disable paging,
then proceed. The problem is that of writing of the contents
of RAM to be replaced, to storage media so the new page-table
needs to be loaded from the new location. This may not work
if the LDT and the GDT are not accessible from their current
locations. If they are in the RAM to be replaced, you are
in a world of hurt taking the "world" apart and putting it
back together again.


Cheers,
Dick Johnson
Penguin : Linux version 2.6.13.4 on an i686 machine (5589.55 BogoMips).
Warning : 98.36% of all statistics are fiction.
.

****************************************************************
The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to [email protected] - and destroy all copies of this information, including any attachments, without reading or disclosing them.

Thank you.

2005-11-01 18:24:40

by Rob Landley

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Tuesday 01 November 2005 07:56, Ingo Molnar wrote:
> * Mel Gorman <[email protected]> wrote:
> > The set of patches do fix a lot and make a strong start at addressing
> > the fragmentation problem, just not 100% of the way. [...]
>
> do you have an expectation to be able to solve the 'fragmentation
> problem', all the time, in a 100% way, now or in the future?

Considering anybody can allocate memory and never release it, _any_ 100%
solution is going to require migrating existing pages, regardless of
allocation strategy.

> > So, with this set of patches, how fragmented you get is dependant on
> > the workload and it may still break down and high order allocations
> > will fail. But the current situation is that it will defiantly break
> > down. The fact is that it has been reported that memory hotplug remove
> > works with these patches and doesn't without them. Granted, this is
> > just one feature on a high-end machine, but it is one solid operation
> > we can perform with the patches and cannot without them. [...]
>
> can you always, under any circumstance hot unplug RAM with these patches
> applied? If not, do you have any expectation to reach 100%?

You're asking intentionally leading questions, aren't you? Without on-demand
page migration a given area of physical memory would only ever be free by
sheer coincidence. Less fragmented page allocation doesn't address _where_
the free areas are, it just tries to make them contiguous.

A page migration strategy would have to do less work if there's less
fragmention, and it also allows you to cluster the "difficult" cases (such as
kernel structures that just ain't moving) so you can much more easily
hot-unplug everything else. It also makes larger order allocations easier to
do so drivers needing that can load as modules after boot, and it also means
hugetlb comes a lot closer to general purpose infrastructure rather than a
funky boot-time reservation thing. Plus page prezeroing approaches get to
work on larger chunks, and so on.

But any strategy to demand that "this physical memory range must be freed up
now" will by definition require moving pages...

> Ingo

Rob

2005-11-01 18:34:15

by Rob Landley

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Tuesday 01 November 2005 08:46, Ingo Molnar wrote:
> how will the 100% solution handle a simple kmalloc()-ed kernel buffer,
> that is pinned down, and to/from which live pointers may exist? That
> alone can prevent RAM from being removable.

Would you like to apply your "100% or nothing" argument to the virtual memory
management subsystem and see how it sounds in that context? (As an argument
that we shouldn't _have_ one?)

> Ingo

Rob

2005-11-01 19:03:03

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19


* Rob Landley <[email protected]> wrote:

> On Tuesday 01 November 2005 08:46, Ingo Molnar wrote:
> > how will the 100% solution handle a simple kmalloc()-ed kernel buffer,
> > that is pinned down, and to/from which live pointers may exist? That
> > alone can prevent RAM from being removable.
>
> Would you like to apply your "100% or nothing" argument to the virtual
> memory management subsystem and see how it sounds in that context?
> (As an argument that we shouldn't _have_ one?)

that would be comparing apples to oranges. There is a big difference
between "VM failures under high load", and "failure of VM functionality
for no user-visible reason". The fragmentation problem here has nothing
to do with pathological workloads. It has to do with 'unlucky'
allocation patterns that pin down RAM areas which thus become
non-removable. The RAM module will be non-removable for no user-visible
reason. Possible under zero load, and with lots of free RAM otherwise.

Ingo

2005-11-01 20:31:50

by Joel Schopp

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

>>>The set of patches do fix a lot and make a strong start at addressing
>>>the fragmentation problem, just not 100% of the way. [...]
>>
>>do you have an expectation to be able to solve the 'fragmentation
>>problem', all the time, in a 100% way, now or in the future?
>
>
> Considering anybody can allocate memory and never release it, _any_ 100%
> solution is going to require migrating existing pages, regardless of
> allocation strategy.
>

Three issues here. Fragmentation of memory in general, fragmentation of usage,
and being able to have 100% success rate at removing memory.

We will never be able to have 100% contiguous memory with no fragmentation.
Ever. Certainly not while we have non-movable pieces of memory. Even if we
could move every piece of memory it would be impractical. What these patches do
for general fragmentation is to keep the allocations that never will get freed
away from the rest of memory, so that memory has a chance to form larger
contiguous ranges when it is freed.

By separating memory based on usage there is another side effect. It also makes
possible some more active defragmentation methods on easier memory, because it
doesn't have annoying hard memory scattered throughout. Suddenly we can talk
about being able to do memory hotplug remove on significant portions of memory.
Or allocating these hugepages after boot. Or doing active defragmentation.
Or modules being able to be modules because they don't have to preallocate big
pieces of contiguous memory.

Some people will argue that we need 100% separation of usage or no separation at
all. Well, change the array of fallback to not allow kernel non-reclaimable to
fallback and we are done. 4 line change, 100% separation. But the tradeoff is
that under memory pressure we might fail allocations when we still have free
memory. There are other options for fallback of course, the fallback_alloc()
function is easily replaceable if somebody wants to. Many of these options get
easier once memory migration is in. The way fallback is done in the current
patches is to maintain current behavior as much as possible, satisfy
allocations, and not affect performance.

As to the 100% success at removing memory, this set of patches doesn't solve
that. But it solves the 80% problem quite nicely (when combined with the memory
migration patches). 80% is great for virtualized systems where the OS has some
choice over which memory to remove, but not the quantity to remove. It is also
a good start to 100%, because we can separate and identify the easy memory from
the hard memory. Dave Hansen has outlined in separate posts how we can get to
100%, including hard memory.

>>can you always, under any circumstance hot unplug RAM with these patches
>>applied? If not, do you have any expectation to reach 100%?
>
>
> You're asking intentionally leading questions, aren't you? Without on-demand
> page migration a given area of physical memory would only ever be free by
> sheer coincidence. Less fragmented page allocation doesn't address _where_
> the free areas are, it just tries to make them contiguous.
>
> A page migration strategy would have to do less work if there's less
> fragmention, and it also allows you to cluster the "difficult" cases (such as
> kernel structures that just ain't moving) so you can much more easily
> hot-unplug everything else. It also makes larger order allocations easier to
> do so drivers needing that can load as modules after boot, and it also means
> hugetlb comes a lot closer to general purpose infrastructure rather than a
> funky boot-time reservation thing. Plus page prezeroing approaches get to
> work on larger chunks, and so on.
>
> But any strategy to demand that "this physical memory range must be freed up
> now" will by definition require moving pages...

Perfectly stated.

2005-11-02 00:33:14

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

Mel Gorman wrote:
> 3. When adding a node that must be removable, make the array look like
> this
>
> int fallback_allocs[RCLM_TYPES-1][RCLM_TYPES+1] = {
> {RCLM_NORCLM, RCLM_TYPES, RCLM_TYPES, RCLM_TYPES, RCLM_TYPES},
> {RCLM_EASY, RCLM_FALLBACK, RCLM_NORCLM, RCLM_KERN, RCLM_TYPES},
> {RCLM_KERN, RCLM_TYPES, RCLM_TYPES, RCLM_TYPES, RCLM_TYPES},
> };
>
> The effect of this is only allocations that are easily reclaimable will
> end up in this node. This would be a straight-forward addition to build
> upon this set of patches. The difference would only be visible to
> architectures that cared.
>
Thank you for illustration.
maybe fallback_list per pgdat/zone is what I need with your patch. right ?

-- Kame


2005-11-02 00:51:39

by Nick Piggin

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

Dave Hansen wrote:

> What the fragmentation patches _can_ give us is the ability to have 100%
> success in removing certain areas: the "user-reclaimable" areas
> referenced in the patch. This gives a customer at least the ability to
> plan for how dynamically reconfigurable a system should be.
>

But the "user-reclaimable" areas can still be taken over by other
areas which become fragmented.

That's like saying we can already guarantee 100% success in removing
areas that are unfragmented and free, or freeable.

> After these patches, the next logical steps are to increase the
> knowledge that the slabs have about fragmentation, and to teach some of
> the shrinkers about fragmentation.
>

I don't like all this work and complexity and overheads going into a
partial solution.

Look: if you have to guarantee memory can be shrunk, set aside a zone
for it (that only fills with user reclaimable areas). This is better
than the current frag patches because it will give you the 100%
guarantee that you need (provided we have page migration to move mlocked
pages).

If you don't need a guarantee, then our current, simple system does the
job perfectly.

> After that, we'll need some kind of virtual remapping, breaking the 1:1
> kernel virtual mapping, so that the most problematic pages can be
> remapped. These pages would retain their virtual address, but getting a
> new physical. However, this is quite far down the road and will require
> some serious evaluation because it impacts how normal devices are able
> to to DMA. The ppc64 proprietary hypervisor has features to work around
> these issues, and any new hypervisors wishing to support partition
> memory hotplug would likely have to follow suit.
>

I would more like to see something like this happen (provided it was
nicely abstracted away and could be CONFIGed out for the 99.999% of
users who don't need the overhead or complexity).

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-11-02 06:12:11

by Andrew Morton

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

Mel Gorman <[email protected]> wrote:
>
> As GFP_ATOMIC and GFP_NOFS cannot do
> any reclaim work themselves

Both GFP_NOFS and GFP_NOIO can indeed perform direct reclaim. All
we require is __GFP_WAIT.

2005-11-02 07:19:57

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19


* Kamezawa Hiroyuki <[email protected]> wrote:

> My own target is NUMA node hotplug, what NUMA node hotplug want is
> - [remove the range of memory] For this approach, admin should define
> *core* node and removable node. Memory on removable node is removable.
> Dividing area into removable and not-removable is needed, because
> we cannot allocate any kernel's object on removable area.
> Removable area should be 100% removable. Customer can know the limitation
> before using.

that's a perfectly fine method, and is quite similar to the 'separate
zone' approach Nick mentioned too. It is also easily understandable for
users/customers.

under such an approach, things become easier as well: if you have zones
you can to restrict (no kernel pinned-down allocations, no mlock-ed
pages, etc.), there's no need for any 'fragmentation avoidance' patches!
Basically all of that RAM becomes instantly removable (with some small
complications). That's the beauty of the separate-zones approach. It is
also a limitation: no kernel allocations, so all the highmem-alike
restrictions apply to it too.

but what is a dangerous fallacy is that we will be able to support hot
memory unplug of generic kernel RAM in any reliable way!

you really have to look at this from the conceptual angle: 'can an
approach ever lead to a satisfactory result'? If the answer is 'no',
then we _must not_ add a 90% solution that we _know_ will never be a
100% solution.

for the separate-removable-zones approach we see the end of the tunnel.
Separate zones are well-understood.

generic unpluggable kernel RAM _will not work_.

Ingo

2005-11-02 07:42:31

by Dave Hansen

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Wed, 2005-11-02 at 11:51 +1100, Nick Piggin wrote:
> Look: if you have to guarantee memory can be shrunk, set aside a zone
> for it (that only fills with user reclaimable areas). This is better
> than the current frag patches because it will give you the 100%
> guarantee that you need (provided we have page migration to move mlocked
> pages).

With Mel's patches, you can easily add the same guarantee. Look at the
code in fallback_alloc() (patch 5/8). It would be quite easy to modify
the fallback lists to disallow fallbacks into areas from which we would
like to remove memory. That was left out for simplicity. As you say,
they're quite complex as it is. Would you be interested in seeing a
patch to provide those kinds of guarantees?

We've had a bit of experience with a hotpluggable zone approach before.
Just like the current topic patches, you're right, that approach can
also provide strong guarantees. However, the issue comes if the system
ever needs to move memory between such zones, such as if a user ever
decides that they'd prefer to break hotplug guarantees rather than OOM.

Do you think changing what a particular area of memory is being used for
would ever be needed?

One other thing, if we decide to take the zones approach, it would have
no other side benefits for the kernel. It would be for hotplug only and
I don't think even the large page users would get much benefit.

-- Dave

2005-11-02 07:46:45

by Gerrit Huizenga

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19


On Wed, 02 Nov 2005 08:19:43 +0100, Ingo Molnar wrote:
>
> * Kamezawa Hiroyuki <[email protected]> wrote:
>
> > My own target is NUMA node hotplug, what NUMA node hotplug want is
> > - [remove the range of memory] For this approach, admin should define
> > *core* node and removable node. Memory on removable node is removable.
> > Dividing area into removable and not-removable is needed, because
> > we cannot allocate any kernel's object on removable area.
> > Removable area should be 100% removable. Customer can know the limitation
> > before using.
>
> that's a perfectly fine method, and is quite similar to the 'separate
> zone' approach Nick mentioned too. It is also easily understandable for
> users/customers.
>
> under such an approach, things become easier as well: if you have zones
> you can to restrict (no kernel pinned-down allocations, no mlock-ed
> pages, etc.), there's no need for any 'fragmentation avoidance' patches!
> Basically all of that RAM becomes instantly removable (with some small
> complications). That's the beauty of the separate-zones approach. It is
> also a limitation: no kernel allocations, so all the highmem-alike
> restrictions apply to it too.
>
> but what is a dangerous fallacy is that we will be able to support hot
> memory unplug of generic kernel RAM in any reliable way!
>
> you really have to look at this from the conceptual angle: 'can an
> approach ever lead to a satisfactory result'? If the answer is 'no',
> then we _must not_ add a 90% solution that we _know_ will never be a
> 100% solution.
>
> for the separate-removable-zones approach we see the end of the tunnel.
> Separate zones are well-understood.
>
> generic unpluggable kernel RAM _will not work_.

Actually, it will. Well, depending on terminology.

There are two usage models here - those which intend to remove physical
elements and those where the kernel returnss management of its virtualized
"physical" memory to a hypervisor. In the latter case, a hypervisor
already maintains a virtual map of the memory and the OS needs to release
virtualized "physical" memory. I think you are referring to RAM here as
the physical component; however these same defrag patches help where a
hypervisor is maintaining the real physical memory below the operating
system and the OS is managing a virtualized "physical" memory.

On pSeries hardware or with Xen, a client OS can return chunks of memory
to the hypervisor. That memory needs to be returned in chunks of the
size that the hypervisor normally manages/maintains. But long ranges
of physical contiguity are not required. Just shorter ranges, depending
on what the hypervisor maintains, need to be returned from the OS to
the hypervisor.

In other words, if we can return 1 MB chunks, the hypervisor can hand
out those 1 MB chunks to other domains/partitions. So, if we can return
500 1 MB chunks from a 2 GB OS instance, we can add 500 MB dyanamically
to another OS image.

This happens to be a *very* satisfactory answer for virtualized environments.

The other answer, which is harder, is to return (free) entire large physical
chunks, e.g. the size of the full memory of a node, allowing a node to be
dynamically removed (or a DIMM/SIMM/etc.).

So, people are working towards two distinct solutions, both of which
require us to do a better job of defragmenting memory (or avoiding
fragementation in the first place).

gerrit

2005-11-02 07:55:53

by Nick Piggin

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

Ingo Molnar wrote:
> * Kamezawa Hiroyuki <[email protected]> wrote:
>
>
>>My own target is NUMA node hotplug, what NUMA node hotplug want is
>>- [remove the range of memory] For this approach, admin should define
>> *core* node and removable node. Memory on removable node is removable.
>> Dividing area into removable and not-removable is needed, because
>> we cannot allocate any kernel's object on removable area.
>> Removable area should be 100% removable. Customer can know the limitation
>> before using.
>
>
> that's a perfectly fine method, and is quite similar to the 'separate
> zone' approach Nick mentioned too. It is also easily understandable for
> users/customers.
>

I agree - and I think it should be easy to configure out of the
kernel for those that don't want the functionality, and should
at very little complexity to core code (all without looking at
the patches so I could be very wrong!).

>
> but what is a dangerous fallacy is that we will be able to support hot
> memory unplug of generic kernel RAM in any reliable way!
>

Very true.

> you really have to look at this from the conceptual angle: 'can an
> approach ever lead to a satisfactory result'? If the answer is 'no',
> then we _must not_ add a 90% solution that we _know_ will never be a
> 100% solution.
>
> for the separate-removable-zones approach we see the end of the tunnel.
> Separate zones are well-understood.
>

Yep, I don't see why this doesn't cover all the needs that the frag
patches attempt (hot unplug, hugepage dynamic reserves).

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-11-02 08:23:14

by Nick Piggin

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

Dave Hansen wrote:
> On Wed, 2005-11-02 at 11:51 +1100, Nick Piggin wrote:
>
>>Look: if you have to guarantee memory can be shrunk, set aside a zone
>>for it (that only fills with user reclaimable areas). This is better
>>than the current frag patches because it will give you the 100%
>>guarantee that you need (provided we have page migration to move mlocked
>>pages).
>
>
> With Mel's patches, you can easily add the same guarantee. Look at the
> code in fallback_alloc() (patch 5/8). It would be quite easy to modify
> the fallback lists to disallow fallbacks into areas from which we would
> like to remove memory. That was left out for simplicity. As you say,
> they're quite complex as it is. Would you be interested in seeing a
> patch to provide those kinds of guarantees?
>

On top of Mel's patch? I think this is essiential for any guarantees
that you might be interested... but it would just mean that now you
have a redundant extra zoning layer.

I think ZONE_REMOVABLE is something that really needs to be looked at
again if you need a hotunplug solution in the kernel.

> We've had a bit of experience with a hotpluggable zone approach before.
> Just like the current topic patches, you're right, that approach can
> also provide strong guarantees. However, the issue comes if the system
> ever needs to move memory between such zones, such as if a user ever
> decides that they'd prefer to break hotplug guarantees rather than OOM.
>

I can imagine one could have a sysctl to allow/disallow non-easy-reclaim
allocations from ZONE_REMOVABLE.

As Ingo says, neither way is going to give a 100% solution - I wouldn't
like to see so much complexity added to bring us from a ZONE_REMOVABLE 80%
solution to a 90% solution. I believe this is where Linus' "perfect is
the enemy of good" quote applies.

> Do you think changing what a particular area of memory is being used for
> would ever be needed?
>

Perhaps, but Mel's patch only guarantees you to change once, same as
ZONE_REMOVABLE. Once you eat up those easy-to-reclaim areas, you can't
get them back.

> One other thing, if we decide to take the zones approach, it would have
> no other side benefits for the kernel. It would be for hotplug only and
> I don't think even the large page users would get much benefit.
>

Hugepage users? They can be satisfied with ZONE_REMOVABLE too. If you're
talking about other higher-order users, I still think we can't guarantee
past about order 1 or 2 with Mel's patch and they simply need to have
some other ways to do things.

But I think using zones would have advantages in that they would help
give zones and zone balancing more scrutiny and test coverage in the
kernel, which is sorely needed since everyone threw out their highmem
systems :P

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-11-02 08:34:05

by Yasunori Goto

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

> > One other thing, if we decide to take the zones approach, it would have
> > no other side benefits for the kernel. It would be for hotplug only and
> > I don't think even the large page users would get much benefit.
> >
>
> Hugepage users? They can be satisfied with ZONE_REMOVABLE too. If you're
> talking about other higher-order users, I still think we can't guarantee
> past about order 1 or 2 with Mel's patch and they simply need to have
> some other ways to do things.

Hmmm. I don't see at this point.
Why do you think ZONE_REMOVABLE can satisfy for hugepage.
At leaset, my ZONE_REMOVABLE patch doesn't any concern about
fragmentation.

Bye.

--
Yasunori Goto

2005-11-02 08:41:46

by Nick Piggin

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

Yasunori Goto wrote:
>>>One other thing, if we decide to take the zones approach, it would have
>>>no other side benefits for the kernel. It would be for hotplug only and
>>>I don't think even the large page users would get much benefit.
>>>
>>
>>Hugepage users? They can be satisfied with ZONE_REMOVABLE too. If you're
>>talking about other higher-order users, I still think we can't guarantee
>>past about order 1 or 2 with Mel's patch and they simply need to have
>>some other ways to do things.
>
>
> Hmmm. I don't see at this point.
> Why do you think ZONE_REMOVABLE can satisfy for hugepage.
> At leaset, my ZONE_REMOVABLE patch doesn't any concern about
> fragmentation.
>

Well I think it can satisfy hugepage allocations simply because
we can be reasonably sure of being able to free contiguous regions.
Of course it will be memory no longer easily reclaimable, same as
the case for the frag patches. Nor would be name ZONE_REMOVABLE any
longer be the most appropriate!

But my point is, the basic mechanism is there and is workable.
Hugepages and memory unplug are the two main reasons for IBM to be
pushing this AFAIKS.

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-11-02 08:48:27

by Nick Piggin

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

Gerrit Huizenga wrote:

> So, people are working towards two distinct solutions, both of which
> require us to do a better job of defragmenting memory (or avoiding
> fragementation in the first place).
>

This is just going around in circles. Even with your fragmentation
avoidance and memory defragmentation, there are still going to be
cases where memory does get fragmented and can't be defragmented.
This is Ingo's point, I believe.

Isn't the solution for your hypervisor problem to dish out pages of
the same size that are used by the virtual machines. Doesn't this
provide you with a nice, 100% solution that doesn't add complexity
where it isn't needed?

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-11-02 09:13:19

by Gerrit Huizenga

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19


On Wed, 02 Nov 2005 19:50:15 +1100, Nick Piggin wrote:
> Gerrit Huizenga wrote:
>
> > So, people are working towards two distinct solutions, both of which
> > require us to do a better job of defragmenting memory (or avoiding
> > fragementation in the first place).
> >
>
> This is just going around in circles. Even with your fragmentation
> avoidance and memory defragmentation, there are still going to be
> cases where memory does get fragmented and can't be defragmented.
> This is Ingo's point, I believe.
>
> Isn't the solution for your hypervisor problem to dish out pages of
> the same size that are used by the virtual machines. Doesn't this
> provide you with a nice, 100% solution that doesn't add complexity
> where it isn't needed?

So do you see the problem with fragementation if the hypervisor is
handing out, say, 1 MB pages? Or, more likely, something like 64 MB
pages? What are the chances that an entire 64 MB page can be freed
on a large system that has been up a while?

And, if you create zones, you run into all of the zone rebalancing
problems of ZONE_DMA, ZONE_NORMAL, ZONE_HIGHMEM. In that case, on
any long running system, ZONE_HOTPLUGGABLE has been overwhelmed with
random allocations, making almost none of it available.

However, with reasonable defragmentation or fragmentation avoidance,
we have some potential to make large chunks available for return to
the hypervisor. And, that same capability continues to help those
who want to remove fixed ranges of physical memory.

gerrit

2005-11-02 09:35:50

by Nick Piggin

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

Gerrit Huizenga wrote:
> On Wed, 02 Nov 2005 19:50:15 +1100, Nick Piggin wrote:

>>Isn't the solution for your hypervisor problem to dish out pages of
>>the same size that are used by the virtual machines. Doesn't this
>>provide you with a nice, 100% solution that doesn't add complexity
>>where it isn't needed?
>
>
> So do you see the problem with fragementation if the hypervisor is
> handing out, say, 1 MB pages? Or, more likely, something like 64 MB
> pages? What are the chances that an entire 64 MB page can be freed
> on a large system that has been up a while?
>

I see the problem, but if you want to be able to shrink memory to a
given size, then you must either introduce a hard limit somewhere, or
have the hypervisor hand out guest sized pages. Use zones, or Xen?

> And, if you create zones, you run into all of the zone rebalancing
> problems of ZONE_DMA, ZONE_NORMAL, ZONE_HIGHMEM. In that case, on
> any long running system, ZONE_HOTPLUGGABLE has been overwhelmed with
> random allocations, making almost none of it available.
>

If there are zone rebalancing problems[*], then it would be great to
have more users of zones because then they will be more likely to get
fixed.

[*] and there are, sadly enough - see the recent patches I posted to
lkml for example. But I'm fairly confident that once the particularly
silly ones have been fixed, zone balancing will no longer be a
derogatory term as has been thrown around (maybe rightly) in this
thread!

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-11-02 10:18:30

by Gerrit Huizenga

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19


On Wed, 02 Nov 2005 20:37:43 +1100, Nick Piggin wrote:
> Gerrit Huizenga wrote:
> > On Wed, 02 Nov 2005 19:50:15 +1100, Nick Piggin wrote:
>
> >>Isn't the solution for your hypervisor problem to dish out pages of
> >>the same size that are used by the virtual machines. Doesn't this
> >>provide you with a nice, 100% solution that doesn't add complexity
> >>where it isn't needed?
> >
> >
> > So do you see the problem with fragementation if the hypervisor is
> > handing out, say, 1 MB pages? Or, more likely, something like 64 MB
> > pages? What are the chances that an entire 64 MB page can be freed
> > on a large system that has been up a while?
>
> I see the problem, but if you want to be able to shrink memory to a
> given size, then you must either introduce a hard limit somewhere, or
> have the hypervisor hand out guest sized pages. Use zones, or Xen?

So why do you believe there must be a hard limit?

Any reduction in memory usage is going to be workload related.
If the workload is consuming less memory than is available, memory
reclaim is easy (e.g. handle fragmentation, find nice sized chunks).
The workload determines how much the administrator can free. If
the workload is using all of the resources available (e.g. lots of
associated kernel memory locked down, locked user pages, etc.)
then the administrator will logically be able to reduce less memory
from the machine.

The amount of memory to be freed up is not determined by some pre-defined
machine constraints but based on the actual workload's use of the machine.

In other words, who really cares if there is some hard limit? The
only limit should be the number of pages not currently needed by
a given workload, not some arbitrary zone size.

> > And, if you create zones, you run into all of the zone rebalancing
> > problems of ZONE_DMA, ZONE_NORMAL, ZONE_HIGHMEM. In that case, on
> > any long running system, ZONE_HOTPLUGGABLE has been overwhelmed with
> > random allocations, making almost none of it available.
>
> If there are zone rebalancing problems[*], then it would be great to
> have more users of zones because then they will be more likely to get
> fixed.
>
> [*] and there are, sadly enough - see the recent patches I posted to
> lkml for example. But I'm fairly confident that once the particularly
> silly ones have been fixed, zone balancing will no longer be a
> derogatory term as has been thrown around (maybe rightly) in this
> thread!

You are more optimistic here than I. You might have improved the
problem but I think that any zone rebalancing problem is intrinsicly
hard given the way those zones are used and the fact that we sort
of want them to be dynamic and yet physically contiguous. Those two
core constraints seem to be relatively at odds with each other.

I'm not a huge fan of dividing memory up into different types which
are all special purposed. Everything that becomes special purposed
over time limits its use and brings up questions on what special purpose
bucket each allocation should use (e.g. ZONE_NORMAL or ZONE_HIGHMEM
or ZONE_DMA or ZONE_HOTPLUGGABLE). And then, when you run out of
ZONE_HIGHMEM and have to reach into ZONE_HOTPLUGGABLE for some pinned
memory allocation, it seems the whole concept leads to a messy
train wreck.

gerrit

2005-11-02 10:41:48

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19


* Gerrit Huizenga <[email protected]> wrote:

> > generic unpluggable kernel RAM _will not work_.
>
> Actually, it will. Well, depending on terminology.

'generic unpluggable kernel RAM' means what it says: any RAM seen by the
kernel can be unplugged, always. (as long as the unplug request is
reasonable and there is enough free space to migrate in-use pages to).

> There are two usage models here - those which intend to remove
> physical elements and those where the kernel returnss management of
> its virtualized "physical" memory to a hypervisor. In the latter
> case, a hypervisor already maintains a virtual map of the memory and
> the OS needs to release virtualized "physical" memory. I think you
> are referring to RAM here as the physical component; however these
> same defrag patches help where a hypervisor is maintaining the real
> physical memory below the operating system and the OS is managing a
> virtualized "physical" memory.

reliable unmapping of "generic kernel RAM" is not possible even in a
virtualized environment. Think of the 'live pointers' problem i outlined
in an earlier mail in this thread today.

Ingo

2005-11-02 11:04:39

by Gerrit Huizenga

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19


On Wed, 02 Nov 2005 11:41:31 +0100, Ingo Molnar wrote:
>
> * Gerrit Huizenga <[email protected]> wrote:
>
> > > generic unpluggable kernel RAM _will not work_.
> >
> > Actually, it will. Well, depending on terminology.
>
> 'generic unpluggable kernel RAM' means what it says: any RAM seen by the
> kernel can be unplugged, always. (as long as the unplug request is
> reasonable and there is enough free space to migrate in-use pages to).

Okay, I understand your terminology. Yes, I can not point to any
particular piece of memory and say "I want *that* one" and have that
request succeed. However, I can say "find me 50 chunks of memory
of your choosing" and have a very good chance of finding enough
memory to satisfy my request.

> > There are two usage models here - those which intend to remove
> > physical elements and those where the kernel returnss management of
> > its virtualized "physical" memory to a hypervisor. In the latter
> > case, a hypervisor already maintains a virtual map of the memory and
> > the OS needs to release virtualized "physical" memory. I think you
> > are referring to RAM here as the physical component; however these
> > same defrag patches help where a hypervisor is maintaining the real
> > physical memory below the operating system and the OS is managing a
> > virtualized "physical" memory.
>
> reliable unmapping of "generic kernel RAM" is not possible even in a
> virtualized environment. Think of the 'live pointers' problem i outlined
> in an earlier mail in this thread today.

Yeah - and that isn't what is being proposed here. The goal is to ask
the kernel to identify some memory which can be legitimately freed and
hasten the freeing of that memory.

gerrit

2005-11-02 11:22:26

by Mel Gorman

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Wed, 2 Nov 2005, KAMEZAWA Hiroyuki wrote:

> Mel Gorman wrote:
> > 3. When adding a node that must be removable, make the array look like
> > this
> >
> > int fallback_allocs[RCLM_TYPES-1][RCLM_TYPES+1] = {
> > {RCLM_NORCLM, RCLM_TYPES, RCLM_TYPES, RCLM_TYPES, RCLM_TYPES},
> > {RCLM_EASY, RCLM_FALLBACK, RCLM_NORCLM, RCLM_KERN, RCLM_TYPES},
> > {RCLM_KERN, RCLM_TYPES, RCLM_TYPES, RCLM_TYPES, RCLM_TYPES},
> > };
> >
> > The effect of this is only allocations that are easily reclaimable will
> > end up in this node. This would be a straight-forward addition to build
> > upon this set of patches. The difference would only be visible to
> > architectures that cared.
> >
> Thank you for illustration.
> maybe fallback_list per pgdat/zone is what I need with your patch. right ?
>

With my patch, yes. With zones, you need to change how zonelists are built
for each node.

--
Mel Gorman
Part-time Phd Student Java Applications Developer
University of Limerick IBM Dublin Software Lab

2005-11-02 12:01:00

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19


* Gerrit Huizenga <[email protected]> wrote:

>
> On Wed, 02 Nov 2005 11:41:31 +0100, Ingo Molnar wrote:
> >
> > * Gerrit Huizenga <[email protected]> wrote:
> >
> > > > generic unpluggable kernel RAM _will not work_.
> > >
> > > Actually, it will. Well, depending on terminology.
> >
> > 'generic unpluggable kernel RAM' means what it says: any RAM seen by the
> > kernel can be unplugged, always. (as long as the unplug request is
> > reasonable and there is enough free space to migrate in-use pages to).
>
> Okay, I understand your terminology. Yes, I can not point to any
> particular piece of memory and say "I want *that* one" and have that
> request succeed. However, I can say "find me 50 chunks of memory
> of your choosing" and have a very good chance of finding enough
> memory to satisfy my request.

but that's obviously not 'generic unpluggable kernel RAM'. It's very
special RAM: RAM that is free or easily freeable. I never argued that
such RAM is not returnable to the hypervisor.

> > reliable unmapping of "generic kernel RAM" is not possible even in a
> > virtualized environment. Think of the 'live pointers' problem i outlined
> > in an earlier mail in this thread today.
>
> Yeah - and that isn't what is being proposed here. The goal is to
> ask the kernel to identify some memory which can be legitimately
> freed and hasten the freeing of that memory.

but that's very easy to identify: check the free list or the clean
list(s). No defragmentation necessary. [unless the unit of RAM mapping
between hypervisor and guest is too coarse (i.e. not 4K pages).]

Ingo

2005-11-02 12:38:49

by Mel Gorman

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 - Summary

On Wed, 2 Nov 2005, Nick Piggin wrote:

> Dave Hansen wrote:
>
> > What the fragmentation patches _can_ give us is the ability to have 100%
> > success in removing certain areas: the "user-reclaimable" areas
> > referenced in the patch. This gives a customer at least the ability to
> > plan for how dynamically reconfigurable a system should be.
> >
>
> But the "user-reclaimable" areas can still be taken over by other
> areas which become fragmented.
>

This is true, we have worst case scenarios. With our patches though, our
assertion it takes a lot longer to degrade and in good scenarios like
where the workload is not using all of physical memory, we don't degrade
at all. Assuming we get a page migration or active defragmentation in the
future, it will be a lot longer before they have to do any work. As we
only fragment when there is nothing else to do, page migration will also
have less work to do.

> That's like saying we can already guarantee 100% success in removing
> areas that are unfragmented and free, or freeable.
>
> > After these patches, the next logical steps are to increase the
> > knowledge that the slabs have about fragmentation, and to teach some of
> > the shrinkers about fragmentation.
> >
>
> I don't like all this work and complexity and overheads going into a
> partial solution.
>
> Look: if you have to guarantee memory can be shrunk, set aside a zone
> for it (that only fills with user reclaimable areas). This is better
> than the current frag patches because it will give you the 100%
> guarantee that you need (provided we have page migration to move mlocked
> pages).
>
> If you don't need a guarantee, then our current, simple system does the
> job perfectly.
>

Ok. To me, the rest of the thread are beating around the same points and
no one is giving ground. The points are made so lets summarise. Apologies
if anything is missing.

Problem
=======

Memory gets fragmented meaning that contiguous blocks of memory are not
free and not freeable no matter how much kswapd works

Impact
======
A number of different users are hit, in different ways
Physical Hotplug remove: Hotplug remove needs to be able to free a large
region of memory that is then unplugged. Different architectures have
different ways of doing this
Virtualization hotplug remove: The requirements are lighter here.
Contiguous Regions from 1MiB to 64MiB (figure taken from thread)
must be freed to move the memory between virtual machines
High order allocations: With fragmentation, high order allocations fail.
Depending on the workload, kswapd could work forever and not free up a
4MiB chunk

Who cares
=========
Physical hotplug remove: Vendors of the hardware that support this -
Fujitsu, HP (I think), IBM etc

Virtualization hotplug remove: Sellers of virtualization software, some
hardware like any IBM machine that lists LPAR in it's list of
features. Probably software solutions like Xen are also affected
if they want to be able to grow and shrink the virtual machines on
demand

High order allocations: Ultimately, hugepage users. Today, that is a
feature only big server users like Oracle care about. In the
future I reckon applications will be able to use them for things
like backing the heap by huge pages. Other users like GigE,
loopback devices with large MTUs, some filesystem like CIFS are
all interested although they are also been told use use smaller
pages.

Solutions
=========

Anti-defrag: This solution defines three groups of pages KernNoRclm,
KernRclm and EasyRclm. Small sub-zone regions of size
2^(MAX_ORDER-1) are reserved for each allocation type. If there
are no large blocks available and no reserved pages available, it
falls back and begins to fragment. This tries to delay
fragmentation for as long as possible

New Zone: Add a new zone for easyrclm only allocations. This means that
all kernel pages go in one place and all easyrclm go in another.
This solution would allow us to reclaim contiguous blocks of
(Note: This is basically what Solaris Kernel Cages are)

Note that I am leaving out Growing/Shrinking zone code for the moment.
While zones are currently able to get new pages with something like memory
hotadd, there is no mechanism available to move existing pages from one
zone into another. This will need planning and code. Code exists for page
migration so we can reasonable speculate about what it brings to the table
for both anti-defrag and New Zone approaches.

Pros/Cons of Solutions
======================

Anti-defrag Pros
o Aim9 shows no significant regressions (.37% on page_test). On some
tests, it shows performance gains (> 5% on fork_test)
o Stress tests show that it manages to keep fragmentation down to a far
lower level even without teaching kswapd how to linear reclaim
o Stress tests with a linear reclaim experimental patch shows that it
can successfully find large contiguous chunks of memory
o It is known to help hotplug on PPC64
o No tunables. The approach tries to manage itself as much as possible
o It exists, heavily tested, and synced against the latest -mm1
o Can be compiled away be redefining the RCLM_* macros and the
__GFP_*RCLM flags

Anti-defrag Cons
o More complexity within the page allocator
o Adds a new layer onto the allocator that effectively creates subzones
o Adding a new concept that maintainers have to work with
o Depending on the workload, it fragments anyway

New Zone Pros
o Zones are a well known and understood concept
o For people that do not care about hotplug, they can easily get rid of it
o Provides reliable areas of contiguous groups that can be freed for
HugeTLB pages going to userspace
o Uses existing zone infrastructure for balancing

New Zone Cons
o Zones historically have introduced balancing problems
o Been tried for hotplug and dropped because of being awkward to work with
o It only helps hotplug and potentially HugeTLB pages for userspace
o Tunable required. If you get it wrong, the system suffers a lot
o Needs to be planned for and developed

Scenarios
=========

Lets outline some situations then or workloads that can occur

1. Heavy job running that consumes 75% of physical memory. Like a kernel
build

Anti-defrag: It will not fragment as it will never have to fallback.High
order allocations will be possible in the remaining 25%.
Zone-based: After been tuned to a kernel build load, it will not
fragment. Get the tuning wrong, performance suffers or workload
fails. High order allocations will be possible in the remaining 25%.

Future work for scenario 1
Anti-defrag: No problem.
Zone-based: Tune some more if problems occur.

2. Heavy job running that needs 110% of physical memory, swap is used.
Example would be too many simultaneous kernel builds
Anti-defrag: UserRclm regions are stolen to prevent too many fallbacks.
KernNoRclm starts stealing UserRclm regions to avoid excessive
fragmentation but some fragmentation occurs. Extent depends on the
duration and heaviness of the load. High order allocations will
work if kswapd runs for long enough as it will reclaim the
UserRclm reserved areas. Your chances depend on the intensity of
KernNoRclm allocations

Zone-based: After been tuned to the new kernel build load, it will not
fragment. Get it wrong and performance suffers. High order
allocations will work if you're lucky enough to have enough
reclaimable pages together. Your chances are not good

Future work for scenario 2
Anti-defrag: kswapd would need to know how to reclaim EasyRclm pages
from the KernNoRclm, KernRclm and Fallback areas.
Zone-based: Keep tuning

3. HighMem intensive workload with CONFIG_HIGHPTE set. Example would be a
scientific job that was performing a very large calculation on an
anonymous region of memory. Possible that some desktop
workloads are like this - i.e. use large amounts of anonymous
memory

Anti-defrag: For ZONE_HIGHMEM, PTEs are grouped into one area,
everything else into another, no fragmentation. HugeTLB
allocations in ZONE_HIGHMEM will work if kswapd works long enough
Zone-based: PTEs go to anywhere in ZONE_HIGHMEM. Easy-reclaimed goes to
ZONE_HIGHMEM and ZONE_HOTREMOVABLE. ZONE_HIGHMEM fragments,
ZONE_HOTREMOVABLE does not. HugeTLB pages will be available in
ZONE_HOTREMOVABLE, but probably not in ZONE_HIGHMEM.

Future work for scenario 3
Anti-defrag: No problem. On-demand HugeTLB allocation for userspace is
possible. Would work better with linear page reclaim.
Zone-based: Depends if we care that ZONE_HIGHMEM gets fragmented. We
would only care if trying to allocate HugeTLB pages on demand from
ZONE_HIGHMEM. ZONE_HOTREMOVABLE depending on it's size would be
possible. Linear reclaim will help ZONE_HOTREMOVABLE, but not
ZONE_HIGHMEM

4. KBuild. Main concerns here are performance
Anti-defrag: May cause problems because of the .37% drop on page_test.
May cause improvements because of the 5% increase on fork_test. No
figures on kbuild available
Zone-based: No figures available. Depends heavily on being configured
correctly

Future work for scenario 4
Anti-defrag: Try and optimise the paths affected. Alternatively make
anti-defrag a configurable option by altering the values of RCLM_*
and __GFP_*RCLM. (Note, would people be interested in a
compile-time option for anti-defrag or would it make the complexity
worse for people?)
Zone-based: Tune for performance or compile away the zone

5. Physically unplug memory 25% of physical memory

Anti-defrag: Memory in the region gets reclaimed if it's EasyRclm.
Possibly will encounter awkward pages. Known that PPC64 has some
success. Fujitsu's use nodes for hotplug, they would need to
adjust the fallbacks to be fully reliable
Zone-based: If we are unplugging the right zone, reclaim the pages.
Possibly will encounter awkward pages (only mlock in this case)

Future work for scenario 5
Anti-defrag: fallback_allocs for each node for Fujitsu to be any way
reliable. Ability to move awkward pages around. For 100% success,
ability to move kernel pages
Zone-based: Ability to move awkward pages around. There is no 100%
success scenario here. You remove the ZONE_HOTREMOVEABLE area or
you turn the machine off.

6. Fsck a large filesystem (known to be a KernNoRclm heavy workload)

Anti-defrag: Memory fragments, but fsck is a short-lived kernel heavy
workload. It is also known that free blocks reappear through the
address space when it finishes. Contiguous blocks may appear in
the middle of the zone rather than either end.
Zone-based: If misconfigured, performance degrades. As a machine will
not be tuned for fsck, changes of degrading are pretty high. On
the other hand, fsck is something people can wait for

Future work for scenario 6
Anti-defrag: Ideally, in case of fallbacks, page migration would move
awkward pages out of UserRclm areas
Zone-based: Keep tuning if you run into problems

Lets say we agree on a way that ZONE_HOTREMOVABLE can be shrunk in such a
way to give pages to ZONE_NORMAL and ZONE_HIGHMEM as necessary (and we
have to be able to handle both), Situation 2 and 6 changes. Note that this
changing of zones sizes brings all the problems from the anti-defrag
approach to the zone-based approach.

2a. Heavy job running that needs 110% of physical memory, swap is used.
Anti-defrag: UserRclm regions are stolen to prevent too many fallbacks.
KernNoRclm starts stealing UserRclm regions to avoid excessive
fragmentation but some fragmentation occurs. Extent depends on the
duration and heaviness of the load.
Zone-based: ZONE_NORMAL grows into ZONE_HOTREMOVALE. The zone cannot be
shrunk so ZONE_NORMAL fragments as normal.

Future work for scenario 2a
Anti-defrag: kswapd would need to know how to clean EasyRclm pages
from the KernNoRclm, KernRclm and Fallback reserved areas. When
load drops off, regions will get reserved again for EasyRclm.
Contiguous blocks will become whenever possible be it the
beginning, middle or end of the zone. Page migration would help
fix up single kernel pages left in EasyRclm areas.
Zone-based: Page migration would need to move pages from the end of
the zone so it could be shrunk.

6a. Fsck
Anti-defrag: Memory fragments, but fsck is a short-lived kernel heavy
workload. It is also known that free blocks reappear through the
address space when it finishes. Once the free blocks appear, they
get reserved for the different allocation types on demand and
business continues as usual
Zone-based: ZONE_NORMAL grows into ZONE_HOTREMOVABLE. No mechanism to
shrink it so it doesn't recover

Future work for scenario 2a

Anti-defrag: Same as for Situation 2. kswapd would need to know how to
clean UserRclm pages from the KernNoRclm, KernRclm and Fallback
reserved areas.
Zone-based: Same as for 2a. Page migration would need to move pages
from the end of the zone so it could be shrunk

I've tried to be as objective as possible with the summary.

>From the points above though, I think that anti-defrag gets us a lot of
the way, with the complexity isolated in one place. It's downside is that
it can still break down and future work is needed to stop it degrading
(kswapd cleaning UserRclm areas and page migration when we get really
stuck). Zone-based is more reliable but only addresses a limited
situation, principally hotplug and it does not even go 100% of the way for
hotplug. It also depends on a tunable which is not cool and it is static.
If we make the zones growable+shrinkable, we run into all the same
problems that anti-defrag has today.

--
Mel Gorman
Part-time Phd Student Java Applications Developer
University of Limerick IBM Dublin Software Lab

2005-11-02 12:43:06

by Dave Hansen

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Wed, 2005-11-02 at 13:00 +0100, Ingo Molnar wrote:
>
> > Yeah - and that isn't what is being proposed here. The goal is to
> > ask the kernel to identify some memory which can be legitimately
> > freed and hasten the freeing of that memory.
>
> but that's very easy to identify: check the free list or the clean
> list(s). No defragmentation necessary. [unless the unit of RAM mapping
> between hypervisor and guest is too coarse (i.e. not 4K pages).]

It needs to be that coarse in cases where HugeTLB is desired for use.
I'm not sure I could convince the DB guys to give up large pages,
they're pretty hooked on them. ;)

-- Dave

2005-11-02 14:51:52

by Martin Bligh

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

> Well I think it can satisfy hugepage allocations simply because
> we can be reasonably sure of being able to free contiguous regions.
> Of course it will be memory no longer easily reclaimable, same as
> the case for the frag patches. Nor would be name ZONE_REMOVABLE any
> longer be the most appropriate!
>
> But my point is, the basic mechanism is there and is workable.
> Hugepages and memory unplug are the two main reasons for IBM to be
> pushing this AFAIKS.

No, that's not true - those are just the "exciting" features that go
on the back of it. Look back in this email thread - there's lots of
other reasons to fix fragmentation. I don't believe you can eliminate
all the order > 0 allocations in the kernel.

2005-11-02 15:02:38

by Gerrit Huizenga

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19


On Wed, 02 Nov 2005 13:00:48 +0100, Ingo Molnar wrote:
>
> * Gerrit Huizenga <[email protected]> wrote:
>
> >
> > On Wed, 02 Nov 2005 11:41:31 +0100, Ingo Molnar wrote:
> > >
> > > * Gerrit Huizenga <[email protected]> wrote:
> > >
> > > > > generic unpluggable kernel RAM _will not work_.
> > > >
> > > > Actually, it will. Well, depending on terminology.
> > >
> > > 'generic unpluggable kernel RAM' means what it says: any RAM seen by the
> > > kernel can be unplugged, always. (as long as the unplug request is
> > > reasonable and there is enough free space to migrate in-use pages to).
> >
> > Okay, I understand your terminology. Yes, I can not point to any
> > particular piece of memory and say "I want *that* one" and have that
> > request succeed. However, I can say "find me 50 chunks of memory
> > of your choosing" and have a very good chance of finding enough
> > memory to satisfy my request.
>
> but that's obviously not 'generic unpluggable kernel RAM'. It's very
> special RAM: RAM that is free or easily freeable. I never argued that
> such RAM is not returnable to the hypervisor.

Okay - and 'generic unpluggable kernel RAM' has not been a goal for
the hypervisor based environments. I believe it is closer to being
a goal for those machines which want to hot-remove DIMMs or physical
memory, e.g. those with IA64 machines wishing to remove entire nodes.

> > > reliable unmapping of "generic kernel RAM" is not possible even in a
> > > virtualized environment. Think of the 'live pointers' problem i outlined
> > > in an earlier mail in this thread today.
> >
> > Yeah - and that isn't what is being proposed here. The goal is to
> > ask the kernel to identify some memory which can be legitimately
> > freed and hasten the freeing of that memory.
>
> but that's very easy to identify: check the free list or the clean
> list(s). No defragmentation necessary. [unless the unit of RAM mapping
> between hypervisor and guest is too coarse (i.e. not 4K pages).]

Ah, but the hypervisor often manages large page sizes, e.g. 64 MB.
It doesn't manage page rights for each guest OS at the 4 K granularity.
Hypervisors are theoretically light in terms of memory needs and
general footprint. Picture the overhead of tracking rights/permissions
of each page of memory and its assignment to any of, say, 256 different
guest operating systems. For a machine of any size, that would be
a huge amount of state for a hypervisor to maintain. Would you
really want a hypervisor to keep that much state? Or is it more
reasonably for a hypervisor to track, say, 64 MB chunks and the
rights of that memory for a number of guest operating systems? Even
if the number of guests is small, the data structures for fast
memory management would grow quickly.

gerrit

2005-11-03 01:38:06

by Rob Landley

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Wednesday 02 November 2005 02:43, Nick Piggin wrote:

> > Hmmm. I don't see at this point.
> > Why do you think ZONE_REMOVABLE can satisfy for hugepage.
> > At leaset, my ZONE_REMOVABLE patch doesn't any concern about
> > fragmentation.
>
> Well I think it can satisfy hugepage allocations simply because
> we can be reasonably sure of being able to free contiguous regions.
> Of course it will be memory no longer easily reclaimable, same as
> the case for the frag patches. Nor would be name ZONE_REMOVABLE any
> longer be the most appropriate!
>
> But my point is, the basic mechanism is there and is workable.
> Hugepages and memory unplug are the two main reasons for IBM to be
> pushing this AFAIKS.

Who cares what IBM is pushing? I'm interested in fragmentation avoidance for
User Mode Linux.

I use User Mode Linux to virtualize a system build, and one problem I
currently have is that some workloads temporarily use a lot of memory. For
example, I can run a complete system build in about 48 megs of ram: except
for building GCC. That spikes to a couple hundred megabytes. If I allocate
256 megabytes of memory to UML, that's half the memory on my laptop and UML
will just use it for redundant cacheing and such while desktop performance
gets a bit unhappy with the build going.

UML gets an instance's "physical memory" by allocating a temporary file,
mmapping it, and deleting it (which signals to the vfs that flushing this
data to backing store should only be done under memory pressure from the rest
of the OS, because the file's going away when it's closed so there's no

With fragmentation reduction and prezeroing, UML suddenly gains the option of
calling madvise(DONT_NEED) on sufficiently large blocks as A) a fast way of
prezeroing, B) a way of giving memory back to the host OS when it's not in
use.

This has _nothing_ to do with IBM. Or large systems. This is some random
developer trying to run a virtualized system build on his laptop.

(The reason I need to use UML is that I build uClibc with the newest 2.6
kernel headers I can, link apps against it, and then running many of those
apps during later stages of the build. If the kernel headers used to build
libc are sufficiently newer than the kernel the build is running under, I get
segfaults because the new libc tries use kernel features that aren't there on
the host system, but will be in the final system. I also get the ability to
mknod/chown/chroot without needing root access on the host system for
free...)

Rob

2005-11-03 01:37:36

by Rob Landley

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Wednesday 02 November 2005 09:02, Gerrit Huizenga wrote:
> > but that's obviously not 'generic unpluggable kernel RAM'. It's very
> > special RAM: RAM that is free or easily freeable. I never argued that
> > such RAM is not returnable to the hypervisor.
>
> Okay - and 'generic unpluggable kernel RAM' has not been a goal for
> the hypervisor based environments. I believe it is closer to being
> a goal for those machines which want to hot-remove DIMMs or physical
> memory, e.g. those with IA64 machines wishing to remove entire nodes

Keep in mind that just about any virtualized environment might benefit from
being able to tell the parent system "we're not using this ram". I mentioned
UML, and I can also imagine a Linux driver that signals qemu (or even vmware)
to say "this chunk of physical memory isn't currently in use", and even if
they don't actually _free_ it they can call madvise() on it.

Heck, if we have prezeroing of large blocks, telling your emulator to
madvise(ADV_DONTNEED) the pages for you should just plug right in to that
infrastructure...

Rob

2005-11-03 01:37:34

by Rob Landley

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Wednesday 02 November 2005 03:37, Nick Piggin wrote:
> > So do you see the problem with fragementation if the hypervisor is
> > handing out, say, 1 MB pages? Or, more likely, something like 64 MB
> > pages? What are the chances that an entire 64 MB page can be freed
> > on a large system that has been up a while?
>
> I see the problem, but if you want to be able to shrink memory to a
> given size, then you must either introduce a hard limit somewhere, or
> have the hypervisor hand out guest sized pages. Use zones, or Xen?

In the UML case, I want the system to automatically be able to hand back any
sufficiently large chunks of memory it currently isn't using.

What does this have to do with specifying hard limits of anything? What's to
specify? Workloads vary. Deal with it.

> If there are zone rebalancing problems[*], then it would be great to
> have more users of zones because then they will be more likely to get
> fixed.

Ok, so you want to artificially turn this into a zone balancing issue in hopes
of giving that area of the code more testing when, if zones weren't involved,
there would be no need for balancing at all?

How does that make sense?

> [*] and there are, sadly enough - see the recent patches I posted to
> lkml for example.

I was under the impression that zone balancing is, conceptually speaking, a
difficult problem.

> But I'm fairly confident that once the particularly
> silly ones have been fixed,

Great, you're advocating migrating the fragmentation patches to an area of
code that has known problems you yourself describe as "particularly silly".
A ringing endorsement, that.

The fact that the migrated version wouldn't even address fragmentation
avoidance at all (the topic of this thread!) is apparently a side issue.

> zone balancing will no longer be a
> derogatory term as has been thrown around (maybe rightly) in this
> thread!

If I'm not mistaken, you introduced zones into this thread, you are the
primary (possibly only) proponent of them. Yes, zones are a way of
categorizing memory. They're not a way of defragmenting it.

Rob

2005-11-03 03:12:45

by Nick Piggin

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 - Summary

Mel Gorman wrote:

>
> Ok. To me, the rest of the thread are beating around the same points and
> no one is giving ground. The points are made so lets summarise. Apologies
> if anything is missing.
>

Thanks for attempting a summary of a difficult topic. I have a couple
of suggestions.

> Who cares
> =========
> Physical hotplug remove: Vendors of the hardware that support this -
> Fujitsu, HP (I think), IBM etc
>
> Virtualization hotplug remove: Sellers of virtualization software, some
> hardware like any IBM machine that lists LPAR in it's list of
> features. Probably software solutions like Xen are also affected
> if they want to be able to grow and shrink the virtual machines on
> demand
>

Ingo said that Xen is fine with per page granular freeing - this covers
embedded, desktop and small server users of VMs into the future I'd say.

> High order allocations: Ultimately, hugepage users. Today, that is a
> feature only big server users like Oracle care about. In the
> future I reckon applications will be able to use them for things
> like backing the heap by huge pages. Other users like GigE,
> loopback devices with large MTUs, some filesystem like CIFS are
> all interested although they are also been told use use smaller
> pages.
>

I think that saying its now OK to use higher order allocations is wrong
because as I said even with your patches they are going to run into
problems.

Actually I think one reason your patches may perform so well is because
there aren't actually a lot of higher order allocations in the kernel.

I think that probably leaves us realistically with demand hugepages,
hot unplug memory, and IBM lpars?


> Pros/Cons of Solutions
> ======================
>
> Anti-defrag Pros
> o Aim9 shows no significant regressions (.37% on page_test). On some
> tests, it shows performance gains (> 5% on fork_test)
> o Stress tests show that it manages to keep fragmentation down to a far
> lower level even without teaching kswapd how to linear reclaim

This sounds like a kind of funny test to me if nobody is actually
using higher order allocations.

When a higher order allocation is attempted, either you will satisfy
it from the kernel region, in which case the vanilla kernel would
have done the same. Or you satisfy it from an easy-reclaim contiguous
region, in which case it is no longer an easy-reclaim contiguous
region.

> o Stress tests with a linear reclaim experimental patch shows that it
> can successfully find large contiguous chunks of memory
> o It is known to help hotplug on PPC64
> o No tunables. The approach tries to manage itself as much as possible

But it has more dreaded heuristics :P

> o It exists, heavily tested, and synced against the latest -mm1
> o Can be compiled away be redefining the RCLM_* macros and the
> __GFP_*RCLM flags
>
> Anti-defrag Cons
> o More complexity within the page allocator
> o Adds a new layer onto the allocator that effectively creates subzones
> o Adding a new concept that maintainers have to work with
> o Depending on the workload, it fragments anyway
>
> New Zone Pros
> o Zones are a well known and understood concept
> o For people that do not care about hotplug, they can easily get rid of it
> o Provides reliable areas of contiguous groups that can be freed for
> HugeTLB pages going to userspace
> o Uses existing zone infrastructure for balancing
>
> New Zone Cons
> o Zones historically have introduced balancing problems
> o Been tried for hotplug and dropped because of being awkward to work with
> o It only helps hotplug and potentially HugeTLB pages for userspace
> o Tunable required. If you get it wrong, the system suffers a lot

Pro: it keeps IBM mainframe and pseries sysadmins in a job ;) Let
them get it right.

> o Needs to be planned for and developed
>

Yasunori Goto had patches around from last year. Not sure what sort
of shape they're in now but I'd think most of the hard work is done.

> Scenarios
> =========
>
> Lets outline some situations then or workloads that can occur
>
> 1. Heavy job running that consumes 75% of physical memory. Like a kernel
> build
>
> Anti-defrag: It will not fragment as it will never have to fallback.High
> order allocations will be possible in the remaining 25%.
> Zone-based: After been tuned to a kernel build load, it will not
> fragment. Get the tuning wrong, performance suffers or workload
> fails. High order allocations will be possible in the remaining 25%.
>

You don't need to continually tune things for each and every possible
workload under the sun. It is like how we currently drive 16GB highmem
systems quite nicely under most workloads with 1GB of normal memory.
Make that an 8:1 ratio if you're worried.

[snip]

>
> I've tried to be as objective as possible with the summary.
>
>>From the points above though, I think that anti-defrag gets us a lot of
> the way, with the complexity isolated in one place. It's downside is that
> it can still break down and future work is needed to stop it degrading
> (kswapd cleaning UserRclm areas and page migration when we get really
> stuck). Zone-based is more reliable but only addresses a limited
> situation, principally hotplug and it does not even go 100% of the way for
> hotplug.

To me it seems like it solves the hotplug, lpar hotplug, and hugepages
problems which seem to be the main ones.

> It also depends on a tunable which is not cool and it is static.

I think it is very cool because it means the tiny minority of Linux
users who want this can do so without impacting the rest of the code
or users. This is how Linux has been traditionally run and I still
have a tiny bit of faith left :)

> If we make the zones growable+shrinkable, we run into all the same
> problems that anti-defrag has today.
>

But we don't have the extra zones layer that anti defrag has today.

And anti defrag needs limits if it is to be reliable anyway.

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-11-03 04:35:56

by Jeff Dike

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Wed, Nov 02, 2005 at 05:28:35PM -0600, Rob Landley wrote:
> With fragmentation reduction and prezeroing, UML suddenly gains the option of
> calling madvise(DONT_NEED) on sufficiently large blocks as A) a fast way of
> prezeroing, B) a way of giving memory back to the host OS when it's not in
> use.

DONT_NEED is insufficient. It doesn't discard the data in dirty
file-backed pages.

Badari Pulavarty has a test patch (google for madvise(MADV_REMOVE))
which does do the trick, and I have a UML patch which adds memory
hotplug. This combination does free memory back to the host.

Jeff

2005-11-03 04:41:44

by Nick Piggin

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

Rob Landley wrote:

> In the UML case, I want the system to automatically be able to hand back any
> sufficiently large chunks of memory it currently isn't using.
>

I'd just be happy with UML handing back page sized chunks of memory that
it isn't currently using. How does contiguous memory (in either the host
or the guest) help this?

> What does this have to do with specifying hard limits of anything? What's to
> specify? Workloads vary. Deal with it.
>

Umm, if you hadn't bothered to read the thread then I won't go through
it all again. The short of it is that if you want guaranteed unfragmented
memory you have to specify a limit.

>
>>If there are zone rebalancing problems[*], then it would be great to
>>have more users of zones because then they will be more likely to get
>>fixed.
>
>
> Ok, so you want to artificially turn this into a zone balancing issue in hopes
> of giving that area of the code more testing when, if zones weren't involved,
> there would be no need for balancing at all?
>
> How does that make sense?
>

Have you looked at the frag patches? Do you realise that they have to
balance between the different types of memory blocks? Duplicating the
same or similar infrastructure (in this case, a memory zoning facility)
is a bad thing in general.

>
>>[*] and there are, sadly enough - see the recent patches I posted to
>> lkml for example.
>
>
> I was under the impression that zone balancing is, conceptually speaking, a
> difficult problem.
>

I am under the impression that you think proper fragmentation avoidance
is easier.

>
>> But I'm fairly confident that once the particularly
>> silly ones have been fixed,
>
>
> Great, you're advocating migrating the fragmentation patches to an area of
> code that has known problems you yourself describe as "particularly silly".
> A ringing endorsement, that.
>

Err, the point is so we don't now have 2 layers doing very similar things,
at least one of which has "particularly silly" bugs in it.

> The fact that the migrated version wouldn't even address fragmentation
> avoidance at all (the topic of this thread!) is apparently a side issue.
>

Zones can be used to guaranteee physically contiguous regions with exactly
the same effectiveness as the frag patches.

>
>> zone balancing will no longer be a
>> derogatory term as has been thrown around (maybe rightly) in this
>> thread!
>
>
> If I'm not mistaken, you introduced zones into this thread, you are the
> primary (possibly only) proponent of them.

So you didn't look at Yasunori Goto's patch from last year that implements
exactly what I described, then?

> Yes, zones are a way of categorizing memory.

Yes, have you read Mel's patches? Guess what they do?

> They're not a way of defragmenting it.

Guess what they don't?


Send instant messages to your online friends http://au.messenger.yahoo.com

2005-11-03 05:42:43

by Rob Landley

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Wednesday 02 November 2005 23:26, Jeff Dike wrote:
> On Wed, Nov 02, 2005 at 05:28:35PM -0600, Rob Landley wrote:
> > With fragmentation reduction and prezeroing, UML suddenly gains the
> > option of calling madvise(DONT_NEED) on sufficiently large blocks as A) a
> > fast way of prezeroing, B) a way of giving memory back to the host OS
> > when it's not in use.
>
> DONT_NEED is insufficient. It doesn't discard the data in dirty
> file-backed pages.

I thought DONT_NEED would discard the page cache, and punch was only needed to
free up the disk space.

I was hoping that since the file was deleted from disk and is already getting
_some_ special treatment (since it's a longstanding "poor man's shared
memory" hack), that madvise wouldn't flush the data to disk, but would just
zero it out. A bit optimistic on my part, I know. :)

> Badari Pulavarty has a test patch (google for madvise(MADV_REMOVE))
> which does do the trick, and I have a UML patch which adds memory
> hotplug. This combination does free memory back to the host.

I saw it wander by, and am all for it. If it goes in, it's obviously the
right thing to use. You may remember I asked about this two years ago:
http://seclists.org/lists/linux-kernel/2003/Dec/0919.html

And a reply indicated that SVr4 had it, but we don't. I assume the "naming
discussion" mentioned in the recent thread already scrubbed through this old
thread to determine that the SVr4 API was icky.
http://seclists.org/lists/linux-kernel/2003/Dec/0955.html

> Jeff

Rob

2005-11-03 06:08:39

by Rob Landley

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Wednesday 02 November 2005 22:43, Nick Piggin wrote:
> Rob Landley wrote:
> > In the UML case, I want the system to automatically be able to hand back
> > any sufficiently large chunks of memory it currently isn't using.
>
> I'd just be happy with UML handing back page sized chunks of memory that
> it isn't currently using. How does contiguous memory (in either the host
> or the guest) help this?

Smaller chunks of memory are likely to be reclaimed really soon, and adding in
the syscall overhead working with individual pages of memory is almost
guaranteed to slow us down. Plus with punch, we'd be fragmenting the heck
out of the underlying file.

> > What does this have to do with specifying hard limits of anything?
> > What's to specify? Workloads vary. Deal with it.
>
> Umm, if you hadn't bothered to read the thread then I won't go through
> it all again. The short of it is that if you want guaranteed unfragmented
> memory you have to specify a limit.

I read it. It just didn't contain an answer the the question. I want UML to
be able to hand back however much memory it's not using, but handing back
individual pages as we free them and inserting a syscall overhead for every
page freed and allocated is just nuts. (Plus, at page size, the OS isn't
likely to zero them much faster than we can ourselves even without the
syscall overhead.) Defragmentation means we can batch this into a
granularity that makes it worth it.

This has nothing to do with hard limits on anything.

> Have you looked at the frag patches?

I've read Mel's various descriptions, and tried to stay more or less up to
date ever since LWN brought it to my attention. But I can't say I'm a linux
VM system expert. (The last time I felt I had a really firm grasp on it was
before Andrea and Rik started arguing circa 2.4 and Andrea spent six months
just assuming everybody already knew what a classzone was. I've had other
things to do since then...)

> Do you realise that they have to
> balance between the different types of memory blocks?

I realise they merge them back together into larger chunks as they free up
space, and split larger chunks when they haven't got a smaller one.

> Duplicating the
> same or similar infrastructure (in this case, a memory zoning facility)
> is a bad thing in general.

Even when they keep track of very different things? The memory zoning thing
is about where stuff is in physical memory, and it exists because various
hardware that wants to access memory (24 bit DMA, 32 bit DMA, and PAE) is
evil and crippled and we have to humor it by not asking it to do stuff it
can't.

The fragmentation stuff is about what long contiguous runs of free memory we
can arrange, and it's also nice to be able to categorize them as "zeroed" or
"not zeroed" to make new allocations faster. Where they actually are in
memory is not at issue here.

You can have prezeroed memory in 32 bit DMA space, and prezeroed memory in
highmem, but there's memory in both that isn't prezeroed. I thought there
was a hierarchy of zones. You want overlapping, interlaced, randomly laid
out zones.

> >>[*] and there are, sadly enough - see the recent patches I posted to
> >> lkml for example.
> >
> > I was under the impression that zone balancing is, conceptually speaking,
> > a difficult problem.
>
> I am under the impression that you think proper fragmentation avoidance
> is easier.

I was under the impression it was orthogonal to figuring out whether or not a
given bank of physical memory is accessable to your sound blaster without an
IOMMU.

> >> But I'm fairly confident that once the particularly
> >> silly ones have been fixed,
> >
> > Great, you're advocating migrating the fragmentation patches to an area
> > of code that has known problems you yourself describe as "particularly
> > silly". A ringing endorsement, that.
>
> Err, the point is so we don't now have 2 layers doing very similar things,
> at least one of which has "particularly silly" bugs in it.

Similar is not identical. You seem to be implying that the IO elevator and
the network stack queueing should be merged because they do similar things.

> > The fact that the migrated version wouldn't even address fragmentation
> > avoidance at all (the topic of this thread!) is apparently a side issue.
>
> Zones can be used to guaranteee physically contiguous regions with exactly
> the same effectiveness as the frag patches.

If you'd like to write a counter-patch to Mel's to prove it...

> >> zone balancing will no longer be a
> >> derogatory term as has been thrown around (maybe rightly) in this
> >> thread!
> >
> > If I'm not mistaken, you introduced zones into this thread, you are the
> > primary (possibly only) proponent of them.
>
> So you didn't look at Yasunori Goto's patch from last year that implements
> exactly what I described, then?

I saw the patch he just posted, if that's what you mean. By his own
admission, it doesn't address fragmentation at all.

> > Yes, zones are a way of categorizing memory.
>
> Yes, have you read Mel's patches? Guess what they do?

The swap file is a way of storing data on disk. So is ext3. Obviously, one
is a trivial extension of the other and there's no reason to have both.

> > They're not a way of defragmenting it.
>
> Guess what they don't?

I have no idea what you intended to mean by that. Mel posted a set of patches
in a thread titled "fragmentation avoidance", and you've been arguing about
hotplug, and pointing to a set of patches from Goto that do not address
fragmentation at all. This confuses me.

Rob

2005-11-03 07:32:34

by Nick Piggin

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

Rob Landley wrote:
> On Wednesday 02 November 2005 22:43, Nick Piggin wrote:
>

>>I'd just be happy with UML handing back page sized chunks of memory that
>>it isn't currently using. How does contiguous memory (in either the host
>>or the guest) help this?
>
>
> Smaller chunks of memory are likely to be reclaimed really soon, and adding in
> the syscall overhead working with individual pages of memory is almost
> guaranteed to slow us down.

Because UML doesn't already make a syscall per individual page of
memory freed? (If I read correctly)

> Plus with punch, we'd be fragmenting the heck
> out of the underlying file.
>

Why? No you wouldn't.

>
>>>What does this have to do with specifying hard limits of anything?
>>>What's to specify? Workloads vary. Deal with it.
>>
>>Umm, if you hadn't bothered to read the thread then I won't go through
>>it all again. The short of it is that if you want guaranteed unfragmented
>>memory you have to specify a limit.
>
>
> I read it. It just didn't contain an answer the the question. I want UML to
> be able to hand back however much memory it's not using, but handing back
> individual pages as we free them and inserting a syscall overhead for every
> page freed and allocated is just nuts. (Plus, at page size, the OS isn't
> likely to zero them much faster than we can ourselves even without the
> syscall overhead.) Defragmentation means we can batch this into a
> granularity that makes it worth it.
>

Oh you have measured it and found out that "defragmentation" makes
it worthwhile?

> This has nothing to do with hard limits on anything.
>

You said:

"What does this have to do with specifying hard limits of
anything? What's to specify? Workloads vary. Deal with it."

And I was answering your very polite questions.

>
>>Have you looked at the frag patches?
>
>
> I've read Mel's various descriptions, and tried to stay more or less up to
> date ever since LWN brought it to my attention. But I can't say I'm a linux
> VM system expert. (The last time I felt I had a really firm grasp on it was
> before Andrea and Rik started arguing circa 2.4 and Andrea spent six months
> just assuming everybody already knew what a classzone was. I've had other
> things to do since then...)
>

Maybe you have better things to do now as well?

>>Duplicating the
>>same or similar infrastructure (in this case, a memory zoning facility)
>>is a bad thing in general.
>
>
> Even when they keep track of very different things? The memory zoning thing
> is about where stuff is in physical memory, and it exists because various
> hardware that wants to access memory (24 bit DMA, 32 bit DMA, and PAE) is
> evil and crippled and we have to humor it by not asking it to do stuff it
> can't.
>

No, the buddy allocator is and always has been what tracks the "long
contiguous runs of free memory". Both zones and Mels patches classify
blocks of memory according to some criteria. They're not exactly the
same obviously, but they're equivalent in terms of capability to
guarantee contiguous freeable regions.

>
> I was under the impression it was orthogonal to figuring out whether or not a
> given bank of physical memory is accessable to your sound blaster without an
> IOMMU.
>

Huh?

>>Err, the point is so we don't now have 2 layers doing very similar things,
>>at least one of which has "particularly silly" bugs in it.
>
>
> Similar is not identical. You seem to be implying that the IO elevator and
> the network stack queueing should be merged because they do similar things.
>

No I don't.

>
> If you'd like to write a counter-patch to Mel's to prove it...
>

It has already been written as you have been told numerous times.

Now if you'd like to actually learn about what you're commenting on,
that would be really good too.

>>So you didn't look at Yasunori Goto's patch from last year that implements
>>exactly what I described, then?
>
>
> I saw the patch he just posted, if that's what you mean. By his own
> admission, it doesn't address fragmentation at all.
>

It seems to be that it provides exactly the same (actually stronger)
guarantees than the current frag patches do. Or were you going to point
out a bug in the implementation?

>
>>>Yes, zones are a way of categorizing memory.
>>
>>Yes, have you read Mel's patches? Guess what they do?
>
>
> The swap file is a way of storing data on disk. So is ext3. Obviously, one
> is a trivial extension of the other and there's no reason to have both.
>

Don't try to bullshit your way around with stupid analogies please, it
is an utter waste of time.

>
>>>They're not a way of defragmenting it.
>>
>>Guess what they don't?
>
>
> I have no idea what you intended to mean by that. Mel posted a set of patches

What I mean is that Mel's patches aren't a way of defragmenting memory either.
They fit exactly the description you gave for zones (ie. a way of categorizing,
not defragmenting).

> in a thread titled "fragmentation avoidance", and you've been arguing about
> hotplug, and pointing to a set of patches from Goto that do not address
> fragmentation at all. This confuses me.
>

Yeah it does seem like you are confused.

Now let's finish up this subthread and try to keep the SN ratio up, please?
I'm sure Jeff or someone knowledgeable in the area can chime in if there are
concerns about UML.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-11-03 12:19:36

by Mel Gorman

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 - Summary

On Thu, 3 Nov 2005, Nick Piggin wrote:

> Mel Gorman wrote:
>
> >
> > Ok. To me, the rest of the thread are beating around the same points and
> > no one is giving ground. The points are made so lets summarise. Apologies
> > if anything is missing.
> >
>
> Thanks for attempting a summary of a difficult topic. I have a couple
> of suggestions.
>
> > Who cares
> > =========
> > Physical hotplug remove: Vendors of the hardware that support this -
> > Fujitsu, HP (I think), IBM etc
> >
> > Virtualization hotplug remove: Sellers of virtualization software, some
> > hardware like any IBM machine that lists LPAR in it's list of
> > features. Probably software solutions like Xen are also affected
> > if they want to be able to grow and shrink the virtual machines on
> > demand
> >
>
> Ingo said that Xen is fine with per page granular freeing - this covers
> embedded, desktop and small server users of VMs into the future I'd say.
>

Ok, hard to argue with that.

> > High order allocations: Ultimately, hugepage users. Today, that is a
> > feature only big server users like Oracle care about. In the
> > future I reckon applications will be able to use them for things
> > like backing the heap by huge pages. Other users like GigE,
> > loopback devices with large MTUs, some filesystem like CIFS are
> > all interested although they are also been told use use smaller
> > pages.
> >
>
> I think that saying its now OK to use higher order allocations is wrong
> because as I said even with your patches they are going to run into
> problems.
>

Ok, I have not denied that they will run into problems. I have asserted
that, with more work built upon these patches, we can grant large pages
with a good degree of reliability. Subsystems should still use small
orders whenever possible and at the very least, large orders should be
short-lived.

For userspace users, I would like to move towards better availibility of
huge page without requiring boot-time tunables which are required today.
Do we agree that this would be useful at least for a few different users?

HugeTLB user 1: Todays users of hugetlbfs like big databases etc
HugeTLB user 2: HPC jobs that run with sparse data sets
HugeTLB user 3: Desktop applications that use large amounts of address space.

I got a mail from a user of category 2. He said I can quote his email, but
he didn't say I could quote his name which is inconvenient but I'm sure he
has good reasons.

To him, low fragmentation is "critical, at least in HPC environments".
Here is the core of his issue;

--- excerpt ---
Take the scenario that you have a large machine that is
used by multiple users, and the usage is regulated by a batch
scheduler. Loadleveler on ibm's for example. PBS on many
others. Both appear to be available in linux environments.

In the case of my codes, I find that having large pages is
extremely beneficial to my run times. As in factors of several,
modulo things that I've coded in by hand to try and avoid the
issues. I don't think my code is in any way unusual in this
magnitude of improvement.
--- excerpt ---

ok, so we have two potential solutions, anti-defrag and zones. We don't
need to rehash the pro's and cons. With zones, we just say "just reclaim
the easy reclaim zone, alloc your pages and away we go".

Now, his problem is that the server is not restarted between job times and
jobs takes days and weeks to complete. The system administrators will not
restart the machine so getting it to a prestine state is a difficulty. The
state he gets the system in is the state he works with and with
fragmentation, he doesn't get large pages unless he is lucky enough to be
the first user of the machine

With the zone approach, we would just be saying "tune it". Here is what he
says about that

--- excerpt ---
I specifically *don't* want things that I have to beg sysadmins to
tune correctly. They won't get it right because there is no `right'
that is right for everyone. They won't want to change it and it
won't work besides. Been there, done that. My experience is that
with linux so far, and some other non-linux machines too, they
always turn all the page stuff off because it breaks the machine.
--- excerpt ---

This is an example of a real user that "tune the size of your zone
correctly" is just not good enough. He makes a novel suggestion on how
anti-defrag + hotplug could be used.

--- excerpt ---
In the context of hotplug stuff and fragmentation avoidance,
this sort of reset would be implemented by performing the
the first step in the hot unplug, to migrate everything off
of that memory, including whatever kernel pages that exist
there, but not the second step. Just leave that memory plugged
in and reset the memory to a sane initial state. Essentially
this would be some sort of pseudo hotunplug followed by a pseudo
hotplug of that memory.
--- excerpt ---

I'm pretty sure this is not what hotplug was aimed at but it would get him
what he wants, large pages to echo BigNumber > nr_hugepages at the least.
It also needs hotplug remove to be working for some banks and regions of
memory although not the 100% case.

Ok, this is one example of a user for scientific workloads that "tune the
size of the zone" just is not good enough. The admins won't do it for him
because it'll just break for the next scheduled job.

> Actually I think one reason your patches may perform so well is because
> there aren't actually a lot of higher order allocations in the kernel.
>
> I think that probably leaves us realistically with demand hugepages,
> hot unplug memory, and IBM lpars?
>


>
> > Pros/Cons of Solutions
> > ======================
> >
> > Anti-defrag Pros
> > o Aim9 shows no significant regressions (.37% on page_test). On some
> > tests, it shows performance gains (> 5% on fork_test)
> > o Stress tests show that it manages to keep fragmentation down to a far
> > lower level even without teaching kswapd how to linear reclaim
>
> This sounds like a kind of funny test to me if nobody is actually
> using higher order allocations.
>

No one uses them because they always fail. This is a chicken and egg
problem.

> When a higher order allocation is attempted, either you will satisfy
> it from the kernel region, in which case the vanilla kernel would
> have done the same. Or you satisfy it from an easy-reclaim contiguous
> region, in which case it is no longer an easy-reclaim contiguous
> region.
>

Right, but right now, we say "don't use high order allocations ever". With
work, we'll be saying "ok, use high order allocations but they should be
short lived or you won't be allocating them for long"

> > o Stress tests with a linear reclaim experimental patch shows that it
> > can successfully find large contiguous chunks of memory
> > o It is known to help hotplug on PPC64
> > o No tunables. The approach tries to manage itself as much as possible
>
> But it has more dreaded heuristics :P
>

Yeah, but if it gets them wrong, the system chugs along anyway, just
fragmented like it is today. If the zone-based approach gets it wrong, the
system goes down the tubes.

At very worst, the patches give a kernel allocator that is as good as
todays. At very worst, the zone-based approach makes an unusable system.
The performance of the patches is another story. I've been posting aim9
figures based on my test machine. I'm trying to kick an ancient PowerPC
43P Model 150 machine into working. This machine is a different
architecture and ancient (I found it on the way to a skip) so should give
different figures.

> > o It exists, heavily tested, and synced against the latest -mm1
> > o Can be compiled away be redefining the RCLM_* macros and the
> > __GFP_*RCLM flags
> >
> > Anti-defrag Cons
> > o More complexity within the page allocator
> > o Adds a new layer onto the allocator that effectively creates subzones
> > o Adding a new concept that maintainers have to work with
> > o Depending on the workload, it fragments anyway
> >
> > New Zone Pros
> > o Zones are a well known and understood concept
> > o For people that do not care about hotplug, they can easily get rid of it
> > o Provides reliable areas of contiguous groups that can be freed for
> > HugeTLB pages going to userspace
> > o Uses existing zone infrastructure for balancing
> >
> > New Zone Cons
> > o Zones historically have introduced balancing problems
> > o Been tried for hotplug and dropped because of being awkward to work with
> > o It only helps hotplug and potentially HugeTLB pages for userspace
> > o Tunable required. If you get it wrong, the system suffers a lot
>
> Pro: it keeps IBM mainframe and pseries sysadmins in a job ;) Let
> them get it right.
>

Unless you work in a place where they sysadmins will tell you to go away
such as the HPC user above. I'm not a sysadmin, but I'm pretty sure they
have better things to do than twiddle a tunable all day.

> > o Needs to be planned for and developed
> >
>
> Yasunori Goto had patches around from last year. Not sure what sort
> of shape they're in now but I'd think most of the hard work is done.
>

But Yasunori (thanks for sending the links ) himself says when he posted.

--- excerpt ---
Another one was a bit similar than Mel-san's one.
One of motivation of this patch was to create orthogonal relationship
between Removable and DMA/Normal/Highmem. I thought it is desirable.
Because, ppc64 can treat that all of memory is same (DMA) zone.
I thought that new zone spoiled its good feature.
--- excerpt ---

He thought that the new zone removed the ability of some architectures to
treat all memory the same. My patches give some of the benefits of using
another zone while still preserving an architectures ability to
treat all memory the same.

> > Scenarios
> > =========
> >
> > Lets outline some situations then or workloads that can occur
> >
> > 1. Heavy job running that consumes 75% of physical memory. Like a kernel
> > build
> >
> > Anti-defrag: It will not fragment as it will never have to fallback.High
> > order allocations will be possible in the remaining 25%.
> > Zone-based: After been tuned to a kernel build load, it will not
> > fragment. Get the tuning wrong, performance suffers or workload
> > fails. High order allocations will be possible in the remaining 25%.
> >
>
> You don't need to continually tune things for each and every possible
> workload under the sun. It is like how we currently drive 16GB highmem
> systems quite nicely under most workloads with 1GB of normal memory.
> Make that an 8:1 ratio if you're worried.
>
> [snip]
>
> >
> > I've tried to be as objective as possible with the summary.
> >
> > > From the points above though, I think that anti-defrag gets us a lot of
> > the way, with the complexity isolated in one place. It's downside is that
> > it can still break down and future work is needed to stop it degrading
> > (kswapd cleaning UserRclm areas and page migration when we get really
> > stuck). Zone-based is more reliable but only addresses a limited
> > situation, principally hotplug and it does not even go 100% of the way for
> > hotplug.
>
> To me it seems like it solves the hotplug, lpar hotplug, and hugepages
> problems which seem to be the main ones.
>
> > It also depends on a tunable which is not cool and it is static.
>
> I think it is very cool because it means the tiny minority of Linux
> users who want this can do so without impacting the rest of the code
> or users. This is how Linux has been traditionally run and I still
> have a tiny bit of faith left :)
>

The impact of the code and users will depend on benchmarks. I've posted
benchmarks that show there are either very small regressions or else there
are performance gains. As I write this, some of the aim9 benchmarks
completed on the PowerPC.

This is a comparison between 2.6.14-rc5-mm1 and
2.6.14-rc5-mm1-mbuddy-v19-defragDisabledViaConfig

1 creat-clo 73500.00 72504.58 -995.42 -1.35% File Creations and Closes/second
2 page_test 30806.13 31076.49 270.36 0.88% System Allocations & Pages/second
3 brk_test 335299.02 341926.35 6627.33 1.98% System Memory Allocations/second
4 jmp_test 1641733.33 1644566.67 2833.34 0.17% Non-local gotos/second
5 signal_test 100883.19 98900.18 -1983.01 -1.97% Signal Traps/second
6 exec_test 116.53 118.44 1.91 1.64% Program Loads/second
7 fork_test 751.70 746.84 -4.86 -0.65% Task Creations/second
8 link_test 30217.11 30463.82 246.71 0.82% Link/Unlink Pairs/second

Performance gains on page_test, brk_test and exec_test. Even with
variances between tests, we are looking at "more or less the same", not
regressions. No user impact there.

This is a comparison between 2.6.14-rc5-mm1 and
2.6.14-rc5-mm1-mbuddy-v19-withantidefrag

1 creat-clo 73500.00 71188.14 -2311.86 -3.15% File Creations and Closes/second
2 page_test 30806.13 31060.96 254.83 0.83% System Allocations & Pages/second
3 brk_test 335299.02 344361.15 9062.13 2.70% System Memory Allocations/second
4 jmp_test 1641733.33 1627228.80 -14504.53 -0.88% Non-local gotos/second
5 signal_test 100883.19 100233.33 -649.86 -0.64% Signal Traps/second
6 exec_test 116.53 117.63 1.10 0.94% Program Loads/second
7 fork_test 751.70 763.73 12.03 1.60% Task Creations/second
8 link_test 30217.11 30322.10 104.99 0.35% Link/Unlink Pairs/second

Performance gains on page_test, brk_test, exec_test and fork_test. Not bad
going for complex overhead. create-clo took a beating, but what workload
opens and closes files at that rate?

This is an old, small machine. If I hotplug this, I'll be lucky if it ever
turns on again. The aim9 benchmarks on two machines show that there is
similar and, in some cases better, performance with these patches. If a
workload does suffer badly, an additional patch has been supplied that
disables anti-defrag. A run in -mm will tell us if this is the general
case for machines or are my two test boxes running on magic beans.

So, the small number of users that want this, get this. The rest of the
users who just run the code, should not notice or care. This brings us
back to the main stickler, code complexity. I think that the code has been
very well isolated from the code allocator code and people looking at the
allocator could avoid it if they really wanted while stilling knowing what
the buddy allocator was doing.

> > If we make the zones growable+shrinkable, we run into all the same
> > problems that anti-defrag has today.
> >
>
> But we don't have the extra zones layer that anti defrag has today.
>

So, we just have an extra layer on the side that has to be configured. All
of the problems with all of the configuration.

> And anti defrag needs limits if it is to be reliable anyway.
>

I'm confident given time that I can make this manage itself with a very
good degree of reliability.

--
Mel Gorman
Part-time Phd Student Java Applications Developer
University of Limerick IBM Dublin Software Lab

2005-11-03 15:34:42

by Martin Bligh

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 - Summary

>> Physical hotplug remove: Vendors of the hardware that support this -
>> Fujitsu, HP (I think), IBM etc
>>
>> Virtualization hotplug remove: Sellers of virtualization software, some
>> hardware like any IBM machine that lists LPAR in it's list of
>> features. Probably software solutions like Xen are also affected
>> if they want to be able to grow and shrink the virtual machines on
>> demand
>
> Ingo said that Xen is fine with per page granular freeing - this covers
> embedded, desktop and small server users of VMs into the future I'd say.

Not using large page mappings for the kernel area will be a substantial
performance hit. It's a less efficient approach inside the hypervisor,
and not all VMs / hardware can support it.

>> High order allocations: Ultimately, hugepage users. Today, that is a
>> feature only big server users like Oracle care about. In the
>> future I reckon applications will be able to use them for things
>> like backing the heap by huge pages. Other users like GigE,
>> loopback devices with large MTUs, some filesystem like CIFS are
>> all interested although they are also been told use use smaller
>> pages.
>
> I think that saying its now OK to use higher order allocations is wrong
> because as I said even with your patches they are going to run into
> problems.
>
> Actually I think one reason your patches may perform so well is because
> there aren't actually a lot of higher order allocations in the kernel.
>
> I think that probably leaves us realistically with demand hugepages,
> hot unplug memory, and IBM lpars?

Sigh. You seem obsessed with this. There are various critical places in
the kernel that use higher order allocations. Yes, they're normally
smaller ones rather than larger ones, but .... please try re-reading
the earlier portions of this thread. You are NOT going to be able to
get rid of all higher-order allocations - please quit pretending you
can - living in denial is not going to help us.

If you really, really believe you can do that, please go ahead and prove
it. Until that point, please let go of the "it's only for a few specialized
users" arguement, and acknowledge we DO actually use higher order allocs
in the kernel right now.

>> o Aim9 shows no significant regressions (.37% on page_test). On some
>> tests, it shows performance gains (> 5% on fork_test)
>> o Stress tests show that it manages to keep fragmentation down to a far
>> lower level even without teaching kswapd how to linear reclaim
>
> This sounds like a kind of funny test to me if nobody is actually
> using higher order allocations.

It's a regression test. To, like, test for regressions in the normal
case ;-)

>> New Zone Cons
>> o Zones historically have introduced balancing problems
>> o Been tried for hotplug and dropped because of being awkward to work with
>> o It only helps hotplug and potentially HugeTLB pages for userspace
>> o Tunable required. If you get it wrong, the system suffers a lot
>
> Pro: it keeps IBM mainframe and pseries sysadmins in a job ;) Let
> them get it right.

Having met some of them ... that's not a pro ;-) We have quite enough
meaningless tunables already. And to be honest, the bigger problem is
that it's a problem with no correct answer - workloads shift day vs.
night, etc.

> You don't need to continually tune things for each and every possible
> workload under the sun. It is like how we currently drive 16GB highmem
> systems quite nicely under most workloads with 1GB of normal memory.
> Make that an 8:1 ratio if you're worried.

Thanks for turning my 64 bit system back into a 32 bit one. really
appreciate that. Note the last 5 years of endless whining about all
the problems with large 32 bit systems, and how they're unfixable
and we should all move to 64 bit please.

> To me it seems like it solves the hotplug, lpar hotplug, and hugepages
> problems which seem to be the main ones.

That's because you're not listening, you're going on your own preconcieved
notions ...

> I think it is very cool because it means the tiny minority of Linux
> users who want this can do so without impacting the rest of the code
> or users.

Ditto.

M.

2005-11-03 15:44:36

by Jeff Dike

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Thu, Nov 03, 2005 at 12:07:33AM -0600, Rob Landley wrote:
> I want UML to
> be able to hand back however much memory it's not using, but handing back
> individual pages as we free them and inserting a syscall overhead for every
> page freed and allocated is just nuts. (Plus, at page size, the OS isn't
> likely to zero them much faster than we can ourselves even without the
> syscall overhead.) Defragmentation means we can batch this into a
> granularity that makes it worth it.

I don't think that freeing pages back to the host in free_pages is the
way to go. The normal behavior for a Linux system, virtual or
physical, is to use all the memory it has. So, any memory that's
freed is pretty likely to be reused for something else, wasting any
effort that's made to free pages back to the host.

The one counter-example I can think of is when a large process with a
lot of data exits. Then its data pages will be freed and they may
stay free for a while until the system finds other data to fill them
with.

Also, it's not the virtual machine's job to know how to make the host
perform optimally. It doesn't have the information to do it. It's
perfectly OK for a UML to hang on to memory if the host has plenty
free. So, it's the host's job to make sure that its memory pressure
is reflected to the UMLs.

My current thinking is that you'll have a daemon on the host keeping
track of memory pressure on the host and the UMLs, plugging and
unplugging memory in order to keep the busy machines, including the
host, supplied with memory, and periodically pushing down the memory
of idle UMLs in order to force them to GC their page caches.

With Badari's patch and UML memory hotplug, the infrastructure is
there to make this work. The one thing I'm puzzling over right now is
how to measure memory pressure.

Jeff

2005-11-03 16:23:52

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Thu, 2005-11-03 at 11:35 -0500, Jeff Dike wrote:
> On Thu, Nov 03, 2005 at 12:07:33AM -0600, Rob Landley wrote:
> > I want UML to
> > be able to hand back however much memory it's not using, but handing back
> > individual pages as we free them and inserting a syscall overhead for every
> > page freed and allocated is just nuts. (Plus, at page size, the OS isn't
> > likely to zero them much faster than we can ourselves even without the
> > syscall overhead.) Defragmentation means we can batch this into a
> > granularity that makes it worth it.
>
> I don't think that freeing pages back to the host in free_pages is the
> way to go. The normal behavior for a Linux system, virtual or
> physical, is to use all the memory it has. So, any memory that's
> freed is pretty likely to be reused for something else, wasting any
> effort that's made to free pages back to the host.
>
> The one counter-example I can think of is when a large process with a
> lot of data exits. Then its data pages will be freed and they may
> stay free for a while until the system finds other data to fill them
> with.
>
> Also, it's not the virtual machine's job to know how to make the host
> perform optimally. It doesn't have the information to do it. It's
> perfectly OK for a UML to hang on to memory if the host has plenty
> free. So, it's the host's job to make sure that its memory pressure
> is reflected to the UMLs.
>
> My current thinking is that you'll have a daemon on the host keeping
> track of memory pressure on the host and the UMLs, plugging and
> unplugging memory in order to keep the busy machines, including the
> host, supplied with memory, and periodically pushing down the memory
> of idle UMLs in order to force them to GC their page caches.
>
> With Badari's patch and UML memory hotplug, the infrastructure is
> there to make this work. The one thing I'm puzzling over right now is
> how to measure memory pressure.

Yep. This is the exactly the issue other product groups normally raise
on Linux. How do we measure memory pressure in linux ? Some of our
software products want to grow or shrink their memory usage depending
on the memory pressure in the system. Since most memory is used for
cache, "free" really doesn't indicate anything -they are monitoring
info in /proc/meminfo and swapping rates to "guess" on the memory
pressure. They want a clear way of finding out "how badly" system
is under memory pressure. (As a starting point, they want to find out
out of "cached" memory - how much is really easily "reclaimable"
under memory pressure - without swapping). I know this is kind of
crazy, but interesting to think about :)

Thanks,
Badari

2005-11-03 17:38:19

by Jeff Dike

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Thu, Nov 03, 2005 at 08:23:20AM -0800, Badari Pulavarty wrote:
> Yep. This is the exactly the issue other product groups normally raise
> on Linux. How do we measure memory pressure in linux ? Some of our
> software products want to grow or shrink their memory usage depending
> on the memory pressure in the system.

I think this is wrong. Applications shouldn't be measuring host
memory pressure and trying to react to it.

This gives you no way to implement a global memory use policy - you
can't say "App X is the most important thing on the system and must
have all the memory it needs in order run as quickly as possible".

You can't establish any sort of priority between apps when it comes to
memory use, or change those priorities.

And how does this work when the system can change the amount of memory
that it has, such as when the app is inside a UML?

I think the right way to go is for willing apps to have an interface
through which they can be told "change your memory consumption by +-X"
and have a single daemon on the host tracking memory use and memory
pressure, and shuffling memory between the apps.

This allows the admin to set memory use priorities between the apps
and to exempt important ones from having memory pulled.

Measuring at the bottom and pushing memory pressure upwards also works
naturally for virtual machines and the apps running inside them. The
host will push memory pressure at the virtual machines, which in turn
will push that pressure at their apps.

With UML, I have an interface where a daemon on the host can add or
remove memory from an instance. I think the apps that are willing to
adjust should implement something similar.

Jeff

2005-11-03 17:54:57

by Rob Landley

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Thursday 03 November 2005 01:34, Nick Piggin wrote:
> Rob Landley wrote:
> > On Wednesday 02 November 2005 22:43, Nick Piggin wrote:
> >>I'd just be happy with UML handing back page sized chunks of memory that
> >>it isn't currently using. How does contiguous memory (in either the host
> >>or the guest) help this?
> >
> > Smaller chunks of memory are likely to be reclaimed really soon, and
> > adding in the syscall overhead working with individual pages of memory is
> > almost guaranteed to slow us down.
>
> Because UML doesn't already make a syscall per individual page of
> memory freed? (If I read correctly)

UML does a big mmap to get "physical" memory, and then manages itself using
the normal Linux kernel mechanisms for doing so. We even have page tables,
although I'm still somewhat unclear on quite how that works.

> > Plus with punch, we'd be fragmenting the heck
> > out of the underlying file.
>
> Why? No you wouldn't.

Creating holes in the file and freeing up the underlying blocks on disk? 4k
at a time? Randomly scattered?

> > I read it. It just didn't contain an answer the the question. I want
> > UML to be able to hand back however much memory it's not using, but
> > handing back individual pages as we free them and inserting a syscall
> > overhead for every page freed and allocated is just nuts. (Plus, at page
> > size, the OS isn't likely to zero them much faster than we can ourselves
> > even without the syscall overhead.) Defragmentation means we can batch
> > this into a granularity that makes it worth it.
>
> Oh you have measured it and found out that "defragmentation" makes
> it worthwhile?

Lots of work has gone into batching up syscalls and making as few of them as
possible because they are a performance bottleneck. You want to introduce a
syscall for every single individual page of memory allocated or freed.

That's stupid.

> > This has nothing to do with hard limits on anything.
>
> You said:
>
> "What does this have to do with specifying hard limits of
> anything? What's to specify? Workloads vary. Deal with it."
>
> And I was answering your very polite questions.

You didn't answer. You keep saying you've already answered, but there
continues to be no answer. Maybe you think you've answered, but I haven't
seen it yet. You brought up hard limits, I asked what that had to do with
anything, and in response you quote my question back at me.

> >>Have you looked at the frag patches?
> >
> > I've read Mel's various descriptions, and tried to stay more or less up
> > to date ever since LWN brought it to my attention. But I can't say I'm a
> > linux VM system expert. (The last time I felt I had a really firm grasp
> > on it was before Andrea and Rik started arguing circa 2.4 and Andrea
> > spent six months just assuming everybody already knew what a classzone
> > was. I've had other things to do since then...)
>
> Maybe you have better things to do now as well?

Yeah, thanks for reminding me. I need to test Mel's newest round of
fragmentation avoidance patches in my UML build system...

> >>Duplicating the
> >>same or similar infrastructure (in this case, a memory zoning facility)
> >>is a bad thing in general.
> >
> > Even when they keep track of very different things? The memory zoning
> > thing is about where stuff is in physical memory, and it exists because
> > various hardware that wants to access memory (24 bit DMA, 32 bit DMA, and
> > PAE) is evil and crippled and we have to humor it by not asking it to do
> > stuff it can't.
>
> No, the buddy allocator is and always has been what tracks the "long
> contiguous runs of free memory".

We are still discussing fragmentation avoidance, right? (I know _I'm_ trying
to...)

> Both zones and Mels patches classify blocks of memory according to some
> criteria. They're not exactly the same obviously, but they're equivalent in
> terms of capability to guarantee contiguous freeable regions.

Back up.

I don't care _where_ the freeable regions are. I just wan't them coalesced.

Zones are all about _where_ the memory is.

I'm pretty sure we're arguing past each other.

> > I was under the impression it was orthogonal to figuring out whether or
> > not a given bank of physical memory is accessable to your sound blaster
> > without an IOMMU.
>
> Huh?

Fragmentation avoidance is what is orthogonal to...

> >>Err, the point is so we don't now have 2 layers doing very similar
> >> things, at least one of which has "particularly silly" bugs in it.
> >
> > Similar is not identical. You seem to be implying that the IO elevator
> > and the network stack queueing should be merged because they do similar
> > things.
>
> No I don't.

They're similar though, aren't they? Why should we have different code in
there to do both? (I know why, but that's what your argument sounds like to
me.)

> > If you'd like to write a counter-patch to Mel's to prove it...
>
> It has already been written as you have been told numerous times.

Quoting Yasunori Goto, Yesterday at 2:33 pm,
Message-Id: <[email protected]>

> Hmmm. I don't see at this point.
> Why do you think ZONE_REMOVABLE can satisfy for hugepage.
> At leaset, my ZONE_REMOVABLE patch doesn't any concern about
> fragmentation.

He's NOT ADDRESSING FRAGMENTATION.

So unless you're talking about some OTHER patch, we're talking past each other
again.

> Now if you'd like to actually learn about what you're commenting on,
> that would be really good too.

The feeling is mutual.

> >>So you didn't look at Yasunori Goto's patch from last year that
> >> implements exactly what I described, then?
> >
> > I saw the patch he just posted, if that's what you mean. By his own
> > admission, it doesn't address fragmentation at all.
>
> It seems to be that it provides exactly the same (actually stronger)
> guarantees than the current frag patches do. Or were you going to point
> out a bug in the implementation?

No, I'm going to point out that the author of the patch contradicts you.

> >>>Yes, zones are a way of categorizing memory.
> >>
> >>Yes, have you read Mel's patches? Guess what they do?
> >
> > The swap file is a way of storing data on disk. So is ext3. Obviously,
> > one is a trivial extension of the other and there's no reason to have
> > both.
>
> Don't try to bullshit your way around with stupid analogies please, it
> is an utter waste of time.

I agree that this conversation is a waste of time, and will stop trying to
reason with you now.

Rob

2005-11-03 18:50:58

by Rob Landley

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Thursday 03 November 2005 10:23, Badari Pulavarty wrote:

> Yep. This is the exactly the issue other product groups normally raise
> on Linux. How do we measure memory pressure in linux ? Some of our
> software products want to grow or shrink their memory usage depending
> on the memory pressure in the system. Since most memory is used for
> cache, "free" really doesn't indicate anything -they are monitoring
> info in /proc/meminfo and swapping rates to "guess" on the memory
> pressure. They want a clear way of finding out "how badly" system
> is under memory pressure. (As a starting point, they want to find out
> out of "cached" memory - how much is really easily "reclaimable"
> under memory pressure - without swapping). I know this is kind of
> crazy, but interesting to think about :)

If we do ever get prezeroing, we'd want a tuneable to say how much memory
should be spent on random page cache and how much should be prezeroed. And
large chunks of prezeroed memory lying around are what you'd think about
handing back to the host OS...

Rob

2005-11-03 19:21:41

by Jeff Dike

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Thu, Nov 03, 2005 at 11:54:10AM -0600, Rob Landley wrote:
> Lots of work has gone into batching up syscalls and making as few of them as
> possible because they are a performance bottleneck. You want to introduce a
> syscall for every single individual page of memory allocated or freed.
>
> That's stupid.

I think what I'm optimizing is TLB flushes, not system calls. With
mmap et al, they are effectively the same thing though.

Jeff

2005-11-04 03:21:37

by Blaisorblade

[permalink] [raw]
Subject: Re: [uml-devel] Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Thursday 03 November 2005 06:41, Rob Landley wrote:
> On Wednesday 02 November 2005 23:26, Jeff Dike wrote:
> > On Wed, Nov 02, 2005 at 05:28:35PM -0600, Rob Landley wrote:
> > > With fragmentation reduction and prezeroing, UML suddenly gains the
> > > option of calling madvise(DONT_NEED) on sufficiently large blocks as A)
> > > a fast way of prezeroing, B) a way of giving memory back to the host OS
> > > when it's not in use.

> > DONT_NEED is insufficient. It doesn't discard the data in dirty
> > file-backed pages.

> I thought DONT_NEED would discard the page cache, and punch was only needed
> to free up the disk space.
This is correct, but...

> I was hoping that since the file was deleted from disk and is already
> getting _some_ special treatment (since it's a longstanding "poor man's
> shared memory" hack), that madvise wouldn't flush the data to disk, but
> would just zero it out. A bit optimistic on my part, I know. :)

I read at some time that this optimization existed but was deemed obsolete and
removed.

Why obsolete? Because... we have tmpfs! And that's the point. With DONTNEED,
we detach references from page tables, but the content is still pinned: it
_is_ the "disk"! (And you have TMPDIR on tmpfs, right?)

> > Badari Pulavarty has a test patch (google for madvise(MADV_REMOVE))
> > which does do the trick, and I have a UML patch which adds memory
> > hotplug. This combination does free memory back to the host.

> I saw it wander by, and am all for it. If it goes in, it's obviously the
> right thing to use.
Btw, on this side of the picture, I think fragmentation avoidance is not
needed for that.

I guess you refer to using frag. avoidance on the guest (if it matters for the
host, let me know). When it will be present using it will be nice, but
currently we'd do madvise() on a page-per-page basis, and we'd do it on
non-consecutive pages (basically, free pages we either find or free or
purpose).

> You may remember I asked about this two years ago:
> http://seclists.org/lists/linux-kernel/2003/Dec/0919.html

> And a reply indicated that SVr4 had it, but we don't. I assume the "naming
> discussion" mentioned in the recent thread already scrubbed through this
> old thread to determine that the SVr4 API was icky.
> http://seclists.org/lists/linux-kernel/2003/Dec/0955.html

I assume not everybody did (even if somebody pointed out the existance of the
SVr4 API), but there was the need, in at least one usage, for a virtual
address-based API rather than a file offset based one, like the SVr4 one -
that user would need implementing backward mapping in userspace only for this
purpose, while we already have it in the kernel.

Anyway, the sys_punch() API will follow later - customers need mainly
madvise() for now.
--
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade





___________________________________
Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB
http://mail.yahoo.it

2005-11-04 04:52:35

by Andrew Morton

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

Badari Pulavarty <[email protected]> wrote:
>
> > With Badari's patch and UML memory hotplug, the infrastructure is
> > there to make this work. The one thing I'm puzzling over right now is
> > how to measure memory pressure.
>
> Yep. This is the exactly the issue other product groups normally raise
> on Linux. How do we measure memory pressure in linux ? Some of our
> software products want to grow or shrink their memory usage depending
> on the memory pressure in the system. Since most memory is used for
> cache, "free" really doesn't indicate anything -they are monitoring
> info in /proc/meminfo and swapping rates to "guess" on the memory
> pressure. They want a clear way of finding out "how badly" system
> is under memory pressure. (As a starting point, they want to find out
> out of "cached" memory - how much is really easily "reclaimable"
> under memory pressure - without swapping). I know this is kind of
> crazy, but interesting to think about :)

Similarly, that SGI patch which was rejected 6-12 months ago to kill off
processes once they started swapping. We thought that it could be done
from userspace, but we need a way for userspace to detect when a task is
being swapped on a per-task basis.

I'm thinking a few numbers in the mm_struct, incremented in the pageout
code, reported via /proc/stat.

2005-11-04 05:36:14

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

> Similarly, that SGI patch which was rejected 6-12 months ago to kill off
> processes once they started swapping. We thought that it could be done
> from userspace, but we need a way for userspace to detect when a task is
> being swapped on a per-task basis.
>
> I'm thinking a few numbers in the mm_struct, incremented in the pageout
> code, reported via /proc/stat.

I just sent in a proposed patch for this - one more per-cpuset
number, tracking the recent rate of calls into the synchronous
(direct) page reclaim by tasks in the cpuset.

See the message sent a few minutes ago, with subject:

[PATCH 5/5] cpuset: memory reclaim rate meter

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-11-04 05:48:37

by Andrew Morton

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

Paul Jackson <[email protected]> wrote:
>
> > Similarly, that SGI patch which was rejected 6-12 months ago to kill off
> > processes once they started swapping. We thought that it could be done
> > from userspace, but we need a way for userspace to detect when a task is
> > being swapped on a per-task basis.
> >
> > I'm thinking a few numbers in the mm_struct, incremented in the pageout
> > code, reported via /proc/stat.
>
> I just sent in a proposed patch for this - one more per-cpuset
> number, tracking the recent rate of calls into the synchronous
> (direct) page reclaim by tasks in the cpuset.
>
> See the message sent a few minutes ago, with subject:
>
> [PATCH 5/5] cpuset: memory reclaim rate meter
>

uh, OK. If that patch is merged, does that make Bron happy, so I don't
have to reply to his plaintive email?

I was kind of thinking that the stats should be per-process (actually
per-mm) rather than bound to cpusets. /proc/<pid>/pageout-stats or something.

2005-11-04 06:43:15

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

Andrew wrote:
> uh, OK. If that patch is merged, does that make Bron happy, so I don't
> have to reply to his plaintive email?

In theory yes, that should do it. I will ack again, by early next
week, after I have verified this further.

And it should also handle some other folks who have plaintive emails
in my inbox, that haven't gotten bold enough to pester you, yet.

It really is, for the users who know my email address (*), job based
memory pressure, not task based, that matters. Sticking it in a
cpuset, which is the natural job container, is easier, more natural,
and more efficient for all concerned.

It's jobs that are being run in cpusets with dedicated (not shared)
CPUs and Memory Nodes that care about this, so far as I know.

When running a system in a more typical sharing mode, with multiple
jobs and applications competing for the same resources, then the kernel
needs to be master of processor scheduling and memory allocation.

When running jobs in cpusets with dedicated CPUs and Memory Nodes,
then less is being asked of the kernel, and some per-job controls
from userspace make more sense. This is where a simple hook like
this reclaim rate meter comes into play - passing up to user space
another clue to help it do its job.


> I was kind of thinking that the stats should be per-process (actually
> per-mm) rather than bound to cpusets. /proc/<pid>/pageout-stats or something.

There may well be a market for these too. But such stats sound like
more work, and the market isn't one that's paying my salary.

So I will leave that challenge on the table for someone else.


(*) Of course, there is some self selection going on here.
Folks not doing cpuset-based jobs are far less likely
to know my email address ;).

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-11-04 07:11:15

by Andrew Morton

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

Paul Jackson <[email protected]> wrote:
>
> > I was kind of thinking that the stats should be per-process (actually
> > per-mm) rather than bound to cpusets. /proc/<pid>/pageout-stats or something.
>
> There may well be a market for these too. But such stats sound like
> more work, and the market isn't one that's paying my salary.

But I have to care for all users.

> So I will leave that challenge on the table for someone else.

And I won't merge your patch ;)


Seriously, it does appear that doing it per-task is adequate for your
needs, and it is certainly more general.



I cannot understand why you decided to count only the number of
direct-reclaim events, via a "digitally filtered, constant time based,
event frequency meter".

a) It loses information. If we were to export the number of pages
reclaimed from the mm, filtering can be done in userspace.

b) It omits reclaim performed by kswapd and by other tasks (ok, it's
very cpuset-specific).

c) It only counts synchronous try_to_free_pages() attempts. What if an
attempt only freed pagecache, or didbn't manage to free anything?

d) It doesn't notice if kswapd is swapping the heck out of your
not-allocating-any-memory-now process.


I think all the above can be addressed by exporting per-task (actually
per-mm) reclaim info. (I haven't put much though into what info that
should be - page reclaim attempts, mmapped reclaims, swapcache reclaims,
etc)

2005-11-04 07:26:53

by Ingo Molnar

[permalink] [raw]
Subject: [patch] swapin rlimit


* Andrew Morton <[email protected]> wrote:

> Similarly, that SGI patch which was rejected 6-12 months ago to kill
> off processes once they started swapping. We thought that it could be
> done from userspace, but we need a way for userspace to detect when a
> task is being swapped on a per-task basis.

wouldnt the clean solution here be a "swap ulimit"?

I.e. something like the 2-minute quick-hack below (against Linus-curr).

Ingo

---
implement a swap ulimit: RLIMIT_SWAP.

setting the ulimit to 0 causes any swapin activity to kill the task.
Setting the rlimit to 0 is allowed for unprivileged users too, since it
is a decrease of the default RLIM_INFINITY value. I.e. users could run
known-memory-intense jobs with such an ulimit set, and get a guarantee
that they wont put the system into a swap-storm.

Note: it's just swapin that causes the SIGKILL, because at swapout time
it's hard to identify the originating task. Pure swapouts and a buildup
in the swap-cache is not punished, only actual hard swapins. I didnt try
too hard to make the rlimit particularly finegrained - i.e. right now we
only know 'zero' and 'infinity' ...

Signed-off-by: Ingo Molnar <[email protected]>

include/asm-generic/resource.h | 4 +++-
mm/memory.c | 13 +++++++++++++
2 files changed, 16 insertions(+), 1 deletion(-)

Index: linux/include/asm-generic/resource.h
===================================================================
--- linux.orig/include/asm-generic/resource.h
+++ linux/include/asm-generic/resource.h
@@ -44,8 +44,9 @@
#define RLIMIT_NICE 13 /* max nice prio allowed to raise to
0-39 for nice level 19 .. -20 */
#define RLIMIT_RTPRIO 14 /* maximum realtime priority */
+#define RLIMIT_SWAP 15 /* maximum swapspace for task */

-#define RLIM_NLIMITS 15
+#define RLIM_NLIMITS 16

/*
* SuS says limits have to be unsigned.
@@ -86,6 +87,7 @@
[RLIMIT_MSGQUEUE] = { MQ_BYTES_MAX, MQ_BYTES_MAX }, \
[RLIMIT_NICE] = { 0, 0 }, \
[RLIMIT_RTPRIO] = { 0, 0 }, \
+ [RLIMIT_SWAP] = { RLIM_INFINITY, RLIM_INFINITY }, \
}

#endif /* __KERNEL__ */
Index: linux/mm/memory.c
===================================================================
--- linux.orig/mm/memory.c
+++ linux/mm/memory.c
@@ -1647,6 +1647,18 @@ void swapin_readahead(swp_entry_t entry,
}

/*
+ * Crude first-approximation swapin-avoidance: if there is a zero swap
+ * rlimit then kill the task.
+ */
+static inline void check_swap_rlimit(void)
+{
+ unsigned long limit = current->signal->rlim[RLIMIT_SWAP].rlim_cur;
+
+ if (limit != RLIM_INFINITY)
+ force_sig(SIGKILL, current);
+}
+
+/*
* We enter with non-exclusive mmap_sem (to exclude vma changes,
* but allow concurrent faults), and pte mapped but not yet locked.
* We return with mmap_sem still held, but pte unmapped and unlocked.
@@ -1667,6 +1679,7 @@ static int do_swap_page(struct mm_struct
entry = pte_to_swp_entry(orig_pte);
page = lookup_swap_cache(entry);
if (!page) {
+ check_swap_rlimit();
swapin_readahead(entry, address, vma);
page = read_swap_cache_async(entry, vma, address);
if (!page) {

2005-11-04 07:36:56

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] swapin rlimit

Ingo Molnar <[email protected]> wrote:
>
> * Andrew Morton <[email protected]> wrote:
>
> > Similarly, that SGI patch which was rejected 6-12 months ago to kill
> > off processes once they started swapping. We thought that it could be
> > done from userspace, but we need a way for userspace to detect when a
> > task is being swapped on a per-task basis.
>
> wouldnt the clean solution here be a "swap ulimit"?

Well it's _a_ solution, but it's terribly specific.

How hard is it to read /proc/<pid>/nr_swapped_in_pages and if that's
non-zero, kill <pid>?

2005-11-04 07:46:01

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

Andrew wrote:
> > So I will leave that challenge on the table for someone else.
>
> And I won't merge your patch ;)

Be that way ;).


> Seriously, it does appear that doing it per-task is adequate for your
> needs, and it is certainly more general.

My motivations for the per-cpuset, digitally filtered rate, as opposed
to the per-task raw counter mostly have to do with minimizing total
cost (user + kernel) of collecting this information. I have this phobia,
perhaps not well founded, that moving critical scheduling/allocation
decisions like this into user space will fail in some cases because
the cost of gathering the critical information will be too intrusive
on system performance and scalability.

A per-task stat requires walking the tasklist, to build a list of the
tasks to query.

A raw counter requires repeated polling to determine the recent rate of
activity.

The filtered per-cpuset rate avoids any need to repeatedly access
global resources such as the tasklist, and minimizes the total cpu
cycles required to get the interesting stat.


> But I have to care for all users.

Well you should, and well you do.

If you have good reason, or just good instincts, to think that there
are uses for per-task raw counters, then your choice is clear.

As indeed it was clear.

I don't recall hearing of any desire for per-task memory pressure data,
until tonight.

I will miss this patch. It had provided exactly what I thought was
needed, with an extremely small impact on system (kern+user) performance.

Oh well.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-11-04 08:02:53

by Andrew Morton

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

Paul Jackson <[email protected]> wrote:
>
> A per-task stat requires walking the tasklist, to build a list of the
> tasks to query.

Nope, just task->mm->whatever.

> A raw counter requires repeated polling to determine the recent rate of
> activity.

True.

> The filtered per-cpuset rate avoids any need to repeatedly access
> global resources such as the tasklist, and minimizes the total cpu
> cycles required to get the interesting stat.
>

Well no. Because the filtered-whatsit takes two spinlocks and does a bunch
of arith for each and every task, each time it calls try_to_free_pages().
The frequency of that could be very high indeed, even when nobody is
interested in the metric which is being maintained(!).

And I'd suggest that only a minority of workloads would be interested in
this metric?

ergo, polling the thing once per five seconds in those situations where we
actually want to poll the thing may well be cheaper, in global terms?

2005-11-04 08:07:38

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] swapin rlimit


* Andrew Morton <[email protected]> wrote:

> Ingo Molnar <[email protected]> wrote:
> >
> > * Andrew Morton <[email protected]> wrote:
> >
> > > Similarly, that SGI patch which was rejected 6-12 months ago to kill
> > > off processes once they started swapping. We thought that it could be
> > > done from userspace, but we need a way for userspace to detect when a
> > > task is being swapped on a per-task basis.
> >
> > wouldnt the clean solution here be a "swap ulimit"?
>
> Well it's _a_ solution, but it's terribly specific.
>
> How hard is it to read /proc/<pid>/nr_swapped_in_pages and if that's
> non-zero, kill <pid>?

on a system with possibly thousands of taks, over /proc, on a
high-performance node where for a 0.5% improvement they are willing to
sacrifice maidens? :)

Seriously, while nr_swapped_in_pages ought to be OK, i think there is a
generic problem with /proc based stats.

System instrumentation people are already complaining about how costly
/proc parsing is. If you have to get some nontrivial stat from all
threads in the system, and if Linux doesnt offer that counter or summary
by default, it gets pretty expensive.

One solution i can think of would be to make a binary representation of
/proc/<pid>/stats readonly-mmap-able. This would add a 4K page to every
task tracked that way, and stats updates would have to update this page
too - but it would make instrumentation of running apps really
unintrusive and scalable.

Another addition would be some mechanism for a monitoring app to capture
events in the PID space: so that they can mmap() new tasks [if they are
interested] on a non-polling basis, i.e. not like readdir on /proc. This
capability probably has to be a system-call though, as /proc seems too
quirky for it. The system does not wait on the monitoring app(s) to
catch up - if it's too slow in reacting and the event buffer overflows
then tough luck - monitoring apps will have no impact on the runtime
characteristics of other tasks. In theory this is somewhat similar to
auditing, but the purpose would be quite different, and it only cares
about PID-space events like 'fork/clone', 'exec' and 'exit'.

Ingo

2005-11-04 08:19:34

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [patch] swapin rlimit

On Thu, 2005-11-03 at 23:36 -0800, Andrew Morton wrote:
> Ingo Molnar <[email protected]> wrote:
> >
> > * Andrew Morton <[email protected]> wrote:
> >
> > > Similarly, that SGI patch which was rejected 6-12 months ago to kill
> > > off processes once they started swapping. We thought that it could be
> > > done from userspace, but we need a way for userspace to detect when a
> > > task is being swapped on a per-task basis.
> >
> > wouldnt the clean solution here be a "swap ulimit"?
>
> Well it's _a_ solution, but it's terribly specific.
>
> How hard is it to read /proc/<pid>/nr_swapped_in_pages and if that's
> non-zero, kill <pid>?

well or do it the other way around

write a counter to such a thing
and kill when it hits zero
(similar to the CPU perf counter stuff on x86)

doing this from userspace is tricky; what if the task dies of natural
causes and the pid gets reused, between the time the userspace app reads
the value and the time it decides the time is up and time for a kill....
(and on a busy server that can be quite a bit of time)

2005-11-04 09:53:53

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

> > A per-task stat requires walking the tasklist, to build a list of the
> > tasks to query.
>
> Nope, just task->mm->whatever.

Nope.

Agreed - once you have the task, then sure, that's enough.

However - a batch scheduler will end up having to figure out what tasks
there are to inquire, by either listing the tasks in a cpuset, or
by listing /proc. Either way, that's a tasklist scan. And it will
have to do that pretty much every iteration of polling, since it has
no a priori knowledge of what tasks a job is firing up.


> Well no. Because the filtered-whatsit takes two spinlocks and does a bunch
> of arith for each and every task, each time it calls try_to_free_pages().

Neither spinlock is global - the task and a lock in its cpuset.

I see a fair number of existing locks and semaphores, some global
and some in loops, that look to be in the code invoked by
try_to_free_pages(). And far more arithmetic than in that little
filter.

Granted, its cost seen by all, for the benefit of few. But other sorts
of per-task or per-mm stats are not going to be free either. I would
have figured that doing something per-page, even the most trivial
"counter++" (better have that mm locked) will likely cost more than
doing something per try_to_free_pages() call.


> The frequency of that could be very high indeed, even when nobody is
> interested in the metric which is being maintained(!)

When I have a task start allocating memory as fast it can, it is only
able to call try_to_free_pages() about 10 times a second on an idle
ia64 SN2 system, with a single thread, or about 20 times a second
running several threads at once allocating memory.

That's not "very high" in my book.

What sort of load would hit this much more often?


If more folks need these detailed stats, then that's how it should be.

But I am no fan of exposing more than the minimum kernel vm details for
use by production software.

We agree that my per-cpuset memory_reclaim_rate meter certainly hides
more detail than the sorts of stats you are suggesting. I thought that
was good, so long as what was needed was still present.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-11-04 10:05:07

by Paul Jackson

[permalink] [raw]
Subject: Re: [patch] swapin rlimit

Arjan wrote:
> doing this from userspace is tricky; what if the task dies of natural
> causes and the pid gets reused, between the time the userspace app reads
> the value and the time it decides the time is up and time for a kill....
> (and on a busy server that can be quite a bit of time)

If pids are being reused within seconds of their being freed up,
then the batch managers running on the big HPC systems I care
about are so screwed it isn't even funny. They depend heavily
on being able to identify the task pids in a job and then doing
something to those tasks (suspend, kill, gather stats, ...).

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-11-04 10:07:00

by Paul Jackson

[permalink] [raw]
Subject: Re: [patch] swapin rlimit

Ingo wrote:
> Seriously, while nr_swapped_in_pages ought to be OK, i think there is a
> generic problem with /proc based stats.
>
> System instrumentation people are already complaining about how costly
> /proc parsing is. If you have to get some nontrivial stat from all
> threads in the system, and if Linux doesnt offer that counter or summary
> by default, it gets pretty expensive.

Agreed.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-11-04 10:21:36

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] swapin rlimit


* Bernd Petrovitsch <[email protected]> wrote:

> On Fri, 2005-11-04 at 08:26 +0100, Ingo Molnar wrote:
> > * Andrew Morton <[email protected]> wrote:
> >
> > > Similarly, that SGI patch which was rejected 6-12 months ago to kill
> > > off processes once they started swapping. We thought that it could be
> > > done from userspace, but we need a way for userspace to detect when a
> > > task is being swapped on a per-task basis.
> >
> > wouldnt the clean solution here be a "swap ulimit"?
>
> Hmm, where is the difference to "mlockall(MCL_CURRENT|MCL_FUTURE);"?
> OK, mlockall() can only be done by root (processes).

what do you mean? mlockall pins down all pages. swapin ulimit kills the
task (and thus frees all the RAM it had) when it touches swap for the
first time. These two solutions almost oppose each other!

Ingo

2005-11-04 10:25:19

by Bernd Petrovitsch

[permalink] [raw]
Subject: Re: [patch] swapin rlimit

On Fri, 2005-11-04 at 08:26 +0100, Ingo Molnar wrote:
> * Andrew Morton <[email protected]> wrote:
>
> > Similarly, that SGI patch which was rejected 6-12 months ago to kill
> > off processes once they started swapping. We thought that it could be
> > done from userspace, but we need a way for userspace to detect when a
> > task is being swapped on a per-task basis.
>
> wouldnt the clean solution here be a "swap ulimit"?

Hmm, where is the difference to "mlockall(MCL_CURRENT|MCL_FUTURE);"?
OK, mlockall() can only be done by root (processes).

Bernd
--
Firmix Software GmbH http://www.firmix.at/
mobil: +43 664 4416156 fax: +43 1 7890849-55
Embedded Linux Development and Services

2005-11-04 11:25:00

by Bernd Petrovitsch

[permalink] [raw]
Subject: Re: [patch] swapin rlimit

On Fri, 2005-11-04 at 11:21 +0100, Ingo Molnar wrote:
> * Bernd Petrovitsch <[email protected]> wrote:
> > On Fri, 2005-11-04 at 08:26 +0100, Ingo Molnar wrote:
> > > * Andrew Morton <[email protected]> wrote:
> > > > Similarly, that SGI patch which was rejected 6-12 months ago to kill
> > > > off processes once they started swapping. We thought that it could be
> > > > done from userspace, but we need a way for userspace to detect when a
> > > > task is being swapped on a per-task basis.
> > >
> > > wouldnt the clean solution here be a "swap ulimit"?
> >
> > Hmm, where is the difference to "mlockall(MCL_CURRENT|MCL_FUTURE);"?
> > OK, mlockall() can only be done by root (processes).
>
> what do you mean? mlockall pins down all pages. swapin ulimit kills the
in memory.
> task (and thus frees all the RAM it had) when it touches swap for the
> first time. These two solutions almost oppose each other!

Almost IMHO as locked pages in RAM avoid swapping totally. Probably
"complement each other" is more correct.

Given the limit for "max locked memory" it should pretty much behave the
same if the process gets on his limits.
OK, the difference may be loaded executable and lib pages.

Hmm, delivering a signal on the first swapped out page might be another
simple solution and the process might do something to avoid it.

The nice thing about "swap ulimit" is: It is easy to understand what it
is (which is always a good thing).
Generating a similar effect with the combination of 2 other features is
probably somewhat more arcane.

Bernd
--
Firmix Software GmbH http://www.firmix.at/
mobil: +43 664 4416156 fax: +43 1 7890849-55
Embedded Linux Development and Services

2005-11-04 15:15:06

by Rob Landley

[permalink] [raw]
Subject: Re: [patch] swapin rlimit

On Friday 04 November 2005 01:36, Andrew Morton wrote:
> > wouldnt the clean solution here be a "swap ulimit"?
>
> Well it's _a_ solution, but it's terribly specific.
>
> How hard is it to read /proc/<pid>/nr_swapped_in_pages and if that's
> non-zero, kill <pid>?

Things like make fork lots of short-lived child processes, and some of those
can be quite memory intensive. (The gcc 4.0.2 build causes an outright swap
storm for me about halfway through, doing genattrtab and then again compiling
the result).

Is there any way for parents to collect their child process's statistics when
the children exit? Or by the time the actual swapper exits, do we not care
anymore?

Rob

2005-11-04 15:19:57

by Martin Bligh

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

> Seriously, it does appear that doing it per-task is adequate for your
> needs, and it is certainly more general.
>
>
>
> I cannot understand why you decided to count only the number of
> direct-reclaim events, via a "digitally filtered, constant time based,
> event frequency meter".
>
> a) It loses information. If we were to export the number of pages
> reclaimed from the mm, filtering can be done in userspace.
>
> b) It omits reclaim performed by kswapd and by other tasks (ok, it's
> very cpuset-specific).
>
> c) It only counts synchronous try_to_free_pages() attempts. What if an
> attempt only freed pagecache, or didbn't manage to free anything?
>
> d) It doesn't notice if kswapd is swapping the heck out of your
> not-allocating-any-memory-now process.
>
>
> I think all the above can be addressed by exporting per-task (actually
> per-mm) reclaim info. (I haven't put much though into what info that
> should be - page reclaim attempts, mmapped reclaims, swapcache reclaims,
> etc)

I've been looking at similar things. When we page out / free something from
a shared library that 10 tasks have mapped, who does that count against
for pressure?

M.

2005-11-04 15:24:05

by Martin Bligh

[permalink] [raw]
Subject: Re: [patch] swapin rlimit

> System instrumentation people are already complaining about how costly
> /proc parsing is. If you have to get some nontrivial stat from all
> threads in the system, and if Linux doesnt offer that counter or summary
> by default, it gets pretty expensive.
>
> One solution i can think of would be to make a binary representation of
> /proc/<pid>/stats readonly-mmap-able. This would add a 4K page to every
> task tracked that way, and stats updates would have to update this page
> too - but it would make instrumentation of running apps really
> unintrusive and scalable.

That would be awesome - the current methods we have are mostly crap. There
are some atomicity issues though. Plus when I suggested this 2 years ago,
everyone told me to piss off, but I'm not bitter ;-) Seriously, we do
need a fast communication mechanism.

M.

2005-11-04 15:27:03

by Martin Bligh

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19



> We agree that my per-cpuset memory_reclaim_rate meter certainly hides
> more detail than the sorts of stats you are suggesting. I thought that
> was good, so long as what was needed was still present.

But it's horribly specific to cpusets. If you want something multi-task,
would be better if it worked by more generic task groupings.

M.

2005-11-04 15:51:51

by Rob Landley

[permalink] [raw]
Subject: Re: [uml-devel] Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Thursday 03 November 2005 21:26, Blaisorblade wrote:
> > I was hoping that since the file was deleted from disk and is already
> > getting _some_ special treatment (since it's a longstanding "poor man's
> > shared memory" hack), that madvise wouldn't flush the data to disk, but
> > would just zero it out. A bit optimistic on my part, I know. :)
>
> I read at some time that this optimization existed but was deemed obsolete
> and removed.
>
> Why obsolete? Because... we have tmpfs! And that's the point. With
> DONTNEED, we detach references from page tables, but the content is still
> pinned: it _is_ the "disk"! (And you have TMPDIR on tmpfs, right?)

If I had that kind of control over environment my build would always be
deployed in (including root access), I wouldn't need UML. :)

(P.S. The default for Ubuntu "Horny Hedgehog" is no. The only tmpfs mount
is /dev/shm, and /tmp is on / which is ext3. Yeah, I need to upgrade my
laptop...)

> I guess you refer to using frag. avoidance on the guest

Yes. Moot point since Linus doesn't want it.

> (if it matters for
> the host, let me know). When it will be present using it will be nice, but
> currently we'd do madvise() on a page-per-page basis, and we'd do it on
> non-consecutive pages (basically, free pages we either find or free or
> purpose).

Might be a performance issue if that gets introduced with per-page
granularity, and how do you avoid giving back pages we're about to re-use?
Oh well, bench it when it happens. (And in any case, it needs a tunable to
beat the page cache into submission or there's no free memory to give back.
If there's already such a tuneable, I haven't found it yet.)

Rob

2005-11-04 17:19:14

by Blaisorblade

[permalink] [raw]
Subject: Re: [uml-devel] Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

(Note - I've removed a few CC's since we're too many ones, sorry for any
inconvenience).

On Friday 04 November 2005 16:50, Rob Landley wrote:
> On Thursday 03 November 2005 21:26, Blaisorblade wrote:
> > > I was hoping that since the file was deleted from disk and is already
> > > getting _some_ special treatment (since it's a longstanding "poor man's
> > > shared memory" hack), that madvise wouldn't flush the data to disk, but
> > > would just zero it out. A bit optimistic on my part, I know. :)
> >
> > I read at some time that this optimization existed but was deemed
> > obsolete and removed.
> >
> > Why obsolete? Because... we have tmpfs! And that's the point. With
> > DONTNEED, we detach references from page tables, but the content is still
> > pinned: it _is_ the "disk"! (And you have TMPDIR on tmpfs, right?)
>
> If I had that kind of control over environment my build would always be
> deployed in (including root access), I wouldn't need UML. :)
Yep, right for your case... however currently the majority of users use tmpfs
(I hope for them)...

> > I guess you refer to using frag. avoidance on the guest
>
> Yes. Moot point since Linus doesn't want it.
See lwn.net last issue (when it becomes available) on this issue. In short,
however, the real point is that we need this kind of support.

> Might be a performance issue if that gets introduced with per-page
> granularity,
I'm aware of this possibility, and I've said in fact "Frag. avoidance will be
nice to use". However I'm not sure that the system call overhead is so big,
compared to flushing the TLB entries...

But for now we haven't the issue - you don't do hotunplug frequently. When
somebody will write the auto-hotunplug management daemon we could have a
problem on this...
> and how do you avoid giving back pages we're about to re-use?

Jeff's trick is call the buddy allocator (__get_free_pages()) to get a full
page (and it will do any needed work to free memory), so nobody else will use
it, and then madvise() it.

If a better API exists, that will be used.

> Oh well, bench it when it happens. (And in any case, it needs a tunable to
> beat the page cache into submission or there's no free memory to give back.
I couldn't parse your sentence. The allocation will free memory like when
memory is needed.

However look at /proc/sys/vm/swappiness or use Con Kolivas's patches to find
new tunable and policies.
> If there's already such a tuneable, I haven't found it yet.)
--
Inform me of my mistakes, so I can keep imitating Homer Simpson's "Doh!".
Paolo Giarrusso, aka Blaisorblade (Skype ID "PaoloGiarrusso", ICQ 215621894)
http://www.user-mode-linux.org/~blaisorblade





___________________________________
Yahoo! Mail: gratis 1GB per i messaggi e allegati da 10MB
http://mail.yahoo.it

2005-11-04 17:39:32

by Andrew Morton

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

"Martin J. Bligh" <[email protected]> wrote:
>
> > Seriously, it does appear that doing it per-task is adequate for your
> > needs, and it is certainly more general.
> >
> >
> >
> > I cannot understand why you decided to count only the number of
> > direct-reclaim events, via a "digitally filtered, constant time based,
> > event frequency meter".
> >
> > a) It loses information. If we were to export the number of pages
> > reclaimed from the mm, filtering can be done in userspace.
> >
> > b) It omits reclaim performed by kswapd and by other tasks (ok, it's
> > very cpuset-specific).
> >
> > c) It only counts synchronous try_to_free_pages() attempts. What if an
> > attempt only freed pagecache, or didbn't manage to free anything?
> >
> > d) It doesn't notice if kswapd is swapping the heck out of your
> > not-allocating-any-memory-now process.
> >
> >
> > I think all the above can be addressed by exporting per-task (actually
> > per-mm) reclaim info. (I haven't put much though into what info that
> > should be - page reclaim attempts, mmapped reclaims, swapcache reclaims,
> > etc)
>
> I've been looking at similar things. When we page out / free something from
> a shared library that 10 tasks have mapped, who does that count against
> for pressure?

Count pte unmappings and minor faults and account them against the
mm_struct, I guess.

2005-11-04 17:44:58

by Rob Landley

[permalink] [raw]
Subject: Re: [uml-devel] Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19

On Friday 04 November 2005 11:18, Blaisorblade wrote:
> > Oh well, bench it when it happens. (And in any case, it needs a tunable
> > to beat the page cache into submission or there's no free memory to give
> > back.
>
> I couldn't parse your sentence. The allocation will free memory like when
> memory is needed.

If you've got a daemon running in the virtual system to hand back memory to
the host, then you don't need a tuneable.

What I was thinking is that if we get prezeroing infrastructure that can use
various prezeroing accelerators (as has been discussed but I don't believe
merged), then a logical prezeroing accelerator for UML would be calling
madvise on the host system. This has the advantage of automatically giving
back to the host system any memory that's not in use, but would require some
way to tell kswapd or some such that keeping around lots of prezeroed memory
is preferable to keeping around lots of page cache.

In my case, I have a workload that can mostly work with 32-48 megs of ram, but
it spikes up to 256 at one point. Right now, I'm telling UML mem=64 megs and
the feeding it a 256 swap file on ubd, but this is hideously inefficient when
it actually tries to use this swap file. (And since the host system is
running a 2.6.10 kernel, there's a five minute period during each build where
things on my desktop actually freeze for 15-30 seconds at a time. And this
is on a laptop with 512 megs of ram. I think it's because the disk is so
overwhelmed, and some things (like vim's .swp file, and something similar in
kmail's composer) do a gratuitous fsync...

> However look at /proc/sys/vm/swappiness

Setting swappiness to 0 triggers the OOM killer on 2.6.14 for a load that
completes with swappiness at 60. I mentioned this on the list a little while
ago and some people asked for copies of my test script...

> or use Con Kolivas's patches to find new tunable and policies.

The daemon you mentioned is an alternative, but I'm not quite sure how rapid
the daemon's reaction is going to be to potential OOM situations when
something suddenly wants an extra 200 megs...

Rob

2005-11-10 18:47:23

by Steve Lord

[permalink] [raw]
Subject: Re: [Lhms-devel] [PATCH 0/7] Fragmentation Avoidance V19 - Summary

Flogging a dead horse here maybe, I missed this whole thread when it was
live, and someone may already have covered this.

Another reason for avoiding memory fragmentation, which may have been lost
in the discussion, is avoiding scatter/gather in I/O. The block layer now
has the smarts to join together physically contiguous pages into a single
scatter/gather element. It always had the smarts to deal with I/O from lots
of small chunks of memory, and let the hardware do the work of reassembling
it. This does not come for free though.

I have come across situations where a raid controller gets cpu bound dealing
with I/O from Linux, but not from Windows. The reason being that Windows seems
to manage to present the same amount of memory in less scatter gather entries.
Because the number of DMA elements is another limiting factor, Windows also
managed to submit larger individual requests. Once Linux reaches steady state,
it ends up submitting one page per scatter gather entry.

OK, if you are going via the page cache, then this is not going to mean anything
unless the idea of having PAGE_CACHE_SIZE > PAGE_SIZE gets dusted off. However,
for direct userspace <-> disk direct I/O, having the address space of a process
be more physically contiguous could help here. Specifically allocated huge pages
is another way to achieve this, but it does require special coding in an app
to do it.

I'll go back to my day job now ;-)

Steve



Mel Gorman wrote:
> On Thu, 3 Nov 2005, Nick Piggin wrote:
>
>> Mel Gorman wrote:
>>
>>> Ok. To me, the rest of the thread are beating around the same points and
>>> no one is giving ground. The points are made so lets summarise. Apologies
>>> if anything is missing.
>>>
>> Thanks for attempting a summary of a difficult topic. I have a couple
>> of suggestions.
>>
>>> Who cares
>>> =========
>>> Physical hotplug remove: Vendors of the hardware that support this -
>>> Fujitsu, HP (I think), IBM etc
>>>
>>> Virtualization hotplug remove: Sellers of virtualization software, some
>>> hardware like any IBM machine that lists LPAR in it's list of
>>> features. Probably software solutions like Xen are also affected
>>> if they want to be able to grow and shrink the virtual machines on
>>> demand
>>>
>> Ingo said that Xen is fine with per page granular freeing - this covers
>> embedded, desktop and small server users of VMs into the future I'd say.
>>
>
> Ok, hard to argue with that.
>
>>> High order allocations: Ultimately, hugepage users. Today, that is a
>>> feature only big server users like Oracle care about. In the
>>> future I reckon applications will be able to use them for things
>>> like backing the heap by huge pages. Other users like GigE,
>>> loopback devices with large MTUs, some filesystem like CIFS are
>>> all interested although they are also been told use use smaller
>>> pages.
>>>
>> I think that saying its now OK to use higher order allocations is wrong
>> because as I said even with your patches they are going to run into
>> problems.
>>
>
> Ok, I have not denied that they will run into problems. I have asserted
> that, with more work built upon these patches, we can grant large pages
> with a good degree of reliability. Subsystems should still use small
> orders whenever possible and at the very least, large orders should be
> short-lived.
>
> For userspace users, I would like to move towards better availibility of
> huge page without requiring boot-time tunables which are required today.
> Do we agree that this would be useful at least for a few different users?
>
> HugeTLB user 1: Todays users of hugetlbfs like big databases etc
> HugeTLB user 2: HPC jobs that run with sparse data sets
> HugeTLB user 3: Desktop applications that use large amounts of address space.
>
> I got a mail from a user of category 2. He said I can quote his email, but
> he didn't say I could quote his name which is inconvenient but I'm sure he
> has good reasons.
>
> To him, low fragmentation is "critical, at least in HPC environments".
> Here is the core of his issue;
>
> --- excerpt ---
> Take the scenario that you have a large machine that is
> used by multiple users, and the usage is regulated by a batch
> scheduler. Loadleveler on ibm's for example. PBS on many
> others. Both appear to be available in linux environments.
>
> In the case of my codes, I find that having large pages is
> extremely beneficial to my run times. As in factors of several,
> modulo things that I've coded in by hand to try and avoid the
> issues. I don't think my code is in any way unusual in this
> magnitude of improvement.
> --- excerpt ---
>
> ok, so we have two potential solutions, anti-defrag and zones. We don't
> need to rehash the pro's and cons. With zones, we just say "just reclaim
> the easy reclaim zone, alloc your pages and away we go".
>
> Now, his problem is that the server is not restarted between job times and
> jobs takes days and weeks to complete. The system administrators will not
> restart the machine so getting it to a prestine state is a difficulty. The
> state he gets the system in is the state he works with and with
> fragmentation, he doesn't get large pages unless he is lucky enough to be
> the first user of the machine
>
> With the zone approach, we would just be saying "tune it". Here is what he
> says about that
>
> --- excerpt ---
> I specifically *don't* want things that I have to beg sysadmins to
> tune correctly. They won't get it right because there is no `right'
> that is right for everyone. They won't want to change it and it
> won't work besides. Been there, done that. My experience is that
> with linux so far, and some other non-linux machines too, they
> always turn all the page stuff off because it breaks the machine.
> --- excerpt ---
>
> This is an example of a real user that "tune the size of your zone
> correctly" is just not good enough. He makes a novel suggestion on how
> anti-defrag + hotplug could be used.
>
> --- excerpt ---
> In the context of hotplug stuff and fragmentation avoidance,
> this sort of reset would be implemented by performing the
> the first step in the hot unplug, to migrate everything off
> of that memory, including whatever kernel pages that exist
> there, but not the second step. Just leave that memory plugged
> in and reset the memory to a sane initial state. Essentially
> this would be some sort of pseudo hotunplug followed by a pseudo
> hotplug of that memory.
> --- excerpt ---
>
> I'm pretty sure this is not what hotplug was aimed at but it would get him
> what he wants, large pages to echo BigNumber > nr_hugepages at the least.
> It also needs hotplug remove to be working for some banks and regions of
> memory although not the 100% case.
>
> Ok, this is one example of a user for scientific workloads that "tune the
> size of the zone" just is not good enough. The admins won't do it for him
> because it'll just break for the next scheduled job.
>
>> Actually I think one reason your patches may perform so well is because
>> there aren't actually a lot of higher order allocations in the kernel.
>>
>> I think that probably leaves us realistically with demand hugepages,
>> hot unplug memory, and IBM lpars?
>>
>
>
>>> Pros/Cons of Solutions
>>> ======================
>>>
>>> Anti-defrag Pros
>>> o Aim9 shows no significant regressions (.37% on page_test). On some
>>> tests, it shows performance gains (> 5% on fork_test)
>>> o Stress tests show that it manages to keep fragmentation down to a far
>>> lower level even without teaching kswapd how to linear reclaim
>> This sounds like a kind of funny test to me if nobody is actually
>> using higher order allocations.
>>
>
> No one uses them because they always fail. This is a chicken and egg
> problem.
>
>> When a higher order allocation is attempted, either you will satisfy
>> it from the kernel region, in which case the vanilla kernel would
>> have done the same. Or you satisfy it from an easy-reclaim contiguous
>> region, in which case it is no longer an easy-reclaim contiguous
>> region.
>>
>
> Right, but right now, we say "don't use high order allocations ever". With
> work, we'll be saying "ok, use high order allocations but they should be
> short lived or you won't be allocating them for long"
>
>>> o Stress tests with a linear reclaim experimental patch shows that it
>>> can successfully find large contiguous chunks of memory
>>> o It is known to help hotplug on PPC64
>>> o No tunables. The approach tries to manage itself as much as possible
>> But it has more dreaded heuristics :P
>>
>
> Yeah, but if it gets them wrong, the system chugs along anyway, just
> fragmented like it is today. If the zone-based approach gets it wrong, the
> system goes down the tubes.
>
> At very worst, the patches give a kernel allocator that is as good as
> todays. At very worst, the zone-based approach makes an unusable system.
> The performance of the patches is another story. I've been posting aim9
> figures based on my test machine. I'm trying to kick an ancient PowerPC
> 43P Model 150 machine into working. This machine is a different
> architecture and ancient (I found it on the way to a skip) so should give
> different figures.
>
>>> o It exists, heavily tested, and synced against the latest -mm1
>>> o Can be compiled away be redefining the RCLM_* macros and the
>>> __GFP_*RCLM flags
>>>
>>> Anti-defrag Cons
>>> o More complexity within the page allocator
>>> o Adds a new layer onto the allocator that effectively creates subzones
>>> o Adding a new concept that maintainers have to work with
>>> o Depending on the workload, it fragments anyway
>>>
>>> New Zone Pros
>>> o Zones are a well known and understood concept
>>> o For people that do not care about hotplug, they can easily get rid of it
>>> o Provides reliable areas of contiguous groups that can be freed for
>>> HugeTLB pages going to userspace
>>> o Uses existing zone infrastructure for balancing
>>>
>>> New Zone Cons
>>> o Zones historically have introduced balancing problems
>>> o Been tried for hotplug and dropped because of being awkward to work with
>>> o It only helps hotplug and potentially HugeTLB pages for userspace
>>> o Tunable required. If you get it wrong, the system suffers a lot
>> Pro: it keeps IBM mainframe and pseries sysadmins in a job ;) Let
>> them get it right.
>>
>
> Unless you work in a place where they sysadmins will tell you to go away
> such as the HPC user above. I'm not a sysadmin, but I'm pretty sure they
> have better things to do than twiddle a tunable all day.
>
>>> o Needs to be planned for and developed
>>>
>> Yasunori Goto had patches around from last year. Not sure what sort
>> of shape they're in now but I'd think most of the hard work is done.
>>
>
> But Yasunori (thanks for sending the links ) himself says when he posted.
>
> --- excerpt ---
> Another one was a bit similar than Mel-san's one.
> One of motivation of this patch was to create orthogonal relationship
> between Removable and DMA/Normal/Highmem. I thought it is desirable.
> Because, ppc64 can treat that all of memory is same (DMA) zone.
> I thought that new zone spoiled its good feature.
> --- excerpt ---
>
> He thought that the new zone removed the ability of some architectures to
> treat all memory the same. My patches give some of the benefits of using
> another zone while still preserving an architectures ability to
> treat all memory the same.
>
>>> Scenarios
>>> =========
>>>
>>> Lets outline some situations then or workloads that can occur
>>>
>>> 1. Heavy job running that consumes 75% of physical memory. Like a kernel
>>> build
>>>
>>> Anti-defrag: It will not fragment as it will never have to fallback.High
>>> order allocations will be possible in the remaining 25%.
>>> Zone-based: After been tuned to a kernel build load, it will not
>>> fragment. Get the tuning wrong, performance suffers or workload
>>> fails. High order allocations will be possible in the remaining 25%.
>>>
>> You don't need to continually tune things for each and every possible
>> workload under the sun. It is like how we currently drive 16GB highmem
>> systems quite nicely under most workloads with 1GB of normal memory.
>> Make that an 8:1 ratio if you're worried.
>>
>> [snip]
>>
>>> I've tried to be as objective as possible with the summary.
>>>
>>>> From the points above though, I think that anti-defrag gets us a lot of
>>> the way, with the complexity isolated in one place. It's downside is that
>>> it can still break down and future work is needed to stop it degrading
>>> (kswapd cleaning UserRclm areas and page migration when we get really
>>> stuck). Zone-based is more reliable but only addresses a limited
>>> situation, principally hotplug and it does not even go 100% of the way for
>>> hotplug.
>> To me it seems like it solves the hotplug, lpar hotplug, and hugepages
>> problems which seem to be the main ones.
>>
>>> It also depends on a tunable which is not cool and it is static.
>> I think it is very cool because it means the tiny minority of Linux
>> users who want this can do so without impacting the rest of the code
>> or users. This is how Linux has been traditionally run and I still
>> have a tiny bit of faith left :)
>>
>
> The impact of the code and users will depend on benchmarks. I've posted
> benchmarks that show there are either very small regressions or else there
> are performance gains. As I write this, some of the aim9 benchmarks
> completed on the PowerPC.
>
> This is a comparison between 2.6.14-rc5-mm1 and
> 2.6.14-rc5-mm1-mbuddy-v19-defragDisabledViaConfig
>
> 1 creat-clo 73500.00 72504.58 -995.42 -1.35% File Creations and Closes/second
> 2 page_test 30806.13 31076.49 270.36 0.88% System Allocations & Pages/second
> 3 brk_test 335299.02 341926.35 6627.33 1.98% System Memory Allocations/second
> 4 jmp_test 1641733.33 1644566.67 2833.34 0.17% Non-local gotos/second
> 5 signal_test 100883.19 98900.18 -1983.01 -1.97% Signal Traps/second
> 6 exec_test 116.53 118.44 1.91 1.64% Program Loads/second
> 7 fork_test 751.70 746.84 -4.86 -0.65% Task Creations/second
> 8 link_test 30217.11 30463.82 246.71 0.82% Link/Unlink Pairs/second
>
> Performance gains on page_test, brk_test and exec_test. Even with
> variances between tests, we are looking at "more or less the same", not
> regressions. No user impact there.
>
> This is a comparison between 2.6.14-rc5-mm1 and
> 2.6.14-rc5-mm1-mbuddy-v19-withantidefrag
>
> 1 creat-clo 73500.00 71188.14 -2311.86 -3.15% File Creations and Closes/second
> 2 page_test 30806.13 31060.96 254.83 0.83% System Allocations & Pages/second
> 3 brk_test 335299.02 344361.15 9062.13 2.70% System Memory Allocations/second
> 4 jmp_test 1641733.33 1627228.80 -14504.53 -0.88% Non-local gotos/second
> 5 signal_test 100883.19 100233.33 -649.86 -0.64% Signal Traps/second
> 6 exec_test 116.53 117.63 1.10 0.94% Program Loads/second
> 7 fork_test 751.70 763.73 12.03 1.60% Task Creations/second
> 8 link_test 30217.11 30322.10 104.99 0.35% Link/Unlink Pairs/second
>
> Performance gains on page_test, brk_test, exec_test and fork_test. Not bad
> going for complex overhead. create-clo took a beating, but what workload
> opens and closes files at that rate?
>
> This is an old, small machine. If I hotplug this, I'll be lucky if it ever
> turns on again. The aim9 benchmarks on two machines show that there is
> similar and, in some cases better, performance with these patches. If a
> workload does suffer badly, an additional patch has been supplied that
> disables anti-defrag. A run in -mm will tell us if this is the general
> case for machines or are my two test boxes running on magic beans.
>
> So, the small number of users that want this, get this. The rest of the
> users who just run the code, should not notice or care. This brings us
> back to the main stickler, code complexity. I think that the code has been
> very well isolated from the code allocator code and people looking at the
> allocator could avoid it if they really wanted while stilling knowing what
> the buddy allocator was doing.
>
>>> If we make the zones growable+shrinkable, we run into all the same
>>> problems that anti-defrag has today.
>>>
>> But we don't have the extra zones layer that anti defrag has today.
>>
>
> So, we just have an extra layer on the side that has to be configured. All
> of the problems with all of the configuration.
>
>> And anti defrag needs limits if it is to be reliable anyway.
>>
>
> I'm confident given time that I can make this manage itself with a very
> good degree of reliability.
>