Hi,
The following two emails contain the latest version of the placement policy
for the binary buddy allocator to reduce fragmentation and the prezeroing
patch. The changelogs are with the patches.
The placement policy patch should now be more Hotplug-friendly and I would like
to hear from the Hotplug people if they have more requirements of this patch.
Of interest, rmqueue_bulk() has been taught how to allocate large blocks of
pages and split them up into the requested size. An impact of this is that
refilling the per-cpu caches will sometimes be satisfied with a single 2**4
allocation rather than 16 2**0 allocations. Lastly, the beancounters are
now a configurable option under "Kernel Hacking".
In terms of fragmentation, the placement policy still performs really well
and The placement policies raw performance for aim9 and ghostscript rendering
are comparable to the normal allocator so there should be no regressions
there. I've posted new figures with the patch.
The prezeroing patch still regresses fragmentation slightly but nowhere near
as bad as previously. However, the aim9 figures for the prezeroing patch
suck big-time. I suspect it is because zero-allocations are very common
but the lists are usually empty so there is a lot of list traversal that
yield nothing. Figures posted with patch. I have a solution in mind but
it'll be a while before I implement it.
In case others want to reproduce the allocation and ghostscript
benchmarks, I've included them as scripts in vmregress-0.13
(http://www.skynet.ie/~mel/projects/vmregress/vmregress-0.13.tar.gz) but
they are still not integrated with OSDL's STP tool. The scripts are in
bin/bench-gs.sh and bin/bench-stresshighalloc.sh and both take the --help
switch to explain what they do.
The patches were developed and tested heavily on 2.6.11.
--
Mel Gorman
On Mon, 2005-03-07 at 19:39 +0000, Mel Gorman wrote:
> The placement policy patch should now be more Hotplug-friendly and I
> would like to hear from the Hotplug people if they have more
> requirements of this patch.
It looks like most of what we need is there already. There are two
things that come to mind. We'll likely need some modifications that
will deal with committing memory areas that are larger than MAX_ORDER to
the different allocation pools. That's because a hotplug area (memory
section) might be larger than a single MAX_ORDER area, and each section
may need to be limited to a single allocation type.
The other thing is that we'll probably have to be a lot more strict
about how the allocations fall back. Some users will probably prefer to
kill an application rather than let a kernel allocation fall back into a
user memory area.
But, those are things that can be relatively easily grafted on to your
current code. I'm not horribly concerned about that, and merging
something like that is months and months away.
BTW, I wrote some requirements about how these section divisions might
be dealt with. Note that this is a completely hotplug-centric view of
the whole problem, I didn't discern between reclaimable and
unreclaimable kernel memory as your patch does. This is probably waaaay
more than you wanted to hear, but I thought I'd share anyway. :)
> There are 2 kinds of sections: user and kernel. The traditional
> ZONE_HIGHMEM is full of user sections (except for vmalloc). Any
> section which has slab pages or any kernel caller to alloc_pages() is
> a kernel section.
>
> Some properties of these sections:
> a. User sections are easily removed.
> b. Kernel sections are hard to remove. (considered impossible)
> c. User sections may *NOT* be used for kernel pages if all user
> sections are full. (but, see f.)
> d. Kernel sections may be used for user pages if all user sections are
> full.
> e. A transition from a kernel section to a user section is hard, and
> requires that it be empty of all kernel users.
> f. A transition from a user section to a kernel section is easy.
> (although easy, this should be avoided because it's hard to turn it
> _back_ into a user section)
-- Dave
On Mon, 7 Mar 2005, Dave Hansen wrote:
> On Mon, 2005-03-07 at 19:39 +0000, Mel Gorman wrote:
> > The placement policy patch should now be more Hotplug-friendly and I
> > would like to hear from the Hotplug people if they have more
> > requirements of this patch.
>
> It looks like most of what we need is there already. There are two
> things that come to mind. We'll likely need some modifications that
> will deal with committing memory areas that are larger than MAX_ORDER to
> the different allocation pools. That's because a hotplug area (memory
> section) might be larger than a single MAX_ORDER area, and each section
> may need to be limited to a single allocation type.
>
As you say later, stuff like that can be easily grafted on by fiddling
with the bitmap, just with more than one block of MAX_ORDER pages.
> The other thing is that we'll probably have to be a lot more strict
> about how the allocations fall back. Some users will probably prefer to
> kill an application rather than let a kernel allocation fall back into a
> user memory area.
>
That will be a tad trickier because we'll need a way of specifying a
"fallback policy" at configure time. However, the fallback policy is
currently isolated within one while loop, having different fallback
policies is doable. The kicker is that that there might be nasty
interaction with the page reclaim code where the allocator is not falling
back due to policy but the reclaim code things everything is just fine.
> BTW, I wrote some requirements about how these section divisions might
> be dealt with. Note that this is a completely hotplug-centric view of
> the whole problem, I didn't discern between reclaimable and
> unreclaimable kernel memory as your patch does. This is probably waaaay
> more than you wanted to hear, but I thought I'd share anyway. :)
>
No, better to hear about it now so I have something to chew over :)
> > There are 2 kinds of sections: user and kernel. The traditional
> > ZONE_HIGHMEM is full of user sections (except for vmalloc).
And PTEs if configured to be allocated from high memory. I have not double
checked but I don't think they can be trivially reclaimed.
> > Any
> > section which has slab pages or any kernel caller to alloc_pages() is
> > a kernel section.
> >
Slab pages could be moved to the user section as long as the cache owner
was able to reclaim the slabs on demand.
> > Some properties of these sections:
> > a. User sections are easily removed.
> > b. Kernel sections are hard to remove. (considered impossible)
> > c. User sections may *NOT* be used for kernel pages if all user
> > sections are full. (but, see f.)
> > d. Kernel sections may be used for user pages if all user sections are
> > full.
> > e. A transition from a kernel section to a user section is hard, and
> > requires that it be empty of all kernel users.
> > f. A transition from a user section to a kernel section is easy.
> > (although easy, this should be avoided because it's hard to turn it
> > _back_ into a user section)
>
All of these requirements are similar (just not as strict) as those for
fragmentation so common ground should continue to exist.
--
Mel Gorman
Part-time Phd Student Java Applications Developer
University of Limerick IBM Dublin Software Lab
On Thu, 2005-03-10 at 14:31 +0000, Mel Gorman wrote:
> > > There are 2 kinds of sections: user and kernel. The traditional
> > > ZONE_HIGHMEM is full of user sections (except for vmalloc).
>
> And PTEs if configured to be allocated from high memory. I have not double
> checked but I don't think they can be trivially reclaimed.
We've run into a couple of these pieces of highmem that can't be
reclaimed. The latest one are pages for the new pipe buffers. We could
code these up with a flag something like __GFP_HIGHMEM_NORCLM, that is
__GFP_HIGHMEM in the normal case, but 0 in the hotplug case (at least
for now).
> > > Any
> > > section which has slab pages or any kernel caller to alloc_pages() is
> > > a kernel section.
>
> Slab pages could be moved to the user section as long as the cache owner
> was able to reclaim the slabs on demand.
At least for the large consumers of slab (dentry/inode caches), they
can't quite reclaim on demand. I was picking Dipankar's brain about
this one day, and there are going to be particularly troublesome
dentries, like "/", that are going to need some serious rethinking to be
able to forcefully free.
-- Dave
Mel Gorman, responding to Dave Hansen
> > The other thing is that we'll probably have to be a lot more strict
> > about how the allocations fall back. Some users will probably prefer to
> > kill an application rather than let a kernel allocation fall back into a
> > user memory area.
> >
>
> That will be a tad trickier because we'll need a way of specifying a
> "fallback policy" at configure time. However, the fallback policy is
> currently isolated within one while loop, having different fallback
> policies is doable. The kicker is that that there might be nasty
> interaction with the page reclaim code where the allocator is not falling
> back due to policy but the reclaim code things everything is just fine.
There is at least one, perhaps a few, policies that I'd like to see in
the current allocator as well.
In particular, I am working on preparing a patch proposal for a policy
that would kill a task rather than invoke the swapper. In
mm/page_alloc.c __alloc_pages(), if one gets down to the point of being
about to kick the swapper, if this policy is enabled (and you're not
in_interrupt() and don't have flag PF_MEMALLOC set), then ask
oom_kill_task() to shoot us instead. For some big HPC jobs that are
carefully sized to fit on the allowed memory nodes, swapping is a fate
worse than death.
The natural place (for me, anyway) to hang such policies is off the
cpuset.
I am hopeful that cpusets will soon hit Linus's tree.
Would it make sense to specify these buddy allocator fallback policies
per cpuset as well?
I'd be glad to investigate providing the cpuset part of the code,
exposing the appropriate boolean, enum, scalar or bitmap type(s) to user
land query and setting, as another file in each cpuset directory, if
that would facilitate this.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401
On Thu, 2005-03-10 at 09:22 -0800, Paul Jackson wrote:
> In particular, I am working on preparing a patch proposal for a policy
> that would kill a task rather than invoke the swapper. In
> mm/page_alloc.c __alloc_pages(), if one gets down to the point of being
> about to kick the swapper, if this policy is enabled (and you're not
> in_interrupt() and don't have flag PF_MEMALLOC set), then ask
> oom_kill_task() to shoot us instead. For some big HPC jobs that are
> carefully sized to fit on the allowed memory nodes, swapping is a fate
> worse than death.
>
> The natural place (for me, anyway) to hang such policies is off the
> cpuset.
>
> I am hopeful that cpusets will soon hit Linus's tree.
>
> Would it make sense to specify these buddy allocator fallback policies
> per cpuset as well?
That seems reasonable, but I don't think there necessarily be enough
cpuset users to make this reasonable as the only interface.
Seems like something VMA-based along the lines of madvise() or the NUMA
binding API would be more appropriate. Perhaps default policies
inherited from a cpuset, but overridden by other APIs would be a good
compromise.
I have the feeling that applications will want to give the same kind of
notifications for swapping as they would for memory hotplug operations
as well. In decreasing order of pickiness:
1. Don't touch me at all
2. It's OK to migrate these pages elsewhere on this node
3. It's OK to migrate these pages anywhere
4. It's OK to swap these pages out
Although the node part, at least, can almost certainly be done in
combination with the NUMA api.
-- Dave
Dave wrote:
> Perhaps default policies inherited from a cpuset, but overridden by
> other APIs would be a good compromise.
Perhaps. The madvise() and numa calls (mbind, set_mempolicy) only
affect the current task, as is usually appropriate for calls that allow
specification of specific address ranges (strangers shouldn't be messing
in my address space). Some external means to set default policy for
whole tasks seems to be needed, as well, which could well be via the
cpuset.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401
On Thu, 2005-03-10 at 12:11 -0800, Paul Jackson wrote:
> Dave wrote:
> > Perhaps default policies inherited from a cpuset, but overridden by
> > other APIs would be a good compromise.
>
> Perhaps. The madvise() and numa calls (mbind, set_mempolicy) only
> affect the current task, as is usually appropriate for calls that allow
> specification of specific address ranges (strangers shouldn't be messing
> in my address space). Some external means to set default policy for
> whole tasks seems to be needed, as well, which could well be via the
> cpuset.
Shouldn't a particular task know what the policy should be when it is
launched? If the policy is only per-task and known at task exec time,
I'd imagine that a simple exec wrapper setting a flag would be much more
effective than even defining the policy in a cpuset.
-- Dave
Dave wrote:
> Shouldn't a particular task know what the policy should be when it is
> launched?
No ... but not necessarily because it isn't known yet, but rather also
because it might be imposed earlier in the job creation, before the
actual task hierarchy is manifest. This point goes to the heart of one
of the motivations for cpusets themselves.
On a big system, one might have OpenMP threads inside MPI tasks inside
jobs being managed by a batch manager, running on a subset of the
system. The system admins may need to impose these policy decisions
from the outside, and not uniformly across the entire batch managed
arena. The cpuset becomes the named object, to which such attributes
accrue, to take affect on whatever threads, tasks, or jobs end up
thereon.
Do a google search for "mixed openmp mpi", or for "hybrid openmp mpi",
to find examples of such usage, then imagine such jobs running inside a
batch manager, on a portion of a larger system.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401
On Thu, 10 Mar 2005, Dave Hansen wrote:
> On Thu, 2005-03-10 at 09:22 -0800, Paul Jackson wrote:
> > In particular, I am working on preparing a patch proposal for a policy
> > that would kill a task rather than invoke the swapper. In
> > mm/page_alloc.c __alloc_pages(), if one gets down to the point of being
> > about to kick the swapper, if this policy is enabled (and you're not
> > in_interrupt() and don't have flag PF_MEMALLOC set), then ask
> > oom_kill_task() to shoot us instead. For some big HPC jobs that are
> > carefully sized to fit on the allowed memory nodes, swapping is a fate
> > worse than death.
> >
> > The natural place (for me, anyway) to hang such policies is off the
> > cpuset.
> >
I have not read up on cpuset before so I am assuming you are talking about
http://www.bullopensource.org/cpuset/ so correct me if I am wrong.
> > I am hopeful that cpusets will soon hit Linus's tree.
> >
> > Would it make sense to specify these buddy allocator fallback policies
> > per cpuset as well?
>
> That seems reasonable, but I don't think there necessarily be enough
> cpuset users to make this reasonable as the only interface.
>
> Seems like something VMA-based along the lines of madvise() or the NUMA
> binding API would be more appropriate. Perhaps default policies
> inherited from a cpuset, but overridden by other APIs would be a good
> compromise.
>
I would think that VMA is too fine-grained and there is not a really clean
way of specifying what fallback policy to use with madvise() unless we
hard-code a fixed number of policies. I agree that if cpuset is not
widely used, it should not be the only way of setting policy. However, the
NUMA binding API deals with memory ranges, not PIDs. Implementing the
fallbacks for memory ranges would be at the node or zone level, not PID
which could be a conflict with cpuset (for example, which takes priority
when both cpuset and node policies are set?). So, I don't think the numa
binding as it is today is exactly the way to go either.
First though, we have to define how a fallback policy would be implemented
and applied. Right now, there is one hard-coded policy (assuming that the
placement policy gets merged at some point in the future) that is
implemented in __rmqueue() and is applied at the zone level. It is
implemented as a while loop and the information it uses is;
Input: int[] Integer array of fallback allocation types
struct zone * zone is the zone being allocated from
int order being the required order
Output: struct free_area *area is the area we are going to allocate from
int current_order is the order of the free pages in area
So, a very rough fallback policy setup might look something like;
/*
* Fallback function fills this struct telling the allocator where to get
* free pages from
*/
struct fallback {
struct free_area *area;
unsigned long current_order;
};
/* struct that describes a fallback policy */
struct fallback_policy {
void (*fallback)(int *fallback_list, struct zone* zone, int order);
char name[10];
int id; /* set on register */
}
/* List of all available fallback policies */
struct list_head *fallback_policies;
/* Default policy to use */
struct fallback_policy default_policy = {
.fallback = fallback_lowfrag,
.name = "default",
.id = 0
};
/*
* Register a new fallback policy. Throws a wobbly if the name is already
* registered
*/
void register_fallback(struct fallback_policy *policy);
/* Unregister a fallback, calls BUG if policy == default_policy */
void unregister_fallback(struct fallback_policy *policy);
the fallback_lowfrag() function would just implement the current fallback
approach. Other users like hotplug or cpuset would create their own
fallback policy, populate a fallback_policy and register it. I think this
is similar to how IO schedulers are registered.
Next... How do we apply it.
I think the sensible level to have a fallback policy as at the mm_struct
level. If the mm does not have a pointer to a fallback_policy, it uses
the default. This would be inherited from either a cpuset or explicitly
set via a specific API (should it inherit across exec()? Initially, I
would think no)
Case 1: The cpuset case
cpuset has a pseudo filesystem called cfs that looks very like /sys
mounted on /dev/cpuset. Each directory under /dev/cpuset is a cpuset and
there are a number of files there. For fallbacks, a new file would be
there called fallback. Catting it would print something like
[default] hotplug noswapper
where the name between [] is what is currently set. To set a new fallback
policy, echo the new string name to fallback like
echo noswapper > fallback
Any process created with pexec() or is otherwise part of a cpuset will get
it's policy from here.
Case 2: Specific API
The NUMA binding API has a method that looks like this;
#include <linux/mempolicy.h>
int set_mempolicy(int policy, unsigned long *nodemask, unsigned long
maxnode)
We could have something like
int set_mempolicy_fallback(int pid, char *policy_name);
Straight off, I don't like that policy_name is a string. Maybe, we would
export a mapping of ID to string names via /sys although it means users of
the API would have to parse /sys first which is undesirable.
Case 3: Dirty hack via /proc
We could have a proc entry like /proc/pid/fallback which behaves the same
as the cfs entry. I have a funny feeling that no one will go for this for
anything other than a proof of concept.
> I have the feeling that applications will want to give the same kind of
> notifications for swapping as they would for memory hotplug operations
> as well. In decreasing order of pickiness:
>
> 1. Don't touch me at all
> 2. It's OK to migrate these pages elsewhere on this node
> 3. It's OK to migrate these pages anywhere
> 4. It's OK to swap these pages out
>
I think this is a separate issue because it affects page replacement both
locally and globally as well as allocation fallback policy.
--
Mel Gorman
Part-time Phd Student Java Applications Developer
University of Limerick IBM Dublin Software Lab
Mel wrote:
> I have not read up on cpuset before so I am assuming you are talking about
> http://www.bullopensource.org/cpuset/ so correct me if I am wrong.
Yes - that. See also the kernel doc file:
Documentation/cpusets.txt
> I agree that if cpuset is not
> widely used, it should not be the only way of setting policy.
Well ... cpusets just hit Linus tree four days ago, so I wouldn't expect
them to have achieved world domination quite yet ;).
Cpusets implement hardwall outer limits on cpu and memory usage. The
tasks assigned to a cpuset are only allowed to work within that cpuset.
Within a cpuset, a job may use sched_setaffinity, set_mempolicy, mbind
and madvise to make fine grained placement and related policy choices
however it chooses, subject to the broad, hard constraints of the cpuset.
The imposition of a policy that says a task can't swap is usually, at
least where I see it used, a hard constraint, imposed externally on an
entire job, for the well being of the rest of the system:
Waking up swappers imposes a burden on the rest of the
system, which some jobs must not be allowed to do.
And wasting further cpu cycles on a job that has exceeded
its allowed memory when it wasn't supposed to (and hence
no longer has any chance of substaining the in-memory
performance required of it) is a waste of possibly expensive
compute resources.
The natural place for such an externally imposed policy limiting overall
processor or memory usage by a group of tasks is, in my admittedly
biased view, the cpuset.
I envision a per-cpuset file, "policy_kill_no_swap", containing a
boolean "0" or "1" (actually, "0\n" or "1\n"). It defaults to "0". If
set to "1" (by writing "1" to that file) then if any task in that cpuset
gets far enough in the mm/page_alloc.c:__alloc_pages() code to initiate
swapping, that task is killed instead.
I don't see any need to have any other way of specifying this policy
preference by a per task call such as set_mempolicy(2). However if
others saw such a need, I'm open to considering it.
I don't view this fallback stuff like I see Mel describing it. I don't
see it as a passing a list of fallback alternatives to a single API.
Rather, each API need only specify one policy. The only place
'fallback' comes into play is if there are multiple API's (such as both
set_mempolicy and cpusets) that affect the same decision in the kernel
(such as whether to let a task invoke swapping, or to kill it instead).
The 'fallback' is the chose of what API takes precedence here. For
system wide imposed hardwall limitations, the cpuset should have its
policy enforced. Within those limits, finer grained calls such as
set_mempolicy should prevail.
So, if others did make the case for a second, per-task, way of
specifying this 'kill_no_swap' policy, then:
1) If a tasks cpuset policy_kill_no_swap is true, that prevails.
2) Otherwise the per-task setting of kill_no_swap prevails.
The choices of where migration is allowed is separate, in my view,
and deserves its own policy flags. I don't know what those flags
should be.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401