Hi
I tested numactl on 2.6.24-rc8-mm1.
and I found strange behavior.
test method and result.
$ numactl --interleave=all ls
set_mempolicy: Invalid argument
setting interleave mask: Invalid argument
numactl command download from
ftp://ftp.suse.com/pub/people/ak/numa/
(I choice numactl-1.0.2)
Of course, older kernel(RHEL5.1) works good.
more detail:
1. my machine node and memory.
$ numactl --hardware
available: 16 nodes (0-15)
node 0 size: 0 MB
node 0 free: 0 MB
node 1 size: 0 MB
node 1 free: 0 MB
node 2 size: 3872 MB
node 2 free: 1487 MB
node 3 size: 4032 MB
node 3 free: 3671 MB
node 4 size: 0 MB
node 4 free: 0 MB
node 5 size: 0 MB
node 5 free: 0 MB
node 6 size: 0 MB
node 6 free: 0 MB
node 7 size: 0 MB
node 7 free: 0 MB
node 8 size: 0 MB
node 8 free: 0 MB
node 9 size: 0 MB
node 9 free: 0 MB
node 10 size: 0 MB
node 10 free: 0 MB
node 11 size: 0 MB
node 11 free: 0 MB
node 12 size: 0 MB
node 12 free: 0 MB
node 13 size: 0 MB
node 13 free: 0 MB
node 14 size: 0 MB
node 14 free: 0 MB
node 15 size: 0 MB
node 15 free: 0 MB
2. numactl behavior of --interleave=all
2.1 scan "/sys/devices/system/node" dir
2.2 calculate max node number
2.3 all bit turn on of existing node.
(i.e. 0xFF generated on my environment.)
2.4 call set_mempolicy()
3. 2.6.24-rc8-mm1 set_mempolicy(2) behavior
3.1 check nodesubset(nodemask argument, node_states[N_HIGH_MEMORY])
in mpol_check_policy()
-> check failed when memmoryless node exist.
(i.e. node_states[N_HIGH_MEMORY] of my machine is 0xc)
4. RHEL5.1 set_mempolicy(2) behavior
4.1 check nodesubset(nodemask argument, node_online_map)
in mpol_check_policy().
-> check success.
I don't know wrong either kernel or libnuma.
Please any comments!
- kosaki
[intentional full quote]
On Sat, Feb 02, 2008 at 05:12:30PM +0900, KOSAKI Motohiro wrote:
> I tested numactl on 2.6.24-rc8-mm1.
> and I found strange behavior.
>
> test method and result.
>
> $ numactl --interleave=all ls
> set_mempolicy: Invalid argument
> setting interleave mask: Invalid argument
>
> numactl command download from
> ftp://ftp.suse.com/pub/people/ak/numa/
> (I choice numactl-1.0.2)
>
>
> Of course, older kernel(RHEL5.1) works good.
>
>
>
> more detail:
>
> 1. my machine node and memory.
>
> $ numactl --hardware
> available: 16 nodes (0-15)
> node 0 size: 0 MB
> node 0 free: 0 MB
> node 1 size: 0 MB
> node 1 free: 0 MB
> node 2 size: 3872 MB
> node 2 free: 1487 MB
> node 3 size: 4032 MB
> node 3 free: 3671 MB
> node 4 size: 0 MB
> node 4 free: 0 MB
> node 5 size: 0 MB
> node 5 free: 0 MB
> node 6 size: 0 MB
> node 6 free: 0 MB
> node 7 size: 0 MB
> node 7 free: 0 MB
> node 8 size: 0 MB
> node 8 free: 0 MB
> node 9 size: 0 MB
> node 9 free: 0 MB
> node 10 size: 0 MB
> node 10 free: 0 MB
> node 11 size: 0 MB
> node 11 free: 0 MB
> node 12 size: 0 MB
> node 12 free: 0 MB
> node 13 size: 0 MB
> node 13 free: 0 MB
> node 14 size: 0 MB
> node 14 free: 0 MB
> node 15 size: 0 MB
> node 15 free: 0 MB
>
>
> 2. numactl behavior of --interleave=all
> 2.1 scan "/sys/devices/system/node" dir
> 2.2 calculate max node number
> 2.3 all bit turn on of existing node.
> (i.e. 0xFF generated on my environment.)
> 2.4 call set_mempolicy()
>
> 3. 2.6.24-rc8-mm1 set_mempolicy(2) behavior
> 3.1 check nodesubset(nodemask argument, node_states[N_HIGH_MEMORY])
> in mpol_check_policy()
>
> -> check failed when memmoryless node exist.
> (i.e. node_states[N_HIGH_MEMORY] of my machine is 0xc)
>
> 4. RHEL5.1 set_mempolicy(2) behavior
> 4.1 check nodesubset(nodemask argument, node_online_map)
> in mpol_check_policy().
>
> -> check success.
>
>
> I don't know wrong either kernel or libnuma.
When the kernel behaviour changes and breaks user space then the kernel
is usually wrong. Cc'ed Lee S. who maintains the kernel code now.
-Andi
Hi Andi,
> > 3. 2.6.24-rc8-mm1 set_mempolicy(2) behavior
> > 3.1 check nodesubset(nodemask argument, node_states[N_HIGH_MEMORY])
> > in mpol_check_policy()
> >
> > -> check failed when memmoryless node exist.
> > (i.e. node_states[N_HIGH_MEMORY] of my machine is 0xc)
> >
> > 4. RHEL5.1 set_mempolicy(2) behavior
> > 4.1 check nodesubset(nodemask argument, node_online_map)
> > in mpol_check_policy().
> >
> > -> check success.
> >
> > I don't know wrong either kernel or libnuma.
>
> When the kernel behaviour changes and breaks user space then the kernel
> is usually wrong. Cc'ed Lee S. who maintains the kernel code now.
may be yes, may be no.
I have 1 simple question.
Why do libnuma generate bitpattern of all bit on instead
check /sys/devices/system/node/has_high_memory nor
check /sys/devices/system/node/online?
Do you know it?
and I made simple patch that has_high_memory exposed however CONFIG_HIGHMEM disabled.
if CONFIG_HIGHMEM disabled, the has_high_memory file show
the same as the has_normal_memory.
may be, userland process should check has_high_memory file.
but, I am not confident.
Thanks.
- kosaki
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
drivers/base/node.c | 7 +------
1 file changed, 1 insertion(+), 6 deletions(-)
Index: b/drivers/base/node.c
===================================================================
--- a/drivers/base/node.c 2008-02-02 17:52:32.000000000 +0900
+++ b/drivers/base/node.c 2008-02-02 18:32:38.000000000 +0900
@@ -276,7 +276,6 @@ static SYSDEV_CLASS_ATTR(has_normal_memo
NULL);
static SYSDEV_CLASS_ATTR(has_cpu, 0444, print_nodes_has_cpu, NULL);
-#ifdef CONFIG_HIGHMEM
static ssize_t print_nodes_has_high_memory(struct sysdev_class *class,
char *buf)
{
@@ -285,15 +284,11 @@ static ssize_t print_nodes_has_high_memo
static SYSDEV_CLASS_ATTR(has_high_memory, 0444, print_nodes_has_high_memory,
NULL);
-#endif
-
struct sysdev_class_attribute *node_state_attr[] = {
&attr_possible,
&attr_online,
&attr_has_normal_memory,
-#ifdef CONFIG_HIGHMEM
&attr_has_high_memory,
-#endif
&attr_has_cpu,
};
@@ -302,7 +297,7 @@ static int node_states_init(void)
int i;
int err = 0;
- for (i = 0; i < NR_NODE_STATES; i++) {
+ for (i = 0; i < ARRAY_SIZE(node_state_attr); i++) {
int ret;
ret = sysdev_class_create_file(&node_class, node_state_attr[i]);
if (!err)
> I have 1 simple question.
> Why do libnuma generate bitpattern of all bit on instead
> check /sys/devices/system/node/has_high_memory nor
> check /sys/devices/system/node/online?
>
> Do you know it?
It's far simpler and cheaper (sysfs is expensive) to do this in the kernel
and besides the kernel can do more easily keep up with dynamic topology
changes.
>
> and I made simple patch that has_high_memory exposed however CONFIG_HIGHMEM disabled.
> if CONFIG_HIGHMEM disabled, the has_high_memory file show
> the same as the has_normal_memory.
>
> may be, userland process should check has_high_memory file.
To be honest I've never tried seriously to make 32bit NUMA policy
(with highmem) work well; just kept it at a "should not break"
level. That is because with highmem the kernel's choices at
placing memory are seriously limited anyways so I doubt 32bit
NUMA will ever work very well.
-Andi
On Sat, 2008-02-02 at 18:37 +0900, KOSAKI Motohiro wrote:
> Hi Andi,
>
> > > 3. 2.6.24-rc8-mm1 set_mempolicy(2) behavior
> > > 3.1 check nodesubset(nodemask argument, node_states[N_HIGH_MEMORY])
> > > in mpol_check_policy()
> > >
> > > -> check failed when memmoryless node exist.
> > > (i.e. node_states[N_HIGH_MEMORY] of my machine is 0xc)
> > >
> > > 4. RHEL5.1 set_mempolicy(2) behavior
> > > 4.1 check nodesubset(nodemask argument, node_online_map)
> > > in mpol_check_policy().
> > >
> > > -> check success.
> > >
> > > I don't know wrong either kernel or libnuma.
> >
> > When the kernel behaviour changes and breaks user space then the kernel
> > is usually wrong. Cc'ed Lee S. who maintains the kernel code now.
The memoryless nodes patch series changed a lot of things, so just
reverting this one area [mpol_check_policy()] probably won't restore the
prior behavior. A fully populated node mask is not necessarily a proper
subset of node_online_map(). And contextualize_policy() also requires
the mask to be a subset of mems_allowed which also defaults to nodes
with memory.
I don't know how Mel Gorman's "two zonelist" series, which is still
awaiting a window into the -mm tree, affects this behavior. Those
patches will certainly be affected by whatever we decide here.
I don't know the current state of Paul's rework of cpusets and
mems_allowed. That probably resolves this issue, if he still plans on
allowing a fully populated mask to indicate interleaving over all
allowed nodes.
I have a patch that takes a different approach to "interleave=all" that
doesn't solve Paul's and David's requirements. I also have patches to
libnuma and numactl that work with my patches, but I saw no sense in
posting them unless my kernel patches got some traction. If interested,
you can find them at:
http://free.linux.hp.com/~lts/Patches/Numactl/
>
> may be yes, may be no.
>
> I have 1 simple question.
> Why do libnuma generate bitpattern of all bit on instead
> check /sys/devices/system/node/has_high_memory nor
> check /sys/devices/system/node/online?
>
> Do you know it?
In addition to Andi's answer about simplicity, libnuma and numactl
predate the sysfs node masks. There was no way to query what the valid
set of nodes would be, but the kernel allowed a fully populated map. We
broke that with the memoryless nodes rework.
>
> and I made simple patch that has_high_memory exposed however CONFIG_HIGHMEM disabled.
> if CONFIG_HIGHMEM disabled, the has_high_memory file show
> the same as the has_normal_memory.
> may be, userland process should check has_high_memory file.
>
> but, I am not confident.
Regarding the patch itself: If others have no problems with displaying
a "has_high_memory" node mask for systems w/o HIGH_MEM configured, I can
live with it.
The current upstream kernel [2.6.24] supports a MPOL_MEMS_ALLOWED flag
to get_mempolicy() to return the nodes allowed in the caller's cpuset.
My numactl patches, mentioned above, support this.
However, as Andi says, we really can't break application behavior. All
applications that use mempolicy don't necessarily use libnuma APIs. So,
a fully populated interleave node mask should be allowed and should
probably mean "all allowed nodes with memory".
I think we'd still need to reduce the interleave policy mask to nodes
with memory when it's installed or find another way to skip memoryless
nodes when interleaving, else we don't get even distribution of
interleaved pages over the nodes that do have memory. This is one of
the memoryless nodes fixes. I THINK this is one of the areas that Paul
and David are investigating.
Christoph, Mel, Paul: any suggestions for a [relatively quick] fix that
doesn't break the memoryless nodes work and doesn't violate cpuset
constraints?
> Thanks.
>
>
> - kosaki
>
>
>
> Signed-off-by: KOSAKI Motohiro <[email protected]>
>
> ---
> drivers/base/node.c | 7 +------
> 1 file changed, 1 insertion(+), 6 deletions(-)
>
> Index: b/drivers/base/node.c
> ===================================================================
> --- a/drivers/base/node.c 2008-02-02 17:52:32.000000000 +0900
> +++ b/drivers/base/node.c 2008-02-02 18:32:38.000000000 +0900
> @@ -276,7 +276,6 @@ static SYSDEV_CLASS_ATTR(has_normal_memo
> NULL);
> static SYSDEV_CLASS_ATTR(has_cpu, 0444, print_nodes_has_cpu, NULL);
>
> -#ifdef CONFIG_HIGHMEM
> static ssize_t print_nodes_has_high_memory(struct sysdev_class *class,
> char *buf)
> {
> @@ -285,15 +284,11 @@ static ssize_t print_nodes_has_high_memo
>
> static SYSDEV_CLASS_ATTR(has_high_memory, 0444, print_nodes_has_high_memory,
> NULL);
> -#endif
> -
> struct sysdev_class_attribute *node_state_attr[] = {
> &attr_possible,
> &attr_online,
> &attr_has_normal_memory,
> -#ifdef CONFIG_HIGHMEM
> &attr_has_high_memory,
> -#endif
> &attr_has_cpu,
> };
>
> @@ -302,7 +297,7 @@ static int node_states_init(void)
> int i;
> int err = 0;
>
> - for (i = 0; i < NR_NODE_STATES; i++) {
> + for (i = 0; i < ARRAY_SIZE(node_state_attr); i++) {
> int ret;
> ret = sysdev_class_create_file(&node_class, node_state_attr[i]);
> if (!err)
>
On Sat, 2 Feb 2008, Andi Kleen wrote:
> To be honest I've never tried seriously to make 32bit NUMA policy
> (with highmem) work well; just kept it at a "should not break"
> level. That is because with highmem the kernel's choices at
> placing memory are seriously limited anyways so I doubt 32bit
> NUMA will ever work very well.
Memory policies do not work reliably with config highmem (I have never
seen such usage because large memory systems are typically 64 bit
which have no highmem, but there are some 32bit numa uses of HIGHMEM) ....
Memory policies are only applied to the highest zone. So if a system has
highmem on some nodes and not on the others then policies will only be
applied if allocations happen to occur on the highmem nodes.
Hi Lee-san
I change subject because 2.6.24 reproduce too.
> I have a patch that takes a different approach to "interleave=all" that
> doesn't solve Paul's and David's requirements. I also have patches to
> libnuma and numactl that work with my patches, but I saw no sense in
> posting them unless my kernel patches got some traction. If interested,
> you can find them at:
>
> http://free.linux.hp.com/~lts/Patches/Numactl/
unfortunately it doesn't works on my test environment ;-)
numactl-orig numactl-with-lee-patch
2.6.24 failed failed
2.6.24-rc8-mm1 failed failed
I got below error messages by all case.
$ numactl --interleave=all ls
set_mempolicy: Invalid argument
setting interleave mask: Invalid argument
I think kernel is need changed too.
I attached bellow.
kernel2.6.24-rc8-mm1 + mypatch + numactl-1.0.2 + leepatch works good.
> > and I made simple patch that has_high_memory exposed however CONFIG_HIGHMEM disabled.
> > if CONFIG_HIGHMEM disabled, the has_high_memory file show
> > the same as the has_normal_memory.
> > may be, userland process should check has_high_memory file.
>
> Regarding the patch itself: If others have no problems with displaying
> a "has_high_memory" node mask for systems w/o HIGH_MEM configured, I can
> live with it.
>
> The current upstream kernel [2.6.24] supports a MPOL_MEMS_ALLOWED flag
> to get_mempolicy() to return the nodes allowed in the caller's cpuset.
> My numactl patches, mentioned above, support this.
OK, I cancel my previous has_high_memory patch.
and, I understood anyone doesn't use 32bit numa.
> However, as Andi says, we really can't break application behavior. All
> applications that use mempolicy don't necessarily use libnuma APIs. So,
> a fully populated interleave node mask should be allowed and should
> probably mean "all allowed nodes with memory".
Agreed.
> I think we'd still need to reduce the interleave policy mask to nodes
> with memory when it's installed or find another way to skip memoryless
> nodes when interleaving, else we don't get even distribution of
> interleaved pages over the nodes that do have memory. This is one of
> the memoryless nodes fixes. I THINK this is one of the areas that Paul
> and David are investigating.
this is good news for me :)
I'll wait his patch post.
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/mempolicy.c | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)
Index: b/mm/mempolicy.c
===================================================================
--- a/mm/mempolicy.c 2008-02-02 17:54:33.000000000 +0900
+++ b/mm/mempolicy.c 2008-02-05 17:49:47.000000000 +0900
@@ -187,9 +187,12 @@ static struct mempolicy *mpol_new(int mo
atomic_set(&policy->refcnt, 1);
switch (mode) {
case MPOL_INTERLEAVE:
- policy->v.nodes = *nodes;
- nodes_and(policy->v.nodes, policy->v.nodes,
- node_states[N_HIGH_MEMORY]);
+ if (nodes) {
+ policy->v.nodes = *nodes;
+ nodes_and(policy->v.nodes, policy->v.nodes,
+ node_states[N_HIGH_MEMORY]);
+ } else
+ policy->v.nodes = node_states[N_HIGH_MEMORY];
if (nodes_weight(policy->v.nodes) == 0) {
kmem_cache_free(policy_cache, policy);
return ERR_PTR(-EINVAL);
@@ -934,7 +937,7 @@ asmlinkage long sys_set_mempolicy(int mo
err = get_nodes(&nodes, nmask, maxnode);
if (err)
return err;
- return do_set_mempolicy(mode, &nodes);
+ return do_set_mempolicy(mode, nodes_empty(nodes) ? NULL : &nodes);
}
asmlinkage long sys_migrate_pages(pid_t pid, unsigned long maxnode,
Lee wrote:
> I don't know the current state of Paul's rework of cpusets and
> mems_allowed. That probably resolves this issue, if he still plans on
> allowing a fully populated mask to indicate interleaving over all
> allowed nodes.
It got a bit stalled out for the last month (my employer had other
designs on my time.) But I'd really like to drive it home.
What happened so far, in December 2007 and earlier, is that a few of us:
David Rientjes <[email protected]>
[email protected]
Christoph Lameter <[email protected]>
Andi Kleen <[email protected]>
had a discussion, motivated in good part by the need to allow a
mempolicy of MPOL_INTERLEAVE over all nodes currently available in
the cpuset, where that interleave policy was robustly preserved if
the cpuset changed (without requiring the application to somehow
"know" its cpuset had changed and reissuing the set_mempolicy call.)
But that discussion touched on some other long standing deficiencies
in the way that I had originally glued cpusets and memory policies
together. The current mechanism doesn't handle changing cpusets very
well, especially if the number of nodes in the cpuset increases.
Obviously, I can't change the current behaviour, especially of the
mempolicy system calls. I can only add new options that provide new
alternatives.
The patchset I'd like to drive home addresses these issues with a
couple of additional MPOL_* flags, upward compatible, that alter the
way that nodemasks are mapped into cpusets, and remapped if the cpuset
subsequently changes.
The next two steps I need to take are:
1) propose this patch, with careful explanation (it's easy to lose
one's bearings in the mappings and remappings of node numberings)
to a wider audience, such as linux-mm or linux-kernel, and
2) carefully test this, especially on each code path I touched in
mm/mempolicy.c, where the changes were delicate, to ensure I
didn't break any existing code.
There were also some other, smaller patches proposed, by myself and
others. I was preferring to address a wider set of the long standing
issues in this area, but the others above mostly preferred the smaller
patches. This needs to be discussed in a wider forum, and a concensus
reached.
Hopefully this week or next, I will publish this patch proposal.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214
Hi Paul
> Hopefully this week or next, I will publish this patch proposal.
Great.
at that time, I will join review the patch with presure :)
- kosaki
On (04/02/08 13:20), Lee Schermerhorn didst pronounce:
> > > When the kernel behaviour changes and breaks user space then the kernel
> > > is usually wrong. Cc'ed Lee S. who maintains the kernel code now.
>
> The memoryless nodes patch series changed a lot of things, so just
> reverting this one area [mpol_check_policy()] probably won't restore the
> prior behavior. A fully populated node mask is not necessarily a proper
> subset of node_online_map(). And contextualize_policy() also requires
> the mask to be a subset of mems_allowed which also defaults to nodes
> with memory.
>
> I don't know how Mel Gorman's "two zonelist" series, which is still
> awaiting a window into the -mm tree, affects this behavior. Those
> patches will certainly be affected by whatever we decide here.
>
I doubt they'd make a difference to this particular problem.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Tue, 2008-02-05 at 14:31 +0000, Mel Gorman wrote:
> On (04/02/08 13:20), Lee Schermerhorn didst pronounce:
> > > > When the kernel behaviour changes and breaks user space then the kernel
> > > > is usually wrong. Cc'ed Lee S. who maintains the kernel code now.
> >
> > The memoryless nodes patch series changed a lot of things, so just
> > reverting this one area [mpol_check_policy()] probably won't restore the
> > prior behavior. A fully populated node mask is not necessarily a proper
> > subset of node_online_map(). And contextualize_policy() also requires
> > the mask to be a subset of mems_allowed which also defaults to nodes
> > with memory.
> >
> > I don't know how Mel Gorman's "two zonelist" series, which is still
> > awaiting a window into the -mm tree, affects this behavior. Those
> > patches will certainly be affected by whatever we decide here.
> >
>
> I doubt they'd make a difference to this particular problem.
I didn't really think so, but I wanted to give you a heads up regarding
this, as I think it will affect your patches. I'm hoping we'll see them
in -mm soon after the .25 merge window closes. If you get there before
a fix to this issue, so much the better, IMO :-).
Lee
>
Could we focus on the problem instead of discussion of new patches under
development? Can we confirm that what Kosaki sees is a bug?
On Tue, 2008-02-05 at 10:12 -0800, Christoph Lameter wrote:
> Could we focus on the problem instead of discussion of new patches under
> development?
Christoph: you are free to ignore any part of this discussion that you
wish...
> Can we confirm that what Kosaki sees is a bug?
by definition, right? we broke user space.
Lee
On Tue, 5 Feb 2008, Lee Schermerhorn wrote:
> Christoph: you are free to ignore any part of this discussion that you
> wish...
Had the impression that we are ignoring Kosaki's fix to the problem. Can
we fix up his patch to address the immediate issue?
Christoph wrote:
> Can we fix up his patch to address the immediate issue?
Since any of those future patches only add optional modes
with new flags, while preserving current behaviour if you
don't use one of the new flags, therefore the current behavior
has to work as best it can.
Therefore fixes such as this to address immediate issues
are probably needed. Yup.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214
On Tue, 5 Feb 2008, Paul Jackson wrote:
> But that discussion touched on some other long standing deficiencies
> in the way that I had originally glued cpusets and memory policies
> together. The current mechanism doesn't handle changing cpusets very
> well, especially if the number of nodes in the cpuset increases.
>
That's because of the nodemask remaps that are done for the various
mempolicy cases when rebinding the policy. I agree we cannot change that
implementation now even though it is undocumented.
The more alarming result of these remaps is in the MPOL_BIND case, as
we've talked about before. The language in set_mempolicy(2):
The MPOL_BIND policy is a strict policy that restricts memory
allocation to the nodes specified in nodemask. There won't be
allocations on other nodes.
makes it pretty clear that allocations will not be done on other nodes not
provided in the set_mempolicy() nodemask if the task is not swapped out.
But the current implementation allows that if the task is either moved to
a different cpuset or its cpuset's mems change. For example, consider a
task that is allowed nodes 1-3 by its cpuset and asks for a MPOL_BIND
mempolicy of node 2. If that cpuset's mems change to 4-6, the mempolicy
is now effectively a bind on node 5.
> The next two steps I need to take are:
> 1) propose this patch, with careful explanation (it's easy to lose
> one's bearings in the mappings and remappings of node numberings)
> to a wider audience, such as linux-mm or linux-kernel, and
Thanks.
> 2) carefully test this, especially on each code path I touched in
> mm/mempolicy.c, where the changes were delicate, to ensure I
> didn't break any existing code.
>
> There were also some other, smaller patches proposed, by myself and
> others. I was preferring to address a wider set of the long standing
> issues in this area, but the others above mostly preferred the smaller
> patches. This needs to be discussed in a wider forum, and a concensus
> reached.
>
I think if these MPOL_* flags that you're proposing are made as generic as
possible for all possible mempolicies (current and future), it would be
the optimal change. It would prevent us from having to add new flags for
corner-cases in the future and would allow us to keep the flag set as
small as possible. My suggestion of MPOL_F_STATIC_NODEMASK goes a long
way to solve these issues both for MPOL_INTERLEAVE (in conjunction with
storing the set_mempolicy() intent) and the MPOL_BIND discrepency I
mentioned above.
David
On Tue, 5 Feb 2008, Paul Jackson wrote:
> Since any of those future patches only add optional modes
> with new flags, while preserving current behaviour if you
> don't use one of the new flags, therefore the current behavior
> has to work as best it can.
>
There's a subtlety to this issue that allows it to be fixed and easily
extended for two upcoming changes:
- Paul Jackson's mempolicy and cpuset interactions change that will
probably allow set_mempolicy() callers to specify with a MPOL_*
flag whether they are referring to "dynamic" or "static" nodemasks[*],
and
- node hotplug (both add and remove) that will change the state of a
node with an identical id.
Paul, with his patch, will need to preserve the "intent" of the mempolicy
as the nodemask that was passed by the user and attempt on all successive
rebinds to accomodate that intent as much as possible.
So at the time of rebind it is quite simple to intersect the set of system
nodes that have memory with the intent of the mempolicy to yield the
effected nodemask. This nodemask is saved in the mempolicy (pol->v.nodes
in this case for interleave) and only steps through the set of nodes that
can allow interleaved allocations.
When the available nodes changes, either by cpuset change or node hotplug,
the rebind is quite simple when the intent is preserved. So we're going
to need an additional nodemask_t added to struct mempolicy that saves this
intent and modify contextualize_policy() to allow it. This will basically
make any set_mempolicy() call succeed even if the application does not
have access to any of the mempolicy nodes because it is possible that they
will become accessible in the future. In that case the mempolicy is
effectively MPOL_DEFAULT until the desired nodes become available and it
is effected.
David
David wrote:
> The more alarming result of these remaps is in the MPOL_BIND case, as
> we've talked about before. The language in set_mempolicy(2):
You're diving into the middle of a rather involved discussion
we had on the other various patches proposed to extend the
interaction of mempolicy's with cpusets and hotplug.
I choose not to hijack this current thread with my rebuttal,
which you've seen before, of your points here.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214
On Tue, 5 Feb 2008, Paul Jackson wrote:
> David wrote:
> > The more alarming result of these remaps is in the MPOL_BIND case, as
> > we've talked about before. The language in set_mempolicy(2):
>
> You're diving into the middle of a rather involved discussion
> we had on the other various patches proposed to extend the
> interaction of mempolicy's with cpusets and hotplug.
>
I've simply identified that MPOL_BIND mempolicy interactions with a task's
changing mems_allowed as a result of a cpuset move or mems change is also
an issue that can be addressed at the same time as the interleave problem.
And it can be done with the addition of a single MPOL_F_* flag.
> I choose not to hijack this current thread with my rebuttal,
> which you've seen before, of your points here.
>
The issues of mempolicies working over memoryless nodes and supporting
changing cpusets are very closely related and can be addressed in the same
way. It would be disappointing to see a lot of work done to fix the
memoryless node issue or the changing cpuset mems issue and then realize
both could have been fixed quite simply with a relatively small set of
changes.
David
David wrote:
> It would be disappointing to see a lot of work done to fix
The suggested patch of KOSAKI Motohiro didn't look like a lot of work to me.
I continue to prefer not to hijack this thread for that other discussion.
Just presenting your position and calling it "simple" is misleading.
The discussion so far has involved over a hundred messages over months,
and certainly your position, nor mine for that matter, obtained concensus.
How does the patch of KOSAKI Motohiro, earlier in this thread, look to you?
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214
Here's a patch that addresses the problem w/o requiring change to
numactl or libnuma. It DOES have side affects, discussed in the
description.
Tested with memoryless nodes and restricted cpusets using the numactl
installed with RHEL5.1.
Altho' nominally against 24-mm1, applies cleanly to 2.6.24. Should be
suitable for 'stable' if everyone agrees.
Lee
----------------------------------
[PATCH] 2.6.24-mm1 - mempolicy: silently restrict to allowed nodes
Kosaki-san noted that "numactl --interleave=all ..." failed in the
presence of memoryless nodes. This patch attempts to fix that
problem.
Some background:
numactl --interleave=all calls set_mempolicy(2) with a fully
populated [out to MAXNUMNODES] nodemask. set_mempolicy()
[in do_set_mempolicy()] calls contextualize_policy() which
requires that the nodemask be a subset of the current task's
mems_allowed; else EINVAL will be returned. A task's
mems_allowed will always be a subset of node_states[N_HIGH_MEMORY]--
i.e., nodes with memory. So, a fully populated nodemask will
be declared invalid if it includes memoryless nodes.
NOTE: the same thing will occur when running in a cpuset
with restricted mem_allowed--for the same reason:
node mask contains dis-allowed nodes.
mbind(2), on the other hand, just masks off any nodes in the
nodemask that are not included in the caller's mems_allowed.
In each case [mbind() and set_mempolicy()], mpol_check_policy()
will complain [again, resulting in EINVAL] if the nodemask contains
any memoryless nodes. This is somewhat redundant as mpol_new()
will remove memoryless nodes for interleave policy, as will
bind_zonelist()--called by mpol_new() for BIND policy.
Proposed fix:
1) modify contextualize_policy to just remove the non-allowed
nodes, as is currently done in-line for mbind(). This
guarantees that the resulting mask includes only nodes with
memory.
NOTE: this is a [benign, IMO] change in behavior for
set_mempolicy(). Dis-allowed nodes will be silently
ignored, rather than returning an error.
Another, perhaps less benign, change in behavior:
MPOL_PREFERRED policy that specifies only memoryless nodes
or nodes that are disallowed in the cpuset will be interpreted
as "local allocation" as the nodemask will be empty after
the masking in contextualize_policy(). With a bit of
additional hackery I can make this return EINVAL.
Comments?
2) modify mbind() to use contextualize_policy(), like set_mempolicy(),
instead of masking nodes in-line.
3) remove the now redundant check for memoryless nodes from
mpol_check_policy().
4) remove the masking of policy nodes for interleave policy from
mpol_new().
Signed-off-by: Lee Schermerhorn <[email protected]>
mm/mempolicy.c | 18 ++++++++----------
1 file changed, 8 insertions(+), 10 deletions(-)
Index: Linux/mm/mempolicy.c
===================================================================
--- Linux.orig/mm/mempolicy.c 2008-02-05 11:25:17.000000000 -0500
+++ Linux/mm/mempolicy.c 2008-02-05 16:03:11.000000000 -0500
@@ -131,7 +131,7 @@ static int mpol_check_policy(int mode, n
return -EINVAL;
break;
}
- return nodes_subset(*nodes, node_states[N_HIGH_MEMORY]) ? 0 : -EINVAL;
+ return 0;
}
/* Generate a custom zonelist for the BIND policy. */
@@ -188,8 +188,6 @@ static struct mempolicy *mpol_new(int mo
switch (mode) {
case MPOL_INTERLEAVE:
policy->v.nodes = *nodes;
- nodes_and(policy->v.nodes, policy->v.nodes,
- node_states[N_HIGH_MEMORY]);
if (nodes_weight(policy->v.nodes) == 0) {
kmem_cache_free(policy_cache, policy);
return ERR_PTR(-EINVAL);
@@ -426,9 +424,13 @@ static int contextualize_policy(int mode
if (!nodes)
return 0;
+ /*
+ * Restrict the nodes to the allowed nodes in the cpuset.
+ * This is guaranteed to be a subset of nodes with memory.
+ */
cpuset_update_task_memory_state();
- if (!cpuset_nodes_subset_current_mems_allowed(*nodes))
- return -EINVAL;
+ nodes_and(*nodes, *nodes, cpuset_current_mems_allowed);
+
return mpol_check_policy(mode, nodes);
}
@@ -797,7 +799,7 @@ static long do_mbind(unsigned long start
if (end == start)
return 0;
- if (mpol_check_policy(mode, nmask))
+ if (contextualize_policy(mode, nmask))
return -EINVAL;
new = mpol_new(mode, nmask);
@@ -915,10 +917,6 @@ asmlinkage long sys_mbind(unsigned long
err = get_nodes(&nodes, nmask, maxnode);
if (err)
return err;
-#ifdef CONFIG_CPUSETS
- /* Restrict the nodes to the allowed nodes in the cpuset */
- nodes_and(nodes, nodes, current->mems_allowed);
-#endif
return do_mbind(start, len, mode, &nodes, flags);
}
On Tue, 2008-02-05 at 15:33 -0600, Paul Jackson wrote:
> David wrote:
> > It would be disappointing to see a lot of work done to fix
>
> The suggested patch of KOSAKI Motohiro didn't look like a lot of work to me.
>
> I continue to prefer not to hijack this thread for that other discussion.
> Just presenting your position and calling it "simple" is misleading.
> The discussion so far has involved over a hundred messages over months,
> and certainly your position, nor mine for that matter, obtained concensus.
>
> How does the patch of KOSAKI Motohiro, earlier in this thread, look to you?
>
Paul,
It wasn't clear to me whether Kosaki-san's patch required a modified
numactl/libnuma or not. I think so, because that patch doesn't change
the error return in contextualize_policy() and in mpol_check_policy().
My modified numactl/libnuma avoids this by only passing in allowed mems
fetch via get_mempolicy() with the new MEMS_ALLOWED flags.
The patch I just posted doesn't depend on the numactl changes and seems
quite minimal to me. I think it cleans up the differences between
set_mempolicy() and mbind(), as well. However, some may take exception
to the change in behavior--silently ignoring dis-allowed nodes in
set_mempolicy().
Also, your cpuset/mempolicy work will probably need to undo the
unconditional masking in contextualize_policy() and/or save the original
node mask somewhere...
Lee
On Tue, 5 Feb 2008, Lee Schermerhorn wrote:
> mbind(2), on the other hand, just masks off any nodes in the
> nodemask that are not included in the caller's mems_allowed.
Ok so we temporarily adopt these semantics for set_mempolicy.
> 1) modify contextualize_policy to just remove the non-allowed
> nodes, as is currently done in-line for mbind(). This
> guarantees that the resulting mask includes only nodes with
> memory.
Right make ssense. we already contextualize for cpusets.
> Index: Linux/mm/mempolicy.c
> ===================================================================
> --- Linux.orig/mm/mempolicy.c 2008-02-05 11:25:17.000000000 -0500
> +++ Linux/mm/mempolicy.c 2008-02-05 16:03:11.000000000 -0500
> @@ -131,7 +131,7 @@ static int mpol_check_policy(int mode, n
> return -EINVAL;
> break;
> }
> - return nodes_subset(*nodes, node_states[N_HIGH_MEMORY]) ? 0 : -EINVAL;
> + return 0;
> }
Hmmm... That is a pretty drastic change.
> @@ -188,8 +188,6 @@ static struct mempolicy *mpol_new(int mo
> switch (mode) {
> case MPOL_INTERLEAVE:
> policy->v.nodes = *nodes;
> - nodes_and(policy->v.nodes, policy->v.nodes,
> - node_states[N_HIGH_MEMORY]);
> if (nodes_weight(policy->v.nodes) == 0) {
> kmem_cache_free(policy_cache, policy);
> return ERR_PTR(-EINVAL);
Do we really need to remove these lines if we change set_mempolicy?
> @@ -426,9 +424,13 @@ static int contextualize_policy(int mode
> if (!nodes)
> return 0;
>
> + /*
> + * Restrict the nodes to the allowed nodes in the cpuset.
> + * This is guaranteed to be a subset of nodes with memory.
> + */
> cpuset_update_task_memory_state();
> - if (!cpuset_nodes_subset_current_mems_allowed(*nodes))
> - return -EINVAL;
> + nodes_and(*nodes, *nodes, cpuset_current_mems_allowed);
> +
> return mpol_check_policy(mode, nodes);
> }
>
Ditto?
> @@ -797,7 +799,7 @@ static long do_mbind(unsigned long start
> if (end == start)
> return 0;
>
> - if (mpol_check_policy(mode, nmask))
> + if (contextualize_policy(mode, nmask))
> return -EINVAL;
>
> new = mpol_new(mode, nmask);
> @@ -915,10 +917,6 @@ asmlinkage long sys_mbind(unsigned long
> err = get_nodes(&nodes, nmask, maxnode);
> if (err)
> return err;
> -#ifdef CONFIG_CPUSETS
> - /* Restrict the nodes to the allowed nodes in the cpuset */
> - nodes_and(nodes, nodes, current->mems_allowed);
> -#endif
Would just removing #ifdef CONFIG_CPUSETS work? mems_allowed falls back to
node_possible_map.... Shouldnt that be node_online_map?
Lee wrote:
> [PATCH] 2.6.24-mm1 - mempolicy: silently restrict to allowed nodes
At first glance, I like it. Thanks.
The changes in the exact behaviour of set_mempolicy (and mbind?) seem
to me to be changes for the better -- subtle improvements in the
consistency of handling corner cases.
However I don't have code that depends all that elaborately on the
fine details of these system calls, so I'm easy. If others know of
an existing or likely usage pattern that this patch would break,
that would be interesting input.
Thanks, Lee.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214
On Tue, 5 Feb 2008, Lee Schermerhorn wrote:
> The patch I just posted doesn't depend on the numactl changes and seems
> quite minimal to me. I think it cleans up the differences between
> set_mempolicy() and mbind(), as well. However, some may take exception
> to the change in behavior--silently ignoring dis-allowed nodes in
> set_mempolicy().
>
If the intent of the set_mempolicy() call is going to be preserved in the
struct mempolicy with Paul's change, then we're going to allow disallowed
nodes anyway. So the only nodemask errors that we should return are ones
that are empty; nodemasks that include offlined nodes should be allowed to
support node hotplug. Likewise, memoryless nodes should still be saved as
the intent of the syscall.
The change to save the intent or silently ignore disallowed nodes would
also require applications to issue a successive get_mempolicy() call to
know what their current mempolicy is, since it will be able to change with
a cpusets change or node hotplug event. There is no longer this assurance
that if set_mempolicy() returns without an error that the memory policy is
effected. But in the presence of subsystems such as cpusets that allow
those mempolicies to change from beneath the application, there is no way
around that: the nodemask that the mempolicy acts on can dynamically
change at any time.
So I don't see any problem with silently ignoring disallowed nodes and
encourage it so that the kernel accomodates the intent of the mempolicy in
the future if and when it can be effected.
David
Lee wrote:
> Also, your cpuset/mempolicy work will probably need to undo the
> unconditional masking in contextualize_policy() and/or save the original
> node mask somewhere...
Yeah, something like that ... just a small matter of code.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.940.382.4214
On Tue, 5 Feb 2008, Lee Schermerhorn wrote:
> Index: Linux/mm/mempolicy.c
> ===================================================================
> --- Linux.orig/mm/mempolicy.c 2008-02-05 11:25:17.000000000 -0500
> +++ Linux/mm/mempolicy.c 2008-02-05 16:03:11.000000000 -0500
> @@ -131,7 +131,7 @@ static int mpol_check_policy(int mode, n
> return -EINVAL;
> break;
> }
> - return nodes_subset(*nodes, node_states[N_HIGH_MEMORY]) ? 0 : -EINVAL;
> + return 0;
> }
>
> /* Generate a custom zonelist for the BIND policy. */
This change will be necessary when the nodemask passed from the syscall is
saved in the struct mempolicy as the intent of the application as well.
> @@ -188,8 +188,6 @@ static struct mempolicy *mpol_new(int mo
> switch (mode) {
> case MPOL_INTERLEAVE:
> policy->v.nodes = *nodes;
> - nodes_and(policy->v.nodes, policy->v.nodes,
> - node_states[N_HIGH_MEMORY]);
> if (nodes_weight(policy->v.nodes) == 0) {
> kmem_cache_free(policy_cache, policy);
> return ERR_PTR(-EINVAL);
> @@ -426,9 +424,13 @@ static int contextualize_policy(int mode
> if (!nodes)
> return 0;
>
> + /*
> + * Restrict the nodes to the allowed nodes in the cpuset.
> + * This is guaranteed to be a subset of nodes with memory.
> + */
> cpuset_update_task_memory_state();
> - if (!cpuset_nodes_subset_current_mems_allowed(*nodes))
> - return -EINVAL;
> + nodes_and(*nodes, *nodes, cpuset_current_mems_allowed);
> +
> return mpol_check_policy(mode, nodes);
> }
>
I would defer the intersection until later because contextualize_policy()
is called before mpol_new() so we have no struct mempolicy to save the
intent in. It doesn't matter for the sake of this change, I know, but you
could move this intersection to mpol_new() and give us an opportunity to
store the user's nodemask in the mempolicy with a one-line change and get
the same desired result.
You can now remove cpuset_nodes_subset_current_mems_allowed() from
linux/cpuset.h.
> @@ -797,7 +799,7 @@ static long do_mbind(unsigned long start
> if (end == start)
> return 0;
>
> - if (mpol_check_policy(mode, nmask))
> + if (contextualize_policy(mode, nmask))
> return -EINVAL;
>
> new = mpol_new(mode, nmask);
> @@ -915,10 +917,6 @@ asmlinkage long sys_mbind(unsigned long
> err = get_nodes(&nodes, nmask, maxnode);
> if (err)
> return err;
> -#ifdef CONFIG_CPUSETS
> - /* Restrict the nodes to the allowed nodes in the cpuset */
> - nodes_and(nodes, nodes, current->mems_allowed);
> -#endif
> return do_mbind(start, len, mode, &nodes, flags);
> }
>
Looks good, thanks for doing this.
Hi Lee-san
> Here's a patch that addresses the problem w/o requiring change to
> numactl or libnuma. It DOES have side affects, discussed in the
> description.
Thank you!
but unfortunately, My machine is broken phisically today ;-)
I will test it tommorow or later.
On Tue, 2008-02-05 at 14:12 -0800, Christoph Lameter wrote:
> On Tue, 5 Feb 2008, Lee Schermerhorn wrote:
>
> > mbind(2), on the other hand, just masks off any nodes in the
> > nodemask that are not included in the caller's mems_allowed.
>
> Ok so we temporarily adopt these semantics for set_mempolicy.
>
> > 1) modify contextualize_policy to just remove the non-allowed
> > nodes, as is currently done in-line for mbind(). This
> > guarantees that the resulting mask includes only nodes with
> > memory.
>
> Right make ssense. we already contextualize for cpusets.
Only for mbind(). set_mempolicy(), via contextualize_policy() was just
returning EINVAL for invalid nodes in the mask. I don't know if it
always worked like this, or if we did this in the memoryless nodes
series...
>
> > Index: Linux/mm/mempolicy.c
> > ===================================================================
> > --- Linux.orig/mm/mempolicy.c 2008-02-05 11:25:17.000000000 -0500
> > +++ Linux/mm/mempolicy.c 2008-02-05 16:03:11.000000000 -0500
> > @@ -131,7 +131,7 @@ static int mpol_check_policy(int mode, n
> > return -EINVAL;
> > break;
> > }
> > - return nodes_subset(*nodes, node_states[N_HIGH_MEMORY]) ? 0 : -EINVAL;
> > + return 0;
> > }
>
> Hmmm... That is a pretty drastic change.
the nodes_subset() would always return true, once we mask it with
cpuset_current_mems_allowed(), right? mems_allowed can now only contain
nodes with memory and if cpusets are not configured,
cpuset_current_mems_allowed() just returns node_states[N_HIGH_MEMORY].
So, I think this is a no-op.
>
> > @@ -188,8 +188,6 @@ static struct mempolicy *mpol_new(int mo
> > switch (mode) {
> > case MPOL_INTERLEAVE:
> > policy->v.nodes = *nodes;
> > - nodes_and(policy->v.nodes, policy->v.nodes,
> > - node_states[N_HIGH_MEMORY]);
> > if (nodes_weight(policy->v.nodes) == 0) {
> > kmem_cache_free(policy_cache, policy);
> > return ERR_PTR(-EINVAL);
>
> Do we really need to remove these lines if we change set_mempolicy?
Again, with the change to contextualize_policy(), the nodemask is
guaranteed to only contain nodes with memory, so this was redundant.
>
> > @@ -426,9 +424,13 @@ static int contextualize_policy(int mode
> > if (!nodes)
> > return 0;
> >
> > + /*
> > + * Restrict the nodes to the allowed nodes in the cpuset.
> > + * This is guaranteed to be a subset of nodes with memory.
> > + */
> > cpuset_update_task_memory_state();
> > - if (!cpuset_nodes_subset_current_mems_allowed(*nodes))
> > - return -EINVAL;
> > + nodes_and(*nodes, *nodes, cpuset_current_mems_allowed);
> > +
> > return mpol_check_policy(mode, nodes);
> > }
> >
>
> Ditto?
This is the main change in the patch: masking off the invalid nodes
[like sys_mbind() did inline] rather than complaining about them.
However, after I finish testing, I will post an update to this patch
which restores some of the error checks that this change lost.
>
> > @@ -797,7 +799,7 @@ static long do_mbind(unsigned long start
> > if (end == start)
> > return 0;
> >
> > - if (mpol_check_policy(mode, nmask))
> > + if (contextualize_policy(mode, nmask))
> > return -EINVAL;
> >
> > new = mpol_new(mode, nmask);
> > @@ -915,10 +917,6 @@ asmlinkage long sys_mbind(unsigned long
> > err = get_nodes(&nodes, nmask, maxnode);
> > if (err)
> > return err;
> > -#ifdef CONFIG_CPUSETS
> > - /* Restrict the nodes to the allowed nodes in the cpuset */
> > - nodes_and(nodes, nodes, current->mems_allowed);
> > -#endif
>
> Would just removing #ifdef CONFIG_CPUSETS work? mems_allowed falls back to
> node_possible_map.... Shouldnt that be node_online_map?
I removed this because I changed do_mbind() to call the revised
contextualize_policy() that does exactly this masking. I didn't see any
reason to leave the duplicate code there.
I think that mems_allowed now falls back to nodes with memory. Or it
should in the current code. When Paul adds his new magic, that might
change.
Lee
>
On Tue, 2008-02-05 at 18:17 -0800, David Rientjes wrote:
> On Tue, 5 Feb 2008, Lee Schermerhorn wrote:
>
> > Index: Linux/mm/mempolicy.c
> > ===================================================================
> > --- Linux.orig/mm/mempolicy.c 2008-02-05 11:25:17.000000000 -0500
> > +++ Linux/mm/mempolicy.c 2008-02-05 16:03:11.000000000 -0500
> > @@ -131,7 +131,7 @@ static int mpol_check_policy(int mode, n
> > return -EINVAL;
> > break;
> > }
> > - return nodes_subset(*nodes, node_states[N_HIGH_MEMORY]) ? 0 : -EINVAL;
> > + return 0;
> > }
> >
> > /* Generate a custom zonelist for the BIND policy. */
>
> This change will be necessary when the nodemask passed from the syscall is
> saved in the struct mempolicy as the intent of the application as well.
>
> > @@ -188,8 +188,6 @@ static struct mempolicy *mpol_new(int mo
> > switch (mode) {
> > case MPOL_INTERLEAVE:
> > policy->v.nodes = *nodes;
> > - nodes_and(policy->v.nodes, policy->v.nodes,
> > - node_states[N_HIGH_MEMORY]);
> > if (nodes_weight(policy->v.nodes) == 0) {
> > kmem_cache_free(policy_cache, policy);
> > return ERR_PTR(-EINVAL);
> > @@ -426,9 +424,13 @@ static int contextualize_policy(int mode
> > if (!nodes)
> > return 0;
> >
> > + /*
> > + * Restrict the nodes to the allowed nodes in the cpuset.
> > + * This is guaranteed to be a subset of nodes with memory.
> > + */
> > cpuset_update_task_memory_state();
> > - if (!cpuset_nodes_subset_current_mems_allowed(*nodes))
> > - return -EINVAL;
> > + nodes_and(*nodes, *nodes, cpuset_current_mems_allowed);
> > +
> > return mpol_check_policy(mode, nodes);
> > }
> >
>
> I would defer the intersection until later because contextualize_policy()
> is called before mpol_new() so we have no struct mempolicy to save the
> intent in. It doesn't matter for the sake of this change, I know, but you
> could move this intersection to mpol_new() and give us an opportunity to
> store the user's nodemask in the mempolicy with a one-line change and get
> the same desired result.
Hi, David:
I wanted to avoid a major restructuring of the code for this patch.
However, now that both do_mbind() and do_set_mempolicy() both call
contextualize_policy() [which calls mpol_check_policy()] immediately
before calling mpol_new(), I agree we can push this "contextualization"
down there. I would like to defer this to another patch--perhaps as
part of Paul's rework of mempolicy and cpusets.
Note that there is another caller of mpol_new() --
mpol_shared_policy_init(). We'll need to decide whether that call needs
to be contextualized, as it constructs a policy from the tmpfs or
hugetlbfs superblock, as specified on the mount command [or kernel
command line?]. As this is a privileged operation, one could argue that
it should be exempt from cpuset constraints.
>
> You can now remove cpuset_nodes_subset_current_mems_allowed() from
> linux/cpuset.h.
>
> > @@ -797,7 +799,7 @@ static long do_mbind(unsigned long start
> > if (end == start)
> > return 0;
> >
> > - if (mpol_check_policy(mode, nmask))
> > + if (contextualize_policy(mode, nmask))
> > return -EINVAL;
> >
> > new = mpol_new(mode, nmask);
> > @@ -915,10 +917,6 @@ asmlinkage long sys_mbind(unsigned long
> > err = get_nodes(&nodes, nmask, maxnode);
> > if (err)
> > return err;
> > -#ifdef CONFIG_CPUSETS
> > - /* Restrict the nodes to the allowed nodes in the cpuset */
> > - nodes_and(nodes, nodes, current->mems_allowed);
> > -#endif
> > return do_mbind(start, len, mode, &nodes, flags);
> > }
> >
>
> Looks good, thanks for doing this.
As I mentioned to Christoph, I'll post a new version that I think
handles the error conditions better.
Later,
Lee
I've updated the patch to restore some error checking that my previous
version and the memoryless-nodes series lost.
Again, tested with "numactl --interleave=all" and memtoy on ia64 using
mem= command line argument to simulate memoryless node.
Lee
----------------------------------
[PATCH] 2.6.24-mm1 - mempolicy: silently restrict to allowed nodes
V1 -> V2:
+ Communicate whether or not incoming node mask was empty to
mpol_check_policy() for better error checking.
+ As suggested by David Rientjes, remove the now unused
cpuset_nodes_subset_current_mems_allowed() from cpuset.h
Kosaki Motohito noted that "numactl --interleave=all ..." failed in the
presence of memoryless nodes. This patch attempts to fix that problem.
Some background:
numactl --interleave=all calls set_mempolicy(2) with a fully
populated [out to MAXNUMNODES] nodemask. set_mempolicy()
[in do_set_mempolicy()] calls contextualize_policy() which
requires that the nodemask be a subset of the current task's
mems_allowed; else EINVAL will be returned. A task's
mems_allowed will always be a subset of node_states[N_HIGH_MEMORY]--
i.e., nodes with memory. So, a fully populated nodemask will
be declared invalid if it includes memoryless nodes.
NOTE: the same thing will occur when running in a cpuset
with restricted mem_allowed--for the same reason:
node mask contains dis-allowed nodes.
mbind(2), on the other hand, just masks off any nodes in the
nodemask that are not included in the caller's mems_allowed.
In each case [mbind() and set_mempolicy()], mpol_check_policy()
will complain [again, resulting in EINVAL] if the nodemask contains
any memoryless nodes. This is somewhat redundant as mpol_new()
will remove memoryless nodes for interleave policy, as will
bind_zonelist()--called by mpol_new() for BIND policy.
Proposed fix:
1) modify contextualize_policy to:
a) remember whether the incoming node mask is empty.
b) if not, restrict the nodemask to allowed nodes, as is
currently done in-line for mbind(). This guarantees
that the resulting mask includes only nodes with memory.
NOTE: this is a [benign, IMO] change in behavior for
set_mempolicy(). Dis-allowed nodes will be
silently ignored, rather than returning an error.
c) pass the "was_empty" state of the incoming mask to
mpol_check_policy() for vetting the user specified
nodemask for MPOL_DEFAULT and MPOL_PREFERRED
2) In mpol_check_policy():
a) MPOL_DEFAULT: require that in coming mask "was_empty"
a) add a case for MPOL_PREFERRED: if in coming was not empty
and resulting mask IS empty, user specified invalid nodes.
Return EINVAL.
b) remove the now redundant check for memoryless nodes
3) modify mbind() to use contextualize_policy(), like set_mempolicy(),
instead of masking nodes in-line. This preserves the current
behavior of mbind() and restores some error checking that the
memoryless-nodes series lost when restricting node masks to
allowed_nodes [== subset of nodes with memory].
4) remove the now redundant masking of policy nodes for interleave
policy from mpol_new().
Signed-off-by: Lee Schermerhorn <[email protected]>
include/linux/cpuset.h | 3 --
mm/mempolicy.c | 50 ++++++++++++++++++++++++++++++++-----------------
2 files changed, 33 insertions(+), 20 deletions(-)
Index: Linux/mm/mempolicy.c
===================================================================
--- Linux.orig/mm/mempolicy.c 2008-02-05 16:51:19.000000000 -0500
+++ Linux/mm/mempolicy.c 2008-02-06 10:58:57.000000000 -0500
@@ -114,24 +114,37 @@ static void mpol_rebind_policy(struct me
const nodemask_t *newmask);
/* Do sanity checking on a policy */
-static int mpol_check_policy(int mode, nodemask_t *nodes)
+static int mpol_check_policy(int mode, nodemask_t *nodes, int was_empty)
{
- int empty = nodes_empty(*nodes);
+ int is_empty = nodes_empty(*nodes);
switch (mode) {
case MPOL_DEFAULT:
- if (!empty)
+ /*
+ * require caller to specify an empty nodemask
+ * before "contextualization"
+ */
+ if (!was_empty)
return -EINVAL;
break;
case MPOL_BIND:
case MPOL_INTERLEAVE:
- /* Preferred will only use the first bit, but allow
- more for now. */
- if (empty)
+ /*
+ * require at least 1 valid node after "contextualization"
+ */
+ if (is_empty)
+ return -EINVAL;
+ break;
+ case MPOL_PREFERRED:
+ /*
+ * Did caller specify invalid nodes?
+ * Don't silently accept this as "local allocation".
+ */
+ if (!was_empty && is_empty)
return -EINVAL;
break;
}
- return nodes_subset(*nodes, node_states[N_HIGH_MEMORY]) ? 0 : -EINVAL;
+ return 0;
}
/* Generate a custom zonelist for the BIND policy. */
@@ -188,8 +201,6 @@ static struct mempolicy *mpol_new(int mo
switch (mode) {
case MPOL_INTERLEAVE:
policy->v.nodes = *nodes;
- nodes_and(policy->v.nodes, policy->v.nodes,
- node_states[N_HIGH_MEMORY]);
if (nodes_weight(policy->v.nodes) == 0) {
kmem_cache_free(policy_cache, policy);
return ERR_PTR(-EINVAL);
@@ -423,13 +434,22 @@ static int mbind_range(struct vm_area_st
static int contextualize_policy(int mode, nodemask_t *nodes)
{
+ int was_empty;
+
if (!nodes)
return 0;
+ /*
+ * Remember whether in coming nodemask was empty, If not,
+ * restrict the nodes to the allowed nodes in the cpuset.
+ * This is guaranteed to be a subset of nodes with memory.
+ */
cpuset_update_task_memory_state();
- if (!cpuset_nodes_subset_current_mems_allowed(*nodes))
- return -EINVAL;
- return mpol_check_policy(mode, nodes);
+ was_empty = nodes_empty(*nodes);
+ if (!was_empty)
+ nodes_and(*nodes, *nodes, cpuset_current_mems_allowed);
+
+ return mpol_check_policy(mode, nodes, was_empty);
}
@@ -797,7 +817,7 @@ static long do_mbind(unsigned long start
if (end == start)
return 0;
- if (mpol_check_policy(mode, nmask))
+ if (contextualize_policy(mode, nmask))
return -EINVAL;
new = mpol_new(mode, nmask);
@@ -915,10 +935,6 @@ asmlinkage long sys_mbind(unsigned long
err = get_nodes(&nodes, nmask, maxnode);
if (err)
return err;
-#ifdef CONFIG_CPUSETS
- /* Restrict the nodes to the allowed nodes in the cpuset */
- nodes_and(nodes, nodes, current->mems_allowed);
-#endif
return do_mbind(start, len, mode, &nodes, flags);
}
Index: Linux/include/linux/cpuset.h
===================================================================
--- Linux.orig/include/linux/cpuset.h 2008-02-05 16:05:15.000000000 -0500
+++ Linux/include/linux/cpuset.h 2008-02-06 10:47:48.000000000 -0500
@@ -26,8 +26,6 @@ extern nodemask_t cpuset_mems_allowed(st
#define cpuset_current_mems_allowed (current->mems_allowed)
void cpuset_init_current_mems_allowed(void);
void cpuset_update_task_memory_state(void);
-#define cpuset_nodes_subset_current_mems_allowed(nodes) \
- nodes_subset((nodes), current->mems_allowed)
int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl);
extern int __cpuset_zone_allowed_softwall(struct zone *z, gfp_t gfp_mask);
@@ -103,7 +101,6 @@ static inline nodemask_t cpuset_mems_all
#define cpuset_current_mems_allowed (node_states[N_HIGH_MEMORY])
static inline void cpuset_init_current_mems_allowed(void) {}
static inline void cpuset_update_task_memory_state(void) {}
-#define cpuset_nodes_subset_current_mems_allowed(nodes) (1)
static inline int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl)
{
Hi Lee-san
Unfortunately, 2.6.24-mm1 can't boot on fujitsu machine.
(hmm, origin.patch cause regression to pci initialization ;-)
instead, I tested 2.6.24 + your patch.
it seem work good :)
Tested-by: KOSAKI Motohiro <[email protected]>
and, I have a bit comment.
> /* Do sanity checking on a policy */
> -static int mpol_check_policy(int mode, nodemask_t *nodes)
> +static int mpol_check_policy(int mode, nodemask_t *nodes, int was_empty)
was_empty argument is a bit ugly.
Could we unify mpol_check_policy and contextualize_policy?
mpol_check_policy only called from contextualize_policy.
> - return nodes_subset(*nodes, node_states[N_HIGH_MEMORY]) ? 0 : -EINVAL;
> + return 0;
Could we N_POSSIBLE check?
I attached the patch for my idea explain.
on my test environment, your patch and mine works good both.
- kosaki
---
mm/mempolicy.c | 47 +++++++++++++++++++++--------------------------
1 file changed, 21 insertions(+), 26 deletions(-)
Index: b/mm/mempolicy.c
===================================================================
--- a/mm/mempolicy.c 2008-02-07 17:19:09.000000000 +0900
+++ b/mm/mempolicy.c 2008-02-07 17:24:28.000000000 +0900
@@ -114,9 +114,25 @@ static void mpol_rebind_policy(struct me
const nodemask_t *newmask);
/* Do sanity checking on a policy */
-static int mpol_check_policy(int mode, nodemask_t *nodes, int was_empty)
+static int mpol_check_policy(int mode, nodemask_t *nodes)
{
- int is_empty = nodes_empty(*nodes);
+ int was_empty;
+ int is_empty;
+
+ if (!nodes)
+ return 0;
+
+ /*
+ * Remember whether in coming nodemask was empty, If not,
+ * restrict the nodes to the allowed nodes in the cpuset.
+ * This is guaranteed to be a subset of nodes with memory.
+ */
+ cpuset_update_task_memory_state();
+ was_empty = nodes_empty(*nodes);
+ if (!was_empty)
+ nodes_and(*nodes, *nodes, cpuset_current_mems_allowed);
+
+ is_empty = nodes_empty(*nodes);
switch (mode) {
case MPOL_DEFAULT:
@@ -144,7 +160,7 @@ static int mpol_check_policy(int mode, n
return -EINVAL;
break;
}
- return 0;
+ return nodes_subset(*nodes, node_states[N_POSSIBLE]) ? 0 : -EINVAL;
}
/* Generate a custom zonelist for the BIND policy. */
@@ -432,27 +448,6 @@ static int mbind_range(struct vm_area_st
return err;
}
-static int contextualize_policy(int mode, nodemask_t *nodes)
-{
- int was_empty;
-
- if (!nodes)
- return 0;
-
- /*
- * Remember whether in coming nodemask was empty, If not,
- * restrict the nodes to the allowed nodes in the cpuset.
- * This is guaranteed to be a subset of nodes with memory.
- */
- cpuset_update_task_memory_state();
- was_empty = nodes_empty(*nodes);
- if (!was_empty)
- nodes_and(*nodes, *nodes, cpuset_current_mems_allowed);
-
- return mpol_check_policy(mode, nodes, was_empty);
-}
-
-
/*
* Update task->flags PF_MEMPOLICY bit: set iff non-default
* mempolicy. Allows more rapid checking of this (combined perhaps
@@ -488,7 +483,7 @@ static long do_set_mempolicy(int mode, n
{
struct mempolicy *new;
- if (contextualize_policy(mode, nodes))
+ if (mpol_check_policy(mode, nodes))
return -EINVAL;
new = mpol_new(mode, nodes);
if (IS_ERR(new))
@@ -817,7 +812,7 @@ static long do_mbind(unsigned long start
if (end == start)
return 0;
- if (contextualize_policy(mode, nmask))
+ if (mpol_check_policy(mode, nmask))
return -EINVAL;
new = mpol_new(mode, nmask);
Was "Re: [2.6.24 regression][BUGFIX] numactl --interleave=all doesn't
works on memoryless node."
[Aside: I noticed there were two slightly different distributions for
this topic. I've unified the distribution lists w/o dropping anyone, I
think. Apologies if you'd rather have been dropped...]
Here's V3 of the patch, accomodating Kosaki Motohiro's suggestion for
folding contextualize_policy() into mpol_check_policy() [because my
"was_empty" argument "was ugly" ;-)]. It does seem to clean up the
code.
I'm still deferring David Rientjes' suggestion to fold
mpol_check_policy() into mpol_new(). We need to sort out whether
mempolicies specified for tmpfs and hugetlbfs mounts always need the
same "contextualization" as user/application installed policies. I
don't want to hold up this bug fix for that discussion. This is
something Paul J will need to address with his cpuset/mempolicy rework,
so we can sort it out in that context.
Again, tested with "numactl --interleave=all" and memtoy on ia64 using
mem= command line argument to simulate memoryless node.
Lee
============================
[PATCH] 2.6.24-mm1 - mempolicy: silently restrict nodemask to allowed nodes
V2 -> V3:
+ As suggested by Kosaki Motohito, fold the "contextualization"
of policy nodemask into mpol_check_policy(). Looks a little
cleaner.
V1 -> V2:
+ Communicate whether or not incoming node mask was empty to
mpol_check_policy() for better error checking.
+ As suggested by David Rientjes, remove the now unused
cpuset_nodes_subset_current_mems_allowed() from cpuset.h
Kosaki Motohito noted that "numactl --interleave=all ..." failed in the
presence of memoryless nodes. This patch attempts to fix that problem.
Some background:
numactl --interleave=all calls set_mempolicy(2) with a fully
populated [out to MAXNUMNODES] nodemask. set_mempolicy()
[in do_set_mempolicy()] calls contextualize_policy() which
requires that the nodemask be a subset of the current task's
mems_allowed; else EINVAL will be returned. A task's
mems_allowed will always be a subset of node_states[N_HIGH_MEMORY]--
i.e., nodes with memory. So, a fully populated nodemask will
be declared invalid if it includes memoryless nodes.
NOTE: the same thing will occur when running in a cpuset
with restricted mem_allowed--for the same reason:
node mask contains dis-allowed nodes.
mbind(2), on the other hand, just masks off any nodes in the
nodemask that are not included in the caller's mems_allowed.
In each case [mbind() and set_mempolicy()], mpol_check_policy()
will complain [again, resulting in EINVAL] if the nodemask contains
any memoryless nodes. This is somewhat redundant as mpol_new()
will remove memoryless nodes for interleave policy, as will
bind_zonelist()--called by mpol_new() for BIND policy.
Proposed fix:
1) modify contextualize_policy logic to:
a) remember whether the incoming node mask is empty.
b) if not, restrict the nodemask to allowed nodes, as is
currently done in-line for mbind(). This guarantees
that the resulting mask includes only nodes with memory.
NOTE: this is a [benign, IMO] change in behavior for
set_mempolicy(). Dis-allowed nodes will be
silently ignored, rather than returning an error.
c) fold this code into mpol_check_policy(), replace 2 calls to
contextualize_policy() to call mpol_check_policy() directly
and remove contextualize_policy().
2) In existing mpol_check_policy() logic, after "contextualization":
a) MPOL_DEFAULT: require that in coming mask "was_empty"
b) MPOL_{BIND|INTERLEAVE}: require that contextualized nodemask
contains at least one node.
c) add a case for MPOL_PREFERRED: if in coming was not empty
and resulting mask IS empty, user specified invalid nodes.
Return EINVAL.
c) remove the now redundant check for memoryless nodes
3) remove the now redundant masking of policy nodes for interleave
policy from mpol_new().
4) Now that mpol_check_policy() contextualizes the nodemask, remove
the in-line nodes_and() from sys_mbind(). I believe that this
restores mbind() to the behavior before the memoryless-nodes
patch series. E.g., we'll no longer treat an invalid nodemask
with MPOL_PREFERRED as local allocation.
Signed-off-by: Lee Schermerhorn <[email protected]>
include/linux/cpuset.h | 3 --
mm/mempolicy.c | 61 ++++++++++++++++++++++++++++---------------------
2 files changed, 36 insertions(+), 28 deletions(-)
Index: Linux/mm/mempolicy.c
===================================================================
--- Linux.orig/mm/mempolicy.c 2008-02-08 11:11:34.000000000 -0500
+++ Linux/mm/mempolicy.c 2008-02-08 13:40:40.000000000 -0500
@@ -116,22 +116,51 @@ static void mpol_rebind_policy(struct me
/* Do sanity checking on a policy */
static int mpol_check_policy(int mode, nodemask_t *nodes)
{
- int empty = nodes_empty(*nodes);
+ int was_empty, is_empty;
+
+ if (!nodes)
+ return 0;
+
+ /*
+ * "Contextualize" the in-coming nodemast for cpusets:
+ * Remember whether in-coming nodemask was empty, If not,
+ * restrict the nodes to the allowed nodes in the cpuset.
+ * This is guaranteed to be a subset of nodes with memory.
+ */
+ cpuset_update_task_memory_state();
+ is_empty = was_empty = nodes_empty(*nodes);
+ if (!was_empty) {
+ nodes_and(*nodes, *nodes, cpuset_current_mems_allowed);
+ is_empty = nodes_empty(*nodes); /* after "contextualization" */
+ }
switch (mode) {
case MPOL_DEFAULT:
- if (!empty)
+ /*
+ * require caller to specify an empty nodemask
+ * before "contextualization"
+ */
+ if (!was_empty)
return -EINVAL;
break;
case MPOL_BIND:
case MPOL_INTERLEAVE:
- /* Preferred will only use the first bit, but allow
- more for now. */
- if (empty)
+ /*
+ * require at least 1 valid node after "contextualization"
+ */
+ if (is_empty)
+ return -EINVAL;
+ break;
+ case MPOL_PREFERRED:
+ /*
+ * Did caller specify invalid nodes?
+ * Don't silently accept this as "local allocation".
+ */
+ if (!was_empty && is_empty)
return -EINVAL;
break;
}
- return nodes_subset(*nodes, node_states[N_HIGH_MEMORY]) ? 0 : -EINVAL;
+ return 0;
}
/* Generate a custom zonelist for the BIND policy. */
@@ -188,8 +217,6 @@ static struct mempolicy *mpol_new(int mo
switch (mode) {
case MPOL_INTERLEAVE:
policy->v.nodes = *nodes;
- nodes_and(policy->v.nodes, policy->v.nodes,
- node_states[N_HIGH_MEMORY]);
if (nodes_weight(policy->v.nodes) == 0) {
kmem_cache_free(policy_cache, policy);
return ERR_PTR(-EINVAL);
@@ -421,18 +448,6 @@ static int mbind_range(struct vm_area_st
return err;
}
-static int contextualize_policy(int mode, nodemask_t *nodes)
-{
- if (!nodes)
- return 0;
-
- cpuset_update_task_memory_state();
- if (!cpuset_nodes_subset_current_mems_allowed(*nodes))
- return -EINVAL;
- return mpol_check_policy(mode, nodes);
-}
-
-
/*
* Update task->flags PF_MEMPOLICY bit: set iff non-default
* mempolicy. Allows more rapid checking of this (combined perhaps
@@ -468,7 +483,7 @@ static long do_set_mempolicy(int mode, n
{
struct mempolicy *new;
- if (contextualize_policy(mode, nodes))
+ if (mpol_check_policy(mode, nodes))
return -EINVAL;
new = mpol_new(mode, nodes);
if (IS_ERR(new))
@@ -915,10 +930,6 @@ asmlinkage long sys_mbind(unsigned long
err = get_nodes(&nodes, nmask, maxnode);
if (err)
return err;
-#ifdef CONFIG_CPUSETS
- /* Restrict the nodes to the allowed nodes in the cpuset */
- nodes_and(nodes, nodes, current->mems_allowed);
-#endif
return do_mbind(start, len, mode, &nodes, flags);
}
Index: Linux/include/linux/cpuset.h
===================================================================
--- Linux.orig/include/linux/cpuset.h 2008-02-08 11:11:34.000000000 -0500
+++ Linux/include/linux/cpuset.h 2008-02-08 11:12:43.000000000 -0500
@@ -26,8 +26,6 @@ extern nodemask_t cpuset_mems_allowed(st
#define cpuset_current_mems_allowed (current->mems_allowed)
void cpuset_init_current_mems_allowed(void);
void cpuset_update_task_memory_state(void);
-#define cpuset_nodes_subset_current_mems_allowed(nodes) \
- nodes_subset((nodes), current->mems_allowed)
int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl);
extern int __cpuset_zone_allowed_softwall(struct zone *z, gfp_t gfp_mask);
@@ -103,7 +101,6 @@ static inline nodemask_t cpuset_mems_all
#define cpuset_current_mems_allowed (node_states[N_HIGH_MEMORY])
static inline void cpuset_init_current_mems_allowed(void) {}
static inline void cpuset_update_task_memory_state(void) {}
-#define cpuset_nodes_subset_current_mems_allowed(nodes) (1)
static inline int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl)
{
Hi Lee-san
looks good for me.
I'll test about the head of week and report it by another mail.
Thanks!
> Was "Re: [2.6.24 regression][BUGFIX] numactl --interleave=all doesn't
> works on memoryless node."
>
> [Aside: I noticed there were two slightly different distributions for
> this topic. I've unified the distribution lists w/o dropping anyone, I
> think. Apologies if you'd rather have been dropped...]
>
> Here's V3 of the patch, accomodating Kosaki Motohiro's suggestion for
> folding contextualize_policy() into mpol_check_policy() [because my
> "was_empty" argument "was ugly" ;-)]. It does seem to clean up the
> code.
CC'd Greg KH <[email protected]>
I tested this patch on fujitsu memoryless node.
(2.6.24 + silently-restrict-nodemask-to-allowed-nodes-V3 insted 2.6.24-mm1)
it seems works good.
Tested-by: KOSAKI Motohiro <[email protected]>
Greg, I hope this patch merge to 2.6.24.x stable tree because
this patch is regression fixed patch.
Please tell me what do i doing for it.
[intentional full quote]
> Was "Re: [2.6.24 regression][BUGFIX] numactl --interleave=all doesn't
> works on memoryless node."
>
> [Aside: I noticed there were two slightly different distributions for
> this topic. I've unified the distribution lists w/o dropping anyone, I
> think. Apologies if you'd rather have been dropped...]
>
> Here's V3 of the patch, accomodating Kosaki Motohiro's suggestion for
> folding contextualize_policy() into mpol_check_policy() [because my
> "was_empty" argument "was ugly" ;-)]. It does seem to clean up the
> code.
>
> I'm still deferring David Rientjes' suggestion to fold
> mpol_check_policy() into mpol_new(). We need to sort out whether
> mempolicies specified for tmpfs and hugetlbfs mounts always need the
> same "contextualization" as user/application installed policies. I
> don't want to hold up this bug fix for that discussion. This is
> something Paul J will need to address with his cpuset/mempolicy rework,
> so we can sort it out in that context.
>
> Again, tested with "numactl --interleave=all" and memtoy on ia64 using
> mem= command line argument to simulate memoryless node.
>
>
> Lee
>
> ============================
> [PATCH] 2.6.24-mm1 - mempolicy: silently restrict nodemask to allowed nodes
>
> V2 -> V3:
> + As suggested by Kosaki Motohito, fold the "contextualization"
> of policy nodemask into mpol_check_policy(). Looks a little
> cleaner.
>
> V1 -> V2:
> + Communicate whether or not incoming node mask was empty to
> mpol_check_policy() for better error checking.
> + As suggested by David Rientjes, remove the now unused
> cpuset_nodes_subset_current_mems_allowed() from cpuset.h
>
> Kosaki Motohito noted that "numactl --interleave=all ..." failed in the
> presence of memoryless nodes. This patch attempts to fix that problem.
>
> Some background:
>
> numactl --interleave=all calls set_mempolicy(2) with a fully
> populated [out to MAXNUMNODES] nodemask. set_mempolicy()
> [in do_set_mempolicy()] calls contextualize_policy() which
> requires that the nodemask be a subset of the current task's
> mems_allowed; else EINVAL will be returned. A task's
> mems_allowed will always be a subset of node_states[N_HIGH_MEMORY]--
> i.e., nodes with memory. So, a fully populated nodemask will
> be declared invalid if it includes memoryless nodes.
>
> NOTE: the same thing will occur when running in a cpuset
> with restricted mem_allowed--for the same reason:
> node mask contains dis-allowed nodes.
>
> mbind(2), on the other hand, just masks off any nodes in the
> nodemask that are not included in the caller's mems_allowed.
>
> In each case [mbind() and set_mempolicy()], mpol_check_policy()
> will complain [again, resulting in EINVAL] if the nodemask contains
> any memoryless nodes. This is somewhat redundant as mpol_new()
> will remove memoryless nodes for interleave policy, as will
> bind_zonelist()--called by mpol_new() for BIND policy.
>
> Proposed fix:
>
> 1) modify contextualize_policy logic to:
> a) remember whether the incoming node mask is empty.
> b) if not, restrict the nodemask to allowed nodes, as is
> currently done in-line for mbind(). This guarantees
> that the resulting mask includes only nodes with memory.
>
> NOTE: this is a [benign, IMO] change in behavior for
> set_mempolicy(). Dis-allowed nodes will be
> silently ignored, rather than returning an error.
>
> c) fold this code into mpol_check_policy(), replace 2 calls to
> contextualize_policy() to call mpol_check_policy() directly
> and remove contextualize_policy().
>
> 2) In existing mpol_check_policy() logic, after "contextualization":
> a) MPOL_DEFAULT: require that in coming mask "was_empty"
> b) MPOL_{BIND|INTERLEAVE}: require that contextualized nodemask
> contains at least one node.
> c) add a case for MPOL_PREFERRED: if in coming was not empty
> and resulting mask IS empty, user specified invalid nodes.
> Return EINVAL.
> c) remove the now redundant check for memoryless nodes
>
> 3) remove the now redundant masking of policy nodes for interleave
> policy from mpol_new().
>
> 4) Now that mpol_check_policy() contextualizes the nodemask, remove
> the in-line nodes_and() from sys_mbind(). I believe that this
> restores mbind() to the behavior before the memoryless-nodes
> patch series. E.g., we'll no longer treat an invalid nodemask
> with MPOL_PREFERRED as local allocation.
>
> Signed-off-by: Lee Schermerhorn <[email protected]>
>
> include/linux/cpuset.h | 3 --
> mm/mempolicy.c | 61 ++++++++++++++++++++++++++++---------------------
> 2 files changed, 36 insertions(+), 28 deletions(-)
>
> Index: Linux/mm/mempolicy.c
> ===================================================================
> --- Linux.orig/mm/mempolicy.c 2008-02-08 11:11:34.000000000 -0500
> +++ Linux/mm/mempolicy.c 2008-02-08 13:40:40.000000000 -0500
> @@ -116,22 +116,51 @@ static void mpol_rebind_policy(struct me
> /* Do sanity checking on a policy */
> static int mpol_check_policy(int mode, nodemask_t *nodes)
> {
> - int empty = nodes_empty(*nodes);
> + int was_empty, is_empty;
> +
> + if (!nodes)
> + return 0;
> +
> + /*
> + * "Contextualize" the in-coming nodemast for cpusets:
> + * Remember whether in-coming nodemask was empty, If not,
> + * restrict the nodes to the allowed nodes in the cpuset.
> + * This is guaranteed to be a subset of nodes with memory.
> + */
> + cpuset_update_task_memory_state();
> + is_empty = was_empty = nodes_empty(*nodes);
> + if (!was_empty) {
> + nodes_and(*nodes, *nodes, cpuset_current_mems_allowed);
> + is_empty = nodes_empty(*nodes); /* after "contextualization" */
> + }
>
> switch (mode) {
> case MPOL_DEFAULT:
> - if (!empty)
> + /*
> + * require caller to specify an empty nodemask
> + * before "contextualization"
> + */
> + if (!was_empty)
> return -EINVAL;
> break;
> case MPOL_BIND:
> case MPOL_INTERLEAVE:
> - /* Preferred will only use the first bit, but allow
> - more for now. */
> - if (empty)
> + /*
> + * require at least 1 valid node after "contextualization"
> + */
> + if (is_empty)
> + return -EINVAL;
> + break;
> + case MPOL_PREFERRED:
> + /*
> + * Did caller specify invalid nodes?
> + * Don't silently accept this as "local allocation".
> + */
> + if (!was_empty && is_empty)
> return -EINVAL;
> break;
> }
> - return nodes_subset(*nodes, node_states[N_HIGH_MEMORY]) ? 0 : -EINVAL;
> + return 0;
> }
>
> /* Generate a custom zonelist for the BIND policy. */
> @@ -188,8 +217,6 @@ static struct mempolicy *mpol_new(int mo
> switch (mode) {
> case MPOL_INTERLEAVE:
> policy->v.nodes = *nodes;
> - nodes_and(policy->v.nodes, policy->v.nodes,
> - node_states[N_HIGH_MEMORY]);
> if (nodes_weight(policy->v.nodes) == 0) {
> kmem_cache_free(policy_cache, policy);
> return ERR_PTR(-EINVAL);
> @@ -421,18 +448,6 @@ static int mbind_range(struct vm_area_st
> return err;
> }
>
> -static int contextualize_policy(int mode, nodemask_t *nodes)
> -{
> - if (!nodes)
> - return 0;
> -
> - cpuset_update_task_memory_state();
> - if (!cpuset_nodes_subset_current_mems_allowed(*nodes))
> - return -EINVAL;
> - return mpol_check_policy(mode, nodes);
> -}
> -
> -
> /*
> * Update task->flags PF_MEMPOLICY bit: set iff non-default
> * mempolicy. Allows more rapid checking of this (combined perhaps
> @@ -468,7 +483,7 @@ static long do_set_mempolicy(int mode, n
> {
> struct mempolicy *new;
>
> - if (contextualize_policy(mode, nodes))
> + if (mpol_check_policy(mode, nodes))
> return -EINVAL;
> new = mpol_new(mode, nodes);
> if (IS_ERR(new))
> @@ -915,10 +930,6 @@ asmlinkage long sys_mbind(unsigned long
> err = get_nodes(&nodes, nmask, maxnode);
> if (err)
> return err;
> -#ifdef CONFIG_CPUSETS
> - /* Restrict the nodes to the allowed nodes in the cpuset */
> - nodes_and(nodes, nodes, current->mems_allowed);
> -#endif
> return do_mbind(start, len, mode, &nodes, flags);
> }
>
> Index: Linux/include/linux/cpuset.h
> ===================================================================
> --- Linux.orig/include/linux/cpuset.h 2008-02-08 11:11:34.000000000 -0500
> +++ Linux/include/linux/cpuset.h 2008-02-08 11:12:43.000000000 -0500
> @@ -26,8 +26,6 @@ extern nodemask_t cpuset_mems_allowed(st
> #define cpuset_current_mems_allowed (current->mems_allowed)
> void cpuset_init_current_mems_allowed(void);
> void cpuset_update_task_memory_state(void);
> -#define cpuset_nodes_subset_current_mems_allowed(nodes) \
> - nodes_subset((nodes), current->mems_allowed)
> int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl);
>
> extern int __cpuset_zone_allowed_softwall(struct zone *z, gfp_t gfp_mask);
> @@ -103,7 +101,6 @@ static inline nodemask_t cpuset_mems_all
> #define cpuset_current_mems_allowed (node_states[N_HIGH_MEMORY])
> static inline void cpuset_init_current_mems_allowed(void) {}
> static inline void cpuset_update_task_memory_state(void) {}
> -#define cpuset_nodes_subset_current_mems_allowed(nodes) (1)
>
> static inline int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl)
> {
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Sun, Feb 10, 2008 at 02:29:24PM +0900, KOSAKI Motohiro wrote:
> CC'd Greg KH <[email protected]>
>
> I tested this patch on fujitsu memoryless node.
> (2.6.24 + silently-restrict-nodemask-to-allowed-nodes-V3 insted 2.6.24-mm1)
> it seems works good.
>
> Tested-by: KOSAKI Motohiro <[email protected]>
>
>
> Greg, I hope this patch merge to 2.6.24.x stable tree because
> this patch is regression fixed patch.
> Please tell me what do i doing for it.
Once the patch goes into Linus's tree, feel free to send it to the
[email protected] address so that we can include it in the 2.6.24.x
tree.
thanks,
greg k-h
On Sat, 9 Feb 2008, Greg KH wrote:
>
> Once the patch goes into Linus's tree, feel free to send it to the
> [email protected] address so that we can include it in the 2.6.24.x
> tree.
I've been ignoring the patches because they say "PATCH 2.6.24-mm1", and so
I simply don't know whether it's supposed to go into *my* kernel or just
-mm.
There's also been several versions and discussions, so I'd really like to
have somebody send me a final patch with all the acks etc.. One that is
clearly for me, not for -mm.
Linus
On Sat, 9 Feb 2008 23:42:21 -0800 (PST) Linus Torvalds <[email protected]> wrote:
>
>
> On Sat, 9 Feb 2008, Greg KH wrote:
> >
> > Once the patch goes into Linus's tree, feel free to send it to the
> > [email protected] address so that we can include it in the 2.6.24.x
> > tree.
>
> I've been ignoring the patches because they say "PATCH 2.6.24-mm1", and so
> I simply don't know whether it's supposed to go into *my* kernel or just
> -mm.
>
> There's also been several versions and discussions, so I'd really like to
> have somebody send me a final patch with all the acks etc.. One that is
> clearly for me, not for -mm.
>
fyi, I won't be able to do much patch-wrangling until Tuesday or Wednesday.
All the big machines are disconnected and mothballed due to domestic
s/carpet/hardwood/g.
On Sat, 2008-02-09 at 23:42 -0800, Linus Torvalds wrote:
>
> On Sat, 9 Feb 2008, Greg KH wrote:
> >
> > Once the patch goes into Linus's tree, feel free to send it to the
> > [email protected] address so that we can include it in the 2.6.24.x
> > tree.
>
> I've been ignoring the patches because they say "PATCH 2.6.24-mm1", and so
> I simply don't know whether it's supposed to go into *my* kernel or just
> -mm.
>
> There's also been several versions and discussions, so I'd really like to
> have somebody send me a final patch with all the acks etc.. One that is
> clearly for me, not for -mm.
>
Kosaki-san: You've tested V3 on '.24. Do you want to repost the patch
refreshed against .24, adding your "Tested-by:" [and "Signed-off-by:",
as the folding of the contextualization into mpol_check_policy() is
based on your code--apologies for not adding it myself]? I'm tied up
with something else for most of this week and won't get to it until
Friday, earliest.
Regards,
Lee
P.S., As Andrew pointed out, I forgot to run checkpatch and the patch
does include a violation thereof.
Hi Andrew
# this is second post of the same patch.
this is backport from -mm to mainline.
original patch is http://marc.info/?l=linux-kernel&m=120250000001182&w=2
my change is only line number change and remove extra space.
please ack.
============================
[PATCH] 2.6.24 - mempolicy: silently restrict nodemask to allowed nodes
V2 -> V3:
+ As suggested by Kosaki Motohito, fold the "contextualization"
of policy nodemask into mpol_check_policy(). Looks a little
cleaner.
V1 -> V2:
+ Communicate whether or not incoming node mask was empty to
mpol_check_policy() for better error checking.
+ As suggested by David Rientjes, remove the now unused
cpuset_nodes_subset_current_mems_allowed() from cpuset.h
Kosaki Motohito noted that "numactl --interleave=all ..." failed in the
presence of memoryless nodes. This patch attempts to fix that problem.
Some background:
numactl --interleave=all calls set_mempolicy(2) with a fully
populated [out to MAXNUMNODES] nodemask. set_mempolicy()
[in do_set_mempolicy()] calls contextualize_policy() which
requires that the nodemask be a subset of the current task's
mems_allowed; else EINVAL will be returned. A task's
mems_allowed will always be a subset of node_states[N_HIGH_MEMORY]--
i.e., nodes with memory. So, a fully populated nodemask will
be declared invalid if it includes memoryless nodes.
NOTE: the same thing will occur when running in a cpuset
with restricted mem_allowed--for the same reason:
node mask contains dis-allowed nodes.
mbind(2), on the other hand, just masks off any nodes in the
nodemask that are not included in the caller's mems_allowed.
In each case [mbind() and set_mempolicy()], mpol_check_policy()
will complain [again, resulting in EINVAL] if the nodemask contains
any memoryless nodes. This is somewhat redundant as mpol_new()
will remove memoryless nodes for interleave policy, as will
bind_zonelist()--called by mpol_new() for BIND policy.
Proposed fix:
1) modify contextualize_policy logic to:
a) remember whether the incoming node mask is empty.
b) if not, restrict the nodemask to allowed nodes, as is
currently done in-line for mbind(). This guarantees
that the resulting mask includes only nodes with memory.
NOTE: this is a [benign, IMO] change in behavior for
set_mempolicy(). Dis-allowed nodes will be
silently ignored, rather than returning an error.
c) fold this code into mpol_check_policy(), replace 2 calls to
contextualize_policy() to call mpol_check_policy() directly
and remove contextualize_policy().
2) In existing mpol_check_policy() logic, after "contextualization":
a) MPOL_DEFAULT: require that in coming mask "was_empty"
b) MPOL_{BIND|INTERLEAVE}: require that contextualized nodemask
contains at least one node.
c) add a case for MPOL_PREFERRED: if in coming was not empty
and resulting mask IS empty, user specified invalid nodes.
Return EINVAL.
c) remove the now redundant check for memoryless nodes
3) remove the now redundant masking of policy nodes for interleave
policy from mpol_new().
4) Now that mpol_check_policy() contextualizes the nodemask, remove
the in-line nodes_and() from sys_mbind(). I believe that this
restores mbind() to the behavior before the memoryless-nodes
patch series. E.g., we'll no longer treat an invalid nodemask
with MPOL_PREFERRED as local allocation.
Signed-off-by: Lee Schermerhorn <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
Tested-by: KOSAKI Motohiro <[email protected]>
Acked-by: David Rientjes <[email protected]>
include/linux/cpuset.h | 3 --
mm/mempolicy.c | 61 ++++++++++++++++++++++++++++---------------------
2 files changed, 36 insertions(+), 28 deletions(-)
Index: b/mm/mempolicy.c
===================================================================
--- a/mm/mempolicy.c 2008-02-10 14:27:58.000000000 +0900
+++ b/mm/mempolicy.c 2008-02-12 09:37:35.000000000 +0900
@@ -116,22 +116,51 @@ static void mpol_rebind_policy(struct me
/* Do sanity checking on a policy */
static int mpol_check_policy(int mode, nodemask_t *nodes)
{
- int empty = nodes_empty(*nodes);
+ int was_empty, is_empty;
+
+ if (!nodes)
+ return 0;
+
+ /*
+ * "Contextualize" the in-coming nodemast for cpusets:
+ * Remember whether in-coming nodemask was empty, If not,
+ * restrict the nodes to the allowed nodes in the cpuset.
+ * This is guaranteed to be a subset of nodes with memory.
+ */
+ cpuset_update_task_memory_state();
+ is_empty = was_empty = nodes_empty(*nodes);
+ if (!was_empty) {
+ nodes_and(*nodes, *nodes, cpuset_current_mems_allowed);
+ is_empty = nodes_empty(*nodes); /* after "contextualization" */
+ }
switch (mode) {
case MPOL_DEFAULT:
- if (!empty)
+ /*
+ * require caller to specify an empty nodemask
+ * before "contextualization"
+ */
+ if (!was_empty)
return -EINVAL;
break;
case MPOL_BIND:
case MPOL_INTERLEAVE:
- /* Preferred will only use the first bit, but allow
- more for now. */
- if (empty)
+ /*
+ * require at least 1 valid node after "contextualization"
+ */
+ if (is_empty)
+ return -EINVAL;
+ break;
+ case MPOL_PREFERRED:
+ /*
+ * Did caller specify invalid nodes?
+ * Don't silently accept this as "local allocation".
+ */
+ if (!was_empty && is_empty)
return -EINVAL;
break;
}
- return nodes_subset(*nodes, node_states[N_HIGH_MEMORY]) ? 0 : -EINVAL;
+ return 0;
}
/* Generate a custom zonelist for the BIND policy. */
@@ -188,8 +217,6 @@ static struct mempolicy *mpol_new(int mo
switch (mode) {
case MPOL_INTERLEAVE:
policy->v.nodes = *nodes;
- nodes_and(policy->v.nodes, policy->v.nodes,
- node_states[N_HIGH_MEMORY]);
if (nodes_weight(policy->v.nodes) == 0) {
kmem_cache_free(policy_cache, policy);
return ERR_PTR(-EINVAL);
@@ -421,18 +448,6 @@ static int mbind_range(struct vm_area_st
return err;
}
-static int contextualize_policy(int mode, nodemask_t *nodes)
-{
- if (!nodes)
- return 0;
-
- cpuset_update_task_memory_state();
- if (!cpuset_nodes_subset_current_mems_allowed(*nodes))
- return -EINVAL;
- return mpol_check_policy(mode, nodes);
-}
-
-
/*
* Update task->flags PF_MEMPOLICY bit: set iff non-default
* mempolicy. Allows more rapid checking of this (combined perhaps
@@ -468,7 +483,7 @@ static long do_set_mempolicy(int mode, n
{
struct mempolicy *new;
- if (contextualize_policy(mode, nodes))
+ if (mpol_check_policy(mode, nodes))
return -EINVAL;
new = mpol_new(mode, nodes);
if (IS_ERR(new))
@@ -915,10 +930,6 @@ asmlinkage long sys_mbind(unsigned long
err = get_nodes(&nodes, nmask, maxnode);
if (err)
return err;
-#ifdef CONFIG_CPUSETS
- /* Restrict the nodes to the allowed nodes in the cpuset */
- nodes_and(nodes, nodes, current->mems_allowed);
-#endif
return do_mbind(start, len, mode, &nodes, flags);
}
Index: b/include/linux/cpuset.h
===================================================================
--- a/include/linux/cpuset.h 2008-02-10 14:27:58.000000000 +0900
+++ b/include/linux/cpuset.h 2008-02-10 14:33:40.000000000 +0900
@@ -26,8 +26,6 @@ extern nodemask_t cpuset_mems_allowed(st
#define cpuset_current_mems_allowed (current->mems_allowed)
void cpuset_init_current_mems_allowed(void);
void cpuset_update_task_memory_state(void);
-#define cpuset_nodes_subset_current_mems_allowed(nodes) \
- nodes_subset((nodes), current->mems_allowed)
int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl);
extern int __cpuset_zone_allowed_softwall(struct zone *z, gfp_t gfp_mask);
@@ -101,7 +99,6 @@ static inline nodemask_t cpuset_mems_all
#define cpuset_current_mems_allowed (node_states[N_HIGH_MEMORY])
static inline void cpuset_init_current_mems_allowed(void) {}
static inline void cpuset_update_task_memory_state(void) {}
-#define cpuset_nodes_subset_current_mems_allowed(nodes) (1)
static inline int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl)
{
On Tue, 12 Feb 2008 13:30:22 +0900 KOSAKI Motohiro <[email protected]> wrote:
> Hi Andrew
>
> # this is second post of the same patch.
>
> this is backport from -mm to mainline.
> original patch is http://marc.info/?l=linux-kernel&m=120250000001182&w=2
>
> my change is only line number change and remove extra space.
This is identical to what I have now.
> please ack.
As it's now post -rc1 and not a 100% obvious thing, I tend to hang onto
such patches for a week or so before sending up to Linus
Should this be backported to 2.6.24.x? If so, the reasons for such a
relatively stern step should be spelled out in the changelog for the
-stable maintiners to evaluate.
Thanks.
On Tue, 12 Feb 2008, KOSAKI Motohiro wrote:
> [PATCH] 2.6.24 - mempolicy: silently restrict nodemask to allowed nodes
>
Linus has already merged this patch into his tree, but next time you
pass along a contribution to a maintainer the first line should read:
From: Lee Schermerhorn <[email protected]>
so the person who actually wrote the patch is listed as the author in the
git commit.
Hi
> > please ack.
>
> As it's now post -rc1 and not a 100% obvious thing, I tend to hang onto
> such patches for a week or so before sending up to Linus
Thanks, really thanks.
> Should this be backported to 2.6.24.x? If so, the reasons for such a
> relatively stern step should be spelled out in the changelog for the
> -stable maintiners to evaluate.
Oh,
you think below reason is not enough, really?
1. it is regression.
2. it is very easy reprodusable on memoryless node machine.
if so, i back down on my backport reclaim.
I don't hope increase your headache ;-)
thanks.
-kosaki