2010-04-22 14:13:18

by Miao Xie

[permalink] [raw]
Subject: [PATCH 1/2] mm: fix bugs of mpol_rebind_nodemask()

- local variable might be an empty nodemask, so must be checked before setting
pol->v.nodes to it.

- nodes_remap() may cause the weight of pol->v.nodes being monotonic decreasing.
and never become large even we pass a nodemask with large weight after
->v.nodes become little.

this patch fixes these two problem.

Signed-off-by: Miao Xie <[email protected]>
---
mm/mempolicy.c | 9 ++++++---
1 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 08f40a2..03ba9fc 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -291,12 +291,15 @@ static void mpol_rebind_nodemask(struct mempolicy *pol,
else if (pol->flags & MPOL_F_RELATIVE_NODES)
mpol_relative_nodemask(&tmp, &pol->w.user_nodemask, nodes);
else {
- nodes_remap(tmp, pol->v.nodes, pol->w.cpuset_mems_allowed,
- *nodes);
+ tmp = *nodes;
pol->w.cpuset_mems_allowed = *nodes;
}

- pol->v.nodes = tmp;
+ if (nodes_empty(tmp))
+ pol->v.nodes = *nodes;
+ else
+ pol->v.nodes = tmp;
+
if (!node_isset(current->il_next, tmp)) {
current->il_next = next_node(current->il_next, tmp);
if (current->il_next >= MAX_NUMNODES)
--
1.6.5.2


2010-04-22 21:20:33

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: fix bugs of mpol_rebind_nodemask()

On Thu, 22 Apr 2010, Miao Xie wrote:

> - local variable might be an empty nodemask, so must be checked before setting
> pol->v.nodes to it.
>
> - nodes_remap() may cause the weight of pol->v.nodes being monotonic decreasing.
> and never become large even we pass a nodemask with large weight after
> ->v.nodes become little.
>

That's always been the intention of rebinding a mempolicy nodemask: we
remap the current mempolicy nodes over the new nodemask given the set of
allowed nodes. The nodes_remap() shouldn't be removed.

> this patch fixes these two problem.
>
> Signed-off-by: Miao Xie <[email protected]>
> ---
> mm/mempolicy.c | 9 ++++++---
> 1 files changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 08f40a2..03ba9fc 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -291,12 +291,15 @@ static void mpol_rebind_nodemask(struct mempolicy *pol,
> else if (pol->flags & MPOL_F_RELATIVE_NODES)
> mpol_relative_nodemask(&tmp, &pol->w.user_nodemask, nodes);
> else {
> - nodes_remap(tmp, pol->v.nodes, pol->w.cpuset_mems_allowed,
> - *nodes);
> + tmp = *nodes;
> pol->w.cpuset_mems_allowed = *nodes;
> }
>
> - pol->v.nodes = tmp;
> + if (nodes_empty(tmp))
> + pol->v.nodes = *nodes;
> + else
> + pol->v.nodes = tmp;
> +
> if (!node_isset(current->il_next, tmp)) {
> current->il_next = next_node(current->il_next, tmp);
> if (current->il_next >= MAX_NUMNODES)

2010-04-23 01:26:58

by Miao Xie

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: fix bugs of mpol_rebind_nodemask()

on 2010-4-23 5:20, David Rientjes wrote:
> On Thu, 22 Apr 2010, Miao Xie wrote:
>
>> - local variable might be an empty nodemask, so must be checked before setting
>> pol->v.nodes to it.
>>
>> - nodes_remap() may cause the weight of pol->v.nodes being monotonic decreasing.
>> and never become large even we pass a nodemask with large weight after
>> ->v.nodes become little.
>>
>
> That's always been the intention of rebinding a mempolicy nodemask: we
> remap the current mempolicy nodes over the new nodemask given the set of
> allowed nodes. The nodes_remap() shouldn't be removed.

Suppose the current mempolicy nodes is 0-2, we can remap it from 0-2 to 2,
then we can remap it from 2 to 1, but we can't remap it from 2 to 0-2.

that is to say it can't be remaped to a large set of allowed nodes, and the task
just can use the small set of nodes for ever, even the large set of nodes is allowed,
I think it is unreasonable.

Thanks
Miao

>
>> this patch fixes these two problem.
>>
>> Signed-off-by: Miao Xie <[email protected]>
>> ---
>> mm/mempolicy.c | 9 ++++++---
>> 1 files changed, 6 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
>> index 08f40a2..03ba9fc 100644
>> --- a/mm/mempolicy.c
>> +++ b/mm/mempolicy.c
>> @@ -291,12 +291,15 @@ static void mpol_rebind_nodemask(struct mempolicy *pol,
>> else if (pol->flags & MPOL_F_RELATIVE_NODES)
>> mpol_relative_nodemask(&tmp, &pol->w.user_nodemask, nodes);
>> else {
>> - nodes_remap(tmp, pol->v.nodes, pol->w.cpuset_mems_allowed,
>> - *nodes);
>> + tmp = *nodes;
>> pol->w.cpuset_mems_allowed = *nodes;
>> }
>>
>> - pol->v.nodes = tmp;
>> + if (nodes_empty(tmp))
>> + pol->v.nodes = *nodes;
>> + else
>> + pol->v.nodes = tmp;
>> +
>> if (!node_isset(current->il_next, tmp)) {
>> current->il_next = next_node(current->il_next, tmp);
>> if (current->il_next >= MAX_NUMNODES)
>
>
>

2010-04-23 08:45:30

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: fix bugs of mpol_rebind_nodemask()

On Fri, 23 Apr 2010, Miao Xie wrote:

> Suppose the current mempolicy nodes is 0-2, we can remap it from 0-2 to 2,
> then we can remap it from 2 to 1, but we can't remap it from 2 to 0-2.
>
> that is to say it can't be remaped to a large set of allowed nodes, and the task
> just can use the small set of nodes for ever, even the large set of nodes is allowed,
> I think it is unreasonable.
>

That's been the behavior for at least three years so changing it from
under the applications isn't acceptable, see
Documentation/vm/numa_memory_policy.txt regarding mempolicy rebinds and
the two flags that are defined that can be used to adjust the behavior.

The pol->v.nodes = nodes_empty(tmp) ? *nodes : tmp fix is welcome,
however, as a standalone patch.

2010-04-30 18:48:44

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: fix bugs of mpol_rebind_nodemask()

On Thu, 29 Apr 2010, Miao Xie wrote:

> > That's been the behavior for at least three years so changing it from
> > under the applications isn't acceptable, see
> > Documentation/vm/numa_memory_policy.txt regarding mempolicy rebinds and
> > the two flags that are defined that can be used to adjust the behavior.
>
> Is the flags what you said MPOL_F_STATIC_NODES and MPOL_F_RELATIVE_NODES?
> But the codes that I changed isn't under MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES.
> The documentation doesn't say what we should do if either of these two flags is not set.
>

MPOL_F_STATIC_NODES and MPOL_F_RELATIVE_NODES allow you to adjust the
behavior of the rebind: the former requires specific nodes to be assigned
to the mempolicy and could suppress the rebind completely, if necessary;
the latter ensures the mempolicy nodemask has a certain weight as nodes
are assigned in a round-robin manner. The behavior that you're referring
to is provided via MPOL_F_RELATIVE_NODES, which guarantees whatever weight
is passed via set_mempolicy() will be preserved when mems are added to a
cpuset.

Regardless of whether the behavior is documented when either flag is
passed, we can't change the long-standing default behavior that people use
when their cpuset mems are rebound: we can only extend the functionality
and the behavior you're seeking is already available with a
MPOL_F_RELATIVE_NODES flag modifier.

> Furthermore, in order to fix no node to alloc memory, when we want to update mempolicy
> and mems_allowed, we expand the set of nodes first (set all the newly nodes) and
> shrink the set of nodes lazily(clean disallowed nodes).

That's a cpuset implementation choice, not a mempolicy one; mempolicies
have nothing to do with an empty current->mems_allowed.

> But remap() breaks the expanding, so if we don't remove remap(), the problem can't be
> fixed. Otherwise, cpuset has to do the rebinding by itself and the code is ugly.
> Like this:
>
> static void cpuset_change_task_nodemask(struct task_struct *tsk, nodemask_t *newmems)
> {
> nodemask_t tmp;
> ...
> /* expand the set of nodes */
> if (!mpol_store_user_nodemask(tsk->mempolicy)) {
> nodes_remap(tmp, ...);
> nodes_or(tsk->mempolicy->v.nodes, tsk->mempolicy->v.nodes, tmp);
> }
> ...
>
> /* shrink the set of nodes */
> if (!mpol_store_user_nodemask(tsk->mempolicy))
> tsk->mempolicy->v.nodes = tmp;
> }
>

I don't see why this is even necessary, the mempolicy code could simply
return numa_node_id() when nodes_empty(current->mempolicy->v.nodes) to
close the race.

[ Your pseudo-code is also lacking task_lock(tsk), which is required to
safely dereference tsk->mempolicy, and this is only available so far in
-mm since the oom killer rewrite. ]