We bind an different node to different vma, Unluckily,
it will bind different vma to same node by checking the /proc/pid/numa_maps.
Commit 213980c0f23b ("mm, mempolicy: simplify rebinding mempolicies when updating cpusets")
has introduced the issue. when we change memory policy by seting cpuset.mems,
A process will rebind the specified policy more than one times.
if the cpuset_mems_allowed is not equal to user specified nodes. hence the issue will trigger.
Maybe result in the out of memory which allocating memory from same node.
Fixes: 213980c0f23b ("mm, mempolicy: simplify rebinding mempolicies when updating cpusets")
Signed-off-by: zhong jiang <[email protected]>
---
mm/mempolicy.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index e3ab1d9..a60a3be 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -345,7 +345,7 @@ static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes)
else {
nodes_remap(tmp, pol->v.nodes,pol->w.cpuset_mems_allowed,
*nodes);
- pol->w.cpuset_mems_allowed = tmp;
+ pol->w.cpuset_mems_allowed = *nodes;
}
if (nodes_empty(tmp))
--
1.7.12.4
(Cc Vlastimil)
On Sat, 25 May 2019 15:07:23 +0800 zhong jiang <[email protected]> wrote:
> We bind an different node to different vma, Unluckily,
> it will bind different vma to same node by checking the /proc/pid/numa_maps.
> Commit 213980c0f23b ("mm, mempolicy: simplify rebinding mempolicies when updating cpusets")
> has introduced the issue. when we change memory policy by seting cpuset.mems,
> A process will rebind the specified policy more than one times.
> if the cpuset_mems_allowed is not equal to user specified nodes. hence the issue will trigger.
> Maybe result in the out of memory which allocating memory from same node.
>
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -345,7 +345,7 @@ static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes)
> else {
> nodes_remap(tmp, pol->v.nodes,pol->w.cpuset_mems_allowed,
> *nodes);
> - pol->w.cpuset_mems_allowed = tmp;
> + pol->w.cpuset_mems_allowed = *nodes;
> }
>
> if (nodes_empty(tmp))
hm, I'm not surprised the code broke. What the heck is going on in
there? It used to have a perfunctory comment, but Vlastimil deleted
it.
Could someone please propose a comment for the above code block
explaining why we're doing what we do?
On 5/25/19 8:28 PM, Andrew Morton wrote:
> (Cc Vlastimil)
Oh dear, 2 years and I forgot all the details about how this works.
> On Sat, 25 May 2019 15:07:23 +0800 zhong jiang <[email protected]> wrote:
>
>> We bind an different node to different vma, Unluckily,
>> it will bind different vma to same node by checking the /proc/pid/numa_maps.
>> Commit 213980c0f23b ("mm, mempolicy: simplify rebinding mempolicies when updating cpusets")
>> has introduced the issue. when we change memory policy by seting cpuset.mems,
>> A process will rebind the specified policy more than one times.
>> if the cpuset_mems_allowed is not equal to user specified nodes. hence the issue will trigger.
>> Maybe result in the out of memory which allocating memory from same node.
I have a hard time understanding what the problem is. Could you please
write it as a (pseudo) reproducer? I.e. an example of the process/admin
mempolicy/cpuset actions that have some wrong observed results vs the
correct expected result.
>> --- a/mm/mempolicy.c
>> +++ b/mm/mempolicy.c
>> @@ -345,7 +345,7 @@ static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes)
>> else {
>> nodes_remap(tmp, pol->v.nodes,pol->w.cpuset_mems_allowed,
>> *nodes);
>> - pol->w.cpuset_mems_allowed = tmp;
>> + pol->w.cpuset_mems_allowed = *nodes;
Looks like a mechanical error on my side when removing the code for
step1+step2 rebinding. Before my commit there was
pol->w.cpuset_mems_allowed = step ? tmp : *nodes;
Since 'step' was removed and thus 0, I should have used *nodes indeed.
Thanks for catching that.
>> }
>>
>> if (nodes_empty(tmp))
>
> hm, I'm not surprised the code broke. What the heck is going on in
> there? It used to have a perfunctory comment, but Vlastimil deleted
> it.
Yeah the comment was specific for the case that was being removed.
> Could someone please propose a comment for the above code block
> explaining why we're doing what we do?
I'll have to relearn this first...
On 2019/5/27 20:23, Vlastimil Babka wrote:
> On 5/25/19 8:28 PM, Andrew Morton wrote:
>> (Cc Vlastimil)
> Oh dear, 2 years and I forgot all the details about how this works.
>
>> On Sat, 25 May 2019 15:07:23 +0800 zhong jiang <[email protected]> wrote:
>>
>>> We bind an different node to different vma, Unluckily,
>>> it will bind different vma to same node by checking the /proc/pid/numa_maps.
>>> Commit 213980c0f23b ("mm, mempolicy: simplify rebinding mempolicies when updating cpusets")
>>> has introduced the issue. when we change memory policy by seting cpuset.mems,
>>> A process will rebind the specified policy more than one times.
>>> if the cpuset_mems_allowed is not equal to user specified nodes. hence the issue will trigger.
>>> Maybe result in the out of memory which allocating memory from same node.
> I have a hard time understanding what the problem is. Could you please
> write it as a (pseudo) reproducer? I.e. an example of the process/admin
> mempolicy/cpuset actions that have some wrong observed results vs the
> correct expected result.
Sorry, I havn't an testcase to reproduce the issue. At first, It was disappeared by
my colleague to configure the xml to start an vm. To his suprise, The bind mempolicy
doesn't work.
Thanks,
zhong jiang
>>> --- a/mm/mempolicy.c
>>> +++ b/mm/mempolicy.c
>>> @@ -345,7 +345,7 @@ static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes)
>>> else {
>>> nodes_remap(tmp, pol->v.nodes,pol->w.cpuset_mems_allowed,
>>> *nodes);
>>> - pol->w.cpuset_mems_allowed = tmp;
>>> + pol->w.cpuset_mems_allowed = *nodes;
> Looks like a mechanical error on my side when removing the code for
> step1+step2 rebinding. Before my commit there was
>
> pol->w.cpuset_mems_allowed = step ? tmp : *nodes;
>
> Since 'step' was removed and thus 0, I should have used *nodes indeed.
> Thanks for catching that.
>
>>> }
>>>
>>> if (nodes_empty(tmp))
>> hm, I'm not surprised the code broke. What the heck is going on in
>> there? It used to have a perfunctory comment, but Vlastimil deleted
>> it.
> Yeah the comment was specific for the case that was being removed.
>
>> Could someone please propose a comment for the above code block
>> explaining why we're doing what we do?
> I'll have to relearn this first...
>
>
On Mon, 27 May 2019 21:58:17 +0800 zhong jiang <[email protected]> wrote:
> On 2019/5/27 20:23, Vlastimil Babka wrote:
> > On 5/25/19 8:28 PM, Andrew Morton wrote:
> >> (Cc Vlastimil)
> > Oh dear, 2 years and I forgot all the details about how this works.
> >
> >> On Sat, 25 May 2019 15:07:23 +0800 zhong jiang <[email protected]> wrote:
> >>
> >>> We bind an different node to different vma, Unluckily,
> >>> it will bind different vma to same node by checking the /proc/pid/numa_maps.
> >>> Commit 213980c0f23b ("mm, mempolicy: simplify rebinding mempolicies when updating cpusets")
> >>> has introduced the issue. when we change memory policy by seting cpuset.mems,
> >>> A process will rebind the specified policy more than one times.
> >>> if the cpuset_mems_allowed is not equal to user specified nodes. hence the issue will trigger.
> >>> Maybe result in the out of memory which allocating memory from same node.
> > I have a hard time understanding what the problem is. Could you please
> > write it as a (pseudo) reproducer? I.e. an example of the process/admin
> > mempolicy/cpuset actions that have some wrong observed results vs the
> > correct expected result.
> Sorry, I havn't an testcase to reproduce the issue. At first, It was disappeared by
> my colleague to configure the xml to start an vm. To his suprise, The bind mempolicy
> doesn't work.
So... what do we do with this patch?
> Thanks,
> zhong jiang
> >>> --- a/mm/mempolicy.c
> >>> +++ b/mm/mempolicy.c
> >>> @@ -345,7 +345,7 @@ static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes)
> >>> else {
> >>> nodes_remap(tmp, pol->v.nodes,pol->w.cpuset_mems_allowed,
> >>> *nodes);
> >>> - pol->w.cpuset_mems_allowed = tmp;
> >>> + pol->w.cpuset_mems_allowed = *nodes;
> > Looks like a mechanical error on my side when removing the code for
> > step1+step2 rebinding. Before my commit there was
> >
> > pol->w.cpuset_mems_allowed = step ? tmp : *nodes;
> >
> > Since 'step' was removed and thus 0, I should have used *nodes indeed.
> > Thanks for catching that.
Was that an ack?
> >>> }
> >>>
> >>> if (nodes_empty(tmp))
> >> hm, I'm not surprised the code broke. What the heck is going on in
> >> there? It used to have a perfunctory comment, but Vlastimil deleted
> >> it.
> > Yeah the comment was specific for the case that was being removed.
> >
> >> Could someone please propose a comment for the above code block
> >> explaining why we're doing what we do?
> > I'll have to relearn this first...
> >
> >
>
On 6/27/19 5:57 AM, Andrew Morton wrote:
> On Mon, 27 May 2019 21:58:17 +0800 zhong jiang <[email protected]> wrote:
>
>> On 2019/5/27 20:23, Vlastimil Babka wrote:
>>> On 5/25/19 8:28 PM, Andrew Morton wrote:
>>>> (Cc Vlastimil)
>>> Oh dear, 2 years and I forgot all the details about how this works.
>>>
>>>> On Sat, 25 May 2019 15:07:23 +0800 zhong jiang <[email protected]> wrote:
>>>>
>>>>> We bind an different node to different vma, Unluckily,
>>>>> it will bind different vma to same node by checking the /proc/pid/numa_maps.
>>>>> Commit 213980c0f23b ("mm, mempolicy: simplify rebinding mempolicies when updating cpusets")
>>>>> has introduced the issue. when we change memory policy by seting cpuset.mems,
>>>>> A process will rebind the specified policy more than one times.
>>>>> if the cpuset_mems_allowed is not equal to user specified nodes. hence the issue will trigger.
>>>>> Maybe result in the out of memory which allocating memory from same node.
>>> I have a hard time understanding what the problem is. Could you please
>>> write it as a (pseudo) reproducer? I.e. an example of the process/admin
>>> mempolicy/cpuset actions that have some wrong observed results vs the
>>> correct expected result.
>> Sorry, I havn't an testcase to reproduce the issue. At first, It was disappeared by
>> my colleague to configure the xml to start an vm. To his suprise, The bind mempolicy
>> doesn't work.
>
> So... what do we do with this patch?
>
>> Thanks,
>> zhong jiang
>>>>> --- a/mm/mempolicy.c
>>>>> +++ b/mm/mempolicy.c
>>>>> @@ -345,7 +345,7 @@ static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes)
>>>>> else {
>>>>> nodes_remap(tmp, pol->v.nodes,pol->w.cpuset_mems_allowed,
>>>>> *nodes);
>>>>> - pol->w.cpuset_mems_allowed = tmp;
>>>>> + pol->w.cpuset_mems_allowed = *nodes;
>>> Looks like a mechanical error on my side when removing the code for
>>> step1+step2 rebinding. Before my commit there was
>>>
>>> pol->w.cpuset_mems_allowed = step ? tmp : *nodes;
>>>
>>> Since 'step' was removed and thus 0, I should have used *nodes indeed.
>>> Thanks for catching that.
>
> Was that an ack?
The fix should be fine, but the description is vague. I'll try to
improve it.
>>>>> }
>>>>>
>>>>> if (nodes_empty(tmp))
>>>> hm, I'm not surprised the code broke. What the heck is going on in
>>>> there? It used to have a perfunctory comment, but Vlastimil deleted
>>>> it.
>>> Yeah the comment was specific for the case that was being removed.
>>>
>>>> Could someone please propose a comment for the above code block
>>>> explaining why we're doing what we do?
>>> I'll have to relearn this first...
>>>
>>>
>>
On 5/25/19 9:07 AM, zhong jiang wrote:
> We bind an different node to different vma, Unluckily,
> it will bind different vma to same node by checking the /proc/pid/numa_maps.
> Commit 213980c0f23b ("mm, mempolicy: simplify rebinding mempolicies when updating cpusets")
> has introduced the issue. when we change memory policy by seting cpuset.mems,
> A process will rebind the specified policy more than one times.
> if the cpuset_mems_allowed is not equal to user specified nodes. hence the issue will trigger.
> Maybe result in the out of memory which allocating memory from same node.
OK, how about this instead?
mpol_rebind_nodemask() is called for MPOL_BIND and MPOL_INTERLEAVE
mempoclicies when the tasks's cpuset's mems_allowed changes. For
policies created without MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES,
it works by remapping the policy's allowed nodes (stored in v.nodes)
using the previous value of mems_allowed (stored in
w.cpuset_mems_allowed) as the domain of map and the new mems_allowed
(passed as nodes) as the range of the map (see the comment of
bitmap_remap() for details).
The result of remapping is stored back as policy's nodemask in v.nodes,
and the new value of mems_allowed should be stored in
w.cpuset_mems_allowed to facilitate the next rebind, if it happens.
However, commit 213980c0f23b ("mm, mempolicy: simplify rebinding
mempolicies when updating cpusets") introduced a bug where the result of
remapping is stored in w.cpuset_mems_allowed instead. Thus, a
mempolicy's allowed nodes can evolve in an unexpected way after a series
of rebinding due to cpuset mems_allowed changes, possibly binding to a
wrong node or a smaller number of nodes which may e.g. overload them.
This patch fixes the bug so rebinding again works as intended.
> Fixes: 213980c0f23b ("mm, mempolicy: simplify rebinding mempolicies when updating cpusets")
> Signed-off-by: zhong jiang <[email protected]>
(an example of what exactly was the sequence of set_mempolicy and cpuset
mems changes with expected wrt actual results would be nice, but I think
the above should be fine by itself)
Reviewed-by: Vlastimil Babka <[email protected]>
> ---
> mm/mempolicy.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index e3ab1d9..a60a3be 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -345,7 +345,7 @@ static void mpol_rebind_nodemask(struct mempolicy *pol, const nodemask_t *nodes)
> else {
> nodes_remap(tmp, pol->v.nodes,pol->w.cpuset_mems_allowed,
> *nodes);
> - pol->w.cpuset_mems_allowed = tmp;
> + pol->w.cpuset_mems_allowed = *nodes;
> }
>
> if (nodes_empty(tmp))
>