2022-10-26 08:09:30

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH] mm/vmscan: respect cpuset policy during page demotion

On 10/26/22 1:13 PM, Feng Tang wrote:
> In page reclaim path, memory could be demoted from faster memory tier
> to slower memory tier. Currently, there is no check about cpuset's
> memory policy, that even if the target demotion node is not allowd
> by cpuset, the demotion will still happen, which breaks the cpuset
> semantics.
>
> So add cpuset policy check in the demotion path and skip demotion
> if the demotion targets are not allowed by cpuset.
>

What about the vma policy or the task memory policy? Shouldn't we respect
those memory policy restrictions while demoting the page?

-aneesh


2022-10-26 08:17:04

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH] mm/vmscan: respect cpuset policy during page demotion

On Wed, Oct 26, 2022 at 03:49:48PM +0800, Aneesh Kumar K V wrote:
> On 10/26/22 1:13 PM, Feng Tang wrote:
> > In page reclaim path, memory could be demoted from faster memory tier
> > to slower memory tier. Currently, there is no check about cpuset's
> > memory policy, that even if the target demotion node is not allowd
> > by cpuset, the demotion will still happen, which breaks the cpuset
> > semantics.
> >
> > So add cpuset policy check in the demotion path and skip demotion
> > if the demotion targets are not allowed by cpuset.
> >
>
> What about the vma policy or the task memory policy? Shouldn't we respect
> those memory policy restrictions while demoting the page?

Good question! We have some basic patches to consider memory policy
in demotion path too, which are still under test, and will be posted
soon. And the basic idea is similar to this patch.

Thanks,
Feng

> -aneesh

2022-10-26 09:34:09

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm/vmscan: respect cpuset policy during page demotion

On Wed 26-10-22 16:00:13, Feng Tang wrote:
> On Wed, Oct 26, 2022 at 03:49:48PM +0800, Aneesh Kumar K V wrote:
> > On 10/26/22 1:13 PM, Feng Tang wrote:
> > > In page reclaim path, memory could be demoted from faster memory tier
> > > to slower memory tier. Currently, there is no check about cpuset's
> > > memory policy, that even if the target demotion node is not allowd
> > > by cpuset, the demotion will still happen, which breaks the cpuset
> > > semantics.
> > >
> > > So add cpuset policy check in the demotion path and skip demotion
> > > if the demotion targets are not allowed by cpuset.
> > >
> >
> > What about the vma policy or the task memory policy? Shouldn't we respect
> > those memory policy restrictions while demoting the page?
>
> Good question! We have some basic patches to consider memory policy
> in demotion path too, which are still under test, and will be posted
> soon. And the basic idea is similar to this patch.

For that you need to consult each vma and it's owning task(s) and that
to me sounds like something to be done in folio_check_references.
Relying on memcg to get a cpuset cgroup is really ugly and not really
100% correct. Memory controller might be disabled and then you do not
have your association anymore.

This all can get quite expensive so the primary question is, does the
existing behavior generates any real issues or is this more of an
correctness exercise? I mean it certainly is not great to demote to an
incompatible numa node but are there any reasonable configurations when
the demotion target node is explicitly excluded from memory
policy/cpuset?
--
Michal Hocko
SUSE Labs

2022-10-26 11:04:15

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH] mm/vmscan: respect cpuset policy during page demotion

On 10/26/22 2:49 PM, Michal Hocko wrote:
> On Wed 26-10-22 16:00:13, Feng Tang wrote:
>> On Wed, Oct 26, 2022 at 03:49:48PM +0800, Aneesh Kumar K V wrote:
>>> On 10/26/22 1:13 PM, Feng Tang wrote:
>>>> In page reclaim path, memory could be demoted from faster memory tier
>>>> to slower memory tier. Currently, there is no check about cpuset's
>>>> memory policy, that even if the target demotion node is not allowd
>>>> by cpuset, the demotion will still happen, which breaks the cpuset
>>>> semantics.
>>>>
>>>> So add cpuset policy check in the demotion path and skip demotion
>>>> if the demotion targets are not allowed by cpuset.
>>>>
>>>
>>> What about the vma policy or the task memory policy? Shouldn't we respect
>>> those memory policy restrictions while demoting the page?
>>
>> Good question! We have some basic patches to consider memory policy
>> in demotion path too, which are still under test, and will be posted
>> soon. And the basic idea is similar to this patch.
>
> For that you need to consult each vma and it's owning task(s) and that
> to me sounds like something to be done in folio_check_references.
> Relying on memcg to get a cpuset cgroup is really ugly and not really
> 100% correct. Memory controller might be disabled and then you do not
> have your association anymore.
>

I was looking at this recently and I am wondering whether we should worry about VM_SHARE
vmas.

ie, page_to_policy() can just reverse lookup just one VMA and fetch the policy right?
if it VM_SHARE it will be a shared policy we can find using vma->vm_file?

For non anonymous and anon vma not having any policy set it is owning task vma->vm_mm->owner task policy ?
We don't worry about multiple tasks that can be possibly sharing that page right?

> This all can get quite expensive so the primary question is, does the
> existing behavior generates any real issues or is this more of an
> correctness exercise? I mean it certainly is not great to demote to an
> incompatible numa node but are there any reasonable configurations when
> the demotion target node is explicitly excluded from memory
> policy/cpuset?

I guess vma policy is important. Applications want to make sure that they don't
have variable performance and they go to lengths to avoid that by using MEM_BIND.
So if they access the memory they always know access is satisfied from a specific
set of NUMA nodes. Swapin can cause performance impact but then all continued
access will be from a specific NUMA nodes. With slow memory demotion that is
not going to be the case. Large in-memory database applications are observed to
be sensitive to these access latencies.


-aneesh

2022-10-26 11:17:30

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm/vmscan: respect cpuset policy during page demotion

On Wed 26-10-22 16:12:25, Aneesh Kumar K V wrote:
> On 10/26/22 2:49 PM, Michal Hocko wrote:
> > On Wed 26-10-22 16:00:13, Feng Tang wrote:
> >> On Wed, Oct 26, 2022 at 03:49:48PM +0800, Aneesh Kumar K V wrote:
> >>> On 10/26/22 1:13 PM, Feng Tang wrote:
> >>>> In page reclaim path, memory could be demoted from faster memory tier
> >>>> to slower memory tier. Currently, there is no check about cpuset's
> >>>> memory policy, that even if the target demotion node is not allowd
> >>>> by cpuset, the demotion will still happen, which breaks the cpuset
> >>>> semantics.
> >>>>
> >>>> So add cpuset policy check in the demotion path and skip demotion
> >>>> if the demotion targets are not allowed by cpuset.
> >>>>
> >>>
> >>> What about the vma policy or the task memory policy? Shouldn't we respect
> >>> those memory policy restrictions while demoting the page?
> >>
> >> Good question! We have some basic patches to consider memory policy
> >> in demotion path too, which are still under test, and will be posted
> >> soon. And the basic idea is similar to this patch.
> >
> > For that you need to consult each vma and it's owning task(s) and that
> > to me sounds like something to be done in folio_check_references.
> > Relying on memcg to get a cpuset cgroup is really ugly and not really
> > 100% correct. Memory controller might be disabled and then you do not
> > have your association anymore.
> >
>
> I was looking at this recently and I am wondering whether we should worry about VM_SHARE
> vmas.
>
> ie, page_to_policy() can just reverse lookup just one VMA and fetch the policy right?

How would that help for private mappings shared between parent/child?
Also reducing this to a single VMA is not really necessary as
folio_check_references already does most of that work. What is really
missing is to check for other memory policies (i.e. cpusets and per-task
mempolicy). The later is what can get quite expensive.

> if it VM_SHARE it will be a shared policy we can find using vma->vm_file?
>
> For non anonymous and anon vma not having any policy set it is owning task vma->vm_mm->owner task policy ?

Please note that mm can be shared even outside of the traditional thread
group so you would need to go into something like mm_update_next_owner

> We don't worry about multiple tasks that can be possibly sharing that page right?

Why not?

> > This all can get quite expensive so the primary question is, does the
> > existing behavior generates any real issues or is this more of an
> > correctness exercise? I mean it certainly is not great to demote to an
> > incompatible numa node but are there any reasonable configurations when
> > the demotion target node is explicitly excluded from memory
> > policy/cpuset?
>
> I guess vma policy is important. Applications want to make sure that they don't
> have variable performance and they go to lengths to avoid that by using MEM_BIND.
> So if they access the memory they always know access is satisfied from a specific
> set of NUMA nodes. Swapin can cause performance impact but then all continued
> access will be from a specific NUMA nodes. With slow memory demotion that is
> not going to be the case. Large in-memory database applications are observed to
> be sensitive to these access latencies.

Yes, I do understand that from the correctness POV this is a problem. My
question is whether this is a _practical_ problem worth really being
fixed as it is not really a cheap fix. If there are strong node locality
assumptions by the userspace then I would expect demotion to be disabled
in the first place.
--
Michal Hocko
SUSE Labs

2022-10-26 12:13:51

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH] mm/vmscan: respect cpuset policy during page demotion

On 10/26/22 4:32 PM, Michal Hocko wrote:
> On Wed 26-10-22 16:12:25, Aneesh Kumar K V wrote:
>> On 10/26/22 2:49 PM, Michal Hocko wrote:
>>> On Wed 26-10-22 16:00:13, Feng Tang wrote:
>>>> On Wed, Oct 26, 2022 at 03:49:48PM +0800, Aneesh Kumar K V wrote:
>>>>> On 10/26/22 1:13 PM, Feng Tang wrote:
>>>>>> In page reclaim path, memory could be demoted from faster memory tier
>>>>>> to slower memory tier. Currently, there is no check about cpuset's
>>>>>> memory policy, that even if the target demotion node is not allowd
>>>>>> by cpuset, the demotion will still happen, which breaks the cpuset
>>>>>> semantics.
>>>>>>
>>>>>> So add cpuset policy check in the demotion path and skip demotion
>>>>>> if the demotion targets are not allowed by cpuset.
>>>>>>
>>>>>
>>>>> What about the vma policy or the task memory policy? Shouldn't we respect
>>>>> those memory policy restrictions while demoting the page?
>>>>
>>>> Good question! We have some basic patches to consider memory policy
>>>> in demotion path too, which are still under test, and will be posted
>>>> soon. And the basic idea is similar to this patch.
>>>
>>> For that you need to consult each vma and it's owning task(s) and that
>>> to me sounds like something to be done in folio_check_references.
>>> Relying on memcg to get a cpuset cgroup is really ugly and not really
>>> 100% correct. Memory controller might be disabled and then you do not
>>> have your association anymore.
>>>
>>
>> I was looking at this recently and I am wondering whether we should worry about VM_SHARE
>> vmas.
>>
>> ie, page_to_policy() can just reverse lookup just one VMA and fetch the policy right?
>
> How would that help for private mappings shared between parent/child?


this is MAP_PRIVATE | MAP_SHARED? and won't they get converted to shared policy for the backing shmfs? via

} else if (vm_flags & VM_SHARED) {
error = shmem_zero_setup(vma);
if (error)
goto free_vma;
} else {
vma_set_anonymous(vma);
}



> Also reducing this to a single VMA is not really necessary as
> folio_check_references already does most of that work. What is really
> missing is to check for other memory policies (i.e. cpusets and per-task
> mempolicy). The later is what can get quite expensive.
>


I agree that walking all the related vma is already done in folio_check_references. I was
checking do we really need to check all the vma in case of memory policy.


>> if it VM_SHARE it will be a shared policy we can find using vma->vm_file?
>>
>> For non anonymous and anon vma not having any policy set it is owning task vma->vm_mm->owner task policy ?
>
> Please note that mm can be shared even outside of the traditional thread
> group so you would need to go into something like mm_update_next_owner
>
>> We don't worry about multiple tasks that can be possibly sharing that page right?
>
> Why not?
>

On the page fault side for non anonymous vma we only respect the memory policy of the
task faulting the page in. With that restrictions we could always say if demotion
node is allowed by the policy of any task sharing this vma, we can demote the
page to that specific node?

>>> This all can get quite expensive so the primary question is, does the
>>> existing behavior generates any real issues or is this more of an
>>> correctness exercise? I mean it certainly is not great to demote to an
>>> incompatible numa node but are there any reasonable configurations when
>>> the demotion target node is explicitly excluded from memory
>>> policy/cpuset?
>>
>> I guess vma policy is important. Applications want to make sure that they don't
>> have variable performance and they go to lengths to avoid that by using MEM_BIND.
>> So if they access the memory they always know access is satisfied from a specific
>> set of NUMA nodes. Swapin can cause performance impact but then all continued
>> access will be from a specific NUMA nodes. With slow memory demotion that is
>> not going to be the case. Large in-memory database applications are observed to
>> be sensitive to these access latencies.
>
> Yes, I do understand that from the correctness POV this is a problem. My
> question is whether this is a _practical_ problem worth really being
> fixed as it is not really a cheap fix. If there are strong node locality
> assumptions by the userspace then I would expect demotion to be disabled
> in the first place.

Agreed. Right now these applications when they fail to allocate memory from
the Node on which they are running, they fallback to nearby NUMA nodes rather than
failing the database query. They would want to prevent fallback of some allocation
to slow cxl memory to avoid hot database tables getting allocated from the cxl
memory node. In that case one way they can consume slow cxl memory is to partition
the data structure using membind and allow cold pages to demoted back to
slow cxl memory making space for hot page allocation in the running NUMA node.

Other option is to run the application using fsdax.

I am still not clear which option will get used finally.


-aneesh





2022-10-26 12:37:00

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH] mm/vmscan: respect cpuset policy during page demotion

On Wed, Oct 26, 2022 at 05:19:50PM +0800, Michal Hocko wrote:
> On Wed 26-10-22 16:00:13, Feng Tang wrote:
> > On Wed, Oct 26, 2022 at 03:49:48PM +0800, Aneesh Kumar K V wrote:
> > > On 10/26/22 1:13 PM, Feng Tang wrote:
> > > > In page reclaim path, memory could be demoted from faster memory tier
> > > > to slower memory tier. Currently, there is no check about cpuset's
> > > > memory policy, that even if the target demotion node is not allowd
> > > > by cpuset, the demotion will still happen, which breaks the cpuset
> > > > semantics.
> > > >
> > > > So add cpuset policy check in the demotion path and skip demotion
> > > > if the demotion targets are not allowed by cpuset.
> > > >
> > >
> > > What about the vma policy or the task memory policy? Shouldn't we respect
> > > those memory policy restrictions while demoting the page?
> >
> > Good question! We have some basic patches to consider memory policy
> > in demotion path too, which are still under test, and will be posted
> > soon. And the basic idea is similar to this patch.
>
> For that you need to consult each vma and it's owning task(s) and that
> to me sounds like something to be done in folio_check_references.
> Relying on memcg to get a cpuset cgroup is really ugly and not really
> 100% correct. Memory controller might be disabled and then you do not
> have your association anymore.

You are right, for cpuset case, the solution depends on 'CONFIG_MEMCG=y',
and the bright side is most of distribution have it on.

> This all can get quite expensive so the primary question is, does the
> existing behavior generates any real issues or is this more of an
> correctness exercise? I mean it certainly is not great to demote to an
> incompatible numa node but are there any reasonable configurations when
> the demotion target node is explicitly excluded from memory
> policy/cpuset?

We haven't got customer report on this, but there are quite some customers
use cpuset to bind some specific memory nodes to a docker (You've helped
us solve a OOM issue in such cases), so I think it's practical to respect
the cpuset semantics as much as we can.

Your concern about the expensive cost makes sense! Some raw ideas are:
* if the shrink_folio_list is called by kswapd, the folios come from
the same per-memcg lruvec, so only one check is enough
* if not from kswapd, like called form madvise or DAMON code, we can
save a memcg cache, and if the next folio's memcg is same as the
cache, we reuse its result. And due to the locality, the real
check is rarely performed.

Thanks,
Feng

> --
> Michal Hocko
> SUSE Labs
>

2022-10-26 13:07:44

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm/vmscan: respect cpuset policy during page demotion

On Wed 26-10-22 17:38:06, Aneesh Kumar K V wrote:
> On 10/26/22 4:32 PM, Michal Hocko wrote:
> > On Wed 26-10-22 16:12:25, Aneesh Kumar K V wrote:
> >> On 10/26/22 2:49 PM, Michal Hocko wrote:
> >>> On Wed 26-10-22 16:00:13, Feng Tang wrote:
> >>>> On Wed, Oct 26, 2022 at 03:49:48PM +0800, Aneesh Kumar K V wrote:
> >>>>> On 10/26/22 1:13 PM, Feng Tang wrote:
> >>>>>> In page reclaim path, memory could be demoted from faster memory tier
> >>>>>> to slower memory tier. Currently, there is no check about cpuset's
> >>>>>> memory policy, that even if the target demotion node is not allowd
> >>>>>> by cpuset, the demotion will still happen, which breaks the cpuset
> >>>>>> semantics.
> >>>>>>
> >>>>>> So add cpuset policy check in the demotion path and skip demotion
> >>>>>> if the demotion targets are not allowed by cpuset.
> >>>>>>
> >>>>>
> >>>>> What about the vma policy or the task memory policy? Shouldn't we respect
> >>>>> those memory policy restrictions while demoting the page?
> >>>>
> >>>> Good question! We have some basic patches to consider memory policy
> >>>> in demotion path too, which are still under test, and will be posted
> >>>> soon. And the basic idea is similar to this patch.
> >>>
> >>> For that you need to consult each vma and it's owning task(s) and that
> >>> to me sounds like something to be done in folio_check_references.
> >>> Relying on memcg to get a cpuset cgroup is really ugly and not really
> >>> 100% correct. Memory controller might be disabled and then you do not
> >>> have your association anymore.
> >>>
> >>
> >> I was looking at this recently and I am wondering whether we should worry about VM_SHARE
> >> vmas.
> >>
> >> ie, page_to_policy() can just reverse lookup just one VMA and fetch the policy right?
> >
> > How would that help for private mappings shared between parent/child?
>
>
> this is MAP_PRIVATE | MAP_SHARED?

This is not a valid combination IIRC. What I meant is a simple
MAP_PRIVATE|MAP_ANON that is CoW shared between parent and child.

[...]
--
Michal Hocko
SUSE Labs

2022-10-26 16:02:16

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm/vmscan: respect cpuset policy during page demotion

On Wed 26-10-22 20:20:01, Feng Tang wrote:
> On Wed, Oct 26, 2022 at 05:19:50PM +0800, Michal Hocko wrote:
> > On Wed 26-10-22 16:00:13, Feng Tang wrote:
> > > On Wed, Oct 26, 2022 at 03:49:48PM +0800, Aneesh Kumar K V wrote:
> > > > On 10/26/22 1:13 PM, Feng Tang wrote:
> > > > > In page reclaim path, memory could be demoted from faster memory tier
> > > > > to slower memory tier. Currently, there is no check about cpuset's
> > > > > memory policy, that even if the target demotion node is not allowd
> > > > > by cpuset, the demotion will still happen, which breaks the cpuset
> > > > > semantics.
> > > > >
> > > > > So add cpuset policy check in the demotion path and skip demotion
> > > > > if the demotion targets are not allowed by cpuset.
> > > > >
> > > >
> > > > What about the vma policy or the task memory policy? Shouldn't we respect
> > > > those memory policy restrictions while demoting the page?
> > >
> > > Good question! We have some basic patches to consider memory policy
> > > in demotion path too, which are still under test, and will be posted
> > > soon. And the basic idea is similar to this patch.
> >
> > For that you need to consult each vma and it's owning task(s) and that
> > to me sounds like something to be done in folio_check_references.
> > Relying on memcg to get a cpuset cgroup is really ugly and not really
> > 100% correct. Memory controller might be disabled and then you do not
> > have your association anymore.
>
> You are right, for cpuset case, the solution depends on 'CONFIG_MEMCG=y',
> and the bright side is most of distribution have it on.

CONFIG_MEMCG=y is not sufficient. You would need to enable memcg
controller during the runtime as well.

> > This all can get quite expensive so the primary question is, does the
> > existing behavior generates any real issues or is this more of an
> > correctness exercise? I mean it certainly is not great to demote to an
> > incompatible numa node but are there any reasonable configurations when
> > the demotion target node is explicitly excluded from memory
> > policy/cpuset?
>
> We haven't got customer report on this, but there are quite some customers
> use cpuset to bind some specific memory nodes to a docker (You've helped
> us solve a OOM issue in such cases), so I think it's practical to respect
> the cpuset semantics as much as we can.

Yes, it is definitely better to respect cpusets and all local memory
policies. There is no dispute there. The thing is whether this is really
worth it. How often would cpusets (or policies in general) go actively
against demotion nodes (i.e. exclude those nodes from their allowes node
mask)?

I can imagine workloads which wouldn't like to get their memory demoted
for some reason but wouldn't it be more practical to tell that
explicitly (e.g. via prctl) rather than configuring cpusets/memory
policies explicitly?

> Your concern about the expensive cost makes sense! Some raw ideas are:
> * if the shrink_folio_list is called by kswapd, the folios come from
> the same per-memcg lruvec, so only one check is enough
> * if not from kswapd, like called form madvise or DAMON code, we can
> save a memcg cache, and if the next folio's memcg is same as the
> cache, we reuse its result. And due to the locality, the real
> check is rarely performed.

memcg is not the expensive part of the thing. You need to get from page
-> all vmas::vm_policy -> mm -> task::mempolicy

--
Michal Hocko
SUSE Labs

2022-10-26 18:08:08

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH] mm/vmscan: respect cpuset policy during page demotion

On Wed, Oct 26, 2022 at 8:59 AM Michal Hocko <[email protected]> wrote:
>
> On Wed 26-10-22 20:20:01, Feng Tang wrote:
> > On Wed, Oct 26, 2022 at 05:19:50PM +0800, Michal Hocko wrote:
> > > On Wed 26-10-22 16:00:13, Feng Tang wrote:
> > > > On Wed, Oct 26, 2022 at 03:49:48PM +0800, Aneesh Kumar K V wrote:
> > > > > On 10/26/22 1:13 PM, Feng Tang wrote:
> > > > > > In page reclaim path, memory could be demoted from faster memory tier
> > > > > > to slower memory tier. Currently, there is no check about cpuset's
> > > > > > memory policy, that even if the target demotion node is not allowd
> > > > > > by cpuset, the demotion will still happen, which breaks the cpuset
> > > > > > semantics.
> > > > > >
> > > > > > So add cpuset policy check in the demotion path and skip demotion
> > > > > > if the demotion targets are not allowed by cpuset.
> > > > > >
> > > > >
> > > > > What about the vma policy or the task memory policy? Shouldn't we respect
> > > > > those memory policy restrictions while demoting the page?
> > > >
> > > > Good question! We have some basic patches to consider memory policy
> > > > in demotion path too, which are still under test, and will be posted
> > > > soon. And the basic idea is similar to this patch.
> > >
> > > For that you need to consult each vma and it's owning task(s) and that
> > > to me sounds like something to be done in folio_check_references.
> > > Relying on memcg to get a cpuset cgroup is really ugly and not really
> > > 100% correct. Memory controller might be disabled and then you do not
> > > have your association anymore.
> >
> > You are right, for cpuset case, the solution depends on 'CONFIG_MEMCG=y',
> > and the bright side is most of distribution have it on.
>
> CONFIG_MEMCG=y is not sufficient. You would need to enable memcg
> controller during the runtime as well.
>
> > > This all can get quite expensive so the primary question is, does the
> > > existing behavior generates any real issues or is this more of an
> > > correctness exercise? I mean it certainly is not great to demote to an
> > > incompatible numa node but are there any reasonable configurations when
> > > the demotion target node is explicitly excluded from memory
> > > policy/cpuset?
> >
> > We haven't got customer report on this, but there are quite some customers
> > use cpuset to bind some specific memory nodes to a docker (You've helped
> > us solve a OOM issue in such cases), so I think it's practical to respect
> > the cpuset semantics as much as we can.
>
> Yes, it is definitely better to respect cpusets and all local memory
> policies. There is no dispute there. The thing is whether this is really
> worth it. How often would cpusets (or policies in general) go actively
> against demotion nodes (i.e. exclude those nodes from their allowes node
> mask)?
>
> I can imagine workloads which wouldn't like to get their memory demoted
> for some reason but wouldn't it be more practical to tell that
> explicitly (e.g. via prctl) rather than configuring cpusets/memory
> policies explicitly?
>
> > Your concern about the expensive cost makes sense! Some raw ideas are:
> > * if the shrink_folio_list is called by kswapd, the folios come from
> > the same per-memcg lruvec, so only one check is enough
> > * if not from kswapd, like called form madvise or DAMON code, we can
> > save a memcg cache, and if the next folio's memcg is same as the
> > cache, we reuse its result. And due to the locality, the real
> > check is rarely performed.
>
> memcg is not the expensive part of the thing. You need to get from page
> -> all vmas::vm_policy -> mm -> task::mempolicy

Yeah, on the same page with Michal. Figuring out mempolicy from page
seems quite expensive and the correctness can't be guranteed since the
mempolicy could be set per-thread and the mm->task depends on
CONFIG_MEMCG so it doesn't work for !CONFIG_MEMCG.

>
> --
> Michal Hocko
> SUSE Labs
>

2022-10-27 07:11:46

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH] mm/vmscan: respect cpuset policy during page demotion

Michal Hocko <[email protected]> writes:

> On Wed 26-10-22 20:20:01, Feng Tang wrote:
>> On Wed, Oct 26, 2022 at 05:19:50PM +0800, Michal Hocko wrote:
>> > On Wed 26-10-22 16:00:13, Feng Tang wrote:
>> > > On Wed, Oct 26, 2022 at 03:49:48PM +0800, Aneesh Kumar K V wrote:
>> > > > On 10/26/22 1:13 PM, Feng Tang wrote:
>> > > > > In page reclaim path, memory could be demoted from faster memory tier
>> > > > > to slower memory tier. Currently, there is no check about cpuset's
>> > > > > memory policy, that even if the target demotion node is not allowd
>> > > > > by cpuset, the demotion will still happen, which breaks the cpuset
>> > > > > semantics.
>> > > > >
>> > > > > So add cpuset policy check in the demotion path and skip demotion
>> > > > > if the demotion targets are not allowed by cpuset.
>> > > > >
>> > > >
>> > > > What about the vma policy or the task memory policy? Shouldn't we respect
>> > > > those memory policy restrictions while demoting the page?
>> > >
>> > > Good question! We have some basic patches to consider memory policy
>> > > in demotion path too, which are still under test, and will be posted
>> > > soon. And the basic idea is similar to this patch.
>> >
>> > For that you need to consult each vma and it's owning task(s) and that
>> > to me sounds like something to be done in folio_check_references.
>> > Relying on memcg to get a cpuset cgroup is really ugly and not really
>> > 100% correct. Memory controller might be disabled and then you do not
>> > have your association anymore.
>>
>> You are right, for cpuset case, the solution depends on 'CONFIG_MEMCG=y',
>> and the bright side is most of distribution have it on.
>
> CONFIG_MEMCG=y is not sufficient. You would need to enable memcg
> controller during the runtime as well.
>
>> > This all can get quite expensive so the primary question is, does the
>> > existing behavior generates any real issues or is this more of an
>> > correctness exercise? I mean it certainly is not great to demote to an
>> > incompatible numa node but are there any reasonable configurations when
>> > the demotion target node is explicitly excluded from memory
>> > policy/cpuset?
>>
>> We haven't got customer report on this, but there are quite some customers
>> use cpuset to bind some specific memory nodes to a docker (You've helped
>> us solve a OOM issue in such cases), so I think it's practical to respect
>> the cpuset semantics as much as we can.
>
> Yes, it is definitely better to respect cpusets and all local memory
> policies. There is no dispute there. The thing is whether this is really
> worth it. How often would cpusets (or policies in general) go actively
> against demotion nodes (i.e. exclude those nodes from their allowes node
> mask)?
>
> I can imagine workloads which wouldn't like to get their memory demoted
> for some reason but wouldn't it be more practical to tell that
> explicitly (e.g. via prctl) rather than configuring cpusets/memory
> policies explicitly?

If my understanding were correct, prctl() configures the process or
thread. How can we get process/thread configuration at demotion time?

Best Regards,
Huang, Ying

2022-10-27 07:20:13

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm/vmscan: respect cpuset policy during page demotion

On Thu 27-10-22 14:47:22, Huang, Ying wrote:
> Michal Hocko <[email protected]> writes:
[...]
> > I can imagine workloads which wouldn't like to get their memory demoted
> > for some reason but wouldn't it be more practical to tell that
> > explicitly (e.g. via prctl) rather than configuring cpusets/memory
> > policies explicitly?
>
> If my understanding were correct, prctl() configures the process or
> thread.

Not necessarily. There are properties which are per adddress space like
PR_[GS]ET_THP_DISABLE. This could be very similar.

> How can we get process/thread configuration at demotion time?

As already pointed out in previous emails. You could hook into
folio_check_references path, more specifically folio_referenced_one
where you have all that you need already - all vmas mapping the page and
then it is trivial to get the corresponding vm_mm. If at least one of
them has the flag set then the demotion is not allowed (essentially the
same model as VM_LOCKED).
--
Michal Hocko
SUSE Labs

2022-10-27 08:04:39

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH] mm/vmscan: respect cpuset policy during page demotion

On Thu, Oct 27, 2022 at 01:57:52AM +0800, Yang Shi wrote:
> On Wed, Oct 26, 2022 at 8:59 AM Michal Hocko <[email protected]> wrote:
[...]
> > > > This all can get quite expensive so the primary question is, does the
> > > > existing behavior generates any real issues or is this more of an
> > > > correctness exercise? I mean it certainly is not great to demote to an
> > > > incompatible numa node but are there any reasonable configurations when
> > > > the demotion target node is explicitly excluded from memory
> > > > policy/cpuset?
> > >
> > > We haven't got customer report on this, but there are quite some customers
> > > use cpuset to bind some specific memory nodes to a docker (You've helped
> > > us solve a OOM issue in such cases), so I think it's practical to respect
> > > the cpuset semantics as much as we can.
> >
> > Yes, it is definitely better to respect cpusets and all local memory
> > policies. There is no dispute there. The thing is whether this is really
> > worth it. How often would cpusets (or policies in general) go actively
> > against demotion nodes (i.e. exclude those nodes from their allowes node
> > mask)?
> >
> > I can imagine workloads which wouldn't like to get their memory demoted
> > for some reason but wouldn't it be more practical to tell that
> > explicitly (e.g. via prctl) rather than configuring cpusets/memory
> > policies explicitly?
> >
> > > Your concern about the expensive cost makes sense! Some raw ideas are:
> > > * if the shrink_folio_list is called by kswapd, the folios come from
> > > the same per-memcg lruvec, so only one check is enough
> > > * if not from kswapd, like called form madvise or DAMON code, we can
> > > save a memcg cache, and if the next folio's memcg is same as the
> > > cache, we reuse its result. And due to the locality, the real
> > > check is rarely performed.
> >
> > memcg is not the expensive part of the thing. You need to get from page
> > -> all vmas::vm_policy -> mm -> task::mempolicy
>
> Yeah, on the same page with Michal. Figuring out mempolicy from page
> seems quite expensive and the correctness can't be guranteed since the
> mempolicy could be set per-thread and the mm->task depends on
> CONFIG_MEMCG so it doesn't work for !CONFIG_MEMCG.

Yes, you are right. Our "working" psudo code for mem policy looks like
what Michal mentioned, and it can't work for all cases, but try to
enforce it whenever possible:

static bool __check_mpol_demotion(struct folio *folio, struct vm_area_struct *vma,
unsigned long addr, void *arg)
{
bool *skip_demotion = arg;
struct mempolicy *mpol;
int nid, dnid;
bool ret = true;

mpol = __get_vma_policy(vma, addr);
if (!mpol) {
struct task_struct *task;
if (vma->vm_mm)
task = vma->vm_mm->owner;

if (task) {
mpol = get_task_policy(task);
if (mpol)
mpol_get(mpol);
}
}

if (!mpol)
return ret;

if (mpol->mode != MPOL_BIND)
goto put_exit;

nid = folio_nid(folio);
dnid = next_demotion_node(nid);
if (!node_isset(dnid, mpol->nodes)) {
*skip_demotion = true;
ret = false;
}

put_exit:
mpol_put(mpol);
return ret;
}

static unsigned int shrink_page_list(struct list_head *page_list,..)
{
...

bool skip_demotion = false;
struct rmap_walk_control rwc = {
.arg = &skip_demotion,
.rmap_one = __check_mpol_demotion,
};

/* memory policy check */
rmap_walk(folio, &rwc);
if (skip_demotion)
goto keep_locked;
}

And there seems to be no simple solution for getting the memory
policy from a page.

Thanks,
Feng

> >
> > --
> > Michal Hocko
> > SUSE Labs
> >
>

2022-10-27 08:16:11

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH] mm/vmscan: respect cpuset policy during page demotion

Michal Hocko <[email protected]> writes:

> On Thu 27-10-22 14:47:22, Huang, Ying wrote:
>> Michal Hocko <[email protected]> writes:
> [...]
>> > I can imagine workloads which wouldn't like to get their memory demoted
>> > for some reason but wouldn't it be more practical to tell that
>> > explicitly (e.g. via prctl) rather than configuring cpusets/memory
>> > policies explicitly?
>>
>> If my understanding were correct, prctl() configures the process or
>> thread.
>
> Not necessarily. There are properties which are per adddress space like
> PR_[GS]ET_THP_DISABLE. This could be very similar.
>
>> How can we get process/thread configuration at demotion time?
>
> As already pointed out in previous emails. You could hook into
> folio_check_references path, more specifically folio_referenced_one
> where you have all that you need already - all vmas mapping the page and
> then it is trivial to get the corresponding vm_mm. If at least one of
> them has the flag set then the demotion is not allowed (essentially the
> same model as VM_LOCKED).

Got it! Thanks for detailed explanation.

One bit may be not sufficient. For example, if we want to avoid or
control cross-socket demotion and still allow demoting to slow memory
nodes in local socket, we need to specify a node mask to exclude some
NUMA nodes from demotion targets.

From overhead point of view, this appears similar as that of VMA/task
memory policy? We can make mm->owner available for memory tiers
(CONFIG_NUMA && CONFIG_MIGRATION). The advantage is that we don't need
to introduce new ABI. I guess users may prefer to use `numactl` than a
new ABI?

Best Regards,
Huang, Ying

2022-10-27 08:19:56

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH] mm/vmscan: respect cpuset policy during page demotion

Feng Tang <[email protected]> writes:

> On Thu, Oct 27, 2022 at 01:57:52AM +0800, Yang Shi wrote:
>> On Wed, Oct 26, 2022 at 8:59 AM Michal Hocko <[email protected]> wrote:
> [...]
>> > > > This all can get quite expensive so the primary question is, does the
>> > > > existing behavior generates any real issues or is this more of an
>> > > > correctness exercise? I mean it certainly is not great to demote to an
>> > > > incompatible numa node but are there any reasonable configurations when
>> > > > the demotion target node is explicitly excluded from memory
>> > > > policy/cpuset?
>> > >
>> > > We haven't got customer report on this, but there are quite some customers
>> > > use cpuset to bind some specific memory nodes to a docker (You've helped
>> > > us solve a OOM issue in such cases), so I think it's practical to respect
>> > > the cpuset semantics as much as we can.
>> >
>> > Yes, it is definitely better to respect cpusets and all local memory
>> > policies. There is no dispute there. The thing is whether this is really
>> > worth it. How often would cpusets (or policies in general) go actively
>> > against demotion nodes (i.e. exclude those nodes from their allowes node
>> > mask)?
>> >
>> > I can imagine workloads which wouldn't like to get their memory demoted
>> > for some reason but wouldn't it be more practical to tell that
>> > explicitly (e.g. via prctl) rather than configuring cpusets/memory
>> > policies explicitly?
>> >
>> > > Your concern about the expensive cost makes sense! Some raw ideas are:
>> > > * if the shrink_folio_list is called by kswapd, the folios come from
>> > > the same per-memcg lruvec, so only one check is enough
>> > > * if not from kswapd, like called form madvise or DAMON code, we can
>> > > save a memcg cache, and if the next folio's memcg is same as the
>> > > cache, we reuse its result. And due to the locality, the real
>> > > check is rarely performed.
>> >
>> > memcg is not the expensive part of the thing. You need to get from page
>> > -> all vmas::vm_policy -> mm -> task::mempolicy
>>
>> Yeah, on the same page with Michal. Figuring out mempolicy from page
>> seems quite expensive and the correctness can't be guranteed since the
>> mempolicy could be set per-thread and the mm->task depends on
>> CONFIG_MEMCG so it doesn't work for !CONFIG_MEMCG.
>
> Yes, you are right. Our "working" psudo code for mem policy looks like
> what Michal mentioned, and it can't work for all cases, but try to
> enforce it whenever possible:
>
> static bool __check_mpol_demotion(struct folio *folio, struct vm_area_struct *vma,
> unsigned long addr, void *arg)
> {
> bool *skip_demotion = arg;
> struct mempolicy *mpol;
> int nid, dnid;
> bool ret = true;
>
> mpol = __get_vma_policy(vma, addr);
> if (!mpol) {
> struct task_struct *task;

task = NULL;

> if (vma->vm_mm)
> task = vma->vm_mm->owner;
>
> if (task) {
> mpol = get_task_policy(task);
> if (mpol)
> mpol_get(mpol);
> }
> }
>
> if (!mpol)
> return ret;
>
> if (mpol->mode != MPOL_BIND)
> goto put_exit;
>
> nid = folio_nid(folio);
> dnid = next_demotion_node(nid);
> if (!node_isset(dnid, mpol->nodes)) {
> *skip_demotion = true;
> ret = false;
> }

I think that you need to get a node mask instead. Even if
!node_isset(dnid, mpol->nodes), you may demote to other node in the node
mask.

Best Regards,
Huang, Ying

>
> put_exit:
> mpol_put(mpol);
> return ret;
> }
>
> static unsigned int shrink_page_list(struct list_head *page_list,..)
> {
> ...
>
> bool skip_demotion = false;
> struct rmap_walk_control rwc = {
> .arg = &skip_demotion,
> .rmap_one = __check_mpol_demotion,
> };
>
> /* memory policy check */
> rmap_walk(folio, &rwc);
> if (skip_demotion)
> goto keep_locked;
> }
>
> And there seems to be no simple solution for getting the memory
> policy from a page.
>
> Thanks,
> Feng
>
>> >
>> > --
>> > Michal Hocko
>> > SUSE Labs
>> >
>>

2022-10-27 08:23:35

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH] mm/vmscan: respect cpuset policy during page demotion

On Thu, Oct 27, 2022 at 03:45:12PM +0800, Huang, Ying wrote:
> Feng Tang <[email protected]> writes:
>
> > On Thu, Oct 27, 2022 at 01:57:52AM +0800, Yang Shi wrote:
> >> On Wed, Oct 26, 2022 at 8:59 AM Michal Hocko <[email protected]> wrote:
> > [...]
> >> > > > This all can get quite expensive so the primary question is, does the
> >> > > > existing behavior generates any real issues or is this more of an
> >> > > > correctness exercise? I mean it certainly is not great to demote to an
> >> > > > incompatible numa node but are there any reasonable configurations when
> >> > > > the demotion target node is explicitly excluded from memory
> >> > > > policy/cpuset?
> >> > >
> >> > > We haven't got customer report on this, but there are quite some customers
> >> > > use cpuset to bind some specific memory nodes to a docker (You've helped
> >> > > us solve a OOM issue in such cases), so I think it's practical to respect
> >> > > the cpuset semantics as much as we can.
> >> >
> >> > Yes, it is definitely better to respect cpusets and all local memory
> >> > policies. There is no dispute there. The thing is whether this is really
> >> > worth it. How often would cpusets (or policies in general) go actively
> >> > against demotion nodes (i.e. exclude those nodes from their allowes node
> >> > mask)?
> >> >
> >> > I can imagine workloads which wouldn't like to get their memory demoted
> >> > for some reason but wouldn't it be more practical to tell that
> >> > explicitly (e.g. via prctl) rather than configuring cpusets/memory
> >> > policies explicitly?
> >> >
> >> > > Your concern about the expensive cost makes sense! Some raw ideas are:
> >> > > * if the shrink_folio_list is called by kswapd, the folios come from
> >> > > the same per-memcg lruvec, so only one check is enough
> >> > > * if not from kswapd, like called form madvise or DAMON code, we can
> >> > > save a memcg cache, and if the next folio's memcg is same as the
> >> > > cache, we reuse its result. And due to the locality, the real
> >> > > check is rarely performed.
> >> >
> >> > memcg is not the expensive part of the thing. You need to get from page
> >> > -> all vmas::vm_policy -> mm -> task::mempolicy
> >>
> >> Yeah, on the same page with Michal. Figuring out mempolicy from page
> >> seems quite expensive and the correctness can't be guranteed since the
> >> mempolicy could be set per-thread and the mm->task depends on
> >> CONFIG_MEMCG so it doesn't work for !CONFIG_MEMCG.
> >
> > Yes, you are right. Our "working" psudo code for mem policy looks like
> > what Michal mentioned, and it can't work for all cases, but try to
> > enforce it whenever possible:
> >
> > static bool __check_mpol_demotion(struct folio *folio, struct vm_area_struct *vma,
> > unsigned long addr, void *arg)
> > {
> > bool *skip_demotion = arg;
> > struct mempolicy *mpol;
> > int nid, dnid;
> > bool ret = true;
> >
> > mpol = __get_vma_policy(vma, addr);
> > if (!mpol) {
> > struct task_struct *task;
>
> task = NULL;
>
> > if (vma->vm_mm)
> > task = vma->vm_mm->owner;
> >
> > if (task) {
> > mpol = get_task_policy(task);
> > if (mpol)
> > mpol_get(mpol);
> > }
> > }
> >
> > if (!mpol)
> > return ret;
> >
> > if (mpol->mode != MPOL_BIND)
> > goto put_exit;
> >
> > nid = folio_nid(folio);
> > dnid = next_demotion_node(nid);
> > if (!node_isset(dnid, mpol->nodes)) {
> > *skip_demotion = true;
> > ret = false;
> > }
>
> I think that you need to get a node mask instead. Even if
> !node_isset(dnid, mpol->nodes), you may demote to other node in the node
> mask.

Yes, you are right. This code was written/tested about 2 months ago,
before Aneesh's memory tiering interface patchset. It was listed
to demonstrate idea of solution.

Thanks,
Feng

> Best Regards,
> Huang, Ying

2022-10-27 18:14:24

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH] mm/vmscan: respect cpuset policy during page demotion

On Thu, Oct 27, 2022 at 12:12 AM Feng Tang <[email protected]> wrote:
>
> On Thu, Oct 27, 2022 at 01:57:52AM +0800, Yang Shi wrote:
> > On Wed, Oct 26, 2022 at 8:59 AM Michal Hocko <[email protected]> wrote:
> [...]
> > > > > This all can get quite expensive so the primary question is, does the
> > > > > existing behavior generates any real issues or is this more of an
> > > > > correctness exercise? I mean it certainly is not great to demote to an
> > > > > incompatible numa node but are there any reasonable configurations when
> > > > > the demotion target node is explicitly excluded from memory
> > > > > policy/cpuset?
> > > >
> > > > We haven't got customer report on this, but there are quite some customers
> > > > use cpuset to bind some specific memory nodes to a docker (You've helped
> > > > us solve a OOM issue in such cases), so I think it's practical to respect
> > > > the cpuset semantics as much as we can.
> > >
> > > Yes, it is definitely better to respect cpusets and all local memory
> > > policies. There is no dispute there. The thing is whether this is really
> > > worth it. How often would cpusets (or policies in general) go actively
> > > against demotion nodes (i.e. exclude those nodes from their allowes node
> > > mask)?
> > >
> > > I can imagine workloads which wouldn't like to get their memory demoted
> > > for some reason but wouldn't it be more practical to tell that
> > > explicitly (e.g. via prctl) rather than configuring cpusets/memory
> > > policies explicitly?
> > >
> > > > Your concern about the expensive cost makes sense! Some raw ideas are:
> > > > * if the shrink_folio_list is called by kswapd, the folios come from
> > > > the same per-memcg lruvec, so only one check is enough
> > > > * if not from kswapd, like called form madvise or DAMON code, we can
> > > > save a memcg cache, and if the next folio's memcg is same as the
> > > > cache, we reuse its result. And due to the locality, the real
> > > > check is rarely performed.
> > >
> > > memcg is not the expensive part of the thing. You need to get from page
> > > -> all vmas::vm_policy -> mm -> task::mempolicy
> >
> > Yeah, on the same page with Michal. Figuring out mempolicy from page
> > seems quite expensive and the correctness can't be guranteed since the
> > mempolicy could be set per-thread and the mm->task depends on
> > CONFIG_MEMCG so it doesn't work for !CONFIG_MEMCG.
>
> Yes, you are right. Our "working" psudo code for mem policy looks like
> what Michal mentioned, and it can't work for all cases, but try to
> enforce it whenever possible:
>
> static bool __check_mpol_demotion(struct folio *folio, struct vm_area_struct *vma,
> unsigned long addr, void *arg)
> {
> bool *skip_demotion = arg;
> struct mempolicy *mpol;
> int nid, dnid;
> bool ret = true;
>
> mpol = __get_vma_policy(vma, addr);
> if (!mpol) {
> struct task_struct *task;
> if (vma->vm_mm)
> task = vma->vm_mm->owner;

But this task may not be the task you want IIUC. For example, the
process has two threads, A and B. They have different mempolicy. The
vmscan is trying to demote a page belonging to thread A, but the task
may point to thread B, so you actually get the wrong mempolicy IIUC.

>
> if (task) {
> mpol = get_task_policy(task);
> if (mpol)
> mpol_get(mpol);
> }
> }
>
> if (!mpol)
> return ret;
>
> if (mpol->mode != MPOL_BIND)
> goto put_exit;
>
> nid = folio_nid(folio);
> dnid = next_demotion_node(nid);
> if (!node_isset(dnid, mpol->nodes)) {
> *skip_demotion = true;
> ret = false;
> }
>
> put_exit:
> mpol_put(mpol);
> return ret;
> }
>
> static unsigned int shrink_page_list(struct list_head *page_list,..)
> {
> ...
>
> bool skip_demotion = false;
> struct rmap_walk_control rwc = {
> .arg = &skip_demotion,
> .rmap_one = __check_mpol_demotion,
> };
>
> /* memory policy check */
> rmap_walk(folio, &rwc);
> if (skip_demotion)
> goto keep_locked;
> }
>
> And there seems to be no simple solution for getting the memory
> policy from a page.
>
> Thanks,
> Feng
>
> > >
> > > --
> > > Michal Hocko
> > > SUSE Labs
> > >
> >

2022-10-28 04:01:35

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH] mm/vmscan: respect cpuset policy during page demotion

On Thu, Oct 27, 2022 at 10:55:58AM -0700, Yang Shi wrote:
> On Thu, Oct 27, 2022 at 12:12 AM Feng Tang <[email protected]> wrote:
> >
> > On Thu, Oct 27, 2022 at 01:57:52AM +0800, Yang Shi wrote:
> > > On Wed, Oct 26, 2022 at 8:59 AM Michal Hocko <[email protected]> wrote:
> > [...]
> > > > > > This all can get quite expensive so the primary question is, does the
> > > > > > existing behavior generates any real issues or is this more of an
> > > > > > correctness exercise? I mean it certainly is not great to demote to an
> > > > > > incompatible numa node but are there any reasonable configurations when
> > > > > > the demotion target node is explicitly excluded from memory
> > > > > > policy/cpuset?
> > > > >
> > > > > We haven't got customer report on this, but there are quite some customers
> > > > > use cpuset to bind some specific memory nodes to a docker (You've helped
> > > > > us solve a OOM issue in such cases), so I think it's practical to respect
> > > > > the cpuset semantics as much as we can.
> > > >
> > > > Yes, it is definitely better to respect cpusets and all local memory
> > > > policies. There is no dispute there. The thing is whether this is really
> > > > worth it. How often would cpusets (or policies in general) go actively
> > > > against demotion nodes (i.e. exclude those nodes from their allowes node
> > > > mask)?
> > > >
> > > > I can imagine workloads which wouldn't like to get their memory demoted
> > > > for some reason but wouldn't it be more practical to tell that
> > > > explicitly (e.g. via prctl) rather than configuring cpusets/memory
> > > > policies explicitly?
> > > >
> > > > > Your concern about the expensive cost makes sense! Some raw ideas are:
> > > > > * if the shrink_folio_list is called by kswapd, the folios come from
> > > > > the same per-memcg lruvec, so only one check is enough
> > > > > * if not from kswapd, like called form madvise or DAMON code, we can
> > > > > save a memcg cache, and if the next folio's memcg is same as the
> > > > > cache, we reuse its result. And due to the locality, the real
> > > > > check is rarely performed.
> > > >
> > > > memcg is not the expensive part of the thing. You need to get from page
> > > > -> all vmas::vm_policy -> mm -> task::mempolicy
> > >
> > > Yeah, on the same page with Michal. Figuring out mempolicy from page
> > > seems quite expensive and the correctness can't be guranteed since the
> > > mempolicy could be set per-thread and the mm->task depends on
> > > CONFIG_MEMCG so it doesn't work for !CONFIG_MEMCG.
> >
> > Yes, you are right. Our "working" psudo code for mem policy looks like
> > what Michal mentioned, and it can't work for all cases, but try to
> > enforce it whenever possible:
> >
> > static bool __check_mpol_demotion(struct folio *folio, struct vm_area_struct *vma,
> > unsigned long addr, void *arg)
> > {
> > bool *skip_demotion = arg;
> > struct mempolicy *mpol;
> > int nid, dnid;
> > bool ret = true;
> >
> > mpol = __get_vma_policy(vma, addr);
> > if (!mpol) {
> > struct task_struct *task;
> > if (vma->vm_mm)
> > task = vma->vm_mm->owner;
>
> But this task may not be the task you want IIUC. For example, the
> process has two threads, A and B. They have different mempolicy. The
> vmscan is trying to demote a page belonging to thread A, but the task
> may point to thread B, so you actually get the wrong mempolicy IIUC.

Yes, this is a valid concern! We don't have good solution for this.
For memory policy, we may only handle the per-vma policy for now whose
cost is relatively low, as a best-effort try.

Thanks,
Feng



2022-10-28 05:13:02

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH] mm/vmscan: respect cpuset policy during page demotion

On 10/27/22 11:25 PM, Yang Shi wrote:
> On Thu, Oct 27, 2022 at 12:12 AM Feng Tang <[email protected]> wrote:
>>
>> On Thu, Oct 27, 2022 at 01:57:52AM +0800, Yang Shi wrote:
>>> On Wed, Oct 26, 2022 at 8:59 AM Michal Hocko <[email protected]> wrote:
>> [...]
>>>>>> This all can get quite expensive so the primary question is, does the
>>>>>> existing behavior generates any real issues or is this more of an
>>>>>> correctness exercise? I mean it certainly is not great to demote to an
>>>>>> incompatible numa node but are there any reasonable configurations when
>>>>>> the demotion target node is explicitly excluded from memory
>>>>>> policy/cpuset?
>>>>>
>>>>> We haven't got customer report on this, but there are quite some customers
>>>>> use cpuset to bind some specific memory nodes to a docker (You've helped
>>>>> us solve a OOM issue in such cases), so I think it's practical to respect
>>>>> the cpuset semantics as much as we can.
>>>>
>>>> Yes, it is definitely better to respect cpusets and all local memory
>>>> policies. There is no dispute there. The thing is whether this is really
>>>> worth it. How often would cpusets (or policies in general) go actively
>>>> against demotion nodes (i.e. exclude those nodes from their allowes node
>>>> mask)?
>>>>
>>>> I can imagine workloads which wouldn't like to get their memory demoted
>>>> for some reason but wouldn't it be more practical to tell that
>>>> explicitly (e.g. via prctl) rather than configuring cpusets/memory
>>>> policies explicitly?
>>>>
>>>>> Your concern about the expensive cost makes sense! Some raw ideas are:
>>>>> * if the shrink_folio_list is called by kswapd, the folios come from
>>>>> the same per-memcg lruvec, so only one check is enough
>>>>> * if not from kswapd, like called form madvise or DAMON code, we can
>>>>> save a memcg cache, and if the next folio's memcg is same as the
>>>>> cache, we reuse its result. And due to the locality, the real
>>>>> check is rarely performed.
>>>>
>>>> memcg is not the expensive part of the thing. You need to get from page
>>>> -> all vmas::vm_policy -> mm -> task::mempolicy
>>>
>>> Yeah, on the same page with Michal. Figuring out mempolicy from page
>>> seems quite expensive and the correctness can't be guranteed since the
>>> mempolicy could be set per-thread and the mm->task depends on
>>> CONFIG_MEMCG so it doesn't work for !CONFIG_MEMCG.
>>
>> Yes, you are right. Our "working" psudo code for mem policy looks like
>> what Michal mentioned, and it can't work for all cases, but try to
>> enforce it whenever possible:
>>
>> static bool __check_mpol_demotion(struct folio *folio, struct vm_area_struct *vma,
>> unsigned long addr, void *arg)
>> {
>> bool *skip_demotion = arg;
>> struct mempolicy *mpol;
>> int nid, dnid;
>> bool ret = true;
>>
>> mpol = __get_vma_policy(vma, addr);
>> if (!mpol) {
>> struct task_struct *task;
>> if (vma->vm_mm)
>> task = vma->vm_mm->owner;
>
> But this task may not be the task you want IIUC. For example, the
> process has two threads, A and B. They have different mempolicy. The
> vmscan is trying to demote a page belonging to thread A, but the task
> may point to thread B, so you actually get the wrong mempolicy IIUC.
>

But if we swap out this page and fault back in via thread B the page would
get allocated as per thread B mempolicy. So if we demote based on thread B
policy are we breaking anything?

-aneesh



2022-10-28 06:04:16

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH] mm/vmscan: respect cpuset policy during page demotion

Feng Tang <[email protected]> writes:

> On Thu, Oct 27, 2022 at 10:55:58AM -0700, Yang Shi wrote:
>> On Thu, Oct 27, 2022 at 12:12 AM Feng Tang <[email protected]> wrote:
>> >
>> > On Thu, Oct 27, 2022 at 01:57:52AM +0800, Yang Shi wrote:
>> > > On Wed, Oct 26, 2022 at 8:59 AM Michal Hocko <[email protected]> wrote:
>> > [...]
>> > > > > > This all can get quite expensive so the primary question is, does the
>> > > > > > existing behavior generates any real issues or is this more of an
>> > > > > > correctness exercise? I mean it certainly is not great to demote to an
>> > > > > > incompatible numa node but are there any reasonable configurations when
>> > > > > > the demotion target node is explicitly excluded from memory
>> > > > > > policy/cpuset?
>> > > > >
>> > > > > We haven't got customer report on this, but there are quite some customers
>> > > > > use cpuset to bind some specific memory nodes to a docker (You've helped
>> > > > > us solve a OOM issue in such cases), so I think it's practical to respect
>> > > > > the cpuset semantics as much as we can.
>> > > >
>> > > > Yes, it is definitely better to respect cpusets and all local memory
>> > > > policies. There is no dispute there. The thing is whether this is really
>> > > > worth it. How often would cpusets (or policies in general) go actively
>> > > > against demotion nodes (i.e. exclude those nodes from their allowes node
>> > > > mask)?
>> > > >
>> > > > I can imagine workloads which wouldn't like to get their memory demoted
>> > > > for some reason but wouldn't it be more practical to tell that
>> > > > explicitly (e.g. via prctl) rather than configuring cpusets/memory
>> > > > policies explicitly?
>> > > >
>> > > > > Your concern about the expensive cost makes sense! Some raw ideas are:
>> > > > > * if the shrink_folio_list is called by kswapd, the folios come from
>> > > > > the same per-memcg lruvec, so only one check is enough
>> > > > > * if not from kswapd, like called form madvise or DAMON code, we can
>> > > > > save a memcg cache, and if the next folio's memcg is same as the
>> > > > > cache, we reuse its result. And due to the locality, the real
>> > > > > check is rarely performed.
>> > > >
>> > > > memcg is not the expensive part of the thing. You need to get from page
>> > > > -> all vmas::vm_policy -> mm -> task::mempolicy
>> > >
>> > > Yeah, on the same page with Michal. Figuring out mempolicy from page
>> > > seems quite expensive and the correctness can't be guranteed since the
>> > > mempolicy could be set per-thread and the mm->task depends on
>> > > CONFIG_MEMCG so it doesn't work for !CONFIG_MEMCG.
>> >
>> > Yes, you are right. Our "working" psudo code for mem policy looks like
>> > what Michal mentioned, and it can't work for all cases, but try to
>> > enforce it whenever possible:
>> >
>> > static bool __check_mpol_demotion(struct folio *folio, struct vm_area_struct *vma,
>> > unsigned long addr, void *arg)
>> > {
>> > bool *skip_demotion = arg;
>> > struct mempolicy *mpol;
>> > int nid, dnid;
>> > bool ret = true;
>> >
>> > mpol = __get_vma_policy(vma, addr);
>> > if (!mpol) {
>> > struct task_struct *task;
>> > if (vma->vm_mm)
>> > task = vma->vm_mm->owner;
>>
>> But this task may not be the task you want IIUC. For example, the
>> process has two threads, A and B. They have different mempolicy. The
>> vmscan is trying to demote a page belonging to thread A, but the task
>> may point to thread B, so you actually get the wrong mempolicy IIUC.
>
> Yes, this is a valid concern! We don't have good solution for this.
> For memory policy, we may only handle the per-vma policy for now whose
> cost is relatively low, as a best-effort try.

Yes. The solution isn't perfect, especially for multiple-thread
processes with thread specific memory policy. But the proposed code
above can support the most common cases at least, that is, run workload
with `numactl`.

Best Regards,
Huang, Ying

2022-10-28 17:34:30

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH] mm/vmscan: respect cpuset policy during page demotion

On Thu, Oct 27, 2022 at 10:09 PM Aneesh Kumar K V
<[email protected]> wrote:
>
> On 10/27/22 11:25 PM, Yang Shi wrote:
> > On Thu, Oct 27, 2022 at 12:12 AM Feng Tang <[email protected]> wrote:
> >>
> >> On Thu, Oct 27, 2022 at 01:57:52AM +0800, Yang Shi wrote:
> >>> On Wed, Oct 26, 2022 at 8:59 AM Michal Hocko <[email protected]> wrote:
> >> [...]
> >>>>>> This all can get quite expensive so the primary question is, does the
> >>>>>> existing behavior generates any real issues or is this more of an
> >>>>>> correctness exercise? I mean it certainly is not great to demote to an
> >>>>>> incompatible numa node but are there any reasonable configurations when
> >>>>>> the demotion target node is explicitly excluded from memory
> >>>>>> policy/cpuset?
> >>>>>
> >>>>> We haven't got customer report on this, but there are quite some customers
> >>>>> use cpuset to bind some specific memory nodes to a docker (You've helped
> >>>>> us solve a OOM issue in such cases), so I think it's practical to respect
> >>>>> the cpuset semantics as much as we can.
> >>>>
> >>>> Yes, it is definitely better to respect cpusets and all local memory
> >>>> policies. There is no dispute there. The thing is whether this is really
> >>>> worth it. How often would cpusets (or policies in general) go actively
> >>>> against demotion nodes (i.e. exclude those nodes from their allowes node
> >>>> mask)?
> >>>>
> >>>> I can imagine workloads which wouldn't like to get their memory demoted
> >>>> for some reason but wouldn't it be more practical to tell that
> >>>> explicitly (e.g. via prctl) rather than configuring cpusets/memory
> >>>> policies explicitly?
> >>>>
> >>>>> Your concern about the expensive cost makes sense! Some raw ideas are:
> >>>>> * if the shrink_folio_list is called by kswapd, the folios come from
> >>>>> the same per-memcg lruvec, so only one check is enough
> >>>>> * if not from kswapd, like called form madvise or DAMON code, we can
> >>>>> save a memcg cache, and if the next folio's memcg is same as the
> >>>>> cache, we reuse its result. And due to the locality, the real
> >>>>> check is rarely performed.
> >>>>
> >>>> memcg is not the expensive part of the thing. You need to get from page
> >>>> -> all vmas::vm_policy -> mm -> task::mempolicy
> >>>
> >>> Yeah, on the same page with Michal. Figuring out mempolicy from page
> >>> seems quite expensive and the correctness can't be guranteed since the
> >>> mempolicy could be set per-thread and the mm->task depends on
> >>> CONFIG_MEMCG so it doesn't work for !CONFIG_MEMCG.
> >>
> >> Yes, you are right. Our "working" psudo code for mem policy looks like
> >> what Michal mentioned, and it can't work for all cases, but try to
> >> enforce it whenever possible:
> >>
> >> static bool __check_mpol_demotion(struct folio *folio, struct vm_area_struct *vma,
> >> unsigned long addr, void *arg)
> >> {
> >> bool *skip_demotion = arg;
> >> struct mempolicy *mpol;
> >> int nid, dnid;
> >> bool ret = true;
> >>
> >> mpol = __get_vma_policy(vma, addr);
> >> if (!mpol) {
> >> struct task_struct *task;
> >> if (vma->vm_mm)
> >> task = vma->vm_mm->owner;
> >
> > But this task may not be the task you want IIUC. For example, the
> > process has two threads, A and B. They have different mempolicy. The
> > vmscan is trying to demote a page belonging to thread A, but the task
> > may point to thread B, so you actually get the wrong mempolicy IIUC.
> >
>
> But if we swap out this page and fault back in via thread B the page would
> get allocated as per thread B mempolicy. So if we demote based on thread B
> policy are we breaking anything?

If the page is demoted by following thread B's mempolicy, didn't it
already break thread A's mempolicy in the first place if you care
about it? If thread A and thread B have the same mempolicy, then it is
not a problem.

Actually there is another problem for shared page. If a page is shared
by two processes, P1 and P2, when you do rmap walk to find the task,
you may find two contradict mempolicy, what mempolicy would you like
to obey? Do you have to save all the intermediate mempolicy results
somewhere or you just bail out once the first mempolicy is found?

>
> -aneesh
>
>

2022-10-28 17:37:02

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH] mm/vmscan: respect cpuset policy during page demotion

On Thu, Oct 27, 2022 at 10:55 PM Huang, Ying <[email protected]> wrote:
>
> Feng Tang <[email protected]> writes:
>
> > On Thu, Oct 27, 2022 at 10:55:58AM -0700, Yang Shi wrote:
> >> On Thu, Oct 27, 2022 at 12:12 AM Feng Tang <[email protected]> wrote:
> >> >
> >> > On Thu, Oct 27, 2022 at 01:57:52AM +0800, Yang Shi wrote:
> >> > > On Wed, Oct 26, 2022 at 8:59 AM Michal Hocko <[email protected]> wrote:
> >> > [...]
> >> > > > > > This all can get quite expensive so the primary question is, does the
> >> > > > > > existing behavior generates any real issues or is this more of an
> >> > > > > > correctness exercise? I mean it certainly is not great to demote to an
> >> > > > > > incompatible numa node but are there any reasonable configurations when
> >> > > > > > the demotion target node is explicitly excluded from memory
> >> > > > > > policy/cpuset?
> >> > > > >
> >> > > > > We haven't got customer report on this, but there are quite some customers
> >> > > > > use cpuset to bind some specific memory nodes to a docker (You've helped
> >> > > > > us solve a OOM issue in such cases), so I think it's practical to respect
> >> > > > > the cpuset semantics as much as we can.
> >> > > >
> >> > > > Yes, it is definitely better to respect cpusets and all local memory
> >> > > > policies. There is no dispute there. The thing is whether this is really
> >> > > > worth it. How often would cpusets (or policies in general) go actively
> >> > > > against demotion nodes (i.e. exclude those nodes from their allowes node
> >> > > > mask)?
> >> > > >
> >> > > > I can imagine workloads which wouldn't like to get their memory demoted
> >> > > > for some reason but wouldn't it be more practical to tell that
> >> > > > explicitly (e.g. via prctl) rather than configuring cpusets/memory
> >> > > > policies explicitly?
> >> > > >
> >> > > > > Your concern about the expensive cost makes sense! Some raw ideas are:
> >> > > > > * if the shrink_folio_list is called by kswapd, the folios come from
> >> > > > > the same per-memcg lruvec, so only one check is enough
> >> > > > > * if not from kswapd, like called form madvise or DAMON code, we can
> >> > > > > save a memcg cache, and if the next folio's memcg is same as the
> >> > > > > cache, we reuse its result. And due to the locality, the real
> >> > > > > check is rarely performed.
> >> > > >
> >> > > > memcg is not the expensive part of the thing. You need to get from page
> >> > > > -> all vmas::vm_policy -> mm -> task::mempolicy
> >> > >
> >> > > Yeah, on the same page with Michal. Figuring out mempolicy from page
> >> > > seems quite expensive and the correctness can't be guranteed since the
> >> > > mempolicy could be set per-thread and the mm->task depends on
> >> > > CONFIG_MEMCG so it doesn't work for !CONFIG_MEMCG.
> >> >
> >> > Yes, you are right. Our "working" psudo code for mem policy looks like
> >> > what Michal mentioned, and it can't work for all cases, but try to
> >> > enforce it whenever possible:
> >> >
> >> > static bool __check_mpol_demotion(struct folio *folio, struct vm_area_struct *vma,
> >> > unsigned long addr, void *arg)
> >> > {
> >> > bool *skip_demotion = arg;
> >> > struct mempolicy *mpol;
> >> > int nid, dnid;
> >> > bool ret = true;
> >> >
> >> > mpol = __get_vma_policy(vma, addr);
> >> > if (!mpol) {
> >> > struct task_struct *task;
> >> > if (vma->vm_mm)
> >> > task = vma->vm_mm->owner;
> >>
> >> But this task may not be the task you want IIUC. For example, the
> >> process has two threads, A and B. They have different mempolicy. The
> >> vmscan is trying to demote a page belonging to thread A, but the task
> >> may point to thread B, so you actually get the wrong mempolicy IIUC.
> >
> > Yes, this is a valid concern! We don't have good solution for this.
> > For memory policy, we may only handle the per-vma policy for now whose
> > cost is relatively low, as a best-effort try.
>
> Yes. The solution isn't perfect, especially for multiple-thread
> processes with thread specific memory policy. But the proposed code
> above can support the most common cases at least, that is, run workload
> with `numactl`.

Not only multi threads, but also may be broken for shared pages. When
you do rmap walk, you may get multiple contradict mempolicy, which one
would you like to obey?

TBH I'm not sure whether such half-baked solution is worth it or not,
at least at this moment. The cost is not cheap, but the gain may not
be worth it IMHO.

>
> Best Regards,
> Huang, Ying

2022-10-31 01:58:53

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH] mm/vmscan: respect cpuset policy during page demotion

Yang Shi <[email protected]> writes:

> On Thu, Oct 27, 2022 at 10:55 PM Huang, Ying <[email protected]> wrote:
>>
>> Feng Tang <[email protected]> writes:
>>
>> > On Thu, Oct 27, 2022 at 10:55:58AM -0700, Yang Shi wrote:
>> >> On Thu, Oct 27, 2022 at 12:12 AM Feng Tang <[email protected]> wrote:
>> >> >
>> >> > On Thu, Oct 27, 2022 at 01:57:52AM +0800, Yang Shi wrote:
>> >> > > On Wed, Oct 26, 2022 at 8:59 AM Michal Hocko <[email protected]> wrote:
>> >> > [...]
>> >> > > > > > This all can get quite expensive so the primary question is, does the
>> >> > > > > > existing behavior generates any real issues or is this more of an
>> >> > > > > > correctness exercise? I mean it certainly is not great to demote to an
>> >> > > > > > incompatible numa node but are there any reasonable configurations when
>> >> > > > > > the demotion target node is explicitly excluded from memory
>> >> > > > > > policy/cpuset?
>> >> > > > >
>> >> > > > > We haven't got customer report on this, but there are quite some customers
>> >> > > > > use cpuset to bind some specific memory nodes to a docker (You've helped
>> >> > > > > us solve a OOM issue in such cases), so I think it's practical to respect
>> >> > > > > the cpuset semantics as much as we can.
>> >> > > >
>> >> > > > Yes, it is definitely better to respect cpusets and all local memory
>> >> > > > policies. There is no dispute there. The thing is whether this is really
>> >> > > > worth it. How often would cpusets (or policies in general) go actively
>> >> > > > against demotion nodes (i.e. exclude those nodes from their allowes node
>> >> > > > mask)?
>> >> > > >
>> >> > > > I can imagine workloads which wouldn't like to get their memory demoted
>> >> > > > for some reason but wouldn't it be more practical to tell that
>> >> > > > explicitly (e.g. via prctl) rather than configuring cpusets/memory
>> >> > > > policies explicitly?
>> >> > > >
>> >> > > > > Your concern about the expensive cost makes sense! Some raw ideas are:
>> >> > > > > * if the shrink_folio_list is called by kswapd, the folios come from
>> >> > > > > the same per-memcg lruvec, so only one check is enough
>> >> > > > > * if not from kswapd, like called form madvise or DAMON code, we can
>> >> > > > > save a memcg cache, and if the next folio's memcg is same as the
>> >> > > > > cache, we reuse its result. And due to the locality, the real
>> >> > > > > check is rarely performed.
>> >> > > >
>> >> > > > memcg is not the expensive part of the thing. You need to get from page
>> >> > > > -> all vmas::vm_policy -> mm -> task::mempolicy
>> >> > >
>> >> > > Yeah, on the same page with Michal. Figuring out mempolicy from page
>> >> > > seems quite expensive and the correctness can't be guranteed since the
>> >> > > mempolicy could be set per-thread and the mm->task depends on
>> >> > > CONFIG_MEMCG so it doesn't work for !CONFIG_MEMCG.
>> >> >
>> >> > Yes, you are right. Our "working" psudo code for mem policy looks like
>> >> > what Michal mentioned, and it can't work for all cases, but try to
>> >> > enforce it whenever possible:
>> >> >
>> >> > static bool __check_mpol_demotion(struct folio *folio, struct vm_area_struct *vma,
>> >> > unsigned long addr, void *arg)
>> >> > {
>> >> > bool *skip_demotion = arg;
>> >> > struct mempolicy *mpol;
>> >> > int nid, dnid;
>> >> > bool ret = true;
>> >> >
>> >> > mpol = __get_vma_policy(vma, addr);
>> >> > if (!mpol) {
>> >> > struct task_struct *task;
>> >> > if (vma->vm_mm)
>> >> > task = vma->vm_mm->owner;
>> >>
>> >> But this task may not be the task you want IIUC. For example, the
>> >> process has two threads, A and B. They have different mempolicy. The
>> >> vmscan is trying to demote a page belonging to thread A, but the task
>> >> may point to thread B, so you actually get the wrong mempolicy IIUC.
>> >
>> > Yes, this is a valid concern! We don't have good solution for this.
>> > For memory policy, we may only handle the per-vma policy for now whose
>> > cost is relatively low, as a best-effort try.
>>
>> Yes. The solution isn't perfect, especially for multiple-thread
>> processes with thread specific memory policy. But the proposed code
>> above can support the most common cases at least, that is, run workload
>> with `numactl`.
>
> Not only multi threads, but also may be broken for shared pages. When
> you do rmap walk, you may get multiple contradict mempolicy, which one
> would you like to obey?
>
> TBH I'm not sure whether such half-baked solution is worth it or not,
> at least at this moment. The cost is not cheap, but the gain may not
> be worth it IMHO.

Per my understanding, this can cover most cases. For example, run
workload with `numactl`, or control the page placement of some memory
areas via mbind(). Although there are some issue in the corner cases.

Best Regards,
Huang, Ying

2022-10-31 02:15:16

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH] mm/vmscan: respect cpuset policy during page demotion

Yang Shi <[email protected]> writes:

> On Thu, Oct 27, 2022 at 10:09 PM Aneesh Kumar K V
> <[email protected]> wrote:
>>
>> On 10/27/22 11:25 PM, Yang Shi wrote:
>> > On Thu, Oct 27, 2022 at 12:12 AM Feng Tang <[email protected]> wrote:
>> >>
>> >> On Thu, Oct 27, 2022 at 01:57:52AM +0800, Yang Shi wrote:
>> >>> On Wed, Oct 26, 2022 at 8:59 AM Michal Hocko <[email protected]> wrote:
>> >> [...]
>> >>>>>> This all can get quite expensive so the primary question is, does the
>> >>>>>> existing behavior generates any real issues or is this more of an
>> >>>>>> correctness exercise? I mean it certainly is not great to demote to an
>> >>>>>> incompatible numa node but are there any reasonable configurations when
>> >>>>>> the demotion target node is explicitly excluded from memory
>> >>>>>> policy/cpuset?
>> >>>>>
>> >>>>> We haven't got customer report on this, but there are quite some customers
>> >>>>> use cpuset to bind some specific memory nodes to a docker (You've helped
>> >>>>> us solve a OOM issue in such cases), so I think it's practical to respect
>> >>>>> the cpuset semantics as much as we can.
>> >>>>
>> >>>> Yes, it is definitely better to respect cpusets and all local memory
>> >>>> policies. There is no dispute there. The thing is whether this is really
>> >>>> worth it. How often would cpusets (or policies in general) go actively
>> >>>> against demotion nodes (i.e. exclude those nodes from their allowes node
>> >>>> mask)?
>> >>>>
>> >>>> I can imagine workloads which wouldn't like to get their memory demoted
>> >>>> for some reason but wouldn't it be more practical to tell that
>> >>>> explicitly (e.g. via prctl) rather than configuring cpusets/memory
>> >>>> policies explicitly?
>> >>>>
>> >>>>> Your concern about the expensive cost makes sense! Some raw ideas are:
>> >>>>> * if the shrink_folio_list is called by kswapd, the folios come from
>> >>>>> the same per-memcg lruvec, so only one check is enough
>> >>>>> * if not from kswapd, like called form madvise or DAMON code, we can
>> >>>>> save a memcg cache, and if the next folio's memcg is same as the
>> >>>>> cache, we reuse its result. And due to the locality, the real
>> >>>>> check is rarely performed.
>> >>>>
>> >>>> memcg is not the expensive part of the thing. You need to get from page
>> >>>> -> all vmas::vm_policy -> mm -> task::mempolicy
>> >>>
>> >>> Yeah, on the same page with Michal. Figuring out mempolicy from page
>> >>> seems quite expensive and the correctness can't be guranteed since the
>> >>> mempolicy could be set per-thread and the mm->task depends on
>> >>> CONFIG_MEMCG so it doesn't work for !CONFIG_MEMCG.
>> >>
>> >> Yes, you are right. Our "working" psudo code for mem policy looks like
>> >> what Michal mentioned, and it can't work for all cases, but try to
>> >> enforce it whenever possible:
>> >>
>> >> static bool __check_mpol_demotion(struct folio *folio, struct vm_area_struct *vma,
>> >> unsigned long addr, void *arg)
>> >> {
>> >> bool *skip_demotion = arg;
>> >> struct mempolicy *mpol;
>> >> int nid, dnid;
>> >> bool ret = true;
>> >>
>> >> mpol = __get_vma_policy(vma, addr);
>> >> if (!mpol) {
>> >> struct task_struct *task;
>> >> if (vma->vm_mm)
>> >> task = vma->vm_mm->owner;
>> >
>> > But this task may not be the task you want IIUC. For example, the
>> > process has two threads, A and B. They have different mempolicy. The
>> > vmscan is trying to demote a page belonging to thread A, but the task
>> > may point to thread B, so you actually get the wrong mempolicy IIUC.
>> >
>>
>> But if we swap out this page and fault back in via thread B the page would
>> get allocated as per thread B mempolicy. So if we demote based on thread B
>> policy are we breaking anything?
>
> If the page is demoted by following thread B's mempolicy, didn't it
> already break thread A's mempolicy in the first place if you care
> about it? If thread A and thread B have the same mempolicy, then it is
> not a problem.
>
> Actually there is another problem for shared page. If a page is shared
> by two processes, P1 and P2, when you do rmap walk to find the task,
> you may find two contradict mempolicy, what mempolicy would you like
> to obey? Do you have to save all the intermediate mempolicy results
> somewhere or you just bail out once the first mempolicy is found?

Yes. There's no perfect solution for this. I suggest to avoid demoting
if any VMA (or task) prevent it. Because allowing demoting is the
default policy. And we will not promote the page back if it becomes hot
later by default because promotion only works for default memory policy
by default.

Best Regards,
Huang, Ying

2022-10-31 02:28:03

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH] mm/vmscan: respect cpuset policy during page demotion

On Sat, Oct 29, 2022 at 01:23:53AM +0800, Yang Shi wrote:
> On Thu, Oct 27, 2022 at 10:55 PM Huang, Ying <[email protected]> wrote:
> >
> > Feng Tang <[email protected]> writes:
> >
> > > On Thu, Oct 27, 2022 at 10:55:58AM -0700, Yang Shi wrote:
> > >> On Thu, Oct 27, 2022 at 12:12 AM Feng Tang <[email protected]> wrote:
> > >> >
> > >> > On Thu, Oct 27, 2022 at 01:57:52AM +0800, Yang Shi wrote:
> > >> > > On Wed, Oct 26, 2022 at 8:59 AM Michal Hocko <[email protected]> wrote:
> > >> > [...]
> > >> > > > > > This all can get quite expensive so the primary question is, does the
> > >> > > > > > existing behavior generates any real issues or is this more of an
> > >> > > > > > correctness exercise? I mean it certainly is not great to demote to an
> > >> > > > > > incompatible numa node but are there any reasonable configurations when
> > >> > > > > > the demotion target node is explicitly excluded from memory
> > >> > > > > > policy/cpuset?
> > >> > > > >
> > >> > > > > We haven't got customer report on this, but there are quite some customers
> > >> > > > > use cpuset to bind some specific memory nodes to a docker (You've helped
> > >> > > > > us solve a OOM issue in such cases), so I think it's practical to respect
> > >> > > > > the cpuset semantics as much as we can.
> > >> > > >
> > >> > > > Yes, it is definitely better to respect cpusets and all local memory
> > >> > > > policies. There is no dispute there. The thing is whether this is really
> > >> > > > worth it. How often would cpusets (or policies in general) go actively
> > >> > > > against demotion nodes (i.e. exclude those nodes from their allowes node
> > >> > > > mask)?
> > >> > > >
> > >> > > > I can imagine workloads which wouldn't like to get their memory demoted
> > >> > > > for some reason but wouldn't it be more practical to tell that
> > >> > > > explicitly (e.g. via prctl) rather than configuring cpusets/memory
> > >> > > > policies explicitly?
> > >> > > >
> > >> > > > > Your concern about the expensive cost makes sense! Some raw ideas are:
> > >> > > > > * if the shrink_folio_list is called by kswapd, the folios come from
> > >> > > > > the same per-memcg lruvec, so only one check is enough
> > >> > > > > * if not from kswapd, like called form madvise or DAMON code, we can
> > >> > > > > save a memcg cache, and if the next folio's memcg is same as the
> > >> > > > > cache, we reuse its result. And due to the locality, the real
> > >> > > > > check is rarely performed.
> > >> > > >
> > >> > > > memcg is not the expensive part of the thing. You need to get from page
> > >> > > > -> all vmas::vm_policy -> mm -> task::mempolicy
> > >> > >
> > >> > > Yeah, on the same page with Michal. Figuring out mempolicy from page
> > >> > > seems quite expensive and the correctness can't be guranteed since the
> > >> > > mempolicy could be set per-thread and the mm->task depends on
> > >> > > CONFIG_MEMCG so it doesn't work for !CONFIG_MEMCG.
> > >> >
> > >> > Yes, you are right. Our "working" psudo code for mem policy looks like
> > >> > what Michal mentioned, and it can't work for all cases, but try to
> > >> > enforce it whenever possible:
> > >> >
> > >> > static bool __check_mpol_demotion(struct folio *folio, struct vm_area_struct *vma,
> > >> > unsigned long addr, void *arg)
> > >> > {
> > >> > bool *skip_demotion = arg;
> > >> > struct mempolicy *mpol;
> > >> > int nid, dnid;
> > >> > bool ret = true;
> > >> >
> > >> > mpol = __get_vma_policy(vma, addr);
> > >> > if (!mpol) {
> > >> > struct task_struct *task;
> > >> > if (vma->vm_mm)
> > >> > task = vma->vm_mm->owner;
> > >>
> > >> But this task may not be the task you want IIUC. For example, the
> > >> process has two threads, A and B. They have different mempolicy. The
> > >> vmscan is trying to demote a page belonging to thread A, but the task
> > >> may point to thread B, so you actually get the wrong mempolicy IIUC.
> > >
> > > Yes, this is a valid concern! We don't have good solution for this.
> > > For memory policy, we may only handle the per-vma policy for now whose
> > > cost is relatively low, as a best-effort try.
> >
> > Yes. The solution isn't perfect, especially for multiple-thread
> > processes with thread specific memory policy. But the proposed code
> > above can support the most common cases at least, that is, run workload
> > with `numactl`.
>
> Not only multi threads, but also may be broken for shared pages. When
> you do rmap walk, you may get multiple contradict mempolicy, which one
> would you like to obey?

In our test code, it follows the stricter policy, that if the rmap
walk meets a mempolicy disallowing the demotion, it will stop the walk
and return with 'skip_demotion' flag set.

Thanks,
Feng