mem_cgroup_charge() uses the GFP flags in a fairly sophisticated way.
In addition to checking gfpflags_allow_blocking(), it pays attention
to __GFP_NORETRY and __GFP_RETRY_MAYFAIL to ensure that processes within
this memcg do not exceed their quotas. Using the same GFP flags ensures
that we handle large anonymous folios correctly, including falling back
to smaller orders when there is plenty of memory available in the system
but this memcg is close to its limits.
Signed-off-by: Kefeng Wang <[email protected]>
---
v2:
- fix built when !CONFIG_TRANSPARENT_HUGEPAGE
- update changelog suggested by Matthew Wilcox
mm/memory.c | 16 ++++++++--------
1 file changed, 8 insertions(+), 8 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 5e88d5379127..551f0b21bc42 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4153,8 +4153,8 @@ static bool pte_range_none(pte_t *pte, int nr_pages)
static struct folio *alloc_anon_folio(struct vm_fault *vmf)
{
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
struct vm_area_struct *vma = vmf->vma;
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
unsigned long orders;
struct folio *folio;
unsigned long addr;
@@ -4206,15 +4206,21 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
folio = vma_alloc_folio(gfp, order, vma, addr, true);
if (folio) {
+ if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) {
+ folio_put(folio);
+ goto next;
+ }
+ folio_throttle_swaprate(folio, gfp);
clear_huge_page(&folio->page, vmf->address, 1 << order);
return folio;
}
+next:
order = next_order(&orders, order);
}
fallback:
#endif
- return vma_alloc_zeroed_movable_folio(vmf->vma, vmf->address);
+ return folio_prealloc(vma->vm_mm, vma, vmf->address, true);
}
/*
@@ -4281,10 +4287,6 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
nr_pages = folio_nr_pages(folio);
addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
- if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
- goto oom_free_page;
- folio_throttle_swaprate(folio, GFP_KERNEL);
-
/*
* The memory barrier inside __folio_mark_uptodate makes sure that
* preceding stores to the page contents become visible before
@@ -4338,8 +4340,6 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
release:
folio_put(folio);
goto unlock;
-oom_free_page:
- folio_put(folio);
oom:
return VM_FAULT_OOM;
}
--
2.27.0
On 17/01/2024 10:39, Kefeng Wang wrote:
> mem_cgroup_charge() uses the GFP flags in a fairly sophisticated way.
> In addition to checking gfpflags_allow_blocking(), it pays attention
> to __GFP_NORETRY and __GFP_RETRY_MAYFAIL to ensure that processes within
> this memcg do not exceed their quotas. Using the same GFP flags ensures
> that we handle large anonymous folios correctly, including falling back
> to smaller orders when there is plenty of memory available in the system
> but this memcg is close to its limits.
>
> Signed-off-by: Kefeng Wang <[email protected]>
Reviewed-by: Ryan Roberts <[email protected]>
> ---
> v2:
> - fix built when !CONFIG_TRANSPARENT_HUGEPAGE
> - update changelog suggested by Matthew Wilcox
>
> mm/memory.c | 16 ++++++++--------
> 1 file changed, 8 insertions(+), 8 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 5e88d5379127..551f0b21bc42 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4153,8 +4153,8 @@ static bool pte_range_none(pte_t *pte, int nr_pages)
>
> static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> {
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> struct vm_area_struct *vma = vmf->vma;
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> unsigned long orders;
> struct folio *folio;
> unsigned long addr;
> @@ -4206,15 +4206,21 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> folio = vma_alloc_folio(gfp, order, vma, addr, true);
> if (folio) {
> + if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) {
> + folio_put(folio);
> + goto next;
> + }
> + folio_throttle_swaprate(folio, gfp);
> clear_huge_page(&folio->page, vmf->address, 1 << order);
> return folio;
> }
> +next:
> order = next_order(&orders, order);
> }
>
> fallback:
> #endif
> - return vma_alloc_zeroed_movable_folio(vmf->vma, vmf->address);
> + return folio_prealloc(vma->vm_mm, vma, vmf->address, true);
> }
>
> /*
> @@ -4281,10 +4287,6 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> nr_pages = folio_nr_pages(folio);
> addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
>
> - if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
> - goto oom_free_page;
> - folio_throttle_swaprate(folio, GFP_KERNEL);
> -
> /*
> * The memory barrier inside __folio_mark_uptodate makes sure that
> * preceding stores to the page contents become visible before
> @@ -4338,8 +4340,6 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> release:
> folio_put(folio);
> goto unlock;
> -oom_free_page:
> - folio_put(folio);
> oom:
> return VM_FAULT_OOM;
> }
On Wed 17-01-24 18:39:54, Kefeng Wang wrote:
> mem_cgroup_charge() uses the GFP flags in a fairly sophisticated way.
> In addition to checking gfpflags_allow_blocking(), it pays attention
> to __GFP_NORETRY and __GFP_RETRY_MAYFAIL to ensure that processes within
> this memcg do not exceed their quotas. Using the same GFP flags ensures
> that we handle large anonymous folios correctly, including falling back
> to smaller orders when there is plenty of memory available in the system
> but this memcg is close to its limits.
The changelog is not really clear in the actual problem you are trying
to fix. Is this pure consistency fix or have you actually seen any
misbehavior. From the patch I suspect you are interested in THPs much
more than regular order-0 pages because those are GFP_KERNEL like when
it comes to charging. THPs have a variety of options on how aggressive
the allocation should try. From that perspective NORETRY and
RETRY_MAYFAIL are not all that interesting because costly allocations
(which THPs are) already do imply MAYFAIL and NORETRY.
GFP_TRANSHUGE_LIGHT is more interesting though because those do not dive
into the direct reclaim at all. With the current code they will reclaim
charges to free up the space for the allocated THP page and that defeats
the light mode. I have a vague recollection of preparing a patch to
address that in the past. Let me have a look at the current code...
.. So yes, we still do THP charging the way I remember
(do_huge_pmd_anonymous_page). Your patch touches handle_pte_fault ->
do_anonymous_page path which is not THP AFAICS. Or am I missing
something?
> Signed-off-by: Kefeng Wang <[email protected]>
> ---
> v2:
> - fix built when !CONFIG_TRANSPARENT_HUGEPAGE
> - update changelog suggested by Matthew Wilcox
>
> mm/memory.c | 16 ++++++++--------
> 1 file changed, 8 insertions(+), 8 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 5e88d5379127..551f0b21bc42 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4153,8 +4153,8 @@ static bool pte_range_none(pte_t *pte, int nr_pages)
>
> static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> {
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> struct vm_area_struct *vma = vmf->vma;
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> unsigned long orders;
> struct folio *folio;
> unsigned long addr;
> @@ -4206,15 +4206,21 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> folio = vma_alloc_folio(gfp, order, vma, addr, true);
> if (folio) {
> + if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) {
> + folio_put(folio);
> + goto next;
> + }
> + folio_throttle_swaprate(folio, gfp);
> clear_huge_page(&folio->page, vmf->address, 1 << order);
> return folio;
> }
> +next:
> order = next_order(&orders, order);
> }
>
> fallback:
> #endif
> - return vma_alloc_zeroed_movable_folio(vmf->vma, vmf->address);
> + return folio_prealloc(vma->vm_mm, vma, vmf->address, true);
> }
>
> /*
> @@ -4281,10 +4287,6 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> nr_pages = folio_nr_pages(folio);
> addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
>
> - if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
> - goto oom_free_page;
> - folio_throttle_swaprate(folio, GFP_KERNEL);
> -
> /*
> * The memory barrier inside __folio_mark_uptodate makes sure that
> * preceding stores to the page contents become visible before
> @@ -4338,8 +4340,6 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> release:
> folio_put(folio);
> goto unlock;
> -oom_free_page:
> - folio_put(folio);
> oom:
> return VM_FAULT_OOM;
> }
> --
> 2.27.0
>
--
Michal Hocko
SUSE Labs
On 2024/1/18 23:59, Michal Hocko wrote:
> On Wed 17-01-24 18:39:54, Kefeng Wang wrote:
>> mem_cgroup_charge() uses the GFP flags in a fairly sophisticated way.
>> In addition to checking gfpflags_allow_blocking(), it pays attention
>> to __GFP_NORETRY and __GFP_RETRY_MAYFAIL to ensure that processes within
>> this memcg do not exceed their quotas. Using the same GFP flags ensures
>> that we handle large anonymous folios correctly, including falling back
>> to smaller orders when there is plenty of memory available in the system
>> but this memcg is close to its limits.
>
> The changelog is not really clear in the actual problem you are trying
> to fix. Is this pure consistency fix or have you actually seen any
> misbehavior. From the patch I suspect you are interested in THPs much
> more than regular order-0 pages because those are GFP_KERNEL like when
> it comes to charging. THPs have a variety of options on how aggressive
> the allocation should try. From that perspective NORETRY and
> RETRY_MAYFAIL are not all that interesting because costly allocations
> (which THPs are) already do imply MAYFAIL and NORETRY.
I don't meet actual issue, it founds from code inspection.
mTHP is introduced by Ryan(19eaf44954df "mm: thp: support allocation of
anonymous multi-size THP"),so we have similar check for mTHP like PMD
THP in alloc_anon_folio(), it will try to allocate large order folio
below PMD_ORDER, and fallback to order-0 folio if fails, meanwhile,
it get GFP flags from vma_thp_gfp_mask() according to user configuration
like PMD THP allocation, so
1) the memory charge failure check should be moved into fallback
logical, because it will make us to allocated as much as possible large
order folio, although the memcg's memory usage is close to its limits.
2) using seem GFP flags for allocate/mem charge, be consistent with PMD
THP firstly, in addition, according to GFP flag returned for
vma_thp_gfp_mask(), GFP_TRANSHUGE_LIGHT could make us skip direct
reclaim, _GFP_NORETRY will make us skip mem_cgroup_oom and won't kill
any progress from large order folio charging.
>
> GFP_TRANSHUGE_LIGHT is more interesting though because those do not dive
> into the direct reclaim at all. With the current code they will reclaim
> charges to free up the space for the allocated THP page and that defeats
> the light mode. I have a vague recollection of preparing a patch to
We are interesting to GFP_TRANSHUGE_LIGHT and _GFP_NORETRY as mentioned
above.
> address that in the past. Let me have a look at the current code...
Yes, commit 3b3636924dfe ("mm, memcg: sync allocation and memcg charge
gfp flags for THP") for PMD THP from you :)
>
> ... So yes, we still do THP charging the way I remember
> (do_huge_pmd_anonymous_page). Your patch touches handle_pte_fault ->
> do_anonymous_page path which is not THP AFAICS. Or am I missing
> something?
mTHP is one kind of THP.
Thanks.
>
>> Signed-off-by: Kefeng Wang <[email protected]>
>> ---
>> v2:
>> - fix built when !CONFIG_TRANSPARENT_HUGEPAGE
>> - update changelog suggested by Matthew Wilcox
>>
>> mm/memory.c | 16 ++++++++--------
>> 1 file changed, 8 insertions(+), 8 deletions(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 5e88d5379127..551f0b21bc42 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4153,8 +4153,8 @@ static bool pte_range_none(pte_t *pte, int nr_pages)
>>
>> static struct folio *alloc_anon_folio(struct vm_fault *vmf)
>> {
>> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> struct vm_area_struct *vma = vmf->vma;
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> unsigned long orders;
>> struct folio *folio;
>> unsigned long addr;
>> @@ -4206,15 +4206,21 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
>> addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
>> folio = vma_alloc_folio(gfp, order, vma, addr, true);
>> if (folio) {
>> + if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) {
>> + folio_put(folio);
>> + goto next;
>> + }
>> + folio_throttle_swaprate(folio, gfp);
>> clear_huge_page(&folio->page, vmf->address, 1 << order);
>> return folio;
>> }
>> +next:
>> order = next_order(&orders, order);
>> }
>>
>> fallback:
>> #endif
>> - return vma_alloc_zeroed_movable_folio(vmf->vma, vmf->address);
>> + return folio_prealloc(vma->vm_mm, vma, vmf->address, true);
>> }
>>
>> /*
>> @@ -4281,10 +4287,6 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>> nr_pages = folio_nr_pages(folio);
>> addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
>>
>> - if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
>> - goto oom_free_page;
>> - folio_throttle_swaprate(folio, GFP_KERNEL);
>> -
>> /*
>> * The memory barrier inside __folio_mark_uptodate makes sure that
>> * preceding stores to the page contents become visible before
>> @@ -4338,8 +4340,6 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
>> release:
>> folio_put(folio);
>> goto unlock;
>> -oom_free_page:
>> - folio_put(folio);
>> oom:
>> return VM_FAULT_OOM;
>> }
>> --
>> 2.27.0
>>
>
On Fri 19-01-24 10:05:15, Kefeng Wang wrote:
>
>
> On 2024/1/18 23:59, Michal Hocko wrote:
> > On Wed 17-01-24 18:39:54, Kefeng Wang wrote:
> > > mem_cgroup_charge() uses the GFP flags in a fairly sophisticated way.
> > > In addition to checking gfpflags_allow_blocking(), it pays attention
> > > to __GFP_NORETRY and __GFP_RETRY_MAYFAIL to ensure that processes within
> > > this memcg do not exceed their quotas. Using the same GFP flags ensures
> > > that we handle large anonymous folios correctly, including falling back
> > > to smaller orders when there is plenty of memory available in the system
> > > but this memcg is close to its limits.
> >
> > The changelog is not really clear in the actual problem you are trying
> > to fix. Is this pure consistency fix or have you actually seen any
> > misbehavior. From the patch I suspect you are interested in THPs much
> > more than regular order-0 pages because those are GFP_KERNEL like when
> > it comes to charging. THPs have a variety of options on how aggressive
> > the allocation should try. From that perspective NORETRY and
> > RETRY_MAYFAIL are not all that interesting because costly allocations
> > (which THPs are) already do imply MAYFAIL and NORETRY.
>
> I don't meet actual issue, it founds from code inspection.
>
> mTHP is introduced by Ryan(19eaf44954df "mm: thp: support allocation of
> anonymous multi-size THP"),so we have similar check for mTHP like PMD THP
> in alloc_anon_folio(), it will try to allocate large order folio below
> PMD_ORDER, and fallback to order-0 folio if fails, meanwhile,
> it get GFP flags from vma_thp_gfp_mask() according to user configuration
> like PMD THP allocation, so
>
> 1) the memory charge failure check should be moved into fallback
> logical, because it will make us to allocated as much as possible large
> order folio, although the memcg's memory usage is close to its limits.
>
> 2) using seem GFP flags for allocate/mem charge, be consistent with PMD
> THP firstly, in addition, according to GFP flag returned for
> vma_thp_gfp_mask(), GFP_TRANSHUGE_LIGHT could make us skip direct reclaim,
> _GFP_NORETRY will make us skip mem_cgroup_oom and won't kill
> any progress from large order folio charging.
OK, makes sense. Please turn that into the changelog.
> > GFP_TRANSHUGE_LIGHT is more interesting though because those do not dive
> > into the direct reclaim at all. With the current code they will reclaim
> > charges to free up the space for the allocated THP page and that defeats
> > the light mode. I have a vague recollection of preparing a patch to
>
> We are interesting to GFP_TRANSHUGE_LIGHT and _GFP_NORETRY as mentioned
> above.
if mTHP can be smaller than COSTLY_ORDER then you are correct and
NORETRY makes a difference. Please mention that in the changelog as
well.
Thanks!
--
Michal Hocko
SUSE Labs
On 2024/1/19 16:00, Michal Hocko wrote:
> On Fri 19-01-24 10:05:15, Kefeng Wang wrote:
>>
>>
>> On 2024/1/18 23:59, Michal Hocko wrote:
>>> On Wed 17-01-24 18:39:54, Kefeng Wang wrote:
>>>> mem_cgroup_charge() uses the GFP flags in a fairly sophisticated way.
>>>> In addition to checking gfpflags_allow_blocking(), it pays attention
>>>> to __GFP_NORETRY and __GFP_RETRY_MAYFAIL to ensure that processes within
>>>> this memcg do not exceed their quotas. Using the same GFP flags ensures
>>>> that we handle large anonymous folios correctly, including falling back
>>>> to smaller orders when there is plenty of memory available in the system
>>>> but this memcg is close to its limits.
>>>
>>> The changelog is not really clear in the actual problem you are trying
>>> to fix. Is this pure consistency fix or have you actually seen any
>>> misbehavior. From the patch I suspect you are interested in THPs much
>>> more than regular order-0 pages because those are GFP_KERNEL like when
>>> it comes to charging. THPs have a variety of options on how aggressive
>>> the allocation should try. From that perspective NORETRY and
>>> RETRY_MAYFAIL are not all that interesting because costly allocations
>>> (which THPs are) already do imply MAYFAIL and NORETRY.
>>
>> I don't meet actual issue, it founds from code inspection.
>>
>> mTHP is introduced by Ryan(19eaf44954df "mm: thp: support allocation of
>> anonymous multi-size THP"),so we have similar check for mTHP like PMD THP
>> in alloc_anon_folio(), it will try to allocate large order folio below
>> PMD_ORDER, and fallback to order-0 folio if fails, meanwhile,
>> it get GFP flags from vma_thp_gfp_mask() according to user configuration
>> like PMD THP allocation, so
>>
>> 1) the memory charge failure check should be moved into fallback
>> logical, because it will make us to allocated as much as possible large
>> order folio, although the memcg's memory usage is close to its limits.
>>
>> 2) using seem GFP flags for allocate/mem charge, be consistent with PMD
>> THP firstly, in addition, according to GFP flag returned for
>> vma_thp_gfp_mask(), GFP_TRANSHUGE_LIGHT could make us skip direct reclaim,
>> _GFP_NORETRY will make us skip mem_cgroup_oom and won't kill
>> any progress from large order folio charging.
>
> OK, makes sense. Please turn that into the changelog.
Sure.
>
>>> GFP_TRANSHUGE_LIGHT is more interesting though because those do not dive
>>> into the direct reclaim at all. With the current code they will reclaim
>>> charges to free up the space for the allocated THP page and that defeats
>>> the light mode. I have a vague recollection of preparing a patch to
>>
>> We are interesting to GFP_TRANSHUGE_LIGHT and _GFP_NORETRY as mentioned
>> above.
>
> if mTHP can be smaller than COSTLY_ORDER then you are correct and
> NORETRY makes a difference. Please mention that in the changelog as
> well.
>
For memory cgroup charge, _GFP_NORETRY checked to make us directly skip
mem_cgroup_oom(), it has no concern with folio order or COSTLY_ORDER
when check _GFP_NORETRY in try_charge_memcg(), so I think NORETRY should
always make difference for all large order folio.
On Fri 19-01-24 20:59:22, Kefeng Wang wrote:
> > > > GFP_TRANSHUGE_LIGHT is more interesting though because those do not dive
> > > > into the direct reclaim at all. With the current code they will reclaim
> > > > charges to free up the space for the allocated THP page and that defeats
> > > > the light mode. I have a vague recollection of preparing a patch to
> > >
> > > We are interesting to GFP_TRANSHUGE_LIGHT and _GFP_NORETRY as mentioned
> > > above.
> >
> > if mTHP can be smaller than COSTLY_ORDER then you are correct and
> > NORETRY makes a difference. Please mention that in the changelog as
> > well.
> >
>
> For memory cgroup charge, _GFP_NORETRY checked to make us directly skip
> mem_cgroup_oom(), it has no concern with folio order or COSTLY_ORDER when
> check _GFP_NORETRY in try_charge_memcg(), so I think NORETRY should
> always make difference for all large order folio.
we do not OOM on COSTLY_ORDER (see mem_cgroup_oom). So NORETRY really
makes a difference for small orders.
--
Michal Hocko
SUSE Labs
On 2024/1/19 23:46, Michal Hocko wrote:
> On Fri 19-01-24 20:59:22, Kefeng Wang wrote:
>>>>> GFP_TRANSHUGE_LIGHT is more interesting though because those do not dive
>>>>> into the direct reclaim at all. With the current code they will reclaim
>>>>> charges to free up the space for the allocated THP page and that defeats
>>>>> the light mode. I have a vague recollection of preparing a patch to
>>>>
>>>> We are interesting to GFP_TRANSHUGE_LIGHT and _GFP_NORETRY as mentioned
>>>> above.
>>>
>>> if mTHP can be smaller than COSTLY_ORDER then you are correct and
>>> NORETRY makes a difference. Please mention that in the changelog as
>>> well.
>>>
>>
>> For memory cgroup charge, _GFP_NORETRY checked to make us directly skip
>> mem_cgroup_oom(), it has no concern with folio order or COSTLY_ORDER when
>> check _GFP_NORETRY in try_charge_memcg(), so I think NORETRY should
>> always make difference for all large order folio.
>
> we do not OOM on COSTLY_ORDER (see mem_cgroup_oom). So NORETRY really
> makes a difference for small orders.
I see what you mean, but we may describe the different processes, if
GFP_TRANSHUGE | __GFP_NORETRY returned from vma_thp_gfp_mask(),
then we never involved with mem_cgroup_oom(), since mem_cgroup_oom()
will be skipped in try_charge_memcg(), that is what I want to say,
and in this case, no oom for order < COSTLY_ORDER or order >
COSTLY_ORDER. But if GFP is GFP_TRANHUGE, then we may enter
mem_cgroup_oom(), and maybe oom if order < COSTLY_ORDER.
So Yes, NORETRY really makes a difference for small orders.