2024-01-16 07:14:59

by Kefeng Wang

[permalink] [raw]
Subject: [PATCH] mm: memory: move mem_cgroup_charge() into alloc_anon_folio()

In order to allocate as much as possible of large folio, move
the mem charge into alloc_anon_folio() and try the next order
if mem_cgroup_charge() fails, also we change the GFP_KERNEL
to gfp to be consistent with PMD THP.

Signed-off-by: Kefeng Wang <[email protected]>
---
mm/memory.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 5e88d5379127..2e31a407e6f9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4206,15 +4206,21 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
folio = vma_alloc_folio(gfp, order, vma, addr, true);
if (folio) {
+ if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) {
+ folio_put(folio);
+ goto next;
+ }
+ folio_throttle_swaprate(folio, gfp);
clear_huge_page(&folio->page, vmf->address, 1 << order);
return folio;
}
+next:
order = next_order(&orders, order);
}

fallback:
#endif
- return vma_alloc_zeroed_movable_folio(vmf->vma, vmf->address);
+ return folio_prealloc(vma->vm_mm, vma, vmf->address, true);
}

/*
@@ -4281,10 +4287,6 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
nr_pages = folio_nr_pages(folio);
addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);

- if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
- goto oom_free_page;
- folio_throttle_swaprate(folio, GFP_KERNEL);
-
/*
* The memory barrier inside __folio_mark_uptodate makes sure that
* preceding stores to the page contents become visible before
@@ -4338,8 +4340,6 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
release:
folio_put(folio);
goto unlock;
-oom_free_page:
- folio_put(folio);
oom:
return VM_FAULT_OOM;
}
--
2.27.0



2024-01-16 14:36:10

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH] mm: memory: move mem_cgroup_charge() into alloc_anon_folio()

On 16/01/2024 07:13, Kefeng Wang wrote:
> In order to allocate as much as possible of large folio, move
> the mem charge into alloc_anon_folio() and try the next order
> if mem_cgroup_charge() fails, also we change the GFP_KERNEL
> to gfp to be consistent with PMD THP.

I agree that changing gfp gives you consistency. But it's not entirely clear to
me why THP should use one set of flags for this case, and since pages another.
Why does this difference exist?

Otherwise, LGTM!

>
> Signed-off-by: Kefeng Wang <[email protected]>
> ---
> mm/memory.c | 14 +++++++-------
> 1 file changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 5e88d5379127..2e31a407e6f9 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4206,15 +4206,21 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> folio = vma_alloc_folio(gfp, order, vma, addr, true);
> if (folio) {
> + if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) {
> + folio_put(folio);
> + goto next;
> + }
> + folio_throttle_swaprate(folio, gfp);
> clear_huge_page(&folio->page, vmf->address, 1 << order);
> return folio;
> }
> +next:
> order = next_order(&orders, order);
> }
>
> fallback:
> #endif
> - return vma_alloc_zeroed_movable_folio(vmf->vma, vmf->address);
> + return folio_prealloc(vma->vm_mm, vma, vmf->address, true);
> }
>
> /*
> @@ -4281,10 +4287,6 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> nr_pages = folio_nr_pages(folio);
> addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
>
> - if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL))
> - goto oom_free_page;
> - folio_throttle_swaprate(folio, GFP_KERNEL);
> -
> /*
> * The memory barrier inside __folio_mark_uptodate makes sure that
> * preceding stores to the page contents become visible before
> @@ -4338,8 +4340,6 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> release:
> folio_put(folio);
> goto unlock;
> -oom_free_page:
> - folio_put(folio);
> oom:
> return VM_FAULT_OOM;
> }


2024-01-16 14:51:37

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm: memory: move mem_cgroup_charge() into alloc_anon_folio()

On Tue, Jan 16, 2024 at 02:35:54PM +0000, Ryan Roberts wrote:
> On 16/01/2024 07:13, Kefeng Wang wrote:
> > In order to allocate as much as possible of large folio, move
> > the mem charge into alloc_anon_folio() and try the next order
> > if mem_cgroup_charge() fails, also we change the GFP_KERNEL
> > to gfp to be consistent with PMD THP.
>
> I agree that changing gfp gives you consistency. But it's not entirely clear to
> me why THP should use one set of flags for this case, and since pages another.
> Why does this difference exist?

I think it needs to be spelled out much better in the changelog. Here's
my attempt at explaining why we might want this change.

mem_cgroup_charge() uses the GFP flags in a fairly sophisticated way.
In addition to checking gfpflags_allow_blocking(), it pays attention to
__GFP_NORETRY and __GFP_RETRY_MAYFAIL to ensure that processes within
this memcg do not exceed their quotas. Using the same GFP flags ensures
that we handle large anonymous folios correctly, including falling back
to smaller orders when there is plenty of memory available in the system
but this memcg is close to its limits.

.. I remain not-an-expert in memcg and anonymous memory and welcome
improvements to that text.

2024-01-16 15:08:27

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH] mm: memory: move mem_cgroup_charge() into alloc_anon_folio()

On 16/01/2024 14:51, Matthew Wilcox wrote:
> On Tue, Jan 16, 2024 at 02:35:54PM +0000, Ryan Roberts wrote:
>> On 16/01/2024 07:13, Kefeng Wang wrote:
>>> In order to allocate as much as possible of large folio, move
>>> the mem charge into alloc_anon_folio() and try the next order
>>> if mem_cgroup_charge() fails, also we change the GFP_KERNEL
>>> to gfp to be consistent with PMD THP.
>>
>> I agree that changing gfp gives you consistency. But it's not entirely clear to
>> me why THP should use one set of flags for this case, and since pages another.
>> Why does this difference exist?
>
> I think it needs to be spelled out much better in the changelog. Here's
> my attempt at explaining why we might want this change.
>
> mem_cgroup_charge() uses the GFP flags in a fairly sophisticated way.
> In addition to checking gfpflags_allow_blocking(), it pays attention to
> __GFP_NORETRY and __GFP_RETRY_MAYFAIL to ensure that processes within
> this memcg do not exceed their quotas. Using the same GFP flags ensures
> that we handle large anonymous folios correctly, including falling back
> to smaller orders when there is plenty of memory available in the system
> but this memcg is close to its limits.

Thanks for the explanation. Please add to the commit log.

Essentially you are saying that previously, all mTHP allocations would cause
reclaim from the memcg if the allocation caused the quota to be used up. But
with this change, it might now avoid that reclaim and just OOM, if the flags are
as such? So then we retry with the next lowest available size. Makes sense!


>
> ... I remain not-an-expert in memcg and anonymous memory and welcome
> improvements to that text.

Me too...

2024-01-16 22:32:49

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH] mm: memory: move mem_cgroup_charge() into alloc_anon_folio()

Hi Kefeng,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]

url: https://github.com/intel-lab-lkp/linux/commits/Kefeng-Wang/mm-memory-move-mem_cgroup_charge-into-alloc_anon_folio/20240116-151640
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20240116071302.2282230-1-wangkefeng.wang%40huawei.com
patch subject: [PATCH] mm: memory: move mem_cgroup_charge() into alloc_anon_folio()
config: x86_64-allnoconfig (https://download.01.org/0day-ci/archive/20240117/[email protected]/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240117/[email protected]/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All error/warnings (new ones prefixed by >>):

mm/memory.c: In function 'alloc_anon_folio':
>> mm/memory.c:4223:31: error: 'vma' undeclared (first use in this function); did you mean 'vmf'?
4223 | return folio_prealloc(vma->vm_mm, vma, vmf->address, true);
| ^~~
| vmf
mm/memory.c:4223:31: note: each undeclared identifier is reported only once for each function it appears in
>> mm/memory.c:4224:1: warning: control reaches end of non-void function [-Wreturn-type]
4224 | }
| ^


vim +4223 mm/memory.c

4153
4154 static struct folio *alloc_anon_folio(struct vm_fault *vmf)
4155 {
4156 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
4157 struct vm_area_struct *vma = vmf->vma;
4158 unsigned long orders;
4159 struct folio *folio;
4160 unsigned long addr;
4161 pte_t *pte;
4162 gfp_t gfp;
4163 int order;
4164
4165 /*
4166 * If uffd is active for the vma we need per-page fault fidelity to
4167 * maintain the uffd semantics.
4168 */
4169 if (unlikely(userfaultfd_armed(vma)))
4170 goto fallback;
4171
4172 /*
4173 * Get a list of all the (large) orders below PMD_ORDER that are enabled
4174 * for this vma. Then filter out the orders that can't be allocated over
4175 * the faulting address and still be fully contained in the vma.
4176 */
4177 orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
4178 BIT(PMD_ORDER) - 1);
4179 orders = thp_vma_suitable_orders(vma, vmf->address, orders);
4180
4181 if (!orders)
4182 goto fallback;
4183
4184 pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
4185 if (!pte)
4186 return ERR_PTR(-EAGAIN);
4187
4188 /*
4189 * Find the highest order where the aligned range is completely
4190 * pte_none(). Note that all remaining orders will be completely
4191 * pte_none().
4192 */
4193 order = highest_order(orders);
4194 while (orders) {
4195 addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
4196 if (pte_range_none(pte + pte_index(addr), 1 << order))
4197 break;
4198 order = next_order(&orders, order);
4199 }
4200
4201 pte_unmap(pte);
4202
4203 /* Try allocating the highest of the remaining orders. */
4204 gfp = vma_thp_gfp_mask(vma);
4205 while (orders) {
4206 addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
4207 folio = vma_alloc_folio(gfp, order, vma, addr, true);
4208 if (folio) {
4209 if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) {
4210 folio_put(folio);
4211 goto next;
4212 }
4213 folio_throttle_swaprate(folio, gfp);
4214 clear_huge_page(&folio->page, vmf->address, 1 << order);
4215 return folio;
4216 }
4217 next:
4218 order = next_order(&orders, order);
4219 }
4220
4221 fallback:
4222 #endif
> 4223 return folio_prealloc(vma->vm_mm, vma, vmf->address, true);
> 4224 }
4225

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

2024-01-16 23:41:49

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH] mm: memory: move mem_cgroup_charge() into alloc_anon_folio()

Hi Kefeng,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]

url: https://github.com/intel-lab-lkp/linux/commits/Kefeng-Wang/mm-memory-move-mem_cgroup_charge-into-alloc_anon_folio/20240116-151640
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20240116071302.2282230-1-wangkefeng.wang%40huawei.com
patch subject: [PATCH] mm: memory: move mem_cgroup_charge() into alloc_anon_folio()
config: i386-buildonly-randconfig-002-20240116 (https://download.01.org/0day-ci/archive/20240117/[email protected]/config)
compiler: ClangBuiltLinux clang version 17.0.6 (https://github.com/llvm/llvm-project 6009708b4367171ccdbf4b5905cb6a803753fe18)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240117/[email protected]/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All errors (new ones prefixed by >>):

>> mm/memory.c:4223:24: error: use of undeclared identifier 'vma'
4223 | return folio_prealloc(vma->vm_mm, vma, vmf->address, true);
| ^
mm/memory.c:4223:36: error: use of undeclared identifier 'vma'
4223 | return folio_prealloc(vma->vm_mm, vma, vmf->address, true);
| ^
2 errors generated.


vim +/vma +4223 mm/memory.c

4153
4154 static struct folio *alloc_anon_folio(struct vm_fault *vmf)
4155 {
4156 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
4157 struct vm_area_struct *vma = vmf->vma;
4158 unsigned long orders;
4159 struct folio *folio;
4160 unsigned long addr;
4161 pte_t *pte;
4162 gfp_t gfp;
4163 int order;
4164
4165 /*
4166 * If uffd is active for the vma we need per-page fault fidelity to
4167 * maintain the uffd semantics.
4168 */
4169 if (unlikely(userfaultfd_armed(vma)))
4170 goto fallback;
4171
4172 /*
4173 * Get a list of all the (large) orders below PMD_ORDER that are enabled
4174 * for this vma. Then filter out the orders that can't be allocated over
4175 * the faulting address and still be fully contained in the vma.
4176 */
4177 orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
4178 BIT(PMD_ORDER) - 1);
4179 orders = thp_vma_suitable_orders(vma, vmf->address, orders);
4180
4181 if (!orders)
4182 goto fallback;
4183
4184 pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
4185 if (!pte)
4186 return ERR_PTR(-EAGAIN);
4187
4188 /*
4189 * Find the highest order where the aligned range is completely
4190 * pte_none(). Note that all remaining orders will be completely
4191 * pte_none().
4192 */
4193 order = highest_order(orders);
4194 while (orders) {
4195 addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
4196 if (pte_range_none(pte + pte_index(addr), 1 << order))
4197 break;
4198 order = next_order(&orders, order);
4199 }
4200
4201 pte_unmap(pte);
4202
4203 /* Try allocating the highest of the remaining orders. */
4204 gfp = vma_thp_gfp_mask(vma);
4205 while (orders) {
4206 addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
4207 folio = vma_alloc_folio(gfp, order, vma, addr, true);
4208 if (folio) {
4209 if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) {
4210 folio_put(folio);
4211 goto next;
4212 }
4213 folio_throttle_swaprate(folio, gfp);
4214 clear_huge_page(&folio->page, vmf->address, 1 << order);
4215 return folio;
4216 }
4217 next:
4218 order = next_order(&orders, order);
4219 }
4220
4221 fallback:
4222 #endif
> 4223 return folio_prealloc(vma->vm_mm, vma, vmf->address, true);
4224 }
4225

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

2024-01-17 01:02:36

by Kefeng Wang

[permalink] [raw]
Subject: Re: [PATCH] mm: memory: move mem_cgroup_charge() into alloc_anon_folio()



On 2024/1/17 5:26, kernel test robot wrote:
> Hi Kefeng,
>
> kernel test robot noticed the following build errors:
>
> [auto build test ERROR on akpm-mm/mm-everything]
>
> url: https://github.com/intel-lab-lkp/linux/commits/Kefeng-Wang/mm-memory-move-mem_cgroup_charge-into-alloc_anon_folio/20240116-151640
> base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> patch link: https://lore.kernel.org/r/20240116071302.2282230-1-wangkefeng.wang%40huawei.com
> patch subject: [PATCH] mm: memory: move mem_cgroup_charge() into alloc_anon_folio()
> config: x86_64-allnoconfig (https://download.01.org/0day-ci/archive/20240117/[email protected]/config)
> compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240117/[email protected]/reproduce)
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <[email protected]>
> | Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
>
> All error/warnings (new ones prefixed by >>):
>


thanks, will fix built on !CONFIG_TRANSPARENT_HUGEPAGE

> mm/memory.c: In function 'alloc_anon_folio':
>>> mm/memory.c:4223:31: error: 'vma' undeclared (first use in this function); did you mean 'vmf'?
> 4223 | return folio_prealloc(vma->vm_mm, vma, vmf->address, true);
> | ^~~
> | vmf
> mm/memory.c:4223:31: note: each undeclared identifier is reported only once for each function it appears in
>>> mm/memory.c:4224:1: warning: control reaches end of non-void function [-Wreturn-type]
> 4224 | }
> | ^
>
>
> vim +4223 mm/memory.c
>
> 4153
> 4154 static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> 4155 {
> 4156 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> 4157 struct vm_area_struct *vma = vmf->vma;
> 4158 unsigned long orders;
> 4159 struct folio *folio;
> 4160 unsigned long addr;
> 4161 pte_t *pte;
> 4162 gfp_t gfp;
> 4163 int order;
> 4164
> 4165 /*
> 4166 * If uffd is active for the vma we need per-page fault fidelity to
> 4167 * maintain the uffd semantics.
> 4168 */
> 4169 if (unlikely(userfaultfd_armed(vma)))
> 4170 goto fallback;
> 4171
> 4172 /*
> 4173 * Get a list of all the (large) orders below PMD_ORDER that are enabled
> 4174 * for this vma. Then filter out the orders that can't be allocated over
> 4175 * the faulting address and still be fully contained in the vma.
> 4176 */
> 4177 orders = thp_vma_allowable_orders(vma, vma->vm_flags, false, true, true,
> 4178 BIT(PMD_ORDER) - 1);
> 4179 orders = thp_vma_suitable_orders(vma, vmf->address, orders);
> 4180
> 4181 if (!orders)
> 4182 goto fallback;
> 4183
> 4184 pte = pte_offset_map(vmf->pmd, vmf->address & PMD_MASK);
> 4185 if (!pte)
> 4186 return ERR_PTR(-EAGAIN);
> 4187
> 4188 /*
> 4189 * Find the highest order where the aligned range is completely
> 4190 * pte_none(). Note that all remaining orders will be completely
> 4191 * pte_none().
> 4192 */
> 4193 order = highest_order(orders);
> 4194 while (orders) {
> 4195 addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> 4196 if (pte_range_none(pte + pte_index(addr), 1 << order))
> 4197 break;
> 4198 order = next_order(&orders, order);
> 4199 }
> 4200
> 4201 pte_unmap(pte);
> 4202
> 4203 /* Try allocating the highest of the remaining orders. */
> 4204 gfp = vma_thp_gfp_mask(vma);
> 4205 while (orders) {
> 4206 addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> 4207 folio = vma_alloc_folio(gfp, order, vma, addr, true);
> 4208 if (folio) {
> 4209 if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) {
> 4210 folio_put(folio);
> 4211 goto next;
> 4212 }
> 4213 folio_throttle_swaprate(folio, gfp);
> 4214 clear_huge_page(&folio->page, vmf->address, 1 << order);
> 4215 return folio;
> 4216 }
> 4217 next:
> 4218 order = next_order(&orders, order);
> 4219 }
> 4220
> 4221 fallback:
> 4222 #endif
>> 4223 return folio_prealloc(vma->vm_mm, vma, vmf->address, true);
>> 4224 }
> 4225
>

2024-01-17 01:34:21

by Kefeng Wang

[permalink] [raw]
Subject: Re: [PATCH] mm: memory: move mem_cgroup_charge() into alloc_anon_folio()



On 2024/1/16 23:07, Ryan Roberts wrote:
> On 16/01/2024 14:51, Matthew Wilcox wrote:
>> On Tue, Jan 16, 2024 at 02:35:54PM +0000, Ryan Roberts wrote:
>>> On 16/01/2024 07:13, Kefeng Wang wrote:
>>>> In order to allocate as much as possible of large folio, move
>>>> the mem charge into alloc_anon_folio() and try the next order
>>>> if mem_cgroup_charge() fails, also we change the GFP_KERNEL
>>>> to gfp to be consistent with PMD THP.
>>>
>>> I agree that changing gfp gives you consistency. But it's not entirely clear to
>>> me why THP should use one set of flags for this case, and since pages another.
>>> Why does this difference exist?
>>
>> I think it needs to be spelled out much better in the changelog. Here's
>> my attempt at explaining why we might want this change.
>>
>> mem_cgroup_charge() uses the GFP flags in a fairly sophisticated way.
>> In addition to checking gfpflags_allow_blocking(), it pays attention to
>> __GFP_NORETRY and __GFP_RETRY_MAYFAIL to ensure that processes within
>> this memcg do not exceed their quotas. Using the same GFP flags ensures
>> that we handle large anonymous folios correctly, including falling back
>> to smaller orders when there is plenty of memory available in the system
>> but this memcg is close to its limits.
>
> Thanks for the explanation. Please add to the commit log.

Thanks, it is much better, will update, a similar change in THP, see
commit 3b3636924dfe "mm, memcg: sync allocation and memcg charge gfp
flags for THP".

>
> Essentially you are saying that previously, all mTHP allocations would cause
> reclaim from the memcg if the allocation caused the quota to be used up. But
> with this change, it might now avoid that reclaim and just OOM, if the flags are
> as such? So then we retry with the next lowest available size. Makes sense!
>

With correct GFP, we could get less reclaim and faster fallabck to next
order, that's what I want too.

>
>>
>> ... I remain not-an-expert in memcg and anonymous memory and welcome
>> improvements to that text.
>
> Me too...