2024-02-20 23:18:04

by Vishal Moola

[permalink] [raw]
Subject: [PATCH 0/3] Handle hugetlb faults under the VMA lock

It is generally safe to handle hugetlb faults under the VMA lock. The
only time this is unsafe is when no anon_vma has been allocated to this
vma yet, so we can use vmf_anon_prepare() instead of anon_vma_prepare()
to bailout if necessary. This may only happen for the first non-shared
hugetlb page in the vma.

-----
The last patch in this series may cause ltp hugemmap10 to "fail". This
is expected behavior - see the commit message for patch 3 in this series.
The rest of the ltp hugetlb tests pass.

This patchset applies cleanly ontop of mm-unstable.

Vishal Moola (Oracle) (3):
mm/memory: Change vmf_anon_prepare() to be non-static
hugetlb: Use vmf_anon_prepare() instead of anon_vma_prepare()
hugetlb: Allow faults to be handled under the VMA lock

include/linux/hugetlb.h | 1 +
mm/hugetlb.c | 33 +++++++++++++++++++++------------
mm/memory.c | 2 +-
3 files changed, 23 insertions(+), 13 deletions(-)

--
2.43.0



2024-02-20 23:18:09

by Vishal Moola

[permalink] [raw]
Subject: [PATCH 1/3] mm/memory: Change vmf_anon_prepare() to be non-static

In order to handle hugetlb faults under the VMA lock, hugetlb can use
vmf_anon_prepare() to ensure we can safely prepare an anon_vma. Change
it to be a non-static function so it can be used within hugetlb as well.

Signed-off-by: Vishal Moola (Oracle) <[email protected]>
---
include/linux/hugetlb.h | 1 +
mm/memory.c | 2 +-
2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index c1ee640d87b1..9b45edb6e303 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -272,6 +272,7 @@ void hugetlb_vma_unlock_write(struct vm_area_struct *vma);
int hugetlb_vma_trylock_write(struct vm_area_struct *vma);
void hugetlb_vma_assert_locked(struct vm_area_struct *vma);
void hugetlb_vma_lock_release(struct kref *kref);
+vm_fault_t vmf_anon_prepare(struct vm_fault *vmf);

int pmd_huge(pmd_t pmd);
int pud_huge(pud_t pud);
diff --git a/mm/memory.c b/mm/memory.c
index 89bcae0b224d..c93b058adfb2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3081,7 +3081,7 @@ static inline vm_fault_t vmf_can_call_fault(const struct vm_fault *vmf)
return VM_FAULT_RETRY;
}

-static vm_fault_t vmf_anon_prepare(struct vm_fault *vmf)
+vm_fault_t vmf_anon_prepare(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;

--
2.43.0


2024-02-20 23:18:22

by Vishal Moola

[permalink] [raw]
Subject: [PATCH 2/3] hugetlb: Use vmf_anon_prepare() instead of anon_vma_prepare()

hugetlb_no_page() and hugetlb_wp() call anon_vma_prepare(). In
preparation for hugetlb to safely handle faults under the VMA lock,
use vmf_anon_prepare() here instead.

Additionally, define a struct vm_fault at the top of each function.
These can later be used to convert hugetlb to use struct vm_fault -
similar to mm/memory.

Signed-off-by: Vishal Moola (Oracle) <[email protected]>
---
mm/hugetlb.c | 27 +++++++++++++++++++++------
1 file changed, 21 insertions(+), 6 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ed1581b670d4..10f57306e1f0 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5834,9 +5834,15 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma,
struct folio *old_folio;
struct folio *new_folio;
int outside_reserve = 0;
- vm_fault_t ret = 0;
+ vm_fault_t ret = 0, anon_ret = 0;
unsigned long haddr = address & huge_page_mask(h);
struct mmu_notifier_range range;
+ struct vm_fault vmf = {
+ .vma = vma,
+ .address = haddr,
+ .real_address = address,
+ .flags = flags,
+ };

/*
* Never handle CoW for uffd-wp protected pages. It should be only
@@ -5960,8 +5966,9 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma,
* When the original hugepage is shared one, it does not have
* anon_vma prepared.
*/
- if (unlikely(anon_vma_prepare(vma))) {
- ret = VM_FAULT_OOM;
+ anon_ret = vmf_anon_prepare(&vmf);
+ if (unlikely(anon_ret)) {
+ ret = anon_ret;
goto out_release_all;
}

@@ -6119,7 +6126,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
pte_t old_pte, unsigned int flags)
{
struct hstate *h = hstate_vma(vma);
- vm_fault_t ret = VM_FAULT_SIGBUS;
+ vm_fault_t ret = VM_FAULT_SIGBUS, anon_ret = 0;
int anon_rmap = 0;
unsigned long size;
struct folio *folio;
@@ -6128,6 +6135,12 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
unsigned long haddr = address & huge_page_mask(h);
bool new_folio, new_pagecache_folio = false;
u32 hash = hugetlb_fault_mutex_hash(mapping, idx);
+ struct vm_fault vmf = {
+ .vma = vma,
+ .address = haddr,
+ .real_address = address,
+ .flags = flags,
+ };

/*
* Currently, we are forced to kill the process in the event the
@@ -6221,8 +6234,10 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
new_pagecache_folio = true;
} else {
folio_lock(folio);
- if (unlikely(anon_vma_prepare(vma))) {
- ret = VM_FAULT_OOM;
+
+ anon_ret = vmf_anon_prepare(&vmf);
+ if (unlikely(anon_ret)) {
+ ret = anon_ret;
goto backout_unlocked;
}
anon_rmap = 1;
--
2.43.0


2024-02-20 23:18:33

by Vishal Moola

[permalink] [raw]
Subject: [PATCH 3/3] hugetlb: Allow faults to be handled under the VMA lock

Hugetlb can now safely handle faults under the VMA lock, so allow it to
do so.

This patch may cause ltp hugemmap10 to "fail". Hugemmap10 tests hugetlb
counters, and expects the counters to remain unchanged on failure to
handle a fault.

In hugetlb_no_page(), vmf_anon_prepare() may bailout with no anon_vma
under the VMA lock after allocating a folio for the hugepage. In
free_huge_folio(), this folio is completely freed on bailout iff there
is a surplus of hugetlb pages. This will remove a folio off the freelist
and decrement the number of hugepages while ltp expects these counters
to remain unchanged on failure.

Originally this could only happen due to OOM failures, but now it may
also occur after we allocate a hugetlb folio without a suitable anon_vma
under the VMA lock. This should only happen for the first freshly
allocated hugepage in this vma.

Signed-off-by: Vishal Moola (Oracle) <[email protected]>
---
mm/hugetlb.c | 6 ------
1 file changed, 6 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 10f57306e1f0..ed472510699d 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6376,12 +6376,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
int need_wait_lock = 0;
unsigned long haddr = address & huge_page_mask(h);

- /* TODO: Handle faults under the VMA lock */
- if (flags & FAULT_FLAG_VMA_LOCK) {
- vma_end_read(vma);
- return VM_FAULT_RETRY;
- }
-
/*
* Serialize hugepage allocation and instantiation, so that we don't
* get spurious allocation failures if two CPUs race to instantiate
--
2.43.0


2024-02-21 03:37:15

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm/memory: Change vmf_anon_prepare() to be non-static

On Tue, Feb 20, 2024 at 03:14:22PM -0800, Vishal Moola (Oracle) wrote:
> In order to handle hugetlb faults under the VMA lock, hugetlb can use
> vmf_anon_prepare() to ensure we can safely prepare an anon_vma. Change
> it to be a non-static function so it can be used within hugetlb as well.

I think the prototype for this should probably live in mm/internal.h?

> +++ b/include/linux/hugetlb.h
> @@ -272,6 +272,7 @@ void hugetlb_vma_unlock_write(struct vm_area_struct *vma);
> int hugetlb_vma_trylock_write(struct vm_area_struct *vma);
> void hugetlb_vma_assert_locked(struct vm_area_struct *vma);
> void hugetlb_vma_lock_release(struct kref *kref);
> +vm_fault_t vmf_anon_prepare(struct vm_fault *vmf);
>
> int pmd_huge(pmd_t pmd);
> int pud_huge(pud_t pud);
>

2024-02-21 03:50:32

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 2/3] hugetlb: Use vmf_anon_prepare() instead of anon_vma_prepare()

On Tue, Feb 20, 2024 at 03:14:23PM -0800, Vishal Moola (Oracle) wrote:
> +++ b/mm/hugetlb.c
> @@ -5834,9 +5834,15 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma,
> struct folio *old_folio;
> struct folio *new_folio;
> int outside_reserve = 0;
> - vm_fault_t ret = 0;
> + vm_fault_t ret = 0, anon_ret = 0;

Do we need a second variable here? Seems to me like we could
unconditionally assign to ret:

> - if (unlikely(anon_vma_prepare(vma))) {
> - ret = VM_FAULT_OOM;
> + anon_ret = vmf_anon_prepare(&vmf);
> + if (unlikely(anon_ret)) {
> + ret = anon_ret;



> unsigned long haddr = address & huge_page_mask(h);
> struct mmu_notifier_range range;
> + struct vm_fault vmf = {
> + .vma = vma,
> + .address = haddr,
> + .real_address = address,
> + .flags = flags,
> + };

We don't usually indent quite so far. One extra tab would be enough.

Also, I thought we talked about creating the vmf in hugetlb_fault(),
then passing it to hugetlb_wp() hugetlb_no_page() and handle_userfault()?
Was there a reason to abandon that idea?


2024-02-21 16:56:07

by Vishal Moola

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm/memory: Change vmf_anon_prepare() to be non-static

On Tue, Feb 20, 2024 at 7:36 PM Matthew Wilcox <[email protected]> wrote:
>
> On Tue, Feb 20, 2024 at 03:14:22PM -0800, Vishal Moola (Oracle) wrote:
> > In order to handle hugetlb faults under the VMA lock, hugetlb can use
> > vmf_anon_prepare() to ensure we can safely prepare an anon_vma. Change
> > it to be a non-static function so it can be used within hugetlb as well.
>
> I think the prototype for this should probably live in mm/internal.h?

That does make more sense, I'll move it for v2.

> > +++ b/include/linux/hugetlb.h
> > @@ -272,6 +272,7 @@ void hugetlb_vma_unlock_write(struct vm_area_struct *vma);
> > int hugetlb_vma_trylock_write(struct vm_area_struct *vma);
> > void hugetlb_vma_assert_locked(struct vm_area_struct *vma);
> > void hugetlb_vma_lock_release(struct kref *kref);
> > +vm_fault_t vmf_anon_prepare(struct vm_fault *vmf);
> >
> > int pmd_huge(pmd_t pmd);
> > int pud_huge(pud_t pud);
> >

2024-02-21 17:16:37

by Vishal Moola

[permalink] [raw]
Subject: Re: [PATCH 2/3] hugetlb: Use vmf_anon_prepare() instead of anon_vma_prepare()

On Tue, Feb 20, 2024 at 7:46 PM Matthew Wilcox <[email protected]> wrote:
>
> On Tue, Feb 20, 2024 at 03:14:23PM -0800, Vishal Moola (Oracle) wrote:
> > +++ b/mm/hugetlb.c
> > @@ -5834,9 +5834,15 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma,
> > struct folio *old_folio;
> > struct folio *new_folio;
> > int outside_reserve = 0;
> > - vm_fault_t ret = 0;
> > + vm_fault_t ret = 0, anon_ret = 0;
>
> Do we need a second variable here? Seems to me like we could
> unconditionally assign to ret:

Hmm, looks like we can directly assign to ret without any problems
in both functions, I'll change that for v2.

> > - if (unlikely(anon_vma_prepare(vma))) {
> > - ret = VM_FAULT_OOM;
> > + anon_ret = vmf_anon_prepare(&vmf);
> > + if (unlikely(anon_ret)) {
> > + ret = anon_ret;
>
>
>
> > unsigned long haddr = address & huge_page_mask(h);
> > struct mmu_notifier_range range;
> > + struct vm_fault vmf = {
> > + .vma = vma,
> > + .address = haddr,
> > + .real_address = address,
> > + .flags = flags,
> > + };
>
> We don't usually indent quite so far. One extra tab would be enough.
>
> Also, I thought we talked about creating the vmf in hugetlb_fault(),
> then passing it to hugetlb_wp() hugetlb_no_page() and handle_userfault()?
> Was there a reason to abandon that idea?

No I haven't abandoned that idea, I intend to have a separate patchset to go
on top of this one - just keeping them separate since they are conceptually
different. I'm converting each function to use struct vm_fault first, then
shifting it to be passed throughout as an arguement while cleaning up the
excess variables laying around. In a sense working bottom-up instead
of top-down.

2024-02-21 17:31:47

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm/memory: Change vmf_anon_prepare() to be non-static

Hi Vishal,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on linus/master v6.8-rc5 next-20240221]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Vishal-Moola-Oracle/mm-memory-Change-vmf_anon_prepare-to-be-non-static/20240221-071907
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20240220231424.126600-2-vishal.moola%40gmail.com
patch subject: [PATCH 1/3] mm/memory: Change vmf_anon_prepare() to be non-static
config: arm-defconfig (https://download.01.org/0day-ci/archive/20240222/[email protected]/config)
compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project.git f28c006a5895fc0e329fe15fead81e37457cb1d1)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240222/[email protected]/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All warnings (new ones prefixed by >>):

>> mm/memory.c:3283:12: warning: no previous prototype for function 'vmf_anon_prepare' [-Wmissing-prototypes]
vm_fault_t vmf_anon_prepare(struct vm_fault *vmf)
^
mm/memory.c:3283:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
vm_fault_t vmf_anon_prepare(struct vm_fault *vmf)
^
static
1 warning generated.


vim +/vmf_anon_prepare +3283 mm/memory.c

3282
> 3283 vm_fault_t vmf_anon_prepare(struct vm_fault *vmf)
3284 {
3285 struct vm_area_struct *vma = vmf->vma;
3286
3287 if (likely(vma->anon_vma))
3288 return 0;
3289 if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
3290 vma_end_read(vma);
3291 return VM_FAULT_RETRY;
3292 }
3293 if (__anon_vma_prepare(vma))
3294 return VM_FAULT_OOM;
3295 return 0;
3296 }
3297

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

2024-02-21 17:55:33

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 2/3] hugetlb: Use vmf_anon_prepare() instead of anon_vma_prepare()

On Wed, Feb 21, 2024 at 09:15:51AM -0800, Vishal Moola wrote:
> > > unsigned long haddr = address & huge_page_mask(h);
> > > struct mmu_notifier_range range;
> > > + struct vm_fault vmf = {
> > > + .vma = vma,
> > > + .address = haddr,
> > > + .real_address = address,
> > > + .flags = flags,
> > > + };
> >
> > We don't usually indent quite so far. One extra tab would be enough.
> >
> > Also, I thought we talked about creating the vmf in hugetlb_fault(),
> > then passing it to hugetlb_wp() hugetlb_no_page() and handle_userfault()?
> > Was there a reason to abandon that idea?
>
> No I haven't abandoned that idea, I intend to have a separate patchset to go
> on top of this one - just keeping them separate since they are conceptually
> different. I'm converting each function to use struct vm_fault first, then
> shifting it to be passed throughout as an arguement while cleaning up the
> excess variables laying around. In a sense working bottom-up instead
> of top-down.

I think you'll find it less work to create it in hugetlb_fault()
first. ie patch 2 could be to hoist its creation from half-way down
hugetlb_fault to the top of hugetlb_fault. Patch 3 could pass it
through hugetlb_no_page() to hugetlb_handle_userfault() and remove its
creation there. Now you've alreedy got it, and can make use of it in
this patch which would be the new patch 4.

If you want to do a cleanup patch afterwards, you could hoist the vmf
creation all the way to handle_mm_fault() ;-)

2024-02-21 18:03:04

by Vishal Moola

[permalink] [raw]
Subject: Re: [PATCH 2/3] hugetlb: Use vmf_anon_prepare() instead of anon_vma_prepare()

On Wed, Feb 21, 2024 at 9:55 AM Matthew Wilcox <[email protected]> wrote:
>
> On Wed, Feb 21, 2024 at 09:15:51AM -0800, Vishal Moola wrote:
> > > > unsigned long haddr = address & huge_page_mask(h);
> > > > struct mmu_notifier_range range;
> > > > + struct vm_fault vmf = {
> > > > + .vma = vma,
> > > > + .address = haddr,
> > > > + .real_address = address,
> > > > + .flags = flags,
> > > > + };
> > >
> > > We don't usually indent quite so far. One extra tab would be enough.
> > >
> > > Also, I thought we talked about creating the vmf in hugetlb_fault(),
> > > then passing it to hugetlb_wp() hugetlb_no_page() and handle_userfault()?
> > > Was there a reason to abandon that idea?
> >
> > No I haven't abandoned that idea, I intend to have a separate patchset to go
> > on top of this one - just keeping them separate since they are conceptually
> > different. I'm converting each function to use struct vm_fault first, then
> > shifting it to be passed throughout as an arguement while cleaning up the
> > excess variables laying around. In a sense working bottom-up instead
> > of top-down.
>
> I think you'll find it less work to create it in hugetlb_fault()
> first. ie patch 2 could be to hoist its creation from half-way down
> hugetlb_fault to the top of hugetlb_fault. Patch 3 could pass it
> through hugetlb_no_page() to hugetlb_handle_userfault() and remove its
> creation there. Now you've alreedy got it, and can make use of it in
> this patch which would be the new patch 4.

Ah I see, that way would definitely be a lot less work. I'll make that
change for this patchset in v2 then.

> If you want to do a cleanup patch afterwards, you could hoist the vmf
> creation all the way to handle_mm_fault() ;-)

Yeah, I was already looking at doing that in the future patchset :)

2024-02-21 19:38:27

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm/memory: Change vmf_anon_prepare() to be non-static

Hi Vishal,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on linus/master v6.8-rc5 next-20240221]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Vishal-Moola-Oracle/mm-memory-Change-vmf_anon_prepare-to-be-non-static/20240221-071907
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20240220231424.126600-2-vishal.moola%40gmail.com
patch subject: [PATCH 1/3] mm/memory: Change vmf_anon_prepare() to be non-static
config: alpha-allnoconfig (https://download.01.org/0day-ci/archive/20240222/[email protected]/config)
compiler: alpha-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240222/[email protected]/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All warnings (new ones prefixed by >>):

>> mm/memory.c:3283:12: warning: no previous prototype for 'vmf_anon_prepare' [-Wmissing-prototypes]
3283 | vm_fault_t vmf_anon_prepare(struct vm_fault *vmf)
| ^~~~~~~~~~~~~~~~


vim +/vmf_anon_prepare +3283 mm/memory.c

3282
> 3283 vm_fault_t vmf_anon_prepare(struct vm_fault *vmf)
3284 {
3285 struct vm_area_struct *vma = vmf->vma;
3286
3287 if (likely(vma->anon_vma))
3288 return 0;
3289 if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
3290 vma_end_read(vma);
3291 return VM_FAULT_RETRY;
3292 }
3293 if (__anon_vma_prepare(vma))
3294 return VM_FAULT_OOM;
3295 return 0;
3296 }
3297

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki