LinuxLists.cc - [PATCH RFC 00/10] mm/hugetlb: Make huge_pte

2022-10-30 22:02:54

Subject: [PATCH RFC 00/10] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare

This can be seen as a follow-up series to Mike's recent hugetlb vma lock
series for pmd unsharing. So this series also depends on that one. But
there're some huge_pte_offset() paths that seem to be still racy on pmd
unsharing (as they don't take vma lock), more below.

Hopefully this series can make it a more complete resolution for pmd
unsharing.

Problem
=======

huge_pte_offset() is a major helper used by hugetlb code paths to walk a
hugetlb pgtable. It's used mostly everywhere since that's needed even
before taking the pgtable lock.

huge_pte_offset() is always called with mmap lock held with either read or
write.

For normal memory types that's far enough, since any pgtable removal
requires mmap write lock (e.g. munmap or mm destructions). However hugetlb
has the pmd unshare feature, it means not only the pgtable page can be gone
from under us when we're doing a walking, but also the pgtable page we're
walking (even after unshared, in this case it can only be the huge PUD page
which contains 512 huge pmd entries, with the vma VM_SHARED mapped). It's
possible because even though freeing the pgtable page requires mmap write
lock, it doesn't help us from when we're walking on another mm's pgtable,
so it's still on risk even if we're with the current->mm's mmap lock.

The recent work from Mike on vma lock can resolve most of this already.
It's achieved by forbidden pmd unsharing during the lock being taken, so no
further risk of the pgtable page being freed.

But it means it'll work only if we take the vma lock for all the places
around huge_pte_offset(). There're already a bunch of them that we did as
per the latest mm-unstable, but also a lot that we didn't for various
reasons. E.g. it may not be applicable for not-allow-to-sleep contexts
like FOLL_NOWAIT.

I have totally no report showing that I can trigger such a race, but from
code wise I never see anything that stops the race from happening. This
series is trying to resolve that problem.

Resolution
==========

What this patch proposed is, besides using the vma lock, we can also use
RCU to protect the pgtable page from being freed from under us when
huge_pte_offset() is used. The idea is kind of similar to RCU fast-gup.
Note that fast-gup is very safe regarding pmd unsharing even before vma
lock, because fast-gup relies on RCU to protect walking any pgtable page,
including another mm's.

To apply the same idea to huge_pte_offset(), it means with proper RCU
protection the pte_t* pointer returned from huge_pte_offset() can also be
always safe to access and de-reference, along with the pgtable lock that
was bound to the pgtable page.

Patch Layout
============

Patch 1 is a trivial cleanup that I noticed when working on this. Please
shoot if anyone think I should just post it separately, or hopefully I can
still just carry it over.

Patch 2 is the gut of the patchset, describing how we should use the helper
huge_pte_offset() correctly. Only a comment patch but should be the most
important one, as the follow up patches are just trying to follow the rule
it setup here.

The rest patches resolve all the call sites of huge_pte_offset() to make
sure either it's with the vma lock (which is perfectly good enough for
safety in this case; the last patch commented on all those callers to make
sure we won't miss a single case, and why they're safe). Besides, each of
the patch will add rcu protection to one caller of huge_pte_offset().

Tests
=====

Only lightly tested on hugetlb kselftests including uffd, no more errors
triggered than current mm-unstable (hugetlb-madvise fails before/after
here, with error "Unexpected number of free huge pages line 207"; haven't
really got time to look into it).

Since this is so far only discussed with Mike quickly in the other thread,
marking this as RFC for now as I could have missed something.

Comments welcomed, thanks.

Peter Xu (10):
mm/hugetlb: Let vma_offset_start() to return start
mm/hugetlb: Comment huge_pte_offset() for its locking requirements
mm/hugetlb: Make hugetlb_vma_maps_page() RCU-safe
mm/hugetlb: Make userfaultfd_huge_must_wait() RCU-safe
mm/hugetlb: Make walk_hugetlb_range() RCU-safe
mm/hugetlb: Make page_vma_mapped_walk() RCU-safe
mm/hugetlb: Make hugetlb_follow_page_mask() RCU-safe
mm/hugetlb: Make follow_hugetlb_page RCU-safe
mm/hugetlb: Make hugetlb_fault() RCU-safe
mm/hugetlb: Comment at rest huge_pte_offset() places

arch/arm64/mm/hugetlbpage.c | 32 ++++++++++++++++++++++++++
fs/hugetlbfs/inode.c | 39 ++++++++++++++++++--------------
fs/userfaultfd.c | 4 ++++
include/linux/rmap.h | 3 +++
mm/hugetlb.c | 45 +++++++++++++++++++++++++++++++++++--
mm/page_vma_mapped.c | 7 +++++-
mm/pagewalk.c | 5 +++++
7 files changed, 115 insertions(+), 20 deletions(-)

--
2.37.3

2022-10-30 22:03:15

by Peter Xu

[permalink] [raw]

Subject: [PATCH RFC 06/10] mm/hugetlb: Make page_vma_mapped_walk() RCU-safe

RCU makes sure the pte_t* won't go away from under us. Please refer to the
comment above huge_pte_offset() for more information.

Signed-off-by: Peter Xu <[email protected]>
---
include/linux/rmap.h | 3 +++
mm/page_vma_mapped.c | 7 ++++++-
2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index bd3504d11b15..d2c5e69a56f2 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -408,6 +408,9 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
pte_unmap(pvmw->pte);
if (pvmw->ptl)
spin_unlock(pvmw->ptl);
+ /* Hugetlb uses RCU lock for safe access of huge_pte_offset() */
+ if (is_vm_hugetlb_page(pvmw->vma))
+ rcu_read_unlock();
}

bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw);
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 93e13fc17d3c..513210a59d7b 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -169,10 +169,15 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
if (pvmw->pte)
return not_found(pvmw);

+ /* For huge_pte_offset() */
+ rcu_read_lock();
+
/* when pud is not present, pte will be NULL */
pvmw->pte = huge_pte_offset(mm, pvmw->address, size);
- if (!pvmw->pte)
+ if (!pvmw->pte) {
+ rcu_read_unlock();
return false;
+ }

pvmw->ptl = huge_pte_lock(hstate, mm, pvmw->pte);
if (!check_pte(pvmw))
--
2.37.3

2022-10-30 22:03:16

by Peter Xu

[permalink] [raw]

Subject: [PATCH RFC 03/10] mm/hugetlb: Make hugetlb_vma_maps_page() RCU-safe

RCU makes sure the pte_t* won't go away from under us. Please refer to the
comment above huge_pte_offset() for more information.

Signed-off-by: Peter Xu <[email protected]>
---
fs/hugetlbfs/inode.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index ac3a69fe29c3..b9e7825079c7 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -387,21 +387,26 @@ static bool hugetlb_vma_maps_page(struct vm_area_struct *vma,
unsigned long addr, struct page *page)
{
pte_t *ptep, pte;
+ bool result = false;
+
+ /* For huge_pte_offset() */
+ rcu_read_lock();

ptep = huge_pte_offset(vma->vm_mm, addr,
huge_page_size(hstate_vma(vma)));

if (!ptep)
- return false;
+ goto out;

pte = huge_ptep_get(ptep);
if (huge_pte_none(pte) || !pte_present(pte))
- return false;
+ goto out;

if (pte_page(pte) == page)
- return true;
-
- return false;
+ result = true;
+out:
+ rcu_read_unlock();
+ return result;
}

/*
--
2.37.3

2022-10-30 22:03:27

by Peter Xu

[permalink] [raw]

Subject: [PATCH RFC 04/10] mm/hugetlb: Make userfaultfd_huge_must_wait() RCU-safe

RCU makes sure the pte_t* won't go away from under us. Please refer to the
comment above huge_pte_offset() for more information.

Signed-off-by: Peter Xu <[email protected]>
---
fs/userfaultfd.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 07c81ab3fd4d..4e813e68e4f8 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -243,6 +243,9 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,

mmap_assert_locked(mm);

+ /* For huge_pte_offset() */
+ rcu_read_lock();
+
ptep = huge_pte_offset(mm, address, vma_mmu_pagesize(vma));

if (!ptep)
@@ -261,6 +264,7 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
if (!huge_pte_write(pte) && (reason & VM_UFFD_WP))
ret = true;
out:
+ rcu_read_unlock();
return ret;
}
#else
--
2.37.3

2022-10-30 22:03:30

by Peter Xu

[permalink] [raw]

Subject: [PATCH RFC 10/10] mm/hugetlb: Comment at rest huge_pte_offset() places

This makes sure that we're covering all the existing huge_pte_offset()
callers and mention why they are safe regarding to pmd unsharing.

Signed-off-by: Peter Xu <[email protected]>
---
mm/hugetlb.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6d336d286394..270bfc578115 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4822,6 +4822,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
last_addr_mask = hugetlb_mask_last_page(h);
for (addr = src_vma->vm_start; addr < src_vma->vm_end; addr += sz) {
spinlock_t *src_ptl, *dst_ptl;
+ /* With vma lock held, safe without RCU */
src_pte = huge_pte_offset(src, addr, sz);
if (!src_pte) {
addr |= last_addr_mask;
@@ -5026,6 +5027,7 @@ int move_hugetlb_page_tables(struct vm_area_struct *vma,
hugetlb_vma_lock_write(vma);
i_mmap_lock_write(mapping);
for (; old_addr < old_end; old_addr += sz, new_addr += sz) {
+ /* With vma lock held, safe without RCU */
src_pte = huge_pte_offset(mm, old_addr, sz);
if (!src_pte) {
old_addr |= last_addr_mask;
@@ -5097,6 +5099,7 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
last_addr_mask = hugetlb_mask_last_page(h);
address = start;
for (; address < end; address += sz) {
+ /* With vma lock held, safe without RCU */
ptep = huge_pte_offset(mm, address, sz);
if (!ptep) {
address |= last_addr_mask;
@@ -5402,6 +5405,7 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma,
mutex_lock(&hugetlb_fault_mutex_table[hash]);
hugetlb_vma_lock_read(vma);
spin_lock(ptl);
+ /* With vma lock held, safe without RCU */
ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
if (likely(ptep &&
pte_same(huge_ptep_get(ptep), pte)))
@@ -5440,6 +5444,7 @@ static vm_fault_t hugetlb_wp(struct mm_struct *mm, struct vm_area_struct *vma,
* before the page tables are altered
*/
spin_lock(ptl);
+ /* With vma lock (and even pgtable lock) held, safe without RCU */
ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
if (likely(ptep && pte_same(huge_ptep_get(ptep), pte))) {
/* Break COW or unshare */
@@ -6511,6 +6516,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
last_addr_mask = hugetlb_mask_last_page(h);
for (; address < end; address += psize) {
spinlock_t *ptl;
+ /* With vma lock held, safe without RCU */
ptep = huge_pte_offset(mm, address, psize);
if (!ptep) {
address |= last_addr_mask;
@@ -7060,7 +7066,14 @@ pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,

saddr = page_table_shareable(svma, vma, addr, idx);
if (saddr) {
+ /*
+ * huge_pmd_share() (or say its solo caller,
+ * huge_pte_alloc()) always takes the hugetlb vma
+ * lock, so it's always safe to walk the pgtable of
+ * the process, even without RCU.
+ */
spte = huge_pte_offset(svma->vm_mm, saddr,
+
vma_mmu_pagesize(svma));
if (spte) {
get_page(virt_to_page(spte));
@@ -7420,6 +7433,7 @@ void hugetlb_unshare_all_pmds(struct vm_area_struct *vma)
hugetlb_vma_lock_write(vma);
i_mmap_lock_write(vma->vm_file->f_mapping);
for (address = start; address < end; address += PUD_SIZE) {
+ /* With vma lock held, safe without RCU */
ptep = huge_pte_offset(mm, address, sz);
if (!ptep)
continue;
--
2.37.3

2022-11-01 05:58:52

by Nadav Amit

[permalink] [raw]

Subject: Re: [PATCH RFC 10/10] mm/hugetlb: Comment at rest huge_pte_offset() places

On Oct 30, 2022, at 2:30 PM, Peter Xu <[email protected]> wrote:

> + /* With vma lock held, safe without RCU */
> src_pte = huge_pte_offset(src, addr, sz);

Just another option to consider: you can create an inline function
huge_pte_offset_locked_mm(), which would do the lockdep_assert_held_*().

I personally would prefer it, since it would clarify exactly the lock you
care about and "make the code document itself”.

2022-11-02 18:23:37

by James Houghton

[permalink] [raw]

Subject: Re: [PATCH RFC 04/10] mm/hugetlb: Make userfaultfd_huge_must_wait() RCU-safe

On Sun, Oct 30, 2022 at 2:29 PM Peter Xu <[email protected]> wrote:
>
> RCU makes sure the pte_t* won't go away from under us. Please refer to the
> comment above huge_pte_offset() for more information.
>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> fs/userfaultfd.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 07c81ab3fd4d..4e813e68e4f8 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -243,6 +243,9 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
>
> mmap_assert_locked(mm);
>
> + /* For huge_pte_offset() */
> + rcu_read_lock();

userfaultfd_huge_must_wait is called after we set the task's state to
blocking. Is it always safe to call rcu_read_lock (and
rcu_read_unlock) in this case? (With my basic understanding of RCU,
this seems like it should be safe, but I'm not sure.)

- James

> +
> ptep = huge_pte_offset(mm, address, vma_mmu_pagesize(vma));
>
> if (!ptep)
> @@ -261,6 +264,7 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
> if (!huge_pte_write(pte) && (reason & VM_UFFD_WP))
> ret = true;
> out:
> + rcu_read_unlock();
> return ret;
> }
> #else
> --
> 2.37.3
>

2022-11-02 21:28:22

by Peter Xu

[permalink] [raw]

Subject: Re: [PATCH RFC 04/10] mm/hugetlb: Make userfaultfd_huge_must_wait() RCU-safe

On Wed, Nov 02, 2022 at 11:06:16AM -0700, James Houghton wrote:
> On Sun, Oct 30, 2022 at 2:29 PM Peter Xu <[email protected]> wrote:
> >
> > RCU makes sure the pte_t* won't go away from under us. Please refer to the
> > comment above huge_pte_offset() for more information.
> >
> > Signed-off-by: Peter Xu <[email protected]>
> > ---
> > fs/userfaultfd.c | 4 ++++
> > 1 file changed, 4 insertions(+)
> >
> > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > index 07c81ab3fd4d..4e813e68e4f8 100644
> > --- a/fs/userfaultfd.c
> > +++ b/fs/userfaultfd.c
> > @@ -243,6 +243,9 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
> >
> > mmap_assert_locked(mm);
> >
> > + /* For huge_pte_offset() */
> > + rcu_read_lock();
>
> userfaultfd_huge_must_wait is called after we set the task's state to
> blocking. Is it always safe to call rcu_read_lock (and
> rcu_read_unlock) in this case? (With my basic understanding of RCU,
> this seems like it should be safe, but I'm not sure.)

I'm not aware of an issue here, but please shoot if you have any further
concerns or clues, because I'm definitely not a rcu person so I can
overlook things.

What I remember is my smoke test should be with LOCKDEP, it didn't trigger
anything so far I think.

--
Peter Xu

2022-11-02 21:47:36

by Peter Xu

[permalink] [raw]

Subject: Re: [PATCH RFC 10/10] mm/hugetlb: Comment at rest huge_pte_offset() places

On Mon, Oct 31, 2022 at 10:39:33PM -0700, Nadav Amit wrote:
> On Oct 30, 2022, at 2:30 PM, Peter Xu <[email protected]> wrote:
>
> > + /* With vma lock held, safe without RCU */
> > src_pte = huge_pte_offset(src, addr, sz);
>
> Just another option to consider: you can create an inline function
> huge_pte_offset_locked_mm(), which would do the lockdep_assert_held_*().
>
> I personally would prefer it, since it would clarify exactly the lock you
> care about and "make the code document itself”.

That's a great suggestion, I'll give it a shot in the next version.

Thanks!

--
Peter Xu

2022-11-04 00:46:47

by Mike Kravetz

[permalink] [raw]

Subject: Re: [PATCH RFC 00/10] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare

On 10/30/22 17:29, Peter Xu wrote:
> Resolution
> ==========
>
> What this patch proposed is, besides using the vma lock, we can also use
> RCU to protect the pgtable page from being freed from under us when
> huge_pte_offset() is used. The idea is kind of similar to RCU fast-gup.
> Note that fast-gup is very safe regarding pmd unsharing even before vma
> lock, because fast-gup relies on RCU to protect walking any pgtable page,
> including another mm's.
>
> To apply the same idea to huge_pte_offset(), it means with proper RCU
> protection the pte_t* pointer returned from huge_pte_offset() can also be
> always safe to access and de-reference, along with the pgtable lock that
> was bound to the pgtable page.
>
> Patch Layout
> ============
>
> Patch 1 is a trivial cleanup that I noticed when working on this. Please
> shoot if anyone think I should just post it separately, or hopefully I can
> still just carry it over.
>
> Patch 2 is the gut of the patchset, describing how we should use the helper
> huge_pte_offset() correctly. Only a comment patch but should be the most
> important one, as the follow up patches are just trying to follow the rule
> it setup here.
>
> The rest patches resolve all the call sites of huge_pte_offset() to make
> sure either it's with the vma lock (which is perfectly good enough for
> safety in this case; the last patch commented on all those callers to make
> sure we won't miss a single case, and why they're safe). Besides, each of
> the patch will add rcu protection to one caller of huge_pte_offset().
>
> Tests
> =====
>
> Only lightly tested on hugetlb kselftests including uffd, no more errors
> triggered than current mm-unstable (hugetlb-madvise fails before/after
> here, with error "Unexpected number of free huge pages line 207"; haven't
> really got time to look into it).

Do not worry about the madvise test failure, that is caused by a recent
change.

Unless I am missing something, the basic strategy in this series is to
wrap calls to huge_pte_offset and subsequent ptep access with
rcu_read_lock/unlock calls. I must embarrassingly admit that it has
been a loooong time since I had to look at rcu usage and may not know
what I am talking about. However, I seem to recall that one needs to
somehow flag the data items being protected from update/freeing. I
do not see anything like that in the huge_pmd_unshare routine where
pmd page pointer is updated. Or, is it where the pmd page pointer is
referenced in huge_pte_offset?

Please ignore if you are certain of this rcu usage, otherwise I will
spend some time reeducating myself.
--
Mike Kravetz

2022-11-04 15:51:34

by Peter Xu

[permalink] [raw]

Subject: Re: [PATCH RFC 00/10] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare

Hi, Mike,

On Thu, Nov 03, 2022 at 05:21:46PM -0700, Mike Kravetz wrote:
> On 10/30/22 17:29, Peter Xu wrote:
> > Resolution
> > ==========
> >
> > What this patch proposed is, besides using the vma lock, we can also use
> > RCU to protect the pgtable page from being freed from under us when
> > huge_pte_offset() is used. The idea is kind of similar to RCU fast-gup.
> > Note that fast-gup is very safe regarding pmd unsharing even before vma
> > lock, because fast-gup relies on RCU to protect walking any pgtable page,
> > including another mm's.
> >
> > To apply the same idea to huge_pte_offset(), it means with proper RCU
> > protection the pte_t* pointer returned from huge_pte_offset() can also be
> > always safe to access and de-reference, along with the pgtable lock that
> > was bound to the pgtable page.
> >
> > Patch Layout
> > ============
> >
> > Patch 1 is a trivial cleanup that I noticed when working on this. Please
> > shoot if anyone think I should just post it separately, or hopefully I can
> > still just carry it over.
> >
> > Patch 2 is the gut of the patchset, describing how we should use the helper
> > huge_pte_offset() correctly. Only a comment patch but should be the most
> > important one, as the follow up patches are just trying to follow the rule
> > it setup here.
> >
> > The rest patches resolve all the call sites of huge_pte_offset() to make
> > sure either it's with the vma lock (which is perfectly good enough for
> > safety in this case; the last patch commented on all those callers to make
> > sure we won't miss a single case, and why they're safe). Besides, each of
> > the patch will add rcu protection to one caller of huge_pte_offset().
> >
> > Tests
> > =====
> >
> > Only lightly tested on hugetlb kselftests including uffd, no more errors
> > triggered than current mm-unstable (hugetlb-madvise fails before/after
> > here, with error "Unexpected number of free huge pages line 207"; haven't
> > really got time to look into it).
>
> Do not worry about the madvise test failure, that is caused by a recent
> change.
>
> Unless I am missing something, the basic strategy in this series is to
> wrap calls to huge_pte_offset and subsequent ptep access with
> rcu_read_lock/unlock calls. I must embarrassingly admit that it has
> been a loooong time since I had to look at rcu usage and may not know
> what I am talking about. However, I seem to recall that one needs to
> somehow flag the data items being protected from update/freeing. I
> do not see anything like that in the huge_pmd_unshare routine where
> pmd page pointer is updated. Or, is it where the pmd page pointer is
> referenced in huge_pte_offset?

Right. The RCU proposed here is trying to protect the pmd pgtable page
that will normally be freed in rcu pattern. Please refer to
tlb_remove_table_free() (which can be called from tlb_finish_mmu()) where
it's released with RCU API:

call_rcu(&batch->rcu, tlb_remove_table_rcu);

I mentioned fast-gup just to refererence on the same usage as fast-gup has
the same risk if without RCU or similar protections that is IPI-based, but
I definitely can be even clearer, and I will enrich the cover letter in the
next post.

In short, my understanding is pgtable pages (including the shared PUD page
for hugetlb) needs to be freed with caution because there can be softwares
that are walking the pages with no locks. In our case, even though
huge_pte_offset() is with the mmap lock, due to the pmd sharing it's not
always having the same mmap lock as when the pgtable needs to be freed, so
it's similar to having no lock here, imo. Then huge_pte_offset() needs to
be protected just like what we do with fast-gup.

Please also feel free to refer to the comment chunk at the start of
asm-generic/tlb.h for more information on the mmu gather API.

>
> Please ignore if you are certain of this rcu usage, otherwise I will
> spend some time reeducating myself.

I'm not certain, and I'd like to get any form of comment. :)

Sorry if this RFC version is confusing, but if it can try to at least
explain what the problem we have and if we can agree on the problem first
then that'll already be a step forward to me. So far that's more important
than how we resolve it, using RCU or vma lock or anything else.

For a non-rfc series, I think I need to be more careful on some details,
e.g., the RCU protection for pgtable page is only used when the arch
supports MMU_GATHER_RCU_TABLE_FREE. I thought that's always supported at
least for pmd sharing enabled archs, but I'm actually wrong:

arch/arm64/Kconfig: select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)
arch/riscv/Kconfig: select ARCH_WANT_HUGE_PMD_SHARE if 64BIT
arch/x86/Kconfig: select ARCH_WANT_HUGE_PMD_SHARE

arch/arm/Kconfig: select MMU_GATHER_RCU_TABLE_FREE if SMP && ARM_LPAE
arch/arm64/Kconfig: select MMU_GATHER_RCU_TABLE_FREE
arch/powerpc/Kconfig: select MMU_GATHER_RCU_TABLE_FREE
arch/s390/Kconfig: select MMU_GATHER_RCU_TABLE_FREE
arch/sparc/Kconfig: select MMU_GATHER_RCU_TABLE_FREE if SMP
arch/sparc/include/asm/tlb_64.h:#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
arch/x86/Kconfig: select MMU_GATHER_RCU_TABLE_FREE if PARAVIRT

I think it means at least on RISCV RCU_TABLE_FREE is not enabled and we'll
need to rely on the IPIs (e.g. I think we need to replace rcu_read_lock()
with local_irq_disable() on RISCV only for what this patchset wanted to
do). In the next version, I plan to add a helper, let's name it
huge_pte_walker_lock() for now, and it should be one of the three options:

- if !ARCH_WANT_HUGE_PMD_SHARE: it's no-op
- else if MMU_GATHER_RCU_TABLE_FREE: it should be rcu_read_lock()
- else: it should be local_irq_disable()

With that, I think we'll strictly follow what we have with fast-gup, at the
meantime it should add zero overhead on archs that does not have pmd sharing.

Hope above helps a bit on extending the missing pieces of the cover
letter. Or again if anything missing I'd be more than glad to know..

Thanks,

--
Peter Xu

2022-11-04 16:20:15

by Mike Kravetz

[permalink] [raw]

Subject: Re: [PATCH RFC 00/10] mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare

On 11/04/22 11:02, Peter Xu wrote:
> Hi, Mike,
>
> On Thu, Nov 03, 2022 at 05:21:46PM -0700, Mike Kravetz wrote:
> > On 10/30/22 17:29, Peter Xu wrote:
> > > Resolution
> > > ==========
> > >
> > > What this patch proposed is, besides using the vma lock, we can also use
> > > RCU to protect the pgtable page from being freed from under us when
> > > huge_pte_offset() is used. The idea is kind of similar to RCU fast-gup.
> > > Note that fast-gup is very safe regarding pmd unsharing even before vma
> > > lock, because fast-gup relies on RCU to protect walking any pgtable page,
> > > including another mm's.
> > >
> > > To apply the same idea to huge_pte_offset(), it means with proper RCU
> > > protection the pte_t* pointer returned from huge_pte_offset() can also be
> > > always safe to access and de-reference, along with the pgtable lock that
> > > was bound to the pgtable page.
> > >
> > > Patch Layout
> > > ============
> > >
> > > Patch 1 is a trivial cleanup that I noticed when working on this. Please
> > > shoot if anyone think I should just post it separately, or hopefully I can
> > > still just carry it over.
> > >
> > > Patch 2 is the gut of the patchset, describing how we should use the helper
> > > huge_pte_offset() correctly. Only a comment patch but should be the most
> > > important one, as the follow up patches are just trying to follow the rule
> > > it setup here.
> > >
> > > The rest patches resolve all the call sites of huge_pte_offset() to make
> > > sure either it's with the vma lock (which is perfectly good enough for
> > > safety in this case; the last patch commented on all those callers to make
> > > sure we won't miss a single case, and why they're safe). Besides, each of
> > > the patch will add rcu protection to one caller of huge_pte_offset().
> > >
> > > Tests
> > > =====
> > >
> > > Only lightly tested on hugetlb kselftests including uffd, no more errors
> > > triggered than current mm-unstable (hugetlb-madvise fails before/after
> > > here, with error "Unexpected number of free huge pages line 207"; haven't
> > > really got time to look into it).
> >
> > Do not worry about the madvise test failure, that is caused by a recent
> > change.
> >
> > Unless I am missing something, the basic strategy in this series is to
> > wrap calls to huge_pte_offset and subsequent ptep access with
> > rcu_read_lock/unlock calls. I must embarrassingly admit that it has
> > been a loooong time since I had to look at rcu usage and may not know
> > what I am talking about. However, I seem to recall that one needs to
> > somehow flag the data items being protected from update/freeing. I
> > do not see anything like that in the huge_pmd_unshare routine where
> > pmd page pointer is updated. Or, is it where the pmd page pointer is
> > referenced in huge_pte_offset?
>
> Right. The RCU proposed here is trying to protect the pmd pgtable page
> that will normally be freed in rcu pattern. Please refer to
> tlb_remove_table_free() (which can be called from tlb_finish_mmu()) where
> it's released with RCU API:
>
> call_rcu(&batch->rcu, tlb_remove_table_rcu);
>

Thanks! That is the piece of the puzzle I was missing.

> I mentioned fast-gup just to refererence on the same usage as fast-gup has
> the same risk if without RCU or similar protections that is IPI-based, but
> I definitely can be even clearer, and I will enrich the cover letter in the
> next post.
>
> In short, my understanding is pgtable pages (including the shared PUD page
> for hugetlb) needs to be freed with caution because there can be softwares
> that are walking the pages with no locks. In our case, even though
> huge_pte_offset() is with the mmap lock, due to the pmd sharing it's not
> always having the same mmap lock as when the pgtable needs to be freed, so
> it's similar to having no lock here, imo. Then huge_pte_offset() needs to
> be protected just like what we do with fast-gup.
>
> Please also feel free to refer to the comment chunk at the start of
> asm-generic/tlb.h for more information on the mmu gather API.
>
> >
> > Please ignore if you are certain of this rcu usage, otherwise I will
> > spend some time reeducating myself.

Sorry for any misunderstanding. I am very happy with the RFC and the
work you have done. I was just missing the piece about rcu
synchronization when the page table was removed.

> I'm not certain, and I'd like to get any form of comment. :)
>
> Sorry if this RFC version is confusing, but if it can try to at least
> explain what the problem we have and if we can agree on the problem first
> then that'll already be a step forward to me. So far that's more important
> than how we resolve it, using RCU or vma lock or anything else.
>
> For a non-rfc series, I think I need to be more careful on some details,
> e.g., the RCU protection for pgtable page is only used when the arch
> supports MMU_GATHER_RCU_TABLE_FREE. I thought that's always supported at
> least for pmd sharing enabled archs, but I'm actually wrong:
>
> arch/arm64/Kconfig: select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)
> arch/riscv/Kconfig: select ARCH_WANT_HUGE_PMD_SHARE if 64BIT
> arch/x86/Kconfig: select ARCH_WANT_HUGE_PMD_SHARE
>
> arch/arm/Kconfig: select MMU_GATHER_RCU_TABLE_FREE if SMP && ARM_LPAE
> arch/arm64/Kconfig: select MMU_GATHER_RCU_TABLE_FREE
> arch/powerpc/Kconfig: select MMU_GATHER_RCU_TABLE_FREE
> arch/s390/Kconfig: select MMU_GATHER_RCU_TABLE_FREE
> arch/sparc/Kconfig: select MMU_GATHER_RCU_TABLE_FREE if SMP
> arch/sparc/include/asm/tlb_64.h:#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
> arch/x86/Kconfig: select MMU_GATHER_RCU_TABLE_FREE if PARAVIRT
>
> I think it means at least on RISCV RCU_TABLE_FREE is not enabled and we'll
> need to rely on the IPIs (e.g. I think we need to replace rcu_read_lock()
> with local_irq_disable() on RISCV only for what this patchset wanted to
> do). In the next version, I plan to add a helper, let's name it
> huge_pte_walker_lock() for now, and it should be one of the three options:
>
> - if !ARCH_WANT_HUGE_PMD_SHARE: it's no-op
> - else if MMU_GATHER_RCU_TABLE_FREE: it should be rcu_read_lock()
> - else: it should be local_irq_disable()
>
> With that, I think we'll strictly follow what we have with fast-gup, at the
> meantime it should add zero overhead on archs that does not have pmd sharing.
>
> Hope above helps a bit on extending the missing pieces of the cover
> letter. Or again if anything missing I'd be more than glad to know..
--
Mike Kravetz