LinuxLists.cc - [PATCH v1] mm/ksm: update stale comment in write_protect

2022-08-31 08:45:44

Subject: [PATCH v1] mm/ksm: update stale comment in write_protect_page()

The comment is stale, because a TLB flush is no longer sufficient and
required to synchronize against concurrent GUP-fast. This used to be true
in the past, whereby a TLB flush would have implied an IPI on architectures
that support GUP-fast, resulting in GUP-fast that disables local interrupts
from completing before completing the flush.

However, ever since general RCU GUP-fast was introduced in
commit 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()"),
this handling no longer applies in general. RCU primarily prevents page
tables from getting freed while they might still get walked by GUP-fast,
but we can race with GUP-fast after clearing the PTE and flushing the
TLB.

Nowadays, we can see a refcount change from GUP-fast at any point in
time. However, GUP-fast detects concurrent PTE changes by looking up the
PTE, temporarily grabbing a reference, and dropping the reference again if
the PTE changed in the meantime.

An explanation by Jason Gunthorpe regarding how existing memory barriers
should be fine to make sure that concurrent GUP-fast cannot succeed in
grabbing a reference with write permissions after we cleared the PTE and
flushed the TLB can be found at [1].

Note that clearing PageAnonExclusive via page_try_share_anon_rmap()
might still need some explicit memory barriers to not have any possible
races with RCU GUP-fast.

[1] https://lkml.kernel.org/r/[email protected]

Cc: Andrew Morton <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Peter Xu <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---
mm/ksm.c | 21 +++++++++------------
1 file changed, 9 insertions(+), 12 deletions(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index 42ab153335a2..e88291f63461 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1072,23 +1072,20 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
swapped = PageSwapCache(page);
flush_cache_page(vma, pvmw.address, page_to_pfn(page));
/*
- * Ok this is tricky, when get_user_pages_fast() run it doesn't
- * take any lock, therefore the check that we are going to make
- * with the pagecount against the mapcount is racy and
- * O_DIRECT can happen right after the check.
- * So we clear the pte and flush the tlb before the check
- * this assure us that no O_DIRECT can happen after the check
- * or in the middle of the check.
- *
- * No need to notify as we are downgrading page table to read
- * only not changing it to point to a new page.
+ * Especially if we're downgrading protection, make sure to
+ * flush the TLB now. No need to notify as we are not changing
+ * the PTE to point at a different page.
*
* See Documentation/mm/mmu_notifier.rst
*/
entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte);
+
/*
- * Check that no O_DIRECT or similar I/O is in progress on the
- * page
+ * Make sure that there are no unexpected references (e.g.,
+ * concurrent O_DIRECT). Note that while concurrent GUP-fast
+ * could raise the refcount temporarily to grab a write
+ * reference, it will observe the changed PTE and drop that
+ * temporary reference again.
*/
if (page_mapcount(page) + 1 + swapped != page_count(page)) {
set_pte_at(mm, pvmw.address, pvmw.pte, entry);
--
2.37.1

2022-08-31 18:24:40

by Yang Shi

[permalink] [raw]

Subject: Re: [PATCH v1] mm/ksm: update stale comment in write_protect_page()

On Wed, Aug 31, 2022 at 1:30 AM David Hildenbrand <[email protected]> wrote:
>
> The comment is stale, because a TLB flush is no longer sufficient and
> required to synchronize against concurrent GUP-fast. This used to be true
> in the past, whereby a TLB flush would have implied an IPI on architectures
> that support GUP-fast, resulting in GUP-fast that disables local interrupts
> from completing before completing the flush.

Hmm... it seems there might be problem for THP collapse IIUC. THP
collapse clears and flushes pmd before doing anything on pte and
relies on interrupt disable of fast GUP to serialize against fast GUP.
But if TLB flush is no longer sufficient, then we may run into the
below race IIUC:

CPU A CPU B
THP collapse fast GUP

gup_pmd_range() <-- see valid pmd

gup_pte_range() <-- work on pte
clear pmd and flush TLB
__collapse_huge_page_isolate()
isolate page <-- before GUP bump refcount

pin the page
__collapse_huge_page_copy()
copy data to huge page
clear pte (don't flush TLB)
Install huge pmd for huge page

return the obsolete page

>
> However, ever since general RCU GUP-fast was introduced in
> commit 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()"),
> this handling no longer applies in general. RCU primarily prevents page
> tables from getting freed while they might still get walked by GUP-fast,
> but we can race with GUP-fast after clearing the PTE and flushing the
> TLB.
>
> Nowadays, we can see a refcount change from GUP-fast at any point in
> time. However, GUP-fast detects concurrent PTE changes by looking up the
> PTE, temporarily grabbing a reference, and dropping the reference again if
> the PTE changed in the meantime.
>
> An explanation by Jason Gunthorpe regarding how existing memory barriers
> should be fine to make sure that concurrent GUP-fast cannot succeed in
> grabbing a reference with write permissions after we cleared the PTE and
> flushed the TLB can be found at [1].
>
> Note that clearing PageAnonExclusive via page_try_share_anon_rmap()
> might still need some explicit memory barriers to not have any possible
> races with RCU GUP-fast.
>
> [1] https://lkml.kernel.org/r/[email protected]
>
> Cc: Andrew Morton <[email protected]>
> Cc: Jason Gunthorpe <[email protected]>
> Cc: John Hubbard <[email protected]>
> Cc: Andrea Arcangeli <[email protected]>
> Cc: Hugh Dickins <[email protected]>
> Cc: Peter Xu <[email protected]>
> Signed-off-by: David Hildenbrand <[email protected]>
> ---
> mm/ksm.c | 21 +++++++++------------
> 1 file changed, 9 insertions(+), 12 deletions(-)
>
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 42ab153335a2..e88291f63461 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -1072,23 +1072,20 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page,
> swapped = PageSwapCache(page);
> flush_cache_page(vma, pvmw.address, page_to_pfn(page));
> /*
> - * Ok this is tricky, when get_user_pages_fast() run it doesn't
> - * take any lock, therefore the check that we are going to make
> - * with the pagecount against the mapcount is racy and
> - * O_DIRECT can happen right after the check.
> - * So we clear the pte and flush the tlb before the check
> - * this assure us that no O_DIRECT can happen after the check
> - * or in the middle of the check.
> - *
> - * No need to notify as we are downgrading page table to read
> - * only not changing it to point to a new page.
> + * Especially if we're downgrading protection, make sure to
> + * flush the TLB now. No need to notify as we are not changing
> + * the PTE to point at a different page.
> *
> * See Documentation/mm/mmu_notifier.rst
> */
> entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte);
> +
> /*
> - * Check that no O_DIRECT or similar I/O is in progress on the
> - * page
> + * Make sure that there are no unexpected references (e.g.,
> + * concurrent O_DIRECT). Note that while concurrent GUP-fast
> + * could raise the refcount temporarily to grab a write
> + * reference, it will observe the changed PTE and drop that
> + * temporary reference again.
> */
> if (page_mapcount(page) + 1 + swapped != page_count(page)) {
> set_pte_at(mm, pvmw.address, pvmw.pte, entry);
> --
> 2.37.1
>
>

2022-08-31 19:00:32

by David Hildenbrand

[permalink] [raw]

Subject: Re: [PATCH v1] mm/ksm: update stale comment in write_protect_page()

On 31.08.22 19:55, Yang Shi wrote:
> On Wed, Aug 31, 2022 at 1:30 AM David Hildenbrand <[email protected]> wrote:
>>
>> The comment is stale, because a TLB flush is no longer sufficient and
>> required to synchronize against concurrent GUP-fast. This used to be true
>> in the past, whereby a TLB flush would have implied an IPI on architectures
>> that support GUP-fast, resulting in GUP-fast that disables local interrupts
>> from completing before completing the flush.
>
> Hmm... it seems there might be problem for THP collapse IIUC. THP
> collapse clears and flushes pmd before doing anything on pte and
> relies on interrupt disable of fast GUP to serialize against fast GUP.
> But if TLB flush is no longer sufficient, then we may run into the
> below race IIUC:
>
> CPU A CPU B
> THP collapse fast GUP
>
> gup_pmd_range() <-- see valid pmd
>
> gup_pte_range() <-- work on pte
> clear pmd and flush TLB
> __collapse_huge_page_isolate()
> isolate page <-- before GUP bump refcount
>
> pin the page
> __collapse_huge_page_copy()
> copy data to huge page
> clear pte (don't flush TLB)
> Install huge pmd for huge page
>
> return the obsolete page

Hm, the is_refcount_suitable() check runs while the PTE hasn't been
cleared yet. And we don't check if the PMD changed once we're in
gup_pte_range().

The comment most certainly should be stale as well -- unless there is
some kind of an implicit IPI broadcast being done.

2667f50e8b81 mentions: "The RCU page table free logic coupled with an
IPI broadcast on THP split (which is a rare event), allows one to
protect a page table walker by merely disabling the interrupts during
the walk."

I'm not able to quickly locate that IPI broadcast -- maybe there is one
being done here (in collapse) as well?

--
Thanks,

David / dhildenb

2022-08-31 19:13:24

by Yang Shi

[permalink] [raw]

Subject: Re: [PATCH v1] mm/ksm: update stale comment in write_protect_page()

On Wed, Aug 31, 2022 at 11:29 AM David Hildenbrand <[email protected]> wrote:
>
> On 31.08.22 19:55, Yang Shi wrote:
> > On Wed, Aug 31, 2022 at 1:30 AM David Hildenbrand <[email protected]> wrote:
> >>
> >> The comment is stale, because a TLB flush is no longer sufficient and
> >> required to synchronize against concurrent GUP-fast. This used to be true
> >> in the past, whereby a TLB flush would have implied an IPI on architectures
> >> that support GUP-fast, resulting in GUP-fast that disables local interrupts
> >> from completing before completing the flush.
> >
> > Hmm... it seems there might be problem for THP collapse IIUC. THP
> > collapse clears and flushes pmd before doing anything on pte and
> > relies on interrupt disable of fast GUP to serialize against fast GUP.
> > But if TLB flush is no longer sufficient, then we may run into the
> > below race IIUC:
> >
> > CPU A CPU B
> > THP collapse fast GUP
> >
> > gup_pmd_range() <-- see valid pmd
> >
> > gup_pte_range() <-- work on pte
> > clear pmd and flush TLB
> > __collapse_huge_page_isolate()
> > isolate page <-- before GUP bump refcount
> >
> > pin the page
> > __collapse_huge_page_copy()
> > copy data to huge page
> > clear pte (don't flush TLB)
> > Install huge pmd for huge page
> >
> > return the obsolete page
>
> Hm, the is_refcount_suitable() check runs while the PTE hasn't been
> cleared yet. And we don't check if the PMD changed once we're in
> gup_pte_range().

Yes

>
> The comment most certainly should be stale as well -- unless there is
> some kind of an implicit IPI broadcast being done.
>
> 2667f50e8b81 mentions: "The RCU page table free logic coupled with an
> IPI broadcast on THP split (which is a rare event), allows one to
> protect a page table walker by merely disabling the interrupts during
> the walk."
>
> I'm not able to quickly locate that IPI broadcast -- maybe there is one
> being done here (in collapse) as well?

The TLB flush may call IPI. I'm supposed it is arch dependent, right?
Some do use IPI, some may not.

>
> --
> Thanks,
>
> David / dhildenb
>

2022-08-31 19:53:27

by Yang Shi

[permalink] [raw]

Subject: Re: [PATCH v1] mm/ksm: update stale comment in write_protect_page()

On Wed, Aug 31, 2022 at 12:36 PM David Hildenbrand <[email protected]> wrote:
>
> On 31.08.22 21:34, Yang Shi wrote:
> > On Wed, Aug 31, 2022 at 12:15 PM David Hildenbrand <[email protected]> wrote:
> >>
> >> On 31.08.22 21:08, Yang Shi wrote:
> >>> On Wed, Aug 31, 2022 at 11:29 AM David Hildenbrand <[email protected]> wrote:
> >>>>
> >>>> On 31.08.22 19:55, Yang Shi wrote:
> >>>>> On Wed, Aug 31, 2022 at 1:30 AM David Hildenbrand <[email protected]> wrote:
> >>>>>>
> >>>>>> The comment is stale, because a TLB flush is no longer sufficient and
> >>>>>> required to synchronize against concurrent GUP-fast. This used to be true
> >>>>>> in the past, whereby a TLB flush would have implied an IPI on architectures
> >>>>>> that support GUP-fast, resulting in GUP-fast that disables local interrupts
> >>>>>> from completing before completing the flush.
> >>>>>
> >>>>> Hmm... it seems there might be problem for THP collapse IIUC. THP
> >>>>> collapse clears and flushes pmd before doing anything on pte and
> >>>>> relies on interrupt disable of fast GUP to serialize against fast GUP.
> >>>>> But if TLB flush is no longer sufficient, then we may run into the
> >>>>> below race IIUC:
> >>>>>
> >>>>> CPU A CPU B
> >>>>> THP collapse fast GUP
> >>>>>
> >>>>> gup_pmd_range() <-- see valid pmd
> >>>>>
> >>>>> gup_pte_range() <-- work on pte
> >>>>> clear pmd and flush TLB
> >>>>> __collapse_huge_page_isolate()
> >>>>> isolate page <-- before GUP bump refcount
> >>>>>
> >>>>> pin the page
> >>>>> __collapse_huge_page_copy()
> >>>>> copy data to huge page
> >>>>> clear pte (don't flush TLB)
> >>>>> Install huge pmd for huge page
> >>>>>
> >>>>> return the obsolete page
> >>>>
> >>>> Hm, the is_refcount_suitable() check runs while the PTE hasn't been
> >>>> cleared yet. And we don't check if the PMD changed once we're in
> >>>> gup_pte_range().
> >>>
> >>> Yes
> >>>
> >>>>
> >>>> The comment most certainly should be stale as well -- unless there is
> >>>> some kind of an implicit IPI broadcast being done.
> >>>>
> >>>> 2667f50e8b81 mentions: "The RCU page table free logic coupled with an
> >>>> IPI broadcast on THP split (which is a rare event), allows one to
> >>>> protect a page table walker by merely disabling the interrupts during
> >>>> the walk."
> >>>>
> >>>> I'm not able to quickly locate that IPI broadcast -- maybe there is one
> >>>> being done here (in collapse) as well?
> >>>
> >>> The TLB flush may call IPI. I'm supposed it is arch dependent, right?
> >>> Some do use IPI, some may not.
> >>
> >> Right, and the whole idea of the RCU GUP-fast was to support
> >> architectures that don't do it. x86-64 does it. IIRC, powerpc doesn't do
> >> it -- but maybe it does so for PMDs?
> >
> > It looks powerpc does issue IPI for pmd flush. But arm64 doesn't IIRC.
> >
> > So maybe we should implement pmdp_collapse_flush() for those arches to
> > issue IPI.
>
> ... or find another way to detect and handle this in GUP-fast?
>
> Not sure if, for handling PMDs, it could be sufficient to propagate the
> pmdp pointer + value and double check that the values didn't change.

Should work too, right before pinning the page.

pmdp_collapse_flush() is actually just called by khugepaged, so arch
specific implementation should not be a problem and we avoid making
gup fast more complicated.

>
> --
> Thanks,
>
> David / dhildenb
>

2022-08-31 19:54:13

by Peter Xu

[permalink] [raw]

Subject: Re: [PATCH v1] mm/ksm: update stale comment in write_protect_page()

On Wed, Aug 31, 2022 at 10:55:43AM -0700, Yang Shi wrote:
> On Wed, Aug 31, 2022 at 1:30 AM David Hildenbrand <[email protected]> wrote:
> >
> > The comment is stale, because a TLB flush is no longer sufficient and
> > required to synchronize against concurrent GUP-fast. This used to be true
> > in the past, whereby a TLB flush would have implied an IPI on architectures
> > that support GUP-fast, resulting in GUP-fast that disables local interrupts
> > from completing before completing the flush.
>
> Hmm... it seems there might be problem for THP collapse IIUC. THP
> collapse clears and flushes pmd before doing anything on pte and
> relies on interrupt disable of fast GUP to serialize against fast GUP.
> But if TLB flush is no longer sufficient, then we may run into the
> below race IIUC:
>
> CPU A CPU B
> THP collapse fast GUP
>
> gup_pmd_range() <-- see valid pmd
>
> gup_pte_range() <-- work on pte
> clear pmd and flush TLB
> __collapse_huge_page_isolate()
> isolate page <-- before GUP bump refcount
>
> pin the page
> __collapse_huge_page_copy()
> copy data to huge page
> clear pte (don't flush TLB)
> Install huge pmd for huge page
>
> return the obsolete page

Maybe the pmd level tlb flush is still needed, but on pte level it's
optional (where we can rely on fast-gup rechecking on the pte change)?

--
Peter Xu

2022-08-31 19:54:13

by David Hildenbrand

[permalink] [raw]

Subject: Re: [PATCH v1] mm/ksm: update stale comment in write_protect_page()

On 31.08.22 21:15, David Hildenbrand wrote:
> On 31.08.22 21:08, Yang Shi wrote:
>> On Wed, Aug 31, 2022 at 11:29 AM David Hildenbrand <[email protected]> wrote:
>>>
>>> On 31.08.22 19:55, Yang Shi wrote:
>>>> On Wed, Aug 31, 2022 at 1:30 AM David Hildenbrand <[email protected]> wrote:
>>>>>
>>>>> The comment is stale, because a TLB flush is no longer sufficient and
>>>>> required to synchronize against concurrent GUP-fast. This used to be true
>>>>> in the past, whereby a TLB flush would have implied an IPI on architectures
>>>>> that support GUP-fast, resulting in GUP-fast that disables local interrupts
>>>>> from completing before completing the flush.
>>>>
>>>> Hmm... it seems there might be problem for THP collapse IIUC. THP
>>>> collapse clears and flushes pmd before doing anything on pte and
>>>> relies on interrupt disable of fast GUP to serialize against fast GUP.
>>>> But if TLB flush is no longer sufficient, then we may run into the
>>>> below race IIUC:
>>>>
>>>> CPU A CPU B
>>>> THP collapse fast GUP
>>>>
>>>> gup_pmd_range() <-- see valid pmd
>>>>
>>>> gup_pte_range() <-- work on pte
>>>> clear pmd and flush TLB
>>>> __collapse_huge_page_isolate()
>>>> isolate page <-- before GUP bump refcount
>>>>
>>>> pin the page
>>>> __collapse_huge_page_copy()
>>>> copy data to huge page
>>>> clear pte (don't flush TLB)
>>>> Install huge pmd for huge page
>>>>
>>>> return the obsolete page
>>>
>>> Hm, the is_refcount_suitable() check runs while the PTE hasn't been
>>> cleared yet. And we don't check if the PMD changed once we're in
>>> gup_pte_range().
>>
>> Yes
>>
>>>
>>> The comment most certainly should be stale as well -- unless there is
>>> some kind of an implicit IPI broadcast being done.
>>>
>>> 2667f50e8b81 mentions: "The RCU page table free logic coupled with an
>>> IPI broadcast on THP split (which is a rare event), allows one to
>>> protect a page table walker by merely disabling the interrupts during
>>> the walk."
>>>
>>> I'm not able to quickly locate that IPI broadcast -- maybe there is one
>>> being done here (in collapse) as well?
>>
>> The TLB flush may call IPI. I'm supposed it is arch dependent, right?
>> Some do use IPI, some may not.
>
> Right, and the whole idea of the RCU GUP-fast was to support
> architectures that don't do it. x86-64 does it. IIRC, powerpc doesn't do
> it -- but maybe it does so for PMDs?

Looking into the details (and the outdated comment for gup_pte_range()
we should fixup), THP splitting used in the past pmdp_splitting_flush()
for triggering an IPI broadcast.

However, that has been removed in 4b471e8898c3 ("mm, thp: remove
infrastructure for handling splitting PMDs") due to refcount handling
changes that no longer require it.

Consequently, I don't think we can expect an IPI broadcast to sync with
GUP-fast at that point ...

--
Thanks,

David / dhildenb

2022-08-31 19:56:42

by Yang Shi

[permalink] [raw]

Subject: Re: [PATCH v1] mm/ksm: update stale comment in write_protect_page()

On Wed, Aug 31, 2022 at 12:15 PM David Hildenbrand <[email protected]> wrote:
>
> On 31.08.22 21:08, Yang Shi wrote:
> > On Wed, Aug 31, 2022 at 11:29 AM David Hildenbrand <[email protected]> wrote:
> >>
> >> On 31.08.22 19:55, Yang Shi wrote:
> >>> On Wed, Aug 31, 2022 at 1:30 AM David Hildenbrand <[email protected]> wrote:
> >>>>
> >>>> The comment is stale, because a TLB flush is no longer sufficient and
> >>>> required to synchronize against concurrent GUP-fast. This used to be true
> >>>> in the past, whereby a TLB flush would have implied an IPI on architectures
> >>>> that support GUP-fast, resulting in GUP-fast that disables local interrupts
> >>>> from completing before completing the flush.
> >>>
> >>> Hmm... it seems there might be problem for THP collapse IIUC. THP
> >>> collapse clears and flushes pmd before doing anything on pte and
> >>> relies on interrupt disable of fast GUP to serialize against fast GUP.
> >>> But if TLB flush is no longer sufficient, then we may run into the
> >>> below race IIUC:
> >>>
> >>> CPU A CPU B
> >>> THP collapse fast GUP
> >>>
> >>> gup_pmd_range() <-- see valid pmd
> >>>
> >>> gup_pte_range() <-- work on pte
> >>> clear pmd and flush TLB
> >>> __collapse_huge_page_isolate()
> >>> isolate page <-- before GUP bump refcount
> >>>
> >>> pin the page
> >>> __collapse_huge_page_copy()
> >>> copy data to huge page
> >>> clear pte (don't flush TLB)
> >>> Install huge pmd for huge page
> >>>
> >>> return the obsolete page
> >>
> >> Hm, the is_refcount_suitable() check runs while the PTE hasn't been
> >> cleared yet. And we don't check if the PMD changed once we're in
> >> gup_pte_range().
> >
> > Yes
> >
> >>
> >> The comment most certainly should be stale as well -- unless there is
> >> some kind of an implicit IPI broadcast being done.
> >>
> >> 2667f50e8b81 mentions: "The RCU page table free logic coupled with an
> >> IPI broadcast on THP split (which is a rare event), allows one to
> >> protect a page table walker by merely disabling the interrupts during
> >> the walk."
> >>
> >> I'm not able to quickly locate that IPI broadcast -- maybe there is one
> >> being done here (in collapse) as well?
> >
> > The TLB flush may call IPI. I'm supposed it is arch dependent, right?
> > Some do use IPI, some may not.
>
> Right, and the whole idea of the RCU GUP-fast was to support
> architectures that don't do it. x86-64 does it. IIRC, powerpc doesn't do
> it -- but maybe it does so for PMDs?

It looks powerpc does issue IPI for pmd flush. But arm64 doesn't IIRC.

So maybe we should implement pmdp_collapse_flush() for those arches to
issue IPI.

>
> --
> Thanks,
>
> David / dhildenb
>

2022-08-31 19:59:34

by David Hildenbrand

[permalink] [raw]

Subject: Re: [PATCH v1] mm/ksm: update stale comment in write_protect_page()

On 31.08.22 21:08, Yang Shi wrote:
> On Wed, Aug 31, 2022 at 11:29 AM David Hildenbrand <[email protected]> wrote:
>>
>> On 31.08.22 19:55, Yang Shi wrote:
>>> On Wed, Aug 31, 2022 at 1:30 AM David Hildenbrand <[email protected]> wrote:
>>>>
>>>> The comment is stale, because a TLB flush is no longer sufficient and
>>>> required to synchronize against concurrent GUP-fast. This used to be true
>>>> in the past, whereby a TLB flush would have implied an IPI on architectures
>>>> that support GUP-fast, resulting in GUP-fast that disables local interrupts
>>>> from completing before completing the flush.
>>>
>>> Hmm... it seems there might be problem for THP collapse IIUC. THP
>>> collapse clears and flushes pmd before doing anything on pte and
>>> relies on interrupt disable of fast GUP to serialize against fast GUP.
>>> But if TLB flush is no longer sufficient, then we may run into the
>>> below race IIUC:
>>>
>>> CPU A CPU B
>>> THP collapse fast GUP
>>>
>>> gup_pmd_range() <-- see valid pmd
>>>
>>> gup_pte_range() <-- work on pte
>>> clear pmd and flush TLB
>>> __collapse_huge_page_isolate()
>>> isolate page <-- before GUP bump refcount
>>>
>>> pin the page
>>> __collapse_huge_page_copy()
>>> copy data to huge page
>>> clear pte (don't flush TLB)
>>> Install huge pmd for huge page
>>>
>>> return the obsolete page
>>
>> Hm, the is_refcount_suitable() check runs while the PTE hasn't been
>> cleared yet. And we don't check if the PMD changed once we're in
>> gup_pte_range().
>
> Yes
>
>>
>> The comment most certainly should be stale as well -- unless there is
>> some kind of an implicit IPI broadcast being done.
>>
>> 2667f50e8b81 mentions: "The RCU page table free logic coupled with an
>> IPI broadcast on THP split (which is a rare event), allows one to
>> protect a page table walker by merely disabling the interrupts during
>> the walk."
>>
>> I'm not able to quickly locate that IPI broadcast -- maybe there is one
>> being done here (in collapse) as well?
>
> The TLB flush may call IPI. I'm supposed it is arch dependent, right?
> Some do use IPI, some may not.

Right, and the whole idea of the RCU GUP-fast was to support
architectures that don't do it. x86-64 does it. IIRC, powerpc doesn't do
it -- but maybe it does so for PMDs?

--
Thanks,

David / dhildenb

2022-08-31 20:44:18

by David Hildenbrand

[permalink] [raw]

Subject: Re: [PATCH v1] mm/ksm: update stale comment in write_protect_page()

On 31.08.22 21:34, Yang Shi wrote:
> On Wed, Aug 31, 2022 at 12:15 PM David Hildenbrand <[email protected]> wrote:
>>
>> On 31.08.22 21:08, Yang Shi wrote:
>>> On Wed, Aug 31, 2022 at 11:29 AM David Hildenbrand <[email protected]> wrote:
>>>>
>>>> On 31.08.22 19:55, Yang Shi wrote:
>>>>> On Wed, Aug 31, 2022 at 1:30 AM David Hildenbrand <[email protected]> wrote:
>>>>>>
>>>>>> The comment is stale, because a TLB flush is no longer sufficient and
>>>>>> required to synchronize against concurrent GUP-fast. This used to be true
>>>>>> in the past, whereby a TLB flush would have implied an IPI on architectures
>>>>>> that support GUP-fast, resulting in GUP-fast that disables local interrupts
>>>>>> from completing before completing the flush.
>>>>>
>>>>> Hmm... it seems there might be problem for THP collapse IIUC. THP
>>>>> collapse clears and flushes pmd before doing anything on pte and
>>>>> relies on interrupt disable of fast GUP to serialize against fast GUP.
>>>>> But if TLB flush is no longer sufficient, then we may run into the
>>>>> below race IIUC:
>>>>>
>>>>> CPU A CPU B
>>>>> THP collapse fast GUP
>>>>>
>>>>> gup_pmd_range() <-- see valid pmd
>>>>>
>>>>> gup_pte_range() <-- work on pte
>>>>> clear pmd and flush TLB
>>>>> __collapse_huge_page_isolate()
>>>>> isolate page <-- before GUP bump refcount
>>>>>
>>>>> pin the page
>>>>> __collapse_huge_page_copy()
>>>>> copy data to huge page
>>>>> clear pte (don't flush TLB)
>>>>> Install huge pmd for huge page
>>>>>
>>>>> return the obsolete page
>>>>
>>>> Hm, the is_refcount_suitable() check runs while the PTE hasn't been
>>>> cleared yet. And we don't check if the PMD changed once we're in
>>>> gup_pte_range().
>>>
>>> Yes
>>>
>>>>
>>>> The comment most certainly should be stale as well -- unless there is
>>>> some kind of an implicit IPI broadcast being done.
>>>>
>>>> 2667f50e8b81 mentions: "The RCU page table free logic coupled with an
>>>> IPI broadcast on THP split (which is a rare event), allows one to
>>>> protect a page table walker by merely disabling the interrupts during
>>>> the walk."
>>>>
>>>> I'm not able to quickly locate that IPI broadcast -- maybe there is one
>>>> being done here (in collapse) as well?
>>>
>>> The TLB flush may call IPI. I'm supposed it is arch dependent, right?
>>> Some do use IPI, some may not.
>>
>> Right, and the whole idea of the RCU GUP-fast was to support
>> architectures that don't do it. x86-64 does it. IIRC, powerpc doesn't do
>> it -- but maybe it does so for PMDs?
>
> It looks powerpc does issue IPI for pmd flush. But arm64 doesn't IIRC.
>
> So maybe we should implement pmdp_collapse_flush() for those arches to
> issue IPI.

... or find another way to detect and handle this in GUP-fast?

Not sure if, for handling PMDs, it could be sufficient to propagate the
pmdp pointer + value and double check that the values didn't change.

--
Thanks,

David / dhildenb

2022-08-31 21:05:55

by John Hubbard

[permalink] [raw]

Subject: Re: [PATCH v1] mm/ksm: update stale comment in write_protect_page()

On 8/31/22 12:43, Yang Shi wrote:
>>> It looks powerpc does issue IPI for pmd flush. But arm64 doesn't IIRC.
>>>
>>> So maybe we should implement pmdp_collapse_flush() for those arches to
>>> issue IPI.
>>
>> ... or find another way to detect and handle this in GUP-fast?
>>
>> Not sure if, for handling PMDs, it could be sufficient to propagate the
>> pmdp pointer + value and double check that the values didn't change.
>
> Should work too, right before pinning the page.
>
> pmdp_collapse_flush() is actually just called by khugepaged, so arch
> specific implementation should not be a problem and we avoid making
> gup fast more complicated.
>

And just to pile on, about that gup fast complexity: depending upon IPIs
added a lot of complexity, not just because of the IPI dependency, but
more importantly because only some arches even *have* IPIs. So an
entirely different set of reasoning has to be used *in addition* to
working through the IPI story. And sure enough, we can see the fallout:
you are uncovering lots of half-correct comments in that area.

So getting rid of the dependency on IPIs in gup fast would go a long way
to simplifying it, and maybe even improving overall CPU load (insert
some hand-wavy notes here about IPIs being worse than things like RCU).

But the real win is in the complexity reduction in gup fast.

thanks,

--
John Hubbard
NVIDIA

2022-08-31 21:24:47

by Peter Xu

[permalink] [raw]

Subject: Re: [PATCH v1] mm/ksm: update stale comment in write_protect_page()

On Wed, Aug 31, 2022 at 01:38:21PM -0700, Yang Shi wrote:
> On Wed, Aug 31, 2022 at 11:52 AM Peter Xu <[email protected]> wrote:
> >
> > On Wed, Aug 31, 2022 at 10:55:43AM -0700, Yang Shi wrote:
> > > On Wed, Aug 31, 2022 at 1:30 AM David Hildenbrand <[email protected]> wrote:
> > > >
> > > > The comment is stale, because a TLB flush is no longer sufficient and
> > > > required to synchronize against concurrent GUP-fast. This used to be true
> > > > in the past, whereby a TLB flush would have implied an IPI on architectures
> > > > that support GUP-fast, resulting in GUP-fast that disables local interrupts
> > > > from completing before completing the flush.
> > >
> > > Hmm... it seems there might be problem for THP collapse IIUC. THP
> > > collapse clears and flushes pmd before doing anything on pte and
> > > relies on interrupt disable of fast GUP to serialize against fast GUP.
> > > But if TLB flush is no longer sufficient, then we may run into the
> > > below race IIUC:
> > >
> > > CPU A CPU B
> > > THP collapse fast GUP
> > >
> > > gup_pmd_range() <-- see valid pmd
> > >
> > > gup_pte_range() <-- work on pte
> > > clear pmd and flush TLB
> > > __collapse_huge_page_isolate()
> > > isolate page <-- before GUP bump refcount
> > >
> > > pin the page
> > > __collapse_huge_page_copy()
> > > copy data to huge page
> > > clear pte (don't flush TLB)
> > > Install huge pmd for huge page
> > >
> > > return the obsolete page
> >
> > Maybe the pmd level tlb flush is still needed, but on pte level it's
> > optional (where we can rely on fast-gup rechecking on the pte change)?
>
> Do you mean in khugepaged?

What I wanted to say before was that the immediate tlb flush (after pgtable
entry cleared) seems to be only needed by pmd level to guarantee safety
with concurrent fast-gup, since fast-gup can detect pte change after
pinning, and that should already guarantee safe concurrent fast-gup to me.

After reading the other emails, afaiu we're on the same page.

> It does TLB flush, but some arches may not use IPI.

Yeah, I see that ppc book3s code has customized pmdp_collapse_flush() to
explicit do the IPIs besides tlb flush using smp calls.

I assume pmdp_collapse_flush() should always be properly implemented to
guarantee safety against fast-gup, or I also agree it could be a bug.

--
Peter Xu

2022-08-31 21:37:24

by Yang Shi

[permalink] [raw]

Subject: Re: [PATCH v1] mm/ksm: update stale comment in write_protect_page()

On Wed, Aug 31, 2022 at 11:52 AM Peter Xu <[email protected]> wrote:
>
> On Wed, Aug 31, 2022 at 10:55:43AM -0700, Yang Shi wrote:
> > On Wed, Aug 31, 2022 at 1:30 AM David Hildenbrand <[email protected]> wrote:
> > >
> > > The comment is stale, because a TLB flush is no longer sufficient and
> > > required to synchronize against concurrent GUP-fast. This used to be true
> > > in the past, whereby a TLB flush would have implied an IPI on architectures
> > > that support GUP-fast, resulting in GUP-fast that disables local interrupts
> > > from completing before completing the flush.
> >
> > Hmm... it seems there might be problem for THP collapse IIUC. THP
> > collapse clears and flushes pmd before doing anything on pte and
> > relies on interrupt disable of fast GUP to serialize against fast GUP.
> > But if TLB flush is no longer sufficient, then we may run into the
> > below race IIUC:
> >
> > CPU A CPU B
> > THP collapse fast GUP
> >
> > gup_pmd_range() <-- see valid pmd
> >
> > gup_pte_range() <-- work on pte
> > clear pmd and flush TLB
> > __collapse_huge_page_isolate()
> > isolate page <-- before GUP bump refcount
> >
> > pin the page
> > __collapse_huge_page_copy()
> > copy data to huge page
> > clear pte (don't flush TLB)
> > Install huge pmd for huge page
> >
> > return the obsolete page
>
> Maybe the pmd level tlb flush is still needed, but on pte level it's
> optional (where we can rely on fast-gup rechecking on the pte change)?

Do you mean in khugepaged? It does TLB flush, but some arches may not use IPI.

>
> --
> Peter Xu
>

2022-08-31 22:42:06

by Yang Shi

[permalink] [raw]

Subject: Re: [PATCH v1] mm/ksm: update stale comment in write_protect_page()

On Wed, Aug 31, 2022 at 12:43 PM Yang Shi <[email protected]> wrote:
>
> On Wed, Aug 31, 2022 at 12:36 PM David Hildenbrand <[email protected]> wrote:
> >
> > On 31.08.22 21:34, Yang Shi wrote:
> > > On Wed, Aug 31, 2022 at 12:15 PM David Hildenbrand <[email protected]> wrote:
> > >>
> > >> On 31.08.22 21:08, Yang Shi wrote:
> > >>> On Wed, Aug 31, 2022 at 11:29 AM David Hildenbrand <[email protected]> wrote:
> > >>>>
> > >>>> On 31.08.22 19:55, Yang Shi wrote:
> > >>>>> On Wed, Aug 31, 2022 at 1:30 AM David Hildenbrand <[email protected]> wrote:
> > >>>>>>
> > >>>>>> The comment is stale, because a TLB flush is no longer sufficient and
> > >>>>>> required to synchronize against concurrent GUP-fast. This used to be true
> > >>>>>> in the past, whereby a TLB flush would have implied an IPI on architectures
> > >>>>>> that support GUP-fast, resulting in GUP-fast that disables local interrupts
> > >>>>>> from completing before completing the flush.
> > >>>>>
> > >>>>> Hmm... it seems there might be problem for THP collapse IIUC. THP
> > >>>>> collapse clears and flushes pmd before doing anything on pte and
> > >>>>> relies on interrupt disable of fast GUP to serialize against fast GUP.
> > >>>>> But if TLB flush is no longer sufficient, then we may run into the
> > >>>>> below race IIUC:
> > >>>>>
> > >>>>> CPU A CPU B
> > >>>>> THP collapse fast GUP
> > >>>>>
> > >>>>> gup_pmd_range() <-- see valid pmd
> > >>>>>
> > >>>>> gup_pte_range() <-- work on pte
> > >>>>> clear pmd and flush TLB
> > >>>>> __collapse_huge_page_isolate()
> > >>>>> isolate page <-- before GUP bump refcount
> > >>>>>
> > >>>>> pin the page
> > >>>>> __collapse_huge_page_copy()
> > >>>>> copy data to huge page
> > >>>>> clear pte (don't flush TLB)
> > >>>>> Install huge pmd for huge page
> > >>>>>
> > >>>>> return the obsolete page
> > >>>>
> > >>>> Hm, the is_refcount_suitable() check runs while the PTE hasn't been
> > >>>> cleared yet. And we don't check if the PMD changed once we're in
> > >>>> gup_pte_range().
> > >>>
> > >>> Yes
> > >>>
> > >>>>
> > >>>> The comment most certainly should be stale as well -- unless there is
> > >>>> some kind of an implicit IPI broadcast being done.
> > >>>>
> > >>>> 2667f50e8b81 mentions: "The RCU page table free logic coupled with an
> > >>>> IPI broadcast on THP split (which is a rare event), allows one to
> > >>>> protect a page table walker by merely disabling the interrupts during
> > >>>> the walk."
> > >>>>
> > >>>> I'm not able to quickly locate that IPI broadcast -- maybe there is one
> > >>>> being done here (in collapse) as well?
> > >>>
> > >>> The TLB flush may call IPI. I'm supposed it is arch dependent, right?
> > >>> Some do use IPI, some may not.
> > >>
> > >> Right, and the whole idea of the RCU GUP-fast was to support
> > >> architectures that don't do it. x86-64 does it. IIRC, powerpc doesn't do
> > >> it -- but maybe it does so for PMDs?
> > >
> > > It looks powerpc does issue IPI for pmd flush. But arm64 doesn't IIRC.
> > >
> > > So maybe we should implement pmdp_collapse_flush() for those arches to
> > > issue IPI.
> >
> > ... or find another way to detect and handle this in GUP-fast?
> >
> > Not sure if, for handling PMDs, it could be sufficient to propagate the
> > pmdp pointer + value and double check that the values didn't change.
>
> Should work too, right before pinning the page.

I actually mean the same place for checking pte. So, something like:

diff --git a/mm/gup.c b/mm/gup.c
index 5abdaf487460..2b0703403902 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2392,7 +2392,8 @@ static int gup_pte_range(pmd_t pmd, unsigned
long addr, unsigned long end,
goto pte_unmap;
}

- if (unlikely(pte_val(pte) != pte_val(*ptep))) {
+ if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) ||
+ unlikely(pte_val(pte) != pte_val(*ptep))) {
gup_put_folio(folio, 1, flags);
goto pte_unmap;
}

It doesn't build, just shows the idea.

>
> pmdp_collapse_flush() is actually just called by khugepaged, so arch
> specific implementation should not be a problem and we avoid making
> gup fast more complicated.
>
> >
> > --
> > Thanks,
> >
> > David / dhildenb
> >

2022-08-31 23:01:47

by Yang Shi

[permalink] [raw]

Subject: Re: [PATCH v1] mm/ksm: update stale comment in write_protect_page()

On Wed, Aug 31, 2022 at 1:59 PM John Hubbard <[email protected]> wrote:
>
> On 8/31/22 12:43, Yang Shi wrote:
> >>> It looks powerpc does issue IPI for pmd flush. But arm64 doesn't IIRC.
> >>>
> >>> So maybe we should implement pmdp_collapse_flush() for those arches to
> >>> issue IPI.
> >>
> >> ... or find another way to detect and handle this in GUP-fast?
> >>
> >> Not sure if, for handling PMDs, it could be sufficient to propagate the
> >> pmdp pointer + value and double check that the values didn't change.
> >
> > Should work too, right before pinning the page.
> >
> > pmdp_collapse_flush() is actually just called by khugepaged, so arch
> > specific implementation should not be a problem and we avoid making
> > gup fast more complicated.
> >
>
> And just to pile on, about that gup fast complexity: depending upon IPIs
> added a lot of complexity, not just because of the IPI dependency, but
> more importantly because only some arches even *have* IPIs. So an
> entirely different set of reasoning has to be used *in addition* to
> working through the IPI story. And sure enough, we can see the fallout:
> you are uncovering lots of half-correct comments in that area.
>
> So getting rid of the dependency on IPIs in gup fast would go a long way
> to simplifying it, and maybe even improving overall CPU load (insert
> some hand-wavy notes here about IPIs being worse than things like RCU).
>
> But the real win is in the complexity reduction in gup fast.

Thanks, John. Yeah, I still had some wrong impressions about how to
serialize against fast GUP. If you guys thought fixing the problem in
gup code is the preferred way, I won't insist on arch-specific
pmdp_collapse_flush().

>
>
> thanks,
>
> --
> John Hubbard
> NVIDIA

2022-08-31 23:12:29

by Yang Shi

[permalink] [raw]

Subject: Re: [PATCH v1] mm/ksm: update stale comment in write_protect_page()

On Wed, Aug 31, 2022 at 2:09 PM Peter Xu <[email protected]> wrote:
>
> On Wed, Aug 31, 2022 at 01:38:21PM -0700, Yang Shi wrote:
> > On Wed, Aug 31, 2022 at 11:52 AM Peter Xu <[email protected]> wrote:
> > >
> > > On Wed, Aug 31, 2022 at 10:55:43AM -0700, Yang Shi wrote:
> > > > On Wed, Aug 31, 2022 at 1:30 AM David Hildenbrand <[email protected]> wrote:
> > > > >
> > > > > The comment is stale, because a TLB flush is no longer sufficient and
> > > > > required to synchronize against concurrent GUP-fast. This used to be true
> > > > > in the past, whereby a TLB flush would have implied an IPI on architectures
> > > > > that support GUP-fast, resulting in GUP-fast that disables local interrupts
> > > > > from completing before completing the flush.
> > > >
> > > > Hmm... it seems there might be problem for THP collapse IIUC. THP
> > > > collapse clears and flushes pmd before doing anything on pte and
> > > > relies on interrupt disable of fast GUP to serialize against fast GUP.
> > > > But if TLB flush is no longer sufficient, then we may run into the
> > > > below race IIUC:
> > > >
> > > > CPU A CPU B
> > > > THP collapse fast GUP
> > > >
> > > > gup_pmd_range() <-- see valid pmd
> > > >
> > > > gup_pte_range() <-- work on pte
> > > > clear pmd and flush TLB
> > > > __collapse_huge_page_isolate()
> > > > isolate page <-- before GUP bump refcount
> > > >
> > > > pin the page
> > > > __collapse_huge_page_copy()
> > > > copy data to huge page
> > > > clear pte (don't flush TLB)
> > > > Install huge pmd for huge page
> > > >
> > > > return the obsolete page
> > >
> > > Maybe the pmd level tlb flush is still needed, but on pte level it's
> > > optional (where we can rely on fast-gup rechecking on the pte change)?
> >
> > Do you mean in khugepaged?
>
> What I wanted to say before was that the immediate tlb flush (after pgtable
> entry cleared) seems to be only needed by pmd level to guarantee safety
> with concurrent fast-gup, since fast-gup can detect pte change after
> pinning, and that should already guarantee safe concurrent fast-gup to me.

Yeah, so ptep_clear() is used in __collapse_huge_page_copy() instead
of clear and flush.

>
> After reading the other emails, afaiu we're on the same page.
>
> > It does TLB flush, but some arches may not use IPI.
>
> Yeah, I see that ppc book3s code has customized pmdp_collapse_flush() to
> explicit do the IPIs besides tlb flush using smp calls.
>
> I assume pmdp_collapse_flush() should always be properly implemented to
> guarantee safety against fast-gup, or I also agree it could be a bug.

This was what I thought before I saw this patch.

>
> --
> Peter Xu
>

2022-09-01 07:21:35

by David Hildenbrand

[permalink] [raw]

Subject: Re: [PATCH v1] mm/ksm: update stale comment in write_protect_page()

On 01.09.22 00:18, Yang Shi wrote:
> On Wed, Aug 31, 2022 at 12:43 PM Yang Shi <[email protected]> wrote:
>>
>> On Wed, Aug 31, 2022 at 12:36 PM David Hildenbrand <[email protected]> wrote:
>>>
>>> On 31.08.22 21:34, Yang Shi wrote:
>>>> On Wed, Aug 31, 2022 at 12:15 PM David Hildenbrand <[email protected]> wrote:
>>>>>
>>>>> On 31.08.22 21:08, Yang Shi wrote:
>>>>>> On Wed, Aug 31, 2022 at 11:29 AM David Hildenbrand <[email protected]> wrote:
>>>>>>>
>>>>>>> On 31.08.22 19:55, Yang Shi wrote:
>>>>>>>> On Wed, Aug 31, 2022 at 1:30 AM David Hildenbrand <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>> The comment is stale, because a TLB flush is no longer sufficient and
>>>>>>>>> required to synchronize against concurrent GUP-fast. This used to be true
>>>>>>>>> in the past, whereby a TLB flush would have implied an IPI on architectures
>>>>>>>>> that support GUP-fast, resulting in GUP-fast that disables local interrupts
>>>>>>>>> from completing before completing the flush.
>>>>>>>>
>>>>>>>> Hmm... it seems there might be problem for THP collapse IIUC. THP
>>>>>>>> collapse clears and flushes pmd before doing anything on pte and
>>>>>>>> relies on interrupt disable of fast GUP to serialize against fast GUP.
>>>>>>>> But if TLB flush is no longer sufficient, then we may run into the
>>>>>>>> below race IIUC:
>>>>>>>>
>>>>>>>> CPU A CPU B
>>>>>>>> THP collapse fast GUP
>>>>>>>>
>>>>>>>> gup_pmd_range() <-- see valid pmd
>>>>>>>>
>>>>>>>> gup_pte_range() <-- work on pte
>>>>>>>> clear pmd and flush TLB
>>>>>>>> __collapse_huge_page_isolate()
>>>>>>>> isolate page <-- before GUP bump refcount
>>>>>>>>
>>>>>>>> pin the page
>>>>>>>> __collapse_huge_page_copy()
>>>>>>>> copy data to huge page
>>>>>>>> clear pte (don't flush TLB)
>>>>>>>> Install huge pmd for huge page
>>>>>>>>
>>>>>>>> return the obsolete page
>>>>>>>
>>>>>>> Hm, the is_refcount_suitable() check runs while the PTE hasn't been
>>>>>>> cleared yet. And we don't check if the PMD changed once we're in
>>>>>>> gup_pte_range().
>>>>>>
>>>>>> Yes
>>>>>>
>>>>>>>
>>>>>>> The comment most certainly should be stale as well -- unless there is
>>>>>>> some kind of an implicit IPI broadcast being done.
>>>>>>>
>>>>>>> 2667f50e8b81 mentions: "The RCU page table free logic coupled with an
>>>>>>> IPI broadcast on THP split (which is a rare event), allows one to
>>>>>>> protect a page table walker by merely disabling the interrupts during
>>>>>>> the walk."
>>>>>>>
>>>>>>> I'm not able to quickly locate that IPI broadcast -- maybe there is one
>>>>>>> being done here (in collapse) as well?
>>>>>>
>>>>>> The TLB flush may call IPI. I'm supposed it is arch dependent, right?
>>>>>> Some do use IPI, some may not.
>>>>>
>>>>> Right, and the whole idea of the RCU GUP-fast was to support
>>>>> architectures that don't do it. x86-64 does it. IIRC, powerpc doesn't do
>>>>> it -- but maybe it does so for PMDs?
>>>>
>>>> It looks powerpc does issue IPI for pmd flush. But arm64 doesn't IIRC.
>>>>
>>>> So maybe we should implement pmdp_collapse_flush() for those arches to
>>>> issue IPI.
>>>
>>> ... or find another way to detect and handle this in GUP-fast?
>>>
>>> Not sure if, for handling PMDs, it could be sufficient to propagate the
>>> pmdp pointer + value and double check that the values didn't change.
>>
>> Should work too, right before pinning the page.
>
> I actually mean the same place for checking pte. So, something like:
>
> diff --git a/mm/gup.c b/mm/gup.c
> index 5abdaf487460..2b0703403902 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2392,7 +2392,8 @@ static int gup_pte_range(pmd_t pmd, unsigned
> long addr, unsigned long end,
> goto pte_unmap;
> }
>
> - if (unlikely(pte_val(pte) != pte_val(*ptep))) {
> + if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) ||
> + unlikely(pte_val(pte) != pte_val(*ptep))) {
> gup_put_folio(folio, 1, flags);
> goto pte_unmap;
> }
>
> It doesn't build, just shows the idea.

Exactly what I had in mind. We should add a comment spelling out that
this is for handling huge PMD collapse.

--
Thanks,

David / dhildenb

2022-09-01 18:08:36

by Yang Shi

[permalink] [raw]

Subject: Re: [PATCH v1] mm/ksm: update stale comment in write_protect_page()

On Wed, Aug 31, 2022 at 11:58 PM David Hildenbrand <[email protected]> wrote:
>
> On 01.09.22 00:18, Yang Shi wrote:
> > On Wed, Aug 31, 2022 at 12:43 PM Yang Shi <[email protected]> wrote:
> >>
> >> On Wed, Aug 31, 2022 at 12:36 PM David Hildenbrand <[email protected]> wrote:
> >>>
> >>> On 31.08.22 21:34, Yang Shi wrote:
> >>>> On Wed, Aug 31, 2022 at 12:15 PM David Hildenbrand <[email protected]> wrote:
> >>>>>
> >>>>> On 31.08.22 21:08, Yang Shi wrote:
> >>>>>> On Wed, Aug 31, 2022 at 11:29 AM David Hildenbrand <[email protected]> wrote:
> >>>>>>>
> >>>>>>> On 31.08.22 19:55, Yang Shi wrote:
> >>>>>>>> On Wed, Aug 31, 2022 at 1:30 AM David Hildenbrand <[email protected]> wrote:
> >>>>>>>>>
> >>>>>>>>> The comment is stale, because a TLB flush is no longer sufficient and
> >>>>>>>>> required to synchronize against concurrent GUP-fast. This used to be true
> >>>>>>>>> in the past, whereby a TLB flush would have implied an IPI on architectures
> >>>>>>>>> that support GUP-fast, resulting in GUP-fast that disables local interrupts
> >>>>>>>>> from completing before completing the flush.
> >>>>>>>>
> >>>>>>>> Hmm... it seems there might be problem for THP collapse IIUC. THP
> >>>>>>>> collapse clears and flushes pmd before doing anything on pte and
> >>>>>>>> relies on interrupt disable of fast GUP to serialize against fast GUP.
> >>>>>>>> But if TLB flush is no longer sufficient, then we may run into the
> >>>>>>>> below race IIUC:
> >>>>>>>>
> >>>>>>>> CPU A CPU B
> >>>>>>>> THP collapse fast GUP
> >>>>>>>>
> >>>>>>>> gup_pmd_range() <-- see valid pmd
> >>>>>>>>
> >>>>>>>> gup_pte_range() <-- work on pte
> >>>>>>>> clear pmd and flush TLB
> >>>>>>>> __collapse_huge_page_isolate()
> >>>>>>>> isolate page <-- before GUP bump refcount
> >>>>>>>>
> >>>>>>>> pin the page
> >>>>>>>> __collapse_huge_page_copy()
> >>>>>>>> copy data to huge page
> >>>>>>>> clear pte (don't flush TLB)
> >>>>>>>> Install huge pmd for huge page
> >>>>>>>>
> >>>>>>>> return the obsolete page
> >>>>>>>
> >>>>>>> Hm, the is_refcount_suitable() check runs while the PTE hasn't been
> >>>>>>> cleared yet. And we don't check if the PMD changed once we're in
> >>>>>>> gup_pte_range().
> >>>>>>
> >>>>>> Yes
> >>>>>>
> >>>>>>>
> >>>>>>> The comment most certainly should be stale as well -- unless there is
> >>>>>>> some kind of an implicit IPI broadcast being done.
> >>>>>>>
> >>>>>>> 2667f50e8b81 mentions: "The RCU page table free logic coupled with an
> >>>>>>> IPI broadcast on THP split (which is a rare event), allows one to
> >>>>>>> protect a page table walker by merely disabling the interrupts during
> >>>>>>> the walk."
> >>>>>>>
> >>>>>>> I'm not able to quickly locate that IPI broadcast -- maybe there is one
> >>>>>>> being done here (in collapse) as well?
> >>>>>>
> >>>>>> The TLB flush may call IPI. I'm supposed it is arch dependent, right?
> >>>>>> Some do use IPI, some may not.
> >>>>>
> >>>>> Right, and the whole idea of the RCU GUP-fast was to support
> >>>>> architectures that don't do it. x86-64 does it. IIRC, powerpc doesn't do
> >>>>> it -- but maybe it does so for PMDs?
> >>>>
> >>>> It looks powerpc does issue IPI for pmd flush. But arm64 doesn't IIRC.
> >>>>
> >>>> So maybe we should implement pmdp_collapse_flush() for those arches to
> >>>> issue IPI.
> >>>
> >>> ... or find another way to detect and handle this in GUP-fast?
> >>>
> >>> Not sure if, for handling PMDs, it could be sufficient to propagate the
> >>> pmdp pointer + value and double check that the values didn't change.
> >>
> >> Should work too, right before pinning the page.
> >
> > I actually mean the same place for checking pte. So, something like:
> >
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 5abdaf487460..2b0703403902 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -2392,7 +2392,8 @@ static int gup_pte_range(pmd_t pmd, unsigned
> > long addr, unsigned long end,
> > goto pte_unmap;
> > }
> >
> > - if (unlikely(pte_val(pte) != pte_val(*ptep))) {
> > + if (unlikely(pmd_val(pmd) != pmd_val(*pmdp)) ||
> > + unlikely(pte_val(pte) != pte_val(*ptep))) {
> > gup_put_folio(folio, 1, flags);
> > goto pte_unmap;
> > }
> >
> > It doesn't build, just shows the idea.
>
> Exactly what I had in mind. We should add a comment spelling out that
> this is for handling huge PMD collapse.

Yeah, I will prepare a patch soon.

>
>
> --
> Thanks,
>
> David / dhildenb
>