For most of the page walk paths, logically it'll always be good to have the
pmd retries if hit pmd_trans_unstable() race. We can treat it as none
pmd (per comment above pmd_trans_unstable()), but in most cases we're not
even treating that as a none pmd. If to fix it anyway, a retry will be the
most accurate.
I've went over all the pmd_trans_unstable() special cases and this patch
should cover all the rest places where we should retry properly with
unstable pmd. With the newly introduced ACTION_AGAIN since 2020 we can
easily achieve that.
These are the call sites that I think should be fixed with it:
*** fs/proc/task_mmu.c:
smaps_pte_range[634] if (pmd_trans_unstable(pmd))
clear_refs_pte_range[1194] if (pmd_trans_unstable(pmd))
pagemap_pmd_range[1542] if (pmd_trans_unstable(pmdp))
gather_pte_stats[1891] if (pmd_trans_unstable(pmd))
*** mm/memcontrol.c:
mem_cgroup_count_precharge_pte_range[6024] if (pmd_trans_unstable(pmd))
mem_cgroup_move_charge_pte_range[6244] if (pmd_trans_unstable(pmd))
*** mm/memory-failure.c:
hwpoison_pte_range[794] if (pmd_trans_unstable(pmdp))
*** mm/mempolicy.c:
queue_folios_pte_range[517] if (pmd_trans_unstable(pmd))
*** mm/madvise.c:
madvise_cold_or_pageout_pte_range[425] if (pmd_trans_unstable(pmd))
madvise_free_pte_range[625] if (pmd_trans_unstable(pmd))
IIUC most of them may or may not be a big issue even without a retry,
either because they're already not strict (smaps, pte_stats, MADV_COLD,
.. it can mean e.g. the statistic may be inaccurate or one less 2M chunk to
cold worst case), but some of them could have functional error without the
retry afaiu (e.g. pagemap, where we can have the output buffer shifted over
the unstable pmd range.. so IIUC the pagemap result can be wrong).
While these call sites all look fine, and don't need any change:
*** include/linux/pgtable.h:
pmd_devmap_trans_unstable[1418] return pmd_devmap(*pmd) || pmd_trans_unstable(pmd);
*** mm/gup.c:
follow_pmd_mask[695] if (pmd_trans_unstable(pmd))
*** mm/mapping_dirty_helpers.c:
wp_clean_pmd_entry[131] if (!pmd_trans_unstable(&pmdval))
*** mm/memory.c:
do_anonymous_page[4060] if (unlikely(pmd_trans_unstable(vmf->pmd)))
*** mm/migrate_device.c:
migrate_vma_insert_page[616] if (unlikely(pmd_trans_unstable(pmdp)))
*** mm/mincore.c:
mincore_pte_range[116] if (pmd_trans_unstable(pmd)) {
Signed-off-by: Peter Xu <[email protected]>
---
fs/proc/task_mmu.c | 17 +++++++++++++----
mm/madvise.c | 8 ++++++--
mm/memcontrol.c | 8 ++++++--
mm/memory-failure.c | 4 +++-
mm/mempolicy.c | 4 +++-
5 files changed, 31 insertions(+), 10 deletions(-)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 6259dd432eeb..823eaba5c6bf 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -631,8 +631,11 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
goto out;
}
- if (pmd_trans_unstable(pmd))
+ if (pmd_trans_unstable(pmd)) {
+ walk->action = ACTION_AGAIN;
goto out;
+ }
+
/*
* The mmap_lock held all the way back in m_start() is what
* keeps khugepaged out of here and from collapsing things
@@ -1191,8 +1194,10 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
return 0;
}
- if (pmd_trans_unstable(pmd))
+ if (pmd_trans_unstable(pmd)) {
+ walk->action = ACTION_AGAIN;
return 0;
+ }
pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
for (; addr != end; pte++, addr += PAGE_SIZE) {
@@ -1539,8 +1544,10 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
return err;
}
- if (pmd_trans_unstable(pmdp))
+ if (pmd_trans_unstable(pmdp)) {
+ walk->action = ACTION_AGAIN;
return 0;
+ }
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
/*
@@ -1888,8 +1895,10 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
return 0;
}
- if (pmd_trans_unstable(pmd))
+ if (pmd_trans_unstable(pmd)) {
+ walk->action = ACTION_AGAIN;
return 0;
+ }
#endif
orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
do {
diff --git a/mm/madvise.c b/mm/madvise.c
index 78cd12581628..0fd81712022c 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -424,8 +424,10 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
}
regular_folio:
- if (pmd_trans_unstable(pmd))
+ if (pmd_trans_unstable(pmd)) {
+ walk->action = ACTION_AGAIN;
return 0;
+ }
#endif
tlb_change_page_size(tlb, PAGE_SIZE);
orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
@@ -626,8 +628,10 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
if (madvise_free_huge_pmd(tlb, vma, pmd, addr, next))
goto next;
- if (pmd_trans_unstable(pmd))
+ if (pmd_trans_unstable(pmd)) {
+ walk->action = ACTION_AGAIN;
return 0;
+ }
tlb_change_page_size(tlb, PAGE_SIZE);
orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6ee433be4c3b..15e50f033e41 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6021,8 +6021,10 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
return 0;
}
- if (pmd_trans_unstable(pmd))
+ if (pmd_trans_unstable(pmd)) {
+ walk->action = ACTION_AGAIN;
return 0;
+ }
pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
for (; addr != end; pte++, addr += PAGE_SIZE)
if (get_mctgt_type(vma, addr, *pte, NULL))
@@ -6241,8 +6243,10 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
return 0;
}
- if (pmd_trans_unstable(pmd))
+ if (pmd_trans_unstable(pmd)) {
+ walk->action = ACTION_AGAIN;
return 0;
+ }
retry:
pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
for (; addr != end; addr += PAGE_SIZE) {
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 004a02f44271..c97fb2b7ab4a 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -791,8 +791,10 @@ static int hwpoison_pte_range(pmd_t *pmdp, unsigned long addr,
goto out;
}
- if (pmd_trans_unstable(pmdp))
+ if (pmd_trans_unstable(pmdp)) {
+ walk->action = ACTION_AGAIN;
goto out;
+ }
mapped_pte = ptep = pte_offset_map_lock(walk->vma->vm_mm, pmdp,
addr, &ptl);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index f06ca8c18e62..af8907b4aad1 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -514,8 +514,10 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
if (ptl)
return queue_folios_pmd(pmd, ptl, addr, end, walk);
- if (pmd_trans_unstable(pmd))
+ if (pmd_trans_unstable(pmd)) {
+ walk->action = ACTION_AGAIN;
return 0;
+ }
mapped_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
for (; addr != end; pte++, addr += PAGE_SIZE) {
--
2.40.1
On Fri, Jun 2, 2023 at 4:06 PM Peter Xu <[email protected]> wrote:
>
> For most of the page walk paths, logically it'll always be good to have the
> pmd retries if hit pmd_trans_unstable() race. We can treat it as none
> pmd (per comment above pmd_trans_unstable()), but in most cases we're not
> even treating that as a none pmd. If to fix it anyway, a retry will be the
> most accurate.
>
> I've went over all the pmd_trans_unstable() special cases and this patch
> should cover all the rest places where we should retry properly with
> unstable pmd. With the newly introduced ACTION_AGAIN since 2020 we can
> easily achieve that.
>
> These are the call sites that I think should be fixed with it:
>
> *** fs/proc/task_mmu.c:
> smaps_pte_range[634] if (pmd_trans_unstable(pmd))
> clear_refs_pte_range[1194] if (pmd_trans_unstable(pmd))
> pagemap_pmd_range[1542] if (pmd_trans_unstable(pmdp))
> gather_pte_stats[1891] if (pmd_trans_unstable(pmd))
> *** mm/memcontrol.c:
> mem_cgroup_count_precharge_pte_range[6024] if (pmd_trans_unstable(pmd))
> mem_cgroup_move_charge_pte_range[6244] if (pmd_trans_unstable(pmd))
> *** mm/memory-failure.c:
> hwpoison_pte_range[794] if (pmd_trans_unstable(pmdp))
> *** mm/mempolicy.c:
> queue_folios_pte_range[517] if (pmd_trans_unstable(pmd))
> *** mm/madvise.c:
> madvise_cold_or_pageout_pte_range[425] if (pmd_trans_unstable(pmd))
> madvise_free_pte_range[625] if (pmd_trans_unstable(pmd))
>
> IIUC most of them may or may not be a big issue even without a retry,
> either because they're already not strict (smaps, pte_stats, MADV_COLD,
> .. it can mean e.g. the statistic may be inaccurate or one less 2M chunk to
> cold worst case), but some of them could have functional error without the
> retry afaiu (e.g. pagemap, where we can have the output buffer shifted over
> the unstable pmd range.. so IIUC the pagemap result can be wrong).
>
> While these call sites all look fine, and don't need any change:
>
> *** include/linux/pgtable.h:
> pmd_devmap_trans_unstable[1418] return pmd_devmap(*pmd) || pmd_trans_unstable(pmd);
> *** mm/gup.c:
> follow_pmd_mask[695] if (pmd_trans_unstable(pmd))
> *** mm/mapping_dirty_helpers.c:
> wp_clean_pmd_entry[131] if (!pmd_trans_unstable(&pmdval))
> *** mm/memory.c:
> do_anonymous_page[4060] if (unlikely(pmd_trans_unstable(vmf->pmd)))
> *** mm/migrate_device.c:
> migrate_vma_insert_page[616] if (unlikely(pmd_trans_unstable(pmdp)))
> *** mm/mincore.c:
> mincore_pte_range[116] if (pmd_trans_unstable(pmd)) {
>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> fs/proc/task_mmu.c | 17 +++++++++++++----
> mm/madvise.c | 8 ++++++--
> mm/memcontrol.c | 8 ++++++--
> mm/memory-failure.c | 4 +++-
> mm/mempolicy.c | 4 +++-
> 5 files changed, 31 insertions(+), 10 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 6259dd432eeb..823eaba5c6bf 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -631,8 +631,11 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> goto out;
> }
>
> - if (pmd_trans_unstable(pmd))
> + if (pmd_trans_unstable(pmd)) {
> + walk->action = ACTION_AGAIN;
> goto out;
> + }
> +
> /*
> * The mmap_lock held all the way back in m_start() is what
> * keeps khugepaged out of here and from collapsing things
> @@ -1191,8 +1194,10 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
> return 0;
> }
>
> - if (pmd_trans_unstable(pmd))
> + if (pmd_trans_unstable(pmd)) {
> + walk->action = ACTION_AGAIN;
> return 0;
> + }
>
> pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> for (; addr != end; pte++, addr += PAGE_SIZE) {
> @@ -1539,8 +1544,10 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
> return err;
> }
>
> - if (pmd_trans_unstable(pmdp))
> + if (pmd_trans_unstable(pmdp)) {
> + walk->action = ACTION_AGAIN;
> return 0;
Had a quick look at the pagemap code, I agree with your analysis,
"returning 0" may mess up pagemap, retry should be fine. But I'm
wondering whether we should just fill in empty entries. Anyway I don't
have a strong opinion on this, just a little bit concerned by
potential indefinite retry.
> + }
> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>
> /*
> @@ -1888,8 +1895,10 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
> return 0;
> }
>
> - if (pmd_trans_unstable(pmd))
> + if (pmd_trans_unstable(pmd)) {
> + walk->action = ACTION_AGAIN;
> return 0;
> + }
> #endif
> orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
> do {
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 78cd12581628..0fd81712022c 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -424,8 +424,10 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> }
>
> regular_folio:
> - if (pmd_trans_unstable(pmd))
> + if (pmd_trans_unstable(pmd)) {
> + walk->action = ACTION_AGAIN;
> return 0;
> + }
> #endif
> tlb_change_page_size(tlb, PAGE_SIZE);
> orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> @@ -626,8 +628,10 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> if (madvise_free_huge_pmd(tlb, vma, pmd, addr, next))
> goto next;
>
> - if (pmd_trans_unstable(pmd))
> + if (pmd_trans_unstable(pmd)) {
> + walk->action = ACTION_AGAIN;
> return 0;
> + }
>
> tlb_change_page_size(tlb, PAGE_SIZE);
> orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 6ee433be4c3b..15e50f033e41 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6021,8 +6021,10 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
> return 0;
> }
>
> - if (pmd_trans_unstable(pmd))
> + if (pmd_trans_unstable(pmd)) {
> + walk->action = ACTION_AGAIN;
> return 0;
Either retry or keep as is is fine to me. I'm not aware of anyone
complaining about noticeable inaccurate charges due to this. But we
may have potential indefinite retry anyway, so if you don't want to
risk this, it may be better just keep it as is.
> + }
> pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> for (; addr != end; pte++, addr += PAGE_SIZE)
> if (get_mctgt_type(vma, addr, *pte, NULL))
> @@ -6241,8 +6243,10 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
> return 0;
> }
>
> - if (pmd_trans_unstable(pmd))
> + if (pmd_trans_unstable(pmd)) {
> + walk->action = ACTION_AGAIN;
> return 0;
> + }
> retry:
> pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> for (; addr != end; addr += PAGE_SIZE) {
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 004a02f44271..c97fb2b7ab4a 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -791,8 +791,10 @@ static int hwpoison_pte_range(pmd_t *pmdp, unsigned long addr,
> goto out;
> }
>
> - if (pmd_trans_unstable(pmdp))
> + if (pmd_trans_unstable(pmdp)) {
> + walk->action = ACTION_AGAIN;
> goto out;
> + }
This may be worth retrying IMHO.
>
> mapped_pte = ptep = pte_offset_map_lock(walk->vma->vm_mm, pmdp,
> addr, &ptl);
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index f06ca8c18e62..af8907b4aad1 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -514,8 +514,10 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
> if (ptl)
> return queue_folios_pmd(pmd, ptl, addr, end, walk);
>
> - if (pmd_trans_unstable(pmd))
> + if (pmd_trans_unstable(pmd)) {
> + walk->action = ACTION_AGAIN;
> return 0;
> + }
Either retry or keep it as is is fine.
>
> mapped_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
> for (; addr != end; pte++, addr += PAGE_SIZE) {
> --
> 2.40.1
>
>
On Mon, Jun 05, 2023 at 11:46:04AM -0700, Yang Shi wrote:
> On Fri, Jun 2, 2023 at 4:06 PM Peter Xu <[email protected]> wrote:
> >
> > For most of the page walk paths, logically it'll always be good to have the
> > pmd retries if hit pmd_trans_unstable() race. We can treat it as none
> > pmd (per comment above pmd_trans_unstable()), but in most cases we're not
> > even treating that as a none pmd. If to fix it anyway, a retry will be the
> > most accurate.
> >
> > I've went over all the pmd_trans_unstable() special cases and this patch
> > should cover all the rest places where we should retry properly with
> > unstable pmd. With the newly introduced ACTION_AGAIN since 2020 we can
> > easily achieve that.
> >
> > These are the call sites that I think should be fixed with it:
> >
> > *** fs/proc/task_mmu.c:
> > smaps_pte_range[634] if (pmd_trans_unstable(pmd))
> > clear_refs_pte_range[1194] if (pmd_trans_unstable(pmd))
> > pagemap_pmd_range[1542] if (pmd_trans_unstable(pmdp))
> > gather_pte_stats[1891] if (pmd_trans_unstable(pmd))
> > *** mm/memcontrol.c:
> > mem_cgroup_count_precharge_pte_range[6024] if (pmd_trans_unstable(pmd))
> > mem_cgroup_move_charge_pte_range[6244] if (pmd_trans_unstable(pmd))
> > *** mm/memory-failure.c:
> > hwpoison_pte_range[794] if (pmd_trans_unstable(pmdp))
> > *** mm/mempolicy.c:
> > queue_folios_pte_range[517] if (pmd_trans_unstable(pmd))
> > *** mm/madvise.c:
> > madvise_cold_or_pageout_pte_range[425] if (pmd_trans_unstable(pmd))
> > madvise_free_pte_range[625] if (pmd_trans_unstable(pmd))
> >
> > IIUC most of them may or may not be a big issue even without a retry,
> > either because they're already not strict (smaps, pte_stats, MADV_COLD,
> > .. it can mean e.g. the statistic may be inaccurate or one less 2M chunk to
> > cold worst case), but some of them could have functional error without the
> > retry afaiu (e.g. pagemap, where we can have the output buffer shifted over
> > the unstable pmd range.. so IIUC the pagemap result can be wrong).
> >
> > While these call sites all look fine, and don't need any change:
> >
> > *** include/linux/pgtable.h:
> > pmd_devmap_trans_unstable[1418] return pmd_devmap(*pmd) || pmd_trans_unstable(pmd);
> > *** mm/gup.c:
> > follow_pmd_mask[695] if (pmd_trans_unstable(pmd))
> > *** mm/mapping_dirty_helpers.c:
> > wp_clean_pmd_entry[131] if (!pmd_trans_unstable(&pmdval))
> > *** mm/memory.c:
> > do_anonymous_page[4060] if (unlikely(pmd_trans_unstable(vmf->pmd)))
> > *** mm/migrate_device.c:
> > migrate_vma_insert_page[616] if (unlikely(pmd_trans_unstable(pmdp)))
> > *** mm/mincore.c:
> > mincore_pte_range[116] if (pmd_trans_unstable(pmd)) {
> >
> > Signed-off-by: Peter Xu <[email protected]>
> > ---
> > fs/proc/task_mmu.c | 17 +++++++++++++----
> > mm/madvise.c | 8 ++++++--
> > mm/memcontrol.c | 8 ++++++--
> > mm/memory-failure.c | 4 +++-
> > mm/mempolicy.c | 4 +++-
> > 5 files changed, 31 insertions(+), 10 deletions(-)
> >
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index 6259dd432eeb..823eaba5c6bf 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -631,8 +631,11 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> > goto out;
> > }
> >
> > - if (pmd_trans_unstable(pmd))
> > + if (pmd_trans_unstable(pmd)) {
> > + walk->action = ACTION_AGAIN;
> > goto out;
> > + }
> > +
> > /*
> > * The mmap_lock held all the way back in m_start() is what
> > * keeps khugepaged out of here and from collapsing things
> > @@ -1191,8 +1194,10 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
> > return 0;
> > }
> >
> > - if (pmd_trans_unstable(pmd))
> > + if (pmd_trans_unstable(pmd)) {
> > + walk->action = ACTION_AGAIN;
> > return 0;
> > + }
> >
> > pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > for (; addr != end; pte++, addr += PAGE_SIZE) {
> > @@ -1539,8 +1544,10 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
> > return err;
> > }
> >
> > - if (pmd_trans_unstable(pmdp))
> > + if (pmd_trans_unstable(pmdp)) {
> > + walk->action = ACTION_AGAIN;
> > return 0;
>
> Had a quick look at the pagemap code, I agree with your analysis,
> "returning 0" may mess up pagemap, retry should be fine. But I'm
> wondering whether we should just fill in empty entries. Anyway I don't
> have a strong opinion on this, just a little bit concerned by
> potential indefinite retry.
Yes, none pte is still an option. But if we're going to fix this anyway,
it seems better to fix it with the accurate new thing that poped up, and
it's even less change (just apply walk->action rather than doing random
stuff in different call sites).
I see that you have worry on deadloop over this, so I hope to discuss
altogether here.
Unlike normal checks, pmd_trans_unstable() check means something must have
changed in the past very short period or it should just never if nothing
changed concurrently from under us, so it's not a "if (flag==true)" check
which is even more likely to loop.
If we see the places that I didn't touch, most of them suggested a retry in
one form or another. So if there's a worry this will also not the first
time to do a retry (and for such a "unstable" API, that's really the most
natural thing to do which is to retry until it's stable).
So in general, it seems to me if we deadloop over pmd_trans_unstable() for
whatever reason then something more wrong could have happened..
Thanks,
--
Peter Xu
On Mon, Jun 5, 2023 at 12:20 PM Peter Xu <[email protected]> wrote:
>
> On Mon, Jun 05, 2023 at 11:46:04AM -0700, Yang Shi wrote:
> > On Fri, Jun 2, 2023 at 4:06 PM Peter Xu <[email protected]> wrote:
> > >
> > > For most of the page walk paths, logically it'll always be good to have the
> > > pmd retries if hit pmd_trans_unstable() race. We can treat it as none
> > > pmd (per comment above pmd_trans_unstable()), but in most cases we're not
> > > even treating that as a none pmd. If to fix it anyway, a retry will be the
> > > most accurate.
> > >
> > > I've went over all the pmd_trans_unstable() special cases and this patch
> > > should cover all the rest places where we should retry properly with
> > > unstable pmd. With the newly introduced ACTION_AGAIN since 2020 we can
> > > easily achieve that.
> > >
> > > These are the call sites that I think should be fixed with it:
> > >
> > > *** fs/proc/task_mmu.c:
> > > smaps_pte_range[634] if (pmd_trans_unstable(pmd))
> > > clear_refs_pte_range[1194] if (pmd_trans_unstable(pmd))
> > > pagemap_pmd_range[1542] if (pmd_trans_unstable(pmdp))
> > > gather_pte_stats[1891] if (pmd_trans_unstable(pmd))
> > > *** mm/memcontrol.c:
> > > mem_cgroup_count_precharge_pte_range[6024] if (pmd_trans_unstable(pmd))
> > > mem_cgroup_move_charge_pte_range[6244] if (pmd_trans_unstable(pmd))
> > > *** mm/memory-failure.c:
> > > hwpoison_pte_range[794] if (pmd_trans_unstable(pmdp))
> > > *** mm/mempolicy.c:
> > > queue_folios_pte_range[517] if (pmd_trans_unstable(pmd))
> > > *** mm/madvise.c:
> > > madvise_cold_or_pageout_pte_range[425] if (pmd_trans_unstable(pmd))
> > > madvise_free_pte_range[625] if (pmd_trans_unstable(pmd))
> > >
> > > IIUC most of them may or may not be a big issue even without a retry,
> > > either because they're already not strict (smaps, pte_stats, MADV_COLD,
> > > .. it can mean e.g. the statistic may be inaccurate or one less 2M chunk to
> > > cold worst case), but some of them could have functional error without the
> > > retry afaiu (e.g. pagemap, where we can have the output buffer shifted over
> > > the unstable pmd range.. so IIUC the pagemap result can be wrong).
> > >
> > > While these call sites all look fine, and don't need any change:
> > >
> > > *** include/linux/pgtable.h:
> > > pmd_devmap_trans_unstable[1418] return pmd_devmap(*pmd) || pmd_trans_unstable(pmd);
> > > *** mm/gup.c:
> > > follow_pmd_mask[695] if (pmd_trans_unstable(pmd))
> > > *** mm/mapping_dirty_helpers.c:
> > > wp_clean_pmd_entry[131] if (!pmd_trans_unstable(&pmdval))
> > > *** mm/memory.c:
> > > do_anonymous_page[4060] if (unlikely(pmd_trans_unstable(vmf->pmd)))
> > > *** mm/migrate_device.c:
> > > migrate_vma_insert_page[616] if (unlikely(pmd_trans_unstable(pmdp)))
> > > *** mm/mincore.c:
> > > mincore_pte_range[116] if (pmd_trans_unstable(pmd)) {
> > >
> > > Signed-off-by: Peter Xu <[email protected]>
> > > ---
> > > fs/proc/task_mmu.c | 17 +++++++++++++----
> > > mm/madvise.c | 8 ++++++--
> > > mm/memcontrol.c | 8 ++++++--
> > > mm/memory-failure.c | 4 +++-
> > > mm/mempolicy.c | 4 +++-
> > > 5 files changed, 31 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > index 6259dd432eeb..823eaba5c6bf 100644
> > > --- a/fs/proc/task_mmu.c
> > > +++ b/fs/proc/task_mmu.c
> > > @@ -631,8 +631,11 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> > > goto out;
> > > }
> > >
> > > - if (pmd_trans_unstable(pmd))
> > > + if (pmd_trans_unstable(pmd)) {
> > > + walk->action = ACTION_AGAIN;
> > > goto out;
> > > + }
> > > +
> > > /*
> > > * The mmap_lock held all the way back in m_start() is what
> > > * keeps khugepaged out of here and from collapsing things
> > > @@ -1191,8 +1194,10 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
> > > return 0;
> > > }
> > >
> > > - if (pmd_trans_unstable(pmd))
> > > + if (pmd_trans_unstable(pmd)) {
> > > + walk->action = ACTION_AGAIN;
> > > return 0;
> > > + }
> > >
> > > pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > > for (; addr != end; pte++, addr += PAGE_SIZE) {
> > > @@ -1539,8 +1544,10 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
> > > return err;
> > > }
> > >
> > > - if (pmd_trans_unstable(pmdp))
> > > + if (pmd_trans_unstable(pmdp)) {
> > > + walk->action = ACTION_AGAIN;
> > > return 0;
> >
> > Had a quick look at the pagemap code, I agree with your analysis,
> > "returning 0" may mess up pagemap, retry should be fine. But I'm
> > wondering whether we should just fill in empty entries. Anyway I don't
> > have a strong opinion on this, just a little bit concerned by
> > potential indefinite retry.
>
> Yes, none pte is still an option. But if we're going to fix this anyway,
> it seems better to fix it with the accurate new thing that poped up, and
> it's even less change (just apply walk->action rather than doing random
> stuff in different call sites).
>
> I see that you have worry on deadloop over this, so I hope to discuss
> altogether here.
>
> Unlike normal checks, pmd_trans_unstable() check means something must have
> changed in the past very short period or it should just never if nothing
> changed concurrently from under us, so it's not a "if (flag==true)" check
> which is even more likely to loop.
>
> If we see the places that I didn't touch, most of them suggested a retry in
> one form or another. So if there's a worry this will also not the first
> time to do a retry (and for such a "unstable" API, that's really the most
> natural thing to do which is to retry until it's stable).
IIUC other than do_anonymous_page() suggests retry (retry page fault),
others may not, for example:
- follow_pmd_mask: return -EBUSY
- wp_clean_pmd_entry: actually just retry for pmd_none case, but the
pagewalk code does handle pmd_none by skipping it, so it basically
just retry once
- min_core_pte_range: treated as unmapped range by calling
__mincore_unmapped_range
Anyway I really don't have a strong opinion on this. I may be just
over-concerned. I just thought if nobody cares whether the result is
accurate or not, why do we bother fixing those cases?
>
> So in general, it seems to me if we deadloop over pmd_trans_unstable() for
> whatever reason then something more wrong could have happened..
>
> Thanks,
>
> --
> Peter Xu
>
On Tue, Jun 06, 2023 at 12:12:03PM -0700, Yang Shi wrote:
> > > > @@ -1539,8 +1544,10 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
> > > > return err;
> > > > }
> > > >
> > > > - if (pmd_trans_unstable(pmdp))
> > > > + if (pmd_trans_unstable(pmdp)) {
> > > > + walk->action = ACTION_AGAIN;
> > > > return 0;
> > >
> > > Had a quick look at the pagemap code, I agree with your analysis,
> > > "returning 0" may mess up pagemap, retry should be fine. But I'm
> > > wondering whether we should just fill in empty entries. Anyway I don't
> > > have a strong opinion on this, just a little bit concerned by
> > > potential indefinite retry.
> >
> > Yes, none pte is still an option. But if we're going to fix this anyway,
> > it seems better to fix it with the accurate new thing that poped up, and
> > it's even less change (just apply walk->action rather than doing random
> > stuff in different call sites).
> >
> > I see that you have worry on deadloop over this, so I hope to discuss
> > altogether here.
> >
> > Unlike normal checks, pmd_trans_unstable() check means something must have
> > changed in the past very short period or it should just never if nothing
> > changed concurrently from under us, so it's not a "if (flag==true)" check
> > which is even more likely to loop.
> >
> > If we see the places that I didn't touch, most of them suggested a retry in
> > one form or another. So if there's a worry this will also not the first
> > time to do a retry (and for such a "unstable" API, that's really the most
> > natural thing to do which is to retry until it's stable).
>
> IIUC other than do_anonymous_page() suggests retry (retry page fault),
> others may not, for example:
> - follow_pmd_mask: return -EBUSY
I assumed a -EBUSY would mean if the caller still needs the page it'll just
need to retry.
It's actually a very rare errno to return in follow page... e.g. some gup
callers may not even be able to handle -EBUSY afaiu, neither does the gup
core (__get_user_pages), afaiu it'll just forwarded upwards.
My bet is it's just so rare and only used with FOLL_SPLIT_PMD for now. I
had a quick look, at least kvm handles -EBUSY as a real fault (hva_to_pfn,
where it should translate that -EBUSY into a KVM_PFN_ERR_FAULT), I think
it'll crash the hypervisor directly if returned from gup one day (not for
now if never with !FOLL_SPLIT_PMD)..
> - wp_clean_pmd_entry: actually just retry for pmd_none case, but the
> pagewalk code does handle pmd_none by skipping it, so it basically
> just retry once
This one is very special IMHO, pmd_trans_unstable() should in most cases be
used after someone checked pmd value before walking ptes. I had a feeling
it's kind of abused, to check whether it's a huge pmd in whatever format..
IMHO it should just use the other pmd_*() helpers but I won't touch it in
this series.
> - min_core_pte_range: treated as unmapped range by calling
> __mincore_unmapped_range
Correct.
Also pmd_devmap_trans_unstable(), called in 3 call sites:
pmd_devmap_trans_unstable[1418] return pmd_devmap(*pmd) || pmd_trans_unstable(pmd);
filemap_map_pmd[3423] if (pmd_devmap_trans_unstable(vmf->pmd)) {
finish_fault[4390] if (pmd_devmap_trans_unstable(vmf->pmd))
handle_pte_fault[4922] if (pmd_devmap_trans_unstable(vmf->pmd))
They all suggest a retry on the page faults, afaiu.
So indeed not all of them retries, but I doubt those ones that are not and
whether that's the best we should have. Again, I'll leave that out of this
series.
>
> Anyway I really don't have a strong opinion on this. I may be just
> over-concerned. I just thought if nobody cares whether the result is
> accurate or not, why do we bother fixing those cases?
Because anyway we're already at it and it's easier and cleaner to do so? :)
I would say if to fix this I think the most important ones are pagemap and
memcg paths so far, but since I'm at it and anyway I checked all of them, I
figured maybe I should just make everywhere do right and in the same
pattern when handling unstable pmd. Especially, what this patch touched are
all using walk_page*() routines (I left special ones in first patches), so
it's quite straightforward IMHO to switch altogether using ACTION_AGAIN.
Thanks,
--
Peter Xu