Here is the promised splitup of what I posted a few days ago as
5/7 mm/thp: fix page_vma_mapped_walk() if huge page mapped by ptes.
These are against 5.13-rc5 plus yesterday's mm/thp series:
https://lore.kernel.org/linux-mm/[email protected]/T/#u
(sorry about the misthreading, I missed linux-mm and lkml at first).
03/10 and 04/10 of that series also touched mm/page_vma_mapped.c.
I've marked all of these for stable: many are merely cleanups,
but I think they are much better before the main fix than after.
01/11 makes the opposite cleanup to the earlier 5/7: that preferred
to use pvmw->page, this prefers to use page - it could go either way.
11/11 is a new fix that I noticed while splitting up.
01/11 mm: page_vma_mapped_walk(): use page for pvmw->page
02/11 mm: page_vma_mapped_walk(): settle PageHuge on entry
03/11 mm: page_vma_mapped_walk(): use pmd_read_atomic()
04/11 mm: page_vma_mapped_walk(): use pmde for *pvmw->pmd
05/11 mm: page_vma_mapped_walk(): prettify PVMW_MIGRATION block
06/11 mm: page_vma_mapped_walk(): crossing page table boundary
07/11 mm: page_vma_mapped_walk(): add a level of indentation
08/11 mm: page_vma_mapped_walk(): use goto instead of while (1)
09/11 mm: page_vma_mapped_walk(): get vma_address_end() earlier
10/11 mm/thp: fix page_vma_mapped_walk() if THP mapped by ptes
11/11 mm/thp: another PVMW_SYNC fix in page_vma_mapped_walk()
mm/page_vma_mapped.c | 160 +++++++++++++++++++++++++++------------------
1 file changed, 96 insertions(+), 64 deletions(-)
Hugh
page_vma_mapped_walk() cleanup: sometimes the local copy of pvwm->page was
used, sometimes pvmw->page itself: use the local copy "page" throughout.
Signed-off-by: Hugh Dickins <[email protected]>
Cc: <[email protected]>
---
mm/page_vma_mapped.c | 9 ++++-----
1 file changed, 4 insertions(+), 5 deletions(-)
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index e37bd43904af..a6dbf714ca15 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -156,7 +156,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
if (pvmw->pte)
goto next_pte;
- if (unlikely(PageHuge(pvmw->page))) {
+ if (unlikely(PageHuge(page))) {
/* when pud is not present, pte will be NULL */
pvmw->pte = huge_pte_offset(mm, pvmw->address, page_size(page));
if (!pvmw->pte)
@@ -217,8 +217,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
* cannot return prematurely, while zap_huge_pmd() has
* cleared *pmd but not decremented compound_mapcount().
*/
- if ((pvmw->flags & PVMW_SYNC) &&
- PageTransCompound(pvmw->page)) {
+ if ((pvmw->flags & PVMW_SYNC) && PageTransCompound(page)) {
spinlock_t *ptl = pmd_lock(mm, pvmw->pmd);
spin_unlock(ptl);
@@ -234,9 +233,9 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
return true;
next_pte:
/* Seek to next pte only makes sense for THP */
- if (!PageTransHuge(pvmw->page) || PageHuge(pvmw->page))
+ if (!PageTransHuge(page) || PageHuge(page))
return not_found(pvmw);
- end = vma_address_end(pvmw->page, pvmw->vma);
+ end = vma_address_end(page, pvmw->vma);
do {
pvmw->address += PAGE_SIZE;
if (pvmw->address >= end)
--
2.26.2
page_vma_mapped_walk() cleanup: get the hugetlbfs PageHuge case
out of the way at the start, so no need to worry about it later.
Signed-off-by: Hugh Dickins <[email protected]>
Cc: <[email protected]>
---
mm/page_vma_mapped.c | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index a6dbf714ca15..7c0504641fb8 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -153,10 +153,11 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
if (pvmw->pmd && !pvmw->pte)
return not_found(pvmw);
- if (pvmw->pte)
- goto next_pte;
-
if (unlikely(PageHuge(page))) {
+ /* The only possible mapping was handled on last iteration */
+ if (pvmw->pte)
+ return not_found(pvmw);
+
/* when pud is not present, pte will be NULL */
pvmw->pte = huge_pte_offset(mm, pvmw->address, page_size(page));
if (!pvmw->pte)
@@ -168,6 +169,9 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
return not_found(pvmw);
return true;
}
+
+ if (pvmw->pte)
+ goto next_pte;
restart:
pgd = pgd_offset(mm, pvmw->address);
if (!pgd_present(*pgd))
@@ -233,7 +237,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
return true;
next_pte:
/* Seek to next pte only makes sense for THP */
- if (!PageTransHuge(page) || PageHuge(page))
+ if (!PageTransHuge(page))
return not_found(pvmw);
end = vma_address_end(page, pvmw->vma);
do {
--
2.26.2
page_vma_mapped_walk() cleanup: use pmd_read_atomic() with barrier()
instead of READ_ONCE() for pmde: some architectures (e.g. i386 with PAE)
have a multi-word pmd entry, for which READ_ONCE() is not good enough.
Signed-off-by: Hugh Dickins <[email protected]>
Cc: <[email protected]>
---
mm/page_vma_mapped.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 7c0504641fb8..973c3c4e72cc 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -182,13 +182,16 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
pud = pud_offset(p4d, pvmw->address);
if (!pud_present(*pud))
return false;
+
pvmw->pmd = pmd_offset(pud, pvmw->address);
/*
* Make sure the pmd value isn't cached in a register by the
* compiler and used as a stale value after we've observed a
* subsequent update.
*/
- pmde = READ_ONCE(*pvmw->pmd);
+ pmde = pmd_read_atomic(pvmw->pmd);
+ barrier();
+
if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) {
pvmw->ptl = pmd_lock(mm, pvmw->pmd);
if (likely(pmd_trans_huge(*pvmw->pmd))) {
--
2.26.2
page_vma_mapped_walk() cleanup: re-evaluate pmde after taking lock, then
use it in subsequent tests, instead of repeatedly dereferencing pointer.
Signed-off-by: Hugh Dickins <[email protected]>
Cc: <[email protected]>
---
mm/page_vma_mapped.c | 11 ++++++-----
1 file changed, 6 insertions(+), 5 deletions(-)
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 973c3c4e72cc..81000dd0b5da 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -194,18 +194,19 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) {
pvmw->ptl = pmd_lock(mm, pvmw->pmd);
- if (likely(pmd_trans_huge(*pvmw->pmd))) {
+ pmde = *pvmw->pmd;
+ if (likely(pmd_trans_huge(pmde))) {
if (pvmw->flags & PVMW_MIGRATION)
return not_found(pvmw);
- if (pmd_page(*pvmw->pmd) != page)
+ if (pmd_page(pmde) != page)
return not_found(pvmw);
return true;
- } else if (!pmd_present(*pvmw->pmd)) {
+ } else if (!pmd_present(pmde)) {
if (thp_migration_supported()) {
if (!(pvmw->flags & PVMW_MIGRATION))
return not_found(pvmw);
- if (is_migration_entry(pmd_to_swp_entry(*pvmw->pmd))) {
- swp_entry_t entry = pmd_to_swp_entry(*pvmw->pmd);
+ if (is_migration_entry(pmd_to_swp_entry(pmde))) {
+ swp_entry_t entry = pmd_to_swp_entry(pmde);
if (migration_entry_to_page(entry) != page)
return not_found(pvmw);
--
2.26.2
page_vma_mapped_walk() cleanup: rearrange the !pmd_present() block to
follow the same "return not_found, return not_found, return true" pattern
as the block above it (note: returning not_found there is never premature,
since existence or prior existence of huge pmd guarantees good alignment).
Signed-off-by: Hugh Dickins <[email protected]>
Cc: <[email protected]>
---
mm/page_vma_mapped.c | 30 ++++++++++++++----------------
1 file changed, 14 insertions(+), 16 deletions(-)
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 81000dd0b5da..b96fae568bc2 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -201,24 +201,22 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
if (pmd_page(pmde) != page)
return not_found(pvmw);
return true;
- } else if (!pmd_present(pmde)) {
- if (thp_migration_supported()) {
- if (!(pvmw->flags & PVMW_MIGRATION))
- return not_found(pvmw);
- if (is_migration_entry(pmd_to_swp_entry(pmde))) {
- swp_entry_t entry = pmd_to_swp_entry(pmde);
+ }
+ if (!pmd_present(pmde)) {
+ swp_entry_t entry;
- if (migration_entry_to_page(entry) != page)
- return not_found(pvmw);
- return true;
- }
- }
- return not_found(pvmw);
- } else {
- /* THP pmd was split under us: handle on pte level */
- spin_unlock(pvmw->ptl);
- pvmw->ptl = NULL;
+ if (!thp_migration_supported() ||
+ !(pvmw->flags & PVMW_MIGRATION))
+ return not_found(pvmw);
+ entry = pmd_to_swp_entry(pmde);
+ if (!is_migration_entry(entry) ||
+ migration_entry_to_page(entry) != page)
+ return not_found(pvmw);
+ return true;
}
+ /* THP pmd was split under us: handle on pte level */
+ spin_unlock(pvmw->ptl);
+ pvmw->ptl = NULL;
} else if (!pmd_present(pmde)) {
/*
* If PVMW_SYNC, take and drop THP pmd lock so that we
--
2.26.2
page_vma_mapped_walk() cleanup: adjust the test for crossing page table
boundary - I believe pvmw->address is always page-aligned, but nothing
else here assumed that; and remember to reset pvmw->pte to NULL after
unmapping the page table, though I never saw any bug from that.
Signed-off-by: Hugh Dickins <[email protected]>
Cc: <[email protected]>
---
mm/page_vma_mapped.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index b96fae568bc2..0fe6e558d336 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -247,16 +247,16 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
if (pvmw->address >= end)
return not_found(pvmw);
/* Did we cross page table boundary? */
- if (pvmw->address % PMD_SIZE == 0) {
- pte_unmap(pvmw->pte);
+ if ((pvmw->address & (PMD_SIZE - PAGE_SIZE)) == 0) {
if (pvmw->ptl) {
spin_unlock(pvmw->ptl);
pvmw->ptl = NULL;
}
+ pte_unmap(pvmw->pte);
+ pvmw->pte = NULL;
goto restart;
- } else {
- pvmw->pte++;
}
+ pvmw->pte++;
} while (pte_none(*pvmw->pte));
if (!pvmw->ptl) {
--
2.26.2
page_vma_mapped_walk() cleanup: add a level of indentation to much of
the body, making no functional change in this commit, but reducing the
later diff when this is all converted to a loop.
Signed-off-by: Hugh Dickins <[email protected]>
Cc: <[email protected]>
---
mm/page_vma_mapped.c | 109 +++++++++++++++++++++++----------------------
1 file changed, 56 insertions(+), 53 deletions(-)
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 0fe6e558d336..0840079ef7d2 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -173,65 +173,68 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
if (pvmw->pte)
goto next_pte;
restart:
- pgd = pgd_offset(mm, pvmw->address);
- if (!pgd_present(*pgd))
- return false;
- p4d = p4d_offset(pgd, pvmw->address);
- if (!p4d_present(*p4d))
- return false;
- pud = pud_offset(p4d, pvmw->address);
- if (!pud_present(*pud))
- return false;
-
- pvmw->pmd = pmd_offset(pud, pvmw->address);
- /*
- * Make sure the pmd value isn't cached in a register by the
- * compiler and used as a stale value after we've observed a
- * subsequent update.
- */
- pmde = pmd_read_atomic(pvmw->pmd);
- barrier();
-
- if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) {
- pvmw->ptl = pmd_lock(mm, pvmw->pmd);
- pmde = *pvmw->pmd;
- if (likely(pmd_trans_huge(pmde))) {
- if (pvmw->flags & PVMW_MIGRATION)
- return not_found(pvmw);
- if (pmd_page(pmde) != page)
- return not_found(pvmw);
- return true;
- }
- if (!pmd_present(pmde)) {
- swp_entry_t entry;
+ {
+ pgd = pgd_offset(mm, pvmw->address);
+ if (!pgd_present(*pgd))
+ return false;
+ p4d = p4d_offset(pgd, pvmw->address);
+ if (!p4d_present(*p4d))
+ return false;
+ pud = pud_offset(p4d, pvmw->address);
+ if (!pud_present(*pud))
+ return false;
- if (!thp_migration_supported() ||
- !(pvmw->flags & PVMW_MIGRATION))
- return not_found(pvmw);
- entry = pmd_to_swp_entry(pmde);
- if (!is_migration_entry(entry) ||
- migration_entry_to_page(entry) != page)
- return not_found(pvmw);
- return true;
- }
- /* THP pmd was split under us: handle on pte level */
- spin_unlock(pvmw->ptl);
- pvmw->ptl = NULL;
- } else if (!pmd_present(pmde)) {
+ pvmw->pmd = pmd_offset(pud, pvmw->address);
/*
- * If PVMW_SYNC, take and drop THP pmd lock so that we
- * cannot return prematurely, while zap_huge_pmd() has
- * cleared *pmd but not decremented compound_mapcount().
+ * Make sure the pmd value isn't cached in a register by the
+ * compiler and used as a stale value after we've observed a
+ * subsequent update.
*/
- if ((pvmw->flags & PVMW_SYNC) && PageTransCompound(page)) {
- spinlock_t *ptl = pmd_lock(mm, pvmw->pmd);
+ pmde = pmd_read_atomic(pvmw->pmd);
+ barrier();
+
+ if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) {
+ pvmw->ptl = pmd_lock(mm, pvmw->pmd);
+ pmde = *pvmw->pmd;
+ if (likely(pmd_trans_huge(pmde))) {
+ if (pvmw->flags & PVMW_MIGRATION)
+ return not_found(pvmw);
+ if (pmd_page(pmde) != page)
+ return not_found(pvmw);
+ return true;
+ }
+ if (!pmd_present(pmde)) {
+ swp_entry_t entry;
- spin_unlock(ptl);
+ if (!thp_migration_supported() ||
+ !(pvmw->flags & PVMW_MIGRATION))
+ return not_found(pvmw);
+ entry = pmd_to_swp_entry(pmde);
+ if (!is_migration_entry(entry) ||
+ migration_entry_to_page(entry) != page)
+ return not_found(pvmw);
+ return true;
+ }
+ /* THP pmd was split under us: handle on pte level */
+ spin_unlock(pvmw->ptl);
+ pvmw->ptl = NULL;
+ } else if (!pmd_present(pmde)) {
+ /*
+ * If PVMW_SYNC, take and drop THP pmd lock so that we
+ * cannot return prematurely, while zap_huge_pmd() has
+ * cleared *pmd but not decremented compound_mapcount().
+ */
+ if ((pvmw->flags & PVMW_SYNC) &&
+ PageTransCompound(page)) {
+ spinlock_t *ptl = pmd_lock(mm, pvmw->pmd);
+
+ spin_unlock(ptl);
+ }
+ return false;
}
- return false;
+ if (!map_pte(pvmw))
+ goto next_pte;
}
- if (!map_pte(pvmw))
- goto next_pte;
while (1) {
unsigned long end;
--
2.26.2
page_vma_mapped_walk() cleanup: add a label this_pte, matching next_pte,
and use "goto this_pte", in place of the "while (1)" loop at the end.
Signed-off-by: Hugh Dickins <[email protected]>
Cc: <[email protected]>
---
mm/page_vma_mapped.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 0840079ef7d2..006f4814abbc 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -144,6 +144,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
{
struct mm_struct *mm = pvmw->vma->vm_mm;
struct page *page = pvmw->page;
+ unsigned long end;
pgd_t *pgd;
p4d_t *p4d;
pud_t *pud;
@@ -234,10 +235,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
}
if (!map_pte(pvmw))
goto next_pte;
- }
- while (1) {
- unsigned long end;
-
+this_pte:
if (check_pte(pvmw))
return true;
next_pte:
@@ -266,6 +264,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
pvmw->ptl = pte_lockptr(mm, pvmw->pmd);
spin_lock(pvmw->ptl);
}
+ goto this_pte;
}
}
--
2.26.2
page_vma_mapped_walk() cleanup: get THP's vma_address_end() at the start,
rather than later at next_pte. It's a little unnecessary overhead on the
first call, but makes for a simpler loop in the following commit.
Signed-off-by: Hugh Dickins <[email protected]>
Cc: <[email protected]>
---
mm/page_vma_mapped.c | 13 +++++++++----
1 file changed, 9 insertions(+), 4 deletions(-)
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 006f4814abbc..f6839f536645 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -171,6 +171,15 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
return true;
}
+ /*
+ * Seek to next pte only makes sense for THP.
+ * But more important than that optimization, is to filter out
+ * any PageKsm page: whose page->index misleads vma_address()
+ * and vma_address_end() to disaster.
+ */
+ end = PageTransCompound(page) ?
+ vma_address_end(page, pvmw->vma) :
+ pvmw->address + PAGE_SIZE;
if (pvmw->pte)
goto next_pte;
restart:
@@ -239,10 +248,6 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
if (check_pte(pvmw))
return true;
next_pte:
- /* Seek to next pte only makes sense for THP */
- if (!PageTransHuge(page))
- return not_found(pvmw);
- end = vma_address_end(page, pvmw->vma);
do {
pvmw->address += PAGE_SIZE;
if (pvmw->address >= end)
--
2.26.2
Running certain tests with a DEBUG_VM kernel would crash within hours,
on the total_mapcount BUG() in split_huge_page_to_list(), while trying
to free up some memory by punching a hole in a shmem huge page: split's
try_to_unmap() was unable to find all the mappings of the page (which,
on a !DEBUG_VM kernel, would then keep the huge page pinned in memory).
Crash dumps showed two tail pages of a shmem huge page remained mapped
by pte: ptes in a non-huge-aligned vma of a gVisor process, at the end
of a long unmapped range; and no page table had yet been allocated for
the head of the huge page to be mapped into.
Although designed to handle these odd misaligned huge-page-mapped-by-pte
cases, page_vma_mapped_walk() falls short by returning false prematurely
when !pmd_present or !pud_present or !p4d_present or !pgd_present: there
are cases when a huge page may span the boundary, with ptes present in
the next.
Restructure page_vma_mapped_walk() as a loop to continue in these cases,
while keeping its layout much as before. Add a step_forward() helper to
advance pvmw->address across those boundaries: originally I tried to use
mm's standard p?d_addr_end() macros, but hit the same crash 512 times
less often: because of the way redundant levels are folded together,
but folded differently in different configurations, it was just too
difficult to use them correctly; and step_forward() is simpler anyway.
Fixes: ace71a19cec5 ("mm: introduce page_vma_mapped_walk()")
Signed-off-by: Hugh Dickins <[email protected]>
Cc: <[email protected]>
---
mm/page_vma_mapped.c | 34 +++++++++++++++++++++++++---------
1 file changed, 25 insertions(+), 9 deletions(-)
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index f6839f536645..6eb2f1863506 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -116,6 +116,13 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
return pfn_is_match(pvmw->page, pfn);
}
+static void step_forward(struct page_vma_mapped_walk *pvmw, unsigned long size)
+{
+ pvmw->address = (pvmw->address + size) & ~(size - 1);
+ if (!pvmw->address)
+ pvmw->address = ULONG_MAX;
+}
+
/**
* page_vma_mapped_walk - check if @pvmw->page is mapped in @pvmw->vma at
* @pvmw->address
@@ -183,16 +190,22 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
if (pvmw->pte)
goto next_pte;
restart:
- {
+ do {
pgd = pgd_offset(mm, pvmw->address);
- if (!pgd_present(*pgd))
- return false;
+ if (!pgd_present(*pgd)) {
+ step_forward(pvmw, PGDIR_SIZE);
+ continue;
+ }
p4d = p4d_offset(pgd, pvmw->address);
- if (!p4d_present(*p4d))
- return false;
+ if (!p4d_present(*p4d)) {
+ step_forward(pvmw, P4D_SIZE);
+ continue;
+ }
pud = pud_offset(p4d, pvmw->address);
- if (!pud_present(*pud))
- return false;
+ if (!pud_present(*pud)) {
+ step_forward(pvmw, PUD_SIZE);
+ continue;
+ }
pvmw->pmd = pmd_offset(pud, pvmw->address);
/*
@@ -240,7 +253,8 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
spin_unlock(ptl);
}
- return false;
+ step_forward(pvmw, PMD_SIZE);
+ continue;
}
if (!map_pte(pvmw))
goto next_pte;
@@ -270,7 +284,9 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
spin_lock(pvmw->ptl);
}
goto this_pte;
- }
+ } while (pvmw->address < end);
+
+ return false;
}
/**
--
2.26.2
Aha! Shouldn't that quick scan over pte_none()s make sure that it holds
ptlock in the PVMW_SYNC case? That too might have been responsible for
BUGs or WARNs in split_huge_page_to_list() or its unmap_page(), though
I've never seen any.
Fixes: ace71a19cec5 ("mm: introduce page_vma_mapped_walk()")
Signed-off-by: Hugh Dickins <[email protected]>
Cc: <[email protected]>
---
mm/page_vma_mapped.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 6eb2f1863506..7ae4a016304b 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -277,6 +277,10 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
goto restart;
}
pvmw->pte++;
+ if ((pvmw->flags & PVMW_SYNC) && !pvmw->ptl) {
+ pvmw->ptl = pte_lockptr(mm, pvmw->pmd);
+ spin_lock(pvmw->ptl);
+ }
} while (pte_none(*pvmw->pte));
if (!pvmw->ptl) {
--
2.26.2
Thanks for this, I've read through page_vma_mapped_walk() a few too many times
recently and it annoyed me enough to start writing some patches to clean it up,
but happy to see this instead.
I confirmed pvmw->page isn't modified here or in map/check_pte(pvmw), so the
patch looks good to me:
Reviewed-by: Alistair Popple <[email protected]>
On Thursday, 10 June 2021 4:34:40 PM AEST Hugh Dickins wrote:
> page_vma_mapped_walk() cleanup: sometimes the local copy of pvwm->page was
> used, sometimes pvmw->page itself: use the local copy "page" throughout.
>
> Signed-off-by: Hugh Dickins <[email protected]>
> Cc: <[email protected]>
> ---
> mm/page_vma_mapped.c | 9 ++++-----
> 1 file changed, 4 insertions(+), 5 deletions(-)
>
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index e37bd43904af..a6dbf714ca15 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -156,7 +156,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> if (pvmw->pte)
> goto next_pte;
>
> - if (unlikely(PageHuge(pvmw->page))) {
> + if (unlikely(PageHuge(page))) {
> /* when pud is not present, pte will be NULL */
> pvmw->pte = huge_pte_offset(mm, pvmw->address, page_size(page));
> if (!pvmw->pte)
> @@ -217,8 +217,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> * cannot return prematurely, while zap_huge_pmd() has
> * cleared *pmd but not decremented compound_mapcount().
> */
> - if ((pvmw->flags & PVMW_SYNC) &&
> - PageTransCompound(pvmw->page)) {
> + if ((pvmw->flags & PVMW_SYNC) && PageTransCompound(page)) {
> spinlock_t *ptl = pmd_lock(mm, pvmw->pmd);
>
> spin_unlock(ptl);
> @@ -234,9 +233,9 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> return true;
> next_pte:
> /* Seek to next pte only makes sense for THP */
> - if (!PageTransHuge(pvmw->page) || PageHuge(pvmw->page))
> + if (!PageTransHuge(page) || PageHuge(page))
> return not_found(pvmw);
> - end = vma_address_end(pvmw->page, pvmw->vma);
> + end = vma_address_end(page, pvmw->vma);
> do {
> pvmw->address += PAGE_SIZE;
> if (pvmw->address >= end)
> --
> 2.26.2
>
On Wed, Jun 09, 2021 at 11:34:40PM -0700, Hugh Dickins wrote:
> page_vma_mapped_walk() cleanup: sometimes the local copy of pvwm->page was
> used, sometimes pvmw->page itself: use the local copy "page" throughout.
>
> Signed-off-by: Hugh Dickins <[email protected]>
> Cc: <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
A question below.
> ---
> mm/page_vma_mapped.c | 9 ++++-----
> 1 file changed, 4 insertions(+), 5 deletions(-)
>
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index e37bd43904af..a6dbf714ca15 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -156,7 +156,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> if (pvmw->pte)
> goto next_pte;
>
> - if (unlikely(PageHuge(pvmw->page))) {
> + if (unlikely(PageHuge(page))) {
> /* when pud is not present, pte will be NULL */
> pvmw->pte = huge_pte_offset(mm, pvmw->address, page_size(page));
> if (!pvmw->pte)
> @@ -217,8 +217,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> * cannot return prematurely, while zap_huge_pmd() has
> * cleared *pmd but not decremented compound_mapcount().
> */
> - if ((pvmw->flags & PVMW_SYNC) &&
> - PageTransCompound(pvmw->page)) {
> + if ((pvmw->flags & PVMW_SYNC) && PageTransCompound(page)) {
> spinlock_t *ptl = pmd_lock(mm, pvmw->pmd);
>
> spin_unlock(ptl);
> @@ -234,9 +233,9 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> return true;
> next_pte:
> /* Seek to next pte only makes sense for THP */
> - if (!PageTransHuge(pvmw->page) || PageHuge(pvmw->page))
> + if (!PageTransHuge(page) || PageHuge(page))
> return not_found(pvmw);
> - end = vma_address_end(pvmw->page, pvmw->vma);
> + end = vma_address_end(page, pvmw->vma);
> do {
> pvmw->address += PAGE_SIZE;
> if (pvmw->address >= end)
I see two more pvmw->page in this loop. Do you leave them here as the code
will be rewritten later in the patchset?
--
Kirill A. Shutemov
On Wed, Jun 09, 2021 at 11:36:36PM -0700, Hugh Dickins wrote:
> page_vma_mapped_walk() cleanup: get the hugetlbfs PageHuge case
> out of the way at the start, so no need to worry about it later.
>
> Signed-off-by: Hugh Dickins <[email protected]>
> Cc: <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
--
Kirill A. Shutemov
On Wed, Jun 09, 2021 at 11:38:11PM -0700, Hugh Dickins wrote:
> page_vma_mapped_walk() cleanup: use pmd_read_atomic() with barrier()
> instead of READ_ONCE() for pmde: some architectures (e.g. i386 with PAE)
> have a multi-word pmd entry, for which READ_ONCE() is not good enough.
>
> Signed-off-by: Hugh Dickins <[email protected]>
> Cc: <[email protected]>
> ---
> mm/page_vma_mapped.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index 7c0504641fb8..973c3c4e72cc 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -182,13 +182,16 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> pud = pud_offset(p4d, pvmw->address);
> if (!pud_present(*pud))
> return false;
> +
> pvmw->pmd = pmd_offset(pud, pvmw->address);
> /*
> * Make sure the pmd value isn't cached in a register by the
> * compiler and used as a stale value after we've observed a
> * subsequent update.
> */
> - pmde = READ_ONCE(*pvmw->pmd);
> + pmde = pmd_read_atomic(pvmw->pmd);
> + barrier();
> +
Hm. It makes me wounder if barrier() has to be part of pmd_read_atomic().
mm/hmm.c uses the same pattern as you are and I tend to think that the
rest of pmd_read_atomic() users may be broken.
Am I wrong?
--
Kirill A. Shutemov
On Wed, Jun 09, 2021 at 11:40:08PM -0700, Hugh Dickins wrote:
> page_vma_mapped_walk() cleanup: re-evaluate pmde after taking lock, then
> use it in subsequent tests, instead of repeatedly dereferencing pointer.
>
> Signed-off-by: Hugh Dickins <[email protected]>
> Cc: <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
--
Kirill A. Shutemov
On Wed, Jun 09, 2021 at 11:42:12PM -0700, Hugh Dickins wrote:
> page_vma_mapped_walk() cleanup: rearrange the !pmd_present() block to
> follow the same "return not_found, return not_found, return true" pattern
> as the block above it (note: returning not_found there is never premature,
> since existence or prior existence of huge pmd guarantees good alignment).
>
> Signed-off-by: Hugh Dickins <[email protected]>
> Cc: <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
--
Kirill A. Shutemov
On Wed, Jun 09, 2021 at 11:44:10PM -0700, Hugh Dickins wrote:
> page_vma_mapped_walk() cleanup: adjust the test for crossing page table
> boundary - I believe pvmw->address is always page-aligned, but nothing
> else here assumed that;
Maybe we should just get it aligned instead? (PMD_SIZE - PAGE_SIZE) is not
most obvious mask calculation.
> and remember to reset pvmw->pte to NULL after
> unmapping the page table, though I never saw any bug from that.
Okay, it's fair enough.
--
Kirill A. Shutemov
On Wed, Jun 09, 2021 at 11:46:30PM -0700, Hugh Dickins wrote:
> page_vma_mapped_walk() cleanup: add a level of indentation to much of
> the body, making no functional change in this commit, but reducing the
> later diff when this is all converted to a loop.
>
> Signed-off-by: Hugh Dickins <[email protected]>
> Cc: <[email protected]>
Thanks, it helps a lot.
Acked-by: Kirill A. Shutemov <[email protected]>
--
Kirill A. Shutemov
On Wed, Jun 09, 2021 at 11:50:23PM -0700, Hugh Dickins wrote:
> page_vma_mapped_walk() cleanup: get THP's vma_address_end() at the start,
> rather than later at next_pte. It's a little unnecessary overhead on the
> first call, but makes for a simpler loop in the following commit.
>
> Signed-off-by: Hugh Dickins <[email protected]>
> Cc: <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
--
Kirill A. Shutemov
On Wed, Jun 09, 2021 at 11:48:27PM -0700, Hugh Dickins wrote:
> page_vma_mapped_walk() cleanup: add a label this_pte, matching next_pte,
> and use "goto this_pte", in place of the "while (1)" loop at the end.
>
> Signed-off-by: Hugh Dickins <[email protected]>
> Cc: <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
--
Kirill A. Shutemov
On Wed, Jun 09, 2021 at 11:52:37PM -0700, Hugh Dickins wrote:
> Running certain tests with a DEBUG_VM kernel would crash within hours,
> on the total_mapcount BUG() in split_huge_page_to_list(), while trying
> to free up some memory by punching a hole in a shmem huge page: split's
> try_to_unmap() was unable to find all the mappings of the page (which,
> on a !DEBUG_VM kernel, would then keep the huge page pinned in memory).
>
> Crash dumps showed two tail pages of a shmem huge page remained mapped
> by pte: ptes in a non-huge-aligned vma of a gVisor process, at the end
> of a long unmapped range; and no page table had yet been allocated for
> the head of the huge page to be mapped into.
>
> Although designed to handle these odd misaligned huge-page-mapped-by-pte
> cases, page_vma_mapped_walk() falls short by returning false prematurely
> when !pmd_present or !pud_present or !p4d_present or !pgd_present: there
> are cases when a huge page may span the boundary, with ptes present in
> the next.
>
> Restructure page_vma_mapped_walk() as a loop to continue in these cases,
> while keeping its layout much as before. Add a step_forward() helper to
> advance pvmw->address across those boundaries: originally I tried to use
> mm's standard p?d_addr_end() macros, but hit the same crash 512 times
> less often: because of the way redundant levels are folded together,
> but folded differently in different configurations, it was just too
> difficult to use them correctly; and step_forward() is simpler anyway.
>
> Fixes: ace71a19cec5 ("mm: introduce page_vma_mapped_walk()")
> Signed-off-by: Hugh Dickins <[email protected]>
> Cc: <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
--
Kirill A. Shutemov
On Wed, Jun 09, 2021 at 11:54:46PM -0700, Hugh Dickins wrote:
> Aha! Shouldn't that quick scan over pte_none()s make sure that it holds
> ptlock in the PVMW_SYNC case? That too might have been responsible for
> BUGs or WARNs in split_huge_page_to_list() or its unmap_page(), though
> I've never seen any.
>
> Fixes: ace71a19cec5 ("mm: introduce page_vma_mapped_walk()")
> Signed-off-by: Hugh Dickins <[email protected]>
> Cc: <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
--
Kirill A. Shutemov
On Thu, Jun 10, 2021 at 12:06:17PM +0300, Kirill A. Shutemov wrote:
> On Wed, Jun 09, 2021 at 11:38:11PM -0700, Hugh Dickins wrote:
> > page_vma_mapped_walk() cleanup: use pmd_read_atomic() with barrier()
> > instead of READ_ONCE() for pmde: some architectures (e.g. i386 with PAE)
> > have a multi-word pmd entry, for which READ_ONCE() is not good enough.
> >
> > Signed-off-by: Hugh Dickins <[email protected]>
> > Cc: <[email protected]>
> > mm/page_vma_mapped.c | 5 ++++-
> > 1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> > index 7c0504641fb8..973c3c4e72cc 100644
> > +++ b/mm/page_vma_mapped.c
> > @@ -182,13 +182,16 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> > pud = pud_offset(p4d, pvmw->address);
> > if (!pud_present(*pud))
> > return false;
> > +
> > pvmw->pmd = pmd_offset(pud, pvmw->address);
> > /*
> > * Make sure the pmd value isn't cached in a register by the
> > * compiler and used as a stale value after we've observed a
> > * subsequent update.
> > */
> > - pmde = READ_ONCE(*pvmw->pmd);
> > + pmde = pmd_read_atomic(pvmw->pmd);
> > + barrier();
> > +
>
> Hm. It makes me wounder if barrier() has to be part of pmd_read_atomic().
> mm/hmm.c uses the same pattern as you are and I tend to think that the
> rest of pmd_read_atomic() users may be broken.
>
> Am I wrong?
I agree with you, something called _atomic should not require the
caller to provide barriers.
I think the issue is simply that the two implementations of
pmd_read_atomic() should use READ_ONCE() internally, no?
Jason
On Thu, Jun 10, 2021 at 11:55:22AM +0300, Kirill A. Shutemov wrote:
> > @@ -234,9 +233,9 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> > return true;
> > next_pte:
> > /* Seek to next pte only makes sense for THP */
> > - if (!PageTransHuge(pvmw->page) || PageHuge(pvmw->page))
> > + if (!PageTransHuge(page) || PageHuge(page))
> > return not_found(pvmw);
> > - end = vma_address_end(pvmw->page, pvmw->vma);
> > + end = vma_address_end(page, pvmw->vma);
> > do {
> > pvmw->address += PAGE_SIZE;
> > if (pvmw->address >= end)
>
> I see two more pvmw->page in this loop. Do you leave them here as the code
> will be rewritten later in the patchset?
I think they've got removed in previous series ("[PATCH v2 04/10] mm/thp: fix
vma_address() if virtual address below file offset").
Reviewed-by: Peter Xu <[email protected]>
--
Peter Xu
On Wed, Jun 09, 2021 at 11:36:36PM -0700, Hugh Dickins wrote:
> page_vma_mapped_walk() cleanup: get the hugetlbfs PageHuge case
> out of the way at the start, so no need to worry about it later.
>
> Signed-off-by: Hugh Dickins <[email protected]>
> Cc: <[email protected]>
> ---
> mm/page_vma_mapped.c | 12 ++++++++----
> 1 file changed, 8 insertions(+), 4 deletions(-)
>
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index a6dbf714ca15..7c0504641fb8 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -153,10 +153,11 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> if (pvmw->pmd && !pvmw->pte)
> return not_found(pvmw);
>
> - if (pvmw->pte)
> - goto next_pte;
> -
> if (unlikely(PageHuge(page))) {
> + /* The only possible mapping was handled on last iteration */
> + if (pvmw->pte)
> + return not_found(pvmw);
> +
> /* when pud is not present, pte will be NULL */
> pvmw->pte = huge_pte_offset(mm, pvmw->address, page_size(page));
> if (!pvmw->pte)
Would it be even nicer to move the initial check to be after PageHuge() too?
if (pvmw->pmd && !pvmw->pte)
return not_found(pvmw);
It looks already better, so no strong opinion.
Reviewed-by: Peter Xu <[email protected]>
Thanks,
--
Peter Xu
On Wed, Jun 09, 2021 at 11:40:08PM -0700, Hugh Dickins wrote:
> page_vma_mapped_walk() cleanup: re-evaluate pmde after taking lock, then
> use it in subsequent tests, instead of repeatedly dereferencing pointer.
>
> Signed-off-by: Hugh Dickins <[email protected]>
> Cc: <[email protected]>
Reviewed-by: Peter Xu <[email protected]>
--
Peter Xu
On Wed, Jun 09, 2021 at 11:42:12PM -0700, Hugh Dickins wrote:
> page_vma_mapped_walk() cleanup: rearrange the !pmd_present() block to
> follow the same "return not_found, return not_found, return true" pattern
> as the block above it (note: returning not_found there is never premature,
> since existence or prior existence of huge pmd guarantees good alignment).
>
> Signed-off-by: Hugh Dickins <[email protected]>
> Cc: <[email protected]>
> ---
> mm/page_vma_mapped.c | 30 ++++++++++++++----------------
> 1 file changed, 14 insertions(+), 16 deletions(-)
>
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index 81000dd0b5da..b96fae568bc2 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -201,24 +201,22 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> if (pmd_page(pmde) != page)
> return not_found(pvmw);
> return true;
> - } else if (!pmd_present(pmde)) {
> - if (thp_migration_supported()) {
> - if (!(pvmw->flags & PVMW_MIGRATION))
> - return not_found(pvmw);
> - if (is_migration_entry(pmd_to_swp_entry(pmde))) {
> - swp_entry_t entry = pmd_to_swp_entry(pmde);
> + }
> + if (!pmd_present(pmde)) {
> + swp_entry_t entry;
>
> - if (migration_entry_to_page(entry) != page)
> - return not_found(pvmw);
> - return true;
> - }
> - }
> - return not_found(pvmw);
> - } else {
> - /* THP pmd was split under us: handle on pte level */
> - spin_unlock(pvmw->ptl);
> - pvmw->ptl = NULL;
> + if (!thp_migration_supported() ||
> + !(pvmw->flags & PVMW_MIGRATION))
> + return not_found(pvmw);
> + entry = pmd_to_swp_entry(pmde);
> + if (!is_migration_entry(entry) ||
> + migration_entry_to_page(entry) != page)
We'll need to do s/migration_entry_to_page/pfn_swap_entry_to_page/, depending
on whether Alistair's series lands first or not I guess (as you mentioned in
the cover letter).
Thanks for the change, it does look much better.
Reviewed-by: Peter Xu <[email protected]>
> + return not_found(pvmw);
> + return true;
> }
> + /* THP pmd was split under us: handle on pte level */
> + spin_unlock(pvmw->ptl);
> + pvmw->ptl = NULL;
> } else if (!pmd_present(pmde)) {
> /*
> * If PVMW_SYNC, take and drop THP pmd lock so that we
> --
> 2.26.2
>
--
Peter Xu
On Thu, 10 Jun 2021, Peter Xu wrote:
> On Thu, Jun 10, 2021 at 11:55:22AM +0300, Kirill A. Shutemov wrote:
> > > @@ -234,9 +233,9 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> > > return true;
> > > next_pte:
> > > /* Seek to next pte only makes sense for THP */
> > > - if (!PageTransHuge(pvmw->page) || PageHuge(pvmw->page))
> > > + if (!PageTransHuge(page) || PageHuge(page))
> > > return not_found(pvmw);
> > > - end = vma_address_end(pvmw->page, pvmw->vma);
> > > + end = vma_address_end(page, pvmw->vma);
> > > do {
> > > pvmw->address += PAGE_SIZE;
> > > if (pvmw->address >= end)
> >
> > I see two more pvmw->page in this loop. Do you leave them here as the code
> > will be rewritten later in the patchset?
That would be tacky; but I cannot see them (apart from in check_pte()).
>
> I think they've got removed in previous series ("[PATCH v2 04/10] mm/thp: fix
> vma_address() if virtual address below file offset").
Yes, I think you've found the right explanation.
>
> Reviewed-by: Peter Xu <[email protected]>
Thanks,
Hugh
On Thu, 10 Jun 2021, Peter Xu wrote:
> On Wed, Jun 09, 2021 at 11:36:36PM -0700, Hugh Dickins wrote:
> > page_vma_mapped_walk() cleanup: get the hugetlbfs PageHuge case
> > out of the way at the start, so no need to worry about it later.
> >
> > Signed-off-by: Hugh Dickins <[email protected]>
> > Cc: <[email protected]>
> > ---
> > mm/page_vma_mapped.c | 12 ++++++++----
> > 1 file changed, 8 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> > index a6dbf714ca15..7c0504641fb8 100644
> > --- a/mm/page_vma_mapped.c
> > +++ b/mm/page_vma_mapped.c
> > @@ -153,10 +153,11 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> > if (pvmw->pmd && !pvmw->pte)
> > return not_found(pvmw);
> >
> > - if (pvmw->pte)
> > - goto next_pte;
> > -
> > if (unlikely(PageHuge(page))) {
> > + /* The only possible mapping was handled on last iteration */
> > + if (pvmw->pte)
> > + return not_found(pvmw);
> > +
> > /* when pud is not present, pte will be NULL */
> > pvmw->pte = huge_pte_offset(mm, pvmw->address, page_size(page));
> > if (!pvmw->pte)
>
> Would it be even nicer to move the initial check to be after PageHuge() too?
>
> if (pvmw->pmd && !pvmw->pte)
> return not_found(pvmw);
>
> It looks already better, so no strong opinion.
It hadn't occurred to me to move that. I probably wanted those two
"The only possible mapping" comments near each other. I don't see any
particular reason to change them around now - beyond the title saying
"settle PageHuge on entry", whereas this is a few lines after entry!
Let's leave it as is.
>
> Reviewed-by: Peter Xu <[email protected]>
Thanks,
Hugh
On Thu, 10 Jun 2021, Kirill A. Shutemov wrote:
> On Wed, Jun 09, 2021 at 11:44:10PM -0700, Hugh Dickins wrote:
> > page_vma_mapped_walk() cleanup: adjust the test for crossing page table
> > boundary - I believe pvmw->address is always page-aligned, but nothing
> > else here assumed that;
>
> Maybe we should just get it aligned instead? (PMD_SIZE - PAGE_SIZE) is not
> most obvious mask calculation.
Would you prefer it with another line of comment after the
/* Did we cross page table boundary? */
Maybe,
/* Is address always page-aligned? No need to assume that. */
I just don't see the point in forcing alignment when the test is good
(and don't know for sure whether address is always aligned there or not).
I prefer to leave it as is, just letting the commit message document it.
>
> > and remember to reset pvmw->pte to NULL after
> > unmapping the page table, though I never saw any bug from that.
>
> Okay, it's fair enough.
Thanks,
Hugh
On Thu, 10 Jun 2021, Jason Gunthorpe wrote:
> On Thu, Jun 10, 2021 at 12:06:17PM +0300, Kirill A. Shutemov wrote:
> > On Wed, Jun 09, 2021 at 11:38:11PM -0700, Hugh Dickins wrote:
> > > page_vma_mapped_walk() cleanup: use pmd_read_atomic() with barrier()
> > > instead of READ_ONCE() for pmde: some architectures (e.g. i386 with PAE)
> > > have a multi-word pmd entry, for which READ_ONCE() is not good enough.
> > >
> > > Signed-off-by: Hugh Dickins <[email protected]>
> > > Cc: <[email protected]>
> > > mm/page_vma_mapped.c | 5 ++++-
> > > 1 file changed, 4 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> > > index 7c0504641fb8..973c3c4e72cc 100644
> > > +++ b/mm/page_vma_mapped.c
> > > @@ -182,13 +182,16 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> > > pud = pud_offset(p4d, pvmw->address);
> > > if (!pud_present(*pud))
> > > return false;
> > > +
> > > pvmw->pmd = pmd_offset(pud, pvmw->address);
> > > /*
> > > * Make sure the pmd value isn't cached in a register by the
> > > * compiler and used as a stale value after we've observed a
> > > * subsequent update.
> > > */
> > > - pmde = READ_ONCE(*pvmw->pmd);
> > > + pmde = pmd_read_atomic(pvmw->pmd);
> > > + barrier();
> > > +
> >
> > Hm. It makes me wounder if barrier() has to be part of pmd_read_atomic().
> > mm/hmm.c uses the same pattern as you are and I tend to think that the
> > rest of pmd_read_atomic() users may be broken.
> >
> > Am I wrong?
>
> I agree with you, something called _atomic should not require the
> caller to provide barriers.
>
> I think the issue is simply that the two implementations of
> pmd_read_atomic() should use READ_ONCE() internally, no?
I've had great difficulty coming up with answers for you.
This patch was based on two notions I've had lodged in my mind
for several years:
1) that pmd_read_atomic() is the creme-de-la-creme for reading a pmd_t
atomically (even if the pmd_t spans multiple words); but reading again
after all this time the comment above it, it seems to be more specialized
than I'd thought (biggest selling point being for when you want to check
pmd_none(), which we don't). And has no READ_ONCE() or barrier() inside,
so really needs that barrier() to be as safe as the previous READ_ONCE().
2) the barrier() in mm_find_pmd(), that replaced an earlier READ_ONCE(),
because READ_ONCE() did not work (did not give the necessary guarantee? or
did not build?) on architectures with multiple word pmd_ts e.g. i386 PAE.
But I've now come across some changes that Will Deacon made last year:
the include/asm-generic/rwonce.h READ_ONCE() now appears to allow for
native word type *or* type sizeof(long long) (e.g. i386 PAE) - given
"a strong prevailing wind" anyway :) And 8e958839e4b9 ("sparc32: mm:
Restructure sparc32 MMU page-table layout") put an end to sparc32's
typedef struct { unsigned long pmdv[16]; } pmd_t.
It looks like my justification for this 03/11 patch is out-of-date,
and the patch should be dropped from the series.
As to your questions about pmd_read_atomic() usage elsewhere:
please don't force me to think so hard! (And you've set me half-
wondering, whether there are sneaky THP transitions, perhaps of the
"unstable" kind, that page_vma_mapped_walk() should be paying more
attention to: but for sanity's sake I won't go there, not now.)
Hugh
On Thu, Jun 10, 2021 at 04:02:47PM -0700, Hugh Dickins wrote:
> On Thu, 10 Jun 2021, Kirill A. Shutemov wrote:
> > On Wed, Jun 09, 2021 at 11:44:10PM -0700, Hugh Dickins wrote:
> > > page_vma_mapped_walk() cleanup: adjust the test for crossing page table
> > > boundary - I believe pvmw->address is always page-aligned, but nothing
> > > else here assumed that;
> >
> > Maybe we should just get it aligned instead? (PMD_SIZE - PAGE_SIZE) is not
> > most obvious mask calculation.
>
> Would you prefer it with another line of comment after the
> /* Did we cross page table boundary? */
>
> Maybe,
> /* Is address always page-aligned? No need to assume that. */
>
> I just don't see the point in forcing alignment when the test is good
> (and don't know for sure whether address is always aligned there or not).
>
> I prefer to leave it as is, just letting the commit message document it.
Okay.
Acked-by: Kirill A. Shutemov <[email protected]>
--
Kirill A. Shutemov
On Thu, Jun 10, 2021 at 11:37:14PM -0700, Hugh Dickins wrote:
> On Thu, 10 Jun 2021, Jason Gunthorpe wrote:
> > On Thu, Jun 10, 2021 at 12:06:17PM +0300, Kirill A. Shutemov wrote:
> > > On Wed, Jun 09, 2021 at 11:38:11PM -0700, Hugh Dickins wrote:
> > > > page_vma_mapped_walk() cleanup: use pmd_read_atomic() with barrier()
> > > > instead of READ_ONCE() for pmde: some architectures (e.g. i386 with PAE)
> > > > have a multi-word pmd entry, for which READ_ONCE() is not good enough.
> > > >
> > > > Signed-off-by: Hugh Dickins <[email protected]>
> > > > Cc: <[email protected]>
> > > > mm/page_vma_mapped.c | 5 ++++-
> > > > 1 file changed, 4 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> > > > index 7c0504641fb8..973c3c4e72cc 100644
> > > > +++ b/mm/page_vma_mapped.c
> > > > @@ -182,13 +182,16 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> > > > pud = pud_offset(p4d, pvmw->address);
> > > > if (!pud_present(*pud))
> > > > return false;
> > > > +
> > > > pvmw->pmd = pmd_offset(pud, pvmw->address);
> > > > /*
> > > > * Make sure the pmd value isn't cached in a register by the
> > > > * compiler and used as a stale value after we've observed a
> > > > * subsequent update.
> > > > */
> > > > - pmde = READ_ONCE(*pvmw->pmd);
> > > > + pmde = pmd_read_atomic(pvmw->pmd);
> > > > + barrier();
> > > > +
> > >
> > > Hm. It makes me wounder if barrier() has to be part of pmd_read_atomic().
> > > mm/hmm.c uses the same pattern as you are and I tend to think that the
> > > rest of pmd_read_atomic() users may be broken.
> > >
> > > Am I wrong?
> >
> > I agree with you, something called _atomic should not require the
> > caller to provide barriers.
> >
> > I think the issue is simply that the two implementations of
> > pmd_read_atomic() should use READ_ONCE() internally, no?
>
> I've had great difficulty coming up with answers for you.
>
> This patch was based on two notions I've had lodged in my mind
> for several years:
>
> 1) that pmd_read_atomic() is the creme-de-la-creme for reading a pmd_t
> atomically (even if the pmd_t spans multiple words); but reading again
> after all this time the comment above it, it seems to be more specialized
> than I'd thought (biggest selling point being for when you want to check
> pmd_none(), which we don't). And has no READ_ONCE() or barrier() inside,
> so really needs that barrier() to be as safe as the previous READ_ONCE().
IMHO it is a simple bug that the 64 bit version of this is not marked
with READ_ONCE() internally. It is reading data without a lock, AFAIK
our modern understanding of the memory model is that requires
READ_ONCE().
diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h
index e896ebef8c24cb..0bf1fdec928e71 100644
--- a/arch/x86/include/asm/pgtable-3level.h
+++ b/arch/x86/include/asm/pgtable-3level.h
@@ -75,7 +75,7 @@ static inline void native_set_pte(pte_t *ptep, pte_t pte)
static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
{
pmdval_t ret;
- u32 *tmp = (u32 *)pmdp;
+ u32 *tmp = READ_ONCE((u32 *)pmdp);
ret = (pmdval_t) (*tmp);
if (ret) {
@@ -84,7 +84,7 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
* or we can end up with a partial pmd.
*/
smp_rmb();
- ret |= ((pmdval_t)*(tmp + 1)) << 32;
+ ret |= READ_ONCE((pmdval_t)*(tmp + 1)) << 32;
}
return (pmd_t) { ret };
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 46b13780c2c8c9..c8c7a3307d2773 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1228,7 +1228,7 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
* only going to work, if the pmdval_t isn't larger than
* an unsigned long.
*/
- return *pmdp;
+ return READ_ONCE(*pmdp);
}
#endif
> 2) the barrier() in mm_find_pmd(), that replaced an earlier READ_ONCE(),
> because READ_ONCE() did not work (did not give the necessary guarantee? or
> did not build?) on architectures with multiple word pmd_ts e.g. i386 PAE.
This is really interesting, the git history e37c69827063 ("mm: replace
ACCESS_ONCE with READ_ONCE or barriers") says the READ_ONCE was
dropped here "because it doesn't work on non-scalar types" due to a
(now 8 year old) gcc bug.
According to the gcc bug READ_ONCE() on anything that is a scalar
sized struct triggers GCC to ignore the READ_ONCEyness. To work around
this bug then READ_ONCE can never be used on any of the struct
protected page table elements. While I am not 100% sure, it looks like
this is a pre gcc 4.9 bug, and since gcc 4.9 is now the minimum
required compiler this is all just cruft. Use READ_ONCE() here too...
> But I've now come across some changes that Will Deacon made last year:
> the include/asm-generic/rwonce.h READ_ONCE() now appears to allow for
> native word type *or* type sizeof(long long) (e.g. i386 PAE) - given
> "a strong prevailing wind" anyway :) And 8e958839e4b9 ("sparc32: mm:
> Restructure sparc32 MMU page-table layout") put an end to sparc32's
> typedef struct { unsigned long pmdv[16]; } pmd_t.
Indeed, that is good news
> As to your questions about pmd_read_atomic() usage elsewhere:
> please don't force me to think so hard! (And you've set me half-
> wondering, whether there are sneaky THP transitions, perhaps of the
> "unstable" kind, that page_vma_mapped_walk() should be paying more
> attention to: but for sanity's sake I won't go there, not now.)
If I recall, (and I didn't recheck this right now) the only thing
pmd_read_atomic() provides is the special property that if the pmd's
flags are observed to point to a pte table then pmd_read_atomic() will
reliably return the pte table pointer.
Otherwise it returns the flags and a garbage pointer because under the
THP protocol a PMD pointing at a page can be converted to a PTE table
if you hold the mmap sem in read mode. Upgrading a PTE table to a PMD
page requires mmap sem in write mode so once a PTE table is observed
'locklessly' the value is stable.. Or at least so says the documentation
pmd_read_atomic() is badly named, tricky to use, and missing the
READ_ONCE() because it is so old..
As far as is page_vma_mapped_walk correct.. Everything except
is_pmd_migration_entry() is fine to my eye, and I simply don't know
the rules aroudn migration entries to know right/wrong.
I suspect it probably is required to manipulate a migration entry
while holding the mmap_sem in write mode??
There is some list of changes to the page table that require holding
the mmap sem in write mode that I've never seen documented for..
Jason
On Thu, 10 Jun 2021, Kirill A. Shutemov wrote:
> On Wed, Jun 09, 2021 at 11:54:46PM -0700, Hugh Dickins wrote:
> > Aha! Shouldn't that quick scan over pte_none()s make sure that it holds
> > ptlock in the PVMW_SYNC case? That too might have been responsible for
> > BUGs or WARNs in split_huge_page_to_list() or its unmap_page(), though
> > I've never seen any.
> >
> > Fixes: ace71a19cec5 ("mm: introduce page_vma_mapped_walk()")
> > Signed-off-by: Hugh Dickins <[email protected]>
> > Cc: <[email protected]>
>
> Acked-by: Kirill A. Shutemov <[email protected]>
Thanks Kirill.
And Wang Yugui has now reported the good news, that this afterthought
patch finally fixes the unmap_page() BUGs they were hitting on 5.10.
Andrew, please add a link to
https://lore.kernel.org/linux-mm/[email protected]/
and
Tested-by: Wang Yugui <[email protected]>
Thanks,
Hugh
On Fri, 11 Jun 2021, Jason Gunthorpe wrote:
> On Thu, Jun 10, 2021 at 11:37:14PM -0700, Hugh Dickins wrote:
> > On Thu, 10 Jun 2021, Jason Gunthorpe wrote:
> > > On Thu, Jun 10, 2021 at 12:06:17PM +0300, Kirill A. Shutemov wrote:
> > > > On Wed, Jun 09, 2021 at 11:38:11PM -0700, Hugh Dickins wrote:
> > > > > page_vma_mapped_walk() cleanup: use pmd_read_atomic() with barrier()
> > > > > instead of READ_ONCE() for pmde: some architectures (e.g. i386 with PAE)
> > > > > have a multi-word pmd entry, for which READ_ONCE() is not good enough.
> > > > >
> > > > > Signed-off-by: Hugh Dickins <[email protected]>
> > > > > Cc: <[email protected]>
> > > > > mm/page_vma_mapped.c | 5 ++++-
> > > > > 1 file changed, 4 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> > > > > index 7c0504641fb8..973c3c4e72cc 100644
> > > > > +++ b/mm/page_vma_mapped.c
> > > > > @@ -182,13 +182,16 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> > > > > pud = pud_offset(p4d, pvmw->address);
> > > > > if (!pud_present(*pud))
> > > > > return false;
> > > > > +
> > > > > pvmw->pmd = pmd_offset(pud, pvmw->address);
> > > > > /*
> > > > > * Make sure the pmd value isn't cached in a register by the
> > > > > * compiler and used as a stale value after we've observed a
> > > > > * subsequent update.
> > > > > */
> > > > > - pmde = READ_ONCE(*pvmw->pmd);
> > > > > + pmde = pmd_read_atomic(pvmw->pmd);
> > > > > + barrier();
> > > > > +
> > > >
> > > > Hm. It makes me wounder if barrier() has to be part of pmd_read_atomic().
> > > > mm/hmm.c uses the same pattern as you are and I tend to think that the
> > > > rest of pmd_read_atomic() users may be broken.
> > > >
> > > > Am I wrong?
> > >
> > > I agree with you, something called _atomic should not require the
> > > caller to provide barriers.
> > >
> > > I think the issue is simply that the two implementations of
> > > pmd_read_atomic() should use READ_ONCE() internally, no?
> >
> > I've had great difficulty coming up with answers for you.
> >
> > This patch was based on two notions I've had lodged in my mind
> > for several years:
> >
> > 1) that pmd_read_atomic() is the creme-de-la-creme for reading a pmd_t
> > atomically (even if the pmd_t spans multiple words); but reading again
> > after all this time the comment above it, it seems to be more specialized
> > than I'd thought (biggest selling point being for when you want to check
> > pmd_none(), which we don't). And has no READ_ONCE() or barrier() inside,
> > so really needs that barrier() to be as safe as the previous READ_ONCE().
>
> IMHO it is a simple bug that the 64 bit version of this is not marked
> with READ_ONCE() internally. It is reading data without a lock, AFAIK
> our modern understanding of the memory model is that requires
> READ_ONCE().
>
> diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h
> index e896ebef8c24cb..0bf1fdec928e71 100644
> --- a/arch/x86/include/asm/pgtable-3level.h
> +++ b/arch/x86/include/asm/pgtable-3level.h
> @@ -75,7 +75,7 @@ static inline void native_set_pte(pte_t *ptep, pte_t pte)
> static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
> {
> pmdval_t ret;
> - u32 *tmp = (u32 *)pmdp;
> + u32 *tmp = READ_ONCE((u32 *)pmdp);
>
> ret = (pmdval_t) (*tmp);
> if (ret) {
> @@ -84,7 +84,7 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
> * or we can end up with a partial pmd.
> */
> smp_rmb();
> - ret |= ((pmdval_t)*(tmp + 1)) << 32;
> + ret |= READ_ONCE((pmdval_t)*(tmp + 1)) << 32;
> }
Maybe that. Or maybe now (since Will's changes) it can just do
one READ_ONCE() of the whole, then adjust its local copy.
>
> return (pmd_t) { ret };
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 46b13780c2c8c9..c8c7a3307d2773 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1228,7 +1228,7 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
> * only going to work, if the pmdval_t isn't larger than
> * an unsigned long.
> */
> - return *pmdp;
> + return READ_ONCE(*pmdp);
> }
> #endif
And it should now be possible to delete the #ifdef THP barrier() in
function_with_long_name_I_didn't_look_up() which calls pmd_read_atomic().
>
>
> > 2) the barrier() in mm_find_pmd(), that replaced an earlier READ_ONCE(),
> > because READ_ONCE() did not work (did not give the necessary guarantee? or
> > did not build?) on architectures with multiple word pmd_ts e.g. i386 PAE.
>
> This is really interesting, the git history e37c69827063 ("mm: replace
> ACCESS_ONCE with READ_ONCE or barriers") says the READ_ONCE was
> dropped here "because it doesn't work on non-scalar types" due to a
> (now 8 year old) gcc bug.
>
> According to the gcc bug READ_ONCE() on anything that is a scalar
> sized struct triggers GCC to ignore the READ_ONCEyness. To work around
> this bug then READ_ONCE can never be used on any of the struct
> protected page table elements. While I am not 100% sure, it looks like
> this is a pre gcc 4.9 bug, and since gcc 4.9 is now the minimum
> required compiler this is all just cruft. Use READ_ONCE() here too...
Oh, thanks particularly for looking into the gcc end of it:
I never knew that part of the story.
>
> > But I've now come across some changes that Will Deacon made last year:
> > the include/asm-generic/rwonce.h READ_ONCE() now appears to allow for
> > native word type *or* type sizeof(long long) (e.g. i386 PAE) - given
> > "a strong prevailing wind" anyway :) And 8e958839e4b9 ("sparc32: mm:
> > Restructure sparc32 MMU page-table layout") put an end to sparc32's
> > typedef struct { unsigned long pmdv[16]; } pmd_t.
>
> Indeed, that is good news
>
> > As to your questions about pmd_read_atomic() usage elsewhere:
> > please don't force me to think so hard! (And you've set me half-
> > wondering, whether there are sneaky THP transitions, perhaps of the
> > "unstable" kind, that page_vma_mapped_walk() should be paying more
> > attention to: but for sanity's sake I won't go there, not now.)
>
> If I recall, (and I didn't recheck this right now) the only thing
> pmd_read_atomic() provides is the special property that if the pmd's
> flags are observed to point to a pte table then pmd_read_atomic() will
> reliably return the pte table pointer.
I expect your right, I haven't rechecked. But it does also provide that
special guarantee on matching pmd_none() when half matches pmd_none():
which one or more callers want, but irrelevant where I added it.
>
> Otherwise it returns the flags and a garbage pointer because under the
> THP protocol a PMD pointing at a page can be converted to a PTE table
> if you hold the mmap sem in read mode. Upgrading a PTE table to a PMD
> page requires mmap sem in write mode so once a PTE table is observed
> 'locklessly' the value is stable.. Or at least so says the documentation
>
> pmd_read_atomic() is badly named, tricky to use, and missing the
> READ_ONCE() because it is so old..
I think yes to each of those three points.
>
> As far as is page_vma_mapped_walk correct.. Everything except
> is_pmd_migration_entry() is fine to my eye, and I simply don't know
> the rules aroudn migration entries to know right/wrong.
So long as the swap "type" is entirely in the same word as the pte
flags would be, I think is_pmd_migration_entry() should be fine.
There it's just looking for a hint, is it worth taking pmd_lock()
to obtain a reliable pmde.
>
> I suspect it probably is required to manipulate a migration entry
> while holding the mmap_sem in write mode??
I don't think in terms of mmap_sem/mmap_lock here: this is on the
rmap lookup path, and typically mmap_lock is not held (whereas I
was surprised to find the comment above pmd_read_atomic() saying a
lot about it - another reason it's inappropriate in pvmw I guess).
The important lock here is pmd_lock() a.k.a. pmd_trans_huge_lock():
get it when it's necessary, but (the tricky part) try to avoid the
overhead and contention in getting it when not necessary. Before
THP it was easy, but various THP transitions make it hard.
>
> There is some list of changes to the page table that require holding
> the mmap sem in write mode that I've never seen documented for..
That is a subtler subject than I dare get into at the moment.
But if you're just doing lookups, pmd_lock() is enough.
I'll direct a further mail in this thread to Andrew now,
asking him to drop this patch, when convenient.
Hugh
On Fri, 11 Jun 2021, Jason Gunthorpe wrote:
> On Thu, Jun 10, 2021 at 11:37:14PM -0700, Hugh Dickins wrote:
> > On Thu, 10 Jun 2021, Jason Gunthorpe wrote:
> > > On Thu, Jun 10, 2021 at 12:06:17PM +0300, Kirill A. Shutemov wrote:
> > > > On Wed, Jun 09, 2021 at 11:38:11PM -0700, Hugh Dickins wrote:
> > > > > page_vma_mapped_walk() cleanup: use pmd_read_atomic() with barrier()
> > > > > instead of READ_ONCE() for pmde: some architectures (e.g. i386 with PAE)
> > > > > have a multi-word pmd entry, for which READ_ONCE() is not good enough.
> > > > >
> > > > > Signed-off-by: Hugh Dickins <[email protected]>
> > > > > Cc: <[email protected]>
> > > > > mm/page_vma_mapped.c | 5 ++++-
> > > > > 1 file changed, 4 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> > > > > index 7c0504641fb8..973c3c4e72cc 100644
> > > > > +++ b/mm/page_vma_mapped.c
> > > > > @@ -182,13 +182,16 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> > > > > pud = pud_offset(p4d, pvmw->address);
> > > > > if (!pud_present(*pud))
> > > > > return false;
> > > > > +
> > > > > pvmw->pmd = pmd_offset(pud, pvmw->address);
> > > > > /*
> > > > > * Make sure the pmd value isn't cached in a register by the
> > > > > * compiler and used as a stale value after we've observed a
> > > > > * subsequent update.
> > > > > */
> > > > > - pmde = READ_ONCE(*pvmw->pmd);
> > > > > + pmde = pmd_read_atomic(pvmw->pmd);
> > > > > + barrier();
> > > > > +
Andrew, please drop this patch from your queue: it's harmless,
but pretending to do some good when it does not. The situation has
changed since I originally wrote it, gcc min version has been raised,
Will Deacon has corrected the applicability of READ_ONCE() to pmd_t in
April 2020 commits, pmd_read_atomic() is not as magic as I thought (it
has good uses but not here), and the commit comment is no longer right.
However... if you're nearing the point of a fresh mmotm, probably best
to leave it in for now and we sort out the mess later. Because although
it's functionally independent from the other patches in the series,
there is of course the "indentation" patch, and this falls in the middle
of what's indented a level there.
I don't imagine that will tax your abilities to their limit, but after
that there's the main bugfix patch, which expects a blank line I added
in this one, that's no longer there when reverted. Enough to make me
sigh, and just write to say "drop, when you have time".
Let me know if you'd prefer a resend of the series (today? not sure).
(Discussion of pmd_read_atomic() and barrier() and mm_find_pmd() and
suchlike may continue in this thread, but this 03/11 should just be
dropped. Whether it's in or out should not affect Alistair's series.)
Thanks,
HUgh
On Fri, Jun 11, 2021 at 12:05:42PM -0700, Hugh Dickins wrote:
> > diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h
> > index e896ebef8c24cb..0bf1fdec928e71 100644
> > +++ b/arch/x86/include/asm/pgtable-3level.h
> > @@ -75,7 +75,7 @@ static inline void native_set_pte(pte_t *ptep, pte_t pte)
> > static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
> > {
> > pmdval_t ret;
> > - u32 *tmp = (u32 *)pmdp;
> > + u32 *tmp = READ_ONCE((u32 *)pmdp);
> >
> > ret = (pmdval_t) (*tmp);
> > if (ret) {
> > @@ -84,7 +84,7 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
> > * or we can end up with a partial pmd.
> > */
> > smp_rmb();
> > - ret |= ((pmdval_t)*(tmp + 1)) << 32;
> > + ret |= READ_ONCE((pmdval_t)*(tmp + 1)) << 32;
> > }
>
> Maybe that. Or maybe now (since Will's changes) it can just do
> one READ_ONCE() of the whole, then adjust its local copy.
I think the smb_rmb() is critical here to ensure a PTE table pointer
is coherent, READ_ONCE is not a substitute, unless I am miss
understanding what Will's changes are???
> > @@ -1228,7 +1228,7 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
> > * only going to work, if the pmdval_t isn't larger than
> > * an unsigned long.
> > */
> > - return *pmdp;
> > + return READ_ONCE(*pmdp);
> > }
> > #endif
>
> And it should now be possible to delete the #ifdef THP barrier() in
> function_with_long_name_I_didn't_look_up() which calls pmd_read_atomic().
Yes - I think I know what you mean :)
> > If I recall, (and I didn't recheck this right now) the only thing
> > pmd_read_atomic() provides is the special property that if the pmd's
> > flags are observed to point to a pte table then pmd_read_atomic() will
> > reliably return the pte table pointer.
>
> I expect your right, I haven't rechecked. But it does also provide that
> special guarantee on matching pmd_none() when half matches pmd_none():
> which one or more callers want, but irrelevant where I added it.
Ah, this I don't know much about.. Hum, I'd have to think about it way
too much to have an opinion if the races around pmd_none are
meaningful enough to resolve with _atomic vs READ_ONCE()
> > As far as is page_vma_mapped_walk correct.. Everything except
> > is_pmd_migration_entry() is fine to my eye, and I simply don't know
> > the rules aroudn migration entries to know right/wrong.
>
> So long as the swap "type" is entirely in the same word as the pte
> flags would be, I think is_pmd_migration_entry() should be fine.
> There it's just looking for a hint, is it worth taking pmd_lock()
> to obtain a reliable pmde.
I wonder if anyone knows to guarantee that in the arches?
> > I suspect it probably is required to manipulate a migration entry
> > while holding the mmap_sem in write mode??
>
> I don't think in terms of mmap_sem/mmap_lock here: this is on the
> rmap lookup path, and typically mmap_lock is not held (whereas I
> was surprised to find the comment above pmd_read_atomic() saying a
> lot about it - another reason it's inappropriate in pvmw I guess).
Ah.. Honestly I'm not familiar with all the ways to lock a VMA so that
the page tables it spans are guaranteed not to be freed.
I just saw the _offset() ladder without any page table spinlocks and
knew this was one of the lockless walkers that, somehow, relies on
another lock to ensure the page table level memory is not concurrently
freed.
In that case I take it back, I have no idea if this is correct or not
because it calls map_pte() which does pte_offset_map() based on that
READ_ONCE.
pte_offset_map() without holding the pmd_lock is only OK if you know
that the pmd can't be upgraded to a THP - and the only locking for
that I have memorized is the mmap lock :\
I can't tell what other lock is protecting the page tables here or if
that lock is held during the THP upgrade process.. Sorry
> > There is some list of changes to the page table that require
> > holding the mmap sem in write mode that I've never seen documented
> > for..
>
> That is a subtler subject than I dare get into at the moment.
> But if you're just doing lookups, pmd_lock() is enough.
It is enough if you take it, I don't see page_vma_mapped_walk() taking
it in the map_pte() flow :\
Jason
On Fri, Jun 11, 2021 at 04:42:49PM -0300, Jason Gunthorpe wrote:
> On Fri, Jun 11, 2021 at 12:05:42PM -0700, Hugh Dickins wrote:
> > > diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h
> > > index e896ebef8c24cb..0bf1fdec928e71 100644
> > > +++ b/arch/x86/include/asm/pgtable-3level.h
> > > @@ -75,7 +75,7 @@ static inline void native_set_pte(pte_t *ptep, pte_t pte)
> > > static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
> > > {
> > > pmdval_t ret;
> > > - u32 *tmp = (u32 *)pmdp;
> > > + u32 *tmp = READ_ONCE((u32 *)pmdp);
> > >
> > > ret = (pmdval_t) (*tmp);
> > > if (ret) {
> > > @@ -84,7 +84,7 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
> > > * or we can end up with a partial pmd.
> > > */
> > > smp_rmb();
> > > - ret |= ((pmdval_t)*(tmp + 1)) << 32;
> > > + ret |= READ_ONCE((pmdval_t)*(tmp + 1)) << 32;
> > > }
> >
> > Maybe that. Or maybe now (since Will's changes) it can just do
> > one READ_ONCE() of the whole, then adjust its local copy.
>
> I think the smb_rmb() is critical here to ensure a PTE table pointer
> is coherent, READ_ONCE is not a substitute, unless I am miss
> understanding what Will's changes are???
Yes, I agree that the barrier is needed here for x86 PAE. I would really
have liked to enforce native-sized access in READ_ONCE(), but unfortunately
there is plenty of code out there which is resilient to a 64-bit access
being split into two separate 32-bit accesses and so I wasn't able to go
that far.
That being said, pmd_read_atomic() probably _should_ be using READ_ONCE()
because using it inconsistently can give rise to broken codegen, e.g. if
you do:
pmdval_t x, y, z;
x = *pmdp; // Invalid
y = READ_ONCE(*pmdp); // Valid
if (pmd_valid(y))
z = *pmdp; // Invalid again!
Then the compiler can allocate the same register for x and z, but will
issue an additional load for y. If a concurrent update takes place to the
pmd which transitions from Invalid -> Valid, then it will look as though
things went back in time, because z will be stale. We actually hit this
on arm64 in practice [1].
Will
[1] https://lore.kernel.org/lkml/[email protected]/
On Tue, Jun 15, 2021 at 10:46:39AM +0100, Will Deacon wrote:
> Then the compiler can allocate the same register for x and z, but will
> issue an additional load for y. If a concurrent update takes place to the
> pmd which transitions from Invalid -> Valid, then it will look as though
> things went back in time, because z will be stale. We actually hit this
> on arm64 in practice [1].
The fact you actually hit this in the real world just seem to confirm
my thinking that the mm's lax use of the memory model is something
that deserves addressing.
Honestly I'm not sure the fix to stick a READ_ONCE in the macros is
very robust. I prefer the gup_fast pattern of:
pmd_t pmd = READ_ONCE(*pmdp);
pte_offset_phys(&pmd, addr);
To correctly force the READ_ONCE under unlocked access and the
consistently use the single read of the unstable data.
It seems more maintainable 'hey look at me, I have no locks!' and has
fewer possibilities for obscure order related bugs to creep in.
Jason
On Tue, Jun 15, 2021 at 09:42:07PM -0300, Jason Gunthorpe wrote:
> On Tue, Jun 15, 2021 at 10:46:39AM +0100, Will Deacon wrote:
>
> > Then the compiler can allocate the same register for x and z, but will
> > issue an additional load for y. If a concurrent update takes place to the
> > pmd which transitions from Invalid -> Valid, then it will look as though
> > things went back in time, because z will be stale. We actually hit this
> > on arm64 in practice [1].
>
> The fact you actually hit this in the real world just seem to confirm
> my thinking that the mm's lax use of the memory model is something
> that deserves addressing.
>
> Honestly I'm not sure the fix to stick a READ_ONCE in the macros is
> very robust. I prefer the gup_fast pattern of:
>
> pmd_t pmd = READ_ONCE(*pmdp);
> pte_offset_phys(&pmd, addr);
>
> To correctly force the READ_ONCE under unlocked access and the
> consistently use the single read of the unstable data.
>
> It seems more maintainable 'hey look at me, I have no locks!' and has
> fewer possibilities for obscure order related bugs to creep in.
Oh, no objection to cleaning this up. It was a "issuing msync(2) causes
data loss argh!" issue, so adding READ_ONCE() to all the macros was the
most straightforward way to solve the immediate problem.
Generally speaking, I think all accesses to live page-tables should be
using READ_ONCE(), as there's also hardware updates from the CPU table
walker to contend with. If that's done in the caller and the macros are
changed to operate on the loaded value, all the better (although this
probably doesn't work so well once you get into rmw operations such as
ptep_test_and_clear_young()).
Will