2024-06-13 00:07:57

by Barry Song

[permalink] [raw]
Subject: [PATCH RFC 0/3] mm: clarify folio_add_new_anon_rmap() and __folio_add_anon_rmap()

From: Barry Song <[email protected]>

The whole thing was originally suggested by David while we tried
to weaken the WARN_ON in __folio_add_anon_rmap() while bringing up
mTHP swapin[1]. This patchset is also preparatory work for mTHP
swapin.

folio_add_new_anon_rmap() assumes that new anon rmaps are always
exclusive. However, this assumption doesn’t hold true for cases
like do_swap_page(), where a new anon might be added to the
swapcache and is not necessarily exclusive.

The patchset extends the rmap flags to allow folio_add_new_anon_rmap()
to handle both exclusive and non-exclusive new anon folios.
The do_swap_page() function is updated to use this extended API with
rmap flags. Consequently, all new anon folios now consistently use
folio_add_new_anon_rmap().
The special case for !folio_test_anon() in __folio_add_anon_rmap() can
be safely removed.

In conclusion, new anon folios always use folio_add_new_anon_rmap(),
regardless of exclusivity. Old anon folios continue to use
__folio_add_anon_rmap() via folio_add_anon_rmap_pmd() and
folio_add_anon_rmap_ptes().

[1] https://lore.kernel.org/linux-mm/[email protected]/

Barry Song (3):
mm: extend rmap flags arguments for folio_add_new_anon_rmap
mm: do_swap_page: use folio_add_new_anon_rmap() if
folio_test_anon(folio)==false
mm: remove folio_test_anon(folio)==false path in
__folio_add_anon_rmap()

include/linux/rmap.h | 2 +-
kernel/events/uprobes.c | 2 +-
mm/huge_memory.c | 2 +-
mm/khugepaged.c | 2 +-
mm/memory.c | 12 +++++++-----
mm/migrate_device.c | 2 +-
mm/rmap.c | 32 ++++++++++++--------------------
mm/swapfile.c | 2 +-
mm/userfaultfd.c | 2 +-
9 files changed, 26 insertions(+), 32 deletions(-)

--
2.34.1



2024-06-13 00:08:11

by Barry Song

[permalink] [raw]
Subject: [PATCH RFC 1/3] mm: extend rmap flags arguments for folio_add_new_anon_rmap

From: Barry Song <[email protected]>

In the case of do_swap_page(), a new anonymous folio isn’t necessarily
exclusive. This patch extends the rmap flags to allow treating a new
anon folio as either exclusive or non-exclusive. To maintain the current
behavior, we always use EXCLUSIVE as arguments.

Suggested-by: David Hildenbrand <[email protected]>
Signed-off-by: Barry Song <[email protected]>
---
include/linux/rmap.h | 2 +-
kernel/events/uprobes.c | 2 +-
mm/huge_memory.c | 2 +-
mm/khugepaged.c | 2 +-
mm/memory.c | 10 +++++-----
mm/migrate_device.c | 2 +-
mm/rmap.c | 15 +++++++++------
mm/swapfile.c | 2 +-
mm/userfaultfd.c | 2 +-
9 files changed, 21 insertions(+), 18 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index cae38a2a643d..01a64e7e72b9 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -244,7 +244,7 @@ void folio_add_anon_rmap_ptes(struct folio *, struct page *, int nr_pages,
void folio_add_anon_rmap_pmd(struct folio *, struct page *,
struct vm_area_struct *, unsigned long address, rmap_t flags);
void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
- unsigned long address);
+ unsigned long address, rmap_t flags);
void folio_add_file_rmap_ptes(struct folio *, struct page *, int nr_pages,
struct vm_area_struct *);
#define folio_add_file_rmap_pte(folio, page, vma) \
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 2c83ba776fc7..c20368aa33dd 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -181,7 +181,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,

if (new_page) {
folio_get(new_folio);
- folio_add_new_anon_rmap(new_folio, vma, addr);
+ folio_add_new_anon_rmap(new_folio, vma, addr, RMAP_EXCLUSIVE);
folio_add_lru_vma(new_folio, vma);
} else
/* no new page, just dec_mm_counter for old_page */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f409ea9fcc18..09a83e43c71a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -973,7 +973,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,

entry = mk_huge_pmd(page, vma->vm_page_prot);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
- folio_add_new_anon_rmap(folio, vma, haddr);
+ folio_add_new_anon_rmap(folio, vma, haddr, RMAP_EXCLUSIVE);
folio_add_lru_vma(folio, vma);
pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
set_pmd_at(vma->vm_mm, haddr, vmf->pmd, entry);
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 774a97e6e2da..4d759a7487d0 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1213,7 +1213,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,

spin_lock(pmd_ptl);
BUG_ON(!pmd_none(*pmd));
- folio_add_new_anon_rmap(folio, vma, address);
+ folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
folio_add_lru_vma(folio, vma);
pgtable_trans_huge_deposit(mm, pmd, pgtable);
set_pmd_at(mm, address, pmd, _pmd);
diff --git a/mm/memory.c b/mm/memory.c
index 54d7d2acdf39..2f94921091fb 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -930,7 +930,7 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
*prealloc = NULL;
copy_user_highpage(&new_folio->page, page, addr, src_vma);
__folio_mark_uptodate(new_folio);
- folio_add_new_anon_rmap(new_folio, dst_vma, addr);
+ folio_add_new_anon_rmap(new_folio, dst_vma, addr, RMAP_EXCLUSIVE);
folio_add_lru_vma(new_folio, dst_vma);
rss[MM_ANONPAGES]++;

@@ -3400,7 +3400,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
* some TLBs while the old PTE remains in others.
*/
ptep_clear_flush(vma, vmf->address, vmf->pte);
- folio_add_new_anon_rmap(new_folio, vma, vmf->address);
+ folio_add_new_anon_rmap(new_folio, vma, vmf->address, RMAP_EXCLUSIVE);
folio_add_lru_vma(new_folio, vma);
BUG_ON(unshare && pte_write(entry));
set_pte_at(mm, vmf->address, vmf->pte, entry);
@@ -4337,7 +4337,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)

/* ksm created a completely new copy */
if (unlikely(folio != swapcache && swapcache)) {
- folio_add_new_anon_rmap(folio, vma, address);
+ folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
folio_add_lru_vma(folio, vma);
} else {
folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, address,
@@ -4592,7 +4592,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
count_mthp_stat(folio_order(folio), MTHP_STAT_ANON_FAULT_ALLOC);
#endif
- folio_add_new_anon_rmap(folio, vma, addr);
+ folio_add_new_anon_rmap(folio, vma, addr, RMAP_EXCLUSIVE);
folio_add_lru_vma(folio, vma);
setpte:
if (vmf_orig_pte_uffd_wp(vmf))
@@ -4790,7 +4790,7 @@ void set_pte_range(struct vm_fault *vmf, struct folio *folio,
/* copy-on-write page */
if (write && !(vma->vm_flags & VM_SHARED)) {
VM_BUG_ON_FOLIO(nr != 1, folio);
- folio_add_new_anon_rmap(folio, vma, addr);
+ folio_add_new_anon_rmap(folio, vma, addr, RMAP_EXCLUSIVE);
folio_add_lru_vma(folio, vma);
} else {
folio_add_file_rmap_ptes(folio, page, nr, vma);
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 051d0a3ccbee..6d66dc1c6ffa 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -658,7 +658,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
goto unlock_abort;

inc_mm_counter(mm, MM_ANONPAGES);
- folio_add_new_anon_rmap(folio, vma, addr);
+ folio_add_new_anon_rmap(folio, vma, addr, RMAP_EXCLUSIVE);
if (!folio_is_zone_device(folio))
folio_add_lru_vma(folio, vma);
folio_get(folio);
diff --git a/mm/rmap.c b/mm/rmap.c
index b9e5943c8349..e612d999811a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1406,14 +1406,14 @@ void folio_add_anon_rmap_pmd(struct folio *folio, struct page *page,
* This means the inc-and-test can be bypassed.
* The folio does not have to be locked.
*
- * If the folio is pmd-mappable, it is accounted as a THP. As the folio
- * is new, it's assumed to be mapped exclusively by a single process.
+ * If the folio is pmd-mappable, it is accounted as a THP.
*/
void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
- unsigned long address)
+ unsigned long address, rmap_t flags)
{
int nr = folio_nr_pages(folio);
int nr_pmdmapped = 0;
+ bool exclusive = flags & RMAP_EXCLUSIVE;

VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
VM_BUG_ON_VMA(address < vma->vm_start ||
@@ -1424,7 +1424,8 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
if (likely(!folio_test_large(folio))) {
/* increment count (starts at -1) */
atomic_set(&folio->_mapcount, 0);
- SetPageAnonExclusive(&folio->page);
+ if (exclusive)
+ SetPageAnonExclusive(&folio->page);
} else if (!folio_test_pmd_mappable(folio)) {
int i;

@@ -1433,7 +1434,8 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,

/* increment count (starts at -1) */
atomic_set(&page->_mapcount, 0);
- SetPageAnonExclusive(page);
+ if (exclusive)
+ SetPageAnonExclusive(page);
}

/* increment count (starts at -1) */
@@ -1445,7 +1447,8 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
/* increment count (starts at -1) */
atomic_set(&folio->_large_mapcount, 0);
atomic_set(&folio->_nr_pages_mapped, ENTIRELY_MAPPED);
- SetPageAnonExclusive(&folio->page);
+ if (exclusive)
+ SetPageAnonExclusive(&folio->page);
nr_pmdmapped = nr;
}

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 9c6d8e557c0f..ae1d2700f6a3 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1911,7 +1911,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,

folio_add_anon_rmap_pte(folio, page, vma, addr, rmap_flags);
} else { /* ksm created a completely new copy */
- folio_add_new_anon_rmap(folio, vma, addr);
+ folio_add_new_anon_rmap(folio, vma, addr, RMAP_EXCLUSIVE);
folio_add_lru_vma(folio, vma);
}
new_pte = pte_mkold(mk_pte(page, vma->vm_page_prot));
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 5e7f2801698a..8dedaec00486 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -216,7 +216,7 @@ int mfill_atomic_install_pte(pmd_t *dst_pmd,
folio_add_lru(folio);
folio_add_file_rmap_pte(folio, page, dst_vma);
} else {
- folio_add_new_anon_rmap(folio, dst_vma, dst_addr);
+ folio_add_new_anon_rmap(folio, dst_vma, dst_addr, RMAP_EXCLUSIVE);
folio_add_lru_vma(folio, dst_vma);
}

--
2.34.1


2024-06-13 00:08:25

by Barry Song

[permalink] [raw]
Subject: [PATCH RFC 2/3] mm: do_swap_page: use folio_add_new_anon_rmap() if folio_test_anon(folio)==false

From: Barry Song <[email protected]>

For the !folio_test_anon(folio) case, we can now invoke folio_add_new_anon_rmap()
with the rmap flags set to either EXCLUSIVE or non-EXCLUSIVE. This action will
suppress the VM_WARN_ON_FOLIO check within __folio_add_anon_rmap() while initiating
the process of bringing up mTHP swapin.

static __always_inline void __folio_add_anon_rmap(struct folio *folio,
struct page *page, int nr_pages, struct vm_area_struct *vma,
unsigned long address, rmap_t flags, enum rmap_level level)
{
...
if (unlikely(!folio_test_anon(folio))) {
VM_WARN_ON_FOLIO(folio_test_large(folio) &&
level != RMAP_LEVEL_PMD, folio);
}
...
}

It also enhances the code’s readability.

Suggested-by: David Hildenbrand <[email protected]>
Signed-off-by: Barry Song <[email protected]>
---
mm/memory.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index 2f94921091fb..9c962f62f928 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4339,6 +4339,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
if (unlikely(folio != swapcache && swapcache)) {
folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
folio_add_lru_vma(folio, vma);
+ } else if (!folio_test_anon(folio)) {
+ folio_add_new_anon_rmap(folio, vma, address, rmap_flags);
} else {
folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, address,
rmap_flags);
--
2.34.1


2024-06-13 00:08:39

by Barry Song

[permalink] [raw]
Subject: [PATCH RFC 3/3] mm: remove folio_test_anon(folio)==false path in __folio_add_anon_rmap()

From: Barry Song <[email protected]>

The folio_test_anon(folio)==false case within do_swap_page() has been
relocated to folio_add_new_anon_rmap(). Additionally, two other callers
consistently pass anonymous folios.

stack 1:
remove_migration_pmd
-> folio_add_anon_rmap_pmd
-> __folio_add_anon_rmap

stack 2:
__split_huge_pmd_locked
-> folio_add_anon_rmap_ptes
-> __folio_add_anon_rmap

__folio_add_anon_rmap() only needs to handle the cases
folio_test_anon(folio)==true now.

Suggested-by: David Hildenbrand <[email protected]>
Signed-off-by: Barry Song <[email protected]>
---
mm/rmap.c | 17 +++--------------
1 file changed, 3 insertions(+), 14 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index e612d999811a..e84c706c8241 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1299,21 +1299,10 @@ static __always_inline void __folio_add_anon_rmap(struct folio *folio,

nr = __folio_add_rmap(folio, page, nr_pages, level, &nr_pmdmapped);

- if (unlikely(!folio_test_anon(folio))) {
- VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
- /*
- * For a PTE-mapped large folio, we only know that the single
- * PTE is exclusive. Further, __folio_set_anon() might not get
- * folio->index right when not given the address of the head
- * page.
- */
- VM_WARN_ON_FOLIO(folio_test_large(folio) &&
- level != RMAP_LEVEL_PMD, folio);
- __folio_set_anon(folio, vma, address,
- !!(flags & RMAP_EXCLUSIVE));
- } else if (likely(!folio_test_ksm(folio))) {
+ VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio);
+
+ if (likely(!folio_test_ksm(folio)))
__page_check_anon_rmap(folio, page, vma, address);
- }

__folio_mod_stat(folio, nr, nr_pmdmapped);

--
2.34.1


2024-06-13 08:43:07

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH RFC 1/3] mm: extend rmap flags arguments for folio_add_new_anon_rmap

On Thu, Jun 13, 2024 at 12:07 PM Barry Song <[email protected]> wrote:
>
> From: Barry Song <[email protected]>
>
> In the case of do_swap_page(), a new anonymous folio isn’t necessarily
> exclusive. This patch extends the rmap flags to allow treating a new
> anon folio as either exclusive or non-exclusive. To maintain the current
> behavior, we always use EXCLUSIVE as arguments.
>
> Suggested-by: David Hildenbrand <[email protected]>
> Signed-off-by: Barry Song <[email protected]>
> ---
> include/linux/rmap.h | 2 +-
> kernel/events/uprobes.c | 2 +-
> mm/huge_memory.c | 2 +-
> mm/khugepaged.c | 2 +-
> mm/memory.c | 10 +++++-----
> mm/migrate_device.c | 2 +-
> mm/rmap.c | 15 +++++++++------
> mm/swapfile.c | 2 +-
> mm/userfaultfd.c | 2 +-
> 9 files changed, 21 insertions(+), 18 deletions(-)
>
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index cae38a2a643d..01a64e7e72b9 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -244,7 +244,7 @@ void folio_add_anon_rmap_ptes(struct folio *, struct page *, int nr_pages,
> void folio_add_anon_rmap_pmd(struct folio *, struct page *,
> struct vm_area_struct *, unsigned long address, rmap_t flags);
> void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
> - unsigned long address);
> + unsigned long address, rmap_t flags);
> void folio_add_file_rmap_ptes(struct folio *, struct page *, int nr_pages,
> struct vm_area_struct *);
> #define folio_add_file_rmap_pte(folio, page, vma) \
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 2c83ba776fc7..c20368aa33dd 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -181,7 +181,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
>
> if (new_page) {
> folio_get(new_folio);
> - folio_add_new_anon_rmap(new_folio, vma, addr);
> + folio_add_new_anon_rmap(new_folio, vma, addr, RMAP_EXCLUSIVE);
> folio_add_lru_vma(new_folio, vma);
> } else
> /* no new page, just dec_mm_counter for old_page */
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index f409ea9fcc18..09a83e43c71a 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -973,7 +973,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
>
> entry = mk_huge_pmd(page, vma->vm_page_prot);
> entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> - folio_add_new_anon_rmap(folio, vma, haddr);
> + folio_add_new_anon_rmap(folio, vma, haddr, RMAP_EXCLUSIVE);
> folio_add_lru_vma(folio, vma);
> pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
> set_pmd_at(vma->vm_mm, haddr, vmf->pmd, entry);
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 774a97e6e2da..4d759a7487d0 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1213,7 +1213,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>
> spin_lock(pmd_ptl);
> BUG_ON(!pmd_none(*pmd));
> - folio_add_new_anon_rmap(folio, vma, address);
> + folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
> folio_add_lru_vma(folio, vma);
> pgtable_trans_huge_deposit(mm, pmd, pgtable);
> set_pmd_at(mm, address, pmd, _pmd);
> diff --git a/mm/memory.c b/mm/memory.c
> index 54d7d2acdf39..2f94921091fb 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -930,7 +930,7 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
> *prealloc = NULL;
> copy_user_highpage(&new_folio->page, page, addr, src_vma);
> __folio_mark_uptodate(new_folio);
> - folio_add_new_anon_rmap(new_folio, dst_vma, addr);
> + folio_add_new_anon_rmap(new_folio, dst_vma, addr, RMAP_EXCLUSIVE);
> folio_add_lru_vma(new_folio, dst_vma);
> rss[MM_ANONPAGES]++;
>
> @@ -3400,7 +3400,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
> * some TLBs while the old PTE remains in others.
> */
> ptep_clear_flush(vma, vmf->address, vmf->pte);
> - folio_add_new_anon_rmap(new_folio, vma, vmf->address);
> + folio_add_new_anon_rmap(new_folio, vma, vmf->address, RMAP_EXCLUSIVE);
> folio_add_lru_vma(new_folio, vma);
> BUG_ON(unshare && pte_write(entry));
> set_pte_at(mm, vmf->address, vmf->pte, entry);
> @@ -4337,7 +4337,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>
> /* ksm created a completely new copy */
> if (unlikely(folio != swapcache && swapcache)) {
> - folio_add_new_anon_rmap(folio, vma, address);
> + folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
> folio_add_lru_vma(folio, vma);
> } else {
> folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, address,
> @@ -4592,7 +4592,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> count_mthp_stat(folio_order(folio), MTHP_STAT_ANON_FAULT_ALLOC);
> #endif
> - folio_add_new_anon_rmap(folio, vma, addr);
> + folio_add_new_anon_rmap(folio, vma, addr, RMAP_EXCLUSIVE);
> folio_add_lru_vma(folio, vma);
> setpte:
> if (vmf_orig_pte_uffd_wp(vmf))
> @@ -4790,7 +4790,7 @@ void set_pte_range(struct vm_fault *vmf, struct folio *folio,
> /* copy-on-write page */
> if (write && !(vma->vm_flags & VM_SHARED)) {
> VM_BUG_ON_FOLIO(nr != 1, folio);
> - folio_add_new_anon_rmap(folio, vma, addr);
> + folio_add_new_anon_rmap(folio, vma, addr, RMAP_EXCLUSIVE);
> folio_add_lru_vma(folio, vma);
> } else {
> folio_add_file_rmap_ptes(folio, page, nr, vma);
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index 051d0a3ccbee..6d66dc1c6ffa 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -658,7 +658,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
> goto unlock_abort;
>
> inc_mm_counter(mm, MM_ANONPAGES);
> - folio_add_new_anon_rmap(folio, vma, addr);
> + folio_add_new_anon_rmap(folio, vma, addr, RMAP_EXCLUSIVE);
> if (!folio_is_zone_device(folio))
> folio_add_lru_vma(folio, vma);
> folio_get(folio);
> diff --git a/mm/rmap.c b/mm/rmap.c
> index b9e5943c8349..e612d999811a 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1406,14 +1406,14 @@ void folio_add_anon_rmap_pmd(struct folio *folio, struct page *page,
> * This means the inc-and-test can be bypassed.
> * The folio does not have to be locked.
> *
> - * If the folio is pmd-mappable, it is accounted as a THP. As the folio
> - * is new, it's assumed to be mapped exclusively by a single process.
> + * If the folio is pmd-mappable, it is accounted as a THP.
> */
> void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
> - unsigned long address)
> + unsigned long address, rmap_t flags)
> {
> int nr = folio_nr_pages(folio);
> int nr_pmdmapped = 0;
> + bool exclusive = flags & RMAP_EXCLUSIVE;
>
> VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
> VM_BUG_ON_VMA(address < vma->vm_start ||
> @@ -1424,7 +1424,8 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
> if (likely(!folio_test_large(folio))) {
> /* increment count (starts at -1) */
> atomic_set(&folio->_mapcount, 0);
> - SetPageAnonExclusive(&folio->page);
> + if (exclusive)
> + SetPageAnonExclusive(&folio->page);
> } else if (!folio_test_pmd_mappable(folio)) {
> int i;
>
> @@ -1433,7 +1434,8 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
>
> /* increment count (starts at -1) */
> atomic_set(&page->_mapcount, 0);
> - SetPageAnonExclusive(page);
> + if (exclusive)
> + SetPageAnonExclusive(page);
> }
>
> /* increment count (starts at -1) */
> @@ -1445,7 +1447,8 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
> /* increment count (starts at -1) */
> atomic_set(&folio->_large_mapcount, 0);
> atomic_set(&folio->_nr_pages_mapped, ENTIRELY_MAPPED);
> - SetPageAnonExclusive(&folio->page);
> + if (exclusive)
> + SetPageAnonExclusive(&folio->page);
> nr_pmdmapped = nr;
> }
>

I am lacking this:

--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1408,7 +1408,7 @@ void folio_add_new_anon_rmap(struct folio
*folio, struct vm_area_struct *vma,
VM_BUG_ON_VMA(address < vma->vm_start ||
address + (nr << PAGE_SHIFT) > vma->vm_end, vma);
__folio_set_swapbacked(folio);
- __folio_set_anon(folio, vma, address, true);
+ __folio_set_anon(folio, vma, address, exclusive);

if (likely(!folio_test_large(folio))) {
/* increment count (starts at -1) */


> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 9c6d8e557c0f..ae1d2700f6a3 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1911,7 +1911,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
>
> folio_add_anon_rmap_pte(folio, page, vma, addr, rmap_flags);
> } else { /* ksm created a completely new copy */
> - folio_add_new_anon_rmap(folio, vma, addr);
> + folio_add_new_anon_rmap(folio, vma, addr, RMAP_EXCLUSIVE);
> folio_add_lru_vma(folio, vma);
> }
> new_pte = pte_mkold(mk_pte(page, vma->vm_page_prot));
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 5e7f2801698a..8dedaec00486 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -216,7 +216,7 @@ int mfill_atomic_install_pte(pmd_t *dst_pmd,
> folio_add_lru(folio);
> folio_add_file_rmap_pte(folio, page, dst_vma);
> } else {
> - folio_add_new_anon_rmap(folio, dst_vma, dst_addr);
> + folio_add_new_anon_rmap(folio, dst_vma, dst_addr, RMAP_EXCLUSIVE);
> folio_add_lru_vma(folio, dst_vma);
> }
>
> --
> 2.34.1
>

2024-06-13 08:47:33

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH RFC 3/3] mm: remove folio_test_anon(folio)==false path in __folio_add_anon_rmap()

On Thu, Jun 13, 2024 at 12:08 PM Barry Song <[email protected]> wrote:
>
> From: Barry Song <[email protected]>
>
> The folio_test_anon(folio)==false case within do_swap_page() has been
> relocated to folio_add_new_anon_rmap(). Additionally, two other callers
> consistently pass anonymous folios.
>
> stack 1:
> remove_migration_pmd
> -> folio_add_anon_rmap_pmd
> -> __folio_add_anon_rmap
>
> stack 2:
> __split_huge_pmd_locked
> -> folio_add_anon_rmap_ptes
> -> __folio_add_anon_rmap
>
> __folio_add_anon_rmap() only needs to handle the cases
> folio_test_anon(folio)==true now.

My team reported a case where swapoff() is calling
folio_add_anon_rmap_pte *not* folio_add_anon_rmap_ptes
with one new anon (!folio_test_anon(folio)).

I will double check all folio_add_anon_rmap_pte() cases.

>
> Suggested-by: David Hildenbrand <[email protected]>
> Signed-off-by: Barry Song <[email protected]>
> ---
> mm/rmap.c | 17 +++--------------
> 1 file changed, 3 insertions(+), 14 deletions(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index e612d999811a..e84c706c8241 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1299,21 +1299,10 @@ static __always_inline void __folio_add_anon_rmap(struct folio *folio,
>
> nr = __folio_add_rmap(folio, page, nr_pages, level, &nr_pmdmapped);
>
> - if (unlikely(!folio_test_anon(folio))) {
> - VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
> - /*
> - * For a PTE-mapped large folio, we only know that the single
> - * PTE is exclusive. Further, __folio_set_anon() might not get
> - * folio->index right when not given the address of the head
> - * page.
> - */
> - VM_WARN_ON_FOLIO(folio_test_large(folio) &&
> - level != RMAP_LEVEL_PMD, folio);
> - __folio_set_anon(folio, vma, address,
> - !!(flags & RMAP_EXCLUSIVE));
> - } else if (likely(!folio_test_ksm(folio))) {
> + VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio);
> +
> + if (likely(!folio_test_ksm(folio)))
> __page_check_anon_rmap(folio, page, vma, address);
> - }
>
> __folio_mod_stat(folio, nr, nr_pmdmapped);
>
> --
> 2.34.1
>

2024-06-13 08:52:18

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH RFC 3/3] mm: remove folio_test_anon(folio)==false path in __folio_add_anon_rmap()

On 13.06.24 10:46, Barry Song wrote:
> On Thu, Jun 13, 2024 at 12:08 PM Barry Song <[email protected]> wrote:
>>
>> From: Barry Song <[email protected]>
>>
>> The folio_test_anon(folio)==false case within do_swap_page() has been
>> relocated to folio_add_new_anon_rmap(). Additionally, two other callers
>> consistently pass anonymous folios.
>>
>> stack 1:
>> remove_migration_pmd
>> -> folio_add_anon_rmap_pmd
>> -> __folio_add_anon_rmap
>>
>> stack 2:
>> __split_huge_pmd_locked
>> -> folio_add_anon_rmap_ptes
>> -> __folio_add_anon_rmap
>>
>> __folio_add_anon_rmap() only needs to handle the cases
>> folio_test_anon(folio)==true now.
>
> My team reported a case where swapoff() is calling
> folio_add_anon_rmap_pte *not* folio_add_anon_rmap_ptes
> with one new anon (!folio_test_anon(folio)).
>
> I will double check all folio_add_anon_rmap_pte() cases.

Right, swapoff() path is a bit special (e.g., we don't do any kind of
batching on the swapoff() path).

But we should never get a new large anon folio there, or could that now
happen?

--
Cheers,

David / dhildenb


2024-06-13 09:06:45

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH RFC 3/3] mm: remove folio_test_anon(folio)==false path in __folio_add_anon_rmap()

On Thu, Jun 13, 2024 at 8:49 PM David Hildenbrand <[email protected]> wrote:
>
> On 13.06.24 10:46, Barry Song wrote:
> > On Thu, Jun 13, 2024 at 12:08 PM Barry Song <[email protected]> wrote:
> >>
> >> From: Barry Song <[email protected]>
> >>
> >> The folio_test_anon(folio)==false case within do_swap_page() has been
> >> relocated to folio_add_new_anon_rmap(). Additionally, two other callers
> >> consistently pass anonymous folios.
> >>
> >> stack 1:
> >> remove_migration_pmd
> >> -> folio_add_anon_rmap_pmd
> >> -> __folio_add_anon_rmap
> >>
> >> stack 2:
> >> __split_huge_pmd_locked
> >> -> folio_add_anon_rmap_ptes
> >> -> __folio_add_anon_rmap
> >>
> >> __folio_add_anon_rmap() only needs to handle the cases
> >> folio_test_anon(folio)==true now.
> >
> > My team reported a case where swapoff() is calling
> > folio_add_anon_rmap_pte *not* folio_add_anon_rmap_ptes
> > with one new anon (!folio_test_anon(folio)).
> >
> > I will double check all folio_add_anon_rmap_pte() cases.
>
> Right, swapoff() path is a bit special (e.g., we don't do any kind of
> batching on the swapoff() path).
>
> But we should never get a new large anon folio there, or could that now
> happen?

My team encountered the following issue while testing this RFC
series on real hardware. Let me take a look to identify the
problem.

[ 261.214195][T11285] page:000000004cdd779e refcount:4 mapcount:1
mapping:00000000ba142c22 index:0x1 pfn:0x1b30a6
[ 261.214213][T11285] memcg:ffffff8003678000
[ 261.214217][T11285] aops:swap_aops
[ 261.214233][T11285] flags:
0x2000000000081009(locked|uptodate|owner_priv_1|swapbacked|zone=1|kasantag=0x0)
[ 261.214241][T11285] page_type: 0x0()
[ 261.214246][T11285] raw: 2000000000081009 0000000000000000
dead000000000122 0000000000000000
[ 261.214251][T11285] raw: 0000000000000001 00000000000d84b3
0000000400000000 ffffff8003678000
[ 261.214254][T11285] page dumped because:
VM_WARN_ON_FOLIO(!folio_test_anon(folio))
[ 261.214257][T11285] page_owner tracks the page as allocated
[ 261.214260][T11285] page last allocated via order 0, migratetype
Movable, gfp_mask 0x2140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), pid
11285, tgid 11285 (swapoff), ts 261214177545, free_ts 261151875699
[ 261.214268][T11285] post_alloc_hook+0x1b8/0x1c0
[ 261.214284][T11285] prep_new_page+0x28/0x13c
[ 261.214291][T11285] get_page_from_freelist+0x198c/0x1aa4
[ 261.214298][T11285] __alloc_pages+0x15c/0x330
[ 261.214304][T11285] __folio_alloc+0x1c/0x4c
[ 261.214310][T11285] __read_swap_cache_async+0xd8/0x48c
[ 261.214320][T11285] swap_cluster_readahead+0x158/0x324
[ 261.214326][T11285] swapin_readahead+0x64/0x448
[ 261.214331][T11285] __arm64_sys_swapoff+0x6ec/0x14b0
[ 261.214337][T11285] invoke_syscall+0x58/0x114
[ 261.214353][T11285] el0_svc_common+0xac/0xe0
[ 261.214360][T11285] do_el0_svc+0x1c/0x28
[ 261.214366][T11285] el0_svc+0x38/0x68
[ 261.214372][T11285] el0t_64_sync_handler+0x68/0xbc
[ 261.214376][T11285] el0t_64_sync+0x1a8/0x1ac
[ 261.214381][T11285] page last free pid 90 tgid 90 stack trace:
[ 261.214386][T11285] free_unref_page_prepare+0x338/0x374
[ 261.214395][T11285] free_unref_page_list+0x84/0x378
[ 261.214400][T11285] shrink_folio_list+0x1234/0x13e4
[ 261.214409][T11285] evict_folios+0x1458/0x19b4
[ 261.214417][T11285] try_to_shrink_lruvec+0x1c8/0x264
[ 261.214422][T11285] shrink_one+0xa8/0x234
[ 261.214427][T11285] shrink_node+0xb38/0xde0
[ 261.214432][T11285] balance_pgdat+0x7a4/0xdb4
[ 261.214437][T11285] kswapd+0x290/0x4e4
[ 261.214442][T11285] kthread+0x114/0x1bc
[ 261.214459][T11285] ret_from_fork+0x10/0x20
[ 261.214477][T11285] ------------[ cut here ]------------
[ 261.214480][T11285] WARNING: CPU: 3 PID: 11285 at mm/rmap.c:1305
folio_add_anon_rmap_ptes+0x2b4/0x330

[ 261.215403][T11285] pstate: 63400005 (nZCv daif +PAN -UAO +TCO +DIT
-SSBS BTYPE=--)
[ 261.215423][T11285] pc : folio_add_anon_rmap_ptes+0x2b4/0x330
[ 261.215428][T11285] lr : folio_add_anon_rmap_ptes+0x2b4/0x330
[ 261.215433][T11285] sp : ffffffc0a7dbbbf0
[ 261.215437][T11285] x29: ffffffc0a7dbbbf0 x28: ffffff8040860a08
x27: ffffff80db480000
[ 261.215445][T11285] x26: fffffffe04cc2980 x25: ffffff808f77f120
x24: 0000007b44941000
[ 261.215452][T11285] x23: 0000000000000001 x22: 0000000000000001
x21: fffffffe04cc2980
[ 261.215459][T11285] x20: ffffff80db480000 x19: fffffffe04cc2980
x18: ffffffed011dae80
[ 261.215465][T11285] x17: 0000000000000001 x16: ffffffffffffffff
x15: 0000000000000004
[ 261.215471][T11285] x14: ffffff82fb73fac0 x13: 0000000000000003
x12: 0000000000000003
[ 261.215476][T11285] x11: 00000000fffeffff x10: c0000000fffeffff x9
: 0d46c0889b468e00
[ 261.215483][T11285] x8 : 0d46c0889b468e00 x7 : 545b5d3935343431 x6
: 322e31363220205b
[ 261.215489][T11285] x5 : ffffffed013de407 x4 : ffffffed00698967 x3
: 0000000000000000
[ 261.215495][T11285] x2 : 0000000000000000 x1 : ffffffc0a7dbb8c0 x0
: ffffff8068c15c80
[ 261.215501][T11285] Call trace:
[ 261.215504][T11285] folio_add_anon_rmap_ptes+0x2b4/0x330
[ 261.215509][T11285] __arm64_sys_swapoff+0x8cc/0x14b0
[ 261.215516][T11285] invoke_syscall+0x58/0x114
[ 261.215526][T11285] el0_svc_common+0xac/0xe0
[ 261.215532][T11285] do_el0_svc+0x1c/0x28
[ 261.215539][T11285] el0_svc+0x38/0x68
[ 261.215544][T11285] el0t_64_sync_handler+0x68/0xbc
[ 261.215548][T11285] el0t_64_sync+0x1a8/0x1ac
[ 261.215552][T11285] ---[ end trace 0000000000000000 ]---


>
> --
> Cheers,
>
> David / dhildenb
>

Thanks
Barry

2024-06-13 09:12:29

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH RFC 3/3] mm: remove folio_test_anon(folio)==false path in __folio_add_anon_rmap()

On 13.06.24 11:06, Barry Song wrote:
> On Thu, Jun 13, 2024 at 8:49 PM David Hildenbrand <[email protected]> wrote:
>>
>> On 13.06.24 10:46, Barry Song wrote:
>>> On Thu, Jun 13, 2024 at 12:08 PM Barry Song <[email protected]> wrote:
>>>>
>>>> From: Barry Song <[email protected]>
>>>>
>>>> The folio_test_anon(folio)==false case within do_swap_page() has been
>>>> relocated to folio_add_new_anon_rmap(). Additionally, two other callers
>>>> consistently pass anonymous folios.
>>>>
>>>> stack 1:
>>>> remove_migration_pmd
>>>> -> folio_add_anon_rmap_pmd
>>>> -> __folio_add_anon_rmap
>>>>
>>>> stack 2:
>>>> __split_huge_pmd_locked
>>>> -> folio_add_anon_rmap_ptes
>>>> -> __folio_add_anon_rmap
>>>>
>>>> __folio_add_anon_rmap() only needs to handle the cases
>>>> folio_test_anon(folio)==true now.
>>>
>>> My team reported a case where swapoff() is calling
>>> folio_add_anon_rmap_pte *not* folio_add_anon_rmap_ptes
>>> with one new anon (!folio_test_anon(folio)).
>>>
>>> I will double check all folio_add_anon_rmap_pte() cases.
>>
>> Right, swapoff() path is a bit special (e.g., we don't do any kind of
>> batching on the swapoff() path).
>>
>> But we should never get a new large anon folio there, or could that now
>> happen?
>
> My team encountered the following issue while testing this RFC
> series on real hardware. Let me take a look to identify the
> problem.
>
> [ 261.214195][T11285] page:000000004cdd779e refcount:4 mapcount:1
> mapping:00000000ba142c22 index:0x1 pfn:0x1b30a6
> [ 261.214213][T11285] memcg:ffffff8003678000
> [ 261.214217][T11285] aops:swap_aops
> [ 261.214233][T11285] flags:
> 0x2000000000081009(locked|uptodate|owner_priv_1|swapbacked|zone=1|kasantag=0x0)
> [ 261.214241][T11285] page_type: 0x0()
> [ 261.214246][T11285] raw: 2000000000081009 0000000000000000
> dead000000000122 0000000000000000
> [ 261.214251][T11285] raw: 0000000000000001 00000000000d84b3
> 0000000400000000 ffffff8003678000
> [ 261.214254][T11285] page dumped because:
> VM_WARN_ON_FOLIO(!folio_test_anon(folio))
> [ 261.214257][T11285] page_owner tracks the page as allocated
> [ 261.214260][T11285] page last allocated via order 0, migratetype
> Movable, gfp_mask 0x2140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), pid
> 11285, tgid 11285 (swapoff), ts 261214177545, free_ts 261151875699
> [ 261.214268][T11285] post_alloc_hook+0x1b8/0x1c0
> [ 261.214284][T11285] prep_new_page+0x28/0x13c
> [ 261.214291][T11285] get_page_from_freelist+0x198c/0x1aa4
> [ 261.214298][T11285] __alloc_pages+0x15c/0x330
> [ 261.214304][T11285] __folio_alloc+0x1c/0x4c
> [ 261.214310][T11285] __read_swap_cache_async+0xd8/0x48c
> [ 261.214320][T11285] swap_cluster_readahead+0x158/0x324
> [ 261.214326][T11285] swapin_readahead+0x64/0x448
> [ 261.214331][T11285] __arm64_sys_swapoff+0x6ec/0x14b0
> [ 261.214337][T11285] invoke_syscall+0x58/0x114
> [ 261.214353][T11285] el0_svc_common+0xac/0xe0
> [ 261.214360][T11285] do_el0_svc+0x1c/0x28
> [ 261.214366][T11285] el0_svc+0x38/0x68
> [ 261.214372][T11285] el0t_64_sync_handler+0x68/0xbc
> [ 261.214376][T11285] el0t_64_sync+0x1a8/0x1ac
> [ 261.214381][T11285] page last free pid 90 tgid 90 stack trace:
> [ 261.214386][T11285] free_unref_page_prepare+0x338/0x374
> [ 261.214395][T11285] free_unref_page_list+0x84/0x378
> [ 261.214400][T11285] shrink_folio_list+0x1234/0x13e4
> [ 261.214409][T11285] evict_folios+0x1458/0x19b4
> [ 261.214417][T11285] try_to_shrink_lruvec+0x1c8/0x264
> [ 261.214422][T11285] shrink_one+0xa8/0x234
> [ 261.214427][T11285] shrink_node+0xb38/0xde0
> [ 261.214432][T11285] balance_pgdat+0x7a4/0xdb4
> [ 261.214437][T11285] kswapd+0x290/0x4e4
> [ 261.214442][T11285] kthread+0x114/0x1bc
> [ 261.214459][T11285] ret_from_fork+0x10/0x20
> [ 261.214477][T11285] ------------[ cut here ]------------
> [ 261.214480][T11285] WARNING: CPU: 3 PID: 11285 at mm/rmap.c:1305
> folio_add_anon_rmap_ptes+0x2b4/0x330
>
> [ 261.215403][T11285] pstate: 63400005 (nZCv daif +PAN -UAO +TCO +DIT
> -SSBS BTYPE=--)
> [ 261.215423][T11285] pc : folio_add_anon_rmap_ptes+0x2b4/0x330
> [ 261.215428][T11285] lr : folio_add_anon_rmap_ptes+0x2b4/0x330
> [ 261.215433][T11285] sp : ffffffc0a7dbbbf0
> [ 261.215437][T11285] x29: ffffffc0a7dbbbf0 x28: ffffff8040860a08
> x27: ffffff80db480000
> [ 261.215445][T11285] x26: fffffffe04cc2980 x25: ffffff808f77f120
> x24: 0000007b44941000
> [ 261.215452][T11285] x23: 0000000000000001 x22: 0000000000000001
> x21: fffffffe04cc2980
> [ 261.215459][T11285] x20: ffffff80db480000 x19: fffffffe04cc2980
> x18: ffffffed011dae80
> [ 261.215465][T11285] x17: 0000000000000001 x16: ffffffffffffffff
> x15: 0000000000000004
> [ 261.215471][T11285] x14: ffffff82fb73fac0 x13: 0000000000000003
> x12: 0000000000000003
> [ 261.215476][T11285] x11: 00000000fffeffff x10: c0000000fffeffff x9
> : 0d46c0889b468e00
> [ 261.215483][T11285] x8 : 0d46c0889b468e00 x7 : 545b5d3935343431 x6
> : 322e31363220205b
> [ 261.215489][T11285] x5 : ffffffed013de407 x4 : ffffffed00698967 x3
> : 0000000000000000
> [ 261.215495][T11285] x2 : 0000000000000000 x1 : ffffffc0a7dbb8c0 x0
> : ffffff8068c15c80
> [ 261.215501][T11285] Call trace:
> [ 261.215504][T11285] folio_add_anon_rmap_ptes+0x2b4/0x330
> [ 261.215509][T11285] __arm64_sys_swapoff+0x8cc/0x14b0
> [ 261.215516][T11285] invoke_syscall+0x58/0x114
> [ 261.215526][T11285] el0_svc_common+0xac/0xe0
> [ 261.215532][T11285] do_el0_svc+0x1c/0x28
> [ 261.215539][T11285] el0_svc+0x38/0x68
> [ 261.215544][T11285] el0t_64_sync_handler+0x68/0xbc
> [ 261.215548][T11285] el0t_64_sync+0x1a8/0x1ac
> [ 261.215552][T11285] ---[ end trace 0000000000000000 ]---

Ah, yes. in unuse_pte(), you'll have to do the right thing if
!folio_test(anon) before doing the folio_add_anon_rmap_pte().

You might have a fresh anon folio in the swapcache that was never mapped
(hopefully order-0, otherwise we'd likely be in trouble).

--
Cheers,

David / dhildenb


2024-06-13 09:23:42

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH RFC 2/3] mm: do_swap_page: use folio_add_new_anon_rmap() if folio_test_anon(folio)==false

On 13.06.24 02:07, Barry Song wrote:
> From: Barry Song <[email protected]>
>
> For the !folio_test_anon(folio) case, we can now invoke folio_add_new_anon_rmap()
> with the rmap flags set to either EXCLUSIVE or non-EXCLUSIVE. This action will
> suppress the VM_WARN_ON_FOLIO check within __folio_add_anon_rmap() while initiating
> the process of bringing up mTHP swapin.
>
> static __always_inline void __folio_add_anon_rmap(struct folio *folio,
> struct page *page, int nr_pages, struct vm_area_struct *vma,
> unsigned long address, rmap_t flags, enum rmap_level level)
> {
> ...
> if (unlikely(!folio_test_anon(folio))) {
> VM_WARN_ON_FOLIO(folio_test_large(folio) &&
> level != RMAP_LEVEL_PMD, folio);
> }
> ...
> }
>
> It also enhances the code’s readability.
>
> Suggested-by: David Hildenbrand <[email protected]>
> Signed-off-by: Barry Song <[email protected]>
> ---
> mm/memory.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 2f94921091fb..9c962f62f928 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4339,6 +4339,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> if (unlikely(folio != swapcache && swapcache)) {
> folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
> folio_add_lru_vma(folio, vma);
> + } else if (!folio_test_anon(folio)) {
> + folio_add_new_anon_rmap(folio, vma, address, rmap_flags);

So, with large folio swapin, we would never end up here if any swp PTE
is !exclusive, because we would make sure to allocate a large folio only
for suitable regions, correct?

It can certainly happen during swapout + refault with large folios. But
there, we will always have folio_test_anon() still set and wouldn't run
into this code path.

(it will all be better with per-folio PAE bit, but that will take some
more time until I actually get to implement it properly, handling all
the nasty corner cases)

Better add a comment regarding why we are sure that the complete thing
is exclusive (e.g., currently only small folios).

> } else {
> folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, address,
> rmap_flags);

--
Cheers,

David / dhildenb


2024-06-13 09:58:58

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH RFC 2/3] mm: do_swap_page: use folio_add_new_anon_rmap() if folio_test_anon(folio)==false

On Thu, Jun 13, 2024 at 9:23 PM David Hildenbrand <[email protected]> wrote:
>
> On 13.06.24 02:07, Barry Song wrote:
> > From: Barry Song <[email protected]>
> >
> > For the !folio_test_anon(folio) case, we can now invoke folio_add_new_anon_rmap()
> > with the rmap flags set to either EXCLUSIVE or non-EXCLUSIVE. This action will
> > suppress the VM_WARN_ON_FOLIO check within __folio_add_anon_rmap() while initiating
> > the process of bringing up mTHP swapin.
> >
> > static __always_inline void __folio_add_anon_rmap(struct folio *folio,
> > struct page *page, int nr_pages, struct vm_area_struct *vma,
> > unsigned long address, rmap_t flags, enum rmap_level level)
> > {
> > ...
> > if (unlikely(!folio_test_anon(folio))) {
> > VM_WARN_ON_FOLIO(folio_test_large(folio) &&
> > level != RMAP_LEVEL_PMD, folio);
> > }
> > ...
> > }
> >
> > It also enhances the code’s readability.
> >
> > Suggested-by: David Hildenbrand <[email protected]>
> > Signed-off-by: Barry Song <[email protected]>
> > ---
> > mm/memory.c | 2 ++
> > 1 file changed, 2 insertions(+)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 2f94921091fb..9c962f62f928 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4339,6 +4339,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > if (unlikely(folio != swapcache && swapcache)) {
> > folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
> > folio_add_lru_vma(folio, vma);
> > + } else if (!folio_test_anon(folio)) {
> > + folio_add_new_anon_rmap(folio, vma, address, rmap_flags);
>
> So, with large folio swapin, we would never end up here if any swp PTE
> is !exclusive, because we would make sure to allocate a large folio only
> for suitable regions, correct?
>
> It can certainly happen during swapout + refault with large folios. But
> there, we will always have folio_test_anon() still set and wouldn't run
> into this code path.
>
> (it will all be better with per-folio PAE bit, but that will take some
> more time until I actually get to implement it properly, handling all
> the nasty corner cases)
>
> Better add a comment regarding why we are sure that the complete thing
> is exclusive (e.g., currently only small folios).

No, patch 1/3 enables both cases to call folio_add_new_anon_rmap(). Currently,
small folios could be non-exclusive. I suppose your question is
whether we should
ensure that all pages within a mTHP have the same exclusivity status,
rather than
always being exclusive?

>
> > } else {
> > folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, address,
> > rmap_flags);
>
> --
> Cheers,
>
> David / dhildenb
>

Thanks
Barry

2024-06-13 10:37:36

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH RFC 2/3] mm: do_swap_page: use folio_add_new_anon_rmap() if folio_test_anon(folio)==false

On 13.06.24 11:58, Barry Song wrote:
> On Thu, Jun 13, 2024 at 9:23 PM David Hildenbrand <[email protected]> wrote:
>>
>> On 13.06.24 02:07, Barry Song wrote:
>>> From: Barry Song <[email protected]>
>>>
>>> For the !folio_test_anon(folio) case, we can now invoke folio_add_new_anon_rmap()
>>> with the rmap flags set to either EXCLUSIVE or non-EXCLUSIVE. This action will
>>> suppress the VM_WARN_ON_FOLIO check within __folio_add_anon_rmap() while initiating
>>> the process of bringing up mTHP swapin.
>>>
>>> static __always_inline void __folio_add_anon_rmap(struct folio *folio,
>>> struct page *page, int nr_pages, struct vm_area_struct *vma,
>>> unsigned long address, rmap_t flags, enum rmap_level level)
>>> {
>>> ...
>>> if (unlikely(!folio_test_anon(folio))) {
>>> VM_WARN_ON_FOLIO(folio_test_large(folio) &&
>>> level != RMAP_LEVEL_PMD, folio);
>>> }
>>> ...
>>> }
>>>
>>> It also enhances the code’s readability.
>>>
>>> Suggested-by: David Hildenbrand <[email protected]>
>>> Signed-off-by: Barry Song <[email protected]>
>>> ---
>>> mm/memory.c | 2 ++
>>> 1 file changed, 2 insertions(+)
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 2f94921091fb..9c962f62f928 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -4339,6 +4339,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>> if (unlikely(folio != swapcache && swapcache)) {
>>> folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
>>> folio_add_lru_vma(folio, vma);
>>> + } else if (!folio_test_anon(folio)) {
>>> + folio_add_new_anon_rmap(folio, vma, address, rmap_flags);
>>
>> So, with large folio swapin, we would never end up here if any swp PTE
>> is !exclusive, because we would make sure to allocate a large folio only
>> for suitable regions, correct?
>>
>> It can certainly happen during swapout + refault with large folios. But
>> there, we will always have folio_test_anon() still set and wouldn't run
>> into this code path.
>>
>> (it will all be better with per-folio PAE bit, but that will take some
>> more time until I actually get to implement it properly, handling all
>> the nasty corner cases)
>>
>> Better add a comment regarding why we are sure that the complete thing
>> is exclusive (e.g., currently only small folios).
>
> No, patch 1/3 enables both cases to call folio_add_new_anon_rmap(). Currently,
> small folios could be non-exclusive. I suppose your question is
> whether we should
> ensure that all pages within a mTHP have the same exclusivity status,
> rather than
> always being exclusive?

We must never end up calling folio_add_new_anon_rmap(folio, vma,
address, RMAP_EXCLUSIVE) if part of the folio is non-exclusive.

I think we *may* call folio_add_new_anon_rmap(folio, vma, address, 0) if
part of the folio is exclusive [it go swapped out, so there cannot be
GUP references]. But, to be future proof (single PAE bit per folio), we
better avoid that on swapin if possible.

For small folios, it's clear that it cannot be partially exclusive
(single page).

We better comment why we obey to the above. Like

"We currently ensure that new folios cannot be partially exclusive: they
are either fully exclusive or fully shared."

--
Cheers,

David / dhildenb


2024-06-13 10:56:06

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH RFC 2/3] mm: do_swap_page: use folio_add_new_anon_rmap() if folio_test_anon(folio)==false

On Thu, Jun 13, 2024 at 10:37 PM David Hildenbrand <[email protected]> wrote:
>
> On 13.06.24 11:58, Barry Song wrote:
> > On Thu, Jun 13, 2024 at 9:23 PM David Hildenbrand <[email protected]> wrote:
> >>
> >> On 13.06.24 02:07, Barry Song wrote:
> >>> From: Barry Song <[email protected]>
> >>>
> >>> For the !folio_test_anon(folio) case, we can now invoke folio_add_new_anon_rmap()
> >>> with the rmap flags set to either EXCLUSIVE or non-EXCLUSIVE. This action will
> >>> suppress the VM_WARN_ON_FOLIO check within __folio_add_anon_rmap() while initiating
> >>> the process of bringing up mTHP swapin.
> >>>
> >>> static __always_inline void __folio_add_anon_rmap(struct folio *folio,
> >>> struct page *page, int nr_pages, struct vm_area_struct *vma,
> >>> unsigned long address, rmap_t flags, enum rmap_level level)
> >>> {
> >>> ...
> >>> if (unlikely(!folio_test_anon(folio))) {
> >>> VM_WARN_ON_FOLIO(folio_test_large(folio) &&
> >>> level != RMAP_LEVEL_PMD, folio);
> >>> }
> >>> ...
> >>> }
> >>>
> >>> It also enhances the code’s readability.
> >>>
> >>> Suggested-by: David Hildenbrand <[email protected]>
> >>> Signed-off-by: Barry Song <[email protected]>
> >>> ---
> >>> mm/memory.c | 2 ++
> >>> 1 file changed, 2 insertions(+)
> >>>
> >>> diff --git a/mm/memory.c b/mm/memory.c
> >>> index 2f94921091fb..9c962f62f928 100644
> >>> --- a/mm/memory.c
> >>> +++ b/mm/memory.c
> >>> @@ -4339,6 +4339,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >>> if (unlikely(folio != swapcache && swapcache)) {
> >>> folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
> >>> folio_add_lru_vma(folio, vma);
> >>> + } else if (!folio_test_anon(folio)) {
> >>> + folio_add_new_anon_rmap(folio, vma, address, rmap_flags);
> >>
> >> So, with large folio swapin, we would never end up here if any swp PTE
> >> is !exclusive, because we would make sure to allocate a large folio only
> >> for suitable regions, correct?
> >>
> >> It can certainly happen during swapout + refault with large folios. But
> >> there, we will always have folio_test_anon() still set and wouldn't run
> >> into this code path.
> >>
> >> (it will all be better with per-folio PAE bit, but that will take some
> >> more time until I actually get to implement it properly, handling all
> >> the nasty corner cases)
> >>
> >> Better add a comment regarding why we are sure that the complete thing
> >> is exclusive (e.g., currently only small folios).
> >
> > No, patch 1/3 enables both cases to call folio_add_new_anon_rmap(). Currently,
> > small folios could be non-exclusive. I suppose your question is
> > whether we should
> > ensure that all pages within a mTHP have the same exclusivity status,
> > rather than
> > always being exclusive?
>
> We must never end up calling folio_add_new_anon_rmap(folio, vma,
> address, RMAP_EXCLUSIVE) if part of the folio is non-exclusive.

Agreed.

>
> I think we *may* call folio_add_new_anon_rmap(folio, vma, address, 0) if
> part of the folio is exclusive [it go swapped out, so there cannot be
> GUP references]. But, to be future proof (single PAE bit per folio), we
> better avoid that on swapin if possible.
>
> For small folios, it's clear that it cannot be partially exclusive
> (single page).

Yes, for small folios, it is much simpler; they are either entirely exclusive or
entirely non-exclusive.

For the initial swapin patch[1], which only supports SYNC-IO mTHP swapin,
mTHP is always entirely exclusive. I am also considering non-SYNCHRONOUS_IO
swapcache-based mTHP swapin but intend to start with the entirely exclusive
cases.

[1] https://lore.kernel.org/linux-mm/[email protected]/

>
> We better comment why we obey to the above. Like
>
> "We currently ensure that new folios cannot be partially exclusive: they
> are either fully exclusive or fully shared."
>
> --
> Cheers,
>
> David / dhildenb
>

Thanks
Barry

2024-06-14 07:56:30

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH RFC 3/3] mm: remove folio_test_anon(folio)==false path in __folio_add_anon_rmap()

On Thu, Jun 13, 2024 at 9:12 PM David Hildenbrand <[email protected]> wrote:
>
> On 13.06.24 11:06, Barry Song wrote:
> > On Thu, Jun 13, 2024 at 8:49 PM David Hildenbrand <[email protected]> wrote:
> >>
> >> On 13.06.24 10:46, Barry Song wrote:
> >>> On Thu, Jun 13, 2024 at 12:08 PM Barry Song <[email protected]> wrote:
> >>>>
> >>>> From: Barry Song <[email protected]>
> >>>>
> >>>> The folio_test_anon(folio)==false case within do_swap_page() has been
> >>>> relocated to folio_add_new_anon_rmap(). Additionally, two other callers
> >>>> consistently pass anonymous folios.
> >>>>
> >>>> stack 1:
> >>>> remove_migration_pmd
> >>>> -> folio_add_anon_rmap_pmd
> >>>> -> __folio_add_anon_rmap
> >>>>
> >>>> stack 2:
> >>>> __split_huge_pmd_locked
> >>>> -> folio_add_anon_rmap_ptes
> >>>> -> __folio_add_anon_rmap
> >>>>
> >>>> __folio_add_anon_rmap() only needs to handle the cases
> >>>> folio_test_anon(folio)==true now.
> >>>
> >>> My team reported a case where swapoff() is calling
> >>> folio_add_anon_rmap_pte *not* folio_add_anon_rmap_ptes
> >>> with one new anon (!folio_test_anon(folio)).
> >>>
> >>> I will double check all folio_add_anon_rmap_pte() cases.
> >>
> >> Right, swapoff() path is a bit special (e.g., we don't do any kind of
> >> batching on the swapoff() path).
> >>
> >> But we should never get a new large anon folio there, or could that now
> >> happen?
> >
> > My team encountered the following issue while testing this RFC
> > series on real hardware. Let me take a look to identify the
> > problem.
> >
> > [ 261.214195][T11285] page:000000004cdd779e refcount:4 mapcount:1
> > mapping:00000000ba142c22 index:0x1 pfn:0x1b30a6
> > [ 261.214213][T11285] memcg:ffffff8003678000
> > [ 261.214217][T11285] aops:swap_aops
> > [ 261.214233][T11285] flags:
> > 0x2000000000081009(locked|uptodate|owner_priv_1|swapbacked|zone=1|kasantag=0x0)
> > [ 261.214241][T11285] page_type: 0x0()
> > [ 261.214246][T11285] raw: 2000000000081009 0000000000000000
> > dead000000000122 0000000000000000
> > [ 261.214251][T11285] raw: 0000000000000001 00000000000d84b3
> > 0000000400000000 ffffff8003678000
> > [ 261.214254][T11285] page dumped because:
> > VM_WARN_ON_FOLIO(!folio_test_anon(folio))
> > [ 261.214257][T11285] page_owner tracks the page as allocated
> > [ 261.214260][T11285] page last allocated via order 0, migratetype
> > Movable, gfp_mask 0x2140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), pid
> > 11285, tgid 11285 (swapoff), ts 261214177545, free_ts 261151875699
> > [ 261.214268][T11285] post_alloc_hook+0x1b8/0x1c0
> > [ 261.214284][T11285] prep_new_page+0x28/0x13c
> > [ 261.214291][T11285] get_page_from_freelist+0x198c/0x1aa4
> > [ 261.214298][T11285] __alloc_pages+0x15c/0x330
> > [ 261.214304][T11285] __folio_alloc+0x1c/0x4c
> > [ 261.214310][T11285] __read_swap_cache_async+0xd8/0x48c
> > [ 261.214320][T11285] swap_cluster_readahead+0x158/0x324
> > [ 261.214326][T11285] swapin_readahead+0x64/0x448
> > [ 261.214331][T11285] __arm64_sys_swapoff+0x6ec/0x14b0
> > [ 261.214337][T11285] invoke_syscall+0x58/0x114
> > [ 261.214353][T11285] el0_svc_common+0xac/0xe0
> > [ 261.214360][T11285] do_el0_svc+0x1c/0x28
> > [ 261.214366][T11285] el0_svc+0x38/0x68
> > [ 261.214372][T11285] el0t_64_sync_handler+0x68/0xbc
> > [ 261.214376][T11285] el0t_64_sync+0x1a8/0x1ac
> > [ 261.214381][T11285] page last free pid 90 tgid 90 stack trace:
> > [ 261.214386][T11285] free_unref_page_prepare+0x338/0x374
> > [ 261.214395][T11285] free_unref_page_list+0x84/0x378
> > [ 261.214400][T11285] shrink_folio_list+0x1234/0x13e4
> > [ 261.214409][T11285] evict_folios+0x1458/0x19b4
> > [ 261.214417][T11285] try_to_shrink_lruvec+0x1c8/0x264
> > [ 261.214422][T11285] shrink_one+0xa8/0x234
> > [ 261.214427][T11285] shrink_node+0xb38/0xde0
> > [ 261.214432][T11285] balance_pgdat+0x7a4/0xdb4
> > [ 261.214437][T11285] kswapd+0x290/0x4e4
> > [ 261.214442][T11285] kthread+0x114/0x1bc
> > [ 261.214459][T11285] ret_from_fork+0x10/0x20
> > [ 261.214477][T11285] ------------[ cut here ]------------
> > [ 261.214480][T11285] WARNING: CPU: 3 PID: 11285 at mm/rmap.c:1305
> > folio_add_anon_rmap_ptes+0x2b4/0x330
> >
> > [ 261.215403][T11285] pstate: 63400005 (nZCv daif +PAN -UAO +TCO +DIT
> > -SSBS BTYPE=--)
> > [ 261.215423][T11285] pc : folio_add_anon_rmap_ptes+0x2b4/0x330
> > [ 261.215428][T11285] lr : folio_add_anon_rmap_ptes+0x2b4/0x330
> > [ 261.215433][T11285] sp : ffffffc0a7dbbbf0
> > [ 261.215437][T11285] x29: ffffffc0a7dbbbf0 x28: ffffff8040860a08
> > x27: ffffff80db480000
> > [ 261.215445][T11285] x26: fffffffe04cc2980 x25: ffffff808f77f120
> > x24: 0000007b44941000
> > [ 261.215452][T11285] x23: 0000000000000001 x22: 0000000000000001
> > x21: fffffffe04cc2980
> > [ 261.215459][T11285] x20: ffffff80db480000 x19: fffffffe04cc2980
> > x18: ffffffed011dae80
> > [ 261.215465][T11285] x17: 0000000000000001 x16: ffffffffffffffff
> > x15: 0000000000000004
> > [ 261.215471][T11285] x14: ffffff82fb73fac0 x13: 0000000000000003
> > x12: 0000000000000003
> > [ 261.215476][T11285] x11: 00000000fffeffff x10: c0000000fffeffff x9
> > : 0d46c0889b468e00
> > [ 261.215483][T11285] x8 : 0d46c0889b468e00 x7 : 545b5d3935343431 x6
> > : 322e31363220205b
> > [ 261.215489][T11285] x5 : ffffffed013de407 x4 : ffffffed00698967 x3
> > : 0000000000000000
> > [ 261.215495][T11285] x2 : 0000000000000000 x1 : ffffffc0a7dbb8c0 x0
> > : ffffff8068c15c80
> > [ 261.215501][T11285] Call trace:
> > [ 261.215504][T11285] folio_add_anon_rmap_ptes+0x2b4/0x330
> > [ 261.215509][T11285] __arm64_sys_swapoff+0x8cc/0x14b0
> > [ 261.215516][T11285] invoke_syscall+0x58/0x114
> > [ 261.215526][T11285] el0_svc_common+0xac/0xe0
> > [ 261.215532][T11285] do_el0_svc+0x1c/0x28
> > [ 261.215539][T11285] el0_svc+0x38/0x68
> > [ 261.215544][T11285] el0t_64_sync_handler+0x68/0xbc
> > [ 261.215548][T11285] el0t_64_sync+0x1a8/0x1ac
> > [ 261.215552][T11285] ---[ end trace 0000000000000000 ]---
>
> Ah, yes. in unuse_pte(), you'll have to do the right thing if
> !folio_test(anon) before doing the folio_add_anon_rmap_pte().
>
> You might have a fresh anon folio in the swapcache that was never mapped
> (hopefully order-0, otherwise we'd likely be in trouble).

Yes. It is order-0

[ 261.214260][T11285] page last allocated via order 0, migratetype

Otherwise, we would have encountered this VM_WARN_ON_FOLIO?

__folio_add_anon_rmap()
{
...
VM_WARN_ON_FOLIO(folio_test_large(folio) &&
level != RMAP_LEVEL_PMD, folio);
...
}

Given that nobody has ever reported this warning, I assume all callers
using folio_add_anon_rmap_pte(s) are right now safe to move to ?

if (!folio_test_anon(folio))
folio_add_new_anon_rmap(folio, vma, address, rmap_flags);
else
folio_add_anon_rmap_pte(s)

>
> --
> Cheers,
>
> David / dhildenb
>

Thanks
Barry

2024-06-14 08:51:31

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH RFC 3/3] mm: remove folio_test_anon(folio)==false path in __folio_add_anon_rmap()

On 14.06.24 09:56, Barry Song wrote:
> On Thu, Jun 13, 2024 at 9:12 PM David Hildenbrand <[email protected]> wrote:
>>
>> On 13.06.24 11:06, Barry Song wrote:
>>> On Thu, Jun 13, 2024 at 8:49 PM David Hildenbrand <[email protected]> wrote:
>>>>
>>>> On 13.06.24 10:46, Barry Song wrote:
>>>>> On Thu, Jun 13, 2024 at 12:08 PM Barry Song <[email protected]> wrote:
>>>>>>
>>>>>> From: Barry Song <[email protected]>
>>>>>>
>>>>>> The folio_test_anon(folio)==false case within do_swap_page() has been
>>>>>> relocated to folio_add_new_anon_rmap(). Additionally, two other callers
>>>>>> consistently pass anonymous folios.
>>>>>>
>>>>>> stack 1:
>>>>>> remove_migration_pmd
>>>>>> -> folio_add_anon_rmap_pmd
>>>>>> -> __folio_add_anon_rmap
>>>>>>
>>>>>> stack 2:
>>>>>> __split_huge_pmd_locked
>>>>>> -> folio_add_anon_rmap_ptes
>>>>>> -> __folio_add_anon_rmap
>>>>>>
>>>>>> __folio_add_anon_rmap() only needs to handle the cases
>>>>>> folio_test_anon(folio)==true now.
>>>>>
>>>>> My team reported a case where swapoff() is calling
>>>>> folio_add_anon_rmap_pte *not* folio_add_anon_rmap_ptes
>>>>> with one new anon (!folio_test_anon(folio)).
>>>>>
>>>>> I will double check all folio_add_anon_rmap_pte() cases.
>>>>
>>>> Right, swapoff() path is a bit special (e.g., we don't do any kind of
>>>> batching on the swapoff() path).
>>>>
>>>> But we should never get a new large anon folio there, or could that now
>>>> happen?
>>>
>>> My team encountered the following issue while testing this RFC
>>> series on real hardware. Let me take a look to identify the
>>> problem.
>>>
>>> [ 261.214195][T11285] page:000000004cdd779e refcount:4 mapcount:1
>>> mapping:00000000ba142c22 index:0x1 pfn:0x1b30a6
>>> [ 261.214213][T11285] memcg:ffffff8003678000
>>> [ 261.214217][T11285] aops:swap_aops
>>> [ 261.214233][T11285] flags:
>>> 0x2000000000081009(locked|uptodate|owner_priv_1|swapbacked|zone=1|kasantag=0x0)
>>> [ 261.214241][T11285] page_type: 0x0()
>>> [ 261.214246][T11285] raw: 2000000000081009 0000000000000000
>>> dead000000000122 0000000000000000
>>> [ 261.214251][T11285] raw: 0000000000000001 00000000000d84b3
>>> 0000000400000000 ffffff8003678000
>>> [ 261.214254][T11285] page dumped because:
>>> VM_WARN_ON_FOLIO(!folio_test_anon(folio))
>>> [ 261.214257][T11285] page_owner tracks the page as allocated
>>> [ 261.214260][T11285] page last allocated via order 0, migratetype
>>> Movable, gfp_mask 0x2140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), pid
>>> 11285, tgid 11285 (swapoff), ts 261214177545, free_ts 261151875699
>>> [ 261.214268][T11285] post_alloc_hook+0x1b8/0x1c0
>>> [ 261.214284][T11285] prep_new_page+0x28/0x13c
>>> [ 261.214291][T11285] get_page_from_freelist+0x198c/0x1aa4
>>> [ 261.214298][T11285] __alloc_pages+0x15c/0x330
>>> [ 261.214304][T11285] __folio_alloc+0x1c/0x4c
>>> [ 261.214310][T11285] __read_swap_cache_async+0xd8/0x48c
>>> [ 261.214320][T11285] swap_cluster_readahead+0x158/0x324
>>> [ 261.214326][T11285] swapin_readahead+0x64/0x448
>>> [ 261.214331][T11285] __arm64_sys_swapoff+0x6ec/0x14b0
>>> [ 261.214337][T11285] invoke_syscall+0x58/0x114
>>> [ 261.214353][T11285] el0_svc_common+0xac/0xe0
>>> [ 261.214360][T11285] do_el0_svc+0x1c/0x28
>>> [ 261.214366][T11285] el0_svc+0x38/0x68
>>> [ 261.214372][T11285] el0t_64_sync_handler+0x68/0xbc
>>> [ 261.214376][T11285] el0t_64_sync+0x1a8/0x1ac
>>> [ 261.214381][T11285] page last free pid 90 tgid 90 stack trace:
>>> [ 261.214386][T11285] free_unref_page_prepare+0x338/0x374
>>> [ 261.214395][T11285] free_unref_page_list+0x84/0x378
>>> [ 261.214400][T11285] shrink_folio_list+0x1234/0x13e4
>>> [ 261.214409][T11285] evict_folios+0x1458/0x19b4
>>> [ 261.214417][T11285] try_to_shrink_lruvec+0x1c8/0x264
>>> [ 261.214422][T11285] shrink_one+0xa8/0x234
>>> [ 261.214427][T11285] shrink_node+0xb38/0xde0
>>> [ 261.214432][T11285] balance_pgdat+0x7a4/0xdb4
>>> [ 261.214437][T11285] kswapd+0x290/0x4e4
>>> [ 261.214442][T11285] kthread+0x114/0x1bc
>>> [ 261.214459][T11285] ret_from_fork+0x10/0x20
>>> [ 261.214477][T11285] ------------[ cut here ]------------
>>> [ 261.214480][T11285] WARNING: CPU: 3 PID: 11285 at mm/rmap.c:1305
>>> folio_add_anon_rmap_ptes+0x2b4/0x330
>>>
>>> [ 261.215403][T11285] pstate: 63400005 (nZCv daif +PAN -UAO +TCO +DIT
>>> -SSBS BTYPE=--)
>>> [ 261.215423][T11285] pc : folio_add_anon_rmap_ptes+0x2b4/0x330
>>> [ 261.215428][T11285] lr : folio_add_anon_rmap_ptes+0x2b4/0x330
>>> [ 261.215433][T11285] sp : ffffffc0a7dbbbf0
>>> [ 261.215437][T11285] x29: ffffffc0a7dbbbf0 x28: ffffff8040860a08
>>> x27: ffffff80db480000
>>> [ 261.215445][T11285] x26: fffffffe04cc2980 x25: ffffff808f77f120
>>> x24: 0000007b44941000
>>> [ 261.215452][T11285] x23: 0000000000000001 x22: 0000000000000001
>>> x21: fffffffe04cc2980
>>> [ 261.215459][T11285] x20: ffffff80db480000 x19: fffffffe04cc2980
>>> x18: ffffffed011dae80
>>> [ 261.215465][T11285] x17: 0000000000000001 x16: ffffffffffffffff
>>> x15: 0000000000000004
>>> [ 261.215471][T11285] x14: ffffff82fb73fac0 x13: 0000000000000003
>>> x12: 0000000000000003
>>> [ 261.215476][T11285] x11: 00000000fffeffff x10: c0000000fffeffff x9
>>> : 0d46c0889b468e00
>>> [ 261.215483][T11285] x8 : 0d46c0889b468e00 x7 : 545b5d3935343431 x6
>>> : 322e31363220205b
>>> [ 261.215489][T11285] x5 : ffffffed013de407 x4 : ffffffed00698967 x3
>>> : 0000000000000000
>>> [ 261.215495][T11285] x2 : 0000000000000000 x1 : ffffffc0a7dbb8c0 x0
>>> : ffffff8068c15c80
>>> [ 261.215501][T11285] Call trace:
>>> [ 261.215504][T11285] folio_add_anon_rmap_ptes+0x2b4/0x330
>>> [ 261.215509][T11285] __arm64_sys_swapoff+0x8cc/0x14b0
>>> [ 261.215516][T11285] invoke_syscall+0x58/0x114
>>> [ 261.215526][T11285] el0_svc_common+0xac/0xe0
>>> [ 261.215532][T11285] do_el0_svc+0x1c/0x28
>>> [ 261.215539][T11285] el0_svc+0x38/0x68
>>> [ 261.215544][T11285] el0t_64_sync_handler+0x68/0xbc
>>> [ 261.215548][T11285] el0t_64_sync+0x1a8/0x1ac
>>> [ 261.215552][T11285] ---[ end trace 0000000000000000 ]---
>>
>> Ah, yes. in unuse_pte(), you'll have to do the right thing if
>> !folio_test(anon) before doing the folio_add_anon_rmap_pte().
>>
>> You might have a fresh anon folio in the swapcache that was never mapped
>> (hopefully order-0, otherwise we'd likely be in trouble).
>
> Yes. It is order-0
>
> [ 261.214260][T11285] page last allocated via order 0, migratetype
>
> Otherwise, we would have encountered this VM_WARN_ON_FOLIO?
>
> __folio_add_anon_rmap()
> {
> ...
> VM_WARN_ON_FOLIO(folio_test_large(folio) &&
> level != RMAP_LEVEL_PMD, folio);
> ...
> }
>
> Given that nobody has ever reported this warning, I assume all callers
> using folio_add_anon_rmap_pte(s) are right now safe to move to ?

Yes, and we should likely add a VM_WARN_ON_ONCE() here that we have a
small folio if !anon. If that ever changes, we can assess the situation.

Only swap created "new" anon folios without properly calling the right
function so far, all other code handles that correctly.

--
Cheers,

David / dhildenb


2024-06-14 08:56:22

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH RFC 3/3] mm: remove folio_test_anon(folio)==false path in __folio_add_anon_rmap()

On Fri, Jun 14, 2024 at 8:51 PM David Hildenbrand <[email protected]> wrote:
>
> On 14.06.24 09:56, Barry Song wrote:
> > On Thu, Jun 13, 2024 at 9:12 PM David Hildenbrand <[email protected]> wrote:
> >>
> >> On 13.06.24 11:06, Barry Song wrote:
> >>> On Thu, Jun 13, 2024 at 8:49 PM David Hildenbrand <[email protected]> wrote:
> >>>>
> >>>> On 13.06.24 10:46, Barry Song wrote:
> >>>>> On Thu, Jun 13, 2024 at 12:08 PM Barry Song <[email protected]> wrote:
> >>>>>>
> >>>>>> From: Barry Song <[email protected]>
> >>>>>>
> >>>>>> The folio_test_anon(folio)==false case within do_swap_page() has been
> >>>>>> relocated to folio_add_new_anon_rmap(). Additionally, two other callers
> >>>>>> consistently pass anonymous folios.
> >>>>>>
> >>>>>> stack 1:
> >>>>>> remove_migration_pmd
> >>>>>> -> folio_add_anon_rmap_pmd
> >>>>>> -> __folio_add_anon_rmap
> >>>>>>
> >>>>>> stack 2:
> >>>>>> __split_huge_pmd_locked
> >>>>>> -> folio_add_anon_rmap_ptes
> >>>>>> -> __folio_add_anon_rmap
> >>>>>>
> >>>>>> __folio_add_anon_rmap() only needs to handle the cases
> >>>>>> folio_test_anon(folio)==true now.
> >>>>>
> >>>>> My team reported a case where swapoff() is calling
> >>>>> folio_add_anon_rmap_pte *not* folio_add_anon_rmap_ptes
> >>>>> with one new anon (!folio_test_anon(folio)).
> >>>>>
> >>>>> I will double check all folio_add_anon_rmap_pte() cases.
> >>>>
> >>>> Right, swapoff() path is a bit special (e.g., we don't do any kind of
> >>>> batching on the swapoff() path).
> >>>>
> >>>> But we should never get a new large anon folio there, or could that now
> >>>> happen?
> >>>
> >>> My team encountered the following issue while testing this RFC
> >>> series on real hardware. Let me take a look to identify the
> >>> problem.
> >>>
> >>> [ 261.214195][T11285] page:000000004cdd779e refcount:4 mapcount:1
> >>> mapping:00000000ba142c22 index:0x1 pfn:0x1b30a6
> >>> [ 261.214213][T11285] memcg:ffffff8003678000
> >>> [ 261.214217][T11285] aops:swap_aops
> >>> [ 261.214233][T11285] flags:
> >>> 0x2000000000081009(locked|uptodate|owner_priv_1|swapbacked|zone=1|kasantag=0x0)
> >>> [ 261.214241][T11285] page_type: 0x0()
> >>> [ 261.214246][T11285] raw: 2000000000081009 0000000000000000
> >>> dead000000000122 0000000000000000
> >>> [ 261.214251][T11285] raw: 0000000000000001 00000000000d84b3
> >>> 0000000400000000 ffffff8003678000
> >>> [ 261.214254][T11285] page dumped because:
> >>> VM_WARN_ON_FOLIO(!folio_test_anon(folio))
> >>> [ 261.214257][T11285] page_owner tracks the page as allocated
> >>> [ 261.214260][T11285] page last allocated via order 0, migratetype
> >>> Movable, gfp_mask 0x2140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), pid
> >>> 11285, tgid 11285 (swapoff), ts 261214177545, free_ts 261151875699
> >>> [ 261.214268][T11285] post_alloc_hook+0x1b8/0x1c0
> >>> [ 261.214284][T11285] prep_new_page+0x28/0x13c
> >>> [ 261.214291][T11285] get_page_from_freelist+0x198c/0x1aa4
> >>> [ 261.214298][T11285] __alloc_pages+0x15c/0x330
> >>> [ 261.214304][T11285] __folio_alloc+0x1c/0x4c
> >>> [ 261.214310][T11285] __read_swap_cache_async+0xd8/0x48c
> >>> [ 261.214320][T11285] swap_cluster_readahead+0x158/0x324
> >>> [ 261.214326][T11285] swapin_readahead+0x64/0x448
> >>> [ 261.214331][T11285] __arm64_sys_swapoff+0x6ec/0x14b0
> >>> [ 261.214337][T11285] invoke_syscall+0x58/0x114
> >>> [ 261.214353][T11285] el0_svc_common+0xac/0xe0
> >>> [ 261.214360][T11285] do_el0_svc+0x1c/0x28
> >>> [ 261.214366][T11285] el0_svc+0x38/0x68
> >>> [ 261.214372][T11285] el0t_64_sync_handler+0x68/0xbc
> >>> [ 261.214376][T11285] el0t_64_sync+0x1a8/0x1ac
> >>> [ 261.214381][T11285] page last free pid 90 tgid 90 stack trace:
> >>> [ 261.214386][T11285] free_unref_page_prepare+0x338/0x374
> >>> [ 261.214395][T11285] free_unref_page_list+0x84/0x378
> >>> [ 261.214400][T11285] shrink_folio_list+0x1234/0x13e4
> >>> [ 261.214409][T11285] evict_folios+0x1458/0x19b4
> >>> [ 261.214417][T11285] try_to_shrink_lruvec+0x1c8/0x264
> >>> [ 261.214422][T11285] shrink_one+0xa8/0x234
> >>> [ 261.214427][T11285] shrink_node+0xb38/0xde0
> >>> [ 261.214432][T11285] balance_pgdat+0x7a4/0xdb4
> >>> [ 261.214437][T11285] kswapd+0x290/0x4e4
> >>> [ 261.214442][T11285] kthread+0x114/0x1bc
> >>> [ 261.214459][T11285] ret_from_fork+0x10/0x20
> >>> [ 261.214477][T11285] ------------[ cut here ]------------
> >>> [ 261.214480][T11285] WARNING: CPU: 3 PID: 11285 at mm/rmap.c:1305
> >>> folio_add_anon_rmap_ptes+0x2b4/0x330
> >>>
> >>> [ 261.215403][T11285] pstate: 63400005 (nZCv daif +PAN -UAO +TCO +DIT
> >>> -SSBS BTYPE=--)
> >>> [ 261.215423][T11285] pc : folio_add_anon_rmap_ptes+0x2b4/0x330
> >>> [ 261.215428][T11285] lr : folio_add_anon_rmap_ptes+0x2b4/0x330
> >>> [ 261.215433][T11285] sp : ffffffc0a7dbbbf0
> >>> [ 261.215437][T11285] x29: ffffffc0a7dbbbf0 x28: ffffff8040860a08
> >>> x27: ffffff80db480000
> >>> [ 261.215445][T11285] x26: fffffffe04cc2980 x25: ffffff808f77f120
> >>> x24: 0000007b44941000
> >>> [ 261.215452][T11285] x23: 0000000000000001 x22: 0000000000000001
> >>> x21: fffffffe04cc2980
> >>> [ 261.215459][T11285] x20: ffffff80db480000 x19: fffffffe04cc2980
> >>> x18: ffffffed011dae80
> >>> [ 261.215465][T11285] x17: 0000000000000001 x16: ffffffffffffffff
> >>> x15: 0000000000000004
> >>> [ 261.215471][T11285] x14: ffffff82fb73fac0 x13: 0000000000000003
> >>> x12: 0000000000000003
> >>> [ 261.215476][T11285] x11: 00000000fffeffff x10: c0000000fffeffff x9
> >>> : 0d46c0889b468e00
> >>> [ 261.215483][T11285] x8 : 0d46c0889b468e00 x7 : 545b5d3935343431 x6
> >>> : 322e31363220205b
> >>> [ 261.215489][T11285] x5 : ffffffed013de407 x4 : ffffffed00698967 x3
> >>> : 0000000000000000
> >>> [ 261.215495][T11285] x2 : 0000000000000000 x1 : ffffffc0a7dbb8c0 x0
> >>> : ffffff8068c15c80
> >>> [ 261.215501][T11285] Call trace:
> >>> [ 261.215504][T11285] folio_add_anon_rmap_ptes+0x2b4/0x330
> >>> [ 261.215509][T11285] __arm64_sys_swapoff+0x8cc/0x14b0
> >>> [ 261.215516][T11285] invoke_syscall+0x58/0x114
> >>> [ 261.215526][T11285] el0_svc_common+0xac/0xe0
> >>> [ 261.215532][T11285] do_el0_svc+0x1c/0x28
> >>> [ 261.215539][T11285] el0_svc+0x38/0x68
> >>> [ 261.215544][T11285] el0t_64_sync_handler+0x68/0xbc
> >>> [ 261.215548][T11285] el0t_64_sync+0x1a8/0x1ac
> >>> [ 261.215552][T11285] ---[ end trace 0000000000000000 ]---
> >>
> >> Ah, yes. in unuse_pte(), you'll have to do the right thing if
> >> !folio_test(anon) before doing the folio_add_anon_rmap_pte().
> >>
> >> You might have a fresh anon folio in the swapcache that was never mapped
> >> (hopefully order-0, otherwise we'd likely be in trouble).
> >
> > Yes. It is order-0
> >
> > [ 261.214260][T11285] page last allocated via order 0, migratetype
> >
> > Otherwise, we would have encountered this VM_WARN_ON_FOLIO?
> >
> > __folio_add_anon_rmap()
> > {
> > ...
> > VM_WARN_ON_FOLIO(folio_test_large(folio) &&
> > level != RMAP_LEVEL_PMD, folio);
> > ...
> > }
> >
> > Given that nobody has ever reported this warning, I assume all callers
> > using folio_add_anon_rmap_pte(s) are right now safe to move to ?
>
> Yes, and we should likely add a VM_WARN_ON_ONCE() here that we have a
> small folio if !anon. If that ever changes, we can assess the situation.
>

this patch actually has a WARN_ON for all !anon, so it extends the WARN
to small.

- if (unlikely(!folio_test_anon(folio))) {
- VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
- /*
- * For a PTE-mapped large folio, we only know that the single
- * PTE is exclusive. Further, __folio_set_anon() might not get
- * folio->index right when not given the address of the head
- * page.
- */
- VM_WARN_ON_FOLIO(folio_test_large(folio) &&
- level != RMAP_LEVEL_PMD, folio);
- __folio_set_anon(folio, vma, address,
- !!(flags & RMAP_EXCLUSIVE));
- } else if (likely(!folio_test_ksm(folio))) {
+ VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio);
+

> Only swap created "new" anon folios without properly calling the right
> function so far, all other code handles that correctly.
>
> --
> Cheers,
>
> David / dhildenb
>

2024-06-14 09:00:00

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH RFC 3/3] mm: remove folio_test_anon(folio)==false path in __folio_add_anon_rmap()

On Fri, Jun 14, 2024 at 8:56 PM Barry Song <[email protected]> wrote:
>
> On Fri, Jun 14, 2024 at 8:51 PM David Hildenbrand <[email protected]> wrote:
> >
> > On 14.06.24 09:56, Barry Song wrote:
> > > On Thu, Jun 13, 2024 at 9:12 PM David Hildenbrand <[email protected]> wrote:
> > >>
> > >> On 13.06.24 11:06, Barry Song wrote:
> > >>> On Thu, Jun 13, 2024 at 8:49 PM David Hildenbrand <[email protected]> wrote:
> > >>>>
> > >>>> On 13.06.24 10:46, Barry Song wrote:
> > >>>>> On Thu, Jun 13, 2024 at 12:08 PM Barry Song <[email protected]> wrote:
> > >>>>>>
> > >>>>>> From: Barry Song <[email protected]>
> > >>>>>>
> > >>>>>> The folio_test_anon(folio)==false case within do_swap_page() has been
> > >>>>>> relocated to folio_add_new_anon_rmap(). Additionally, two other callers
> > >>>>>> consistently pass anonymous folios.
> > >>>>>>
> > >>>>>> stack 1:
> > >>>>>> remove_migration_pmd
> > >>>>>> -> folio_add_anon_rmap_pmd
> > >>>>>> -> __folio_add_anon_rmap
> > >>>>>>
> > >>>>>> stack 2:
> > >>>>>> __split_huge_pmd_locked
> > >>>>>> -> folio_add_anon_rmap_ptes
> > >>>>>> -> __folio_add_anon_rmap
> > >>>>>>
> > >>>>>> __folio_add_anon_rmap() only needs to handle the cases
> > >>>>>> folio_test_anon(folio)==true now.
> > >>>>>
> > >>>>> My team reported a case where swapoff() is calling
> > >>>>> folio_add_anon_rmap_pte *not* folio_add_anon_rmap_ptes
> > >>>>> with one new anon (!folio_test_anon(folio)).
> > >>>>>
> > >>>>> I will double check all folio_add_anon_rmap_pte() cases.
> > >>>>
> > >>>> Right, swapoff() path is a bit special (e.g., we don't do any kind of
> > >>>> batching on the swapoff() path).
> > >>>>
> > >>>> But we should never get a new large anon folio there, or could that now
> > >>>> happen?
> > >>>
> > >>> My team encountered the following issue while testing this RFC
> > >>> series on real hardware. Let me take a look to identify the
> > >>> problem.
> > >>>
> > >>> [ 261.214195][T11285] page:000000004cdd779e refcount:4 mapcount:1
> > >>> mapping:00000000ba142c22 index:0x1 pfn:0x1b30a6
> > >>> [ 261.214213][T11285] memcg:ffffff8003678000
> > >>> [ 261.214217][T11285] aops:swap_aops
> > >>> [ 261.214233][T11285] flags:
> > >>> 0x2000000000081009(locked|uptodate|owner_priv_1|swapbacked|zone=1|kasantag=0x0)
> > >>> [ 261.214241][T11285] page_type: 0x0()
> > >>> [ 261.214246][T11285] raw: 2000000000081009 0000000000000000
> > >>> dead000000000122 0000000000000000
> > >>> [ 261.214251][T11285] raw: 0000000000000001 00000000000d84b3
> > >>> 0000000400000000 ffffff8003678000
> > >>> [ 261.214254][T11285] page dumped because:
> > >>> VM_WARN_ON_FOLIO(!folio_test_anon(folio))
> > >>> [ 261.214257][T11285] page_owner tracks the page as allocated
> > >>> [ 261.214260][T11285] page last allocated via order 0, migratetype
> > >>> Movable, gfp_mask 0x2140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), pid
> > >>> 11285, tgid 11285 (swapoff), ts 261214177545, free_ts 261151875699
> > >>> [ 261.214268][T11285] post_alloc_hook+0x1b8/0x1c0
> > >>> [ 261.214284][T11285] prep_new_page+0x28/0x13c
> > >>> [ 261.214291][T11285] get_page_from_freelist+0x198c/0x1aa4
> > >>> [ 261.214298][T11285] __alloc_pages+0x15c/0x330
> > >>> [ 261.214304][T11285] __folio_alloc+0x1c/0x4c
> > >>> [ 261.214310][T11285] __read_swap_cache_async+0xd8/0x48c
> > >>> [ 261.214320][T11285] swap_cluster_readahead+0x158/0x324
> > >>> [ 261.214326][T11285] swapin_readahead+0x64/0x448
> > >>> [ 261.214331][T11285] __arm64_sys_swapoff+0x6ec/0x14b0
> > >>> [ 261.214337][T11285] invoke_syscall+0x58/0x114
> > >>> [ 261.214353][T11285] el0_svc_common+0xac/0xe0
> > >>> [ 261.214360][T11285] do_el0_svc+0x1c/0x28
> > >>> [ 261.214366][T11285] el0_svc+0x38/0x68
> > >>> [ 261.214372][T11285] el0t_64_sync_handler+0x68/0xbc
> > >>> [ 261.214376][T11285] el0t_64_sync+0x1a8/0x1ac
> > >>> [ 261.214381][T11285] page last free pid 90 tgid 90 stack trace:
> > >>> [ 261.214386][T11285] free_unref_page_prepare+0x338/0x374
> > >>> [ 261.214395][T11285] free_unref_page_list+0x84/0x378
> > >>> [ 261.214400][T11285] shrink_folio_list+0x1234/0x13e4
> > >>> [ 261.214409][T11285] evict_folios+0x1458/0x19b4
> > >>> [ 261.214417][T11285] try_to_shrink_lruvec+0x1c8/0x264
> > >>> [ 261.214422][T11285] shrink_one+0xa8/0x234
> > >>> [ 261.214427][T11285] shrink_node+0xb38/0xde0
> > >>> [ 261.214432][T11285] balance_pgdat+0x7a4/0xdb4
> > >>> [ 261.214437][T11285] kswapd+0x290/0x4e4
> > >>> [ 261.214442][T11285] kthread+0x114/0x1bc
> > >>> [ 261.214459][T11285] ret_from_fork+0x10/0x20
> > >>> [ 261.214477][T11285] ------------[ cut here ]------------
> > >>> [ 261.214480][T11285] WARNING: CPU: 3 PID: 11285 at mm/rmap.c:1305
> > >>> folio_add_anon_rmap_ptes+0x2b4/0x330
> > >>>
> > >>> [ 261.215403][T11285] pstate: 63400005 (nZCv daif +PAN -UAO +TCO +DIT
> > >>> -SSBS BTYPE=--)
> > >>> [ 261.215423][T11285] pc : folio_add_anon_rmap_ptes+0x2b4/0x330
> > >>> [ 261.215428][T11285] lr : folio_add_anon_rmap_ptes+0x2b4/0x330
> > >>> [ 261.215433][T11285] sp : ffffffc0a7dbbbf0
> > >>> [ 261.215437][T11285] x29: ffffffc0a7dbbbf0 x28: ffffff8040860a08
> > >>> x27: ffffff80db480000
> > >>> [ 261.215445][T11285] x26: fffffffe04cc2980 x25: ffffff808f77f120
> > >>> x24: 0000007b44941000
> > >>> [ 261.215452][T11285] x23: 0000000000000001 x22: 0000000000000001
> > >>> x21: fffffffe04cc2980
> > >>> [ 261.215459][T11285] x20: ffffff80db480000 x19: fffffffe04cc2980
> > >>> x18: ffffffed011dae80
> > >>> [ 261.215465][T11285] x17: 0000000000000001 x16: ffffffffffffffff
> > >>> x15: 0000000000000004
> > >>> [ 261.215471][T11285] x14: ffffff82fb73fac0 x13: 0000000000000003
> > >>> x12: 0000000000000003
> > >>> [ 261.215476][T11285] x11: 00000000fffeffff x10: c0000000fffeffff x9
> > >>> : 0d46c0889b468e00
> > >>> [ 261.215483][T11285] x8 : 0d46c0889b468e00 x7 : 545b5d3935343431 x6
> > >>> : 322e31363220205b
> > >>> [ 261.215489][T11285] x5 : ffffffed013de407 x4 : ffffffed00698967 x3
> > >>> : 0000000000000000
> > >>> [ 261.215495][T11285] x2 : 0000000000000000 x1 : ffffffc0a7dbb8c0 x0
> > >>> : ffffff8068c15c80
> > >>> [ 261.215501][T11285] Call trace:
> > >>> [ 261.215504][T11285] folio_add_anon_rmap_ptes+0x2b4/0x330
> > >>> [ 261.215509][T11285] __arm64_sys_swapoff+0x8cc/0x14b0
> > >>> [ 261.215516][T11285] invoke_syscall+0x58/0x114
> > >>> [ 261.215526][T11285] el0_svc_common+0xac/0xe0
> > >>> [ 261.215532][T11285] do_el0_svc+0x1c/0x28
> > >>> [ 261.215539][T11285] el0_svc+0x38/0x68
> > >>> [ 261.215544][T11285] el0t_64_sync_handler+0x68/0xbc
> > >>> [ 261.215548][T11285] el0t_64_sync+0x1a8/0x1ac
> > >>> [ 261.215552][T11285] ---[ end trace 0000000000000000 ]---
> > >>
> > >> Ah, yes. in unuse_pte(), you'll have to do the right thing if
> > >> !folio_test(anon) before doing the folio_add_anon_rmap_pte().
> > >>
> > >> You might have a fresh anon folio in the swapcache that was never mapped
> > >> (hopefully order-0, otherwise we'd likely be in trouble).
> > >
> > > Yes. It is order-0
> > >
> > > [ 261.214260][T11285] page last allocated via order 0, migratetype
> > >
> > > Otherwise, we would have encountered this VM_WARN_ON_FOLIO?
> > >
> > > __folio_add_anon_rmap()
> > > {
> > > ...
> > > VM_WARN_ON_FOLIO(folio_test_large(folio) &&
> > > level != RMAP_LEVEL_PMD, folio);
> > > ...
> > > }
> > >
> > > Given that nobody has ever reported this warning, I assume all callers
> > > using folio_add_anon_rmap_pte(s) are right now safe to move to ?
> >
> > Yes, and we should likely add a VM_WARN_ON_ONCE() here that we have a
> > small folio if !anon. If that ever changes, we can assess the situation.
> >
>
> this patch actually has a WARN_ON for all !anon, so it extends the WARN
> to small.
>
> - if (unlikely(!folio_test_anon(folio))) {
> - VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
> - /*
> - * For a PTE-mapped large folio, we only know that the single
> - * PTE is exclusive. Further, __folio_set_anon() might not get
> - * folio->index right when not given the address of the head
> - * page.
> - */
> - VM_WARN_ON_FOLIO(folio_test_large(folio) &&
> - level != RMAP_LEVEL_PMD, folio);
> - __folio_set_anon(folio, vma, address,
> - !!(flags & RMAP_EXCLUSIVE));
> - } else if (likely(!folio_test_ksm(folio))) {
> + VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio);
> +
>
> > Only swap created "new" anon folios without properly calling the right
> > function so far, all other code handles that correctly.

as I assume patch2/3 should have moved all !anon to
folio_add_new_anon_rmap()

diff --git a/mm/ksm.c b/mm/ksm.c
index d2641bc2efc9..c913ba8c37eb 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1420,7 +1420,14 @@ static int replace_page(struct vm_area_struct
*vma, struct page *page,
*/
if (!is_zero_pfn(page_to_pfn(kpage))) {
folio_get(kfolio);
- folio_add_anon_rmap_pte(kfolio, kpage, vma, addr, RMAP_NONE);
+ /*
+ * We currently ensure that new folios cannot be partially
+ * exclusive: they are either fully exclusive or fully shared.
+ */
+ if (!folio_test_anon(kfolio))
+ folio_add_new_anon_rmap(kfolio, vma, addr, RMAP_NONE);
+ else
+ folio_add_anon_rmap_pte(kfolio, kpage, vma,
addr, RMAP_NONE);
newpte = mk_pte(kpage, vma->vm_page_prot);
} else {
/*
diff --git a/mm/memory.c b/mm/memory.c
index 1f24ecdafe05..16797cc22058 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4339,6 +4339,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
if (unlikely(folio != swapcache && swapcache)) {
folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
folio_add_lru_vma(folio, vma);
+ } else if (!folio_test_anon(folio)) {
+ /*
+ * We currently ensure that new folios cannot be partially
+ * exclusive: they are either fully exclusive or fully shared.
+ */
+ folio_add_new_anon_rmap(folio, vma, address, rmap_flags);
} else {
folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, address,
rmap_flags);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index ae1d2700f6a3..ac53e46de48a 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1908,8 +1908,14 @@ static int unuse_pte(struct vm_area_struct
*vma, pmd_t *pmd,
VM_BUG_ON_FOLIO(folio_test_writeback(folio), folio);
if (pte_swp_exclusive(old_pte))
rmap_flags |= RMAP_EXCLUSIVE;
-
- folio_add_anon_rmap_pte(folio, page, vma, addr, rmap_flags);
+ /*
+ * We currently ensure that new folios cannot be partially
+ * exclusive: they are either fully exclusive or fully shared.
+ */
+ if (!folio_test_anon(folio))
+ folio_add_new_anon_rmap(folio, vma, addr, rmap_flags);
+ else
+ folio_add_anon_rmap_pte(folio, page, vma,
addr, rmap_flags);
} else { /* ksm created a completely new copy */
folio_add_new_anon_rmap(folio, vma, addr, RMAP_EXCLUSIVE);
folio_add_lru_vma(folio, vma);


> >
> > --
> > Cheers,
> >
> > David / dhildenb
> >

2024-06-14 09:05:47

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH RFC 3/3] mm: remove folio_test_anon(folio)==false path in __folio_add_anon_rmap()

On 14.06.24 10:58, Barry Song wrote:
> On Fri, Jun 14, 2024 at 8:56 PM Barry Song <[email protected]> wrote:
>>
>> On Fri, Jun 14, 2024 at 8:51 PM David Hildenbrand <[email protected]> wrote:
>>>
>>> On 14.06.24 09:56, Barry Song wrote:
>>>> On Thu, Jun 13, 2024 at 9:12 PM David Hildenbrand <[email protected]> wrote:
>>>>>
>>>>> On 13.06.24 11:06, Barry Song wrote:
>>>>>> On Thu, Jun 13, 2024 at 8:49 PM David Hildenbrand <[email protected]> wrote:
>>>>>>>
>>>>>>> On 13.06.24 10:46, Barry Song wrote:
>>>>>>>> On Thu, Jun 13, 2024 at 12:08 PM Barry Song <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>> From: Barry Song <[email protected]>
>>>>>>>>>
>>>>>>>>> The folio_test_anon(folio)==false case within do_swap_page() has been
>>>>>>>>> relocated to folio_add_new_anon_rmap(). Additionally, two other callers
>>>>>>>>> consistently pass anonymous folios.
>>>>>>>>>
>>>>>>>>> stack 1:
>>>>>>>>> remove_migration_pmd
>>>>>>>>> -> folio_add_anon_rmap_pmd
>>>>>>>>> -> __folio_add_anon_rmap
>>>>>>>>>
>>>>>>>>> stack 2:
>>>>>>>>> __split_huge_pmd_locked
>>>>>>>>> -> folio_add_anon_rmap_ptes
>>>>>>>>> -> __folio_add_anon_rmap
>>>>>>>>>
>>>>>>>>> __folio_add_anon_rmap() only needs to handle the cases
>>>>>>>>> folio_test_anon(folio)==true now.
>>>>>>>>
>>>>>>>> My team reported a case where swapoff() is calling
>>>>>>>> folio_add_anon_rmap_pte *not* folio_add_anon_rmap_ptes
>>>>>>>> with one new anon (!folio_test_anon(folio)).
>>>>>>>>
>>>>>>>> I will double check all folio_add_anon_rmap_pte() cases.
>>>>>>>
>>>>>>> Right, swapoff() path is a bit special (e.g., we don't do any kind of
>>>>>>> batching on the swapoff() path).
>>>>>>>
>>>>>>> But we should never get a new large anon folio there, or could that now
>>>>>>> happen?
>>>>>>
>>>>>> My team encountered the following issue while testing this RFC
>>>>>> series on real hardware. Let me take a look to identify the
>>>>>> problem.
>>>>>>
>>>>>> [ 261.214195][T11285] page:000000004cdd779e refcount:4 mapcount:1
>>>>>> mapping:00000000ba142c22 index:0x1 pfn:0x1b30a6
>>>>>> [ 261.214213][T11285] memcg:ffffff8003678000
>>>>>> [ 261.214217][T11285] aops:swap_aops
>>>>>> [ 261.214233][T11285] flags:
>>>>>> 0x2000000000081009(locked|uptodate|owner_priv_1|swapbacked|zone=1|kasantag=0x0)
>>>>>> [ 261.214241][T11285] page_type: 0x0()
>>>>>> [ 261.214246][T11285] raw: 2000000000081009 0000000000000000
>>>>>> dead000000000122 0000000000000000
>>>>>> [ 261.214251][T11285] raw: 0000000000000001 00000000000d84b3
>>>>>> 0000000400000000 ffffff8003678000
>>>>>> [ 261.214254][T11285] page dumped because:
>>>>>> VM_WARN_ON_FOLIO(!folio_test_anon(folio))
>>>>>> [ 261.214257][T11285] page_owner tracks the page as allocated
>>>>>> [ 261.214260][T11285] page last allocated via order 0, migratetype
>>>>>> Movable, gfp_mask 0x2140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), pid
>>>>>> 11285, tgid 11285 (swapoff), ts 261214177545, free_ts 261151875699
>>>>>> [ 261.214268][T11285] post_alloc_hook+0x1b8/0x1c0
>>>>>> [ 261.214284][T11285] prep_new_page+0x28/0x13c
>>>>>> [ 261.214291][T11285] get_page_from_freelist+0x198c/0x1aa4
>>>>>> [ 261.214298][T11285] __alloc_pages+0x15c/0x330
>>>>>> [ 261.214304][T11285] __folio_alloc+0x1c/0x4c
>>>>>> [ 261.214310][T11285] __read_swap_cache_async+0xd8/0x48c
>>>>>> [ 261.214320][T11285] swap_cluster_readahead+0x158/0x324
>>>>>> [ 261.214326][T11285] swapin_readahead+0x64/0x448
>>>>>> [ 261.214331][T11285] __arm64_sys_swapoff+0x6ec/0x14b0
>>>>>> [ 261.214337][T11285] invoke_syscall+0x58/0x114
>>>>>> [ 261.214353][T11285] el0_svc_common+0xac/0xe0
>>>>>> [ 261.214360][T11285] do_el0_svc+0x1c/0x28
>>>>>> [ 261.214366][T11285] el0_svc+0x38/0x68
>>>>>> [ 261.214372][T11285] el0t_64_sync_handler+0x68/0xbc
>>>>>> [ 261.214376][T11285] el0t_64_sync+0x1a8/0x1ac
>>>>>> [ 261.214381][T11285] page last free pid 90 tgid 90 stack trace:
>>>>>> [ 261.214386][T11285] free_unref_page_prepare+0x338/0x374
>>>>>> [ 261.214395][T11285] free_unref_page_list+0x84/0x378
>>>>>> [ 261.214400][T11285] shrink_folio_list+0x1234/0x13e4
>>>>>> [ 261.214409][T11285] evict_folios+0x1458/0x19b4
>>>>>> [ 261.214417][T11285] try_to_shrink_lruvec+0x1c8/0x264
>>>>>> [ 261.214422][T11285] shrink_one+0xa8/0x234
>>>>>> [ 261.214427][T11285] shrink_node+0xb38/0xde0
>>>>>> [ 261.214432][T11285] balance_pgdat+0x7a4/0xdb4
>>>>>> [ 261.214437][T11285] kswapd+0x290/0x4e4
>>>>>> [ 261.214442][T11285] kthread+0x114/0x1bc
>>>>>> [ 261.214459][T11285] ret_from_fork+0x10/0x20
>>>>>> [ 261.214477][T11285] ------------[ cut here ]------------
>>>>>> [ 261.214480][T11285] WARNING: CPU: 3 PID: 11285 at mm/rmap.c:1305
>>>>>> folio_add_anon_rmap_ptes+0x2b4/0x330
>>>>>>
>>>>>> [ 261.215403][T11285] pstate: 63400005 (nZCv daif +PAN -UAO +TCO +DIT
>>>>>> -SSBS BTYPE=--)
>>>>>> [ 261.215423][T11285] pc : folio_add_anon_rmap_ptes+0x2b4/0x330
>>>>>> [ 261.215428][T11285] lr : folio_add_anon_rmap_ptes+0x2b4/0x330
>>>>>> [ 261.215433][T11285] sp : ffffffc0a7dbbbf0
>>>>>> [ 261.215437][T11285] x29: ffffffc0a7dbbbf0 x28: ffffff8040860a08
>>>>>> x27: ffffff80db480000
>>>>>> [ 261.215445][T11285] x26: fffffffe04cc2980 x25: ffffff808f77f120
>>>>>> x24: 0000007b44941000
>>>>>> [ 261.215452][T11285] x23: 0000000000000001 x22: 0000000000000001
>>>>>> x21: fffffffe04cc2980
>>>>>> [ 261.215459][T11285] x20: ffffff80db480000 x19: fffffffe04cc2980
>>>>>> x18: ffffffed011dae80
>>>>>> [ 261.215465][T11285] x17: 0000000000000001 x16: ffffffffffffffff
>>>>>> x15: 0000000000000004
>>>>>> [ 261.215471][T11285] x14: ffffff82fb73fac0 x13: 0000000000000003
>>>>>> x12: 0000000000000003
>>>>>> [ 261.215476][T11285] x11: 00000000fffeffff x10: c0000000fffeffff x9
>>>>>> : 0d46c0889b468e00
>>>>>> [ 261.215483][T11285] x8 : 0d46c0889b468e00 x7 : 545b5d3935343431 x6
>>>>>> : 322e31363220205b
>>>>>> [ 261.215489][T11285] x5 : ffffffed013de407 x4 : ffffffed00698967 x3
>>>>>> : 0000000000000000
>>>>>> [ 261.215495][T11285] x2 : 0000000000000000 x1 : ffffffc0a7dbb8c0 x0
>>>>>> : ffffff8068c15c80
>>>>>> [ 261.215501][T11285] Call trace:
>>>>>> [ 261.215504][T11285] folio_add_anon_rmap_ptes+0x2b4/0x330
>>>>>> [ 261.215509][T11285] __arm64_sys_swapoff+0x8cc/0x14b0
>>>>>> [ 261.215516][T11285] invoke_syscall+0x58/0x114
>>>>>> [ 261.215526][T11285] el0_svc_common+0xac/0xe0
>>>>>> [ 261.215532][T11285] do_el0_svc+0x1c/0x28
>>>>>> [ 261.215539][T11285] el0_svc+0x38/0x68
>>>>>> [ 261.215544][T11285] el0t_64_sync_handler+0x68/0xbc
>>>>>> [ 261.215548][T11285] el0t_64_sync+0x1a8/0x1ac
>>>>>> [ 261.215552][T11285] ---[ end trace 0000000000000000 ]---
>>>>>
>>>>> Ah, yes. in unuse_pte(), you'll have to do the right thing if
>>>>> !folio_test(anon) before doing the folio_add_anon_rmap_pte().
>>>>>
>>>>> You might have a fresh anon folio in the swapcache that was never mapped
>>>>> (hopefully order-0, otherwise we'd likely be in trouble).
>>>>
>>>> Yes. It is order-0
>>>>
>>>> [ 261.214260][T11285] page last allocated via order 0, migratetype
>>>>
>>>> Otherwise, we would have encountered this VM_WARN_ON_FOLIO?
>>>>
>>>> __folio_add_anon_rmap()
>>>> {
>>>> ...
>>>> VM_WARN_ON_FOLIO(folio_test_large(folio) &&
>>>> level != RMAP_LEVEL_PMD, folio);
>>>> ...
>>>> }
>>>>
>>>> Given that nobody has ever reported this warning, I assume all callers
>>>> using folio_add_anon_rmap_pte(s) are right now safe to move to ?
>>>
>>> Yes, and we should likely add a VM_WARN_ON_ONCE() here that we have a
>>> small folio if !anon. If that ever changes, we can assess the situation.
>>>
>>
>> this patch actually has a WARN_ON for all !anon, so it extends the WARN
>> to small.

Not what I mean, see below:

>>
>> - if (unlikely(!folio_test_anon(folio))) {
>> - VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
>> - /*
>> - * For a PTE-mapped large folio, we only know that the single
>> - * PTE is exclusive. Further, __folio_set_anon() might not get
>> - * folio->index right when not given the address of the head
>> - * page.
>> - */
>> - VM_WARN_ON_FOLIO(folio_test_large(folio) &&
>> - level != RMAP_LEVEL_PMD, folio);
>> - __folio_set_anon(folio, vma, address,
>> - !!(flags & RMAP_EXCLUSIVE));
>> - } else if (likely(!folio_test_ksm(folio))) {
>> + VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio);
>> +
>>
>>> Only swap created "new" anon folios without properly calling the right
>>> function so far, all other code handles that correctly.
>
> as I assume patch2/3 should have moved all !anon to
> folio_add_new_anon_rmap()
>
> diff --git a/mm/ksm.c b/mm/ksm.c
> index d2641bc2efc9..c913ba8c37eb 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -1420,7 +1420,14 @@ static int replace_page(struct vm_area_struct
> *vma, struct page *page,
> */
> if (!is_zero_pfn(page_to_pfn(kpage))) {
> folio_get(kfolio);
> - folio_add_anon_rmap_pte(kfolio, kpage, vma, addr, RMAP_NONE);
> + /*
> + * We currently ensure that new folios cannot be partially
> + * exclusive: they are either fully exclusive or fully shared.
> + */
> + if (!folio_test_anon(kfolio))
> + folio_add_new_anon_rmap(kfolio, vma, addr, RMAP_NONE);
> + else
> + folio_add_anon_rmap_pte(kfolio, kpage, vma, > addr, RMAP_NONE);

I don't think that is required? We are only working with anon folios. Or
were you able to trigger this? (which would be weird)

[...]

> - folio_add_anon_rmap_pte(folio, page, vma, addr, rmap_flags);
> + /*
> + * We currently ensure that new folios cannot be partially
> + * exclusive: they are either fully exclusive or fully shared.
> + */
> + if (!folio_test_anon(folio))

Here I suggest changing the comment to (and adding the VM_WARN_ON_ONCE):

/*
* We currently only expect small !anon folios, for which we now that
* they are either fully exclusive or fully shared. If we ever get large
* folios here, we have to be careful.
*/
if (!folio_test_anon(folio) {
VM_WARN_ON_ONCE(folio_test_large(folio));
folio_add_new_anon_rmap(folio, vma, addr, rmap_flags);
} else {
...


--
Cheers,

David / dhildenb


2024-06-14 09:34:06

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH RFC 3/3] mm: remove folio_test_anon(folio)==false path in __folio_add_anon_rmap()

On Fri, Jun 14, 2024 at 9:04 PM David Hildenbrand <[email protected]> wrote:
>
> On 14.06.24 10:58, Barry Song wrote:
> > On Fri, Jun 14, 2024 at 8:56 PM Barry Song <[email protected]> wrote:
> >>
> >> On Fri, Jun 14, 2024 at 8:51 PM David Hildenbrand <[email protected]> wrote:
> >>>
> >>> On 14.06.24 09:56, Barry Song wrote:
> >>>> On Thu, Jun 13, 2024 at 9:12 PM David Hildenbrand <[email protected]> wrote:
> >>>>>
> >>>>> On 13.06.24 11:06, Barry Song wrote:
> >>>>>> On Thu, Jun 13, 2024 at 8:49 PM David Hildenbrand <[email protected]> wrote:
> >>>>>>>
> >>>>>>> On 13.06.24 10:46, Barry Song wrote:
> >>>>>>>> On Thu, Jun 13, 2024 at 12:08 PM Barry Song <[email protected]> wrote:
> >>>>>>>>>
> >>>>>>>>> From: Barry Song <[email protected]>
> >>>>>>>>>
> >>>>>>>>> The folio_test_anon(folio)==false case within do_swap_page() has been
> >>>>>>>>> relocated to folio_add_new_anon_rmap(). Additionally, two other callers
> >>>>>>>>> consistently pass anonymous folios.
> >>>>>>>>>
> >>>>>>>>> stack 1:
> >>>>>>>>> remove_migration_pmd
> >>>>>>>>> -> folio_add_anon_rmap_pmd
> >>>>>>>>> -> __folio_add_anon_rmap
> >>>>>>>>>
> >>>>>>>>> stack 2:
> >>>>>>>>> __split_huge_pmd_locked
> >>>>>>>>> -> folio_add_anon_rmap_ptes
> >>>>>>>>> -> __folio_add_anon_rmap
> >>>>>>>>>
> >>>>>>>>> __folio_add_anon_rmap() only needs to handle the cases
> >>>>>>>>> folio_test_anon(folio)==true now.
> >>>>>>>>
> >>>>>>>> My team reported a case where swapoff() is calling
> >>>>>>>> folio_add_anon_rmap_pte *not* folio_add_anon_rmap_ptes
> >>>>>>>> with one new anon (!folio_test_anon(folio)).
> >>>>>>>>
> >>>>>>>> I will double check all folio_add_anon_rmap_pte() cases.
> >>>>>>>
> >>>>>>> Right, swapoff() path is a bit special (e.g., we don't do any kind of
> >>>>>>> batching on the swapoff() path).
> >>>>>>>
> >>>>>>> But we should never get a new large anon folio there, or could that now
> >>>>>>> happen?
> >>>>>>
> >>>>>> My team encountered the following issue while testing this RFC
> >>>>>> series on real hardware. Let me take a look to identify the
> >>>>>> problem.
> >>>>>>
> >>>>>> [ 261.214195][T11285] page:000000004cdd779e refcount:4 mapcount:1
> >>>>>> mapping:00000000ba142c22 index:0x1 pfn:0x1b30a6
> >>>>>> [ 261.214213][T11285] memcg:ffffff8003678000
> >>>>>> [ 261.214217][T11285] aops:swap_aops
> >>>>>> [ 261.214233][T11285] flags:
> >>>>>> 0x2000000000081009(locked|uptodate|owner_priv_1|swapbacked|zone=1|kasantag=0x0)
> >>>>>> [ 261.214241][T11285] page_type: 0x0()
> >>>>>> [ 261.214246][T11285] raw: 2000000000081009 0000000000000000
> >>>>>> dead000000000122 0000000000000000
> >>>>>> [ 261.214251][T11285] raw: 0000000000000001 00000000000d84b3
> >>>>>> 0000000400000000 ffffff8003678000
> >>>>>> [ 261.214254][T11285] page dumped because:
> >>>>>> VM_WARN_ON_FOLIO(!folio_test_anon(folio))
> >>>>>> [ 261.214257][T11285] page_owner tracks the page as allocated
> >>>>>> [ 261.214260][T11285] page last allocated via order 0, migratetype
> >>>>>> Movable, gfp_mask 0x2140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), pid
> >>>>>> 11285, tgid 11285 (swapoff), ts 261214177545, free_ts 261151875699
> >>>>>> [ 261.214268][T11285] post_alloc_hook+0x1b8/0x1c0
> >>>>>> [ 261.214284][T11285] prep_new_page+0x28/0x13c
> >>>>>> [ 261.214291][T11285] get_page_from_freelist+0x198c/0x1aa4
> >>>>>> [ 261.214298][T11285] __alloc_pages+0x15c/0x330
> >>>>>> [ 261.214304][T11285] __folio_alloc+0x1c/0x4c
> >>>>>> [ 261.214310][T11285] __read_swap_cache_async+0xd8/0x48c
> >>>>>> [ 261.214320][T11285] swap_cluster_readahead+0x158/0x324
> >>>>>> [ 261.214326][T11285] swapin_readahead+0x64/0x448
> >>>>>> [ 261.214331][T11285] __arm64_sys_swapoff+0x6ec/0x14b0
> >>>>>> [ 261.214337][T11285] invoke_syscall+0x58/0x114
> >>>>>> [ 261.214353][T11285] el0_svc_common+0xac/0xe0
> >>>>>> [ 261.214360][T11285] do_el0_svc+0x1c/0x28
> >>>>>> [ 261.214366][T11285] el0_svc+0x38/0x68
> >>>>>> [ 261.214372][T11285] el0t_64_sync_handler+0x68/0xbc
> >>>>>> [ 261.214376][T11285] el0t_64_sync+0x1a8/0x1ac
> >>>>>> [ 261.214381][T11285] page last free pid 90 tgid 90 stack trace:
> >>>>>> [ 261.214386][T11285] free_unref_page_prepare+0x338/0x374
> >>>>>> [ 261.214395][T11285] free_unref_page_list+0x84/0x378
> >>>>>> [ 261.214400][T11285] shrink_folio_list+0x1234/0x13e4
> >>>>>> [ 261.214409][T11285] evict_folios+0x1458/0x19b4
> >>>>>> [ 261.214417][T11285] try_to_shrink_lruvec+0x1c8/0x264
> >>>>>> [ 261.214422][T11285] shrink_one+0xa8/0x234
> >>>>>> [ 261.214427][T11285] shrink_node+0xb38/0xde0
> >>>>>> [ 261.214432][T11285] balance_pgdat+0x7a4/0xdb4
> >>>>>> [ 261.214437][T11285] kswapd+0x290/0x4e4
> >>>>>> [ 261.214442][T11285] kthread+0x114/0x1bc
> >>>>>> [ 261.214459][T11285] ret_from_fork+0x10/0x20
> >>>>>> [ 261.214477][T11285] ------------[ cut here ]------------
> >>>>>> [ 261.214480][T11285] WARNING: CPU: 3 PID: 11285 at mm/rmap.c:1305
> >>>>>> folio_add_anon_rmap_ptes+0x2b4/0x330
> >>>>>>
> >>>>>> [ 261.215403][T11285] pstate: 63400005 (nZCv daif +PAN -UAO +TCO +DIT
> >>>>>> -SSBS BTYPE=--)
> >>>>>> [ 261.215423][T11285] pc : folio_add_anon_rmap_ptes+0x2b4/0x330
> >>>>>> [ 261.215428][T11285] lr : folio_add_anon_rmap_ptes+0x2b4/0x330
> >>>>>> [ 261.215433][T11285] sp : ffffffc0a7dbbbf0
> >>>>>> [ 261.215437][T11285] x29: ffffffc0a7dbbbf0 x28: ffffff8040860a08
> >>>>>> x27: ffffff80db480000
> >>>>>> [ 261.215445][T11285] x26: fffffffe04cc2980 x25: ffffff808f77f120
> >>>>>> x24: 0000007b44941000
> >>>>>> [ 261.215452][T11285] x23: 0000000000000001 x22: 0000000000000001
> >>>>>> x21: fffffffe04cc2980
> >>>>>> [ 261.215459][T11285] x20: ffffff80db480000 x19: fffffffe04cc2980
> >>>>>> x18: ffffffed011dae80
> >>>>>> [ 261.215465][T11285] x17: 0000000000000001 x16: ffffffffffffffff
> >>>>>> x15: 0000000000000004
> >>>>>> [ 261.215471][T11285] x14: ffffff82fb73fac0 x13: 0000000000000003
> >>>>>> x12: 0000000000000003
> >>>>>> [ 261.215476][T11285] x11: 00000000fffeffff x10: c0000000fffeffff x9
> >>>>>> : 0d46c0889b468e00
> >>>>>> [ 261.215483][T11285] x8 : 0d46c0889b468e00 x7 : 545b5d3935343431 x6
> >>>>>> : 322e31363220205b
> >>>>>> [ 261.215489][T11285] x5 : ffffffed013de407 x4 : ffffffed00698967 x3
> >>>>>> : 0000000000000000
> >>>>>> [ 261.215495][T11285] x2 : 0000000000000000 x1 : ffffffc0a7dbb8c0 x0
> >>>>>> : ffffff8068c15c80
> >>>>>> [ 261.215501][T11285] Call trace:
> >>>>>> [ 261.215504][T11285] folio_add_anon_rmap_ptes+0x2b4/0x330
> >>>>>> [ 261.215509][T11285] __arm64_sys_swapoff+0x8cc/0x14b0
> >>>>>> [ 261.215516][T11285] invoke_syscall+0x58/0x114
> >>>>>> [ 261.215526][T11285] el0_svc_common+0xac/0xe0
> >>>>>> [ 261.215532][T11285] do_el0_svc+0x1c/0x28
> >>>>>> [ 261.215539][T11285] el0_svc+0x38/0x68
> >>>>>> [ 261.215544][T11285] el0t_64_sync_handler+0x68/0xbc
> >>>>>> [ 261.215548][T11285] el0t_64_sync+0x1a8/0x1ac
> >>>>>> [ 261.215552][T11285] ---[ end trace 0000000000000000 ]---
> >>>>>
> >>>>> Ah, yes. in unuse_pte(), you'll have to do the right thing if
> >>>>> !folio_test(anon) before doing the folio_add_anon_rmap_pte().
> >>>>>
> >>>>> You might have a fresh anon folio in the swapcache that was never mapped
> >>>>> (hopefully order-0, otherwise we'd likely be in trouble).
> >>>>
> >>>> Yes. It is order-0
> >>>>
> >>>> [ 261.214260][T11285] page last allocated via order 0, migratetype
> >>>>
> >>>> Otherwise, we would have encountered this VM_WARN_ON_FOLIO?
> >>>>
> >>>> __folio_add_anon_rmap()
> >>>> {
> >>>> ...
> >>>> VM_WARN_ON_FOLIO(folio_test_large(folio) &&
> >>>> level != RMAP_LEVEL_PMD, folio);
> >>>> ...
> >>>> }
> >>>>
> >>>> Given that nobody has ever reported this warning, I assume all callers
> >>>> using folio_add_anon_rmap_pte(s) are right now safe to move to ?
> >>>
> >>> Yes, and we should likely add a VM_WARN_ON_ONCE() here that we have a
> >>> small folio if !anon. If that ever changes, we can assess the situation.
> >>>
> >>
> >> this patch actually has a WARN_ON for all !anon, so it extends the WARN
> >> to small.
>
> Not what I mean, see below:
>
> >>
> >> - if (unlikely(!folio_test_anon(folio))) {
> >> - VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
> >> - /*
> >> - * For a PTE-mapped large folio, we only know that the single
> >> - * PTE is exclusive. Further, __folio_set_anon() might not get
> >> - * folio->index right when not given the address of the head
> >> - * page.
> >> - */
> >> - VM_WARN_ON_FOLIO(folio_test_large(folio) &&
> >> - level != RMAP_LEVEL_PMD, folio);
> >> - __folio_set_anon(folio, vma, address,
> >> - !!(flags & RMAP_EXCLUSIVE));
> >> - } else if (likely(!folio_test_ksm(folio))) {
> >> + VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio);
> >> +
> >>
> >>> Only swap created "new" anon folios without properly calling the right
> >>> function so far, all other code handles that correctly.
> >
> > as I assume patch2/3 should have moved all !anon to
> > folio_add_new_anon_rmap()
> >
> > diff --git a/mm/ksm.c b/mm/ksm.c
> > index d2641bc2efc9..c913ba8c37eb 100644
> > --- a/mm/ksm.c
> > +++ b/mm/ksm.c
> > @@ -1420,7 +1420,14 @@ static int replace_page(struct vm_area_struct
> > *vma, struct page *page,
> > */
> > if (!is_zero_pfn(page_to_pfn(kpage))) {
> > folio_get(kfolio);
> > - folio_add_anon_rmap_pte(kfolio, kpage, vma, addr, RMAP_NONE);
> > + /*
> > + * We currently ensure that new folios cannot be partially
> > + * exclusive: they are either fully exclusive or fully shared.
> > + */
> > + if (!folio_test_anon(kfolio))
> > + folio_add_new_anon_rmap(kfolio, vma, addr, RMAP_NONE);
> > + else
> > + folio_add_anon_rmap_pte(kfolio, kpage, vma, > addr, RMAP_NONE);
>
> I don't think that is required? We are only working with anon folios. Or
> were you able to trigger this? (which would be weird)

I didn't trigger this. but I am not sure if kfifo is always anon based on
the code context.

for page, it is 100% anon(otherwise "goto out"), but I am not quite
sure about kpage
by the code context.

static int try_to_merge_one_page(struct vm_area_struct *vma,
struct page *page, struct page *kpage)
{
pte_t orig_pte = __pte(0);
int err = -EFAULT;

if (page == kpage) /* ksm page forked */
return 0;

if (!PageAnon(page))
goto out;
....
}

Then I saw this

static int replace_page(struct vm_area_struct *vma, struct page *page,
struct page *kpage, pte_t orig_pte)
{
...
VM_BUG_ON_PAGE(PageAnonExclusive(page), page);
VM_BUG_ON_FOLIO(folio_test_anon(kfolio) && PageAnonExclusive(kpage),
kfolio);
}

If kfolio is always anon, we should have used
VM_BUG_ON_FOLIO(PageAnonExclusive(kpage), folio)
just like
VM_BUG_ON_PAGE(PageAnonExclusive(page), page);
without "folio_test_anon(kfolio)".

So I lost my way.

>
> [...]
>
> > - folio_add_anon_rmap_pte(folio, page, vma, addr, rmap_flags);
> > + /*
> > + * We currently ensure that new folios cannot be partially
> > + * exclusive: they are either fully exclusive or fully shared.
> > + */
> > + if (!folio_test_anon(folio))
>
> Here I suggest changing the comment to (and adding the VM_WARN_ON_ONCE):
>
> /*
> * We currently only expect small !anon folios, for which we now that
> * they are either fully exclusive or fully shared. If we ever get large
> * folios here, we have to be careful.
> */
> if (!folio_test_anon(folio) {
> VM_WARN_ON_ONCE(folio_test_large(folio));
> folio_add_new_anon_rmap(folio, vma, addr, rmap_flags);
> } else {
> ...

looks good to me.

>
>
> --
> Cheers,
>
> David / dhildenb
>

Thanks
Barry

2024-06-14 11:10:17

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH RFC 3/3] mm: remove folio_test_anon(folio)==false path in __folio_add_anon_rmap()

>> I don't think that is required? We are only working with anon folios. Or
>> were you able to trigger this? (which would be weird)
>
> I didn't trigger this. but I am not sure if kfifo is always anon based on
> the code context.
>
> for page, it is 100% anon(otherwise "goto out"), but I am not quite
> sure about kpage
> by the code context.
>
> static int try_to_merge_one_page(struct vm_area_struct *vma,
> struct page *page, struct page *kpage)
> {
> pte_t orig_pte = __pte(0);
> int err = -EFAULT;
>
> if (page == kpage) /* ksm page forked */
> return 0;
>
> if (!PageAnon(page))
> goto out;
> ....
> }
>
> Then I saw this
>
> static int replace_page(struct vm_area_struct *vma, struct page *page,
> struct page *kpage, pte_t orig_pte)
> {
> ...
> VM_BUG_ON_PAGE(PageAnonExclusive(page), page);
> VM_BUG_ON_FOLIO(folio_test_anon(kfolio) && PageAnonExclusive(kpage),
> kfolio);
> }
>
> If kfolio is always anon, we should have used
> VM_BUG_ON_FOLIO(PageAnonExclusive(kpage), folio)
> just like
> VM_BUG_ON_PAGE(PageAnonExclusive(page), page);
> without "folio_test_anon(kfolio)".
>
> So I lost my way.

try_to_merge_one_page() is either called with a KSM page (anon) from
try_to_merge_with_ksm_page() or with the shared zeropage (!anon and must
never become anon) from cmp_and_merge_page().

So in replace_page(), we either have an ksm/anon page or the shared
zeropage.

We never updated the documentation of replace_page() to spell out that
"kpage" can also be the shared zeropage.

Note how replace_page() calls folio_add_anon_rmap_pte() not for the
shared zeropage.

If we would have to craft a new anon page things would be going terribly
wrong.

So not, this (!anon -> anon) must not happen and if it were to happen,
it would be a real bug and your check in folio_add_anon_rmap_pte()
would catch it.

--
Cheers,

David / dhildenb


2024-06-14 22:25:15

by Barry Song

[permalink] [raw]
Subject: Re: [PATCH RFC 3/3] mm: remove folio_test_anon(folio)==false path in __folio_add_anon_rmap()

On Fri, Jun 14, 2024 at 11:10 PM David Hildenbrand <[email protected]> wrote:
>
> >> I don't think that is required? We are only working with anon folios. Or
> >> were you able to trigger this? (which would be weird)
> >
> > I didn't trigger this. but I am not sure if kfifo is always anon based on
> > the code context.
> >
> > for page, it is 100% anon(otherwise "goto out"), but I am not quite
> > sure about kpage
> > by the code context.
> >
> > static int try_to_merge_one_page(struct vm_area_struct *vma,
> > struct page *page, struct page *kpage)
> > {
> > pte_t orig_pte = __pte(0);
> > int err = -EFAULT;
> >
> > if (page == kpage) /* ksm page forked */
> > return 0;
> >
> > if (!PageAnon(page))
> > goto out;
> > ....
> > }
> >
> > Then I saw this
> >
> > static int replace_page(struct vm_area_struct *vma, struct page *page,
> > struct page *kpage, pte_t orig_pte)
> > {
> > ...
> > VM_BUG_ON_PAGE(PageAnonExclusive(page), page);
> > VM_BUG_ON_FOLIO(folio_test_anon(kfolio) && PageAnonExclusive(kpage),
> > kfolio);
> > }
> >
> > If kfolio is always anon, we should have used
> > VM_BUG_ON_FOLIO(PageAnonExclusive(kpage), folio)
> > just like
> > VM_BUG_ON_PAGE(PageAnonExclusive(page), page);
> > without "folio_test_anon(kfolio)".
> >
> > So I lost my way.
>
> try_to_merge_one_page() is either called with a KSM page (anon) from
> try_to_merge_with_ksm_page() or with the shared zeropage (!anon and must
> never become anon) from cmp_and_merge_page().
>
> So in replace_page(), we either have an ksm/anon page or the shared
> zeropage.
>
> We never updated the documentation of replace_page() to spell out that
> "kpage" can also be the shared zeropage.
>
> Note how replace_page() calls folio_add_anon_rmap_pte() not for the
> shared zeropage.
>
> If we would have to craft a new anon page things would be going terribly
> wrong.
>
> So not, this (!anon -> anon) must not happen and if it were to happen,
> it would be a real bug and your check in folio_add_anon_rmap_pte()
> would catch it.

Thanks very much for the explanation. I wonder if the below can help
improve the doc. If yes, I may add a separate patch for it in v2.

diff --git a/mm/ksm.c b/mm/ksm.c
index d2641bc2efc9..56b10265e617 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1411,14 +1411,13 @@ static int replace_page(struct vm_area_struct
*vma, struct page *page,
goto out_mn;
}
VM_BUG_ON_PAGE(PageAnonExclusive(page), page);
- VM_BUG_ON_FOLIO(folio_test_anon(kfolio) && PageAnonExclusive(kpage),
- kfolio);
-
/*
* No need to check ksm_use_zero_pages here: we can only have a
* zero_page here if ksm_use_zero_pages was enabled already.
*/
if (!is_zero_pfn(page_to_pfn(kpage))) {
+ VM_BUG_ON_FOLIO(!folio_test_anon(kfolio) ||
PageAnonExclusive(kpage),
+ kfolio);
folio_get(kfolio);
folio_add_anon_rmap_pte(kfolio, kpage, vma, addr, RMAP_NONE);
newpte = mk_pte(kpage, vma->vm_page_prot);

>
> --
> Cheers,
>
> David / dhildenb
>

Thanks
Barry