2022-08-11 17:16:07

by Peter Xu

[permalink] [raw]
Subject: [PATCH v4 0/7] mm: Remember a/d bits for migration entries

v4:
- Added r-bs for Ying
- Some cosmetic changes here and there [Ying]
- Fix smaps to only dump PFN for pfn swap entries for both pte/pmd [Ying]
- Remove max_swapfile_size(), export swapfile_maximum_size variable [Ying]
- In migrate_vma_collect_pmd() only read A/D if pte_present()

rfc: https://lore.kernel.org/all/[email protected]
v1: https://lore.kernel.org/all/[email protected]
v2: https://lore.kernel.org/all/[email protected]
v3: https://lore.kernel.org/all/[email protected]

Problem
=======

When migrate a page, right now we always mark the migrated page as old &
clean.

However that could lead to at least two problems:

(1) We lost the real hot/cold information while we could have persisted.
That information shouldn't change even if the backing page is changed
after the migration,

(2) There can be always extra overhead on the immediate next access to
any migrated page, because hardware MMU needs cycles to set the young
bit again for reads, and dirty bits for write, as long as the
hardware MMU supports these bits.

Many of the recent upstream works showed that (2) is not something trivial
and actually very measurable. In my test case, reading 1G chunk of memory
- jumping in page size intervals - could take 99ms just because of the
extra setting on the young bit on a generic x86_64 system, comparing to 4ms
if young set.

This issue is originally reported by Andrea Arcangeli.

Solution
========

To solve this problem, this patchset tries to remember the young/dirty bits
in the migration entries and carry them over when recovering the ptes.

We have the chance to do so because in many systems the swap offset is not
really fully used. Migration entries use swp offset to store PFN only,
while the PFN is normally not as large as swp offset and normally smaller.
It means we do have some free bits in swp offset that we can use to store
things like A/D bits, and that's how this series tried to approach this
problem.

max_swapfile_size() is used here to detect per-arch offset length in swp
entries. We'll automatically remember the A/D bits when we find that we
have enough swp offset field to keep both the PFN and the extra bits.

Since max_swapfile_size() can be slow, the last two patches cache the
results for it and also swap_migration_ad_supported as a whole.

Known Issues / TODOs
====================

We still haven't taught madvise() to recognize the new A/D bits in
migration entries, namely MADV_COLD/MADV_FREE. E.g. when MADV_COLD upon a
migration entry. It's not clear yet on whether we should clear the A bit,
or we should just drop the entry directly.

We didn't teach idle page tracking on the new migration entries, because
it'll need larger rework on the tree on rmap pgtable walk. However it
should make it already better because before this patchset page will be old
page after migration, so the series will fix potential false negative of
idle page tracking when pages were migrated before observing.

The other thing is migration A/D bits will not start to working for private
device swap entries. The code is there for completeness but since private
device swap entries do not yet have fields to store A/D bits, even if we'll
persistent A/D across present pte switching to migration entry, we'll lose
it again when the migration entry converted to private device swap entry.

Tests
=====

After the patchset applied, the immediate read access test [1] of above 1G
chunk after migration can shrink from 99ms to 4ms. The test is done by
moving 1G pages from node 0->1->0 then read it in page size jumps. The
test is with Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz.

Similar effect can also be measured when writting the memory the 1st time
after migration.

After applying the patchset, both initial immediate read/write after page
migrated will perform similarly like before migration happened.

Patch Layout
============

Patch 1-2: Cleanups from either previous versions or on swapops.h macros.

Patch 3-4: Prepare for the introduction of migration A/D bits

Patch 5: The core patch to remember young/dirty bit in swap offsets.

Patch 6-7: Cache relevant fields to make migration_entry_supports_ad() fast.

Please review, thanks.

[1] https://github.com/xzpeter/clibs/blob/master/misc/swap-young.c

Peter Xu (7):
mm/x86: Use SWP_TYPE_BITS in 3-level swap macros
mm/swap: Comment all the ifdef in swapops.h
mm/swap: Add swp_offset_pfn() to fetch PFN from swap entry
mm/thp: Carry over dirty bit when thp splits on pmd
mm: Remember young/dirty bit for page migrations
mm/swap: Cache maximum swapfile size when init swap
mm/swap: Cache swap migration A/D bits support

arch/arm64/mm/hugetlbpage.c | 2 +-
arch/x86/include/asm/pgtable-3level.h | 8 +-
arch/x86/mm/init.c | 2 +-
fs/proc/task_mmu.c | 20 +++-
include/linux/swapfile.h | 5 +-
include/linux/swapops.h | 145 +++++++++++++++++++++++---
mm/hmm.c | 2 +-
mm/huge_memory.c | 27 ++++-
mm/memory-failure.c | 2 +-
mm/migrate.c | 6 +-
mm/migrate_device.c | 6 ++
mm/page_vma_mapped.c | 6 +-
mm/rmap.c | 5 +-
mm/swapfile.c | 15 ++-
14 files changed, 214 insertions(+), 37 deletions(-)

--
2.32.0


2022-08-11 17:16:26

by Peter Xu

[permalink] [raw]
Subject: [PATCH v4 5/7] mm: Remember young/dirty bit for page migrations

When page migration happens, we always ignore the young/dirty bit settings
in the old pgtable, and marking the page as old in the new page table using
either pte_mkold() or pmd_mkold(), and keeping the pte clean.

That's fine from functional-wise, but that's not friendly to page reclaim
because the moving page can be actively accessed within the procedure. Not
to mention hardware setting the young bit can bring quite some overhead on
some systems, e.g. x86_64 needs a few hundreds nanoseconds to set the bit.
The same slowdown problem to dirty bits when the memory is first written
after page migration happened.

Actually we can easily remember the A/D bit configuration and recover the
information after the page is migrated. To achieve it, define a new set of
bits in the migration swap offset field to cache the A/D bits for old pte.
Then when removing/recovering the migration entry, we can recover the A/D
bits even if the page changed.

One thing to mention is that here we used max_swapfile_size() to detect how
many swp offset bits we have, and we'll only enable this feature if we know
the swp offset is big enough to store both the PFN value and the A/D bits.
Otherwise the A/D bits are dropped like before.

Signed-off-by: Peter Xu <[email protected]>
---
include/linux/swapops.h | 99 +++++++++++++++++++++++++++++++++++++++++
mm/huge_memory.c | 18 +++++++-
mm/migrate.c | 6 ++-
mm/migrate_device.c | 6 +++
mm/rmap.c | 5 ++-
5 files changed, 130 insertions(+), 4 deletions(-)

diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index bd4c6f0c2103..36e462e116af 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -8,6 +8,10 @@

#ifdef CONFIG_MMU

+#ifdef CONFIG_SWAP
+#include <linux/swapfile.h>
+#endif /* CONFIG_SWAP */
+
/*
* swapcache pages are stored in the swapper_space radix tree. We want to
* get good packing density in that tree, so the index should be dense in
@@ -35,6 +39,31 @@
#endif /* MAX_PHYSMEM_BITS */
#define SWP_PFN_MASK (BIT(SWP_PFN_BITS) - 1)

+/**
+ * Migration swap entry specific bitfield definitions. Layout:
+ *
+ * |----------+--------------------|
+ * | swp_type | swp_offset |
+ * |----------+--------+-+-+-------|
+ * | | resv |D|A| PFN |
+ * |----------+--------+-+-+-------|
+ *
+ * @SWP_MIG_YOUNG_BIT: Whether the page used to have young bit set (bit A)
+ * @SWP_MIG_DIRTY_BIT: Whether the page used to have dirty bit set (bit D)
+ *
+ * Note: A/D bits will be stored in migration entries iff there're enough
+ * free bits in arch specific swp offset. By default we'll ignore A/D bits
+ * when migrating a page. Please refer to migration_entry_supports_ad()
+ * for more information. If there're more bits besides PFN and A/D bits,
+ * they should be reserved and always be zeros.
+ */
+#define SWP_MIG_YOUNG_BIT (SWP_PFN_BITS)
+#define SWP_MIG_DIRTY_BIT (SWP_PFN_BITS + 1)
+#define SWP_MIG_TOTAL_BITS (SWP_PFN_BITS + 2)
+
+#define SWP_MIG_YOUNG BIT(SWP_MIG_YOUNG_BIT)
+#define SWP_MIG_DIRTY BIT(SWP_MIG_DIRTY_BIT)
+
static inline bool is_pfn_swap_entry(swp_entry_t entry);

/* Clear all flags but only keep swp_entry_t related information */
@@ -265,6 +294,57 @@ static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
return swp_entry(SWP_MIGRATION_WRITE, offset);
}

+/*
+ * Returns whether the host has large enough swap offset field to support
+ * carrying over pgtable A/D bits for page migrations. The result is
+ * pretty much arch specific.
+ */
+static inline bool migration_entry_supports_ad(void)
+{
+ /*
+ * max_swapfile_size() returns the max supported swp-offset plus 1.
+ * We can support the migration A/D bits iff the pfn swap entry has
+ * the offset large enough to cover all of them (PFN, A & D bits).
+ */
+#ifdef CONFIG_SWAP
+ return max_swapfile_size() >= (1UL << SWP_MIG_TOTAL_BITS);
+#else /* CONFIG_SWAP */
+ return false;
+#endif /* CONFIG_SWAP */
+}
+
+static inline swp_entry_t make_migration_entry_young(swp_entry_t entry)
+{
+ if (migration_entry_supports_ad())
+ return swp_entry(swp_type(entry),
+ swp_offset(entry) | SWP_MIG_YOUNG);
+ return entry;
+}
+
+static inline bool is_migration_entry_young(swp_entry_t entry)
+{
+ if (migration_entry_supports_ad())
+ return swp_offset(entry) & SWP_MIG_YOUNG;
+ /* Keep the old behavior of aging page after migration */
+ return false;
+}
+
+static inline swp_entry_t make_migration_entry_dirty(swp_entry_t entry)
+{
+ if (migration_entry_supports_ad())
+ return swp_entry(swp_type(entry),
+ swp_offset(entry) | SWP_MIG_DIRTY);
+ return entry;
+}
+
+static inline bool is_migration_entry_dirty(swp_entry_t entry)
+{
+ if (migration_entry_supports_ad())
+ return swp_offset(entry) & SWP_MIG_DIRTY;
+ /* Keep the old behavior of clean page after migration */
+ return false;
+}
+
extern void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
spinlock_t *ptl);
extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
@@ -311,6 +391,25 @@ static inline int is_readable_migration_entry(swp_entry_t entry)
return 0;
}

+static inline swp_entry_t make_migration_entry_young(swp_entry_t entry)
+{
+ return entry;
+}
+
+static inline bool is_migration_entry_young(swp_entry_t entry)
+{
+ return false;
+}
+
+static inline swp_entry_t make_migration_entry_dirty(swp_entry_t entry)
+{
+ return entry;
+}
+
+static inline bool is_migration_entry_dirty(swp_entry_t entry)
+{
+ return false;
+}
#endif /* CONFIG_MIGRATION */

typedef unsigned long pte_marker;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2f68e034ddec..ac858fd9c1f1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2111,7 +2111,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
write = is_writable_migration_entry(entry);
if (PageAnon(page))
anon_exclusive = is_readable_exclusive_migration_entry(entry);
- young = false;
+ young = is_migration_entry_young(entry);
+ dirty = is_migration_entry_dirty(entry);
soft_dirty = pmd_swp_soft_dirty(old_pmd);
uffd_wp = pmd_swp_uffd_wp(old_pmd);
} else {
@@ -2171,6 +2172,10 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
else
swp_entry = make_readable_migration_entry(
page_to_pfn(page + i));
+ if (young)
+ swp_entry = make_migration_entry_young(swp_entry);
+ if (dirty)
+ swp_entry = make_migration_entry_dirty(swp_entry);
entry = swp_entry_to_pte(swp_entry);
if (soft_dirty)
entry = pte_swp_mksoft_dirty(entry);
@@ -3180,6 +3185,10 @@ int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
entry = make_readable_exclusive_migration_entry(page_to_pfn(page));
else
entry = make_readable_migration_entry(page_to_pfn(page));
+ if (pmd_young(pmdval))
+ entry = make_migration_entry_young(entry);
+ if (pmd_dirty(pmdval))
+ entry = make_migration_entry_dirty(entry);
pmdswp = swp_entry_to_pmd(entry);
if (pmd_soft_dirty(pmdval))
pmdswp = pmd_swp_mksoft_dirty(pmdswp);
@@ -3205,13 +3214,18 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)

entry = pmd_to_swp_entry(*pvmw->pmd);
get_page(new);
- pmde = pmd_mkold(mk_huge_pmd(new, READ_ONCE(vma->vm_page_prot)));
+ pmde = mk_huge_pmd(new, READ_ONCE(vma->vm_page_prot));
if (pmd_swp_soft_dirty(*pvmw->pmd))
pmde = pmd_mksoft_dirty(pmde);
if (is_writable_migration_entry(entry))
pmde = maybe_pmd_mkwrite(pmde, vma);
if (pmd_swp_uffd_wp(*pvmw->pmd))
pmde = pmd_wrprotect(pmd_mkuffd_wp(pmde));
+ if (!is_migration_entry_young(entry))
+ pmde = pmd_mkold(pmde);
+ /* NOTE: this may contain setting soft-dirty on some archs */
+ if (PageDirty(new) && is_migration_entry_dirty(entry))
+ pmde = pmd_mkdirty(pmde);

if (PageAnon(new)) {
rmap_t rmap_flags = RMAP_COMPOUND;
diff --git a/mm/migrate.c b/mm/migrate.c
index 6a1597c92261..0433a71d2bee 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -198,7 +198,7 @@ static bool remove_migration_pte(struct folio *folio,
#endif

folio_get(folio);
- pte = pte_mkold(mk_pte(new, READ_ONCE(vma->vm_page_prot)));
+ pte = mk_pte(new, READ_ONCE(vma->vm_page_prot));
if (pte_swp_soft_dirty(*pvmw.pte))
pte = pte_mksoft_dirty(pte);

@@ -206,6 +206,10 @@ static bool remove_migration_pte(struct folio *folio,
* Recheck VMA as permissions can change since migration started
*/
entry = pte_to_swp_entry(*pvmw.pte);
+ if (!is_migration_entry_young(entry))
+ pte = pte_mkold(pte);
+ if (folio_test_dirty(folio) && is_migration_entry_dirty(entry))
+ pte = pte_mkdirty(pte);
if (is_writable_migration_entry(entry))
pte = maybe_mkwrite(pte, vma);
else if (pte_swp_uffd_wp(*pvmw.pte))
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 27fb37d65476..e450b318b01b 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -221,6 +221,12 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
else
entry = make_readable_migration_entry(
page_to_pfn(page));
+ if (pte_present(pte)) {
+ if (pte_young(pte))
+ entry = make_migration_entry_young(entry);
+ if (pte_dirty(pte))
+ entry = make_migration_entry_dirty(entry);
+ }
swp_pte = swp_entry_to_pte(entry);
if (pte_present(pte)) {
if (pte_soft_dirty(pte))
diff --git a/mm/rmap.c b/mm/rmap.c
index af775855e58f..28aef434ea41 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2065,7 +2065,10 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
else
entry = make_readable_migration_entry(
page_to_pfn(subpage));
-
+ if (pte_young(pteval))
+ entry = make_migration_entry_young(entry);
+ if (pte_dirty(pteval))
+ entry = make_migration_entry_dirty(entry);
swp_pte = swp_entry_to_pte(entry);
if (pte_soft_dirty(pteval))
swp_pte = pte_swp_mksoft_dirty(swp_pte);
--
2.32.0

2022-08-11 17:16:54

by Peter Xu

[permalink] [raw]
Subject: [PATCH v4 3/7] mm/swap: Add swp_offset_pfn() to fetch PFN from swap entry

We've got a bunch of special swap entries that stores PFN inside the swap
offset fields. To fetch the PFN, normally the user just calls swp_offset()
assuming that'll be the PFN.

Add a helper swp_offset_pfn() to fetch the PFN instead, fetching only the
max possible length of a PFN on the host, meanwhile doing proper check with
MAX_PHYSMEM_BITS to make sure the swap offsets can actually store the PFNs
properly always using the BUILD_BUG_ON() in is_pfn_swap_entry().

One reason to do so is we never tried to sanitize whether swap offset can
really fit for storing PFN. At the meantime, this patch also prepares us
with the future possibility to store more information inside the swp offset
field, so assuming "swp_offset(entry)" to be the PFN will not stand any
more very soon.

Replace many of the swp_offset() callers to use swp_offset_pfn() where
proper. Note that many of the existing users are not candidates for the
replacement, e.g.:

(1) When the swap entry is not a pfn swap entry at all, or,
(2) when we wanna keep the whole swp_offset but only change the swp type.

For the latter, it can happen when fork() triggered on a write-migration
swap entry pte, we may want to only change the migration type from
write->read but keep the rest, so it's not "fetching PFN" but "changing
swap type only". They're left aside so that when there're more information
within the swp offset they'll be carried over naturally in those cases.

Since at it, dropping hwpoison_entry_to_pfn() because that's exactly what
the new swp_offset_pfn() is about.

Signed-off-by: Peter Xu <[email protected]>
---
arch/arm64/mm/hugetlbpage.c | 2 +-
fs/proc/task_mmu.c | 20 +++++++++++++++++---
include/linux/swapops.h | 35 +++++++++++++++++++++++++++++------
mm/hmm.c | 2 +-
mm/memory-failure.c | 2 +-
mm/page_vma_mapped.c | 6 +++---
6 files changed, 52 insertions(+), 15 deletions(-)

diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 0795028f017c..35e9a468d13e 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -245,7 +245,7 @@ static inline struct folio *hugetlb_swap_entry_to_folio(swp_entry_t entry)
{
VM_BUG_ON(!is_migration_entry(entry) && !is_hwpoison_entry(entry));

- return page_folio(pfn_to_page(swp_offset(entry)));
+ return page_folio(pfn_to_page(swp_offset_pfn(entry)));
}

void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index d56c65f98d00..b3e79128fca0 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1419,9 +1419,19 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
if (pte_swp_uffd_wp(pte))
flags |= PM_UFFD_WP;
entry = pte_to_swp_entry(pte);
- if (pm->show_pfn)
+ if (pm->show_pfn) {
+ pgoff_t offset;
+ /*
+ * For PFN swap offsets, keeping the offset field
+ * to be PFN only to be compatible with old smaps.
+ */
+ if (is_pfn_swap_entry(entry))
+ offset = swp_offset_pfn(entry);
+ else
+ offset = swp_offset(entry);
frame = swp_type(entry) |
- (swp_offset(entry) << MAX_SWAPFILES_SHIFT);
+ (offset << MAX_SWAPFILES_SHIFT);
+ }
flags |= PM_SWAP;
migration = is_migration_entry(entry);
if (is_pfn_swap_entry(entry))
@@ -1478,7 +1488,11 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
unsigned long offset;

if (pm->show_pfn) {
- offset = swp_offset(entry) +
+ if (is_pfn_swap_entry(entry))
+ offset = swp_offset_pfn(entry);
+ else
+ offset = swp_offset(entry);
+ offset = offset +
((addr & ~PMD_MASK) >> PAGE_SHIFT);
frame = swp_type(entry) |
(offset << MAX_SWAPFILES_SHIFT);
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 3a2901ff4f1e..bd4c6f0c2103 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -23,6 +23,20 @@
#define SWP_TYPE_SHIFT (BITS_PER_XA_VALUE - MAX_SWAPFILES_SHIFT)
#define SWP_OFFSET_MASK ((1UL << SWP_TYPE_SHIFT) - 1)

+/*
+ * Definitions only for PFN swap entries (see is_pfn_swap_entry()). To
+ * store PFN, we only need SWP_PFN_BITS bits. Each of the pfn swap entries
+ * can use the extra bits to store other information besides PFN.
+ */
+#ifdef MAX_PHYSMEM_BITS
+#define SWP_PFN_BITS (MAX_PHYSMEM_BITS - PAGE_SHIFT)
+#else /* MAX_PHYSMEM_BITS */
+#define SWP_PFN_BITS (BITS_PER_LONG - PAGE_SHIFT)
+#endif /* MAX_PHYSMEM_BITS */
+#define SWP_PFN_MASK (BIT(SWP_PFN_BITS) - 1)
+
+static inline bool is_pfn_swap_entry(swp_entry_t entry);
+
/* Clear all flags but only keep swp_entry_t related information */
static inline pte_t pte_swp_clear_flags(pte_t pte)
{
@@ -64,6 +78,17 @@ static inline pgoff_t swp_offset(swp_entry_t entry)
return entry.val & SWP_OFFSET_MASK;
}

+/*
+ * This should only be called upon a pfn swap entry to get the PFN stored
+ * in the swap entry. Please refers to is_pfn_swap_entry() for definition
+ * of pfn swap entry.
+ */
+static inline unsigned long swp_offset_pfn(swp_entry_t entry)
+{
+ VM_BUG_ON(!is_pfn_swap_entry(entry));
+ return swp_offset(entry) & SWP_PFN_MASK;
+}
+
/* check whether a pte points to a swap entry */
static inline int is_swap_pte(pte_t pte)
{
@@ -369,7 +394,7 @@ static inline int pte_none_mostly(pte_t pte)

static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry)
{
- struct page *p = pfn_to_page(swp_offset(entry));
+ struct page *p = pfn_to_page(swp_offset_pfn(entry));

/*
* Any use of migration entries may only occur while the
@@ -387,6 +412,9 @@ static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry)
*/
static inline bool is_pfn_swap_entry(swp_entry_t entry)
{
+ /* Make sure the swp offset can always store the needed fields */
+ BUILD_BUG_ON(SWP_TYPE_SHIFT < SWP_PFN_BITS);
+
return is_migration_entry(entry) || is_device_private_entry(entry) ||
is_device_exclusive_entry(entry);
}
@@ -475,11 +503,6 @@ static inline int is_hwpoison_entry(swp_entry_t entry)
return swp_type(entry) == SWP_HWPOISON;
}

-static inline unsigned long hwpoison_entry_to_pfn(swp_entry_t entry)
-{
- return swp_offset(entry);
-}
-
static inline void num_poisoned_pages_inc(void)
{
atomic_long_inc(&num_poisoned_pages);
diff --git a/mm/hmm.c b/mm/hmm.c
index f2aa63b94d9b..3850fb625dda 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -253,7 +253,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
cpu_flags = HMM_PFN_VALID;
if (is_writable_device_private_entry(entry))
cpu_flags |= HMM_PFN_WRITE;
- *hmm_pfn = swp_offset(entry) | cpu_flags;
+ *hmm_pfn = swp_offset_pfn(entry) | cpu_flags;
return 0;
}

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 0dfed9d7b273..e48f6f6a259d 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -632,7 +632,7 @@ static int check_hwpoisoned_entry(pte_t pte, unsigned long addr, short shift,
swp_entry_t swp = pte_to_swp_entry(pte);

if (is_hwpoison_entry(swp))
- pfn = hwpoison_entry_to_pfn(swp);
+ pfn = swp_offset_pfn(swp);
}

if (!pfn || pfn != poisoned_pfn)
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 8e9e574d535a..93e13fc17d3c 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -86,7 +86,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
!is_device_exclusive_entry(entry))
return false;

- pfn = swp_offset(entry);
+ pfn = swp_offset_pfn(entry);
} else if (is_swap_pte(*pvmw->pte)) {
swp_entry_t entry;

@@ -96,7 +96,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
!is_device_exclusive_entry(entry))
return false;

- pfn = swp_offset(entry);
+ pfn = swp_offset_pfn(entry);
} else {
if (!pte_present(*pvmw->pte))
return false;
@@ -221,7 +221,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
return not_found(pvmw);
entry = pmd_to_swp_entry(pmde);
if (!is_migration_entry(entry) ||
- !check_pmd(swp_offset(entry), pvmw))
+ !check_pmd(swp_offset_pfn(entry), pvmw))
return not_found(pvmw);
return true;
}
--
2.32.0

2022-08-11 17:36:53

by Peter Xu

[permalink] [raw]
Subject: [PATCH v4 6/7] mm/swap: Cache maximum swapfile size when init swap

We used to have swapfile_maximum_size() fetching a maximum value of
swapfile size per-arch.

As the caller of max_swapfile_size() grows, this patch introduce a variable
"swapfile_maximum_size" and cache the value of old max_swapfile_size(), so
that we don't need to calculate the value every time.

Caching the value in swapfile_init() is safe because when reaching the
phase we should have initialized all the relevant information. Here the
major arch to take care of is x86, which defines the max swapfile size
based on L1TF mitigation.

Here both X86_BUG_L1TF or l1tf_mitigation should have been setup properly
when reaching swapfile_init(). As a reference, the code path looks like
this for x86:

- start_kernel
- setup_arch
- early_cpu_init
- early_identify_cpu --> setup X86_BUG_L1TF
- parse_early_param
- l1tf_cmdline --> set l1tf_mitigation
- check_bugs
- l1tf_select_mitigation --> set l1tf_mitigation
- arch_call_rest_init
- rest_init
- kernel_init
- kernel_init_freeable
- do_basic_setup
- do_initcalls --> calls swapfile_init() (initcall level 4)

The swapfile size only depends on swp pte format on non-x86 archs, so
caching it is safe too.

Since at it, rename max_swapfile_size() to arch_max_swapfile_size() because
arch can define its own function, so it's more straightforward to have
"arch_" as its prefix. At the meantime, export swapfile_maximum_size to
replace the old usages of max_swapfile_size().

Signed-off-by: Peter Xu <[email protected]>
---
arch/x86/mm/init.c | 2 +-
include/linux/swapfile.h | 3 ++-
include/linux/swapops.h | 2 +-
mm/swapfile.c | 7 +++++--
4 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 82a042c03824..9121bc1b9453 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -1054,7 +1054,7 @@ void update_cache_mode_entry(unsigned entry, enum page_cache_mode cache)
}

#ifdef CONFIG_SWAP
-unsigned long max_swapfile_size(void)
+unsigned long arch_max_swapfile_size(void)
{
unsigned long pages;

diff --git a/include/linux/swapfile.h b/include/linux/swapfile.h
index 54078542134c..165e0bd04862 100644
--- a/include/linux/swapfile.h
+++ b/include/linux/swapfile.h
@@ -8,6 +8,7 @@
*/
extern struct swap_info_struct *swap_info[];
extern unsigned long generic_max_swapfile_size(void);
-extern unsigned long max_swapfile_size(void);
+/* Maximum swapfile size supported for the arch (not inclusive). */
+extern unsigned long swapfile_maximum_size;

#endif /* _LINUX_SWAPFILE_H */
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 36e462e116af..f25b566643f1 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -307,7 +307,7 @@ static inline bool migration_entry_supports_ad(void)
* the offset large enough to cover all of them (PFN, A & D bits).
*/
#ifdef CONFIG_SWAP
- return max_swapfile_size() >= (1UL << SWP_MIG_TOTAL_BITS);
+ return swapfile_maximum_size >= (1UL << SWP_MIG_TOTAL_BITS);
#else /* CONFIG_SWAP */
return false;
#endif /* CONFIG_SWAP */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 1fdccd2f1422..3cc64399df44 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -63,6 +63,7 @@ EXPORT_SYMBOL_GPL(nr_swap_pages);
/* protected with swap_lock. reading in vm_swap_full() doesn't need lock */
long total_swap_pages;
static int least_priority = -1;
+unsigned long swapfile_maximum_size;

static const char Bad_file[] = "Bad swap file entry ";
static const char Unused_file[] = "Unused swap file entry ";
@@ -2816,7 +2817,7 @@ unsigned long generic_max_swapfile_size(void)
}

/* Can be overridden by an architecture for additional checks. */
-__weak unsigned long max_swapfile_size(void)
+__weak unsigned long arch_max_swapfile_size(void)
{
return generic_max_swapfile_size();
}
@@ -2856,7 +2857,7 @@ static unsigned long read_swap_header(struct swap_info_struct *p,
p->cluster_next = 1;
p->cluster_nr = 0;

- maxpages = max_swapfile_size();
+ maxpages = swapfile_maximum_size;
last_page = swap_header->info.last_page;
if (!last_page) {
pr_warn("Empty swap-file\n");
@@ -3677,6 +3678,8 @@ static int __init swapfile_init(void)
for_each_node(nid)
plist_head_init(&swap_avail_heads[nid]);

+ swapfile_maximum_size = arch_max_swapfile_size();
+
return 0;
}
subsys_initcall(swapfile_init);
--
2.32.0

2022-08-12 02:51:44

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH v4 3/7] mm/swap: Add swp_offset_pfn() to fetch PFN from swap entry

Peter Xu <[email protected]> writes:

> We've got a bunch of special swap entries that stores PFN inside the swap
> offset fields. To fetch the PFN, normally the user just calls swp_offset()
> assuming that'll be the PFN.
>
> Add a helper swp_offset_pfn() to fetch the PFN instead, fetching only the
> max possible length of a PFN on the host, meanwhile doing proper check with
> MAX_PHYSMEM_BITS to make sure the swap offsets can actually store the PFNs
> properly always using the BUILD_BUG_ON() in is_pfn_swap_entry().
>
> One reason to do so is we never tried to sanitize whether swap offset can
> really fit for storing PFN. At the meantime, this patch also prepares us
> with the future possibility to store more information inside the swp offset
> field, so assuming "swp_offset(entry)" to be the PFN will not stand any
> more very soon.
>
> Replace many of the swp_offset() callers to use swp_offset_pfn() where
> proper. Note that many of the existing users are not candidates for the
> replacement, e.g.:
>
> (1) When the swap entry is not a pfn swap entry at all, or,
> (2) when we wanna keep the whole swp_offset but only change the swp type.
>
> For the latter, it can happen when fork() triggered on a write-migration
> swap entry pte, we may want to only change the migration type from
> write->read but keep the rest, so it's not "fetching PFN" but "changing
> swap type only". They're left aside so that when there're more information
> within the swp offset they'll be carried over naturally in those cases.
>
> Since at it, dropping hwpoison_entry_to_pfn() because that's exactly what
> the new swp_offset_pfn() is about.
>
> Signed-off-by: Peter Xu <[email protected]>

LGTM, Thanks!

Reviewed-by: "Huang, Ying" <[email protected]>

> ---
> arch/arm64/mm/hugetlbpage.c | 2 +-
> fs/proc/task_mmu.c | 20 +++++++++++++++++---
> include/linux/swapops.h | 35 +++++++++++++++++++++++++++++------
> mm/hmm.c | 2 +-
> mm/memory-failure.c | 2 +-
> mm/page_vma_mapped.c | 6 +++---
> 6 files changed, 52 insertions(+), 15 deletions(-)
>
> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
> index 0795028f017c..35e9a468d13e 100644
> --- a/arch/arm64/mm/hugetlbpage.c
> +++ b/arch/arm64/mm/hugetlbpage.c
> @@ -245,7 +245,7 @@ static inline struct folio *hugetlb_swap_entry_to_folio(swp_entry_t entry)
> {
> VM_BUG_ON(!is_migration_entry(entry) && !is_hwpoison_entry(entry));
>
> - return page_folio(pfn_to_page(swp_offset(entry)));
> + return page_folio(pfn_to_page(swp_offset_pfn(entry)));
> }
>
> void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index d56c65f98d00..b3e79128fca0 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -1419,9 +1419,19 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
> if (pte_swp_uffd_wp(pte))
> flags |= PM_UFFD_WP;
> entry = pte_to_swp_entry(pte);
> - if (pm->show_pfn)
> + if (pm->show_pfn) {
> + pgoff_t offset;
> + /*
> + * For PFN swap offsets, keeping the offset field
> + * to be PFN only to be compatible with old smaps.
> + */
> + if (is_pfn_swap_entry(entry))
> + offset = swp_offset_pfn(entry);
> + else
> + offset = swp_offset(entry);
> frame = swp_type(entry) |
> - (swp_offset(entry) << MAX_SWAPFILES_SHIFT);
> + (offset << MAX_SWAPFILES_SHIFT);
> + }
> flags |= PM_SWAP;
> migration = is_migration_entry(entry);
> if (is_pfn_swap_entry(entry))
> @@ -1478,7 +1488,11 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
> unsigned long offset;
>
> if (pm->show_pfn) {
> - offset = swp_offset(entry) +
> + if (is_pfn_swap_entry(entry))
> + offset = swp_offset_pfn(entry);
> + else
> + offset = swp_offset(entry);
> + offset = offset +
> ((addr & ~PMD_MASK) >> PAGE_SHIFT);
> frame = swp_type(entry) |
> (offset << MAX_SWAPFILES_SHIFT);
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index 3a2901ff4f1e..bd4c6f0c2103 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -23,6 +23,20 @@
> #define SWP_TYPE_SHIFT (BITS_PER_XA_VALUE - MAX_SWAPFILES_SHIFT)
> #define SWP_OFFSET_MASK ((1UL << SWP_TYPE_SHIFT) - 1)
>
> +/*
> + * Definitions only for PFN swap entries (see is_pfn_swap_entry()). To
> + * store PFN, we only need SWP_PFN_BITS bits. Each of the pfn swap entries
> + * can use the extra bits to store other information besides PFN.
> + */
> +#ifdef MAX_PHYSMEM_BITS
> +#define SWP_PFN_BITS (MAX_PHYSMEM_BITS - PAGE_SHIFT)
> +#else /* MAX_PHYSMEM_BITS */
> +#define SWP_PFN_BITS (BITS_PER_LONG - PAGE_SHIFT)
> +#endif /* MAX_PHYSMEM_BITS */
> +#define SWP_PFN_MASK (BIT(SWP_PFN_BITS) - 1)
> +
> +static inline bool is_pfn_swap_entry(swp_entry_t entry);
> +
> /* Clear all flags but only keep swp_entry_t related information */
> static inline pte_t pte_swp_clear_flags(pte_t pte)
> {
> @@ -64,6 +78,17 @@ static inline pgoff_t swp_offset(swp_entry_t entry)
> return entry.val & SWP_OFFSET_MASK;
> }
>
> +/*
> + * This should only be called upon a pfn swap entry to get the PFN stored
> + * in the swap entry. Please refers to is_pfn_swap_entry() for definition
> + * of pfn swap entry.
> + */
> +static inline unsigned long swp_offset_pfn(swp_entry_t entry)
> +{
> + VM_BUG_ON(!is_pfn_swap_entry(entry));
> + return swp_offset(entry) & SWP_PFN_MASK;
> +}
> +
> /* check whether a pte points to a swap entry */
> static inline int is_swap_pte(pte_t pte)
> {
> @@ -369,7 +394,7 @@ static inline int pte_none_mostly(pte_t pte)
>
> static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry)
> {
> - struct page *p = pfn_to_page(swp_offset(entry));
> + struct page *p = pfn_to_page(swp_offset_pfn(entry));
>
> /*
> * Any use of migration entries may only occur while the
> @@ -387,6 +412,9 @@ static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry)
> */
> static inline bool is_pfn_swap_entry(swp_entry_t entry)
> {
> + /* Make sure the swp offset can always store the needed fields */
> + BUILD_BUG_ON(SWP_TYPE_SHIFT < SWP_PFN_BITS);
> +
> return is_migration_entry(entry) || is_device_private_entry(entry) ||
> is_device_exclusive_entry(entry);
> }
> @@ -475,11 +503,6 @@ static inline int is_hwpoison_entry(swp_entry_t entry)
> return swp_type(entry) == SWP_HWPOISON;
> }
>
> -static inline unsigned long hwpoison_entry_to_pfn(swp_entry_t entry)
> -{
> - return swp_offset(entry);
> -}
> -
> static inline void num_poisoned_pages_inc(void)
> {
> atomic_long_inc(&num_poisoned_pages);
> diff --git a/mm/hmm.c b/mm/hmm.c
> index f2aa63b94d9b..3850fb625dda 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -253,7 +253,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
> cpu_flags = HMM_PFN_VALID;
> if (is_writable_device_private_entry(entry))
> cpu_flags |= HMM_PFN_WRITE;
> - *hmm_pfn = swp_offset(entry) | cpu_flags;
> + *hmm_pfn = swp_offset_pfn(entry) | cpu_flags;
> return 0;
> }
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 0dfed9d7b273..e48f6f6a259d 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -632,7 +632,7 @@ static int check_hwpoisoned_entry(pte_t pte, unsigned long addr, short shift,
> swp_entry_t swp = pte_to_swp_entry(pte);
>
> if (is_hwpoison_entry(swp))
> - pfn = hwpoison_entry_to_pfn(swp);
> + pfn = swp_offset_pfn(swp);
> }
>
> if (!pfn || pfn != poisoned_pfn)
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index 8e9e574d535a..93e13fc17d3c 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -86,7 +86,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
> !is_device_exclusive_entry(entry))
> return false;
>
> - pfn = swp_offset(entry);
> + pfn = swp_offset_pfn(entry);
> } else if (is_swap_pte(*pvmw->pte)) {
> swp_entry_t entry;
>
> @@ -96,7 +96,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
> !is_device_exclusive_entry(entry))
> return false;
>
> - pfn = swp_offset(entry);
> + pfn = swp_offset_pfn(entry);
> } else {
> if (!pte_present(*pvmw->pte))
> return false;
> @@ -221,7 +221,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> return not_found(pvmw);
> entry = pmd_to_swp_entry(pmde);
> if (!is_migration_entry(entry) ||
> - !check_pmd(swp_offset(entry), pvmw))
> + !check_pmd(swp_offset_pfn(entry), pvmw))
> return not_found(pvmw);
> return true;
> }

2022-08-12 03:11:44

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH v4 6/7] mm/swap: Cache maximum swapfile size when init swap

Peter Xu <[email protected]> writes:

> We used to have swapfile_maximum_size() fetching a maximum value of
> swapfile size per-arch.
>
> As the caller of max_swapfile_size() grows, this patch introduce a variable
> "swapfile_maximum_size" and cache the value of old max_swapfile_size(), so
> that we don't need to calculate the value every time.
>
> Caching the value in swapfile_init() is safe because when reaching the
> phase we should have initialized all the relevant information. Here the
> major arch to take care of is x86, which defines the max swapfile size
> based on L1TF mitigation.
>
> Here both X86_BUG_L1TF or l1tf_mitigation should have been setup properly
> when reaching swapfile_init(). As a reference, the code path looks like
> this for x86:
>
> - start_kernel
> - setup_arch
> - early_cpu_init
> - early_identify_cpu --> setup X86_BUG_L1TF
> - parse_early_param
> - l1tf_cmdline --> set l1tf_mitigation
> - check_bugs
> - l1tf_select_mitigation --> set l1tf_mitigation
> - arch_call_rest_init
> - rest_init
> - kernel_init
> - kernel_init_freeable
> - do_basic_setup
> - do_initcalls --> calls swapfile_init() (initcall level 4)
>
> The swapfile size only depends on swp pte format on non-x86 archs, so
> caching it is safe too.
>
> Since at it, rename max_swapfile_size() to arch_max_swapfile_size() because
> arch can define its own function, so it's more straightforward to have
> "arch_" as its prefix. At the meantime, export swapfile_maximum_size to
> replace the old usages of max_swapfile_size().
>
> Signed-off-by: Peter Xu <[email protected]>

LGTM.

Reviewed-by: "Huang, Ying" <[email protected]>

Best Regards,
Huang, Ying

> ---
> arch/x86/mm/init.c | 2 +-
> include/linux/swapfile.h | 3 ++-
> include/linux/swapops.h | 2 +-
> mm/swapfile.c | 7 +++++--
> 4 files changed, 9 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index 82a042c03824..9121bc1b9453 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -1054,7 +1054,7 @@ void update_cache_mode_entry(unsigned entry, enum page_cache_mode cache)
> }
>
> #ifdef CONFIG_SWAP
> -unsigned long max_swapfile_size(void)
> +unsigned long arch_max_swapfile_size(void)
> {
> unsigned long pages;
>
> diff --git a/include/linux/swapfile.h b/include/linux/swapfile.h
> index 54078542134c..165e0bd04862 100644
> --- a/include/linux/swapfile.h
> +++ b/include/linux/swapfile.h
> @@ -8,6 +8,7 @@
> */
> extern struct swap_info_struct *swap_info[];
> extern unsigned long generic_max_swapfile_size(void);
> -extern unsigned long max_swapfile_size(void);
> +/* Maximum swapfile size supported for the arch (not inclusive). */
> +extern unsigned long swapfile_maximum_size;
>
> #endif /* _LINUX_SWAPFILE_H */
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index 36e462e116af..f25b566643f1 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -307,7 +307,7 @@ static inline bool migration_entry_supports_ad(void)
> * the offset large enough to cover all of them (PFN, A & D bits).
> */
> #ifdef CONFIG_SWAP
> - return max_swapfile_size() >= (1UL << SWP_MIG_TOTAL_BITS);
> + return swapfile_maximum_size >= (1UL << SWP_MIG_TOTAL_BITS);
> #else /* CONFIG_SWAP */
> return false;
> #endif /* CONFIG_SWAP */
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 1fdccd2f1422..3cc64399df44 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -63,6 +63,7 @@ EXPORT_SYMBOL_GPL(nr_swap_pages);
> /* protected with swap_lock. reading in vm_swap_full() doesn't need lock */
> long total_swap_pages;
> static int least_priority = -1;
> +unsigned long swapfile_maximum_size;
>
> static const char Bad_file[] = "Bad swap file entry ";
> static const char Unused_file[] = "Unused swap file entry ";
> @@ -2816,7 +2817,7 @@ unsigned long generic_max_swapfile_size(void)
> }
>
> /* Can be overridden by an architecture for additional checks. */
> -__weak unsigned long max_swapfile_size(void)
> +__weak unsigned long arch_max_swapfile_size(void)
> {
> return generic_max_swapfile_size();
> }
> @@ -2856,7 +2857,7 @@ static unsigned long read_swap_header(struct swap_info_struct *p,
> p->cluster_next = 1;
> p->cluster_nr = 0;
>
> - maxpages = max_swapfile_size();
> + maxpages = swapfile_maximum_size;
> last_page = swap_header->info.last_page;
> if (!last_page) {
> pr_warn("Empty swap-file\n");
> @@ -3677,6 +3678,8 @@ static int __init swapfile_init(void)
> for_each_node(nid)
> plist_head_init(&swap_avail_heads[nid]);
>
> + swapfile_maximum_size = arch_max_swapfile_size();
> +
> return 0;
> }
> subsys_initcall(swapfile_init);

2022-08-23 21:21:37

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH v4 3/7] mm/swap: Add swp_offset_pfn() to fetch PFN from swap entry

On Thu, Aug 11, 2022 at 8:33 PM Huang, Ying <[email protected]> wrote:
>
> Peter Xu <[email protected]> writes:
>
> > We've got a bunch of special swap entries that stores PFN inside the swap
> > offset fields. To fetch the PFN, normally the user just calls swp_offset()
> > assuming that'll be the PFN.
> >
> > Add a helper swp_offset_pfn() to fetch the PFN instead, fetching only the
> > max possible length of a PFN on the host, meanwhile doing proper check with
> > MAX_PHYSMEM_BITS to make sure the swap offsets can actually store the PFNs
> > properly always using the BUILD_BUG_ON() in is_pfn_swap_entry().
> >
> > One reason to do so is we never tried to sanitize whether swap offset can
> > really fit for storing PFN. At the meantime, this patch also prepares us
> > with the future possibility to store more information inside the swp offset
> > field, so assuming "swp_offset(entry)" to be the PFN will not stand any
> > more very soon.
> >
> > Replace many of the swp_offset() callers to use swp_offset_pfn() where
> > proper. Note that many of the existing users are not candidates for the
> > replacement, e.g.:
> >
> > (1) When the swap entry is not a pfn swap entry at all, or,
> > (2) when we wanna keep the whole swp_offset but only change the swp type.
> >
> > For the latter, it can happen when fork() triggered on a write-migration
> > swap entry pte, we may want to only change the migration type from
> > write->read but keep the rest, so it's not "fetching PFN" but "changing
> > swap type only". They're left aside so that when there're more information
> > within the swp offset they'll be carried over naturally in those cases.
> >
> > Since at it, dropping hwpoison_entry_to_pfn() because that's exactly what
> > the new swp_offset_pfn() is about.
> >
> > Signed-off-by: Peter Xu <[email protected]>
>
> LGTM, Thanks!
>
> Reviewed-by: "Huang, Ying" <[email protected]>

Hi,

I hit the following crash on mm-everything-2022-08-22-22-59. Please take a look.

Thanks.

kernel BUG at include/linux/swapops.h:117!
CPU: 46 PID: 5245 Comm: EventManager_De Tainted: G S O L
6.0.0-dbg-DEV #2
RIP: 0010:pfn_swap_entry_to_page+0x72/0xf0
Code: c6 48 8b 36 48 83 fe ff 74 53 48 01 d1 48 83 c1 08 48 8b 09 f6
c1 01 75 7b 66 90 48 89 c1 48 8b 09 f6 c1 01 74 74 5d c3 eb 9e <0f> 0b
48 ba ff ff ff ff 03 00 00 00 eb ae a9 ff 0f 00 00 75 13 48
RSP: 0018:ffffa59e73fabb80 EFLAGS: 00010282
RAX: 00000000ffffffe8 RBX: 0c00000000000000 RCX: ffffcd5440000000
RDX: 1ffffffffff7a80a RSI: 0000000000000000 RDI: 0c0000000000042b
RBP: ffffa59e73fabb80 R08: ffff9965ca6e8bb8 R09: 0000000000000000
R10: ffffffffa5a2f62d R11: 0000030b372e9fff R12: ffff997b79db5738
R13: 000000000000042b R14: 0c0000000000042b R15: 1ffffffffff7a80a
FS: 00007f549d1bb700(0000) GS:ffff99d3cf680000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000440d035b3180 CR3: 0000002243176004 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
change_pte_range+0x36e/0x880
change_p4d_range+0x2e8/0x670
change_protection_range+0x14e/0x2c0
mprotect_fixup+0x1ee/0x330
do_mprotect_pkey+0x34c/0x440
__x64_sys_mprotect+0x1d/0x30

2022-08-23 22:27:25

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v4 3/7] mm/swap: Add swp_offset_pfn() to fetch PFN from swap entry

On Tue, Aug 23, 2022 at 03:01:09PM -0600, Yu Zhao wrote:
> On Thu, Aug 11, 2022 at 8:33 PM Huang, Ying <[email protected]> wrote:
> >
> > Peter Xu <[email protected]> writes:
> >
> > > We've got a bunch of special swap entries that stores PFN inside the swap
> > > offset fields. To fetch the PFN, normally the user just calls swp_offset()
> > > assuming that'll be the PFN.
> > >
> > > Add a helper swp_offset_pfn() to fetch the PFN instead, fetching only the
> > > max possible length of a PFN on the host, meanwhile doing proper check with
> > > MAX_PHYSMEM_BITS to make sure the swap offsets can actually store the PFNs
> > > properly always using the BUILD_BUG_ON() in is_pfn_swap_entry().
> > >
> > > One reason to do so is we never tried to sanitize whether swap offset can
> > > really fit for storing PFN. At the meantime, this patch also prepares us
> > > with the future possibility to store more information inside the swp offset
> > > field, so assuming "swp_offset(entry)" to be the PFN will not stand any
> > > more very soon.
> > >
> > > Replace many of the swp_offset() callers to use swp_offset_pfn() where
> > > proper. Note that many of the existing users are not candidates for the
> > > replacement, e.g.:
> > >
> > > (1) When the swap entry is not a pfn swap entry at all, or,
> > > (2) when we wanna keep the whole swp_offset but only change the swp type.
> > >
> > > For the latter, it can happen when fork() triggered on a write-migration
> > > swap entry pte, we may want to only change the migration type from
> > > write->read but keep the rest, so it's not "fetching PFN" but "changing
> > > swap type only". They're left aside so that when there're more information
> > > within the swp offset they'll be carried over naturally in those cases.
> > >
> > > Since at it, dropping hwpoison_entry_to_pfn() because that's exactly what
> > > the new swp_offset_pfn() is about.
> > >
> > > Signed-off-by: Peter Xu <[email protected]>
> >
> > LGTM, Thanks!
> >
> > Reviewed-by: "Huang, Ying" <[email protected]>
>
> Hi,
>
> I hit the following crash on mm-everything-2022-08-22-22-59. Please take a look.
>
> Thanks.
>
> kernel BUG at include/linux/swapops.h:117!
> CPU: 46 PID: 5245 Comm: EventManager_De Tainted: G S O L
> 6.0.0-dbg-DEV #2
> RIP: 0010:pfn_swap_entry_to_page+0x72/0xf0
> Code: c6 48 8b 36 48 83 fe ff 74 53 48 01 d1 48 83 c1 08 48 8b 09 f6
> c1 01 75 7b 66 90 48 89 c1 48 8b 09 f6 c1 01 74 74 5d c3 eb 9e <0f> 0b
> 48 ba ff ff ff ff 03 00 00 00 eb ae a9 ff 0f 00 00 75 13 48
> RSP: 0018:ffffa59e73fabb80 EFLAGS: 00010282
> RAX: 00000000ffffffe8 RBX: 0c00000000000000 RCX: ffffcd5440000000
> RDX: 1ffffffffff7a80a RSI: 0000000000000000 RDI: 0c0000000000042b
> RBP: ffffa59e73fabb80 R08: ffff9965ca6e8bb8 R09: 0000000000000000
> R10: ffffffffa5a2f62d R11: 0000030b372e9fff R12: ffff997b79db5738
> R13: 000000000000042b R14: 0c0000000000042b R15: 1ffffffffff7a80a
> FS: 00007f549d1bb700(0000) GS:ffff99d3cf680000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000440d035b3180 CR3: 0000002243176004 CR4: 00000000003706e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
> <TASK>
> change_pte_range+0x36e/0x880
> change_p4d_range+0x2e8/0x670
> change_protection_range+0x14e/0x2c0
> mprotect_fixup+0x1ee/0x330
> do_mprotect_pkey+0x34c/0x440
> __x64_sys_mprotect+0x1d/0x30

The VM_BUG_ON added in this patch seems to have revealed a real bug,
because we probably shouldn't call pfn_swap_entry_to_page() upon e.g. a
genuine swap pte.

I'll post a patch shortly, thanks.

--
Peter Xu

2022-09-12 00:03:15

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v4 5/7] mm: Remember young/dirty bit for page migrations

On Thu, 11 Aug 2022 12:13:29 -0400 Peter Xu <[email protected]> wrote:

> When page migration happens, we always ignore the young/dirty bit settings
> in the old pgtable, and marking the page as old in the new page table using
> either pte_mkold() or pmd_mkold(), and keeping the pte clean.
>
> That's fine from functional-wise, but that's not friendly to page reclaim
> because the moving page can be actively accessed within the procedure. Not
> to mention hardware setting the young bit can bring quite some overhead on
> some systems, e.g. x86_64 needs a few hundreds nanoseconds to set the bit.
> The same slowdown problem to dirty bits when the memory is first written
> after page migration happened.
>
> Actually we can easily remember the A/D bit configuration and recover the
> information after the page is migrated. To achieve it, define a new set of
> bits in the migration swap offset field to cache the A/D bits for old pte.
> Then when removing/recovering the migration entry, we can recover the A/D
> bits even if the page changed.
>
> One thing to mention is that here we used max_swapfile_size() to detect how
> many swp offset bits we have, and we'll only enable this feature if we know
> the swp offset is big enough to store both the PFN value and the A/D bits.
> Otherwise the A/D bits are dropped like before.
>

There was some discussion over v3 of this patch, but none over v4.

Can people please review this patch series so we can get moving with it?

Thanks.

2022-09-13 01:15:59

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH v4 5/7] mm: Remember young/dirty bit for page migrations

Andrew Morton <[email protected]> writes:

> On Thu, 11 Aug 2022 12:13:29 -0400 Peter Xu <[email protected]> wrote:
>
>> When page migration happens, we always ignore the young/dirty bit settings
>> in the old pgtable, and marking the page as old in the new page table using
>> either pte_mkold() or pmd_mkold(), and keeping the pte clean.
>>
>> That's fine from functional-wise, but that's not friendly to page reclaim
>> because the moving page can be actively accessed within the procedure. Not
>> to mention hardware setting the young bit can bring quite some overhead on
>> some systems, e.g. x86_64 needs a few hundreds nanoseconds to set the bit.
>> The same slowdown problem to dirty bits when the memory is first written
>> after page migration happened.
>>
>> Actually we can easily remember the A/D bit configuration and recover the
>> information after the page is migrated. To achieve it, define a new set of
>> bits in the migration swap offset field to cache the A/D bits for old pte.
>> Then when removing/recovering the migration entry, we can recover the A/D
>> bits even if the page changed.
>>
>> One thing to mention is that here we used max_swapfile_size() to detect how
>> many swp offset bits we have, and we'll only enable this feature if we know
>> the swp offset is big enough to store both the PFN value and the A/D bits.
>> Otherwise the A/D bits are dropped like before.
>>
>
> There was some discussion over v3 of this patch, but none over v4.
>
> Can people please review this patch series so we can get moving with it?

Most discussions over v3 are for migrate_device.c code. There are some
bugs and they have been fixed by Alistair via [1].

This patch itself is good. Sorry for bothering.

Reviewed-by: "Huang, Ying" <[email protected]>

[1] https://lore.kernel.org/linux-mm/9f801e9d8d830408f2ca27821f606e09aa856899.1662078528.git-series.apopple@nvidia.com/

Best Regards,
Huang, Ying

2022-11-21 05:17:46

by Raghavendra K T

[permalink] [raw]
Subject: Re: [PATCH v4 0/7] mm: Remember a/d bits for migration entries

On 8/11/2022 9:43 PM, Peter Xu wrote:
> v4:
> - Added r-bs for Ying
> - Some cosmetic changes here and there [Ying]
> - Fix smaps to only dump PFN for pfn swap entries for both pte/pmd [Ying]
> - Remove max_swapfile_size(), export swapfile_maximum_size variable [Ying]
> - In migrate_vma_collect_pmd() only read A/D if pte_present()
>
> rfc: https://lore.kernel.org/all/[email protected]
> v1: https://lore.kernel.org/all/[email protected]
> v2: https://lore.kernel.org/all/[email protected]
> v3: https://lore.kernel.org/all/[email protected]
>
> Problem
> =======
>
> When migrate a page, right now we always mark the migrated page as old &
> clean.
>
> However that could lead to at least two problems:
>
> (1) We lost the real hot/cold information while we could have persisted.
> That information shouldn't change even if the backing page is changed
> after the migration,
>
> (2) There can be always extra overhead on the immediate next access to
> any migrated page, because hardware MMU needs cycles to set the young
> bit again for reads, and dirty bits for write, as long as the
> hardware MMU supports these bits.
>
> Many of the recent upstream works showed that (2) is not something trivial
> and actually very measurable. In my test case, reading 1G chunk of memory
> - jumping in page size intervals - could take 99ms just because of the
> extra setting on the young bit on a generic x86_64 system, comparing to 4ms
> if young set.
>
> This issue is originally reported by Andrea Arcangeli.
>
> Solution
> ========
>
> To solve this problem, this patchset tries to remember the young/dirty bits
> in the migration entries and carry them over when recovering the ptes.
>
> We have the chance to do so because in many systems the swap offset is not
> really fully used. Migration entries use swp offset to store PFN only,
> while the PFN is normally not as large as swp offset and normally smaller.
> It means we do have some free bits in swp offset that we can use to store
> things like A/D bits, and that's how this series tried to approach this
> problem.
>
> max_swapfile_size() is used here to detect per-arch offset length in swp
> entries. We'll automatically remember the A/D bits when we find that we
> have enough swp offset field to keep both the PFN and the extra bits.
>
> Since max_swapfile_size() can be slow, the last two patches cache the
> results for it and also swap_migration_ad_supported as a whole.
>
> Known Issues / TODOs
> ====================
>
> We still haven't taught madvise() to recognize the new A/D bits in
> migration entries, namely MADV_COLD/MADV_FREE. E.g. when MADV_COLD upon a
> migration entry. It's not clear yet on whether we should clear the A bit,
> or we should just drop the entry directly.
>
> We didn't teach idle page tracking on the new migration entries, because
> it'll need larger rework on the tree on rmap pgtable walk. However it
> should make it already better because before this patchset page will be old
> page after migration, so the series will fix potential false negative of
> idle page tracking when pages were migrated before observing.
>
> The other thing is migration A/D bits will not start to working for private
> device swap entries. The code is there for completeness but since private
> device swap entries do not yet have fields to store A/D bits, even if we'll
> persistent A/D across present pte switching to migration entry, we'll lose
> it again when the migration entry converted to private device swap entry.
>
> Tests
> =====
>
> After the patchset applied, the immediate read access test [1] of above 1G
> chunk after migration can shrink from 99ms to 4ms. The test is done by
> moving 1G pages from node 0->1->0 then read it in page size jumps. The
> test is with Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz.
>
> Similar effect can also be measured when writting the memory the 1st time
> after migration.
>
> After applying the patchset, both initial immediate read/write after page
> migrated will perform similarly like before migration happened.

I was able to test on AMD EPYC 64 core 2 numa node (Milan) 3.72 GHz
clocked system

am seeing the similar improvement for the test mentioned above (swap-young)

base: (6.0)
--------------
Write (node 0) took 562202 (us)
Read (node 0) took 7790 (us)
Move to node 1 took 474876(us)
Move to node 0 took 642805(us)
Read (node 0) took 81364 (us)
Write (node 0) took 12887 (us)
Read (node 0) took 5202 (us)
Write (node 0) took 4533 (us)
Read (node 0) took 5229 (us)
Write (node 0) took 4558 (us)
Read (node 0) took 5198 (us)
Write (node 0) took 4551 (us)
Read (node 0) took 5218 (us)
Write (node 0) took 4534 (us)

patched
-------------
Write (node 0) took 250232 (us)
Read (node 0) took 3262 (us)
Move to node 1 took 640636(us)
Move to node 0 took 449051(us)
Read (node 0) took 2966 (us)
Write (node 0) took 2720 (us)
Read (node 0) took 2891 (us)
Write (node 0) took 2560 (us)
Read (node 0) took 2899 (us)
Write (node 0) took 2568 (us)
Read (node 0) took 2890 (us)
Write (node 0) took 2568 (us)
Read (node 0) took 2897 (us)
Write (node 0) took 2563 (us)

Please feel free to add FWIW
Tested-by: Raghavendra K T <[email protected]>

>
> Patch Layout
> ============
>
> Patch 1-2: Cleanups from either previous versions or on swapops.h macros.
>
> Patch 3-4: Prepare for the introduction of migration A/D bits
>
> Patch 5: The core patch to remember young/dirty bit in swap offsets.
>
> Patch 6-7: Cache relevant fields to make migration_entry_supports_ad() fast.
>
> Please review, thanks.
>
> [1] https://github.com/xzpeter/clibs/blob/master/misc/swap-young.c
>
> Peter Xu (7):
> mm/x86: Use SWP_TYPE_BITS in 3-level swap macros
> mm/swap: Comment all the ifdef in swapops.h
> mm/swap: Add swp_offset_pfn() to fetch PFN from swap entry
> mm/thp: Carry over dirty bit when thp splits on pmd
> mm: Remember young/dirty bit for page migrations
> mm/swap: Cache maximum swapfile size when init swap
> mm/swap: Cache swap migration A/D bits support
>
> arch/arm64/mm/hugetlbpage.c | 2 +-
> arch/x86/include/asm/pgtable-3level.h | 8 +-
> arch/x86/mm/init.c | 2 +-
> fs/proc/task_mmu.c | 20 +++-
> include/linux/swapfile.h | 5 +-
> include/linux/swapops.h | 145 +++++++++++++++++++++++---
> mm/hmm.c | 2 +-
> mm/huge_memory.c | 27 ++++-
> mm/memory-failure.c | 2 +-
> mm/migrate.c | 6 +-
> mm/migrate_device.c | 6 ++
> mm/page_vma_mapped.c | 6 +-
> mm/rmap.c | 5 +-
> mm/swapfile.c | 15 ++-
> 14 files changed, 214 insertions(+), 37 deletions(-)
>


2022-11-21 15:44:33

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH v4 0/7] mm: Remember a/d bits for migration entries

Hi, Raghavendra,

On Mon, Nov 21, 2022 at 10:45:45AM +0530, Raghavendra K T wrote:
> I was able to test on AMD EPYC 64 core 2 numa node (Milan) 3.72 GHz clocked
> system
>
> am seeing the similar improvement for the test mentioned above (swap-young)
>
> base: (6.0)
> --------------
> Write (node 0) took 562202 (us)
> Read (node 0) took 7790 (us)
> Move to node 1 took 474876(us)
> Move to node 0 took 642805(us)
> Read (node 0) took 81364 (us)
> Write (node 0) took 12887 (us)
> Read (node 0) took 5202 (us)
> Write (node 0) took 4533 (us)
> Read (node 0) took 5229 (us)
> Write (node 0) took 4558 (us)
> Read (node 0) took 5198 (us)
> Write (node 0) took 4551 (us)
> Read (node 0) took 5218 (us)
> Write (node 0) took 4534 (us)
>
> patched
> -------------
> Write (node 0) took 250232 (us)
> Read (node 0) took 3262 (us)
> Move to node 1 took 640636(us)
> Move to node 0 took 449051(us)
> Read (node 0) took 2966 (us)
> Write (node 0) took 2720 (us)
> Read (node 0) took 2891 (us)
> Write (node 0) took 2560 (us)
> Read (node 0) took 2899 (us)
> Write (node 0) took 2568 (us)
> Read (node 0) took 2890 (us)
> Write (node 0) took 2568 (us)
> Read (node 0) took 2897 (us)
> Write (node 0) took 2563 (us)
>
> Please feel free to add FWIW
> Tested-by: Raghavendra K T <[email protected]>

The series has already landed v6.1-rc1 so it should be a bit late to apply
the tested-by tag, but still thanks a lot for your tests and upate!

It seems the mem size is different for the two rounds of test as even the
initial write differs in time, but I think that still explains the
difference because what matters is the first read/write after migration,
and that can be compared with the 2nd/3rd/... reads/writes afterwards.

Side note: there's actually one tiny thing got removed from the series on
handling dirty bit of thp split (434e3d15d92b), but it seems there's hope
we cound have found the root issue of the issues on sparc64 and loongarch
so we may have chance to re-apply them.

--
Peter Xu