v3:
- Rebased to akpm/mm-unstable
- Use BIT() [Nadav]
- One more patch to add comment for "ifdef"s [Nadav]
- Added one ascii table for migration swp offset layout [David]
- Move comment to be above "if"s [Ying]
- Separate the dirty bit carry-over of pmd split to separate patch [Ying]
- Added two patches to cache both max_swapfile_size and
migration_entry_supports_ad() at the end
rfc: https://lore.kernel.org/all/[email protected]
v1: https://lore.kernel.org/all/[email protected]
v2: https://lore.kernel.org/all/[email protected]
Problem
=======
When migrate a page, right now we always mark the migrated page as old &
clean.
However that could lead to at least two problems:
(1) We lost the real hot/cold information while we could have persisted.
That information shouldn't change even if the backing page is changed
after the migration,
(2) There can be always extra overhead on the immediate next access to
any migrated page, because hardware MMU needs cycles to set the young
bit again for reads, and dirty bits for write, as long as the
hardware MMU supports these bits.
Many of the recent upstream works showed that (2) is not something trivial
and actually very measurable. In my test case, reading 1G chunk of memory
- jumping in page size intervals - could take 99ms just because of the
extra setting on the young bit on a generic x86_64 system, comparing to 4ms
if young set.
This issue is originally reported by Andrea Arcangeli.
Solution
========
To solve this problem, this patchset tries to remember the young/dirty bits
in the migration entries and carry them over when recovering the ptes.
We have the chance to do so because in many systems the swap offset is not
really fully used. Migration entries use swp offset to store PFN only,
while the PFN is normally not as large as swp offset and normally smaller.
It means we do have some free bits in swp offset that we can use to store
things like A/D bits, and that's how this series tried to approach this
problem.
max_swapfile_size() is used here to detect per-arch offset length in swp
entries. We'll automatically remember the A/D bits when we find that we
have enough swp offset field to keep both the PFN and the extra bits.
Since max_swapfile_size() can be slow, the last two patches cache the
results for it and also swap_migration_ad_supported as a whole.
Tests
=====
After the patchset applied, the immediate read access test [1] of above 1G
chunk after migration can shrink from 99ms to 4ms. The test is done by
moving 1G pages from node 0->1->0 then read it in page size jumps. The
test is with Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz.
Similar effect can also be measured when writting the memory the 1st time
after migration.
After applying the patchset, both initial immediate read/write after page
migrated will perform similarly like before migration happened.
Patch Layout
============
Patch 1-2: Cleanups from either previous versions or on swapops.h macros.
Patch 3-4: Prepare for the introduction of migration A/D bits
Patch 5: The core patch to remember young/dirty bit in swap offsets.
Patch 6-7: Cache relevant fields to make migration_entry_supports_ad() fast.
Please review, thanks.
[1] https://github.com/xzpeter/clibs/blob/master/misc/swap-young.c
Peter Xu (7):
mm/x86: Use SWP_TYPE_BITS in 3-level swap macros
mm/swap: Comment all the ifdef in swapops.h
mm/swap: Add swp_offset_pfn() to fetch PFN from swap entry
mm/thp: Carry over dirty bit when thp splits on pmd
mm: Remember young/dirty bit for page migrations
mm/swap: Cache maximum swapfile size when init swap
mm/swap: Cache swap migration A/D bits support
arch/arm64/mm/hugetlbpage.c | 2 +-
arch/x86/include/asm/pgtable-3level.h | 8 +-
arch/x86/mm/init.c | 2 +-
include/linux/swapfile.h | 1 +
include/linux/swapops.h | 145 +++++++++++++++++++++++---
mm/hmm.c | 2 +-
mm/huge_memory.c | 24 ++++-
mm/memory-failure.c | 2 +-
mm/migrate.c | 6 +-
mm/migrate_device.c | 4 +
mm/page_vma_mapped.c | 6 +-
mm/rmap.c | 5 +-
mm/swapfile.c | 18 +++-
13 files changed, 194 insertions(+), 31 deletions(-)
--
2.32.0
Introduce a variable swap_migration_ad_supported to cache whether the arch
supports swap migration A/D bits.
Here one thing to mention is that SWP_MIG_TOTAL_BITS will internally
reference the other macro MAX_PHYSMEM_BITS, which is a function call on
x86 (constant on all the rest of archs).
It's safe to reference it in swapfile_init() because when reaching here
we're already during initcalls level 4 so we must have initialized 5-level
pgtable for x86_64 (right after early_identify_cpu() finishes).
- start_kernel
- setup_arch
- early_cpu_init
- get_cpu_cap --> fetch from CPUID (including X86_FEATURE_LA57)
- early_identify_cpu --> clear X86_FEATURE_LA57 (if early lvl5 not enabled (USE_EARLY_PGTABLE_L5))
- arch_call_rest_init
- rest_init
- kernel_init
- kernel_init_freeable
- do_basic_setup
- do_initcalls --> calls swapfile_init() (initcall level 4)
This should slightly speed up the migration swap entry handlings.
Signed-off-by: Peter Xu <[email protected]>
---
include/linux/swapfile.h | 1 +
include/linux/swapops.h | 7 +------
mm/swapfile.c | 8 ++++++++
3 files changed, 10 insertions(+), 6 deletions(-)
diff --git a/include/linux/swapfile.h b/include/linux/swapfile.h
index 54078542134c..87ec5e2cdb02 100644
--- a/include/linux/swapfile.h
+++ b/include/linux/swapfile.h
@@ -9,5 +9,6 @@
extern struct swap_info_struct *swap_info[];
extern unsigned long generic_max_swapfile_size(void);
extern unsigned long max_swapfile_size(void);
+extern bool swap_migration_ad_supported;
#endif /* _LINUX_SWAPFILE_H */
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 0e9579b90659..e6afc77c51ad 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -301,13 +301,8 @@ static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
*/
static inline bool migration_entry_supports_ad(void)
{
- /*
- * max_swapfile_size() returns the max supported swp-offset plus 1.
- * We can support the migration A/D bits iff the pfn swap entry has
- * the offset large enough to cover all of them (PFN, A & D bits).
- */
#ifdef CONFIG_SWAP
- return max_swapfile_size() >= (1UL << SWP_MIG_TOTAL_BITS);
+ return swap_migration_ad_supported;
#else /* CONFIG_SWAP */
return false;
#endif /* CONFIG_SWAP */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 794fa37bd0c3..c49cf25f0d08 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -64,6 +64,9 @@ EXPORT_SYMBOL_GPL(nr_swap_pages);
long total_swap_pages;
static int least_priority = -1;
static unsigned long swapfile_maximum_size;
+#ifdef CONFIG_MIGRATION
+bool swap_migration_ad_supported;
+#endif /* CONFIG_MIGRATION */
static const char Bad_file[] = "Bad swap file entry ";
static const char Unused_file[] = "Unused swap file entry ";
@@ -3685,6 +3688,11 @@ static int __init swapfile_init(void)
swapfile_maximum_size = arch_max_swapfile_size();
+#ifdef CONFIG_MIGRATION
+ if (swapfile_maximum_size >= (1UL << SWP_MIG_TOTAL_BITS))
+ swap_migration_ad_supported = true;
+#endif /* CONFIG_MIGRATION */
+
return 0;
}
subsys_initcall(swapfile_init);
--
2.32.0
We used to have swapfile_maximum_size() fetching a maximum value of
swapfile size per-arch.
As the caller of max_swapfile_size() grows, this patch introduce a variable
"swapfile_maximum_size" and cache the value of old max_swapfile_size(), so
that we don't need to calculate the value every time.
Caching the value in swapfile_init() is safe because when reaching the
phase we should have initialized all the relevant information. Here the
major arch to look after is x86, which defines the max size based on L1TF
mitigation.
Here both X86_BUG_L1TF or l1tf_mitigation should have been setup properly
when reaching swapfile_init(). As a reference, the code path looks like
this for x86:
- start_kernel
- setup_arch
- early_cpu_init
- early_identify_cpu --> setup X86_BUG_L1TF
- parse_early_param
- l1tf_cmdline --> set l1tf_mitigation
- check_bugs
- l1tf_select_mitigation --> set l1tf_mitigation
- arch_call_rest_init
- rest_init
- kernel_init
- kernel_init_freeable
- do_basic_setup
- do_initcalls --> calls swapfile_init() (initcall level 4)
The swapfile size only depends on swp pte format on non-x86 archs, so
caching it is safe too.
Since at it, rename max_swapfile_size() to arch_max_swapfile_size() because
arch can define its own function, so it's more straightforward to have
"arch_" as its prefix. At the meantime, keep the swapfile_maximum_size()
function to fetch the value from the cache initialized in swapfile_init().
Signed-off-by: Peter Xu <[email protected]>
---
arch/x86/mm/init.c | 2 +-
mm/swapfile.c | 10 +++++++++-
2 files changed, 10 insertions(+), 2 deletions(-)
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 82a042c03824..9121bc1b9453 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -1054,7 +1054,7 @@ void update_cache_mode_entry(unsigned entry, enum page_cache_mode cache)
}
#ifdef CONFIG_SWAP
-unsigned long max_swapfile_size(void)
+unsigned long arch_max_swapfile_size(void)
{
unsigned long pages;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 1fdccd2f1422..794fa37bd0c3 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -63,6 +63,7 @@ EXPORT_SYMBOL_GPL(nr_swap_pages);
/* protected with swap_lock. reading in vm_swap_full() doesn't need lock */
long total_swap_pages;
static int least_priority = -1;
+static unsigned long swapfile_maximum_size;
static const char Bad_file[] = "Bad swap file entry ";
static const char Unused_file[] = "Unused swap file entry ";
@@ -2816,11 +2817,16 @@ unsigned long generic_max_swapfile_size(void)
}
/* Can be overridden by an architecture for additional checks. */
-__weak unsigned long max_swapfile_size(void)
+__weak unsigned long arch_max_swapfile_size(void)
{
return generic_max_swapfile_size();
}
+unsigned long max_swapfile_size(void)
+{
+ return swapfile_maximum_size;
+}
+
static unsigned long read_swap_header(struct swap_info_struct *p,
union swap_header *swap_header,
struct inode *inode)
@@ -3677,6 +3683,8 @@ static int __init swapfile_init(void)
for_each_node(nid)
plist_head_init(&swap_avail_heads[nid]);
+ swapfile_maximum_size = arch_max_swapfile_size();
+
return 0;
}
subsys_initcall(swapfile_init);
--
2.32.0
swapops.h contains quite a few layers of ifdef, some of the "else" and
"endif" doesn't get proper comment on the macro so it's hard to follow on
what are they referring to. Add the comments.
Suggested-by: Nadav Amit <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
include/linux/swapops.h | 16 ++++++++--------
1 file changed, 8 insertions(+), 8 deletions(-)
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index a3d435bf9f97..3a2901ff4f1e 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -247,8 +247,8 @@ extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
#ifdef CONFIG_HUGETLB_PAGE
extern void __migration_entry_wait_huge(pte_t *ptep, spinlock_t *ptl);
extern void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte);
-#endif
-#else
+#endif /* CONFIG_HUGETLB_PAGE */
+#else /* CONFIG_MIGRATION */
static inline swp_entry_t make_readable_migration_entry(pgoff_t offset)
{
return swp_entry(0, 0);
@@ -276,7 +276,7 @@ static inline void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
#ifdef CONFIG_HUGETLB_PAGE
static inline void __migration_entry_wait_huge(pte_t *ptep, spinlock_t *ptl) { }
static inline void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte) { }
-#endif
+#endif /* CONFIG_HUGETLB_PAGE */
static inline int is_writable_migration_entry(swp_entry_t entry)
{
return 0;
@@ -286,7 +286,7 @@ static inline int is_readable_migration_entry(swp_entry_t entry)
return 0;
}
-#endif
+#endif /* CONFIG_MIGRATION */
typedef unsigned long pte_marker;
@@ -426,7 +426,7 @@ static inline int is_pmd_migration_entry(pmd_t pmd)
{
return is_swap_pmd(pmd) && is_migration_entry(pmd_to_swp_entry(pmd));
}
-#else
+#else /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
static inline int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
struct page *page)
{
@@ -455,7 +455,7 @@ static inline int is_pmd_migration_entry(pmd_t pmd)
{
return 0;
}
-#endif
+#endif /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
#ifdef CONFIG_MEMORY_FAILURE
@@ -495,7 +495,7 @@ static inline void num_poisoned_pages_sub(long i)
atomic_long_sub(i, &num_poisoned_pages);
}
-#else
+#else /* CONFIG_MEMORY_FAILURE */
static inline swp_entry_t make_hwpoison_entry(struct page *page)
{
@@ -514,7 +514,7 @@ static inline void num_poisoned_pages_inc(void)
static inline void num_poisoned_pages_sub(long i)
{
}
-#endif
+#endif /* CONFIG_MEMORY_FAILURE */
static inline int non_swap_entry(swp_entry_t entry)
{
--
2.32.0
When page migration happens, we always ignore the young/dirty bit settings
in the old pgtable, and marking the page as old in the new page table using
either pte_mkold() or pmd_mkold(), and keeping the pte clean.
That's fine from functional-wise, but that's not friendly to page reclaim
because the moving page can be actively accessed within the procedure. Not
to mention hardware setting the young bit can bring quite some overhead on
some systems, e.g. x86_64 needs a few hundreds nanoseconds to set the bit.
The same slowdown problem to dirty bits when the memory is first written
after page migration happened.
Actually we can easily remember the A/D bit configuration and recover the
information after the page is migrated. To achieve it, define a new set of
bits in the migration swap offset field to cache the A/D bits for old pte.
Then when removing/recovering the migration entry, we can recover the A/D
bits even if the page changed.
One thing to mention is that here we used max_swapfile_size() to detect how
many swp offset bits we have, and we'll only enable this feature if we know
the swp offset can be big enough to store both the PFN value and the young
bit. Otherwise the A/D bits are dropped like before.
Signed-off-by: Peter Xu <[email protected]>
---
include/linux/swapops.h | 99 +++++++++++++++++++++++++++++++++++++++++
mm/huge_memory.c | 18 +++++++-
mm/migrate.c | 6 ++-
mm/migrate_device.c | 4 ++
mm/rmap.c | 5 ++-
5 files changed, 128 insertions(+), 4 deletions(-)
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index e1accbcd1136..0e9579b90659 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -8,6 +8,10 @@
#ifdef CONFIG_MMU
+#ifdef CONFIG_SWAP
+#include <linux/swapfile.h>
+#endif /* CONFIG_SWAP */
+
/*
* swapcache pages are stored in the swapper_space radix tree. We want to
* get good packing density in that tree, so the index should be dense in
@@ -35,6 +39,31 @@
#endif /* MAX_PHYSMEM_BITS */
#define SWP_PFN_MASK ((1UL << SWP_PFN_BITS) - 1)
+/**
+ * Migration swap entry specific bitfield definitions. Layout:
+ *
+ * |----------+--------------------|
+ * | swp_type | swp_offset |
+ * |----------+--------+-+-+-------|
+ * | | resv |D|A| PFN |
+ * |----------+--------+-+-+-------|
+ *
+ * @SWP_MIG_YOUNG_BIT: Whether the page used to have young bit set (bit A)
+ * @SWP_MIG_DIRTY_BIT: Whether the page used to have dirty bit set (bit D)
+ *
+ * Note: A/D bits will be stored in migration entries iff there're enough
+ * free bits in arch specific swp offset. By default we'll ignore A/D bits
+ * when migrating a page. Please refer to migration_entry_supports_ad()
+ * for more information. If there're more bits besides PFN and A/D bits,
+ * they should be reserved and always be zeros.
+ */
+#define SWP_MIG_YOUNG_BIT (SWP_PFN_BITS)
+#define SWP_MIG_DIRTY_BIT (SWP_PFN_BITS + 1)
+#define SWP_MIG_TOTAL_BITS (SWP_PFN_BITS + 2)
+
+#define SWP_MIG_YOUNG BIT(SWP_MIG_YOUNG_BIT)
+#define SWP_MIG_DIRTY BIT(SWP_MIG_DIRTY_BIT)
+
static inline bool is_pfn_swap_entry(swp_entry_t entry);
/* Clear all flags but only keep swp_entry_t related information */
@@ -265,6 +294,57 @@ static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
return swp_entry(SWP_MIGRATION_WRITE, offset);
}
+/*
+ * Returns whether the host has large enough swap offset field to support
+ * carrying over pgtable A/D bits for page migrations. The result is
+ * pretty much arch specific.
+ */
+static inline bool migration_entry_supports_ad(void)
+{
+ /*
+ * max_swapfile_size() returns the max supported swp-offset plus 1.
+ * We can support the migration A/D bits iff the pfn swap entry has
+ * the offset large enough to cover all of them (PFN, A & D bits).
+ */
+#ifdef CONFIG_SWAP
+ return max_swapfile_size() >= (1UL << SWP_MIG_TOTAL_BITS);
+#else /* CONFIG_SWAP */
+ return false;
+#endif /* CONFIG_SWAP */
+}
+
+static inline swp_entry_t make_migration_entry_young(swp_entry_t entry)
+{
+ if (migration_entry_supports_ad())
+ return swp_entry(swp_type(entry),
+ swp_offset(entry) | SWP_MIG_YOUNG);
+ return entry;
+}
+
+static inline bool is_migration_entry_young(swp_entry_t entry)
+{
+ if (migration_entry_supports_ad())
+ return swp_offset(entry) & SWP_MIG_YOUNG;
+ /* Keep the old behavior of aging page after migration */
+ return false;
+}
+
+static inline swp_entry_t make_migration_entry_dirty(swp_entry_t entry)
+{
+ if (migration_entry_supports_ad())
+ return swp_entry(swp_type(entry),
+ swp_offset(entry) | SWP_MIG_DIRTY);
+ return entry;
+}
+
+static inline bool is_migration_entry_dirty(swp_entry_t entry)
+{
+ if (migration_entry_supports_ad())
+ return swp_offset(entry) & SWP_MIG_DIRTY;
+ /* Keep the old behavior of clean page after migration */
+ return false;
+}
+
extern void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
spinlock_t *ptl);
extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
@@ -311,6 +391,25 @@ static inline int is_readable_migration_entry(swp_entry_t entry)
return 0;
}
+static inline swp_entry_t make_migration_entry_young(swp_entry_t entry)
+{
+ return entry;
+}
+
+static inline bool is_migration_entry_young(swp_entry_t entry)
+{
+ return false;
+}
+
+static inline swp_entry_t make_migration_entry_dirty(swp_entry_t entry)
+{
+ return entry;
+}
+
+static inline bool is_migration_entry_dirty(swp_entry_t entry)
+{
+ return false;
+}
#endif /* CONFIG_MIGRATION */
typedef unsigned long pte_marker;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e8e78d1bac5f..1644e9f59d73 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2089,7 +2089,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
write = is_writable_migration_entry(entry);
if (PageAnon(page))
anon_exclusive = is_readable_exclusive_migration_entry(entry);
- young = false;
+ young = is_migration_entry_young(entry);
+ dirty = is_migration_entry_dirty(entry);
soft_dirty = pmd_swp_soft_dirty(old_pmd);
uffd_wp = pmd_swp_uffd_wp(old_pmd);
} else {
@@ -2148,6 +2149,10 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
else
swp_entry = make_readable_migration_entry(
page_to_pfn(page + i));
+ if (young)
+ swp_entry = make_migration_entry_young(swp_entry);
+ if (dirty)
+ swp_entry = make_migration_entry_dirty(swp_entry);
entry = swp_entry_to_pte(swp_entry);
if (soft_dirty)
entry = pte_swp_mksoft_dirty(entry);
@@ -3157,6 +3162,10 @@ int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
entry = make_readable_exclusive_migration_entry(page_to_pfn(page));
else
entry = make_readable_migration_entry(page_to_pfn(page));
+ if (pmd_young(pmdval))
+ entry = make_migration_entry_young(entry);
+ if (pmd_dirty(pmdval))
+ entry = make_migration_entry_dirty(entry);
pmdswp = swp_entry_to_pmd(entry);
if (pmd_soft_dirty(pmdval))
pmdswp = pmd_swp_mksoft_dirty(pmdswp);
@@ -3182,13 +3191,18 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
entry = pmd_to_swp_entry(*pvmw->pmd);
get_page(new);
- pmde = pmd_mkold(mk_huge_pmd(new, READ_ONCE(vma->vm_page_prot)));
+ pmde = mk_huge_pmd(new, READ_ONCE(vma->vm_page_prot));
if (pmd_swp_soft_dirty(*pvmw->pmd))
pmde = pmd_mksoft_dirty(pmde);
if (is_writable_migration_entry(entry))
pmde = maybe_pmd_mkwrite(pmde, vma);
if (pmd_swp_uffd_wp(*pvmw->pmd))
pmde = pmd_wrprotect(pmd_mkuffd_wp(pmde));
+ if (!is_migration_entry_young(entry))
+ pmde = pmd_mkold(pmde);
+ /* NOTE: this may contain setting soft-dirty on some archs */
+ if (PageDirty(new) && is_migration_entry_dirty(entry))
+ pmde = pmd_mkdirty(pmde);
if (PageAnon(new)) {
rmap_t rmap_flags = RMAP_COMPOUND;
diff --git a/mm/migrate.c b/mm/migrate.c
index 6a1597c92261..0433a71d2bee 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -198,7 +198,7 @@ static bool remove_migration_pte(struct folio *folio,
#endif
folio_get(folio);
- pte = pte_mkold(mk_pte(new, READ_ONCE(vma->vm_page_prot)));
+ pte = mk_pte(new, READ_ONCE(vma->vm_page_prot));
if (pte_swp_soft_dirty(*pvmw.pte))
pte = pte_mksoft_dirty(pte);
@@ -206,6 +206,10 @@ static bool remove_migration_pte(struct folio *folio,
* Recheck VMA as permissions can change since migration started
*/
entry = pte_to_swp_entry(*pvmw.pte);
+ if (!is_migration_entry_young(entry))
+ pte = pte_mkold(pte);
+ if (folio_test_dirty(folio) && is_migration_entry_dirty(entry))
+ pte = pte_mkdirty(pte);
if (is_writable_migration_entry(entry))
pte = maybe_mkwrite(pte, vma);
else if (pte_swp_uffd_wp(*pvmw.pte))
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 27fb37d65476..699f821b8443 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -221,6 +221,10 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
else
entry = make_readable_migration_entry(
page_to_pfn(page));
+ if (pte_young(pte))
+ entry = make_migration_entry_young(entry);
+ if (pte_dirty(pte))
+ entry = make_migration_entry_dirty(entry);
swp_pte = swp_entry_to_pte(entry);
if (pte_present(pte)) {
if (pte_soft_dirty(pte))
diff --git a/mm/rmap.c b/mm/rmap.c
index af775855e58f..28aef434ea41 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2065,7 +2065,10 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
else
entry = make_readable_migration_entry(
page_to_pfn(subpage));
-
+ if (pte_young(pteval))
+ entry = make_migration_entry_young(entry);
+ if (pte_dirty(pteval))
+ entry = make_migration_entry_dirty(entry);
swp_pte = swp_entry_to_pte(entry);
if (pte_soft_dirty(pteval))
swp_pte = pte_swp_mksoft_dirty(swp_pte);
--
2.32.0
Replace all the magic "5" with the macro.
Reviewed-by: David Hildenbrand <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
arch/x86/include/asm/pgtable-3level.h | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h
index e896ebef8c24..28421a887209 100644
--- a/arch/x86/include/asm/pgtable-3level.h
+++ b/arch/x86/include/asm/pgtable-3level.h
@@ -256,10 +256,10 @@ static inline pud_t native_pudp_get_and_clear(pud_t *pudp)
/* We always extract/encode the offset by shifting it all the way up, and then down again */
#define SWP_OFFSET_SHIFT (SWP_OFFSET_FIRST_BIT + SWP_TYPE_BITS)
-#define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > 5)
-#define __swp_type(x) (((x).val) & 0x1f)
-#define __swp_offset(x) ((x).val >> 5)
-#define __swp_entry(type, offset) ((swp_entry_t){(type) | (offset) << 5})
+#define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > SWP_TYPE_BITS)
+#define __swp_type(x) (((x).val) & ((1UL << SWP_TYPE_BITS) - 1))
+#define __swp_offset(x) ((x).val >> SWP_TYPE_BITS)
+#define __swp_entry(type, offset) ((swp_entry_t){(type) | (offset) << SWP_TYPE_BITS})
/*
* Normally, __swp_entry() converts from arch-independent swp_entry_t to
--
2.32.0
We've got a bunch of special swap entries that stores PFN inside the swap
offset fields. To fetch the PFN, normally the user just calls swp_offset()
assuming that'll be the PFN.
Add a helper swp_offset_pfn() to fetch the PFN instead, fetching only the
max possible length of a PFN on the host, meanwhile doing proper check with
MAX_PHYSMEM_BITS to make sure the swap offsets can actually store the PFNs
properly always using the BUILD_BUG_ON() in is_pfn_swap_entry().
One reason to do so is we never tried to sanitize whether swap offset can
really fit for storing PFN. At the meantime, this patch also prepares us
with the future possibility to store more information inside the swp offset
field, so assuming "swp_offset(entry)" to be the PFN will not stand any
more very soon.
Replace many of the swp_offset() callers to use swp_offset_pfn() where
proper. Note that many of the existing users are not candidates for the
replacement, e.g.:
(1) When the swap entry is not a pfn swap entry at all, or,
(2) when we wanna keep the whole swp_offset but only change the swp type.
For the latter, it can happen when fork() triggered on a write-migration
swap entry pte, we may want to only change the migration type from
write->read but keep the rest, so it's not "fetching PFN" but "changing
swap type only". They're left aside so that when there're more information
within the swp offset they'll be carried over naturally in those cases.
Since at it, dropping hwpoison_entry_to_pfn() because that's exactly what
the new swp_offset_pfn() is about.
Signed-off-by: Peter Xu <[email protected]>
---
arch/arm64/mm/hugetlbpage.c | 2 +-
include/linux/swapops.h | 35 +++++++++++++++++++++++++++++------
mm/hmm.c | 2 +-
mm/memory-failure.c | 2 +-
mm/page_vma_mapped.c | 6 +++---
5 files changed, 35 insertions(+), 12 deletions(-)
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 0795028f017c..35e9a468d13e 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -245,7 +245,7 @@ static inline struct folio *hugetlb_swap_entry_to_folio(swp_entry_t entry)
{
VM_BUG_ON(!is_migration_entry(entry) && !is_hwpoison_entry(entry));
- return page_folio(pfn_to_page(swp_offset(entry)));
+ return page_folio(pfn_to_page(swp_offset_pfn(entry)));
}
void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 3a2901ff4f1e..e1accbcd1136 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -23,6 +23,20 @@
#define SWP_TYPE_SHIFT (BITS_PER_XA_VALUE - MAX_SWAPFILES_SHIFT)
#define SWP_OFFSET_MASK ((1UL << SWP_TYPE_SHIFT) - 1)
+/*
+ * Definitions only for PFN swap entries (see is_pfn_swap_entry()). To
+ * store PFN, we only need SWP_PFN_BITS bits. Each of the pfn swap entries
+ * can use the extra bits to store other information besides PFN.
+ */
+#ifdef MAX_PHYSMEM_BITS
+#define SWP_PFN_BITS (MAX_PHYSMEM_BITS - PAGE_SHIFT)
+#else /* MAX_PHYSMEM_BITS */
+#define SWP_PFN_BITS (BITS_PER_LONG - PAGE_SHIFT)
+#endif /* MAX_PHYSMEM_BITS */
+#define SWP_PFN_MASK ((1UL << SWP_PFN_BITS) - 1)
+
+static inline bool is_pfn_swap_entry(swp_entry_t entry);
+
/* Clear all flags but only keep swp_entry_t related information */
static inline pte_t pte_swp_clear_flags(pte_t pte)
{
@@ -64,6 +78,17 @@ static inline pgoff_t swp_offset(swp_entry_t entry)
return entry.val & SWP_OFFSET_MASK;
}
+/*
+ * This should only be called upon a pfn swap entry to get the PFN stored
+ * in the swap entry. Please refers to is_pfn_swap_entry() for definition
+ * of pfn swap entry.
+ */
+static inline unsigned long swp_offset_pfn(swp_entry_t entry)
+{
+ VM_BUG_ON(!is_pfn_swap_entry(entry));
+ return swp_offset(entry) & SWP_PFN_MASK;
+}
+
/* check whether a pte points to a swap entry */
static inline int is_swap_pte(pte_t pte)
{
@@ -369,7 +394,7 @@ static inline int pte_none_mostly(pte_t pte)
static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry)
{
- struct page *p = pfn_to_page(swp_offset(entry));
+ struct page *p = pfn_to_page(swp_offset_pfn(entry));
/*
* Any use of migration entries may only occur while the
@@ -387,6 +412,9 @@ static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry)
*/
static inline bool is_pfn_swap_entry(swp_entry_t entry)
{
+ /* Make sure the swp offset can always store the needed fields */
+ BUILD_BUG_ON(SWP_TYPE_SHIFT < SWP_PFN_BITS);
+
return is_migration_entry(entry) || is_device_private_entry(entry) ||
is_device_exclusive_entry(entry);
}
@@ -475,11 +503,6 @@ static inline int is_hwpoison_entry(swp_entry_t entry)
return swp_type(entry) == SWP_HWPOISON;
}
-static inline unsigned long hwpoison_entry_to_pfn(swp_entry_t entry)
-{
- return swp_offset(entry);
-}
-
static inline void num_poisoned_pages_inc(void)
{
atomic_long_inc(&num_poisoned_pages);
diff --git a/mm/hmm.c b/mm/hmm.c
index f2aa63b94d9b..3850fb625dda 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -253,7 +253,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
cpu_flags = HMM_PFN_VALID;
if (is_writable_device_private_entry(entry))
cpu_flags |= HMM_PFN_WRITE;
- *hmm_pfn = swp_offset(entry) | cpu_flags;
+ *hmm_pfn = swp_offset_pfn(entry) | cpu_flags;
return 0;
}
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 14439806b5ef..c8a7d38c1da4 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -632,7 +632,7 @@ static int check_hwpoisoned_entry(pte_t pte, unsigned long addr, short shift,
swp_entry_t swp = pte_to_swp_entry(pte);
if (is_hwpoison_entry(swp))
- pfn = hwpoison_entry_to_pfn(swp);
+ pfn = swp_offset_pfn(swp);
}
if (!pfn || pfn != poisoned_pfn)
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 8e9e574d535a..93e13fc17d3c 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -86,7 +86,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
!is_device_exclusive_entry(entry))
return false;
- pfn = swp_offset(entry);
+ pfn = swp_offset_pfn(entry);
} else if (is_swap_pte(*pvmw->pte)) {
swp_entry_t entry;
@@ -96,7 +96,7 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
!is_device_exclusive_entry(entry))
return false;
- pfn = swp_offset(entry);
+ pfn = swp_offset_pfn(entry);
} else {
if (!pte_present(*pvmw->pte))
return false;
@@ -221,7 +221,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
return not_found(pvmw);
entry = pmd_to_swp_entry(pmde);
if (!is_migration_entry(entry) ||
- !check_pmd(swp_offset(entry), pvmw))
+ !check_pmd(swp_offset_pfn(entry), pvmw))
return not_found(pvmw);
return true;
}
--
2.32.0
Carry over the dirty bit from pmd to pte when a huge pmd splits. It
shouldn't be a correctness issue since when pmd_dirty() we'll have the page
marked dirty anyway, however having dirty bit carried over helps the next
initial writes of split ptes on some archs like x86.
Signed-off-by: Peter Xu <[email protected]>
---
mm/huge_memory.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0611b2fd145a..e8e78d1bac5f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2005,7 +2005,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
pgtable_t pgtable;
pmd_t old_pmd, _pmd;
bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false;
- bool anon_exclusive = false;
+ bool anon_exclusive = false, dirty = false;
unsigned long addr;
int i;
@@ -2098,6 +2098,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
SetPageDirty(page);
write = pmd_write(old_pmd);
young = pmd_young(old_pmd);
+ dirty = pmd_dirty(old_pmd);
soft_dirty = pmd_soft_dirty(old_pmd);
uffd_wp = pmd_uffd_wp(old_pmd);
@@ -2161,6 +2162,9 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
entry = pte_wrprotect(entry);
if (!young)
entry = pte_mkold(entry);
+ /* NOTE: this may set soft-dirty too on some archs */
+ if (dirty)
+ entry = pte_mkdirty(entry);
if (soft_dirty)
entry = pte_mksoft_dirty(entry);
if (uffd_wp)
--
2.32.0
Peter Xu <[email protected]> writes:
> swapops.h contains quite a few layers of ifdef, some of the "else" and
> "endif" doesn't get proper comment on the macro so it's hard to follow on
> what are they referring to. Add the comments.
>
> Suggested-by: Nadav Amit <[email protected]>
> Signed-off-by: Peter Xu <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
> ---
> include/linux/swapops.h | 16 ++++++++--------
> 1 file changed, 8 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index a3d435bf9f97..3a2901ff4f1e 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -247,8 +247,8 @@ extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
> #ifdef CONFIG_HUGETLB_PAGE
> extern void __migration_entry_wait_huge(pte_t *ptep, spinlock_t *ptl);
> extern void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte);
> -#endif
> -#else
> +#endif /* CONFIG_HUGETLB_PAGE */
> +#else /* CONFIG_MIGRATION */
> static inline swp_entry_t make_readable_migration_entry(pgoff_t offset)
> {
> return swp_entry(0, 0);
> @@ -276,7 +276,7 @@ static inline void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
> #ifdef CONFIG_HUGETLB_PAGE
> static inline void __migration_entry_wait_huge(pte_t *ptep, spinlock_t *ptl) { }
> static inline void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte) { }
> -#endif
> +#endif /* CONFIG_HUGETLB_PAGE */
> static inline int is_writable_migration_entry(swp_entry_t entry)
> {
> return 0;
> @@ -286,7 +286,7 @@ static inline int is_readable_migration_entry(swp_entry_t entry)
> return 0;
> }
>
> -#endif
> +#endif /* CONFIG_MIGRATION */
>
> typedef unsigned long pte_marker;
>
> @@ -426,7 +426,7 @@ static inline int is_pmd_migration_entry(pmd_t pmd)
> {
> return is_swap_pmd(pmd) && is_migration_entry(pmd_to_swp_entry(pmd));
> }
> -#else
> +#else /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
> static inline int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
> struct page *page)
> {
> @@ -455,7 +455,7 @@ static inline int is_pmd_migration_entry(pmd_t pmd)
> {
> return 0;
> }
> -#endif
> +#endif /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
>
> #ifdef CONFIG_MEMORY_FAILURE
>
> @@ -495,7 +495,7 @@ static inline void num_poisoned_pages_sub(long i)
> atomic_long_sub(i, &num_poisoned_pages);
> }
>
> -#else
> +#else /* CONFIG_MEMORY_FAILURE */
>
> static inline swp_entry_t make_hwpoison_entry(struct page *page)
> {
> @@ -514,7 +514,7 @@ static inline void num_poisoned_pages_inc(void)
> static inline void num_poisoned_pages_sub(long i)
> {
> }
> -#endif
> +#endif /* CONFIG_MEMORY_FAILURE */
>
> static inline int non_swap_entry(swp_entry_t entry)
> {
Peter Xu <[email protected]> writes:
> Replace all the magic "5" with the macro.
>
> Reviewed-by: David Hildenbrand <[email protected]>
> Signed-off-by: Peter Xu <[email protected]>
Reviewed-by: "Huang, Ying" <[email protected]>
Best Regards,
Huang, Ying
> ---
> arch/x86/include/asm/pgtable-3level.h | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h
> index e896ebef8c24..28421a887209 100644
> --- a/arch/x86/include/asm/pgtable-3level.h
> +++ b/arch/x86/include/asm/pgtable-3level.h
> @@ -256,10 +256,10 @@ static inline pud_t native_pudp_get_and_clear(pud_t *pudp)
> /* We always extract/encode the offset by shifting it all the way up, and then down again */
> #define SWP_OFFSET_SHIFT (SWP_OFFSET_FIRST_BIT + SWP_TYPE_BITS)
>
> -#define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > 5)
> -#define __swp_type(x) (((x).val) & 0x1f)
> -#define __swp_offset(x) ((x).val >> 5)
> -#define __swp_entry(type, offset) ((swp_entry_t){(type) | (offset) << 5})
> +#define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > SWP_TYPE_BITS)
> +#define __swp_type(x) (((x).val) & ((1UL << SWP_TYPE_BITS) - 1))
> +#define __swp_offset(x) ((x).val >> SWP_TYPE_BITS)
> +#define __swp_entry(type, offset) ((swp_entry_t){(type) | (offset) << SWP_TYPE_BITS})
>
> /*
> * Normally, __swp_entry() converts from arch-independent swp_entry_t to
Peter Xu <[email protected]> writes:
> Carry over the dirty bit from pmd to pte when a huge pmd splits. It
> shouldn't be a correctness issue since when pmd_dirty() we'll have the page
> marked dirty anyway, however having dirty bit carried over helps the next
> initial writes of split ptes on some archs like x86.
>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> mm/huge_memory.c | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 0611b2fd145a..e8e78d1bac5f 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2005,7 +2005,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> pgtable_t pgtable;
> pmd_t old_pmd, _pmd;
> bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false;
> - bool anon_exclusive = false;
> + bool anon_exclusive = false, dirty = false;
> unsigned long addr;
> int i;
>
> @@ -2098,6 +2098,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> SetPageDirty(page);
> write = pmd_write(old_pmd);
> young = pmd_young(old_pmd);
> + dirty = pmd_dirty(old_pmd);
Nitpick: This can be put under
if (pmd_dirty(old_pmd))
SetPageDirty(page);
Not a big deal.
Reviewed-by: "Huang, Ying" <[email protected]>
> soft_dirty = pmd_soft_dirty(old_pmd);
> uffd_wp = pmd_uffd_wp(old_pmd);
>
> @@ -2161,6 +2162,9 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> entry = pte_wrprotect(entry);
> if (!young)
> entry = pte_mkold(entry);
> + /* NOTE: this may set soft-dirty too on some archs */
> + if (dirty)
> + entry = pte_mkdirty(entry);
> if (soft_dirty)
> entry = pte_mksoft_dirty(entry);
> if (uffd_wp)
Peter Xu <[email protected]> writes:
> We've got a bunch of special swap entries that stores PFN inside the swap
> offset fields. To fetch the PFN, normally the user just calls swp_offset()
> assuming that'll be the PFN.
>
> Add a helper swp_offset_pfn() to fetch the PFN instead, fetching only the
> max possible length of a PFN on the host, meanwhile doing proper check with
> MAX_PHYSMEM_BITS to make sure the swap offsets can actually store the PFNs
> properly always using the BUILD_BUG_ON() in is_pfn_swap_entry().
>
> One reason to do so is we never tried to sanitize whether swap offset can
> really fit for storing PFN. At the meantime, this patch also prepares us
> with the future possibility to store more information inside the swp offset
> field, so assuming "swp_offset(entry)" to be the PFN will not stand any
> more very soon.
>
> Replace many of the swp_offset() callers to use swp_offset_pfn() where
> proper. Note that many of the existing users are not candidates for the
> replacement, e.g.:
>
> (1) When the swap entry is not a pfn swap entry at all, or,
> (2) when we wanna keep the whole swp_offset but only change the swp type.
>
> For the latter, it can happen when fork() triggered on a write-migration
> swap entry pte, we may want to only change the migration type from
> write->read but keep the rest, so it's not "fetching PFN" but "changing
> swap type only". They're left aside so that when there're more information
> within the swp offset they'll be carried over naturally in those cases.
>
> Since at it, dropping hwpoison_entry_to_pfn() because that's exactly what
> the new swp_offset_pfn() is about.
>
> Signed-off-by: Peter Xu <[email protected]>
The patch itself looks good. But I searched swp_entry() in kernel
source code, and found that we need to do more.
For example, in pte_to_pagemap_entry()
frame = swp_type(entry) |
(swp_offset(entry) << MAX_SWAPFILES_SHIFT);
If it's a migration entry, we need
frame = swp_type(entry) |
(swp_offset_pfn(entry) << MAX_SWAPFILES_SHIFT);
So I think you need to search all swp_offset() calling in the kernel
source and check whether they need to be changed.
Best Regards,
Huang, Ying
[snip]
Peter Xu <[email protected]> writes:
> Introduce a variable swap_migration_ad_supported to cache whether the arch
> supports swap migration A/D bits.
>
> Here one thing to mention is that SWP_MIG_TOTAL_BITS will internally
> reference the other macro MAX_PHYSMEM_BITS, which is a function call on
> x86 (constant on all the rest of archs).
>
> It's safe to reference it in swapfile_init() because when reaching here
> we're already during initcalls level 4 so we must have initialized 5-level
> pgtable for x86_64 (right after early_identify_cpu() finishes).
>
> - start_kernel
> - setup_arch
> - early_cpu_init
> - get_cpu_cap --> fetch from CPUID (including X86_FEATURE_LA57)
> - early_identify_cpu --> clear X86_FEATURE_LA57 (if early lvl5 not enabled (USE_EARLY_PGTABLE_L5))
> - arch_call_rest_init
> - rest_init
> - kernel_init
> - kernel_init_freeable
> - do_basic_setup
> - do_initcalls --> calls swapfile_init() (initcall level 4)
>
> This should slightly speed up the migration swap entry handlings.
>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> include/linux/swapfile.h | 1 +
> include/linux/swapops.h | 7 +------
> mm/swapfile.c | 8 ++++++++
> 3 files changed, 10 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/swapfile.h b/include/linux/swapfile.h
> index 54078542134c..87ec5e2cdb02 100644
> --- a/include/linux/swapfile.h
> +++ b/include/linux/swapfile.h
> @@ -9,5 +9,6 @@
> extern struct swap_info_struct *swap_info[];
> extern unsigned long generic_max_swapfile_size(void);
> extern unsigned long max_swapfile_size(void);
> +extern bool swap_migration_ad_supported;
>
> #endif /* _LINUX_SWAPFILE_H */
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index 0e9579b90659..e6afc77c51ad 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -301,13 +301,8 @@ static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
> */
> static inline bool migration_entry_supports_ad(void)
> {
> - /*
> - * max_swapfile_size() returns the max supported swp-offset plus 1.
> - * We can support the migration A/D bits iff the pfn swap entry has
> - * the offset large enough to cover all of them (PFN, A & D bits).
> - */
> #ifdef CONFIG_SWAP
> - return max_swapfile_size() >= (1UL << SWP_MIG_TOTAL_BITS);
> + return swap_migration_ad_supported;
> #else /* CONFIG_SWAP */
> return false;
> #endif /* CONFIG_SWAP */
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 794fa37bd0c3..c49cf25f0d08 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -64,6 +64,9 @@ EXPORT_SYMBOL_GPL(nr_swap_pages);
> long total_swap_pages;
> static int least_priority = -1;
> static unsigned long swapfile_maximum_size;
> +#ifdef CONFIG_MIGRATION
> +bool swap_migration_ad_supported;
> +#endif /* CONFIG_MIGRATION */
>
> static const char Bad_file[] = "Bad swap file entry ";
> static const char Unused_file[] = "Unused swap file entry ";
> @@ -3685,6 +3688,11 @@ static int __init swapfile_init(void)
>
> swapfile_maximum_size = arch_max_swapfile_size();
>
> +#ifdef CONFIG_MIGRATION
> + if (swapfile_maximum_size >= (1UL << SWP_MIG_TOTAL_BITS))
> + swap_migration_ad_supported = true;
> +#endif /* CONFIG_MIGRATION */
> +
> return 0;
> }
> subsys_initcall(swapfile_init);
I don't think it's necessary to add a variable for such a simple
function and it's not a super hot path. But I don't have strong
opinions here.
Best Regards,
Huang, Ying
Peter Xu <[email protected]> writes:
> We used to have swapfile_maximum_size() fetching a maximum value of
> swapfile size per-arch.
>
> As the caller of max_swapfile_size() grows, this patch introduce a variable
> "swapfile_maximum_size" and cache the value of old max_swapfile_size(), so
> that we don't need to calculate the value every time.
>
> Caching the value in swapfile_init() is safe because when reaching the
> phase we should have initialized all the relevant information. Here the
> major arch to look after is x86, which defines the max size based on L1TF
> mitigation.
>
> Here both X86_BUG_L1TF or l1tf_mitigation should have been setup properly
> when reaching swapfile_init(). As a reference, the code path looks like
> this for x86:
>
> - start_kernel
> - setup_arch
> - early_cpu_init
> - early_identify_cpu --> setup X86_BUG_L1TF
> - parse_early_param
> - l1tf_cmdline --> set l1tf_mitigation
> - check_bugs
> - l1tf_select_mitigation --> set l1tf_mitigation
> - arch_call_rest_init
> - rest_init
> - kernel_init
> - kernel_init_freeable
> - do_basic_setup
> - do_initcalls --> calls swapfile_init() (initcall level 4)
>
> The swapfile size only depends on swp pte format on non-x86 archs, so
> caching it is safe too.
>
> Since at it, rename max_swapfile_size() to arch_max_swapfile_size() because
> arch can define its own function, so it's more straightforward to have
> "arch_" as its prefix. At the meantime, keep the swapfile_maximum_size()
> function to fetch the value from the cache initialized in swapfile_init().
>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> arch/x86/mm/init.c | 2 +-
> mm/swapfile.c | 10 +++++++++-
> 2 files changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index 82a042c03824..9121bc1b9453 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -1054,7 +1054,7 @@ void update_cache_mode_entry(unsigned entry, enum page_cache_mode cache)
> }
>
> #ifdef CONFIG_SWAP
> -unsigned long max_swapfile_size(void)
> +unsigned long arch_max_swapfile_size(void)
> {
> unsigned long pages;
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 1fdccd2f1422..794fa37bd0c3 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -63,6 +63,7 @@ EXPORT_SYMBOL_GPL(nr_swap_pages);
> /* protected with swap_lock. reading in vm_swap_full() doesn't need lock */
> long total_swap_pages;
> static int least_priority = -1;
> +static unsigned long swapfile_maximum_size;
>
> static const char Bad_file[] = "Bad swap file entry ";
> static const char Unused_file[] = "Unused swap file entry ";
> @@ -2816,11 +2817,16 @@ unsigned long generic_max_swapfile_size(void)
> }
>
> /* Can be overridden by an architecture for additional checks. */
> -__weak unsigned long max_swapfile_size(void)
> +__weak unsigned long arch_max_swapfile_size(void)
> {
> return generic_max_swapfile_size();
> }
>
> +unsigned long max_swapfile_size(void)
> +{
> + return swapfile_maximum_size;
> +}
> +
It appears unnecessary to hide a variable with a function. Why not just
use the variable directly.
Best Regards,
Huang, Ying
> static unsigned long read_swap_header(struct swap_info_struct *p,
> union swap_header *swap_header,
> struct inode *inode)
> @@ -3677,6 +3683,8 @@ static int __init swapfile_init(void)
> for_each_node(nid)
> plist_head_init(&swap_avail_heads[nid]);
>
> + swapfile_maximum_size = arch_max_swapfile_size();
> +
> return 0;
> }
> subsys_initcall(swapfile_init);
Peter Xu <[email protected]> writes:
> When page migration happens, we always ignore the young/dirty bit settings
> in the old pgtable, and marking the page as old in the new page table using
> either pte_mkold() or pmd_mkold(), and keeping the pte clean.
>
> That's fine from functional-wise, but that's not friendly to page reclaim
> because the moving page can be actively accessed within the procedure. Not
> to mention hardware setting the young bit can bring quite some overhead on
> some systems, e.g. x86_64 needs a few hundreds nanoseconds to set the bit.
> The same slowdown problem to dirty bits when the memory is first written
> after page migration happened.
>
> Actually we can easily remember the A/D bit configuration and recover the
> information after the page is migrated. To achieve it, define a new set of
> bits in the migration swap offset field to cache the A/D bits for old pte.
> Then when removing/recovering the migration entry, we can recover the A/D
> bits even if the page changed.
>
> One thing to mention is that here we used max_swapfile_size() to detect how
> many swp offset bits we have, and we'll only enable this feature if we know
> the swp offset can be big enough to store both the PFN value and the young
~~~~~
Nitpick: A/D
> bit. Otherwise the A/D bits are dropped like before.
>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> include/linux/swapops.h | 99 +++++++++++++++++++++++++++++++++++++++++
> mm/huge_memory.c | 18 +++++++-
> mm/migrate.c | 6 ++-
> mm/migrate_device.c | 4 ++
> mm/rmap.c | 5 ++-
> 5 files changed, 128 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index e1accbcd1136..0e9579b90659 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -8,6 +8,10 @@
>
> #ifdef CONFIG_MMU
>
> +#ifdef CONFIG_SWAP
> +#include <linux/swapfile.h>
> +#endif /* CONFIG_SWAP */
I don't think we need the comment here. The #ifdef is too near. But
this isn't a big deal.
Best Regards,
Huang, Ying
On Wed, Aug 10, 2022 at 02:33:02PM +0800, Huang, Ying wrote:
> Peter Xu <[email protected]> writes:
>
> > We used to have swapfile_maximum_size() fetching a maximum value of
> > swapfile size per-arch.
> >
> > As the caller of max_swapfile_size() grows, this patch introduce a variable
> > "swapfile_maximum_size" and cache the value of old max_swapfile_size(), so
> > that we don't need to calculate the value every time.
> >
> > Caching the value in swapfile_init() is safe because when reaching the
> > phase we should have initialized all the relevant information. Here the
> > major arch to look after is x86, which defines the max size based on L1TF
> > mitigation.
> >
> > Here both X86_BUG_L1TF or l1tf_mitigation should have been setup properly
> > when reaching swapfile_init(). As a reference, the code path looks like
> > this for x86:
> >
> > - start_kernel
> > - setup_arch
> > - early_cpu_init
> > - early_identify_cpu --> setup X86_BUG_L1TF
> > - parse_early_param
> > - l1tf_cmdline --> set l1tf_mitigation
> > - check_bugs
> > - l1tf_select_mitigation --> set l1tf_mitigation
> > - arch_call_rest_init
> > - rest_init
> > - kernel_init
> > - kernel_init_freeable
> > - do_basic_setup
> > - do_initcalls --> calls swapfile_init() (initcall level 4)
> >
> > The swapfile size only depends on swp pte format on non-x86 archs, so
> > caching it is safe too.
> >
> > Since at it, rename max_swapfile_size() to arch_max_swapfile_size() because
> > arch can define its own function, so it's more straightforward to have
> > "arch_" as its prefix. At the meantime, keep the swapfile_maximum_size()
> > function to fetch the value from the cache initialized in swapfile_init().
> >
> > Signed-off-by: Peter Xu <[email protected]>
> > ---
> > arch/x86/mm/init.c | 2 +-
> > mm/swapfile.c | 10 +++++++++-
> > 2 files changed, 10 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> > index 82a042c03824..9121bc1b9453 100644
> > --- a/arch/x86/mm/init.c
> > +++ b/arch/x86/mm/init.c
> > @@ -1054,7 +1054,7 @@ void update_cache_mode_entry(unsigned entry, enum page_cache_mode cache)
> > }
> >
> > #ifdef CONFIG_SWAP
> > -unsigned long max_swapfile_size(void)
> > +unsigned long arch_max_swapfile_size(void)
> > {
> > unsigned long pages;
> >
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 1fdccd2f1422..794fa37bd0c3 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -63,6 +63,7 @@ EXPORT_SYMBOL_GPL(nr_swap_pages);
> > /* protected with swap_lock. reading in vm_swap_full() doesn't need lock */
> > long total_swap_pages;
> > static int least_priority = -1;
> > +static unsigned long swapfile_maximum_size;
> >
> > static const char Bad_file[] = "Bad swap file entry ";
> > static const char Unused_file[] = "Unused swap file entry ";
> > @@ -2816,11 +2817,16 @@ unsigned long generic_max_swapfile_size(void)
> > }
> >
> > /* Can be overridden by an architecture for additional checks. */
> > -__weak unsigned long max_swapfile_size(void)
> > +__weak unsigned long arch_max_swapfile_size(void)
> > {
> > return generic_max_swapfile_size();
> > }
> >
> > +unsigned long max_swapfile_size(void)
> > +{
> > + return swapfile_maximum_size;
> > +}
> > +
>
> It appears unnecessary to hide a variable with a function. Why not just
> use the variable directly.
Sure.
--
Peter Xu
On Wed, Aug 10, 2022 at 02:04:32PM +0800, Huang, Ying wrote:
> Peter Xu <[email protected]> writes:
>
> > We've got a bunch of special swap entries that stores PFN inside the swap
> > offset fields. To fetch the PFN, normally the user just calls swp_offset()
> > assuming that'll be the PFN.
> >
> > Add a helper swp_offset_pfn() to fetch the PFN instead, fetching only the
> > max possible length of a PFN on the host, meanwhile doing proper check with
> > MAX_PHYSMEM_BITS to make sure the swap offsets can actually store the PFNs
> > properly always using the BUILD_BUG_ON() in is_pfn_swap_entry().
> >
> > One reason to do so is we never tried to sanitize whether swap offset can
> > really fit for storing PFN. At the meantime, this patch also prepares us
> > with the future possibility to store more information inside the swp offset
> > field, so assuming "swp_offset(entry)" to be the PFN will not stand any
> > more very soon.
> >
> > Replace many of the swp_offset() callers to use swp_offset_pfn() where
> > proper. Note that many of the existing users are not candidates for the
> > replacement, e.g.:
> >
> > (1) When the swap entry is not a pfn swap entry at all, or,
> > (2) when we wanna keep the whole swp_offset but only change the swp type.
> >
> > For the latter, it can happen when fork() triggered on a write-migration
> > swap entry pte, we may want to only change the migration type from
> > write->read but keep the rest, so it's not "fetching PFN" but "changing
> > swap type only". They're left aside so that when there're more information
> > within the swp offset they'll be carried over naturally in those cases.
> >
> > Since at it, dropping hwpoison_entry_to_pfn() because that's exactly what
> > the new swp_offset_pfn() is about.
> >
> > Signed-off-by: Peter Xu <[email protected]>
>
> The patch itself looks good. But I searched swp_entry() in kernel
> source code, and found that we need to do more.
>
> For example, in pte_to_pagemap_entry()
>
> frame = swp_type(entry) |
> (swp_offset(entry) << MAX_SWAPFILES_SHIFT);
>
> If it's a migration entry, we need
>
> frame = swp_type(entry) |
> (swp_offset_pfn(entry) << MAX_SWAPFILES_SHIFT);
>
> So I think you need to search all swp_offset() calling in the kernel
> source and check whether they need to be changed.
Yeah I actually looked at all of them and explicitly left this one since I
wanted to dump the whole swp entry - even if it's called "show_pfn" it was
actually dumping the whole entries always, e.g., for genuine swap entries I
don't think it's PFN stored in swp offset, so it's nothing about PFN but
swp offset itself, IMHO.
But after a second thought I agree it should be specially handled here,
because the user app could be relying offset to be pfn for migration
entries. The other thing is I'm not sure whether the encoding of pagemap
entries can always fit for both pfn and A/D bits (majorly, PM_PFRAME_MASK)
even if the arch swap pte fits; it needs more math. So unless necessary,
it'll be good to still make the A/D bits internal to kernel too.
Thanks for the careful review, I'll fix that.
--
Peter Xu
On Wed, Aug 10, 2022 at 02:24:33PM +0800, Huang, Ying wrote:
> Peter Xu <[email protected]> writes:
>
> > Carry over the dirty bit from pmd to pte when a huge pmd splits. It
> > shouldn't be a correctness issue since when pmd_dirty() we'll have the page
> > marked dirty anyway, however having dirty bit carried over helps the next
> > initial writes of split ptes on some archs like x86.
> >
> > Signed-off-by: Peter Xu <[email protected]>
> > ---
> > mm/huge_memory.c | 6 +++++-
> > 1 file changed, 5 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 0611b2fd145a..e8e78d1bac5f 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -2005,7 +2005,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> > pgtable_t pgtable;
> > pmd_t old_pmd, _pmd;
> > bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false;
> > - bool anon_exclusive = false;
> > + bool anon_exclusive = false, dirty = false;
> > unsigned long addr;
> > int i;
> >
> > @@ -2098,6 +2098,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> > SetPageDirty(page);
> > write = pmd_write(old_pmd);
> > young = pmd_young(old_pmd);
> > + dirty = pmd_dirty(old_pmd);
>
> Nitpick: This can be put under
>
> if (pmd_dirty(old_pmd))
> SetPageDirty(page);
>
> Not a big deal.
>
> Reviewed-by: "Huang, Ying" <[email protected]>
Yeah will do, thanks.
--
Peter Xu
On Wed, Aug 10, 2022 at 02:30:33PM +0800, Huang, Ying wrote:
> Peter Xu <[email protected]> writes:
>
> > When page migration happens, we always ignore the young/dirty bit settings
> > in the old pgtable, and marking the page as old in the new page table using
> > either pte_mkold() or pmd_mkold(), and keeping the pte clean.
> >
> > That's fine from functional-wise, but that's not friendly to page reclaim
> > because the moving page can be actively accessed within the procedure. Not
> > to mention hardware setting the young bit can bring quite some overhead on
> > some systems, e.g. x86_64 needs a few hundreds nanoseconds to set the bit.
> > The same slowdown problem to dirty bits when the memory is first written
> > after page migration happened.
> >
> > Actually we can easily remember the A/D bit configuration and recover the
> > information after the page is migrated. To achieve it, define a new set of
> > bits in the migration swap offset field to cache the A/D bits for old pte.
> > Then when removing/recovering the migration entry, we can recover the A/D
> > bits even if the page changed.
> >
> > One thing to mention is that here we used max_swapfile_size() to detect how
> > many swp offset bits we have, and we'll only enable this feature if we know
> > the swp offset can be big enough to store both the PFN value and the young
> ~~~~~
> Nitpick: A/D
Fixed.
>
> > bit. Otherwise the A/D bits are dropped like before.
> >
> > Signed-off-by: Peter Xu <[email protected]>
> > ---
> > include/linux/swapops.h | 99 +++++++++++++++++++++++++++++++++++++++++
> > mm/huge_memory.c | 18 +++++++-
> > mm/migrate.c | 6 ++-
> > mm/migrate_device.c | 4 ++
> > mm/rmap.c | 5 ++-
> > 5 files changed, 128 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> > index e1accbcd1136..0e9579b90659 100644
> > --- a/include/linux/swapops.h
> > +++ b/include/linux/swapops.h
> > @@ -8,6 +8,10 @@
> >
> > #ifdef CONFIG_MMU
> >
> > +#ifdef CONFIG_SWAP
> > +#include <linux/swapfile.h>
> > +#endif /* CONFIG_SWAP */
>
> I don't think we need the comment here. The #ifdef is too near. But
> this isn't a big deal.
I'd slightly prefer keeping it (especially Nadav used to complain on
missing comments on ifdefs in previous versions..) since any ifdef can grow
by adding code into it. Then it'll be hard to justify how to define "near"
or not, so hard to define who should be adding that if I'm not the one.
Thanks,
--
Peter Xu
On Wed, Aug 10, 2022 at 02:37:40PM +0800, Huang, Ying wrote:
> Peter Xu <[email protected]> writes:
>
> > Introduce a variable swap_migration_ad_supported to cache whether the arch
> > supports swap migration A/D bits.
> >
> > Here one thing to mention is that SWP_MIG_TOTAL_BITS will internally
> > reference the other macro MAX_PHYSMEM_BITS, which is a function call on
> > x86 (constant on all the rest of archs).
> >
> > It's safe to reference it in swapfile_init() because when reaching here
> > we're already during initcalls level 4 so we must have initialized 5-level
> > pgtable for x86_64 (right after early_identify_cpu() finishes).
> >
> > - start_kernel
> > - setup_arch
> > - early_cpu_init
> > - get_cpu_cap --> fetch from CPUID (including X86_FEATURE_LA57)
> > - early_identify_cpu --> clear X86_FEATURE_LA57 (if early lvl5 not enabled (USE_EARLY_PGTABLE_L5))
> > - arch_call_rest_init
> > - rest_init
> > - kernel_init
> > - kernel_init_freeable
> > - do_basic_setup
> > - do_initcalls --> calls swapfile_init() (initcall level 4)
> >
> > This should slightly speed up the migration swap entry handlings.
> >
> > Signed-off-by: Peter Xu <[email protected]>
> > ---
> > include/linux/swapfile.h | 1 +
> > include/linux/swapops.h | 7 +------
> > mm/swapfile.c | 8 ++++++++
> > 3 files changed, 10 insertions(+), 6 deletions(-)
> >
> > diff --git a/include/linux/swapfile.h b/include/linux/swapfile.h
> > index 54078542134c..87ec5e2cdb02 100644
> > --- a/include/linux/swapfile.h
> > +++ b/include/linux/swapfile.h
> > @@ -9,5 +9,6 @@
> > extern struct swap_info_struct *swap_info[];
> > extern unsigned long generic_max_swapfile_size(void);
> > extern unsigned long max_swapfile_size(void);
> > +extern bool swap_migration_ad_supported;
> >
> > #endif /* _LINUX_SWAPFILE_H */
> > diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> > index 0e9579b90659..e6afc77c51ad 100644
> > --- a/include/linux/swapops.h
> > +++ b/include/linux/swapops.h
> > @@ -301,13 +301,8 @@ static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
> > */
> > static inline bool migration_entry_supports_ad(void)
> > {
> > - /*
> > - * max_swapfile_size() returns the max supported swp-offset plus 1.
> > - * We can support the migration A/D bits iff the pfn swap entry has
> > - * the offset large enough to cover all of them (PFN, A & D bits).
> > - */
> > #ifdef CONFIG_SWAP
> > - return max_swapfile_size() >= (1UL << SWP_MIG_TOTAL_BITS);
> > + return swap_migration_ad_supported;
> > #else /* CONFIG_SWAP */
> > return false;
> > #endif /* CONFIG_SWAP */
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 794fa37bd0c3..c49cf25f0d08 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -64,6 +64,9 @@ EXPORT_SYMBOL_GPL(nr_swap_pages);
> > long total_swap_pages;
> > static int least_priority = -1;
> > static unsigned long swapfile_maximum_size;
> > +#ifdef CONFIG_MIGRATION
> > +bool swap_migration_ad_supported;
> > +#endif /* CONFIG_MIGRATION */
> >
> > static const char Bad_file[] = "Bad swap file entry ";
> > static const char Unused_file[] = "Unused swap file entry ";
> > @@ -3685,6 +3688,11 @@ static int __init swapfile_init(void)
> >
> > swapfile_maximum_size = arch_max_swapfile_size();
> >
> > +#ifdef CONFIG_MIGRATION
> > + if (swapfile_maximum_size >= (1UL << SWP_MIG_TOTAL_BITS))
> > + swap_migration_ad_supported = true;
> > +#endif /* CONFIG_MIGRATION */
> > +
> > return 0;
> > }
> > subsys_initcall(swapfile_init);
>
> I don't think it's necessary to add a variable for such a simple
> function and it's not a super hot path. But I don't have strong
> opinions here.
Logically referencing SWP_MIG_TOTAL_BITS needs to go check
MAX_PHYSMEM_BITS, which should further go with:
# define MAX_PHYSMEM_BITS (pgtable_l5_enabled() ? 52 : 46)
Then since swapfile.c doesn't have USE_EARLY_PGTABLE_L5 defined..
#define pgtable_l5_enabled() cpu_feature_enabled(X86_FEATURE_LA57)
Then,
#define cpu_feature_enabled(bit) \
(__builtin_constant_p(bit) && DISABLED_MASK_BIT_SET(bit) ? 0 : static_cpu_has(bit))
I think LA57 shouldn't be in DISABLED_MASK_BIT_SET() at all, in our case
the relevant disable mask is:
#define DISABLED_MASK16 (DISABLE_PKU|DISABLE_OSPKE|DISABLE_LA57|DISABLE_UMIP| \
DISABLE_ENQCMD)
Here we should have:
#ifdef CONFIG_X86_5LEVEL
# define DISABLE_LA57 0
#else
# define DISABLE_LA57 (1<<(X86_FEATURE_LA57 & 31))
#endif
So DISABLE_LA57 should be 0 when 5level enabled (true in my case). Then we
really should land at static_cpu_has().
I checked up the code generated and surprisingly it's fairly fast indeed:
(after fetching swapfile_maximum_size() and put into %rax, I'll change
that into a variable soon..)
0xffffffff83932e41 <+185>: mov $0x1,%edx
0xffffffff83932e46 <+190>: shl $0x24,%rdx
0xffffffff83932e4a <+194>: xor %r8d,%r8d
0xffffffff83932e4d <+197>: cmp %rdx,%rax
0xffffffff83932e50 <+200>: jb 0xffffffff83932e59 <swapfile_init+209>
0xffffffff83932e52 <+202>: movb $0x1,0xeab897(%rip) # 0xffffffff847de6f0 <swap_migration_ad_supported>
0xffffffff83932e59 <+209>: mov %r8d,%eax
Obviously on my testing host SWP_MIG_TOTAL_BITS is directly set as $0x24
(which reflects a 4-level pgtable) but frankly I cannot tell how it did
that without checking boot cpu x86_capabilities flags.. I'm pretty sure my
kernel config has CONFIG_X86_5LEVEL=y.
It'll be great if anyone already notices why it can be optimized into a
constant, but even if so I'm not confident that'll be a constant for all
the hosts and whether static_cpu_has() will still consume some insns.
Since the change is fairly simple after previous patch, I think it'll be
nice to keep it too.
Thanks,
--
Peter Xu
On Tue, Aug 09, 2022 at 06:00:58PM -0400, Peter Xu wrote:
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index 27fb37d65476..699f821b8443 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -221,6 +221,10 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> else
> entry = make_readable_migration_entry(
> page_to_pfn(page));
> + if (pte_young(pte))
> + entry = make_migration_entry_young(entry);
> + if (pte_dirty(pte))
> + entry = make_migration_entry_dirty(entry);
> swp_pte = swp_entry_to_pte(entry);
> if (pte_present(pte)) {
> if (pte_soft_dirty(pte))
This change needs to be wrapped with pte_present() at least..
I also just noticed that this change probably won't help anyway because:
(1) When ram->device, the pte will finally be replaced with a device
private entry, and device private entry does not yet support A/D, it
means A/D will be dropped again,
(2) When device->ram, we are missing information on either A/D bits, or
even if device private entries start to suport A/D, it's still not
clear whether we should take device read/write into considerations
too on the page A/D bits to be accurate.
I think I'll probably keep the code there for completeness, but I think it
won't really help much until more things are done.
--
Peter Xu
Peter Xu <[email protected]> writes:
> On Tue, Aug 09, 2022 at 06:00:58PM -0400, Peter Xu wrote:
>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>> index 27fb37d65476..699f821b8443 100644
>> --- a/mm/migrate_device.c
>> +++ b/mm/migrate_device.c
>> @@ -221,6 +221,10 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>> else
>> entry = make_readable_migration_entry(
>> page_to_pfn(page));
>> + if (pte_young(pte))
>> + entry = make_migration_entry_young(entry);
>> + if (pte_dirty(pte))
>> + entry = make_migration_entry_dirty(entry);
>> swp_pte = swp_entry_to_pte(entry);
>> if (pte_present(pte)) {
>> if (pte_soft_dirty(pte))
>
> This change needs to be wrapped with pte_present() at least..
>
> I also just noticed that this change probably won't help anyway because:
>
> (1) When ram->device, the pte will finally be replaced with a device
> private entry, and device private entry does not yet support A/D, it
> means A/D will be dropped again,
>
> (2) When device->ram, we are missing information on either A/D bits, or
> even if device private entries start to suport A/D, it's still not
> clear whether we should take device read/write into considerations
> too on the page A/D bits to be accurate.
>
> I think I'll probably keep the code there for completeness, but I think it
> won't really help much until more things are done.
It appears that there are more issues. Between "pte = *ptep" and pte
clear, CPU may set A/D bit in PTE, so we may need to update pte when
clearing PTE. And I don't find the TLB is flushed in some cases after
PTE is cleared.
Best Regards,
Huang, Ying
On Fri, Aug 12, 2022 at 10:32:48AM +0800, Huang, Ying wrote:
> Peter Xu <[email protected]> writes:
>
> > On Tue, Aug 09, 2022 at 06:00:58PM -0400, Peter Xu wrote:
> >> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> >> index 27fb37d65476..699f821b8443 100644
> >> --- a/mm/migrate_device.c
> >> +++ b/mm/migrate_device.c
> >> @@ -221,6 +221,10 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> >> else
> >> entry = make_readable_migration_entry(
> >> page_to_pfn(page));
> >> + if (pte_young(pte))
> >> + entry = make_migration_entry_young(entry);
> >> + if (pte_dirty(pte))
> >> + entry = make_migration_entry_dirty(entry);
> >> swp_pte = swp_entry_to_pte(entry);
> >> if (pte_present(pte)) {
> >> if (pte_soft_dirty(pte))
> >
> > This change needs to be wrapped with pte_present() at least..
> >
> > I also just noticed that this change probably won't help anyway because:
> >
> > (1) When ram->device, the pte will finally be replaced with a device
> > private entry, and device private entry does not yet support A/D, it
> > means A/D will be dropped again,
> >
> > (2) When device->ram, we are missing information on either A/D bits, or
> > even if device private entries start to suport A/D, it's still not
> > clear whether we should take device read/write into considerations
> > too on the page A/D bits to be accurate.
> >
> > I think I'll probably keep the code there for completeness, but I think it
> > won't really help much until more things are done.
>
> It appears that there are more issues. Between "pte = *ptep" and pte
> clear, CPU may set A/D bit in PTE, so we may need to update pte when
> clearing PTE.
Agreed, I didn't see it a huge problem with current code, but it should be
better in that way.
> And I don't find the TLB is flushed in some cases after PTE is cleared.
I think it's okay to not flush tlb if pte not present. But maybe you're
talking about something else?
Thanks,
--
Peter Xu
On Aug 15, 2022, at 1:52 PM, Nadav Amit <[email protected]> wrote:
> On Aug 15, 2022, at 12:18 PM, Peter Xu <[email protected]> wrote:
>
>> On Fri, Aug 12, 2022 at 10:32:48AM +0800, Huang, Ying wrote:
>>> Peter Xu <[email protected]> writes:
>>>
>>>> On Tue, Aug 09, 2022 at 06:00:58PM -0400, Peter Xu wrote:
>>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>>>> index 27fb37d65476..699f821b8443 100644
>>>>> --- a/mm/migrate_device.c
>>>>> +++ b/mm/migrate_device.c
>>>>> @@ -221,6 +221,10 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>> else
>>>>> entry = make_readable_migration_entry(
>>>>> page_to_pfn(page));
>>>>> + if (pte_young(pte))
>>>>> + entry = make_migration_entry_young(entry);
>>>>> + if (pte_dirty(pte))
>>>>> + entry = make_migration_entry_dirty(entry);
>>>>> swp_pte = swp_entry_to_pte(entry);
>>>>> if (pte_present(pte)) {
>>>>> if (pte_soft_dirty(pte))
>>>>
>>>> This change needs to be wrapped with pte_present() at least..
>>>>
>>>> I also just noticed that this change probably won't help anyway because:
>>>>
>>>> (1) When ram->device, the pte will finally be replaced with a device
>>>> private entry, and device private entry does not yet support A/D, it
>>>> means A/D will be dropped again,
>>>>
>>>> (2) When device->ram, we are missing information on either A/D bits, or
>>>> even if device private entries start to suport A/D, it's still not
>>>> clear whether we should take device read/write into considerations
>>>> too on the page A/D bits to be accurate.
>>>>
>>>> I think I'll probably keep the code there for completeness, but I think it
>>>> won't really help much until more things are done.
>>>
>>> It appears that there are more issues. Between "pte = *ptep" and pte
>>> clear, CPU may set A/D bit in PTE, so we may need to update pte when
>>> clearing PTE.
>>
>> Agreed, I didn't see it a huge problem with current code, but it should be
>> better in that way.
>>
>>> And I don't find the TLB is flushed in some cases after PTE is cleared.
>>
>> I think it's okay to not flush tlb if pte not present. But maybe you're
>> talking about something else?
>
> I think Huang refers to situation in which the PTE is cleared, still not
> flushed, and then A/D is being set by the hardware.
>
> At least on x86, the hardware is not supposed to do so. The only case I
> remember (and sometimes misremembers) is with KNL erratum, which perhaps
> needs to be considered:
>
> https://lore.kernel.org/all/[email protected]/
I keep not remembering this erratum correctly. IIRC, the erratum says that
the access/dirty might be set, but it does not mean that a write is possible
after the PTE is cleared (i.e., the dirty/access might be set on the
non-present PTE, but the access itself would fail). So it is not an issue in
this case - losing A/D would not impact correctness since the access should
fail.
Dave Hansen hates when I get confused with this one, but I cc him if he
wants to confirm.
[ Having said all of that, in general the lack of regard to
mm->tlb_flush_pending is always concerning in such functions. ]
On Aug 15, 2022, at 12:18 PM, Peter Xu <[email protected]> wrote:
> On Fri, Aug 12, 2022 at 10:32:48AM +0800, Huang, Ying wrote:
>> Peter Xu <[email protected]> writes:
>>
>>> On Tue, Aug 09, 2022 at 06:00:58PM -0400, Peter Xu wrote:
>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>>> index 27fb37d65476..699f821b8443 100644
>>>> --- a/mm/migrate_device.c
>>>> +++ b/mm/migrate_device.c
>>>> @@ -221,6 +221,10 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>> else
>>>> entry = make_readable_migration_entry(
>>>> page_to_pfn(page));
>>>> + if (pte_young(pte))
>>>> + entry = make_migration_entry_young(entry);
>>>> + if (pte_dirty(pte))
>>>> + entry = make_migration_entry_dirty(entry);
>>>> swp_pte = swp_entry_to_pte(entry);
>>>> if (pte_present(pte)) {
>>>> if (pte_soft_dirty(pte))
>>>
>>> This change needs to be wrapped with pte_present() at least..
>>>
>>> I also just noticed that this change probably won't help anyway because:
>>>
>>> (1) When ram->device, the pte will finally be replaced with a device
>>> private entry, and device private entry does not yet support A/D, it
>>> means A/D will be dropped again,
>>>
>>> (2) When device->ram, we are missing information on either A/D bits, or
>>> even if device private entries start to suport A/D, it's still not
>>> clear whether we should take device read/write into considerations
>>> too on the page A/D bits to be accurate.
>>>
>>> I think I'll probably keep the code there for completeness, but I think it
>>> won't really help much until more things are done.
>>
>> It appears that there are more issues. Between "pte = *ptep" and pte
>> clear, CPU may set A/D bit in PTE, so we may need to update pte when
>> clearing PTE.
>
> Agreed, I didn't see it a huge problem with current code, but it should be
> better in that way.
>
>> And I don't find the TLB is flushed in some cases after PTE is cleared.
>
> I think it's okay to not flush tlb if pte not present. But maybe you're
> talking about something else?
I think Huang refers to situation in which the PTE is cleared, still not
flushed, and then A/D is being set by the hardware.
At least on x86, the hardware is not supposed to do so. The only case I
remember (and sometimes misremembers) is with KNL erratum, which perhaps
needs to be considered:
https://lore.kernel.org/all/[email protected]/
Nadav Amit <[email protected]> writes:
> On Aug 15, 2022, at 12:18 PM, Peter Xu <[email protected]> wrote:
>
>> On Fri, Aug 12, 2022 at 10:32:48AM +0800, Huang, Ying wrote:
>>> Peter Xu <[email protected]> writes:
>>>
>>>> On Tue, Aug 09, 2022 at 06:00:58PM -0400, Peter Xu wrote:
>>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>>>> index 27fb37d65476..699f821b8443 100644
>>>>> --- a/mm/migrate_device.c
>>>>> +++ b/mm/migrate_device.c
>>>>> @@ -221,6 +221,10 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>> else
>>>>> entry = make_readable_migration_entry(
>>>>> page_to_pfn(page));
>>>>> + if (pte_young(pte))
>>>>> + entry = make_migration_entry_young(entry);
>>>>> + if (pte_dirty(pte))
>>>>> + entry = make_migration_entry_dirty(entry);
>>>>> swp_pte = swp_entry_to_pte(entry);
>>>>> if (pte_present(pte)) {
>>>>> if (pte_soft_dirty(pte))
>>>>
>>>> This change needs to be wrapped with pte_present() at least..
>>>>
>>>> I also just noticed that this change probably won't help anyway because:
>>>>
>>>> (1) When ram->device, the pte will finally be replaced with a device
>>>> private entry, and device private entry does not yet support A/D, it
>>>> means A/D will be dropped again,
>>>>
>>>> (2) When device->ram, we are missing information on either A/D bits, or
>>>> even if device private entries start to suport A/D, it's still not
>>>> clear whether we should take device read/write into considerations
>>>> too on the page A/D bits to be accurate.
>>>>
>>>> I think I'll probably keep the code there for completeness, but I think it
>>>> won't really help much until more things are done.
>>>
>>> It appears that there are more issues. Between "pte = *ptep" and pte
>>> clear, CPU may set A/D bit in PTE, so we may need to update pte when
>>> clearing PTE.
>>
>> Agreed, I didn't see it a huge problem with current code, but it should be
>> better in that way.
>>
>>> And I don't find the TLB is flushed in some cases after PTE is cleared.
>>
>> I think it's okay to not flush tlb if pte not present. But maybe you're
>> talking about something else?
>
> I think Huang refers to situation in which the PTE is cleared, still not
> flushed, and then A/D is being set by the hardware.
No. The situation in my mind is PTE with A/D set is cleared, not
flushed. Then a parallel mprotect or munmap may cause race conditions.
As Alistair pointed out in another thread [1], there is TLB flushing
after PTL unlocked. But I think we need to flush TLB before unlock.
This has been fixed in Alistair's latest version [2].
[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://lore.kernel.org/lkml/6e77914685ede036c419fa65b6adc27f25a6c3e9.1660635033.git-series.apopple@nvidia.com/
Best Regards,
Huang, Ying
On 8/15/22 14:03, Nadav Amit wrote:
>>
>> At least on x86, the hardware is not supposed to do so. The only case I
>> remember (and sometimes misremembers) is with KNL erratum, which perhaps
>> needs to be considered:
>>
>> https://lore.kernel.org/all/[email protected]/
> I keep not remembering this erratum correctly. IIRC, the erratum says that
> the access/dirty might be set, but it does not mean that a write is possible
> after the PTE is cleared (i.e., the dirty/access might be set on the
> non-present PTE, but the access itself would fail). So it is not an issue in
> this case - losing A/D would not impact correctness since the access should
> fail.
>
> Dave Hansen hates when I get confused with this one, but I cc him if he
> wants to confirm.
Right.
The issue is strictly with the page walker setting Accessed/Dirty in a
racy way. The TLB still has accurate contents at all times.