[Marking as RFC; only x86 is supported for now, plan to add a few more
archs when there's a formal version]
Problem
=======
When migrate a page, right now we always mark the migrated page as old.
The reason could be that we don't really know whether the page is hot or
cold, so we could have taken it a default negative assuming that's safer.
However that could lead to at least two problems:
(1) We lost the real hot/cold information while we could have persisted.
That information shouldn't change even if the backing page is changed
after the migration,
(2) There can be always extra overhead on the immediate next access to
any migrated page, because hardware MMU needs cycles to set the young
bit again (as long as the MMU supports).
Many of the recent upstream works showed that (2) is not something trivial
and actually very measurable. In my test case, reading 1G chunk of memory
- jumping in page size intervals - could take 99ms just because of the
extra setting on the young bit on a generic x86_64 system, comparing to 4ms
if young set.
This issue is originally reported by Andrea Arcangeli.
Solution
========
To solve this problem, this patchset tries to remember the young bit in the
migration entries and carry it over when recovering the ptes.
We have the chance to do so because in many systems the swap offset is not
really fully used. Migration entries use swp offset to store PFN only,
while the PFN is normally not as large as swp offset and normally smaller.
It means we do have some free bits in swp offset that we can use to store
things like young, and that's how this series tried to approach this
problem.
One tricky thing here is even though we're embedding the information into
swap entry which seems to be a very generic data structure, the number of
bits that are free is still arch dependent. Not only because the size of
swp_entry_t differs, but also due to the different layouts of swap ptes on
different archs.
Here, this series requires specific arch to define an extra macro called
__ARCH_SWP_OFFSET_BITS represents the size of swp offset. With this
information, the swap logic can know whether there's extra bits to use,
then it'll remember the young bits when possible. By default, it'll keep
the old behavior of keeping all migrated pages cold.
Tests
=====
After the patchset applied, the immediate read access test [1] of above 1G
chunk after migration can shrink from 99ms to 4ms. The test is done by
moving 1G pages from node 0->1->0 then read it in page size jumps.
Currently __ARCH_SWP_OFFSET_BITS is only defined on x86 for this series and
only tested on x86_64 with Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz.
Patch Layout
============
Patch 1: Add swp_offset_pfn() and apply to all pfn swap entries, we should
also stop treating swp_offset() as PFN anymore because it can
contain more information starting from next patch.
Patch 2: The core patch to remember young bit in swap offsets.
Patch 3: A cleanup for x86 32 bits pgtable.h.
Patch 4: Define __ARCH_SWP_OFFSET_BITS on x86, enable young bit for migration
Please review, thanks.
[1] https://github.com/xzpeter/clibs/blob/master/misc/swap-young.c
Peter Xu (4):
mm/swap: Add swp_offset_pfn() to fetch PFN from swap entry
mm: Remember young bit for page migrations
mm/x86: Use SWP_TYPE_BITS in 3-level swap macros
mm/x86: Define __ARCH_SWP_OFFSET_BITS
arch/arm64/mm/hugetlbpage.c | 2 +-
arch/x86/include/asm/pgtable-2level.h | 6 ++
arch/x86/include/asm/pgtable-3level.h | 15 +++--
arch/x86/include/asm/pgtable_64.h | 5 ++
include/linux/swapops.h | 85 +++++++++++++++++++++++++--
mm/hmm.c | 2 +-
mm/huge_memory.c | 10 +++-
mm/memory-failure.c | 2 +-
mm/migrate.c | 4 +-
mm/migrate_device.c | 2 +
mm/page_vma_mapped.c | 6 +-
mm/rmap.c | 3 +-
12 files changed, 122 insertions(+), 20 deletions(-)
--
2.32.0
Replace all the magic "5" with the macro.
Signed-off-by: Peter Xu <[email protected]>
---
arch/x86/include/asm/pgtable-3level.h | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h
index e896ebef8c24..28421a887209 100644
--- a/arch/x86/include/asm/pgtable-3level.h
+++ b/arch/x86/include/asm/pgtable-3level.h
@@ -256,10 +256,10 @@ static inline pud_t native_pudp_get_and_clear(pud_t *pudp)
/* We always extract/encode the offset by shifting it all the way up, and then down again */
#define SWP_OFFSET_SHIFT (SWP_OFFSET_FIRST_BIT + SWP_TYPE_BITS)
-#define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > 5)
-#define __swp_type(x) (((x).val) & 0x1f)
-#define __swp_offset(x) ((x).val >> 5)
-#define __swp_entry(type, offset) ((swp_entry_t){(type) | (offset) << 5})
+#define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > SWP_TYPE_BITS)
+#define __swp_type(x) (((x).val) & ((1UL << SWP_TYPE_BITS) - 1))
+#define __swp_offset(x) ((x).val >> SWP_TYPE_BITS)
+#define __swp_entry(type, offset) ((swp_entry_t){(type) | (offset) << SWP_TYPE_BITS})
/*
* Normally, __swp_entry() converts from arch-independent swp_entry_t to
--
2.32.0
This will enable the new migration young bit for all x86 systems for both
32 bits and 64 bits systems (including PAE).
Signed-off-by: Peter Xu <[email protected]>
---
arch/x86/include/asm/pgtable-2level.h | 6 ++++++
arch/x86/include/asm/pgtable-3level.h | 7 +++++++
arch/x86/include/asm/pgtable_64.h | 5 +++++
3 files changed, 18 insertions(+)
diff --git a/arch/x86/include/asm/pgtable-2level.h b/arch/x86/include/asm/pgtable-2level.h
index 60d0f9015317..6e70833feb69 100644
--- a/arch/x86/include/asm/pgtable-2level.h
+++ b/arch/x86/include/asm/pgtable-2level.h
@@ -95,6 +95,12 @@ static inline unsigned long pte_bitop(unsigned long value, unsigned int rightshi
#define __pte_to_swp_entry(pte) ((swp_entry_t) { (pte).pte_low })
#define __swp_entry_to_pte(x) ((pte_t) { .pte = (x).val })
+/*
+ * This defines how many bits we have in the arch specific swp offset.
+ * For 32 bits vanilla systems the pte and swap entry has the same size.
+ */
+#define __ARCH_SWP_OFFSET_BITS (sizeof(swp_entry_t) - SWP_TYPE_BITS)
+
/* No inverted PFNs on 2 level page tables */
static inline u64 protnone_mask(u64 val)
diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h
index 28421a887209..8dbf29b51f8b 100644
--- a/arch/x86/include/asm/pgtable-3level.h
+++ b/arch/x86/include/asm/pgtable-3level.h
@@ -287,6 +287,13 @@ static inline pud_t native_pudp_get_and_clear(pud_t *pudp)
#define __pte_to_swp_entry(pte) (__swp_entry(__pteval_swp_type(pte), \
__pteval_swp_offset(pte)))
+/*
+ * This defines how many bits we have in the arch specific swp offset.
+ * Here since we're putting the 32 bits swap entry into 64 bits pte, the
+ * limitation is the 32 bits swap entry minus the swap type field.
+ */
+#define __ARCH_SWP_OFFSET_BITS (sizeof(swp_entry_t) - SWP_TYPE_BITS)
+
#include <asm/pgtable-invert.h>
#endif /* _ASM_X86_PGTABLE_3LEVEL_H */
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index e479491da8d5..1714f0ded1db 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -217,6 +217,11 @@ static inline void native_pgd_clear(pgd_t *pgd)
/* We always extract/encode the offset by shifting it all the way up, and then down again */
#define SWP_OFFSET_SHIFT (SWP_OFFSET_FIRST_BIT+SWP_TYPE_BITS)
+/*
+ * This defines how many bits we have in the arch specific swp offset. 64
+ * bits systems have both swp_entry_t and pte in 64 bits.
+ */
+#define __ARCH_SWP_OFFSET_BITS (BITS_PER_LONG - SWP_OFFSET_SHIFT)
#define MAX_SWAPFILES_CHECK() BUILD_BUG_ON(MAX_SWAPFILES_SHIFT > SWP_TYPE_BITS)
--
2.32.0
When page migration happens, we always ignore the young bit settings in the
old pgtable, and marking the page as old in the new page table using either
pte_mkold() or pmd_mkold().
That's fine from functional-wise, but that's not friendly to page reclaim
because the moving page can be actively accessed within the procedure.
Actually we can easily remember the young bit configuration and make that
information recovered when the page is migrated. To achieve it, define a
new bit in the migration swap offset field to show whether the old pte has
young bit set or not. Then when removing/recovering the migration entry,
we can recover the young bit even if the page changed.
One thing to mention is that the whole feature is based on an arch specific
macro __ARCH_SWP_OFFSET_BITS that needs to be defined per-arch. The macro
tells how many bits are available for the arch specific swp offset field.
When that macro is not defined, we'll assume we don't have free bits in the
migration swap entry offset, so we can't persist the young bit.
So until now, there should have no functional change at all with this
patch, since no arch has yet defined __ARCH_SWP_OFFSET_BITS.
Signed-off-by: Peter Xu <[email protected]>
---
include/linux/swapops.h | 57 +++++++++++++++++++++++++++++++++++++++++
mm/huge_memory.c | 10 ++++++--
mm/migrate.c | 4 ++-
mm/migrate_device.c | 2 ++
mm/rmap.c | 3 ++-
5 files changed, 72 insertions(+), 4 deletions(-)
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 5378f77860fb..3bbb57aa6742 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -31,6 +31,28 @@
#define SWP_PFN_BITS (MAX_PHYSMEM_BITS - PAGE_SHIFT)
#define SWP_PFN_MASK ((1UL << SWP_PFN_BITS) - 1)
+#ifdef __ARCH_SWP_OFFSET_BITS
+#define SWP_PFN_OFFSET_FREE_BITS (__ARCH_SWP_OFFSET_BITS - SWP_PFN_BITS)
+#else
+/*
+ * If __ARCH_SWP_OFFSET_BITS not defined, assuming we don't have free bits
+ * to be on the safe side.
+ */
+#define SWP_PFN_OFFSET_FREE_BITS 0
+#endif
+
+/**
+ * Migration swap entry specific bitfield definitions.
+ *
+ * @SWP_MIG_YOUNG_BIT: Whether the page used to have young bit set
+ *
+ * Note: these bits will be used only if there're free bits in arch
+ * specific swp offset field. Arch needs __ARCH_SWP_OFFSET_BITS defined to
+ * use the bits/features.
+ */
+#define SWP_MIG_YOUNG_BIT (1UL << SWP_PFN_BITS)
+#define SWP_MIG_OFFSET_BITS (SWP_PFN_BITS + 1)
+
/* Clear all flags but only keep swp_entry_t related information */
static inline pte_t pte_swp_clear_flags(pte_t pte)
{
@@ -258,6 +280,30 @@ static inline swp_entry_t make_writable_migration_entry(pgoff_t offset)
return swp_entry(SWP_MIGRATION_WRITE, offset);
}
+static inline swp_entry_t make_migration_entry_young(swp_entry_t entry)
+{
+ /*
+ * Due to a limitation on x86_64 we can't use #ifdef, as
+ * SWP_PFN_OFFSET_FREE_BITS value can be changed dynamically for
+ * 4/5 level pgtables. For all the non-x86_64 archs (where the
+ * macro MAX_PHYSMEM_BITS is constant) this branching should be
+ * optimized out by the compiler.
+ */
+ if (SWP_PFN_OFFSET_FREE_BITS)
+ return swp_entry(swp_type(entry),
+ swp_offset(entry) | SWP_MIG_YOUNG_BIT);
+ return entry;
+}
+
+static inline bool is_migration_entry_young(swp_entry_t entry)
+{
+ /* Please refer to comment in make_migration_entry_young() */
+ if (SWP_PFN_OFFSET_FREE_BITS)
+ return swp_offset(entry) & SWP_MIG_YOUNG_BIT;
+ /* Keep the old behavior of aging page after migration */
+ return false;
+}
+
extern void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
spinlock_t *ptl);
extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
@@ -304,6 +350,16 @@ static inline int is_readable_migration_entry(swp_entry_t entry)
return 0;
}
+static inline swp_entry_t make_migration_entry_young(swp_entry_t entry)
+{
+ return entry;
+}
+
+static inline bool is_migration_entry_young(swp_entry_t entry)
+{
+ return false;
+}
+
#endif
typedef unsigned long pte_marker;
@@ -407,6 +463,7 @@ static inline bool is_pfn_swap_entry(swp_entry_t entry)
{
/* Make sure the swp offset can always store the needed fields */
BUILD_BUG_ON(SWP_TYPE_SHIFT < SWP_PFN_BITS);
+ BUILD_BUG_ON(SWP_TYPE_SHIFT < SWP_MIG_OFFSET_BITS);
return is_migration_entry(entry) || is_device_private_entry(entry) ||
is_device_exclusive_entry(entry);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 29e3628687a6..131fe5754d8f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2088,7 +2088,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
write = is_writable_migration_entry(entry);
if (PageAnon(page))
anon_exclusive = is_readable_exclusive_migration_entry(entry);
- young = false;
+ young = is_migration_entry_young(entry);
soft_dirty = pmd_swp_soft_dirty(old_pmd);
uffd_wp = pmd_swp_uffd_wp(old_pmd);
} else {
@@ -2146,6 +2146,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
else
swp_entry = make_readable_migration_entry(
page_to_pfn(page + i));
+ if (young)
+ swp_entry = make_migration_entry_young(swp_entry);
entry = swp_entry_to_pte(swp_entry);
if (soft_dirty)
entry = pte_swp_mksoft_dirty(entry);
@@ -3148,6 +3150,8 @@ int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
entry = make_readable_exclusive_migration_entry(page_to_pfn(page));
else
entry = make_readable_migration_entry(page_to_pfn(page));
+ if (pmd_young(pmdval))
+ entry = make_migration_entry_young(entry);
pmdswp = swp_entry_to_pmd(entry);
if (pmd_soft_dirty(pmdval))
pmdswp = pmd_swp_mksoft_dirty(pmdswp);
@@ -3173,13 +3177,15 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
entry = pmd_to_swp_entry(*pvmw->pmd);
get_page(new);
- pmde = pmd_mkold(mk_huge_pmd(new, READ_ONCE(vma->vm_page_prot)));
+ pmde = mk_huge_pmd(new, READ_ONCE(vma->vm_page_prot));
if (pmd_swp_soft_dirty(*pvmw->pmd))
pmde = pmd_mksoft_dirty(pmde);
if (is_writable_migration_entry(entry))
pmde = maybe_pmd_mkwrite(pmde, vma);
if (pmd_swp_uffd_wp(*pvmw->pmd))
pmde = pmd_wrprotect(pmd_mkuffd_wp(pmde));
+ if (!is_migration_entry_young(entry))
+ pmde = pmd_mkold(pmde);
if (PageAnon(new)) {
rmap_t rmap_flags = RMAP_COMPOUND;
diff --git a/mm/migrate.c b/mm/migrate.c
index 1649270bc1a7..62cb3a9451de 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -199,7 +199,7 @@ static bool remove_migration_pte(struct folio *folio,
#endif
folio_get(folio);
- pte = pte_mkold(mk_pte(new, READ_ONCE(vma->vm_page_prot)));
+ pte = mk_pte(new, READ_ONCE(vma->vm_page_prot));
if (pte_swp_soft_dirty(*pvmw.pte))
pte = pte_mksoft_dirty(pte);
@@ -207,6 +207,8 @@ static bool remove_migration_pte(struct folio *folio,
* Recheck VMA as permissions can change since migration started
*/
entry = pte_to_swp_entry(*pvmw.pte);
+ if (!is_migration_entry_young(entry))
+ pte = pte_mkold(pte);
if (is_writable_migration_entry(entry))
pte = maybe_mkwrite(pte, vma);
else if (pte_swp_uffd_wp(*pvmw.pte))
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 7feeb447e3b9..fd8daf45c1a6 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -221,6 +221,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
else
entry = make_readable_migration_entry(
page_to_pfn(page));
+ if (pte_young(pte))
+ entry = make_migration_entry_young(entry);
swp_pte = swp_entry_to_pte(entry);
if (pte_present(pte)) {
if (pte_soft_dirty(pte))
diff --git a/mm/rmap.c b/mm/rmap.c
index af775855e58f..605fb37ae95e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2065,7 +2065,8 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
else
entry = make_readable_migration_entry(
page_to_pfn(subpage));
-
+ if (pte_young(pteval))
+ entry = make_migration_entry_young(entry);
swp_pte = swp_entry_to_pte(entry);
if (pte_soft_dirty(pteval))
swp_pte = pte_swp_mksoft_dirty(swp_pte);
--
2.32.0
On Jul 28, 2022, at 6:40 PM, Peter Xu <[email protected]> wrote:
> [Marking as RFC; only x86 is supported for now, plan to add a few more
> archs when there's a formal version]
>
> Problem
> =======
>
> When migrate a page, right now we always mark the migrated page as old.
> The reason could be that we don't really know whether the page is hot or
> cold, so we could have taken it a default negative assuming that's safer.
Looks good to me.
I just wonder whether the order of the patches should be different. I always
understood that separating the “enabling” patch from the others is not a
good practice, since it complicates bisection. I guess it is more of a minor
issue for such a small patch-set…
On Fri, Jul 29, 2022 at 10:07:02AM -0700, Nadav Amit wrote:
> On Jul 28, 2022, at 6:40 PM, Peter Xu <[email protected]> wrote:
>
> > [Marking as RFC; only x86 is supported for now, plan to add a few more
> > archs when there's a formal version]
> >
> > Problem
> > =======
> >
> > When migrate a page, right now we always mark the migrated page as old.
> > The reason could be that we don't really know whether the page is hot or
> > cold, so we could have taken it a default negative assuming that's safer.
>
> Looks good to me.
Thanks for the quick review comment, Nadav.
>
> I just wonder whether the order of the patches should be different. I always
> understood that separating the “enabling” patch from the others is not a
> good practice, since it complicates bisection. I guess it is more of a minor
> issue for such a small patch-set…
Yeah I'd guess you mean when there are a bunch of patches to form one
feature, then we may want to be able to know which part of the feature
break something. But as you mentioned this feature is mostly implemented
in patch 2 only.
I can squash the enablement patch into the same patch, but when comes to
more archs it also means I'll squash all the archs into the same patch.
I'm just afraid it'll complicate that patch too much - I'd expect each
calculation of swp offset for any arch may not be that straightforward
enough, so it'll be good if they can be reviewed separately and carefully.
--
Peter Xu
Peter Xu <[email protected]> writes:
> [Marking as RFC; only x86 is supported for now, plan to add a few more
> archs when there's a formal version]
>
> Problem
> =======
>
> When migrate a page, right now we always mark the migrated page as old.
> The reason could be that we don't really know whether the page is hot or
> cold, so we could have taken it a default negative assuming that's safer.
>
> However that could lead to at least two problems:
>
> (1) We lost the real hot/cold information while we could have persisted.
> That information shouldn't change even if the backing page is changed
> after the migration,
>
> (2) There can be always extra overhead on the immediate next access to
> any migrated page, because hardware MMU needs cycles to set the young
> bit again (as long as the MMU supports).
>
> Many of the recent upstream works showed that (2) is not something trivial
> and actually very measurable. In my test case, reading 1G chunk of memory
> - jumping in page size intervals - could take 99ms just because of the
> extra setting on the young bit on a generic x86_64 system, comparing to 4ms
> if young set.
LKP has observed that before too, as in the following reports and
discussion.
https://lore.kernel.org/all/[email protected]/t/
Best Regards,
Huang, Ying
> This issue is originally reported by Andrea Arcangeli.
>
> Solution
> ========
>
> To solve this problem, this patchset tries to remember the young bit in the
> migration entries and carry it over when recovering the ptes.
>
> We have the chance to do so because in many systems the swap offset is not
> really fully used. Migration entries use swp offset to store PFN only,
> while the PFN is normally not as large as swp offset and normally smaller.
> It means we do have some free bits in swp offset that we can use to store
> things like young, and that's how this series tried to approach this
> problem.
>
> One tricky thing here is even though we're embedding the information into
> swap entry which seems to be a very generic data structure, the number of
> bits that are free is still arch dependent. Not only because the size of
> swp_entry_t differs, but also due to the different layouts of swap ptes on
> different archs.
>
> Here, this series requires specific arch to define an extra macro called
> __ARCH_SWP_OFFSET_BITS represents the size of swp offset. With this
> information, the swap logic can know whether there's extra bits to use,
> then it'll remember the young bits when possible. By default, it'll keep
> the old behavior of keeping all migrated pages cold.
>
> Tests
> =====
>
> After the patchset applied, the immediate read access test [1] of above 1G
> chunk after migration can shrink from 99ms to 4ms. The test is done by
> moving 1G pages from node 0->1->0 then read it in page size jumps.
>
> Currently __ARCH_SWP_OFFSET_BITS is only defined on x86 for this series and
> only tested on x86_64 with Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz.
>
> Patch Layout
> ============
>
> Patch 1: Add swp_offset_pfn() and apply to all pfn swap entries, we should
> also stop treating swp_offset() as PFN anymore because it can
> contain more information starting from next patch.
> Patch 2: The core patch to remember young bit in swap offsets.
> Patch 3: A cleanup for x86 32 bits pgtable.h.
> Patch 4: Define __ARCH_SWP_OFFSET_BITS on x86, enable young bit for migration
>
> Please review, thanks.
>
> [1] https://github.com/xzpeter/clibs/blob/master/misc/swap-young.c
>
> Peter Xu (4):
> mm/swap: Add swp_offset_pfn() to fetch PFN from swap entry
> mm: Remember young bit for page migrations
> mm/x86: Use SWP_TYPE_BITS in 3-level swap macros
> mm/x86: Define __ARCH_SWP_OFFSET_BITS
>
> arch/arm64/mm/hugetlbpage.c | 2 +-
> arch/x86/include/asm/pgtable-2level.h | 6 ++
> arch/x86/include/asm/pgtable-3level.h | 15 +++--
> arch/x86/include/asm/pgtable_64.h | 5 ++
> include/linux/swapops.h | 85 +++++++++++++++++++++++++--
> mm/hmm.c | 2 +-
> mm/huge_memory.c | 10 +++-
> mm/memory-failure.c | 2 +-
> mm/migrate.c | 4 +-
> mm/migrate_device.c | 2 +
> mm/page_vma_mapped.c | 6 +-
> mm/rmap.c | 3 +-
> 12 files changed, 122 insertions(+), 20 deletions(-)
Peter Xu <[email protected]> writes:
> [Marking as RFC; only x86 is supported for now, plan to add a few more
> archs when there's a formal version]
>
> Problem
> =======
>
> When migrate a page, right now we always mark the migrated page as old.
> The reason could be that we don't really know whether the page is hot or
> cold, so we could have taken it a default negative assuming that's safer.
>
> However that could lead to at least two problems:
>
> (1) We lost the real hot/cold information while we could have persisted.
> That information shouldn't change even if the backing page is changed
> after the migration,
>
> (2) There can be always extra overhead on the immediate next access to
> any migrated page, because hardware MMU needs cycles to set the young
> bit again (as long as the MMU supports).
>
> Many of the recent upstream works showed that (2) is not something trivial
> and actually very measurable. In my test case, reading 1G chunk of memory
> - jumping in page size intervals - could take 99ms just because of the
> extra setting on the young bit on a generic x86_64 system, comparing to 4ms
> if young set.
>
> This issue is originally reported by Andrea Arcangeli.
>
> Solution
> ========
>
> To solve this problem, this patchset tries to remember the young bit in the
> migration entries and carry it over when recovering the ptes.
>
> We have the chance to do so because in many systems the swap offset is not
> really fully used. Migration entries use swp offset to store PFN only,
> while the PFN is normally not as large as swp offset and normally smaller.
> It means we do have some free bits in swp offset that we can use to store
> things like young, and that's how this series tried to approach this
> problem.
>
> One tricky thing here is even though we're embedding the information into
> swap entry which seems to be a very generic data structure, the number of
> bits that are free is still arch dependent. Not only because the size of
> swp_entry_t differs, but also due to the different layouts of swap ptes on
> different archs.
If my understanding were correct, max_swapfile_size() provides a
mechanism to identify the available bits with swp_entry_t and swap PTE
considered. We may take advantage of that?
And according to commit 377eeaa8e11f ("x86/speculation/l1tf: Limit swap
file size to MAX_PA/2"), the highest bit of swap offset needs to be 0 if
L1TF mitigation is enabled.
Cced Andi for confirmation.
Best Regards,
Huang, Ying
> Here, this series requires specific arch to define an extra macro called
> __ARCH_SWP_OFFSET_BITS represents the size of swp offset. With this
> information, the swap logic can know whether there's extra bits to use,
> then it'll remember the young bits when possible. By default, it'll keep
> the old behavior of keeping all migrated pages cold.
>
> Tests
> =====
>
> After the patchset applied, the immediate read access test [1] of above 1G
> chunk after migration can shrink from 99ms to 4ms. The test is done by
> moving 1G pages from node 0->1->0 then read it in page size jumps.
>
> Currently __ARCH_SWP_OFFSET_BITS is only defined on x86 for this series and
> only tested on x86_64 with Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz.
>
> Patch Layout
> ============
>
> Patch 1: Add swp_offset_pfn() and apply to all pfn swap entries, we should
> also stop treating swp_offset() as PFN anymore because it can
> contain more information starting from next patch.
> Patch 2: The core patch to remember young bit in swap offsets.
> Patch 3: A cleanup for x86 32 bits pgtable.h.
> Patch 4: Define __ARCH_SWP_OFFSET_BITS on x86, enable young bit for migration
>
> Please review, thanks.
>
> [1] https://github.com/xzpeter/clibs/blob/master/misc/swap-young.c
>
> Peter Xu (4):
> mm/swap: Add swp_offset_pfn() to fetch PFN from swap entry
> mm: Remember young bit for page migrations
> mm/x86: Use SWP_TYPE_BITS in 3-level swap macros
> mm/x86: Define __ARCH_SWP_OFFSET_BITS
>
> arch/arm64/mm/hugetlbpage.c | 2 +-
> arch/x86/include/asm/pgtable-2level.h | 6 ++
> arch/x86/include/asm/pgtable-3level.h | 15 +++--
> arch/x86/include/asm/pgtable_64.h | 5 ++
> include/linux/swapops.h | 85 +++++++++++++++++++++++++--
> mm/hmm.c | 2 +-
> mm/huge_memory.c | 10 +++-
> mm/memory-failure.c | 2 +-
> mm/migrate.c | 4 +-
> mm/migrate_device.c | 2 +
> mm/page_vma_mapped.c | 6 +-
> mm/rmap.c | 3 +-
> 12 files changed, 122 insertions(+), 20 deletions(-)
On 29.07.22 03:40, Peter Xu wrote:
> [Marking as RFC; only x86 is supported for now, plan to add a few more
> archs when there's a formal version]
>
> Problem
> =======
>
> When migrate a page, right now we always mark the migrated page as old.
> The reason could be that we don't really know whether the page is hot or
> cold, so we could have taken it a default negative assuming that's safer.
>
> However that could lead to at least two problems:
>
> (1) We lost the real hot/cold information while we could have persisted.
> That information shouldn't change even if the backing page is changed
> after the migration,
>
> (2) There can be always extra overhead on the immediate next access to
> any migrated page, because hardware MMU needs cycles to set the young
> bit again (as long as the MMU supports).
>
> Many of the recent upstream works showed that (2) is not something trivial
> and actually very measurable. In my test case, reading 1G chunk of memory
> - jumping in page size intervals - could take 99ms just because of the
> extra setting on the young bit on a generic x86_64 system, comparing to 4ms
> if young set.
>
> This issue is originally reported by Andrea Arcangeli.
>
> Solution
> ========
>
> To solve this problem, this patchset tries to remember the young bit in the
> migration entries and carry it over when recovering the ptes.
>
> We have the chance to do so because in many systems the swap offset is not
> really fully used. Migration entries use swp offset to store PFN only,
> while the PFN is normally not as large as swp offset and normally smaller.
> It means we do have some free bits in swp offset that we can use to store
> things like young, and that's how this series tried to approach this
> problem.
>
> One tricky thing here is even though we're embedding the information into
> swap entry which seems to be a very generic data structure, the number of
> bits that are free is still arch dependent. Not only because the size of
> swp_entry_t differs, but also due to the different layouts of swap ptes on
> different archs.
>
> Here, this series requires specific arch to define an extra macro called
> __ARCH_SWP_OFFSET_BITS represents the size of swp offset. With this
> information, the swap logic can know whether there's extra bits to use,
> then it'll remember the young bits when possible. By default, it'll keep
> the old behavior of keeping all migrated pages cold.
>
I played with a similar idea when working on pte_swp_exclusive() but
gave up, because it ended up looking too hacky. Looking at patch #2, I
get the same feeling again. Kind of hacky.
If we mostly only care about x86_64, and it's a performance improvement
after all, why not simply do it like
pte_swp_mkexclusive/pte_swp_exclusive/ ... and reuse a spare PTE bit?
--
Thanks,
David / dhildenb
On Mon, Aug 01, 2022 at 01:33:28PM +0800, Huang, Ying wrote:
> If my understanding were correct, max_swapfile_size() provides a
> mechanism to identify the available bits with swp_entry_t and swap PTE
> considered. We may take advantage of that?
That's an interesting trick, I'll have a closer look, thanks for the
pointer!
>
> And according to commit 377eeaa8e11f ("x86/speculation/l1tf: Limit swap
> file size to MAX_PA/2"), the highest bit of swap offset needs to be 0 if
> L1TF mitigation is enabled.
>
> Cced Andi for confirmation.
Yeah it'll be great to have a confirmation, hopefully max_swapfile_size()
should have covered that case.
--
Peter Xu
On Mon, Aug 01, 2022 at 10:21:32AM +0200, David Hildenbrand wrote:
> On 29.07.22 03:40, Peter Xu wrote:
> > [Marking as RFC; only x86 is supported for now, plan to add a few more
> > archs when there's a formal version]
> >
> > Problem
> > =======
> >
> > When migrate a page, right now we always mark the migrated page as old.
> > The reason could be that we don't really know whether the page is hot or
> > cold, so we could have taken it a default negative assuming that's safer.
> >
> > However that could lead to at least two problems:
> >
> > (1) We lost the real hot/cold information while we could have persisted.
> > That information shouldn't change even if the backing page is changed
> > after the migration,
> >
> > (2) There can be always extra overhead on the immediate next access to
> > any migrated page, because hardware MMU needs cycles to set the young
> > bit again (as long as the MMU supports).
> >
> > Many of the recent upstream works showed that (2) is not something trivial
> > and actually very measurable. In my test case, reading 1G chunk of memory
> > - jumping in page size intervals - could take 99ms just because of the
> > extra setting on the young bit on a generic x86_64 system, comparing to 4ms
> > if young set.
> >
> > This issue is originally reported by Andrea Arcangeli.
> >
> > Solution
> > ========
> >
> > To solve this problem, this patchset tries to remember the young bit in the
> > migration entries and carry it over when recovering the ptes.
> >
> > We have the chance to do so because in many systems the swap offset is not
> > really fully used. Migration entries use swp offset to store PFN only,
> > while the PFN is normally not as large as swp offset and normally smaller.
> > It means we do have some free bits in swp offset that we can use to store
> > things like young, and that's how this series tried to approach this
> > problem.
> >
> > One tricky thing here is even though we're embedding the information into
> > swap entry which seems to be a very generic data structure, the number of
> > bits that are free is still arch dependent. Not only because the size of
> > swp_entry_t differs, but also due to the different layouts of swap ptes on
> > different archs.
> >
> > Here, this series requires specific arch to define an extra macro called
> > __ARCH_SWP_OFFSET_BITS represents the size of swp offset. With this
> > information, the swap logic can know whether there's extra bits to use,
> > then it'll remember the young bits when possible. By default, it'll keep
> > the old behavior of keeping all migrated pages cold.
> >
>
>
> I played with a similar idea when working on pte_swp_exclusive() but
> gave up, because it ended up looking too hacky. Looking at patch #2, I
> get the same feeling again. Kind of hacky.
Could you explain what's the "hacky" part you mentioned?
I used swap entry to avoid per-arch operations. I failed to figure out a
common way to know swp offset length myself so unluckily in this RFC I
still needed one macro per-arch. Ying's suggestion seems to be a good fit
here to me to remove the last arch-specific dependency.
>
>
> If we mostly only care about x86_64, and it's a performance improvement
> after all, why not simply do it like
> pte_swp_mkexclusive/pte_swp_exclusive/ ... and reuse a spare PTE bit?
Page migration works for most archs, I want to have it work for all archs
that can easily benefit from it.
Besides I actually have a question on the anon exclusive bit in the swap
pte: since we have that anyway, why we need a specific migration type for
anon exclusive pages? Can it be simply read migration entries with anon
exclusive bit set?
Thanks,
--
Peter Xu
On 02.08.22 00:35, Peter Xu wrote:
> On Mon, Aug 01, 2022 at 10:21:32AM +0200, David Hildenbrand wrote:
>> On 29.07.22 03:40, Peter Xu wrote:
>>> [Marking as RFC; only x86 is supported for now, plan to add a few more
>>> archs when there's a formal version]
>>>
>>> Problem
>>> =======
>>>
>>> When migrate a page, right now we always mark the migrated page as old.
>>> The reason could be that we don't really know whether the page is hot or
>>> cold, so we could have taken it a default negative assuming that's safer.
>>>
>>> However that could lead to at least two problems:
>>>
>>> (1) We lost the real hot/cold information while we could have persisted.
>>> That information shouldn't change even if the backing page is changed
>>> after the migration,
>>>
>>> (2) There can be always extra overhead on the immediate next access to
>>> any migrated page, because hardware MMU needs cycles to set the young
>>> bit again (as long as the MMU supports).
>>>
>>> Many of the recent upstream works showed that (2) is not something trivial
>>> and actually very measurable. In my test case, reading 1G chunk of memory
>>> - jumping in page size intervals - could take 99ms just because of the
>>> extra setting on the young bit on a generic x86_64 system, comparing to 4ms
>>> if young set.
>>>
>>> This issue is originally reported by Andrea Arcangeli.
>>>
>>> Solution
>>> ========
>>>
>>> To solve this problem, this patchset tries to remember the young bit in the
>>> migration entries and carry it over when recovering the ptes.
>>>
>>> We have the chance to do so because in many systems the swap offset is not
>>> really fully used. Migration entries use swp offset to store PFN only,
>>> while the PFN is normally not as large as swp offset and normally smaller.
>>> It means we do have some free bits in swp offset that we can use to store
>>> things like young, and that's how this series tried to approach this
>>> problem.
>>>
>>> One tricky thing here is even though we're embedding the information into
>>> swap entry which seems to be a very generic data structure, the number of
>>> bits that are free is still arch dependent. Not only because the size of
>>> swp_entry_t differs, but also due to the different layouts of swap ptes on
>>> different archs.
>>>
>>> Here, this series requires specific arch to define an extra macro called
>>> __ARCH_SWP_OFFSET_BITS represents the size of swp offset. With this
>>> information, the swap logic can know whether there's extra bits to use,
>>> then it'll remember the young bits when possible. By default, it'll keep
>>> the old behavior of keeping all migrated pages cold.
>>>
>>
>>
>> I played with a similar idea when working on pte_swp_exclusive() but
>> gave up, because it ended up looking too hacky. Looking at patch #2, I
>> get the same feeling again. Kind of hacky.
>
> Could you explain what's the "hacky" part you mentioned?
SWP_PFN_OFFSET_FREE_BITS :)
It's a PFN offset and we're mangling in random other bits. That's hacky
IMHO.
I played with the idea of converting all code to store bits in addition
to the type + offset. But that requires digging through a lot of arch
code to teach that code about additional flags, so I discarded that idea
when working on the COW fixes.
>
> I used swap entry to avoid per-arch operations. I failed to figure out a
> common way to know swp offset length myself so unluckily in this RFC I
> still needed one macro per-arch. Ying's suggestion seems to be a good fit
> here to me to remove the last arch-specific dependency.
Instead of mangling this into the PFN offset and let the arch tell you
which bits of the PFN offset are unused ... rather remove the bits from
the offset and define them manually to have a certain meaning. That's
exactly how pte_swp_mkexclusive/pte_swp_exclusive/ is supposed to be
handled on architectures that want to support it.
I hope I could make it clearer what the hacky part is IMHO :)
>
>>
>>
>> If we mostly only care about x86_64, and it's a performance improvement
>> after all, why not simply do it like
>> pte_swp_mkexclusive/pte_swp_exclusive/ ... and reuse a spare PTE bit?
>
> Page migration works for most archs, I want to have it work for all archs
> that can easily benefit from it.
Yet we only care about x86-64 IIUC regarding performance, just the way
the dirty bit is handled?
>
> Besides I actually have a question on the anon exclusive bit in the swap
> pte: since we have that anyway, why we need a specific migration type for
> anon exclusive pages? Can it be simply read migration entries with anon
> exclusive bit set?
Not before all arch support pte_swp_mkexclusive/pte_swp_exclusive/.
As pte_swp_mkexclusive/pte_swp_exclusive/ only applies to actual swap
PTEs, you could even reuse that bit for migration entries and get at
alteast the most relevant 64bit architectures supported easily.
--
Thanks,
David / dhildenb
> I don't think we only care about x86_64? Should other archs have the same
> issue as long as there's the hardware young bit?
>
> Even without it, it'll affect page reclaim logic too, and that's also not
> x86 only.
Okay, reading the cover letter and looking at the code my understanding
was that x86-64 is the real focus.
>>
>>>
>>> Besides I actually have a question on the anon exclusive bit in the swap
>>> pte: since we have that anyway, why we need a specific migration type for
>>> anon exclusive pages? Can it be simply read migration entries with anon
>>> exclusive bit set?
>>
>> Not before all arch support pte_swp_mkexclusive/pte_swp_exclusive/.
>>
>> As pte_swp_mkexclusive/pte_swp_exclusive/ only applies to actual swap
>> PTEs, you could even reuse that bit for migration entries and get at
>> alteast the most relevant 64bit architectures supported easily.
>
> Yes, but I think having two mechanisms for the single problem can confuse
> people.
>
It would be one bit with two different meanings depending on the swp type.
> IIUC the swap bit is already defined in major archs anyway, and since anon
> exclusive bit is best-effort (or am I wrong?..), I won't worry too much on
It kind-of is best effort, but the goal is to have all archs support it.
... just like the young bit here?
> archs outside x86/arm/ppc/s390 on having anon exclusive bit lost during
> migrations, because afaict the whole swap type of ANON_EXCLUSIVE_READ is
> only servicing that very minority.. which seems to be a pity to waste the
I have a big item on my todo list to support all, but I have different
priorities right now.
If there is no free bit, simply steal one from the offset ... which is
the same thing your approach would do, just in a different way, no?
> swp type on all archs even if the archs defined swp pte bits just for anon
> exclusive.
Why do we care? We walk about one type not one bit.
--
Thanks,
David / dhildenb
On Tue, Aug 02, 2022 at 02:06:37PM +0200, David Hildenbrand wrote:
> On 02.08.22 00:35, Peter Xu wrote:
> > On Mon, Aug 01, 2022 at 10:21:32AM +0200, David Hildenbrand wrote:
> >> On 29.07.22 03:40, Peter Xu wrote:
> >>> [Marking as RFC; only x86 is supported for now, plan to add a few more
> >>> archs when there's a formal version]
> >>>
> >>> Problem
> >>> =======
> >>>
> >>> When migrate a page, right now we always mark the migrated page as old.
> >>> The reason could be that we don't really know whether the page is hot or
> >>> cold, so we could have taken it a default negative assuming that's safer.
> >>>
> >>> However that could lead to at least two problems:
> >>>
> >>> (1) We lost the real hot/cold information while we could have persisted.
> >>> That information shouldn't change even if the backing page is changed
> >>> after the migration,
> >>>
> >>> (2) There can be always extra overhead on the immediate next access to
> >>> any migrated page, because hardware MMU needs cycles to set the young
> >>> bit again (as long as the MMU supports).
> >>>
> >>> Many of the recent upstream works showed that (2) is not something trivial
> >>> and actually very measurable. In my test case, reading 1G chunk of memory
> >>> - jumping in page size intervals - could take 99ms just because of the
> >>> extra setting on the young bit on a generic x86_64 system, comparing to 4ms
> >>> if young set.
> >>>
> >>> This issue is originally reported by Andrea Arcangeli.
> >>>
> >>> Solution
> >>> ========
> >>>
> >>> To solve this problem, this patchset tries to remember the young bit in the
> >>> migration entries and carry it over when recovering the ptes.
> >>>
> >>> We have the chance to do so because in many systems the swap offset is not
> >>> really fully used. Migration entries use swp offset to store PFN only,
> >>> while the PFN is normally not as large as swp offset and normally smaller.
> >>> It means we do have some free bits in swp offset that we can use to store
> >>> things like young, and that's how this series tried to approach this
> >>> problem.
> >>>
> >>> One tricky thing here is even though we're embedding the information into
> >>> swap entry which seems to be a very generic data structure, the number of
> >>> bits that are free is still arch dependent. Not only because the size of
> >>> swp_entry_t differs, but also due to the different layouts of swap ptes on
> >>> different archs.
> >>>
> >>> Here, this series requires specific arch to define an extra macro called
> >>> __ARCH_SWP_OFFSET_BITS represents the size of swp offset. With this
> >>> information, the swap logic can know whether there's extra bits to use,
> >>> then it'll remember the young bits when possible. By default, it'll keep
> >>> the old behavior of keeping all migrated pages cold.
> >>>
> >>
> >>
> >> I played with a similar idea when working on pte_swp_exclusive() but
> >> gave up, because it ended up looking too hacky. Looking at patch #2, I
> >> get the same feeling again. Kind of hacky.
> >
> > Could you explain what's the "hacky" part you mentioned?
>
> SWP_PFN_OFFSET_FREE_BITS :)
>
> It's a PFN offset and we're mangling in random other bits. That's hacky
> IMHO.
>
> I played with the idea of converting all code to store bits in addition
> to the type + offset. But that requires digging through a lot of arch
> code to teach that code about additional flags, so I discarded that idea
> when working on the COW fixes.
Having SWP_PFN_OFFSET_FREE_BITS was the cleanest approach I could think of
before I know the max_swapfile_size() trick. It only needs the arch to
define this one macro and swapops.h will take action accordingly.
OTOH, I don't want to use swap pte bits not only because there aren't a lot
left for some archs (x86_64 only has bit 4 left, afaict), but also since
this is a page migration issue it'll be nicer to me if can be solved in the
swp entry level, not pte.
>
> >
> > I used swap entry to avoid per-arch operations. I failed to figure out a
> > common way to know swp offset length myself so unluckily in this RFC I
> > still needed one macro per-arch. Ying's suggestion seems to be a good fit
> > here to me to remove the last arch-specific dependency.
>
> Instead of mangling this into the PFN offset and let the arch tell you
> which bits of the PFN offset are unused ... rather remove the bits from
> the offset and define them manually to have a certain meaning. That's
> exactly how pte_swp_mkexclusive/pte_swp_exclusive/ is supposed to be
> handled on architectures that want to support it.
>
> I hope I could make it clearer what the hacky part is IMHO :)
>
> >
> >>
> >>
> >> If we mostly only care about x86_64, and it's a performance improvement
> >> after all, why not simply do it like
> >> pte_swp_mkexclusive/pte_swp_exclusive/ ... and reuse a spare PTE bit?
> >
> > Page migration works for most archs, I want to have it work for all archs
> > that can easily benefit from it.
>
> Yet we only care about x86-64 IIUC regarding performance, just the way
> the dirty bit is handled?
I don't think we only care about x86_64? Should other archs have the same
issue as long as there's the hardware young bit?
Even without it, it'll affect page reclaim logic too, and that's also not
x86 only.
>
> >
> > Besides I actually have a question on the anon exclusive bit in the swap
> > pte: since we have that anyway, why we need a specific migration type for
> > anon exclusive pages? Can it be simply read migration entries with anon
> > exclusive bit set?
>
> Not before all arch support pte_swp_mkexclusive/pte_swp_exclusive/.
>
> As pte_swp_mkexclusive/pte_swp_exclusive/ only applies to actual swap
> PTEs, you could even reuse that bit for migration entries and get at
> alteast the most relevant 64bit architectures supported easily.
Yes, but I think having two mechanisms for the single problem can confuse
people.
IIUC the swap bit is already defined in major archs anyway, and since anon
exclusive bit is best-effort (or am I wrong?..), I won't worry too much on
archs outside x86/arm/ppc/s390 on having anon exclusive bit lost during
migrations, because afaict the whole swap type of ANON_EXCLUSIVE_READ is
only servicing that very minority.. which seems to be a pity to waste the
swp type on all archs even if the archs defined swp pte bits just for anon
exclusive.
Thanks,
--
Peter Xu
On 02.08.22 22:35, Peter Xu wrote:
> On Tue, Aug 02, 2022 at 10:23:49PM +0200, David Hildenbrand wrote:
>>> I don't think we only care about x86_64? Should other archs have the same
>>> issue as long as there's the hardware young bit?
>>>
>>> Even without it, it'll affect page reclaim logic too, and that's also not
>>> x86 only.
>>
>> Okay, reading the cover letter and looking at the code my understanding
>> was that x86-64 is the real focus.
>>
>>>>
>>>>>
>>>>> Besides I actually have a question on the anon exclusive bit in the swap
>>>>> pte: since we have that anyway, why we need a specific migration type for
>>>>> anon exclusive pages? Can it be simply read migration entries with anon
>>>>> exclusive bit set?
>>>>
>>>> Not before all arch support pte_swp_mkexclusive/pte_swp_exclusive/.
>>>>
>>>> As pte_swp_mkexclusive/pte_swp_exclusive/ only applies to actual swap
>>>> PTEs, you could even reuse that bit for migration entries and get at
>>>> alteast the most relevant 64bit architectures supported easily.
>>>
>>> Yes, but I think having two mechanisms for the single problem can confuse
>>> people.
>>>
>>
>> It would be one bit with two different meanings depending on the swp type.
>>
>>> IIUC the swap bit is already defined in major archs anyway, and since anon
>>> exclusive bit is best-effort (or am I wrong?..), I won't worry too much on
>>
>> It kind-of is best effort, but the goal is to have all archs support it.
>>
>> ... just like the young bit here?
>
> Exactly, so I'm also wondering whether we can move the swp pte anon
> exclusive bit into swp entry. It just sounds weird to have them defined in
> two ways.
I'd argue it's just the swp vs. nonswp difference that are in fact two
different concepts (device+offset vs. type+pte). And some dirty details
how swp entries are actually used.
With swp entries you have to be very careful, for example, take a look
at radix_to_swp_entry() and swp_to_radix_entry(). That made me refrain
from touching anything inside actual swp entries and instead store it in
the pte.
>
>>
>>> archs outside x86/arm/ppc/s390 on having anon exclusive bit lost during
>>> migrations, because afaict the whole swap type of ANON_EXCLUSIVE_READ is
>>> only servicing that very minority.. which seems to be a pity to waste the
>>
>> I have a big item on my todo list to support all, but I have different
>> priorities right now.
>>
>> If there is no free bit, simply steal one from the offset ... which is
>> the same thing your approach would do, just in a different way, no?
>>
>>> swp type on all archs even if the archs defined swp pte bits just for anon
>>> exclusive.
>>
>> Why do we care? We walk about one type not one bit.
>
> The swap type address space is still limited, I'd say we should save when
> possible. I believe people caring about swapping care about the limit of
> swap devices too. If the offset can keep it, I think it's better than the
Ehm, last time I did the math I came to the conclusion that nobody
cares. Let me redo the math:
MAX_SWAPFILES = 1<<5 - 1 - 1 - 4 - 3 - 1 = 22
Which is the worst case right now with all kinds of oddity compiled in
(sorry CONFIG_DEVICE_PRIVATE).
So far nobody complaint.
> swap type. De-dup either the type or the swap pte bit would be nicer, imho.
>
If you manage bits in the pte manually, you might be able to get a
better packing density, if bits are scattered around. Just take a look
at the x86_64 location of _PAGE_SWP_EXCLUSIVE.
What I'm rooting for is something like
#define pte_nonswp_mkyoung pte_swp_mkexclusive
Eventually with some VM_BUG_ONs to make sure people call it on the right
swp ptes.
If we ever want to get rid of SWP_MIGRATION_READ_EXCLUSIVE (so people
can have 23 swap devices), and eventually have separate bits for both.
For now it's not necessary.
--
Thanks,
David / dhildenb
On Tue, Aug 02, 2022 at 10:23:49PM +0200, David Hildenbrand wrote:
> > I don't think we only care about x86_64? Should other archs have the same
> > issue as long as there's the hardware young bit?
> >
> > Even without it, it'll affect page reclaim logic too, and that's also not
> > x86 only.
>
> Okay, reading the cover letter and looking at the code my understanding
> was that x86-64 is the real focus.
>
> >>
> >>>
> >>> Besides I actually have a question on the anon exclusive bit in the swap
> >>> pte: since we have that anyway, why we need a specific migration type for
> >>> anon exclusive pages? Can it be simply read migration entries with anon
> >>> exclusive bit set?
> >>
> >> Not before all arch support pte_swp_mkexclusive/pte_swp_exclusive/.
> >>
> >> As pte_swp_mkexclusive/pte_swp_exclusive/ only applies to actual swap
> >> PTEs, you could even reuse that bit for migration entries and get at
> >> alteast the most relevant 64bit architectures supported easily.
> >
> > Yes, but I think having two mechanisms for the single problem can confuse
> > people.
> >
>
> It would be one bit with two different meanings depending on the swp type.
>
> > IIUC the swap bit is already defined in major archs anyway, and since anon
> > exclusive bit is best-effort (or am I wrong?..), I won't worry too much on
>
> It kind-of is best effort, but the goal is to have all archs support it.
>
> ... just like the young bit here?
Exactly, so I'm also wondering whether we can move the swp pte anon
exclusive bit into swp entry. It just sounds weird to have them defined in
two ways.
>
> > archs outside x86/arm/ppc/s390 on having anon exclusive bit lost during
> > migrations, because afaict the whole swap type of ANON_EXCLUSIVE_READ is
> > only servicing that very minority.. which seems to be a pity to waste the
>
> I have a big item on my todo list to support all, but I have different
> priorities right now.
>
> If there is no free bit, simply steal one from the offset ... which is
> the same thing your approach would do, just in a different way, no?
>
> > swp type on all archs even if the archs defined swp pte bits just for anon
> > exclusive.
>
> Why do we care? We walk about one type not one bit.
The swap type address space is still limited, I'd say we should save when
possible. I believe people caring about swapping care about the limit of
swap devices too. If the offset can keep it, I think it's better than the
swap type. De-dup either the type or the swap pte bit would be nicer, imho.
--
Peter Xu
On Tue, Aug 02, 2022 at 10:59:42PM +0200, David Hildenbrand wrote:
> On 02.08.22 22:35, Peter Xu wrote:
> > On Tue, Aug 02, 2022 at 10:23:49PM +0200, David Hildenbrand wrote:
> >>> I don't think we only care about x86_64? Should other archs have the same
> >>> issue as long as there's the hardware young bit?
> >>>
> >>> Even without it, it'll affect page reclaim logic too, and that's also not
> >>> x86 only.
> >>
> >> Okay, reading the cover letter and looking at the code my understanding
> >> was that x86-64 is the real focus.
> >>
> >>>>
> >>>>>
> >>>>> Besides I actually have a question on the anon exclusive bit in the swap
> >>>>> pte: since we have that anyway, why we need a specific migration type for
> >>>>> anon exclusive pages? Can it be simply read migration entries with anon
> >>>>> exclusive bit set?
> >>>>
> >>>> Not before all arch support pte_swp_mkexclusive/pte_swp_exclusive/.
> >>>>
> >>>> As pte_swp_mkexclusive/pte_swp_exclusive/ only applies to actual swap
> >>>> PTEs, you could even reuse that bit for migration entries and get at
> >>>> alteast the most relevant 64bit architectures supported easily.
> >>>
> >>> Yes, but I think having two mechanisms for the single problem can confuse
> >>> people.
> >>>
> >>
> >> It would be one bit with two different meanings depending on the swp type.
> >>
> >>> IIUC the swap bit is already defined in major archs anyway, and since anon
> >>> exclusive bit is best-effort (or am I wrong?..), I won't worry too much on
> >>
> >> It kind-of is best effort, but the goal is to have all archs support it.
> >>
> >> ... just like the young bit here?
> >
> > Exactly, so I'm also wondering whether we can move the swp pte anon
> > exclusive bit into swp entry. It just sounds weird to have them defined in
> > two ways.
>
> I'd argue it's just the swp vs. nonswp difference that are in fact two
> different concepts (device+offset vs. type+pte). And some dirty details
> how swp entries are actually used.
>
> With swp entries you have to be very careful, for example, take a look
> at radix_to_swp_entry() and swp_to_radix_entry(). That made me refrain
> from touching anything inside actual swp entries and instead store it in
> the pte.
I don't really see any risk - it neither touches the swp entry nor do
complicated things around it (shift 1 and set bit 0 to 1). Please feel
free to share your concern in case I overlooked something, though.
>
> >
> >>
> >>> archs outside x86/arm/ppc/s390 on having anon exclusive bit lost during
> >>> migrations, because afaict the whole swap type of ANON_EXCLUSIVE_READ is
> >>> only servicing that very minority.. which seems to be a pity to waste the
> >>
> >> I have a big item on my todo list to support all, but I have different
> >> priorities right now.
> >>
> >> If there is no free bit, simply steal one from the offset ... which is
> >> the same thing your approach would do, just in a different way, no?
> >>
> >>> swp type on all archs even if the archs defined swp pte bits just for anon
> >>> exclusive.
> >>
> >> Why do we care? We walk about one type not one bit.
> >
> > The swap type address space is still limited, I'd say we should save when
> > possible. I believe people caring about swapping care about the limit of
> > swap devices too. If the offset can keep it, I think it's better than the
>
> Ehm, last time I did the math I came to the conclusion that nobody
> cares. Let me redo the math:
>
> MAX_SWAPFILES = 1<<5 - 1 - 1 - 4 - 3 - 1 = 22
>
> Which is the worst case right now with all kinds of oddity compiled in
> (sorry CONFIG_DEVICE_PRIVATE).
>
> So far nobody complaint.
Yeah. To me using one bit of it is fine especially if that's the best to
do. Here what confuses me is we have two ways to represent "whether this
page is anon exclusive" in a swap pte, either we need to go into the type
or using the bit. The trick here is whether the swap pte bit makes sense
depends on the swp type first too, while the swap type can be "anon
exclusive read migration" itself.
>
> > swap type. De-dup either the type or the swap pte bit would be nicer, imho.
> >
>
> If you manage bits in the pte manually, you might be able to get a
> better packing density, if bits are scattered around. Just take a look
> at the x86_64 location of _PAGE_SWP_EXCLUSIVE.
>
> What I'm rooting for is something like
>
> #define pte_nonswp_mkyoung pte_swp_mkexclusive
>
> Eventually with some VM_BUG_ONs to make sure people call it on the right
> swp ptes.
>
> If we ever want to get rid of SWP_MIGRATION_READ_EXCLUSIVE (so people
> can have 23 swap devices), and eventually have separate bits for both.
> For now it's not necessary.
Okay, but that's probably the last thing I'd like to try - I don't want to
further complicate the anon exclusive bit in swap pte.. I'd think cleaning
that up somehow would be nice, but as you mentioned if no one complains I
have no strong opinion either.
Thanks,
--
Peter Xu