2023-03-15 05:16:45

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 00/36] New page table range API

This patchset changes the API used by the MM to set up page table entries.
The four APIs are:
set_ptes(mm, addr, ptep, pte, nr)
update_mmu_cache_range(vma, addr, ptep, nr)
flush_dcache_folio(folio)
flush_icache_pages(vma, page, nr)

flush_dcache_folio() isn't technically new, but no architecture
implemented it, so I've done that for you. The old APIs remain around
but are mostly implemented by calling the new interfaces.

The new APIs are based around setting up N page table entries at once.
The N entries belong to the same PMD, the same folio and the same VMA,
so ptep++ is a legitimate operation, and locking is taken care of for
you. Some architectures can do a better job of it than just a loop,
but I have hesitated to make too deep a change to architectures I don't
understand well.

One thing I have changed in every architecture is that PG_arch_1 is now a
per-folio bit instead of a per-page bit. This was something that would
have to happen eventually, and it makes sense to do it now rather than
iterate over every page involved in a cache flush and figure out if it
needs to happen.

The point of all this is better performance, and Fengwei Yin has
measured improvement on x86. I suspect you'll see improvement on
your architecture too. Try the new will-it-scale test mentioned here:
https://lore.kernel.org/linux-mm/[email protected]/
You'll need to run it on an XFS filesystem and have
CONFIG_TRANSPARENT_HUGEPAGE set.

For testing, I've only run the code on x86. If an x86->foo compiler
exists in Debian, I've built defconfig. I'm relying on the buildbots
to tell me what I missed, and people who actually have the hardware to
tell me if it actually works.

I'd like to get this into the MM tree soon, so quick feedback would
be appreciated.

v4:
- Fix a few compile errors (mostly Mike Rapoport)
- Incorporate Mike's suggestion to avoid having to define set_ptes()
or set_pte_at() on the majority of architectures
- Optimise m68k's __flush_pages_to_ram (Geert Uytterhoeven)
- Fix sun3 (me)
- Fix sparc32 (me)
- Pick up a few more Ack/Reviewed tags

v3:
- Reinstate flush_dcache_icache_phys() on PowerPC
- Fix folio_flush_mapping(). The documentation was correct and the
implementation was completely wrong
- Change the flush_dcache_page() documentation to describe
flush_dcache_folio() instead
- Split ARM from ARC. I messed up my git commands
- Remove page_mapping_file()
- Rationalise how flush_icache_pages() and flush_icache_page() are defined
- Use flush_icache_pages() in do_set_pmd()
- Pick up Guo Ren's Ack for csky

Matthew Wilcox (Oracle) (32):
mm: Convert page_table_check_pte_set() to page_table_check_ptes_set()
mm: Add generic flush_icache_pages() and documentation
mm: Add folio_flush_mapping()
mm: Remove ARCH_IMPLEMENTS_FLUSH_DCACHE_FOLIO
mm: Add default definition of set_ptes()
alpha: Implement the new page table range API
arc: Implement the new page table range API
arm: Implement the new page table range API
arm64: Implement the new page table range API
csky: Implement the new page table range API
hexagon: Implement the new page table range API
ia64: Implement the new page table range API
loongarch: Implement the new page table range API
m68k: Implement the new page table range API
microblaze: Implement the new page table range API
mips: Implement the new page table range API
nios2: Implement the new page table range API
openrisc: Implement the new page table range API
parisc: Implement the new page table range API
powerpc: Implement the new page table range API
riscv: Implement the new page table range API
s390: Implement the new page table range API
superh: Implement the new page table range API
sparc32: Implement the new page table range API
sparc64: Implement the new page table range API
um: Implement the new page table range API
x86: Implement the new page table range API
xtensa: Implement the new page table range API
mm: Remove page_mapping_file()
mm: Rationalise flush_icache_pages() and flush_icache_page()
mm: Tidy up set_ptes definition
mm: Use flush_icache_pages() in do_set_pmd()

Yin Fengwei (4):
filemap: Add filemap_map_folio_range()
rmap: add folio_add_file_rmap_range()
mm: Convert do_set_pte() to set_pte_range()
filemap: Batch PTE mappings

Documentation/core-api/cachetlb.rst | 51 +++++-----
Documentation/filesystems/locking.rst | 2 +-
arch/alpha/include/asm/cacheflush.h | 13 ++-
arch/alpha/include/asm/pgtable.h | 9 +-
arch/arc/include/asm/cacheflush.h | 14 +--
arch/arc/include/asm/pgtable-bits-arcv2.h | 11 +--
arch/arc/include/asm/pgtable-levels.h | 1 +
arch/arc/mm/cache.c | 61 +++++++-----
arch/arc/mm/tlb.c | 18 ++--
arch/arm/include/asm/cacheflush.h | 29 +++---
arch/arm/include/asm/pgtable.h | 5 +-
arch/arm/include/asm/tlbflush.h | 13 ++-
arch/arm/mm/copypage-v4mc.c | 5 +-
arch/arm/mm/copypage-v6.c | 5 +-
arch/arm/mm/copypage-xscale.c | 5 +-
arch/arm/mm/dma-mapping.c | 24 ++---
arch/arm/mm/fault-armv.c | 14 +--
arch/arm/mm/flush.c | 99 +++++++++++--------
arch/arm/mm/mm.h | 2 +-
arch/arm/mm/mmu.c | 14 ++-
arch/arm64/include/asm/cacheflush.h | 4 +-
arch/arm64/include/asm/pgtable.h | 25 +++--
arch/arm64/mm/flush.c | 36 +++----
arch/csky/abiv1/cacheflush.c | 32 ++++---
arch/csky/abiv1/inc/abi/cacheflush.h | 3 +-
arch/csky/abiv2/cacheflush.c | 32 +++----
arch/csky/abiv2/inc/abi/cacheflush.h | 11 ++-
arch/csky/include/asm/pgtable.h | 8 +-
arch/hexagon/include/asm/cacheflush.h | 9 +-
arch/hexagon/include/asm/pgtable.h | 9 +-
arch/ia64/hp/common/sba_iommu.c | 26 ++---
arch/ia64/include/asm/cacheflush.h | 14 ++-
arch/ia64/include/asm/pgtable.h | 4 +-
arch/ia64/mm/init.c | 28 ++++--
arch/loongarch/include/asm/cacheflush.h | 2 +-
arch/loongarch/include/asm/pgtable-bits.h | 4 +-
arch/loongarch/include/asm/pgtable.h | 28 +++---
arch/loongarch/mm/pgtable.c | 2 +-
arch/loongarch/mm/tlb.c | 2 +-
arch/m68k/include/asm/cacheflush_mm.h | 26 +++--
arch/m68k/include/asm/mcf_pgtable.h | 1 +
arch/m68k/include/asm/motorola_pgtable.h | 1 +
arch/m68k/include/asm/pgtable_mm.h | 9 +-
arch/m68k/include/asm/sun3_pgtable.h | 1 +
arch/m68k/mm/motorola.c | 2 +-
arch/microblaze/include/asm/cacheflush.h | 8 ++
arch/microblaze/include/asm/pgtable.h | 15 +--
arch/microblaze/include/asm/tlbflush.h | 4 +-
arch/mips/bcm47xx/prom.c | 2 +-
arch/mips/include/asm/cacheflush.h | 32 ++++---
arch/mips/include/asm/pgtable-32.h | 10 +-
arch/mips/include/asm/pgtable-64.h | 6 +-
arch/mips/include/asm/pgtable-bits.h | 6 +-
arch/mips/include/asm/pgtable.h | 44 +++++----
arch/mips/mm/c-r4k.c | 5 +-
arch/mips/mm/cache.c | 56 +++++------
arch/mips/mm/init.c | 21 ++--
arch/mips/mm/pgtable-32.c | 2 +-
arch/mips/mm/pgtable-64.c | 2 +-
arch/mips/mm/tlbex.c | 2 +-
arch/nios2/include/asm/cacheflush.h | 6 +-
arch/nios2/include/asm/pgtable.h | 28 ++++--
arch/nios2/mm/cacheflush.c | 62 ++++++------
arch/openrisc/include/asm/cacheflush.h | 8 +-
arch/openrisc/include/asm/pgtable.h | 14 ++-
arch/openrisc/mm/cache.c | 12 ++-
arch/parisc/include/asm/cacheflush.h | 14 ++-
arch/parisc/include/asm/pgtable.h | 37 +++++---
arch/parisc/kernel/cache.c | 101 ++++++++++++++------
arch/powerpc/include/asm/book3s/pgtable.h | 10 +-
arch/powerpc/include/asm/cacheflush.h | 14 ++-
arch/powerpc/include/asm/kvm_ppc.h | 10 +-
arch/powerpc/include/asm/nohash/pgtable.h | 13 +--
arch/powerpc/include/asm/pgtable.h | 6 ++
arch/powerpc/mm/book3s64/hash_utils.c | 11 ++-
arch/powerpc/mm/cacheflush.c | 40 +++-----
arch/powerpc/mm/nohash/e500_hugetlbpage.c | 3 +-
arch/powerpc/mm/pgtable.c | 51 +++++-----
arch/riscv/include/asm/cacheflush.h | 19 ++--
arch/riscv/include/asm/pgtable.h | 26 +++--
arch/riscv/mm/cacheflush.c | 11 +--
arch/s390/include/asm/pgtable.h | 33 +++++--
arch/sh/include/asm/cacheflush.h | 21 ++--
arch/sh/include/asm/pgtable.h | 6 +-
arch/sh/include/asm/pgtable_32.h | 5 +-
arch/sh/mm/cache-j2.c | 4 +-
arch/sh/mm/cache-sh4.c | 26 +++--
arch/sh/mm/cache-sh7705.c | 26 +++--
arch/sh/mm/cache.c | 52 +++++-----
arch/sh/mm/kmap.c | 3 +-
arch/sparc/include/asm/cacheflush_32.h | 9 +-
arch/sparc/include/asm/cacheflush_64.h | 19 ++--
arch/sparc/include/asm/pgtable_32.h | 8 +-
arch/sparc/include/asm/pgtable_64.h | 24 ++++-
arch/sparc/kernel/smp_64.c | 56 +++++++----
arch/sparc/mm/init_32.c | 13 ++-
arch/sparc/mm/init_64.c | 78 ++++++++-------
arch/sparc/mm/tlb.c | 5 +-
arch/um/include/asm/pgtable.h | 7 +-
arch/x86/include/asm/pgtable.h | 13 ++-
arch/xtensa/include/asm/cacheflush.h | 11 ++-
arch/xtensa/include/asm/pgtable.h | 17 ++--
arch/xtensa/mm/cache.c | 83 +++++++++-------
include/asm-generic/cacheflush.h | 7 --
include/linux/cacheflush.h | 13 ++-
include/linux/mm.h | 3 +-
include/linux/page_table_check.h | 14 +--
include/linux/pagemap.h | 28 ++++--
include/linux/pgtable.h | 31 ++++++
include/linux/rmap.h | 2 +
mm/filemap.c | 111 +++++++++++++---------
mm/memory.c | 31 +++---
mm/page_table_check.c | 14 +--
mm/rmap.c | 60 +++++++++---
mm/util.c | 2 +-
115 files changed, 1344 insertions(+), 916 deletions(-)

--
2.39.2



2023-03-15 05:16:50

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 30/36] mm: Rationalise flush_icache_pages() and flush_icache_page()

Move the default (no-op) implementation of flush_icache_pages()
to <linux/cacheflush.h> from <asm-generic/cacheflush.h>.
Remove the flush_icache_page() wrapper from each architecture
into <linux/cacheflush.h>.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
arch/alpha/include/asm/cacheflush.h | 5 +----
arch/arc/include/asm/cacheflush.h | 9 ---------
arch/arm/include/asm/cacheflush.h | 7 -------
arch/csky/abiv1/inc/abi/cacheflush.h | 1 -
arch/csky/abiv2/inc/abi/cacheflush.h | 1 -
arch/hexagon/include/asm/cacheflush.h | 2 +-
arch/loongarch/include/asm/cacheflush.h | 2 --
arch/m68k/include/asm/cacheflush_mm.h | 1 -
arch/mips/include/asm/cacheflush.h | 6 ------
arch/nios2/include/asm/cacheflush.h | 2 +-
arch/nios2/mm/cacheflush.c | 1 +
arch/parisc/include/asm/cacheflush.h | 2 +-
arch/sh/include/asm/cacheflush.h | 2 +-
arch/sparc/include/asm/cacheflush_32.h | 2 --
arch/sparc/include/asm/cacheflush_64.h | 3 ---
arch/xtensa/include/asm/cacheflush.h | 4 ----
include/asm-generic/cacheflush.h | 12 ------------
include/linux/cacheflush.h | 9 +++++++++
18 files changed, 15 insertions(+), 56 deletions(-)

diff --git a/arch/alpha/include/asm/cacheflush.h b/arch/alpha/include/asm/cacheflush.h
index 3956460e69e2..36a7e924c3b9 100644
--- a/arch/alpha/include/asm/cacheflush.h
+++ b/arch/alpha/include/asm/cacheflush.h
@@ -53,10 +53,6 @@ extern void flush_icache_user_page(struct vm_area_struct *vma,
#define flush_icache_user_page flush_icache_user_page
#endif /* CONFIG_SMP */

-/* This is used only in __do_fault and do_swap_page. */
-#define flush_icache_page(vma, page) \
- flush_icache_user_page((vma), (page), 0, 0)
-
/*
* Both implementations of flush_icache_user_page flush the entire
* address space, so one call, no matter how many pages.
@@ -66,6 +62,7 @@ static inline void flush_icache_pages(struct vm_area_struct *vma,
{
flush_icache_user_page(vma, page, 0, 0);
}
+#define flush_icache_pages flush_icache_pages

#include <asm-generic/cacheflush.h>

diff --git a/arch/arc/include/asm/cacheflush.h b/arch/arc/include/asm/cacheflush.h
index 04f65f588510..bd5b1a9a0544 100644
--- a/arch/arc/include/asm/cacheflush.h
+++ b/arch/arc/include/asm/cacheflush.h
@@ -18,15 +18,6 @@
#include <linux/mm.h>
#include <asm/shmparam.h>

-/*
- * Semantically we need this because icache doesn't snoop dcache/dma.
- * However ARC Cache flush requires paddr as well as vaddr, latter not available
- * in the flush_icache_page() API. So we no-op it but do the equivalent work
- * in update_mmu_cache()
- */
-#define flush_icache_page(vma, page)
-#define flush_icache_pages(vma, page, nr)
-
void flush_cache_all(void);

void flush_icache_range(unsigned long kstart, unsigned long kend);
diff --git a/arch/arm/include/asm/cacheflush.h b/arch/arm/include/asm/cacheflush.h
index 841e268d2374..f6181f69577f 100644
--- a/arch/arm/include/asm/cacheflush.h
+++ b/arch/arm/include/asm/cacheflush.h
@@ -321,13 +321,6 @@ static inline void flush_anon_page(struct vm_area_struct *vma,
#define flush_dcache_mmap_lock(mapping) xa_lock_irq(&mapping->i_pages)
#define flush_dcache_mmap_unlock(mapping) xa_unlock_irq(&mapping->i_pages)

-/*
- * We don't appear to need to do anything here. In fact, if we did, we'd
- * duplicate cache flushing elsewhere performed by flush_dcache_page().
- */
-#define flush_icache_page(vma,page) do { } while (0)
-#define flush_icache_pages(vma, page, nr) do { } while (0)
-
/*
* flush_cache_vmap() is used when creating mappings (eg, via vmap,
* vmalloc, ioremap etc) in kernel space for pages. On non-VIPT
diff --git a/arch/csky/abiv1/inc/abi/cacheflush.h b/arch/csky/abiv1/inc/abi/cacheflush.h
index 0d6cb65624c4..908d8b0bc4fd 100644
--- a/arch/csky/abiv1/inc/abi/cacheflush.h
+++ b/arch/csky/abiv1/inc/abi/cacheflush.h
@@ -45,7 +45,6 @@ extern void flush_cache_range(struct vm_area_struct *vma, unsigned long start, u
#define flush_cache_vmap(start, end) cache_wbinv_all()
#define flush_cache_vunmap(start, end) cache_wbinv_all()

-#define flush_icache_page(vma, page) do {} while (0);
#define flush_icache_range(start, end) cache_wbinv_range(start, end)
#define flush_icache_mm_range(mm, start, end) cache_wbinv_range(start, end)
#define flush_icache_deferred(mm) do {} while (0);
diff --git a/arch/csky/abiv2/inc/abi/cacheflush.h b/arch/csky/abiv2/inc/abi/cacheflush.h
index 9c728933a776..40be16907267 100644
--- a/arch/csky/abiv2/inc/abi/cacheflush.h
+++ b/arch/csky/abiv2/inc/abi/cacheflush.h
@@ -33,7 +33,6 @@ static inline void flush_dcache_page(struct page *page)

#define flush_dcache_mmap_lock(mapping) do { } while (0)
#define flush_dcache_mmap_unlock(mapping) do { } while (0)
-#define flush_icache_page(vma, page) do { } while (0)

#define flush_icache_range(start, end) cache_wbinv_range(start, end)

diff --git a/arch/hexagon/include/asm/cacheflush.h b/arch/hexagon/include/asm/cacheflush.h
index 63ca314ede89..bdacf72d97e1 100644
--- a/arch/hexagon/include/asm/cacheflush.h
+++ b/arch/hexagon/include/asm/cacheflush.h
@@ -18,7 +18,7 @@
* - flush_cache_range(vma, start, end) flushes a range of pages
* - flush_icache_range(start, end) flush a range of instructions
* - flush_dcache_page(pg) flushes(wback&invalidates) a page for dcache
- * - flush_icache_page(vma, pg) flushes(invalidates) a page for icache
+ * - flush_icache_pages(vma, pg, nr) flushes(invalidates) nr pages for icache
*
* Need to doublecheck which one is really needed for ptrace stuff to work.
*/
diff --git a/arch/loongarch/include/asm/cacheflush.h b/arch/loongarch/include/asm/cacheflush.h
index 7907eb42bfbd..326ac6f1b27c 100644
--- a/arch/loongarch/include/asm/cacheflush.h
+++ b/arch/loongarch/include/asm/cacheflush.h
@@ -46,8 +46,6 @@ void local_flush_icache_range(unsigned long start, unsigned long end);
#define flush_cache_page(vma, vmaddr, pfn) do { } while (0)
#define flush_cache_vmap(start, end) do { } while (0)
#define flush_cache_vunmap(start, end) do { } while (0)
-#define flush_icache_page(vma, page) do { } while (0)
-#define flush_icache_pages(vma, page) do { } while (0)
#define flush_icache_user_page(vma, page, addr, len) do { } while (0)
#define flush_dcache_page(page) do { } while (0)
#define flush_dcache_folio(folio) do { } while (0)
diff --git a/arch/m68k/include/asm/cacheflush_mm.h b/arch/m68k/include/asm/cacheflush_mm.h
index 88eb85e81ef6..ed12358c4783 100644
--- a/arch/m68k/include/asm/cacheflush_mm.h
+++ b/arch/m68k/include/asm/cacheflush_mm.h
@@ -261,7 +261,6 @@ static inline void __flush_pages_to_ram(void *vaddr, unsigned int nr)
#define flush_dcache_mmap_unlock(mapping) do { } while (0)
#define flush_icache_pages(vma, page, nr) \
__flush_pages_to_ram(page_address(page), nr)
-#define flush_icache_page(vma, page) flush_icache_pages(vma, page, 1)

extern void flush_icache_user_page(struct vm_area_struct *vma, struct page *page,
unsigned long addr, int len);
diff --git a/arch/mips/include/asm/cacheflush.h b/arch/mips/include/asm/cacheflush.h
index 2683cade42ef..043e50effc62 100644
--- a/arch/mips/include/asm/cacheflush.h
+++ b/arch/mips/include/asm/cacheflush.h
@@ -82,12 +82,6 @@ static inline void flush_anon_page(struct vm_area_struct *vma,
__flush_anon_page(page, vmaddr);
}

-static inline void flush_icache_pages(struct vm_area_struct *vma,
- struct page *page, unsigned int nr)
-{
-}
-#define flush_icache_page(vma, page) flush_icache_pages(vma, page, 1)
-
extern void (*flush_icache_range)(unsigned long start, unsigned long end);
extern void (*local_flush_icache_range)(unsigned long start, unsigned long end);
extern void (*__flush_icache_user_range)(unsigned long start,
diff --git a/arch/nios2/include/asm/cacheflush.h b/arch/nios2/include/asm/cacheflush.h
index 8624ca83cffe..7c48c5213fb7 100644
--- a/arch/nios2/include/asm/cacheflush.h
+++ b/arch/nios2/include/asm/cacheflush.h
@@ -35,7 +35,7 @@ void flush_dcache_folio(struct folio *folio);
extern void flush_icache_range(unsigned long start, unsigned long end);
void flush_icache_pages(struct vm_area_struct *vma, struct page *page,
unsigned int nr);
-#define flush_icache_page(vma, page) flush_icache_pages(vma, page, 1);
+#define flush_icache_pages flush_icache_pages

#define flush_cache_vmap(start, end) flush_dcache_range(start, end)
#define flush_cache_vunmap(start, end) flush_dcache_range(start, end)
diff --git a/arch/nios2/mm/cacheflush.c b/arch/nios2/mm/cacheflush.c
index 471485a84b2c..2565767b98a3 100644
--- a/arch/nios2/mm/cacheflush.c
+++ b/arch/nios2/mm/cacheflush.c
@@ -147,6 +147,7 @@ void flush_icache_pages(struct vm_area_struct *vma, struct page *page,
__flush_dcache(start, end);
__flush_icache(start, end);
}
+#define flush_icache_pages flush_icache_pages

void flush_cache_page(struct vm_area_struct *vma, unsigned long vmaddr,
unsigned long pfn)
diff --git a/arch/parisc/include/asm/cacheflush.h b/arch/parisc/include/asm/cacheflush.h
index 2cdc0ea562d6..cd0bfbd244db 100644
--- a/arch/parisc/include/asm/cacheflush.h
+++ b/arch/parisc/include/asm/cacheflush.h
@@ -56,7 +56,7 @@ static inline void flush_dcache_page(struct page *page)

void flush_icache_pages(struct vm_area_struct *vma, struct page *page,
unsigned int nr);
-#define flush_icache_page(vma, page) flush_icache_pages(vma, page, 1)
+#define flush_icache_pages flush_icache_pages

#define flush_icache_range(s,e) do { \
flush_kernel_dcache_range_asm(s,e); \
diff --git a/arch/sh/include/asm/cacheflush.h b/arch/sh/include/asm/cacheflush.h
index 9fceef6f3e00..878b6b551bd2 100644
--- a/arch/sh/include/asm/cacheflush.h
+++ b/arch/sh/include/asm/cacheflush.h
@@ -53,7 +53,7 @@ extern void flush_icache_range(unsigned long start, unsigned long end);
#define flush_icache_user_range flush_icache_range
void flush_icache_pages(struct vm_area_struct *vma, struct page *page,
unsigned int nr);
-#define flush_icache_page(vma, page) flush_icache_pages(vma, page, 1)
+#define flush_icache_pages flush_icache_pages
extern void flush_cache_sigtramp(unsigned long address);

struct flusher_data {
diff --git a/arch/sparc/include/asm/cacheflush_32.h b/arch/sparc/include/asm/cacheflush_32.h
index 8dba35d63328..21f6c918238b 100644
--- a/arch/sparc/include/asm/cacheflush_32.h
+++ b/arch/sparc/include/asm/cacheflush_32.h
@@ -15,8 +15,6 @@
#define flush_cache_page(vma,addr,pfn) \
sparc32_cachetlb_ops->cache_page(vma, addr)
#define flush_icache_range(start, end) do { } while (0)
-#define flush_icache_page(vma, pg) do { } while (0)
-#define flush_icache_pages(vma, pg, nr) do { } while (0)

#define copy_to_user_page(vma, page, vaddr, dst, src, len) \
do { \
diff --git a/arch/sparc/include/asm/cacheflush_64.h b/arch/sparc/include/asm/cacheflush_64.h
index a9a719f04d06..0e879004efff 100644
--- a/arch/sparc/include/asm/cacheflush_64.h
+++ b/arch/sparc/include/asm/cacheflush_64.h
@@ -53,9 +53,6 @@ static inline void flush_dcache_page(struct page *page)
flush_dcache_folio(page_folio(page));
}

-#define flush_icache_page(vma, pg) do { } while(0)
-#define flush_icache_pages(vma, pg, nr) do { } while(0)
-
void flush_ptrace_access(struct vm_area_struct *, struct page *,
unsigned long uaddr, void *kaddr,
unsigned long len, int write);
diff --git a/arch/xtensa/include/asm/cacheflush.h b/arch/xtensa/include/asm/cacheflush.h
index 35153f6725e4..785a00ce83c1 100644
--- a/arch/xtensa/include/asm/cacheflush.h
+++ b/arch/xtensa/include/asm/cacheflush.h
@@ -160,10 +160,6 @@ void local_flush_cache_page(struct vm_area_struct *vma,
__invalidate_icache_range(start,(end) - (start)); \
} while (0)

-/* This is not required, see Documentation/core-api/cachetlb.rst */
-#define flush_icache_page(vma,page) do { } while (0)
-#define flush_icache_pages(vma, page, nr) do { } while (0)
-
#define flush_dcache_mmap_lock(mapping) do { } while (0)
#define flush_dcache_mmap_unlock(mapping) do { } while (0)

diff --git a/include/asm-generic/cacheflush.h b/include/asm-generic/cacheflush.h
index 09d51a680765..84ec53ccc450 100644
--- a/include/asm-generic/cacheflush.h
+++ b/include/asm-generic/cacheflush.h
@@ -77,18 +77,6 @@ static inline void flush_icache_range(unsigned long start, unsigned long end)
#define flush_icache_user_range flush_icache_range
#endif

-#ifndef flush_icache_page
-static inline void flush_icache_pages(struct vm_area_struct *vma,
- struct page *page, unsigned int nr)
-{
-}
-
-static inline void flush_icache_page(struct vm_area_struct *vma,
- struct page *page)
-{
-}
-#endif
-
#ifndef flush_icache_user_page
static inline void flush_icache_user_page(struct vm_area_struct *vma,
struct page *page,
diff --git a/include/linux/cacheflush.h b/include/linux/cacheflush.h
index 82136f3fcf54..55f297b2c23f 100644
--- a/include/linux/cacheflush.h
+++ b/include/linux/cacheflush.h
@@ -17,4 +17,13 @@ static inline void flush_dcache_folio(struct folio *folio)
#define flush_dcache_folio flush_dcache_folio
#endif /* ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE */

+#ifndef flush_icache_pages
+static inline void flush_icache_pages(struct vm_area_struct *vma,
+ struct page *page, unsigned int nr)
+{
+}
+#endif
+
+#define flush_icache_page(vma, page) flush_icache_pages(vma, page, 1)
+
#endif /* _LINUX_CACHEFLUSH_H */
--
2.39.2


2023-03-15 05:16:54

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 32/36] mm: Use flush_icache_pages() in do_set_pmd()

Push the iteration over each page down to the architectures (many
can flush the entire THP without iteration).

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
mm/memory.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index c5f1bf906d0c..6aa21e8f3753 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4209,7 +4209,6 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
bool write = vmf->flags & FAULT_FLAG_WRITE;
unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
pmd_t entry;
- int i;
vm_fault_t ret = VM_FAULT_FALLBACK;

if (!transhuge_vma_suitable(vma, haddr))
@@ -4242,8 +4241,7 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
if (unlikely(!pmd_none(*vmf->pmd)))
goto out;

- for (i = 0; i < HPAGE_PMD_NR; i++)
- flush_icache_page(vma, page + i);
+ flush_icache_pages(vma, page, HPAGE_PMD_NR);

entry = mk_huge_pmd(page, vma->vm_page_prot);
if (write)
--
2.39.2


2023-03-15 05:16:58

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 25/36] sparc64: Implement the new page table range API

Add set_ptes(), update_mmu_cache_range(), flush_dcache_folio() and
flush_icache_pages(). Convert the PG_dcache_dirty flag from being
per-page to per-folio.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: [email protected]
---
arch/sparc/include/asm/cacheflush_64.h | 18 ++++--
arch/sparc/include/asm/pgtable_64.h | 24 ++++++--
arch/sparc/kernel/smp_64.c | 56 +++++++++++-------
arch/sparc/mm/init_64.c | 78 +++++++++++++++-----------
arch/sparc/mm/tlb.c | 5 +-
5 files changed, 116 insertions(+), 65 deletions(-)

diff --git a/arch/sparc/include/asm/cacheflush_64.h b/arch/sparc/include/asm/cacheflush_64.h
index b9341836597e..a9a719f04d06 100644
--- a/arch/sparc/include/asm/cacheflush_64.h
+++ b/arch/sparc/include/asm/cacheflush_64.h
@@ -35,20 +35,26 @@ void flush_icache_range(unsigned long start, unsigned long end);
void __flush_icache_page(unsigned long);

void __flush_dcache_page(void *addr, int flush_icache);
-void flush_dcache_page_impl(struct page *page);
+void flush_dcache_folio_impl(struct folio *folio);
#ifdef CONFIG_SMP
-void smp_flush_dcache_page_impl(struct page *page, int cpu);
-void flush_dcache_page_all(struct mm_struct *mm, struct page *page);
+void smp_flush_dcache_folio_impl(struct folio *folio, int cpu);
+void flush_dcache_folio_all(struct mm_struct *mm, struct folio *folio);
#else
-#define smp_flush_dcache_page_impl(page,cpu) flush_dcache_page_impl(page)
-#define flush_dcache_page_all(mm,page) flush_dcache_page_impl(page)
+#define smp_flush_dcache_folio_impl(folio, cpu) flush_dcache_folio_impl(folio)
+#define flush_dcache_folio_all(mm, folio) flush_dcache_folio_impl(folio)
#endif

void __flush_dcache_range(unsigned long start, unsigned long end);
#define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1
-void flush_dcache_page(struct page *page);
+void flush_dcache_folio(struct folio *folio);
+#define flush_dcache_folio flush_dcache_folio
+static inline void flush_dcache_page(struct page *page)
+{
+ flush_dcache_folio(page_folio(page));
+}

#define flush_icache_page(vma, pg) do { } while(0)
+#define flush_icache_pages(vma, pg, nr) do { } while(0)

void flush_ptrace_access(struct vm_area_struct *, struct page *,
unsigned long uaddr, void *kaddr,
diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 2dc8d4641734..49c37000e1b1 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -911,8 +911,19 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr,
maybe_tlb_batch_add(mm, addr, ptep, orig, fullmm, PAGE_SHIFT);
}

-#define set_pte_at(mm,addr,ptep,pte) \
- __set_pte_at((mm), (addr), (ptep), (pte), 0)
+static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep, pte_t pte, unsigned int nr)
+{
+ for (;;) {
+ __set_pte_at(mm, addr, ptep, pte, 0);
+ if (--nr == 0)
+ break;
+ ptep++;
+ pte_val(pte) += PAGE_SIZE;
+ addr += PAGE_SIZE;
+ }
+}
+#define set_ptes set_ptes

#define pte_clear(mm,addr,ptep) \
set_pte_at((mm), (addr), (ptep), __pte(0UL))
@@ -931,8 +942,8 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr,
\
if (pfn_valid(this_pfn) && \
(((old_addr) ^ (new_addr)) & (1 << 13))) \
- flush_dcache_page_all(current->mm, \
- pfn_to_page(this_pfn)); \
+ flush_dcache_folio_all(current->mm, \
+ page_folio(pfn_to_page(this_pfn))); \
} \
newpte; \
})
@@ -947,7 +958,10 @@ struct seq_file;
void mmu_info(struct seq_file *);

struct vm_area_struct;
-void update_mmu_cache(struct vm_area_struct *, unsigned long, pte_t *);
+void update_mmu_cache_range(struct vm_area_struct *, unsigned long addr,
+ pte_t *ptep, unsigned int nr);
+#define update_mmu_cache(vma, addr, ptep) \
+ update_mmu_cache_range(vma, addr, ptep, 1)
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmd);
diff --git a/arch/sparc/kernel/smp_64.c b/arch/sparc/kernel/smp_64.c
index a55295d1b924..90ef8677ac89 100644
--- a/arch/sparc/kernel/smp_64.c
+++ b/arch/sparc/kernel/smp_64.c
@@ -921,20 +921,26 @@ extern unsigned long xcall_flush_dcache_page_cheetah;
#endif
extern unsigned long xcall_flush_dcache_page_spitfire;

-static inline void __local_flush_dcache_page(struct page *page)
+static inline void __local_flush_dcache_folio(struct folio *folio)
{
+ unsigned int i, nr = folio_nr_pages(folio);
+
#ifdef DCACHE_ALIASING_POSSIBLE
- __flush_dcache_page(page_address(page),
+ for (i = 0; i < nr; i++)
+ __flush_dcache_page(folio_address(folio) + i * PAGE_SIZE,
((tlb_type == spitfire) &&
- page_mapping_file(page) != NULL));
+ folio_flush_mapping(folio) != NULL));
#else
- if (page_mapping_file(page) != NULL &&
- tlb_type == spitfire)
- __flush_icache_page(__pa(page_address(page)));
+ if (folio_flush_mapping(folio) != NULL &&
+ tlb_type == spitfire) {
+ unsigned long pfn = folio_pfn(folio)
+ for (i = 0; i < nr; i++)
+ __flush_icache_page((pfn + i) * PAGE_SIZE);
+ }
#endif
}

-void smp_flush_dcache_page_impl(struct page *page, int cpu)
+void smp_flush_dcache_folio_impl(struct folio *folio, int cpu)
{
int this_cpu;

@@ -948,14 +954,14 @@ void smp_flush_dcache_page_impl(struct page *page, int cpu)
this_cpu = get_cpu();

if (cpu == this_cpu) {
- __local_flush_dcache_page(page);
+ __local_flush_dcache_folio(folio);
} else if (cpu_online(cpu)) {
- void *pg_addr = page_address(page);
+ void *pg_addr = folio_address(folio);
u64 data0 = 0;

if (tlb_type == spitfire) {
data0 = ((u64)&xcall_flush_dcache_page_spitfire);
- if (page_mapping_file(page) != NULL)
+ if (folio_flush_mapping(folio) != NULL)
data0 |= ((u64)1 << 32);
} else if (tlb_type == cheetah || tlb_type == cheetah_plus) {
#ifdef DCACHE_ALIASING_POSSIBLE
@@ -963,18 +969,23 @@ void smp_flush_dcache_page_impl(struct page *page, int cpu)
#endif
}
if (data0) {
- xcall_deliver(data0, __pa(pg_addr),
- (u64) pg_addr, cpumask_of(cpu));
+ unsigned int i, nr = folio_nr_pages(folio);
+
+ for (i = 0; i < nr; i++) {
+ xcall_deliver(data0, __pa(pg_addr),
+ (u64) pg_addr, cpumask_of(cpu));
#ifdef CONFIG_DEBUG_DCFLUSH
- atomic_inc(&dcpage_flushes_xcall);
+ atomic_inc(&dcpage_flushes_xcall);
#endif
+ pg_addr += PAGE_SIZE;
+ }
}
}

put_cpu();
}

-void flush_dcache_page_all(struct mm_struct *mm, struct page *page)
+void flush_dcache_folio_all(struct mm_struct *mm, struct folio *folio)
{
void *pg_addr;
u64 data0;
@@ -988,10 +999,10 @@ void flush_dcache_page_all(struct mm_struct *mm, struct page *page)
atomic_inc(&dcpage_flushes);
#endif
data0 = 0;
- pg_addr = page_address(page);
+ pg_addr = folio_address(folio);
if (tlb_type == spitfire) {
data0 = ((u64)&xcall_flush_dcache_page_spitfire);
- if (page_mapping_file(page) != NULL)
+ if (folio_flush_mapping(folio) != NULL)
data0 |= ((u64)1 << 32);
} else if (tlb_type == cheetah || tlb_type == cheetah_plus) {
#ifdef DCACHE_ALIASING_POSSIBLE
@@ -999,13 +1010,18 @@ void flush_dcache_page_all(struct mm_struct *mm, struct page *page)
#endif
}
if (data0) {
- xcall_deliver(data0, __pa(pg_addr),
- (u64) pg_addr, cpu_online_mask);
+ unsigned int i, nr = folio_nr_pages(folio);
+
+ for (i = 0; i < nr; i++) {
+ xcall_deliver(data0, __pa(pg_addr),
+ (u64) pg_addr, cpu_online_mask);
#ifdef CONFIG_DEBUG_DCFLUSH
- atomic_inc(&dcpage_flushes_xcall);
+ atomic_inc(&dcpage_flushes_xcall);
#endif
+ pg_addr += PAGE_SIZE;
+ }
}
- __local_flush_dcache_page(page);
+ __local_flush_dcache_folio(folio);

preempt_enable();
}
diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 04f9db0c3111..ab9aacbaf43c 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -195,21 +195,26 @@ atomic_t dcpage_flushes_xcall = ATOMIC_INIT(0);
#endif
#endif

-inline void flush_dcache_page_impl(struct page *page)
+inline void flush_dcache_folio_impl(struct folio *folio)
{
+ unsigned int i, nr = folio_nr_pages(folio);
+
BUG_ON(tlb_type == hypervisor);
#ifdef CONFIG_DEBUG_DCFLUSH
atomic_inc(&dcpage_flushes);
#endif

#ifdef DCACHE_ALIASING_POSSIBLE
- __flush_dcache_page(page_address(page),
- ((tlb_type == spitfire) &&
- page_mapping_file(page) != NULL));
+ for (i = 0; i < nr; i++)
+ __flush_dcache_page(folio_address(folio) + i * PAGE_SIZE,
+ ((tlb_type == spitfire) &&
+ folio_flush_mapping(folio) != NULL));
#else
- if (page_mapping_file(page) != NULL &&
- tlb_type == spitfire)
- __flush_icache_page(__pa(page_address(page)));
+ if (folio_flush_mapping(folio) != NULL &&
+ tlb_type == spitfire) {
+ for (i = 0; i < nr; i++)
+ __flush_icache_page((pfn + i) * PAGE_SIZE);
+ }
#endif
}

@@ -218,10 +223,10 @@ inline void flush_dcache_page_impl(struct page *page)
#define PG_dcache_cpu_mask \
((1UL<<ilog2(roundup_pow_of_two(NR_CPUS)))-1UL)

-#define dcache_dirty_cpu(page) \
- (((page)->flags >> PG_dcache_cpu_shift) & PG_dcache_cpu_mask)
+#define dcache_dirty_cpu(folio) \
+ (((folio)->flags >> PG_dcache_cpu_shift) & PG_dcache_cpu_mask)

-static inline void set_dcache_dirty(struct page *page, int this_cpu)
+static inline void set_dcache_dirty(struct folio *folio, int this_cpu)
{
unsigned long mask = this_cpu;
unsigned long non_cpu_bits;
@@ -238,11 +243,11 @@ static inline void set_dcache_dirty(struct page *page, int this_cpu)
"bne,pn %%xcc, 1b\n\t"
" nop"
: /* no outputs */
- : "r" (mask), "r" (non_cpu_bits), "r" (&page->flags)
+ : "r" (mask), "r" (non_cpu_bits), "r" (&folio->flags)
: "g1", "g7");
}

-static inline void clear_dcache_dirty_cpu(struct page *page, unsigned long cpu)
+static inline void clear_dcache_dirty_cpu(struct folio *folio, unsigned long cpu)
{
unsigned long mask = (1UL << PG_dcache_dirty);

@@ -260,7 +265,7 @@ static inline void clear_dcache_dirty_cpu(struct page *page, unsigned long cpu)
" nop\n"
"2:"
: /* no outputs */
- : "r" (cpu), "r" (mask), "r" (&page->flags),
+ : "r" (cpu), "r" (mask), "r" (&folio->flags),
"i" (PG_dcache_cpu_mask),
"i" (PG_dcache_cpu_shift)
: "g1", "g7");
@@ -284,9 +289,10 @@ static void flush_dcache(unsigned long pfn)

page = pfn_to_page(pfn);
if (page) {
+ struct folio *folio = page_folio(page);
unsigned long pg_flags;

- pg_flags = page->flags;
+ pg_flags = folio->flags;
if (pg_flags & (1UL << PG_dcache_dirty)) {
int cpu = ((pg_flags >> PG_dcache_cpu_shift) &
PG_dcache_cpu_mask);
@@ -296,11 +302,11 @@ static void flush_dcache(unsigned long pfn)
* in the SMP case.
*/
if (cpu == this_cpu)
- flush_dcache_page_impl(page);
+ flush_dcache_folio_impl(folio);
else
- smp_flush_dcache_page_impl(page, cpu);
+ smp_flush_dcache_folio_impl(folio, cpu);

- clear_dcache_dirty_cpu(page, cpu);
+ clear_dcache_dirty_cpu(folio, cpu);

put_cpu();
}
@@ -388,12 +394,14 @@ bool __init arch_hugetlb_valid_size(unsigned long size)
}
#endif /* CONFIG_HUGETLB_PAGE */

-void update_mmu_cache(struct vm_area_struct *vma, unsigned long address, pte_t *ptep)
+void update_mmu_cache_range(struct vm_area_struct *vma, unsigned long address,
+ pte_t *ptep, unsigned int nr)
{
struct mm_struct *mm;
unsigned long flags;
bool is_huge_tsb;
pte_t pte = *ptep;
+ unsigned int i;

if (tlb_type != hypervisor) {
unsigned long pfn = pte_pfn(pte);
@@ -440,15 +448,21 @@ void update_mmu_cache(struct vm_area_struct *vma, unsigned long address, pte_t *
}
}
#endif
- if (!is_huge_tsb)
- __update_mmu_tsb_insert(mm, MM_TSB_BASE, PAGE_SHIFT,
- address, pte_val(pte));
+ if (!is_huge_tsb) {
+ for (i = 0; i < nr; i++) {
+ __update_mmu_tsb_insert(mm, MM_TSB_BASE, PAGE_SHIFT,
+ address, pte_val(pte));
+ address += PAGE_SIZE;
+ pte_val(pte) += PAGE_SIZE;
+ }
+ }

spin_unlock_irqrestore(&mm->context.lock, flags);
}

-void flush_dcache_page(struct page *page)
+void flush_dcache_folio(struct folio *folio)
{
+ unsigned long pfn = folio_pfn(folio);
struct address_space *mapping;
int this_cpu;

@@ -459,35 +473,35 @@ void flush_dcache_page(struct page *page)
* is merely the zero page. The 'bigcore' testcase in GDB
* causes this case to run millions of times.
*/
- if (page == ZERO_PAGE(0))
+ if (is_zero_pfn(pfn))
return;

this_cpu = get_cpu();

- mapping = page_mapping_file(page);
+ mapping = folio_flush_mapping(folio);
if (mapping && !mapping_mapped(mapping)) {
- int dirty = test_bit(PG_dcache_dirty, &page->flags);
+ bool dirty = test_bit(PG_dcache_dirty, &folio->flags);
if (dirty) {
- int dirty_cpu = dcache_dirty_cpu(page);
+ int dirty_cpu = dcache_dirty_cpu(folio);

if (dirty_cpu == this_cpu)
goto out;
- smp_flush_dcache_page_impl(page, dirty_cpu);
+ smp_flush_dcache_folio_impl(folio, dirty_cpu);
}
- set_dcache_dirty(page, this_cpu);
+ set_dcache_dirty(folio, this_cpu);
} else {
/* We could delay the flush for the !page_mapping
* case too. But that case is for exec env/arg
* pages and those are %99 certainly going to get
* faulted into the tlb (and thus flushed) anyways.
*/
- flush_dcache_page_impl(page);
+ flush_dcache_folio_impl(folio);
}

out:
put_cpu();
}
-EXPORT_SYMBOL(flush_dcache_page);
+EXPORT_SYMBOL(flush_dcache_folio);

void __kprobes flush_icache_range(unsigned long start, unsigned long end)
{
@@ -2280,10 +2294,10 @@ void __init paging_init(void)
setup_page_offset();

/* These build time checkes make sure that the dcache_dirty_cpu()
- * page->flags usage will work.
+ * folio->flags usage will work.
*
* When a page gets marked as dcache-dirty, we store the
- * cpu number starting at bit 32 in the page->flags. Also,
+ * cpu number starting at bit 32 in the folio->flags. Also,
* functions like clear_dcache_dirty_cpu use the cpu mask
* in 13-bit signed-immediate instruction fields.
*/
diff --git a/arch/sparc/mm/tlb.c b/arch/sparc/mm/tlb.c
index 9a725547578e..3fa6a070912d 100644
--- a/arch/sparc/mm/tlb.c
+++ b/arch/sparc/mm/tlb.c
@@ -118,6 +118,7 @@ void tlb_batch_add(struct mm_struct *mm, unsigned long vaddr,
unsigned long paddr, pfn = pte_pfn(orig);
struct address_space *mapping;
struct page *page;
+ struct folio *folio;

if (!pfn_valid(pfn))
goto no_cache_flush;
@@ -127,13 +128,13 @@ void tlb_batch_add(struct mm_struct *mm, unsigned long vaddr,
goto no_cache_flush;

/* A real file page? */
- mapping = page_mapping_file(page);
+ mapping = folio_flush_mapping(folio);
if (!mapping)
goto no_cache_flush;

paddr = (unsigned long) page_address(page);
if ((paddr ^ vaddr) & (1 << 13))
- flush_dcache_page_all(mm, page);
+ flush_dcache_folio_all(mm, folio);
}

no_cache_flush:
--
2.39.2


2023-03-15 05:17:01

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 36/36] filemap: Batch PTE mappings

From: Yin Fengwei <[email protected]>

Call set_pte_range() once per contiguous range of the folio instead
of once per page. This batches the updates to mm counters and the
rmap.

With a will-it-scale.page_fault3 like app (change file write
fault testing to read fault testing. Trying to upstream it to
will-it-scale at [1]) got 15% performance gain on a 48C/96T
Cascade Lake test box with 96 processes running against xfs.

Perf data collected before/after the change:
18.73%--page_add_file_rmap
|
--11.60%--__mod_lruvec_page_state
|
|--7.40%--__mod_memcg_lruvec_state
| |
| --5.58%--cgroup_rstat_updated
|
--2.53%--__mod_lruvec_state
|
--1.48%--__mod_node_page_state

9.93%--page_add_file_rmap_range
|
--2.67%--__mod_lruvec_page_state
|
|--1.95%--__mod_memcg_lruvec_state
| |
| --1.57%--cgroup_rstat_updated
|
--0.61%--__mod_lruvec_state
|
--0.54%--__mod_node_page_state

The running time of __mode_lruvec_page_state() is reduced about 9%.

[1]: https://github.com/antonblanchard/will-it-scale/pull/37

Signed-off-by: Yin Fengwei <[email protected]>
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
mm/filemap.c | 36 +++++++++++++++++++++++++-----------
1 file changed, 25 insertions(+), 11 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index e2317623dcbf..7a1534460b55 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3483,11 +3483,12 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
struct file *file = vma->vm_file;
struct page *page = folio_page(folio, start);
unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss);
- unsigned int ref_count = 0, count = 0;
+ unsigned int count = 0;
+ pte_t *old_ptep = vmf->pte;

do {
- if (PageHWPoison(page))
- continue;
+ if (PageHWPoison(page + count))
+ goto skip;

if (mmap_miss > 0)
mmap_miss--;
@@ -3497,20 +3498,33 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
* handled in the specific fault path, and it'll prohibit the
* fault-around logic.
*/
- if (!pte_none(*vmf->pte))
- continue;
+ if (!pte_none(vmf->pte[count]))
+ goto skip;

if (vmf->address == addr)
ret = VM_FAULT_NOPAGE;

- ref_count++;
- set_pte_range(vmf, folio, page, 1, addr);
- } while (vmf->pte++, page++, addr += PAGE_SIZE, ++count < nr_pages);
+ count++;
+ continue;
+skip:
+ if (count) {
+ set_pte_range(vmf, folio, page, count, addr);
+ folio_ref_add(folio, count);
+ }

- /* Restore the vmf->pte */
- vmf->pte -= nr_pages;
+ count++;
+ page += count;
+ vmf->pte += count;
+ addr += count * PAGE_SIZE;
+ count = 0;
+ } while (--nr_pages > 0);
+
+ if (count) {
+ set_pte_range(vmf, folio, page, count, addr);
+ folio_ref_add(folio, count);
+ }

- folio_ref_add(folio, ref_count);
+ vmf->pte = old_ptep;
WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss);

return ret;
--
2.39.2


2023-03-15 05:17:05

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 02/36] mm: Add generic flush_icache_pages() and documentation

flush_icache_page() is deprecated but not yet removed, so add
a range version of it. Change the documentation to refer to
update_mmu_cache_range() instead of update_mmu_cache().

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
Documentation/core-api/cachetlb.rst | 35 +++++++++++++++--------------
include/asm-generic/cacheflush.h | 5 +++++
2 files changed, 23 insertions(+), 17 deletions(-)

diff --git a/Documentation/core-api/cachetlb.rst b/Documentation/core-api/cachetlb.rst
index 5c0552e78c58..d4c9e2a28d36 100644
--- a/Documentation/core-api/cachetlb.rst
+++ b/Documentation/core-api/cachetlb.rst
@@ -88,13 +88,13 @@ changes occur:

This is used primarily during fault processing.

-5) ``void update_mmu_cache(struct vm_area_struct *vma,
- unsigned long address, pte_t *ptep)``
+5) ``void update_mmu_cache_range(struct vm_area_struct *vma,
+ unsigned long address, pte_t *ptep, unsigned int nr)``

- At the end of every page fault, this routine is invoked to
- tell the architecture specific code that a translation
- now exists at virtual address "address" for address space
- "vma->vm_mm", in the software page tables.
+ At the end of every page fault, this routine is invoked to tell
+ the architecture specific code that translations now exists
+ in the software page tables for address space "vma->vm_mm"
+ at virtual address "address" for "nr" consecutive pages.

A port may use this information in any way it so chooses.
For example, it could use this event to pre-load TLB
@@ -306,17 +306,18 @@ maps this page at its virtual address.
private". The kernel guarantees that, for pagecache pages, it will
clear this bit when such a page first enters the pagecache.

- This allows these interfaces to be implemented much more efficiently.
- It allows one to "defer" (perhaps indefinitely) the actual flush if
- there are currently no user processes mapping this page. See sparc64's
- flush_dcache_page and update_mmu_cache implementations for an example
- of how to go about doing this.
+ This allows these interfaces to be implemented much more
+ efficiently. It allows one to "defer" (perhaps indefinitely) the
+ actual flush if there are currently no user processes mapping this
+ page. See sparc64's flush_dcache_page and update_mmu_cache_range
+ implementations for an example of how to go about doing this.

- The idea is, first at flush_dcache_page() time, if page_file_mapping()
- returns a mapping, and mapping_mapped on that mapping returns %false,
- just mark the architecture private page flag bit. Later, in
- update_mmu_cache(), a check is made of this flag bit, and if set the
- flush is done and the flag bit is cleared.
+ The idea is, first at flush_dcache_page() time, if
+ page_file_mapping() returns a mapping, and mapping_mapped on that
+ mapping returns %false, just mark the architecture private page
+ flag bit. Later, in update_mmu_cache_range(), a check is made
+ of this flag bit, and if set the flush is done and the flag bit
+ is cleared.

.. important::

@@ -369,7 +370,7 @@ maps this page at its virtual address.
``void flush_icache_page(struct vm_area_struct *vma, struct page *page)``

All the functionality of flush_icache_page can be implemented in
- flush_dcache_page and update_mmu_cache. In the future, the hope
+ flush_dcache_page and update_mmu_cache_range. In the future, the hope
is to remove this interface completely.

The final category of APIs is for I/O to deliberately aliased address
diff --git a/include/asm-generic/cacheflush.h b/include/asm-generic/cacheflush.h
index f46258d1a080..09d51a680765 100644
--- a/include/asm-generic/cacheflush.h
+++ b/include/asm-generic/cacheflush.h
@@ -78,6 +78,11 @@ static inline void flush_icache_range(unsigned long start, unsigned long end)
#endif

#ifndef flush_icache_page
+static inline void flush_icache_pages(struct vm_area_struct *vma,
+ struct page *page, unsigned int nr)
+{
+}
+
static inline void flush_icache_page(struct vm_area_struct *vma,
struct page *page)
{
--
2.39.2


2023-03-15 05:17:10

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 35/36] mm: Convert do_set_pte() to set_pte_range()

From: Yin Fengwei <[email protected]>

set_pte_range() allows to setup page table entries for a specific
range. It takes advantage of batched rmap update for large folio.
It now takes care of calling update_mmu_cache_range().

Signed-off-by: Yin Fengwei <[email protected]>
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
Documentation/filesystems/locking.rst | 2 +-
include/linux/mm.h | 3 ++-
mm/filemap.c | 3 +--
mm/memory.c | 27 +++++++++++++++------------
4 files changed, 19 insertions(+), 16 deletions(-)

diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
index 7de7a7272a5e..922886fefb7f 100644
--- a/Documentation/filesystems/locking.rst
+++ b/Documentation/filesystems/locking.rst
@@ -663,7 +663,7 @@ locked. The VM will unlock the page.
Filesystem should find and map pages associated with offsets from "start_pgoff"
till "end_pgoff". ->map_pages() is called with page table locked and must
not block. If it's not possible to reach a page without blocking,
-filesystem should skip it. Filesystem should use do_set_pte() to setup
+filesystem should skip it. Filesystem should use set_pte_range() to setup
page table entry. Pointer to entry associated with the page is passed in
"pte" field in vm_fault structure. Pointers to entries for other offsets
should be calculated relative to "pte".
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ee755bb4e1c1..81788c985a8c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1299,7 +1299,8 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
}

vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page);
-void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr);
+void set_pte_range(struct vm_fault *vmf, struct folio *folio,
+ struct page *page, unsigned int nr, unsigned long addr);

vm_fault_t finish_fault(struct vm_fault *vmf);
vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
diff --git a/mm/filemap.c b/mm/filemap.c
index 6e2b0778db45..e2317623dcbf 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3504,8 +3504,7 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
ret = VM_FAULT_NOPAGE;

ref_count++;
- do_set_pte(vmf, page, addr);
- update_mmu_cache(vma, addr, vmf->pte);
+ set_pte_range(vmf, folio, page, 1, addr);
} while (vmf->pte++, page++, addr += PAGE_SIZE, ++count < nr_pages);

/* Restore the vmf->pte */
diff --git a/mm/memory.c b/mm/memory.c
index 6aa21e8f3753..9a654802f104 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4274,7 +4274,8 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
}
#endif

-void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
+void set_pte_range(struct vm_fault *vmf, struct folio *folio,
+ struct page *page, unsigned int nr, unsigned long addr)
{
struct vm_area_struct *vma = vmf->vma;
bool uffd_wp = vmf_orig_pte_uffd_wp(vmf);
@@ -4282,7 +4283,7 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
bool prefault = vmf->address != addr;
pte_t entry;

- flush_icache_page(vma, page);
+ flush_icache_pages(vma, page, nr);
entry = mk_pte(page, vma->vm_page_prot);

if (prefault && arch_wants_old_prefaulted_pte())
@@ -4296,14 +4297,18 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
entry = pte_mkuffd_wp(entry);
/* copy-on-write page */
if (write && !(vma->vm_flags & VM_SHARED)) {
- inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
- page_add_new_anon_rmap(page, vma, addr);
- lru_cache_add_inactive_or_unevictable(page, vma);
+ add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr);
+ VM_BUG_ON_FOLIO(nr != 1, folio);
+ folio_add_new_anon_rmap(folio, vma, addr);
+ folio_add_lru_vma(folio, vma);
} else {
- inc_mm_counter(vma->vm_mm, mm_counter_file(page));
- page_add_file_rmap(page, vma, false);
+ add_mm_counter(vma->vm_mm, mm_counter_file(page), nr);
+ folio_add_file_rmap_range(folio, page, nr, vma, false);
}
- set_pte_at(vma->vm_mm, addr, vmf->pte, entry);
+ set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr);
+
+ /* no need to invalidate: a not-present page won't be cached */
+ update_mmu_cache_range(vma, addr, vmf->pte, nr);
}

static bool vmf_pte_changed(struct vm_fault *vmf)
@@ -4376,11 +4381,9 @@ vm_fault_t finish_fault(struct vm_fault *vmf)

/* Re-check under ptl */
if (likely(!vmf_pte_changed(vmf))) {
- do_set_pte(vmf, page, vmf->address);
-
- /* no need to invalidate: a not-present page won't be cached */
- update_mmu_cache(vma, vmf->address, vmf->pte);
+ struct folio *folio = page_folio(page);

+ set_pte_range(vmf, folio, page, 1, vmf->address);
ret = 0;
} else {
update_mmu_tlb(vma, vmf->address, vmf->pte);
--
2.39.2


2023-03-15 05:17:15

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 04/36] mm: Remove ARCH_IMPLEMENTS_FLUSH_DCACHE_FOLIO

Current best practice is to reuse the name of the function as a define
to indicate that the function is implemented by the architecture.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
Documentation/core-api/cachetlb.rst | 24 +++++++++---------------
include/linux/cacheflush.h | 4 ++--
mm/util.c | 2 +-
3 files changed, 12 insertions(+), 18 deletions(-)

diff --git a/Documentation/core-api/cachetlb.rst b/Documentation/core-api/cachetlb.rst
index d4c9e2a28d36..770008afd409 100644
--- a/Documentation/core-api/cachetlb.rst
+++ b/Documentation/core-api/cachetlb.rst
@@ -269,7 +269,7 @@ maps this page at its virtual address.
If D-cache aliasing is not an issue, these two routines may
simply call memcpy/memset directly and do nothing more.

- ``void flush_dcache_page(struct page *page)``
+ ``void flush_dcache_folio(struct folio *folio)``

This routines must be called when:

@@ -277,7 +277,7 @@ maps this page at its virtual address.
and / or in high memory
b) the kernel is about to read from a page cache page and user space
shared/writable mappings of this page potentially exist. Note
- that {get,pin}_user_pages{_fast} already call flush_dcache_page
+ that {get,pin}_user_pages{_fast} already call flush_dcache_folio
on any page found in the user address space and thus driver
code rarely needs to take this into account.

@@ -291,7 +291,7 @@ maps this page at its virtual address.

The phrase "kernel writes to a page cache page" means, specifically,
that the kernel executes store instructions that dirty data in that
- page at the page->virtual mapping of that page. It is important to
+ page at the kernel virtual mapping of that page. It is important to
flush here to handle D-cache aliasing, to make sure these kernel stores
are visible to user space mappings of that page.

@@ -302,18 +302,18 @@ maps this page at its virtual address.
If D-cache aliasing is not an issue, this routine may simply be defined
as a nop on that architecture.

- There is a bit set aside in page->flags (PG_arch_1) as "architecture
+ There is a bit set aside in folio->flags (PG_arch_1) as "architecture
private". The kernel guarantees that, for pagecache pages, it will
clear this bit when such a page first enters the pagecache.

This allows these interfaces to be implemented much more
efficiently. It allows one to "defer" (perhaps indefinitely) the
actual flush if there are currently no user processes mapping this
- page. See sparc64's flush_dcache_page and update_mmu_cache_range
+ page. See sparc64's flush_dcache_folio and update_mmu_cache_range
implementations for an example of how to go about doing this.

- The idea is, first at flush_dcache_page() time, if
- page_file_mapping() returns a mapping, and mapping_mapped on that
+ The idea is, first at flush_dcache_folio() time, if
+ folio_flush_mapping() returns a mapping, and mapping_mapped() on that
mapping returns %false, just mark the architecture private page
flag bit. Later, in update_mmu_cache_range(), a check is made
of this flag bit, and if set the flush is done and the flag bit
@@ -327,12 +327,6 @@ maps this page at its virtual address.
dirty. Again, see sparc64 for examples of how
to deal with this.

- ``void flush_dcache_folio(struct folio *folio)``
- This function is called under the same circumstances as
- flush_dcache_page(). It allows the architecture to
- optimise for flushing the entire folio of pages instead
- of flushing one page at a time.
-
``void copy_to_user_page(struct vm_area_struct *vma, struct page *page,
unsigned long user_vaddr, void *dst, void *src, int len)``
``void copy_from_user_page(struct vm_area_struct *vma, struct page *page,
@@ -353,7 +347,7 @@ maps this page at its virtual address.

When the kernel needs to access the contents of an anonymous
page, it calls this function (currently only
- get_user_pages()). Note: flush_dcache_page() deliberately
+ get_user_pages()). Note: flush_dcache_folio() deliberately
doesn't work for an anonymous page. The default
implementation is a nop (and should remain so for all coherent
architectures). For incoherent architectures, it should flush
@@ -370,7 +364,7 @@ maps this page at its virtual address.
``void flush_icache_page(struct vm_area_struct *vma, struct page *page)``

All the functionality of flush_icache_page can be implemented in
- flush_dcache_page and update_mmu_cache_range. In the future, the hope
+ flush_dcache_folio and update_mmu_cache_range. In the future, the hope
is to remove this interface completely.

The final category of APIs is for I/O to deliberately aliased address
diff --git a/include/linux/cacheflush.h b/include/linux/cacheflush.h
index a6189d21f2ba..82136f3fcf54 100644
--- a/include/linux/cacheflush.h
+++ b/include/linux/cacheflush.h
@@ -7,14 +7,14 @@
struct folio;

#if ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE
-#ifndef ARCH_IMPLEMENTS_FLUSH_DCACHE_FOLIO
+#ifndef flush_dcache_folio
void flush_dcache_folio(struct folio *folio);
#endif
#else
static inline void flush_dcache_folio(struct folio *folio)
{
}
-#define ARCH_IMPLEMENTS_FLUSH_DCACHE_FOLIO 0
+#define flush_dcache_folio flush_dcache_folio
#endif /* ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE */

#endif /* _LINUX_CACHEFLUSH_H */
diff --git a/mm/util.c b/mm/util.c
index dd12b9531ac4..98ce51b01627 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -1125,7 +1125,7 @@ void page_offline_end(void)
}
EXPORT_SYMBOL(page_offline_end);

-#ifndef ARCH_IMPLEMENTS_FLUSH_DCACHE_FOLIO
+#ifndef flush_dcache_folio
void flush_dcache_folio(struct folio *folio)
{
long i, nr = folio_nr_pages(folio);
--
2.39.2


2023-03-15 05:17:17

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v4 33/36] filemap: Add filemap_map_folio_range()

From: Yin Fengwei <[email protected]>

filemap_map_folio_range() maps partial/full folio. Comparing to original
filemap_map_pages(), it updates refcount once per folio instead of per
page and gets minor performance improvement for large folio.

With a will-it-scale.page_fault3 like app (change file write
fault testing to read fault testing. Trying to upstream it to
will-it-scale at [1]), got 2% performance gain on a 48C/96T
Cascade Lake test box with 96 processes running against xfs.

[1]: https://github.com/antonblanchard/will-it-scale/pull/37

Signed-off-by: Yin Fengwei <[email protected]>
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
mm/filemap.c | 98 +++++++++++++++++++++++++++++-----------------------
1 file changed, 54 insertions(+), 44 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index a34abfe8c654..6e2b0778db45 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2199,16 +2199,6 @@ unsigned filemap_get_folios(struct address_space *mapping, pgoff_t *start,
}
EXPORT_SYMBOL(filemap_get_folios);

-static inline
-bool folio_more_pages(struct folio *folio, pgoff_t index, pgoff_t max)
-{
- if (!folio_test_large(folio) || folio_test_hugetlb(folio))
- return false;
- if (index >= max)
- return false;
- return index < folio->index + folio_nr_pages(folio) - 1;
-}
-
/**
* filemap_get_folios_contig - Get a batch of contiguous folios
* @mapping: The address_space to search
@@ -3480,6 +3470,53 @@ static inline struct folio *next_map_page(struct address_space *mapping,
mapping, xas, end_pgoff);
}

+/*
+ * Map page range [start_page, start_page + nr_pages) of folio.
+ * start_page is gotten from start by folio_page(folio, start)
+ */
+static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
+ struct folio *folio, unsigned long start,
+ unsigned long addr, unsigned int nr_pages)
+{
+ vm_fault_t ret = 0;
+ struct vm_area_struct *vma = vmf->vma;
+ struct file *file = vma->vm_file;
+ struct page *page = folio_page(folio, start);
+ unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss);
+ unsigned int ref_count = 0, count = 0;
+
+ do {
+ if (PageHWPoison(page))
+ continue;
+
+ if (mmap_miss > 0)
+ mmap_miss--;
+
+ /*
+ * NOTE: If there're PTE markers, we'll leave them to be
+ * handled in the specific fault path, and it'll prohibit the
+ * fault-around logic.
+ */
+ if (!pte_none(*vmf->pte))
+ continue;
+
+ if (vmf->address == addr)
+ ret = VM_FAULT_NOPAGE;
+
+ ref_count++;
+ do_set_pte(vmf, page, addr);
+ update_mmu_cache(vma, addr, vmf->pte);
+ } while (vmf->pte++, page++, addr += PAGE_SIZE, ++count < nr_pages);
+
+ /* Restore the vmf->pte */
+ vmf->pte -= nr_pages;
+
+ folio_ref_add(folio, ref_count);
+ WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss);
+
+ return ret;
+}
+
vm_fault_t filemap_map_pages(struct vm_fault *vmf,
pgoff_t start_pgoff, pgoff_t end_pgoff)
{
@@ -3490,9 +3527,9 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
unsigned long addr;
XA_STATE(xas, &mapping->i_pages, start_pgoff);
struct folio *folio;
- struct page *page;
unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss);
vm_fault_t ret = 0;
+ int nr_pages = 0;

rcu_read_lock();
folio = first_map_page(mapping, &xas, end_pgoff);
@@ -3507,45 +3544,18 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
addr = vma->vm_start + ((start_pgoff - vma->vm_pgoff) << PAGE_SHIFT);
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
do {
-again:
- page = folio_file_page(folio, xas.xa_index);
- if (PageHWPoison(page))
- goto unlock;
-
- if (mmap_miss > 0)
- mmap_miss--;
+ unsigned long end;

addr += (xas.xa_index - last_pgoff) << PAGE_SHIFT;
vmf->pte += xas.xa_index - last_pgoff;
last_pgoff = xas.xa_index;
+ end = folio->index + folio_nr_pages(folio) - 1;
+ nr_pages = min(end, end_pgoff) - xas.xa_index + 1;

- /*
- * NOTE: If there're PTE markers, we'll leave them to be
- * handled in the specific fault path, and it'll prohibit the
- * fault-around logic.
- */
- if (!pte_none(*vmf->pte))
- goto unlock;
+ ret |= filemap_map_folio_range(vmf, folio,
+ xas.xa_index - folio->index, addr, nr_pages);
+ xas.xa_index += nr_pages;

- /* We're about to handle the fault */
- if (vmf->address == addr)
- ret = VM_FAULT_NOPAGE;
-
- do_set_pte(vmf, page, addr);
- /* no need to invalidate: a not-present page won't be cached */
- update_mmu_cache(vma, addr, vmf->pte);
- if (folio_more_pages(folio, xas.xa_index, end_pgoff)) {
- xas.xa_index++;
- folio_ref_inc(folio);
- goto again;
- }
- folio_unlock(folio);
- continue;
-unlock:
- if (folio_more_pages(folio, xas.xa_index, end_pgoff)) {
- xas.xa_index++;
- goto again;
- }
folio_unlock(folio);
folio_put(folio);
} while ((folio = next_map_page(mapping, &xas, end_pgoff)) != NULL);
--
2.39.2


2023-03-15 09:28:15

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v4 02/36] mm: Add generic flush_icache_pages() and documentation

On Wed, Mar 15, 2023 at 05:14:10AM +0000, Matthew Wilcox (Oracle) wrote:
> flush_icache_page() is deprecated but not yet removed, so add
> a range version of it. Change the documentation to refer to
> update_mmu_cache_range() instead of update_mmu_cache().
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>

Acked-by: Mike Rapoport (IBM) <[email protected]>

> ---
> Documentation/core-api/cachetlb.rst | 35 +++++++++++++++--------------
> include/asm-generic/cacheflush.h | 5 +++++
> 2 files changed, 23 insertions(+), 17 deletions(-)
>
> diff --git a/Documentation/core-api/cachetlb.rst b/Documentation/core-api/cachetlb.rst
> index 5c0552e78c58..d4c9e2a28d36 100644
> --- a/Documentation/core-api/cachetlb.rst
> +++ b/Documentation/core-api/cachetlb.rst
> @@ -88,13 +88,13 @@ changes occur:
>
> This is used primarily during fault processing.
>
> -5) ``void update_mmu_cache(struct vm_area_struct *vma,
> - unsigned long address, pte_t *ptep)``
> +5) ``void update_mmu_cache_range(struct vm_area_struct *vma,
> + unsigned long address, pte_t *ptep, unsigned int nr)``
>
> - At the end of every page fault, this routine is invoked to
> - tell the architecture specific code that a translation
> - now exists at virtual address "address" for address space
> - "vma->vm_mm", in the software page tables.
> + At the end of every page fault, this routine is invoked to tell
> + the architecture specific code that translations now exists
> + in the software page tables for address space "vma->vm_mm"
> + at virtual address "address" for "nr" consecutive pages.
>
> A port may use this information in any way it so chooses.
> For example, it could use this event to pre-load TLB
> @@ -306,17 +306,18 @@ maps this page at its virtual address.
> private". The kernel guarantees that, for pagecache pages, it will
> clear this bit when such a page first enters the pagecache.
>
> - This allows these interfaces to be implemented much more efficiently.
> - It allows one to "defer" (perhaps indefinitely) the actual flush if
> - there are currently no user processes mapping this page. See sparc64's
> - flush_dcache_page and update_mmu_cache implementations for an example
> - of how to go about doing this.
> + This allows these interfaces to be implemented much more
> + efficiently. It allows one to "defer" (perhaps indefinitely) the
> + actual flush if there are currently no user processes mapping this
> + page. See sparc64's flush_dcache_page and update_mmu_cache_range
> + implementations for an example of how to go about doing this.
>
> - The idea is, first at flush_dcache_page() time, if page_file_mapping()
> - returns a mapping, and mapping_mapped on that mapping returns %false,
> - just mark the architecture private page flag bit. Later, in
> - update_mmu_cache(), a check is made of this flag bit, and if set the
> - flush is done and the flag bit is cleared.
> + The idea is, first at flush_dcache_page() time, if
> + page_file_mapping() returns a mapping, and mapping_mapped on that
> + mapping returns %false, just mark the architecture private page
> + flag bit. Later, in update_mmu_cache_range(), a check is made
> + of this flag bit, and if set the flush is done and the flag bit
> + is cleared.
>
> .. important::
>
> @@ -369,7 +370,7 @@ maps this page at its virtual address.
> ``void flush_icache_page(struct vm_area_struct *vma, struct page *page)``
>
> All the functionality of flush_icache_page can be implemented in
> - flush_dcache_page and update_mmu_cache. In the future, the hope
> + flush_dcache_page and update_mmu_cache_range. In the future, the hope
> is to remove this interface completely.
>
> The final category of APIs is for I/O to deliberately aliased address
> diff --git a/include/asm-generic/cacheflush.h b/include/asm-generic/cacheflush.h
> index f46258d1a080..09d51a680765 100644
> --- a/include/asm-generic/cacheflush.h
> +++ b/include/asm-generic/cacheflush.h
> @@ -78,6 +78,11 @@ static inline void flush_icache_range(unsigned long start, unsigned long end)
> #endif
>
> #ifndef flush_icache_page
> +static inline void flush_icache_pages(struct vm_area_struct *vma,
> + struct page *page, unsigned int nr)
> +{
> +}
> +
> static inline void flush_icache_page(struct vm_area_struct *vma,
> struct page *page)
> {
> --
> 2.39.2
>
>

--
Sincerely yours,
Mike.

2023-03-15 09:29:26

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v4 04/36] mm: Remove ARCH_IMPLEMENTS_FLUSH_DCACHE_FOLIO

On Wed, Mar 15, 2023 at 05:14:12AM +0000, Matthew Wilcox (Oracle) wrote:
> Current best practice is to reuse the name of the function as a define
> to indicate that the function is implemented by the architecture.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>

Acked-by: Mike Rapoport (IBM) <[email protected]>

> ---
> Documentation/core-api/cachetlb.rst | 24 +++++++++---------------
> include/linux/cacheflush.h | 4 ++--
> mm/util.c | 2 +-
> 3 files changed, 12 insertions(+), 18 deletions(-)
>
> diff --git a/Documentation/core-api/cachetlb.rst b/Documentation/core-api/cachetlb.rst
> index d4c9e2a28d36..770008afd409 100644
> --- a/Documentation/core-api/cachetlb.rst
> +++ b/Documentation/core-api/cachetlb.rst
> @@ -269,7 +269,7 @@ maps this page at its virtual address.
> If D-cache aliasing is not an issue, these two routines may
> simply call memcpy/memset directly and do nothing more.
>
> - ``void flush_dcache_page(struct page *page)``
> + ``void flush_dcache_folio(struct folio *folio)``
>
> This routines must be called when:
>
> @@ -277,7 +277,7 @@ maps this page at its virtual address.
> and / or in high memory
> b) the kernel is about to read from a page cache page and user space
> shared/writable mappings of this page potentially exist. Note
> - that {get,pin}_user_pages{_fast} already call flush_dcache_page
> + that {get,pin}_user_pages{_fast} already call flush_dcache_folio
> on any page found in the user address space and thus driver
> code rarely needs to take this into account.
>
> @@ -291,7 +291,7 @@ maps this page at its virtual address.
>
> The phrase "kernel writes to a page cache page" means, specifically,
> that the kernel executes store instructions that dirty data in that
> - page at the page->virtual mapping of that page. It is important to
> + page at the kernel virtual mapping of that page. It is important to
> flush here to handle D-cache aliasing, to make sure these kernel stores
> are visible to user space mappings of that page.
>
> @@ -302,18 +302,18 @@ maps this page at its virtual address.
> If D-cache aliasing is not an issue, this routine may simply be defined
> as a nop on that architecture.
>
> - There is a bit set aside in page->flags (PG_arch_1) as "architecture
> + There is a bit set aside in folio->flags (PG_arch_1) as "architecture
> private". The kernel guarantees that, for pagecache pages, it will
> clear this bit when such a page first enters the pagecache.
>
> This allows these interfaces to be implemented much more
> efficiently. It allows one to "defer" (perhaps indefinitely) the
> actual flush if there are currently no user processes mapping this
> - page. See sparc64's flush_dcache_page and update_mmu_cache_range
> + page. See sparc64's flush_dcache_folio and update_mmu_cache_range
> implementations for an example of how to go about doing this.
>
> - The idea is, first at flush_dcache_page() time, if
> - page_file_mapping() returns a mapping, and mapping_mapped on that
> + The idea is, first at flush_dcache_folio() time, if
> + folio_flush_mapping() returns a mapping, and mapping_mapped() on that
> mapping returns %false, just mark the architecture private page
> flag bit. Later, in update_mmu_cache_range(), a check is made
> of this flag bit, and if set the flush is done and the flag bit
> @@ -327,12 +327,6 @@ maps this page at its virtual address.
> dirty. Again, see sparc64 for examples of how
> to deal with this.
>
> - ``void flush_dcache_folio(struct folio *folio)``
> - This function is called under the same circumstances as
> - flush_dcache_page(). It allows the architecture to
> - optimise for flushing the entire folio of pages instead
> - of flushing one page at a time.
> -
> ``void copy_to_user_page(struct vm_area_struct *vma, struct page *page,
> unsigned long user_vaddr, void *dst, void *src, int len)``
> ``void copy_from_user_page(struct vm_area_struct *vma, struct page *page,
> @@ -353,7 +347,7 @@ maps this page at its virtual address.
>
> When the kernel needs to access the contents of an anonymous
> page, it calls this function (currently only
> - get_user_pages()). Note: flush_dcache_page() deliberately
> + get_user_pages()). Note: flush_dcache_folio() deliberately
> doesn't work for an anonymous page. The default
> implementation is a nop (and should remain so for all coherent
> architectures). For incoherent architectures, it should flush
> @@ -370,7 +364,7 @@ maps this page at its virtual address.
> ``void flush_icache_page(struct vm_area_struct *vma, struct page *page)``
>
> All the functionality of flush_icache_page can be implemented in
> - flush_dcache_page and update_mmu_cache_range. In the future, the hope
> + flush_dcache_folio and update_mmu_cache_range. In the future, the hope
> is to remove this interface completely.
>
> The final category of APIs is for I/O to deliberately aliased address
> diff --git a/include/linux/cacheflush.h b/include/linux/cacheflush.h
> index a6189d21f2ba..82136f3fcf54 100644
> --- a/include/linux/cacheflush.h
> +++ b/include/linux/cacheflush.h
> @@ -7,14 +7,14 @@
> struct folio;
>
> #if ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE
> -#ifndef ARCH_IMPLEMENTS_FLUSH_DCACHE_FOLIO
> +#ifndef flush_dcache_folio
> void flush_dcache_folio(struct folio *folio);
> #endif
> #else
> static inline void flush_dcache_folio(struct folio *folio)
> {
> }
> -#define ARCH_IMPLEMENTS_FLUSH_DCACHE_FOLIO 0
> +#define flush_dcache_folio flush_dcache_folio
> #endif /* ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE */
>
> #endif /* _LINUX_CACHEFLUSH_H */
> diff --git a/mm/util.c b/mm/util.c
> index dd12b9531ac4..98ce51b01627 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -1125,7 +1125,7 @@ void page_offline_end(void)
> }
> EXPORT_SYMBOL(page_offline_end);
>
> -#ifndef ARCH_IMPLEMENTS_FLUSH_DCACHE_FOLIO
> +#ifndef flush_dcache_folio
> void flush_dcache_folio(struct folio *folio)
> {
> long i, nr = folio_nr_pages(folio);
> --
> 2.39.2
>
>

--
Sincerely yours,
Mike.

2023-03-15 10:12:54

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v4 25/36] sparc64: Implement the new page table range API

On Wed, Mar 15, 2023 at 05:14:33AM +0000, Matthew Wilcox (Oracle) wrote:
> Add set_ptes(), update_mmu_cache_range(), flush_dcache_folio() and
> flush_icache_pages(). Convert the PG_dcache_dirty flag from being
> per-page to per-folio.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> Cc: "David S. Miller" <[email protected]>
> Cc: [email protected]

Acked-by: Mike Rapoport (IBM) <[email protected]>

> ---
> arch/sparc/include/asm/cacheflush_64.h | 18 ++++--
> arch/sparc/include/asm/pgtable_64.h | 24 ++++++--
> arch/sparc/kernel/smp_64.c | 56 +++++++++++-------
> arch/sparc/mm/init_64.c | 78 +++++++++++++++-----------
> arch/sparc/mm/tlb.c | 5 +-
> 5 files changed, 116 insertions(+), 65 deletions(-)
>
> diff --git a/arch/sparc/include/asm/cacheflush_64.h b/arch/sparc/include/asm/cacheflush_64.h
> index b9341836597e..a9a719f04d06 100644
> --- a/arch/sparc/include/asm/cacheflush_64.h
> +++ b/arch/sparc/include/asm/cacheflush_64.h
> @@ -35,20 +35,26 @@ void flush_icache_range(unsigned long start, unsigned long end);
> void __flush_icache_page(unsigned long);
>
> void __flush_dcache_page(void *addr, int flush_icache);
> -void flush_dcache_page_impl(struct page *page);
> +void flush_dcache_folio_impl(struct folio *folio);
> #ifdef CONFIG_SMP
> -void smp_flush_dcache_page_impl(struct page *page, int cpu);
> -void flush_dcache_page_all(struct mm_struct *mm, struct page *page);
> +void smp_flush_dcache_folio_impl(struct folio *folio, int cpu);
> +void flush_dcache_folio_all(struct mm_struct *mm, struct folio *folio);
> #else
> -#define smp_flush_dcache_page_impl(page,cpu) flush_dcache_page_impl(page)
> -#define flush_dcache_page_all(mm,page) flush_dcache_page_impl(page)
> +#define smp_flush_dcache_folio_impl(folio, cpu) flush_dcache_folio_impl(folio)
> +#define flush_dcache_folio_all(mm, folio) flush_dcache_folio_impl(folio)
> #endif
>
> void __flush_dcache_range(unsigned long start, unsigned long end);
> #define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1
> -void flush_dcache_page(struct page *page);
> +void flush_dcache_folio(struct folio *folio);
> +#define flush_dcache_folio flush_dcache_folio
> +static inline void flush_dcache_page(struct page *page)
> +{
> + flush_dcache_folio(page_folio(page));
> +}
>
> #define flush_icache_page(vma, pg) do { } while(0)
> +#define flush_icache_pages(vma, pg, nr) do { } while(0)
>
> void flush_ptrace_access(struct vm_area_struct *, struct page *,
> unsigned long uaddr, void *kaddr,
> diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
> index 2dc8d4641734..49c37000e1b1 100644
> --- a/arch/sparc/include/asm/pgtable_64.h
> +++ b/arch/sparc/include/asm/pgtable_64.h
> @@ -911,8 +911,19 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr,
> maybe_tlb_batch_add(mm, addr, ptep, orig, fullmm, PAGE_SHIFT);
> }
>
> -#define set_pte_at(mm,addr,ptep,pte) \
> - __set_pte_at((mm), (addr), (ptep), (pte), 0)
> +static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, pte_t pte, unsigned int nr)
> +{
> + for (;;) {
> + __set_pte_at(mm, addr, ptep, pte, 0);
> + if (--nr == 0)
> + break;
> + ptep++;
> + pte_val(pte) += PAGE_SIZE;
> + addr += PAGE_SIZE;
> + }
> +}
> +#define set_ptes set_ptes
>
> #define pte_clear(mm,addr,ptep) \
> set_pte_at((mm), (addr), (ptep), __pte(0UL))
> @@ -931,8 +942,8 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr,
> \
> if (pfn_valid(this_pfn) && \
> (((old_addr) ^ (new_addr)) & (1 << 13))) \
> - flush_dcache_page_all(current->mm, \
> - pfn_to_page(this_pfn)); \
> + flush_dcache_folio_all(current->mm, \
> + page_folio(pfn_to_page(this_pfn))); \
> } \
> newpte; \
> })
> @@ -947,7 +958,10 @@ struct seq_file;
> void mmu_info(struct seq_file *);
>
> struct vm_area_struct;
> -void update_mmu_cache(struct vm_area_struct *, unsigned long, pte_t *);
> +void update_mmu_cache_range(struct vm_area_struct *, unsigned long addr,
> + pte_t *ptep, unsigned int nr);
> +#define update_mmu_cache(vma, addr, ptep) \
> + update_mmu_cache_range(vma, addr, ptep, 1)
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
> pmd_t *pmd);
> diff --git a/arch/sparc/kernel/smp_64.c b/arch/sparc/kernel/smp_64.c
> index a55295d1b924..90ef8677ac89 100644
> --- a/arch/sparc/kernel/smp_64.c
> +++ b/arch/sparc/kernel/smp_64.c
> @@ -921,20 +921,26 @@ extern unsigned long xcall_flush_dcache_page_cheetah;
> #endif
> extern unsigned long xcall_flush_dcache_page_spitfire;
>
> -static inline void __local_flush_dcache_page(struct page *page)
> +static inline void __local_flush_dcache_folio(struct folio *folio)
> {
> + unsigned int i, nr = folio_nr_pages(folio);
> +
> #ifdef DCACHE_ALIASING_POSSIBLE
> - __flush_dcache_page(page_address(page),
> + for (i = 0; i < nr; i++)
> + __flush_dcache_page(folio_address(folio) + i * PAGE_SIZE,
> ((tlb_type == spitfire) &&
> - page_mapping_file(page) != NULL));
> + folio_flush_mapping(folio) != NULL));
> #else
> - if (page_mapping_file(page) != NULL &&
> - tlb_type == spitfire)
> - __flush_icache_page(__pa(page_address(page)));
> + if (folio_flush_mapping(folio) != NULL &&
> + tlb_type == spitfire) {
> + unsigned long pfn = folio_pfn(folio)
> + for (i = 0; i < nr; i++)
> + __flush_icache_page((pfn + i) * PAGE_SIZE);
> + }
> #endif
> }
>
> -void smp_flush_dcache_page_impl(struct page *page, int cpu)
> +void smp_flush_dcache_folio_impl(struct folio *folio, int cpu)
> {
> int this_cpu;
>
> @@ -948,14 +954,14 @@ void smp_flush_dcache_page_impl(struct page *page, int cpu)
> this_cpu = get_cpu();
>
> if (cpu == this_cpu) {
> - __local_flush_dcache_page(page);
> + __local_flush_dcache_folio(folio);
> } else if (cpu_online(cpu)) {
> - void *pg_addr = page_address(page);
> + void *pg_addr = folio_address(folio);
> u64 data0 = 0;
>
> if (tlb_type == spitfire) {
> data0 = ((u64)&xcall_flush_dcache_page_spitfire);
> - if (page_mapping_file(page) != NULL)
> + if (folio_flush_mapping(folio) != NULL)
> data0 |= ((u64)1 << 32);
> } else if (tlb_type == cheetah || tlb_type == cheetah_plus) {
> #ifdef DCACHE_ALIASING_POSSIBLE
> @@ -963,18 +969,23 @@ void smp_flush_dcache_page_impl(struct page *page, int cpu)
> #endif
> }
> if (data0) {
> - xcall_deliver(data0, __pa(pg_addr),
> - (u64) pg_addr, cpumask_of(cpu));
> + unsigned int i, nr = folio_nr_pages(folio);
> +
> + for (i = 0; i < nr; i++) {
> + xcall_deliver(data0, __pa(pg_addr),
> + (u64) pg_addr, cpumask_of(cpu));
> #ifdef CONFIG_DEBUG_DCFLUSH
> - atomic_inc(&dcpage_flushes_xcall);
> + atomic_inc(&dcpage_flushes_xcall);
> #endif
> + pg_addr += PAGE_SIZE;
> + }
> }
> }
>
> put_cpu();
> }
>
> -void flush_dcache_page_all(struct mm_struct *mm, struct page *page)
> +void flush_dcache_folio_all(struct mm_struct *mm, struct folio *folio)
> {
> void *pg_addr;
> u64 data0;
> @@ -988,10 +999,10 @@ void flush_dcache_page_all(struct mm_struct *mm, struct page *page)
> atomic_inc(&dcpage_flushes);
> #endif
> data0 = 0;
> - pg_addr = page_address(page);
> + pg_addr = folio_address(folio);
> if (tlb_type == spitfire) {
> data0 = ((u64)&xcall_flush_dcache_page_spitfire);
> - if (page_mapping_file(page) != NULL)
> + if (folio_flush_mapping(folio) != NULL)
> data0 |= ((u64)1 << 32);
> } else if (tlb_type == cheetah || tlb_type == cheetah_plus) {
> #ifdef DCACHE_ALIASING_POSSIBLE
> @@ -999,13 +1010,18 @@ void flush_dcache_page_all(struct mm_struct *mm, struct page *page)
> #endif
> }
> if (data0) {
> - xcall_deliver(data0, __pa(pg_addr),
> - (u64) pg_addr, cpu_online_mask);
> + unsigned int i, nr = folio_nr_pages(folio);
> +
> + for (i = 0; i < nr; i++) {
> + xcall_deliver(data0, __pa(pg_addr),
> + (u64) pg_addr, cpu_online_mask);
> #ifdef CONFIG_DEBUG_DCFLUSH
> - atomic_inc(&dcpage_flushes_xcall);
> + atomic_inc(&dcpage_flushes_xcall);
> #endif
> + pg_addr += PAGE_SIZE;
> + }
> }
> - __local_flush_dcache_page(page);
> + __local_flush_dcache_folio(folio);
>
> preempt_enable();
> }
> diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
> index 04f9db0c3111..ab9aacbaf43c 100644
> --- a/arch/sparc/mm/init_64.c
> +++ b/arch/sparc/mm/init_64.c
> @@ -195,21 +195,26 @@ atomic_t dcpage_flushes_xcall = ATOMIC_INIT(0);
> #endif
> #endif
>
> -inline void flush_dcache_page_impl(struct page *page)
> +inline void flush_dcache_folio_impl(struct folio *folio)
> {
> + unsigned int i, nr = folio_nr_pages(folio);
> +
> BUG_ON(tlb_type == hypervisor);
> #ifdef CONFIG_DEBUG_DCFLUSH
> atomic_inc(&dcpage_flushes);
> #endif
>
> #ifdef DCACHE_ALIASING_POSSIBLE
> - __flush_dcache_page(page_address(page),
> - ((tlb_type == spitfire) &&
> - page_mapping_file(page) != NULL));
> + for (i = 0; i < nr; i++)
> + __flush_dcache_page(folio_address(folio) + i * PAGE_SIZE,
> + ((tlb_type == spitfire) &&
> + folio_flush_mapping(folio) != NULL));
> #else
> - if (page_mapping_file(page) != NULL &&
> - tlb_type == spitfire)
> - __flush_icache_page(__pa(page_address(page)));
> + if (folio_flush_mapping(folio) != NULL &&
> + tlb_type == spitfire) {
> + for (i = 0; i < nr; i++)
> + __flush_icache_page((pfn + i) * PAGE_SIZE);
> + }
> #endif
> }
>
> @@ -218,10 +223,10 @@ inline void flush_dcache_page_impl(struct page *page)
> #define PG_dcache_cpu_mask \
> ((1UL<<ilog2(roundup_pow_of_two(NR_CPUS)))-1UL)
>
> -#define dcache_dirty_cpu(page) \
> - (((page)->flags >> PG_dcache_cpu_shift) & PG_dcache_cpu_mask)
> +#define dcache_dirty_cpu(folio) \
> + (((folio)->flags >> PG_dcache_cpu_shift) & PG_dcache_cpu_mask)
>
> -static inline void set_dcache_dirty(struct page *page, int this_cpu)
> +static inline void set_dcache_dirty(struct folio *folio, int this_cpu)
> {
> unsigned long mask = this_cpu;
> unsigned long non_cpu_bits;
> @@ -238,11 +243,11 @@ static inline void set_dcache_dirty(struct page *page, int this_cpu)
> "bne,pn %%xcc, 1b\n\t"
> " nop"
> : /* no outputs */
> - : "r" (mask), "r" (non_cpu_bits), "r" (&page->flags)
> + : "r" (mask), "r" (non_cpu_bits), "r" (&folio->flags)
> : "g1", "g7");
> }
>
> -static inline void clear_dcache_dirty_cpu(struct page *page, unsigned long cpu)
> +static inline void clear_dcache_dirty_cpu(struct folio *folio, unsigned long cpu)
> {
> unsigned long mask = (1UL << PG_dcache_dirty);
>
> @@ -260,7 +265,7 @@ static inline void clear_dcache_dirty_cpu(struct page *page, unsigned long cpu)
> " nop\n"
> "2:"
> : /* no outputs */
> - : "r" (cpu), "r" (mask), "r" (&page->flags),
> + : "r" (cpu), "r" (mask), "r" (&folio->flags),
> "i" (PG_dcache_cpu_mask),
> "i" (PG_dcache_cpu_shift)
> : "g1", "g7");
> @@ -284,9 +289,10 @@ static void flush_dcache(unsigned long pfn)
>
> page = pfn_to_page(pfn);
> if (page) {
> + struct folio *folio = page_folio(page);
> unsigned long pg_flags;
>
> - pg_flags = page->flags;
> + pg_flags = folio->flags;
> if (pg_flags & (1UL << PG_dcache_dirty)) {
> int cpu = ((pg_flags >> PG_dcache_cpu_shift) &
> PG_dcache_cpu_mask);
> @@ -296,11 +302,11 @@ static void flush_dcache(unsigned long pfn)
> * in the SMP case.
> */
> if (cpu == this_cpu)
> - flush_dcache_page_impl(page);
> + flush_dcache_folio_impl(folio);
> else
> - smp_flush_dcache_page_impl(page, cpu);
> + smp_flush_dcache_folio_impl(folio, cpu);
>
> - clear_dcache_dirty_cpu(page, cpu);
> + clear_dcache_dirty_cpu(folio, cpu);
>
> put_cpu();
> }
> @@ -388,12 +394,14 @@ bool __init arch_hugetlb_valid_size(unsigned long size)
> }
> #endif /* CONFIG_HUGETLB_PAGE */
>
> -void update_mmu_cache(struct vm_area_struct *vma, unsigned long address, pte_t *ptep)
> +void update_mmu_cache_range(struct vm_area_struct *vma, unsigned long address,
> + pte_t *ptep, unsigned int nr)
> {
> struct mm_struct *mm;
> unsigned long flags;
> bool is_huge_tsb;
> pte_t pte = *ptep;
> + unsigned int i;
>
> if (tlb_type != hypervisor) {
> unsigned long pfn = pte_pfn(pte);
> @@ -440,15 +448,21 @@ void update_mmu_cache(struct vm_area_struct *vma, unsigned long address, pte_t *
> }
> }
> #endif
> - if (!is_huge_tsb)
> - __update_mmu_tsb_insert(mm, MM_TSB_BASE, PAGE_SHIFT,
> - address, pte_val(pte));
> + if (!is_huge_tsb) {
> + for (i = 0; i < nr; i++) {
> + __update_mmu_tsb_insert(mm, MM_TSB_BASE, PAGE_SHIFT,
> + address, pte_val(pte));
> + address += PAGE_SIZE;
> + pte_val(pte) += PAGE_SIZE;
> + }
> + }
>
> spin_unlock_irqrestore(&mm->context.lock, flags);
> }
>
> -void flush_dcache_page(struct page *page)
> +void flush_dcache_folio(struct folio *folio)
> {
> + unsigned long pfn = folio_pfn(folio);
> struct address_space *mapping;
> int this_cpu;
>
> @@ -459,35 +473,35 @@ void flush_dcache_page(struct page *page)
> * is merely the zero page. The 'bigcore' testcase in GDB
> * causes this case to run millions of times.
> */
> - if (page == ZERO_PAGE(0))
> + if (is_zero_pfn(pfn))
> return;
>
> this_cpu = get_cpu();
>
> - mapping = page_mapping_file(page);
> + mapping = folio_flush_mapping(folio);
> if (mapping && !mapping_mapped(mapping)) {
> - int dirty = test_bit(PG_dcache_dirty, &page->flags);
> + bool dirty = test_bit(PG_dcache_dirty, &folio->flags);
> if (dirty) {
> - int dirty_cpu = dcache_dirty_cpu(page);
> + int dirty_cpu = dcache_dirty_cpu(folio);
>
> if (dirty_cpu == this_cpu)
> goto out;
> - smp_flush_dcache_page_impl(page, dirty_cpu);
> + smp_flush_dcache_folio_impl(folio, dirty_cpu);
> }
> - set_dcache_dirty(page, this_cpu);
> + set_dcache_dirty(folio, this_cpu);
> } else {
> /* We could delay the flush for the !page_mapping
> * case too. But that case is for exec env/arg
> * pages and those are %99 certainly going to get
> * faulted into the tlb (and thus flushed) anyways.
> */
> - flush_dcache_page_impl(page);
> + flush_dcache_folio_impl(folio);
> }
>
> out:
> put_cpu();
> }
> -EXPORT_SYMBOL(flush_dcache_page);
> +EXPORT_SYMBOL(flush_dcache_folio);
>
> void __kprobes flush_icache_range(unsigned long start, unsigned long end)
> {
> @@ -2280,10 +2294,10 @@ void __init paging_init(void)
> setup_page_offset();
>
> /* These build time checkes make sure that the dcache_dirty_cpu()
> - * page->flags usage will work.
> + * folio->flags usage will work.
> *
> * When a page gets marked as dcache-dirty, we store the
> - * cpu number starting at bit 32 in the page->flags. Also,
> + * cpu number starting at bit 32 in the folio->flags. Also,
> * functions like clear_dcache_dirty_cpu use the cpu mask
> * in 13-bit signed-immediate instruction fields.
> */
> diff --git a/arch/sparc/mm/tlb.c b/arch/sparc/mm/tlb.c
> index 9a725547578e..3fa6a070912d 100644
> --- a/arch/sparc/mm/tlb.c
> +++ b/arch/sparc/mm/tlb.c
> @@ -118,6 +118,7 @@ void tlb_batch_add(struct mm_struct *mm, unsigned long vaddr,
> unsigned long paddr, pfn = pte_pfn(orig);
> struct address_space *mapping;
> struct page *page;
> + struct folio *folio;
>
> if (!pfn_valid(pfn))
> goto no_cache_flush;
> @@ -127,13 +128,13 @@ void tlb_batch_add(struct mm_struct *mm, unsigned long vaddr,
> goto no_cache_flush;
>
> /* A real file page? */
> - mapping = page_mapping_file(page);
> + mapping = folio_flush_mapping(folio);
> if (!mapping)
> goto no_cache_flush;
>
> paddr = (unsigned long) page_address(page);
> if ((paddr ^ vaddr) & (1 << 13))
> - flush_dcache_page_all(mm, page);
> + flush_dcache_folio_all(mm, folio);
> }
>
> no_cache_flush:
> --
> 2.39.2
>
>

--
Sincerely yours,
Mike.

2023-03-15 15:26:25

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v4 35/36] mm: Convert do_set_pte() to set_pte_range()

On 15/03/2023 05:14, Matthew Wilcox (Oracle) wrote:
> From: Yin Fengwei <[email protected]>
>
> set_pte_range() allows to setup page table entries for a specific
> range. It takes advantage of batched rmap update for large folio.
> It now takes care of calling update_mmu_cache_range().
>
> Signed-off-by: Yin Fengwei <[email protected]>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> ---
> Documentation/filesystems/locking.rst | 2 +-
> include/linux/mm.h | 3 ++-
> mm/filemap.c | 3 +--
> mm/memory.c | 27 +++++++++++++++------------
> 4 files changed, 19 insertions(+), 16 deletions(-)
>
> diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
> index 7de7a7272a5e..922886fefb7f 100644
> --- a/Documentation/filesystems/locking.rst
> +++ b/Documentation/filesystems/locking.rst
> @@ -663,7 +663,7 @@ locked. The VM will unlock the page.
> Filesystem should find and map pages associated with offsets from "start_pgoff"
> till "end_pgoff". ->map_pages() is called with page table locked and must
> not block. If it's not possible to reach a page without blocking,
> -filesystem should skip it. Filesystem should use do_set_pte() to setup
> +filesystem should skip it. Filesystem should use set_pte_range() to setup
> page table entry. Pointer to entry associated with the page is passed in
> "pte" field in vm_fault structure. Pointers to entries for other offsets
> should be calculated relative to "pte".
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ee755bb4e1c1..81788c985a8c 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1299,7 +1299,8 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
> }
>
> vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page);
> -void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr);
> +void set_pte_range(struct vm_fault *vmf, struct folio *folio,
> + struct page *page, unsigned int nr, unsigned long addr);
>
> vm_fault_t finish_fault(struct vm_fault *vmf);
> vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 6e2b0778db45..e2317623dcbf 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3504,8 +3504,7 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
> ret = VM_FAULT_NOPAGE;
>
> ref_count++;
> - do_set_pte(vmf, page, addr);
> - update_mmu_cache(vma, addr, vmf->pte);
> + set_pte_range(vmf, folio, page, 1, addr);
> } while (vmf->pte++, page++, addr += PAGE_SIZE, ++count < nr_pages);
>
> /* Restore the vmf->pte */
> diff --git a/mm/memory.c b/mm/memory.c
> index 6aa21e8f3753..9a654802f104 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4274,7 +4274,8 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
> }
> #endif
>
> -void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
> +void set_pte_range(struct vm_fault *vmf, struct folio *folio,
> + struct page *page, unsigned int nr, unsigned long addr)
> {
> struct vm_area_struct *vma = vmf->vma;
> bool uffd_wp = vmf_orig_pte_uffd_wp(vmf);
> @@ -4282,7 +4283,7 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
> bool prefault = vmf->address != addr;

I think you are changing behavior here - is this intentional? Previously this
would be evaluated per page, now its evaluated once for the whole range. The
intention below is that directly faulted pages are mapped young and prefaulted
pages are mapped old. But now a whole range will be mapped the same.

Thanks,
Ryan

> pte_t entry;
>
> - flush_icache_page(vma, page);
> + flush_icache_pages(vma, page, nr);
> entry = mk_pte(page, vma->vm_page_prot);
>
> if (prefault && arch_wants_old_prefaulted_pte())
> @@ -4296,14 +4297,18 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
> entry = pte_mkuffd_wp(entry);
> /* copy-on-write page */
> if (write && !(vma->vm_flags & VM_SHARED)) {
> - inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
> - page_add_new_anon_rmap(page, vma, addr);
> - lru_cache_add_inactive_or_unevictable(page, vma);
> + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr);
> + VM_BUG_ON_FOLIO(nr != 1, folio);
> + folio_add_new_anon_rmap(folio, vma, addr);
> + folio_add_lru_vma(folio, vma);
> } else {
> - inc_mm_counter(vma->vm_mm, mm_counter_file(page));
> - page_add_file_rmap(page, vma, false);
> + add_mm_counter(vma->vm_mm, mm_counter_file(page), nr);
> + folio_add_file_rmap_range(folio, page, nr, vma, false);
> }
> - set_pte_at(vma->vm_mm, addr, vmf->pte, entry);
> + set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr);
> +
> + /* no need to invalidate: a not-present page won't be cached */
> + update_mmu_cache_range(vma, addr, vmf->pte, nr);
> }
>
> static bool vmf_pte_changed(struct vm_fault *vmf)
> @@ -4376,11 +4381,9 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>
> /* Re-check under ptl */
> if (likely(!vmf_pte_changed(vmf))) {
> - do_set_pte(vmf, page, vmf->address);
> -
> - /* no need to invalidate: a not-present page won't be cached */
> - update_mmu_cache(vma, vmf->address, vmf->pte);
> + struct folio *folio = page_folio(page);
>
> + set_pte_range(vmf, folio, page, 1, vmf->address);
> ret = 0;
> } else {
> update_mmu_tlb(vma, vmf->address, vmf->pte);


2023-03-16 16:26:18

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH v4 35/36] mm: Convert do_set_pte() to set_pte_range()



On 3/15/2023 11:26 PM, Ryan Roberts wrote:
> On 15/03/2023 05:14, Matthew Wilcox (Oracle) wrote:
>> From: Yin Fengwei <[email protected]>
>>
>> set_pte_range() allows to setup page table entries for a specific
>> range. It takes advantage of batched rmap update for large folio.
>> It now takes care of calling update_mmu_cache_range().
>>
>> Signed-off-by: Yin Fengwei <[email protected]>
>> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
>> ---
>> Documentation/filesystems/locking.rst | 2 +-
>> include/linux/mm.h | 3 ++-
>> mm/filemap.c | 3 +--
>> mm/memory.c | 27 +++++++++++++++------------
>> 4 files changed, 19 insertions(+), 16 deletions(-)
>>
>> diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
>> index 7de7a7272a5e..922886fefb7f 100644
>> --- a/Documentation/filesystems/locking.rst
>> +++ b/Documentation/filesystems/locking.rst
>> @@ -663,7 +663,7 @@ locked. The VM will unlock the page.
>> Filesystem should find and map pages associated with offsets from "start_pgoff"
>> till "end_pgoff". ->map_pages() is called with page table locked and must
>> not block. If it's not possible to reach a page without blocking,
>> -filesystem should skip it. Filesystem should use do_set_pte() to setup
>> +filesystem should skip it. Filesystem should use set_pte_range() to setup
>> page table entry. Pointer to entry associated with the page is passed in
>> "pte" field in vm_fault structure. Pointers to entries for other offsets
>> should be calculated relative to "pte".
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index ee755bb4e1c1..81788c985a8c 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -1299,7 +1299,8 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
>> }
>>
>> vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page);
>> -void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr);
>> +void set_pte_range(struct vm_fault *vmf, struct folio *folio,
>> + struct page *page, unsigned int nr, unsigned long addr);
>>
>> vm_fault_t finish_fault(struct vm_fault *vmf);
>> vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index 6e2b0778db45..e2317623dcbf 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -3504,8 +3504,7 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
>> ret = VM_FAULT_NOPAGE;
>>
>> ref_count++;
>> - do_set_pte(vmf, page, addr);
>> - update_mmu_cache(vma, addr, vmf->pte);
>> + set_pte_range(vmf, folio, page, 1, addr);
>> } while (vmf->pte++, page++, addr += PAGE_SIZE, ++count < nr_pages);
>>
>> /* Restore the vmf->pte */
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 6aa21e8f3753..9a654802f104 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4274,7 +4274,8 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
>> }
>> #endif
>>
>> -void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
>> +void set_pte_range(struct vm_fault *vmf, struct folio *folio,
>> + struct page *page, unsigned int nr, unsigned long addr)
>> {
>> struct vm_area_struct *vma = vmf->vma;
>> bool uffd_wp = vmf_orig_pte_uffd_wp(vmf);
>> @@ -4282,7 +4283,7 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
>> bool prefault = vmf->address != addr;
>
> I think you are changing behavior here - is this intentional? Previously this
> would be evaluated per page, now its evaluated once for the whole range. The
> intention below is that directly faulted pages are mapped young and prefaulted
> pages are mapped old. But now a whole range will be mapped the same.

Yes. You are right here.

Look at the prefault and cpu_has_hw_af for ARM64, it looks like we
can avoid to handle vmf->address == addr specially. It's OK to
drop prefault and change the logic here a little bit to:
if (arch_wants_old_prefaulted_pte())
entry = pte_mkold(entry);
else
entry = pte_sw_mkyong(entry);

It's not necessary to use pte_sw_mkyong for vmf->address == addr
because HW will set the ACCESS bit in page table entry.

Add Will Deacon in case I missed something here. Thanks.


Regards
Yin, Fengwei

>
> Thanks,
> Ryan
>
>> pte_t entry;
>>
>> - flush_icache_page(vma, page);
>> + flush_icache_pages(vma, page, nr);
>> entry = mk_pte(page, vma->vm_page_prot);
>>
>> if (prefault && arch_wants_old_prefaulted_pte())
>> @@ -4296,14 +4297,18 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
>> entry = pte_mkuffd_wp(entry);
>> /* copy-on-write page */
>> if (write && !(vma->vm_flags & VM_SHARED)) {
>> - inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>> - page_add_new_anon_rmap(page, vma, addr);
>> - lru_cache_add_inactive_or_unevictable(page, vma);
>> + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr);
>> + VM_BUG_ON_FOLIO(nr != 1, folio);
>> + folio_add_new_anon_rmap(folio, vma, addr);
>> + folio_add_lru_vma(folio, vma);
>> } else {
>> - inc_mm_counter(vma->vm_mm, mm_counter_file(page));
>> - page_add_file_rmap(page, vma, false);
>> + add_mm_counter(vma->vm_mm, mm_counter_file(page), nr);
>> + folio_add_file_rmap_range(folio, page, nr, vma, false);
>> }
>> - set_pte_at(vma->vm_mm, addr, vmf->pte, entry);
>> + set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr);
>> +
>> + /* no need to invalidate: a not-present page won't be cached */
>> + update_mmu_cache_range(vma, addr, vmf->pte, nr);
>> }
>>
>> static bool vmf_pte_changed(struct vm_fault *vmf)
>> @@ -4376,11 +4381,9 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>
>> /* Re-check under ptl */
>> if (likely(!vmf_pte_changed(vmf))) {
>> - do_set_pte(vmf, page, vmf->address);
>> -
>> - /* no need to invalidate: a not-present page won't be cached */
>> - update_mmu_cache(vma, vmf->address, vmf->pte);
>> + struct folio *folio = page_folio(page);
>>
>> + set_pte_range(vmf, folio, page, 1, vmf->address);
>> ret = 0;
>> } else {
>> update_mmu_tlb(vma, vmf->address, vmf->pte);
>

2023-03-16 16:40:40

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v4 35/36] mm: Convert do_set_pte() to set_pte_range()

On 16/03/2023 16:23, Yin, Fengwei wrote:
>
>
> On 3/15/2023 11:26 PM, Ryan Roberts wrote:
>> On 15/03/2023 05:14, Matthew Wilcox (Oracle) wrote:
>>> From: Yin Fengwei <[email protected]>
>>>
>>> set_pte_range() allows to setup page table entries for a specific
>>> range. It takes advantage of batched rmap update for large folio.
>>> It now takes care of calling update_mmu_cache_range().
>>>
>>> Signed-off-by: Yin Fengwei <[email protected]>
>>> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
>>> ---
>>> Documentation/filesystems/locking.rst | 2 +-
>>> include/linux/mm.h | 3 ++-
>>> mm/filemap.c | 3 +--
>>> mm/memory.c | 27 +++++++++++++++------------
>>> 4 files changed, 19 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
>>> index 7de7a7272a5e..922886fefb7f 100644
>>> --- a/Documentation/filesystems/locking.rst
>>> +++ b/Documentation/filesystems/locking.rst
>>> @@ -663,7 +663,7 @@ locked. The VM will unlock the page.
>>> Filesystem should find and map pages associated with offsets from "start_pgoff"
>>> till "end_pgoff". ->map_pages() is called with page table locked and must
>>> not block. If it's not possible to reach a page without blocking,
>>> -filesystem should skip it. Filesystem should use do_set_pte() to setup
>>> +filesystem should skip it. Filesystem should use set_pte_range() to setup
>>> page table entry. Pointer to entry associated with the page is passed in
>>> "pte" field in vm_fault structure. Pointers to entries for other offsets
>>> should be calculated relative to "pte".
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index ee755bb4e1c1..81788c985a8c 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -1299,7 +1299,8 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
>>> }
>>>
>>> vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page);
>>> -void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr);
>>> +void set_pte_range(struct vm_fault *vmf, struct folio *folio,
>>> + struct page *page, unsigned int nr, unsigned long addr);
>>>
>>> vm_fault_t finish_fault(struct vm_fault *vmf);
>>> vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
>>> diff --git a/mm/filemap.c b/mm/filemap.c
>>> index 6e2b0778db45..e2317623dcbf 100644
>>> --- a/mm/filemap.c
>>> +++ b/mm/filemap.c
>>> @@ -3504,8 +3504,7 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
>>> ret = VM_FAULT_NOPAGE;
>>>
>>> ref_count++;
>>> - do_set_pte(vmf, page, addr);
>>> - update_mmu_cache(vma, addr, vmf->pte);
>>> + set_pte_range(vmf, folio, page, 1, addr);
>>> } while (vmf->pte++, page++, addr += PAGE_SIZE, ++count < nr_pages);
>>>
>>> /* Restore the vmf->pte */
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 6aa21e8f3753..9a654802f104 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -4274,7 +4274,8 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
>>> }
>>> #endif
>>>
>>> -void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
>>> +void set_pte_range(struct vm_fault *vmf, struct folio *folio,
>>> + struct page *page, unsigned int nr, unsigned long addr)
>>> {
>>> struct vm_area_struct *vma = vmf->vma;
>>> bool uffd_wp = vmf_orig_pte_uffd_wp(vmf);
>>> @@ -4282,7 +4283,7 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
>>> bool prefault = vmf->address != addr;
>>
>> I think you are changing behavior here - is this intentional? Previously this
>> would be evaluated per page, now its evaluated once for the whole range. The
>> intention below is that directly faulted pages are mapped young and prefaulted
>> pages are mapped old. But now a whole range will be mapped the same.
>
> Yes. You are right here.
>
> Look at the prefault and cpu_has_hw_af for ARM64, it looks like we
> can avoid to handle vmf->address == addr specially. It's OK to
> drop prefault and change the logic here a little bit to:
> if (arch_wants_old_prefaulted_pte())
> entry = pte_mkold(entry);
> else
> entry = pte_sw_mkyong(entry);
>
> It's not necessary to use pte_sw_mkyong for vmf->address == addr
> because HW will set the ACCESS bit in page table entry.
>
> Add Will Deacon in case I missed something here. Thanks.

I'll defer to Will's response, but not all arm HW supports HW access flag
management. In that case it's done by SW, so I would imagine that by setting
this to old initially, we will get a second fault to set the access bit, which
will slow things down. I wonder if you will need to split this into (up to) 3
calls to set_ptes()?

>
>
> Regards
> Yin, Fengwei
>
>>
>> Thanks,
>> Ryan
>>
>>> pte_t entry;
>>>
>>> - flush_icache_page(vma, page);
>>> + flush_icache_pages(vma, page, nr);
>>> entry = mk_pte(page, vma->vm_page_prot);
>>>
>>> if (prefault && arch_wants_old_prefaulted_pte())
>>> @@ -4296,14 +4297,18 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
>>> entry = pte_mkuffd_wp(entry);
>>> /* copy-on-write page */
>>> if (write && !(vma->vm_flags & VM_SHARED)) {
>>> - inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>>> - page_add_new_anon_rmap(page, vma, addr);
>>> - lru_cache_add_inactive_or_unevictable(page, vma);
>>> + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr);
>>> + VM_BUG_ON_FOLIO(nr != 1, folio);
>>> + folio_add_new_anon_rmap(folio, vma, addr);
>>> + folio_add_lru_vma(folio, vma);
>>> } else {
>>> - inc_mm_counter(vma->vm_mm, mm_counter_file(page));
>>> - page_add_file_rmap(page, vma, false);
>>> + add_mm_counter(vma->vm_mm, mm_counter_file(page), nr);
>>> + folio_add_file_rmap_range(folio, page, nr, vma, false);
>>> }
>>> - set_pte_at(vma->vm_mm, addr, vmf->pte, entry);
>>> + set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr);
>>> +
>>> + /* no need to invalidate: a not-present page won't be cached */
>>> + update_mmu_cache_range(vma, addr, vmf->pte, nr);
>>> }
>>>
>>> static bool vmf_pte_changed(struct vm_fault *vmf)
>>> @@ -4376,11 +4381,9 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>>
>>> /* Re-check under ptl */
>>> if (likely(!vmf_pte_changed(vmf))) {
>>> - do_set_pte(vmf, page, vmf->address);
>>> -
>>> - /* no need to invalidate: a not-present page won't be cached */
>>> - update_mmu_cache(vma, vmf->address, vmf->pte);
>>> + struct folio *folio = page_folio(page);
>>>
>>> + set_pte_range(vmf, folio, page, 1, vmf->address);
>>> ret = 0;
>>> } else {
>>> update_mmu_tlb(vma, vmf->address, vmf->pte);
>>


2023-03-16 16:43:07

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH v4 35/36] mm: Convert do_set_pte() to set_pte_range()



On 3/17/2023 12:38 AM, Ryan Roberts wrote:
> On 16/03/2023 16:23, Yin, Fengwei wrote:
>>
>>
>> On 3/15/2023 11:26 PM, Ryan Roberts wrote:
>>> On 15/03/2023 05:14, Matthew Wilcox (Oracle) wrote:
>>>> From: Yin Fengwei <[email protected]>
>>>>
>>>> set_pte_range() allows to setup page table entries for a specific
>>>> range. It takes advantage of batched rmap update for large folio.
>>>> It now takes care of calling update_mmu_cache_range().
>>>>
>>>> Signed-off-by: Yin Fengwei <[email protected]>
>>>> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
>>>> ---
>>>> Documentation/filesystems/locking.rst | 2 +-
>>>> include/linux/mm.h | 3 ++-
>>>> mm/filemap.c | 3 +--
>>>> mm/memory.c | 27 +++++++++++++++------------
>>>> 4 files changed, 19 insertions(+), 16 deletions(-)
>>>>
>>>> diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
>>>> index 7de7a7272a5e..922886fefb7f 100644
>>>> --- a/Documentation/filesystems/locking.rst
>>>> +++ b/Documentation/filesystems/locking.rst
>>>> @@ -663,7 +663,7 @@ locked. The VM will unlock the page.
>>>> Filesystem should find and map pages associated with offsets from "start_pgoff"
>>>> till "end_pgoff". ->map_pages() is called with page table locked and must
>>>> not block. If it's not possible to reach a page without blocking,
>>>> -filesystem should skip it. Filesystem should use do_set_pte() to setup
>>>> +filesystem should skip it. Filesystem should use set_pte_range() to setup
>>>> page table entry. Pointer to entry associated with the page is passed in
>>>> "pte" field in vm_fault structure. Pointers to entries for other offsets
>>>> should be calculated relative to "pte".
>>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>>> index ee755bb4e1c1..81788c985a8c 100644
>>>> --- a/include/linux/mm.h
>>>> +++ b/include/linux/mm.h
>>>> @@ -1299,7 +1299,8 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
>>>> }
>>>>
>>>> vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page);
>>>> -void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr);
>>>> +void set_pte_range(struct vm_fault *vmf, struct folio *folio,
>>>> + struct page *page, unsigned int nr, unsigned long addr);
>>>>
>>>> vm_fault_t finish_fault(struct vm_fault *vmf);
>>>> vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
>>>> diff --git a/mm/filemap.c b/mm/filemap.c
>>>> index 6e2b0778db45..e2317623dcbf 100644
>>>> --- a/mm/filemap.c
>>>> +++ b/mm/filemap.c
>>>> @@ -3504,8 +3504,7 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
>>>> ret = VM_FAULT_NOPAGE;
>>>>
>>>> ref_count++;
>>>> - do_set_pte(vmf, page, addr);
>>>> - update_mmu_cache(vma, addr, vmf->pte);
>>>> + set_pte_range(vmf, folio, page, 1, addr);
>>>> } while (vmf->pte++, page++, addr += PAGE_SIZE, ++count < nr_pages);
>>>>
>>>> /* Restore the vmf->pte */
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index 6aa21e8f3753..9a654802f104 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -4274,7 +4274,8 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
>>>> }
>>>> #endif
>>>>
>>>> -void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
>>>> +void set_pte_range(struct vm_fault *vmf, struct folio *folio,
>>>> + struct page *page, unsigned int nr, unsigned long addr)
>>>> {
>>>> struct vm_area_struct *vma = vmf->vma;
>>>> bool uffd_wp = vmf_orig_pte_uffd_wp(vmf);
>>>> @@ -4282,7 +4283,7 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
>>>> bool prefault = vmf->address != addr;
>>>
>>> I think you are changing behavior here - is this intentional? Previously this
>>> would be evaluated per page, now its evaluated once for the whole range. The
>>> intention below is that directly faulted pages are mapped young and prefaulted
>>> pages are mapped old. But now a whole range will be mapped the same.
>>
>> Yes. You are right here.
>>
>> Look at the prefault and cpu_has_hw_af for ARM64, it looks like we
>> can avoid to handle vmf->address == addr specially. It's OK to
>> drop prefault and change the logic here a little bit to:
>> if (arch_wants_old_prefaulted_pte())
>> entry = pte_mkold(entry);
>> else
>> entry = pte_sw_mkyong(entry);
>>
>> It's not necessary to use pte_sw_mkyong for vmf->address == addr
>> because HW will set the ACCESS bit in page table entry.
>>
>> Add Will Deacon in case I missed something here. Thanks.
>
> I'll defer to Will's response, but not all arm HW supports HW access flag
> management. In that case it's done by SW, so I would imagine that by setting
> this to old initially, we will get a second fault to set the access bit, which
> will slow things down. I wonder if you will need to split this into (up to) 3
> calls to set_ptes()?
If no HW access flag, arch_wants_old_prefaulted_pte() will return false. So
path will goto pte_sw_mkyong(entry). Right?


Regards
Yin, Fengwei

>
>>
>>
>> Regards
>> Yin, Fengwei
>>
>>>
>>> Thanks,
>>> Ryan
>>>
>>>> pte_t entry;
>>>>
>>>> - flush_icache_page(vma, page);
>>>> + flush_icache_pages(vma, page, nr);
>>>> entry = mk_pte(page, vma->vm_page_prot);
>>>>
>>>> if (prefault && arch_wants_old_prefaulted_pte())
>>>> @@ -4296,14 +4297,18 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
>>>> entry = pte_mkuffd_wp(entry);
>>>> /* copy-on-write page */
>>>> if (write && !(vma->vm_flags & VM_SHARED)) {
>>>> - inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>>>> - page_add_new_anon_rmap(page, vma, addr);
>>>> - lru_cache_add_inactive_or_unevictable(page, vma);
>>>> + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr);
>>>> + VM_BUG_ON_FOLIO(nr != 1, folio);
>>>> + folio_add_new_anon_rmap(folio, vma, addr);
>>>> + folio_add_lru_vma(folio, vma);
>>>> } else {
>>>> - inc_mm_counter(vma->vm_mm, mm_counter_file(page));
>>>> - page_add_file_rmap(page, vma, false);
>>>> + add_mm_counter(vma->vm_mm, mm_counter_file(page), nr);
>>>> + folio_add_file_rmap_range(folio, page, nr, vma, false);
>>>> }
>>>> - set_pte_at(vma->vm_mm, addr, vmf->pte, entry);
>>>> + set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr);
>>>> +
>>>> + /* no need to invalidate: a not-present page won't be cached */
>>>> + update_mmu_cache_range(vma, addr, vmf->pte, nr);
>>>> }
>>>>
>>>> static bool vmf_pte_changed(struct vm_fault *vmf)
>>>> @@ -4376,11 +4381,9 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>>>
>>>> /* Re-check under ptl */
>>>> if (likely(!vmf_pte_changed(vmf))) {
>>>> - do_set_pte(vmf, page, vmf->address);
>>>> -
>>>> - /* no need to invalidate: a not-present page won't be cached */
>>>> - update_mmu_cache(vma, vmf->address, vmf->pte);
>>>> + struct folio *folio = page_folio(page);
>>>>
>>>> + set_pte_range(vmf, folio, page, 1, vmf->address);
>>>> ret = 0;
>>>> } else {
>>>> update_mmu_tlb(vma, vmf->address, vmf->pte);
>>>
>

2023-03-16 16:50:59

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v4 35/36] mm: Convert do_set_pte() to set_pte_range()

On 16/03/2023 16:41, Yin, Fengwei wrote:
>
>
> On 3/17/2023 12:38 AM, Ryan Roberts wrote:
>> On 16/03/2023 16:23, Yin, Fengwei wrote:
>>>
>>>
>>> On 3/15/2023 11:26 PM, Ryan Roberts wrote:
>>>> On 15/03/2023 05:14, Matthew Wilcox (Oracle) wrote:
>>>>> From: Yin Fengwei <[email protected]>
>>>>>
>>>>> set_pte_range() allows to setup page table entries for a specific
>>>>> range. It takes advantage of batched rmap update for large folio.
>>>>> It now takes care of calling update_mmu_cache_range().
>>>>>
>>>>> Signed-off-by: Yin Fengwei <[email protected]>
>>>>> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
>>>>> ---
>>>>> Documentation/filesystems/locking.rst | 2 +-
>>>>> include/linux/mm.h | 3 ++-
>>>>> mm/filemap.c | 3 +--
>>>>> mm/memory.c | 27 +++++++++++++++------------
>>>>> 4 files changed, 19 insertions(+), 16 deletions(-)
>>>>>
>>>>> diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
>>>>> index 7de7a7272a5e..922886fefb7f 100644
>>>>> --- a/Documentation/filesystems/locking.rst
>>>>> +++ b/Documentation/filesystems/locking.rst
>>>>> @@ -663,7 +663,7 @@ locked. The VM will unlock the page.
>>>>> Filesystem should find and map pages associated with offsets from "start_pgoff"
>>>>> till "end_pgoff". ->map_pages() is called with page table locked and must
>>>>> not block. If it's not possible to reach a page without blocking,
>>>>> -filesystem should skip it. Filesystem should use do_set_pte() to setup
>>>>> +filesystem should skip it. Filesystem should use set_pte_range() to setup
>>>>> page table entry. Pointer to entry associated with the page is passed in
>>>>> "pte" field in vm_fault structure. Pointers to entries for other offsets
>>>>> should be calculated relative to "pte".
>>>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>>>> index ee755bb4e1c1..81788c985a8c 100644
>>>>> --- a/include/linux/mm.h
>>>>> +++ b/include/linux/mm.h
>>>>> @@ -1299,7 +1299,8 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
>>>>> }
>>>>>
>>>>> vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page);
>>>>> -void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr);
>>>>> +void set_pte_range(struct vm_fault *vmf, struct folio *folio,
>>>>> + struct page *page, unsigned int nr, unsigned long addr);
>>>>>
>>>>> vm_fault_t finish_fault(struct vm_fault *vmf);
>>>>> vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
>>>>> diff --git a/mm/filemap.c b/mm/filemap.c
>>>>> index 6e2b0778db45..e2317623dcbf 100644
>>>>> --- a/mm/filemap.c
>>>>> +++ b/mm/filemap.c
>>>>> @@ -3504,8 +3504,7 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
>>>>> ret = VM_FAULT_NOPAGE;
>>>>>
>>>>> ref_count++;
>>>>> - do_set_pte(vmf, page, addr);
>>>>> - update_mmu_cache(vma, addr, vmf->pte);
>>>>> + set_pte_range(vmf, folio, page, 1, addr);
>>>>> } while (vmf->pte++, page++, addr += PAGE_SIZE, ++count < nr_pages);
>>>>>
>>>>> /* Restore the vmf->pte */
>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>> index 6aa21e8f3753..9a654802f104 100644
>>>>> --- a/mm/memory.c
>>>>> +++ b/mm/memory.c
>>>>> @@ -4274,7 +4274,8 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
>>>>> }
>>>>> #endif
>>>>>
>>>>> -void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
>>>>> +void set_pte_range(struct vm_fault *vmf, struct folio *folio,
>>>>> + struct page *page, unsigned int nr, unsigned long addr)
>>>>> {
>>>>> struct vm_area_struct *vma = vmf->vma;
>>>>> bool uffd_wp = vmf_orig_pte_uffd_wp(vmf);
>>>>> @@ -4282,7 +4283,7 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
>>>>> bool prefault = vmf->address != addr;
>>>>
>>>> I think you are changing behavior here - is this intentional? Previously this
>>>> would be evaluated per page, now its evaluated once for the whole range. The
>>>> intention below is that directly faulted pages are mapped young and prefaulted
>>>> pages are mapped old. But now a whole range will be mapped the same.
>>>
>>> Yes. You are right here.
>>>
>>> Look at the prefault and cpu_has_hw_af for ARM64, it looks like we
>>> can avoid to handle vmf->address == addr specially. It's OK to
>>> drop prefault and change the logic here a little bit to:
>>> if (arch_wants_old_prefaulted_pte())
>>> entry = pte_mkold(entry);
>>> else
>>> entry = pte_sw_mkyong(entry);
>>>
>>> It's not necessary to use pte_sw_mkyong for vmf->address == addr
>>> because HW will set the ACCESS bit in page table entry.
>>>
>>> Add Will Deacon in case I missed something here. Thanks.
>>
>> I'll defer to Will's response, but not all arm HW supports HW access flag
>> management. In that case it's done by SW, so I would imagine that by setting
>> this to old initially, we will get a second fault to set the access bit, which
>> will slow things down. I wonder if you will need to split this into (up to) 3
>> calls to set_ptes()?
> If no HW access flag, arch_wants_old_prefaulted_pte() will return false. So
> path will goto pte_sw_mkyong(entry). Right?

Oops... yes, I agree with you - disregard my previous comment.

>
>
> Regards
> Yin, Fengwei
>
>>
>>>
>>>
>>> Regards
>>> Yin, Fengwei
>>>
>>>>
>>>> Thanks,
>>>> Ryan
>>>>
>>>>> pte_t entry;
>>>>>
>>>>> - flush_icache_page(vma, page);
>>>>> + flush_icache_pages(vma, page, nr);
>>>>> entry = mk_pte(page, vma->vm_page_prot);
>>>>>
>>>>> if (prefault && arch_wants_old_prefaulted_pte())
>>>>> @@ -4296,14 +4297,18 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
>>>>> entry = pte_mkuffd_wp(entry);
>>>>> /* copy-on-write page */
>>>>> if (write && !(vma->vm_flags & VM_SHARED)) {
>>>>> - inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
>>>>> - page_add_new_anon_rmap(page, vma, addr);
>>>>> - lru_cache_add_inactive_or_unevictable(page, vma);
>>>>> + add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr);
>>>>> + VM_BUG_ON_FOLIO(nr != 1, folio);
>>>>> + folio_add_new_anon_rmap(folio, vma, addr);
>>>>> + folio_add_lru_vma(folio, vma);
>>>>> } else {
>>>>> - inc_mm_counter(vma->vm_mm, mm_counter_file(page));
>>>>> - page_add_file_rmap(page, vma, false);
>>>>> + add_mm_counter(vma->vm_mm, mm_counter_file(page), nr);
>>>>> + folio_add_file_rmap_range(folio, page, nr, vma, false);
>>>>> }
>>>>> - set_pte_at(vma->vm_mm, addr, vmf->pte, entry);
>>>>> + set_ptes(vma->vm_mm, addr, vmf->pte, entry, nr);
>>>>> +
>>>>> + /* no need to invalidate: a not-present page won't be cached */
>>>>> + update_mmu_cache_range(vma, addr, vmf->pte, nr);
>>>>> }
>>>>>
>>>>> static bool vmf_pte_changed(struct vm_fault *vmf)
>>>>> @@ -4376,11 +4381,9 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
>>>>>
>>>>> /* Re-check under ptl */
>>>>> if (likely(!vmf_pte_changed(vmf))) {
>>>>> - do_set_pte(vmf, page, vmf->address);
>>>>> -
>>>>> - /* no need to invalidate: a not-present page won't be cached */
>>>>> - update_mmu_cache(vma, vmf->address, vmf->pte);
>>>>> + struct folio *folio = page_folio(page);
>>>>>
>>>>> + set_pte_range(vmf, folio, page, 1, vmf->address);
>>>>> ret = 0;
>>>>> } else {
>>>>> update_mmu_tlb(vma, vmf->address, vmf->pte);
>>>>
>>


2023-03-16 17:53:14

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v4 35/36] mm: Convert do_set_pte() to set_pte_range()

On Thu, Mar 16, 2023 at 04:38:58PM +0000, Ryan Roberts wrote:
> On 16/03/2023 16:23, Yin, Fengwei wrote:
> >> I think you are changing behavior here - is this intentional? Previously this
> >> would be evaluated per page, now its evaluated once for the whole range. The
> >> intention below is that directly faulted pages are mapped young and prefaulted
> >> pages are mapped old. But now a whole range will be mapped the same.
> >
> > Yes. You are right here.
> >
> > Look at the prefault and cpu_has_hw_af for ARM64, it looks like we
> > can avoid to handle vmf->address == addr specially. It's OK to
> > drop prefault and change the logic here a little bit to:
> > if (arch_wants_old_prefaulted_pte())
> > entry = pte_mkold(entry);
> > else
> > entry = pte_sw_mkyong(entry);
> >
> > It's not necessary to use pte_sw_mkyong for vmf->address == addr
> > because HW will set the ACCESS bit in page table entry.
> >
> > Add Will Deacon in case I missed something here. Thanks.
>
> I'll defer to Will's response, but not all arm HW supports HW access flag
> management. In that case it's done by SW, so I would imagine that by setting
> this to old initially, we will get a second fault to set the access bit, which
> will slow things down. I wonder if you will need to split this into (up to) 3
> calls to set_ptes()?

I don't think we should do that. The limited information I have from
various microarchitectures is that the PTEs must differ only in their
PFN bits in order to use larger TLB entries. That includes the Accessed
bit (or equivalent). So we should mkyoung all the PTEs in the same
folio, at least initially.

That said, we should still do this conditionally. We'll prefault some
other folios too. So I think this should be:

bool prefault = (addr > vmf->address) || ((addr + nr) < vmf->address);


2023-03-17 01:58:43

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH v4 35/36] mm: Convert do_set_pte() to set_pte_range()



On 3/17/2023 1:52 AM, Matthew Wilcox wrote:
> On Thu, Mar 16, 2023 at 04:38:58PM +0000, Ryan Roberts wrote:
>> On 16/03/2023 16:23, Yin, Fengwei wrote:
>>>> I think you are changing behavior here - is this intentional? Previously this
>>>> would be evaluated per page, now its evaluated once for the whole range. The
>>>> intention below is that directly faulted pages are mapped young and prefaulted
>>>> pages are mapped old. But now a whole range will be mapped the same.
>>>
>>> Yes. You are right here.
>>>
>>> Look at the prefault and cpu_has_hw_af for ARM64, it looks like we
>>> can avoid to handle vmf->address == addr specially. It's OK to
>>> drop prefault and change the logic here a little bit to:
>>> if (arch_wants_old_prefaulted_pte())
>>> entry = pte_mkold(entry);
>>> else
>>> entry = pte_sw_mkyong(entry);
>>>
>>> It's not necessary to use pte_sw_mkyong for vmf->address == addr
>>> because HW will set the ACCESS bit in page table entry.
>>>
>>> Add Will Deacon in case I missed something here. Thanks.
>>
>> I'll defer to Will's response, but not all arm HW supports HW access flag
>> management. In that case it's done by SW, so I would imagine that by setting
>> this to old initially, we will get a second fault to set the access bit, which
>> will slow things down. I wonder if you will need to split this into (up to) 3
>> calls to set_ptes()?
>
> I don't think we should do that. The limited information I have from
> various microarchitectures is that the PTEs must differ only in their
> PFN bits in order to use larger TLB entries. That includes the Accessed
> bit (or equivalent). So we should mkyoung all the PTEs in the same
> folio, at least initially.
>
> That said, we should still do this conditionally. We'll prefault some
> other folios too. So I think this should be:
>
> bool prefault = (addr > vmf->address) || ((addr + nr) < vmf->address);
>
According to commit 46bdb4277f98e70d0c91f4289897ade533fe9e80, if hardware access
flag is supported on ARM64, there is benefit if prefault PTEs is set as "old".
If we change prefault like above, the PTEs is set as "yong" which loose benefit
on ARM64 with hardware access flag.

ITOH, if from "old" to "yong" is cheap, why not leave all PTEs of folio as "old"
and let hardware to update it to "yong"?

Regards
Yin, Fengwei

2023-03-17 03:45:19

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v4 35/36] mm: Convert do_set_pte() to set_pte_range()

On Fri, Mar 17, 2023 at 09:58:17AM +0800, Yin, Fengwei wrote:
>
>
> On 3/17/2023 1:52 AM, Matthew Wilcox wrote:
> > On Thu, Mar 16, 2023 at 04:38:58PM +0000, Ryan Roberts wrote:
> >> On 16/03/2023 16:23, Yin, Fengwei wrote:
> >>>> I think you are changing behavior here - is this intentional? Previously this
> >>>> would be evaluated per page, now its evaluated once for the whole range. The
> >>>> intention below is that directly faulted pages are mapped young and prefaulted
> >>>> pages are mapped old. But now a whole range will be mapped the same.
> >>>
> >>> Yes. You are right here.
> >>>
> >>> Look at the prefault and cpu_has_hw_af for ARM64, it looks like we
> >>> can avoid to handle vmf->address == addr specially. It's OK to
> >>> drop prefault and change the logic here a little bit to:
> >>> if (arch_wants_old_prefaulted_pte())
> >>> entry = pte_mkold(entry);
> >>> else
> >>> entry = pte_sw_mkyong(entry);
> >>>
> >>> It's not necessary to use pte_sw_mkyong for vmf->address == addr
> >>> because HW will set the ACCESS bit in page table entry.
> >>>
> >>> Add Will Deacon in case I missed something here. Thanks.
> >>
> >> I'll defer to Will's response, but not all arm HW supports HW access flag
> >> management. In that case it's done by SW, so I would imagine that by setting
> >> this to old initially, we will get a second fault to set the access bit, which
> >> will slow things down. I wonder if you will need to split this into (up to) 3
> >> calls to set_ptes()?
> >
> > I don't think we should do that. The limited information I have from
> > various microarchitectures is that the PTEs must differ only in their
> > PFN bits in order to use larger TLB entries. That includes the Accessed
> > bit (or equivalent). So we should mkyoung all the PTEs in the same
> > folio, at least initially.
> >
> > That said, we should still do this conditionally. We'll prefault some
> > other folios too. So I think this should be:
> >
> > bool prefault = (addr > vmf->address) || ((addr + nr) < vmf->address);
> >
> According to commit 46bdb4277f98e70d0c91f4289897ade533fe9e80, if hardware access
> flag is supported on ARM64, there is benefit if prefault PTEs is set as "old".
> If we change prefault like above, the PTEs is set as "yong" which loose benefit
> on ARM64 with hardware access flag.
>
> ITOH, if from "old" to "yong" is cheap, why not leave all PTEs of folio as "old"
> and let hardware to update it to "yong"?

Because we're tracking the entire folio as a single entity. So we're
better off avoiding the extra pagefaults to update the accessed bit,
which won't actually give us any information (vmscan needs to know "were
any of the accessed bits set", not "how many of them were set").

Anyway, hopefully Ryan can test this and let us know if it fixes the
regression he sees.

2023-03-17 06:34:10

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH v4 35/36] mm: Convert do_set_pte() to set_pte_range()



On 3/17/2023 11:44 AM, Matthew Wilcox wrote:
> On Fri, Mar 17, 2023 at 09:58:17AM +0800, Yin, Fengwei wrote:
>>
>>
>> On 3/17/2023 1:52 AM, Matthew Wilcox wrote:
>>> On Thu, Mar 16, 2023 at 04:38:58PM +0000, Ryan Roberts wrote:
>>>> On 16/03/2023 16:23, Yin, Fengwei wrote:
>>>>>> I think you are changing behavior here - is this intentional? Previously this
>>>>>> would be evaluated per page, now its evaluated once for the whole range. The
>>>>>> intention below is that directly faulted pages are mapped young and prefaulted
>>>>>> pages are mapped old. But now a whole range will be mapped the same.
>>>>>
>>>>> Yes. You are right here.
>>>>>
>>>>> Look at the prefault and cpu_has_hw_af for ARM64, it looks like we
>>>>> can avoid to handle vmf->address == addr specially. It's OK to
>>>>> drop prefault and change the logic here a little bit to:
>>>>> if (arch_wants_old_prefaulted_pte())
>>>>> entry = pte_mkold(entry);
>>>>> else
>>>>> entry = pte_sw_mkyong(entry);
>>>>>
>>>>> It's not necessary to use pte_sw_mkyong for vmf->address == addr
>>>>> because HW will set the ACCESS bit in page table entry.
>>>>>
>>>>> Add Will Deacon in case I missed something here. Thanks.
>>>>
>>>> I'll defer to Will's response, but not all arm HW supports HW access flag
>>>> management. In that case it's done by SW, so I would imagine that by setting
>>>> this to old initially, we will get a second fault to set the access bit, which
>>>> will slow things down. I wonder if you will need to split this into (up to) 3
>>>> calls to set_ptes()?
>>>
>>> I don't think we should do that. The limited information I have from
>>> various microarchitectures is that the PTEs must differ only in their
>>> PFN bits in order to use larger TLB entries. That includes the Accessed
>>> bit (or equivalent). So we should mkyoung all the PTEs in the same
>>> folio, at least initially.
>>>
>>> That said, we should still do this conditionally. We'll prefault some
>>> other folios too. So I think this should be:
>>>
>>> bool prefault = (addr > vmf->address) || ((addr + nr) < vmf->address);
>>>
>> According to commit 46bdb4277f98e70d0c91f4289897ade533fe9e80, if hardware access
>> flag is supported on ARM64, there is benefit if prefault PTEs is set as "old".
>> If we change prefault like above, the PTEs is set as "yong" which loose benefit
>> on ARM64 with hardware access flag.
>>
>> ITOH, if from "old" to "yong" is cheap, why not leave all PTEs of folio as "old"
>> and let hardware to update it to "yong"?
>
> Because we're tracking the entire folio as a single entity. So we're
> better off avoiding the extra pagefaults to update the accessed bit,
> which won't actually give us any information (vmscan needs to know "were
> any of the accessed bits set", not "how many of them were set").
There is no extra pagefaults to update the accessed bit. There are three cases here:
1. hardware support access flag and cheap from "old" to "yong" without extra fault
2. hardware support access flag and expensive from "old" to "yong" without extra fault
3. no hardware support access flag (extra pagefaults from "old" to "yong". Expensive)

For #2 and #3, it's expensive from "old" to "yong", so we always set PTEs "yong" in
page fault.
For #1, It's cheap from "old" to "yong", so it's OK to set PTEs "old" in page fault.
And hardware will set it to "yong" when access memory. Actually, ARM64 with hardware
access bit requires to set PTEs "old".

>
> Anyway, hopefully Ryan can test this and let us know if it fixes the
> regression he sees.
I highly suspect the regression Ryan saw is not related with this but another my
stupid work. I will send out the testing patch soon. Thanks.


Regards
Yin, Fengwei

2023-03-17 08:00:59

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v4 35/36] mm: Convert do_set_pte() to set_pte_range()

On 17/03/2023 06:33, Yin, Fengwei wrote:
>
>
> On 3/17/2023 11:44 AM, Matthew Wilcox wrote:
>> On Fri, Mar 17, 2023 at 09:58:17AM +0800, Yin, Fengwei wrote:
>>>
>>>
>>> On 3/17/2023 1:52 AM, Matthew Wilcox wrote:
>>>> On Thu, Mar 16, 2023 at 04:38:58PM +0000, Ryan Roberts wrote:
>>>>> On 16/03/2023 16:23, Yin, Fengwei wrote:
>>>>>>> I think you are changing behavior here - is this intentional? Previously this
>>>>>>> would be evaluated per page, now its evaluated once for the whole range. The
>>>>>>> intention below is that directly faulted pages are mapped young and prefaulted
>>>>>>> pages are mapped old. But now a whole range will be mapped the same.
>>>>>>
>>>>>> Yes. You are right here.
>>>>>>
>>>>>> Look at the prefault and cpu_has_hw_af for ARM64, it looks like we
>>>>>> can avoid to handle vmf->address == addr specially. It's OK to
>>>>>> drop prefault and change the logic here a little bit to:
>>>>>> if (arch_wants_old_prefaulted_pte())
>>>>>> entry = pte_mkold(entry);
>>>>>> else
>>>>>> entry = pte_sw_mkyong(entry);
>>>>>>
>>>>>> It's not necessary to use pte_sw_mkyong for vmf->address == addr
>>>>>> because HW will set the ACCESS bit in page table entry.
>>>>>>
>>>>>> Add Will Deacon in case I missed something here. Thanks.
>>>>>
>>>>> I'll defer to Will's response, but not all arm HW supports HW access flag
>>>>> management. In that case it's done by SW, so I would imagine that by setting
>>>>> this to old initially, we will get a second fault to set the access bit, which
>>>>> will slow things down. I wonder if you will need to split this into (up to) 3
>>>>> calls to set_ptes()?
>>>>
>>>> I don't think we should do that. The limited information I have from
>>>> various microarchitectures is that the PTEs must differ only in their
>>>> PFN bits in order to use larger TLB entries. That includes the Accessed
>>>> bit (or equivalent). So we should mkyoung all the PTEs in the same
>>>> folio, at least initially.
>>>>
>>>> That said, we should still do this conditionally. We'll prefault some
>>>> other folios too. So I think this should be:
>>>>
>>>> bool prefault = (addr > vmf->address) || ((addr + nr) < vmf->address);
>>>>
>>> According to commit 46bdb4277f98e70d0c91f4289897ade533fe9e80, if hardware access
>>> flag is supported on ARM64, there is benefit if prefault PTEs is set as "old".
>>> If we change prefault like above, the PTEs is set as "yong" which loose benefit
>>> on ARM64 with hardware access flag.
>>>
>>> ITOH, if from "old" to "yong" is cheap, why not leave all PTEs of folio as "old"
>>> and let hardware to update it to "yong"?
>>
>> Because we're tracking the entire folio as a single entity. So we're
>> better off avoiding the extra pagefaults to update the accessed bit,
>> which won't actually give us any information (vmscan needs to know "were
>> any of the accessed bits set", not "how many of them were set").
> There is no extra pagefaults to update the accessed bit. There are three cases here:
> 1. hardware support access flag and cheap from "old" to "yong" without extra fault
> 2. hardware support access flag and expensive from "old" to "yong" without extra fault
> 3. no hardware support access flag (extra pagefaults from "old" to "yong". Expensive)
>
> For #2 and #3, it's expensive from "old" to "yong", so we always set PTEs "yong" in
> page fault.
> For #1, It's cheap from "old" to "yong", so it's OK to set PTEs "old" in page fault.
> And hardware will set it to "yong" when access memory. Actually, ARM64 with hardware
> access bit requires to set PTEs "old".

Your logic makes sense, but it doesn't take into account the HPA
micro-architectural feature present in some ARM CPUs. HPA can transparently
coalesce multiple pages into a single TLB entry when certain conditions are met
(roughly; upto 4 pages physically and virtually contiguous and all within a
4-page natural alignment). But as Matthew says, this works out better when all
pte attributes (including access and dirty) match. Given the reason for setting
the prefault pages to old is so that vmscan can do a better job of finding cold
pages, and given vmscan will now be looking for folios and not individual pages
(I assume?), I agree with Matthew that we should make whole folios young or old.
It will marginally increase our chances of the access and dirty bits being
consistent across the whole 4-page block that the HW tries to coalesce. If we
unconditionally make everything old, the hw will set accessed for the single
page that faulted, and we therefore don't have consistency for that 4-page block.

>
>>
>> Anyway, hopefully Ryan can test this and let us know if it fixes the
>> regression he sees.
> I highly suspect the regression Ryan saw is not related with this but another my
> stupid work. I will send out the testing patch soon. Thanks.

I tested a version of this where I made everything unconditionally young,
thinking it might be the source of the perf regression, before I reported it. It
doesn't make any difference. So I agree the regression is somewhere else.

Thanks,
Ryan

>
>
> Regards
> Yin, Fengwei


2023-03-17 08:20:53

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH v4 35/36] mm: Convert do_set_pte() to set_pte_range()



On 3/17/2023 4:00 PM, Ryan Roberts wrote:
> On 17/03/2023 06:33, Yin, Fengwei wrote:
>>
>>
>> On 3/17/2023 11:44 AM, Matthew Wilcox wrote:
>>> On Fri, Mar 17, 2023 at 09:58:17AM +0800, Yin, Fengwei wrote:
>>>>
>>>>
>>>> On 3/17/2023 1:52 AM, Matthew Wilcox wrote:
>>>>> On Thu, Mar 16, 2023 at 04:38:58PM +0000, Ryan Roberts wrote:
>>>>>> On 16/03/2023 16:23, Yin, Fengwei wrote:
>>>>>>>> I think you are changing behavior here - is this intentional? Previously this
>>>>>>>> would be evaluated per page, now its evaluated once for the whole range. The
>>>>>>>> intention below is that directly faulted pages are mapped young and prefaulted
>>>>>>>> pages are mapped old. But now a whole range will be mapped the same.
>>>>>>>
>>>>>>> Yes. You are right here.
>>>>>>>
>>>>>>> Look at the prefault and cpu_has_hw_af for ARM64, it looks like we
>>>>>>> can avoid to handle vmf->address == addr specially. It's OK to
>>>>>>> drop prefault and change the logic here a little bit to:
>>>>>>> if (arch_wants_old_prefaulted_pte())
>>>>>>> entry = pte_mkold(entry);
>>>>>>> else
>>>>>>> entry = pte_sw_mkyong(entry);
>>>>>>>
>>>>>>> It's not necessary to use pte_sw_mkyong for vmf->address == addr
>>>>>>> because HW will set the ACCESS bit in page table entry.
>>>>>>>
>>>>>>> Add Will Deacon in case I missed something here. Thanks.
>>>>>>
>>>>>> I'll defer to Will's response, but not all arm HW supports HW access flag
>>>>>> management. In that case it's done by SW, so I would imagine that by setting
>>>>>> this to old initially, we will get a second fault to set the access bit, which
>>>>>> will slow things down. I wonder if you will need to split this into (up to) 3
>>>>>> calls to set_ptes()?
>>>>>
>>>>> I don't think we should do that. The limited information I have from
>>>>> various microarchitectures is that the PTEs must differ only in their
>>>>> PFN bits in order to use larger TLB entries. That includes the Accessed
>>>>> bit (or equivalent). So we should mkyoung all the PTEs in the same
>>>>> folio, at least initially.
>>>>>
>>>>> That said, we should still do this conditionally. We'll prefault some
>>>>> other folios too. So I think this should be:
>>>>>
>>>>> bool prefault = (addr > vmf->address) || ((addr + nr) < vmf->address);
>>>>>
>>>> According to commit 46bdb4277f98e70d0c91f4289897ade533fe9e80, if hardware access
>>>> flag is supported on ARM64, there is benefit if prefault PTEs is set as "old".
>>>> If we change prefault like above, the PTEs is set as "yong" which loose benefit
>>>> on ARM64 with hardware access flag.
>>>>
>>>> ITOH, if from "old" to "yong" is cheap, why not leave all PTEs of folio as "old"
>>>> and let hardware to update it to "yong"?
>>>
>>> Because we're tracking the entire folio as a single entity. So we're
>>> better off avoiding the extra pagefaults to update the accessed bit,
>>> which won't actually give us any information (vmscan needs to know "were
>>> any of the accessed bits set", not "how many of them were set").
>> There is no extra pagefaults to update the accessed bit. There are three cases here:
>> 1. hardware support access flag and cheap from "old" to "yong" without extra fault
>> 2. hardware support access flag and expensive from "old" to "yong" without extra fault
>> 3. no hardware support access flag (extra pagefaults from "old" to "yong". Expensive)
>>
>> For #2 and #3, it's expensive from "old" to "yong", so we always set PTEs "yong" in
>> page fault.
>> For #1, It's cheap from "old" to "yong", so it's OK to set PTEs "old" in page fault.
>> And hardware will set it to "yong" when access memory. Actually, ARM64 with hardware
>> access bit requires to set PTEs "old".
>
> Your logic makes sense, but it doesn't take into account the HPA
> micro-architectural feature present in some ARM CPUs. HPA can transparently
> coalesce multiple pages into a single TLB entry when certain conditions are met
> (roughly; upto 4 pages physically and virtually contiguous and all within a
> 4-page natural alignment). But as Matthew says, this works out better when all
> pte attributes (including access and dirty) match. Given the reason for setting
> the prefault pages to old is so that vmscan can do a better job of finding cold
> pages, and given vmscan will now be looking for folios and not individual pages
> (I assume?), I agree with Matthew that we should make whole folios young or old.
> It will marginally increase our chances of the access and dirty bits being
> consistent across the whole 4-page block that the HW tries to coalesce. If we
> unconditionally make everything old, the hw will set accessed for the single
> page that faulted, and we therefore don't have consistency for that 4-page block.
My concern was that the benefit of "old" PTEs for ARM64 with hardware access bit
will be lost. The workloads (application launch latency and direct reclaim according
to commit 46bdb4277f98e70d0c91f4289897ade533fe9e80) can show regression with this
series. Thanks.

BTW, with TLB merge feature, should hardware update coalesce multiple pages access
bit together? otherwise, it's avoidable that only one page access is set by hardware
finally.

Regards
Yin, Fengwei

>
>>
>>>
>>> Anyway, hopefully Ryan can test this and let us know if it fixes the
>>> regression he sees.
>> I highly suspect the regression Ryan saw is not related with this but another my
>> stupid work. I will send out the testing patch soon. Thanks.
>
> I tested a version of this where I made everything unconditionally young,
> thinking it might be the source of the perf regression, before I reported it. It
> doesn't make any difference. So I agree the regression is somewhere else.
>
> Thanks,
> Ryan
>
>>
>>
>> Regards
>> Yin, Fengwei
>

2023-03-17 13:00:53

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v4 35/36] mm: Convert do_set_pte() to set_pte_range()

On 17/03/2023 08:19, Yin, Fengwei wrote:
>
>
> On 3/17/2023 4:00 PM, Ryan Roberts wrote:
>> On 17/03/2023 06:33, Yin, Fengwei wrote:
>>>
>>>
>>> On 3/17/2023 11:44 AM, Matthew Wilcox wrote:
>>>> On Fri, Mar 17, 2023 at 09:58:17AM +0800, Yin, Fengwei wrote:
>>>>>
>>>>>
>>>>> On 3/17/2023 1:52 AM, Matthew Wilcox wrote:
>>>>>> On Thu, Mar 16, 2023 at 04:38:58PM +0000, Ryan Roberts wrote:
>>>>>>> On 16/03/2023 16:23, Yin, Fengwei wrote:
>>>>>>>>> I think you are changing behavior here - is this intentional? Previously this
>>>>>>>>> would be evaluated per page, now its evaluated once for the whole range. The
>>>>>>>>> intention below is that directly faulted pages are mapped young and prefaulted
>>>>>>>>> pages are mapped old. But now a whole range will be mapped the same.
>>>>>>>>
>>>>>>>> Yes. You are right here.
>>>>>>>>
>>>>>>>> Look at the prefault and cpu_has_hw_af for ARM64, it looks like we
>>>>>>>> can avoid to handle vmf->address == addr specially. It's OK to
>>>>>>>> drop prefault and change the logic here a little bit to:
>>>>>>>> if (arch_wants_old_prefaulted_pte())
>>>>>>>> entry = pte_mkold(entry);
>>>>>>>> else
>>>>>>>> entry = pte_sw_mkyong(entry);
>>>>>>>>
>>>>>>>> It's not necessary to use pte_sw_mkyong for vmf->address == addr
>>>>>>>> because HW will set the ACCESS bit in page table entry.
>>>>>>>>
>>>>>>>> Add Will Deacon in case I missed something here. Thanks.
>>>>>>>
>>>>>>> I'll defer to Will's response, but not all arm HW supports HW access flag
>>>>>>> management. In that case it's done by SW, so I would imagine that by setting
>>>>>>> this to old initially, we will get a second fault to set the access bit, which
>>>>>>> will slow things down. I wonder if you will need to split this into (up to) 3
>>>>>>> calls to set_ptes()?
>>>>>>
>>>>>> I don't think we should do that. The limited information I have from
>>>>>> various microarchitectures is that the PTEs must differ only in their
>>>>>> PFN bits in order to use larger TLB entries. That includes the Accessed
>>>>>> bit (or equivalent). So we should mkyoung all the PTEs in the same
>>>>>> folio, at least initially.
>>>>>>
>>>>>> That said, we should still do this conditionally. We'll prefault some
>>>>>> other folios too. So I think this should be:
>>>>>>
>>>>>> bool prefault = (addr > vmf->address) || ((addr + nr) < vmf->address);
>>>>>>
>>>>> According to commit 46bdb4277f98e70d0c91f4289897ade533fe9e80, if hardware access
>>>>> flag is supported on ARM64, there is benefit if prefault PTEs is set as "old".
>>>>> If we change prefault like above, the PTEs is set as "yong" which loose benefit
>>>>> on ARM64 with hardware access flag.
>>>>>
>>>>> ITOH, if from "old" to "yong" is cheap, why not leave all PTEs of folio as "old"
>>>>> and let hardware to update it to "yong"?
>>>>
>>>> Because we're tracking the entire folio as a single entity. So we're
>>>> better off avoiding the extra pagefaults to update the accessed bit,
>>>> which won't actually give us any information (vmscan needs to know "were
>>>> any of the accessed bits set", not "how many of them were set").
>>> There is no extra pagefaults to update the accessed bit. There are three cases here:
>>> 1. hardware support access flag and cheap from "old" to "yong" without extra fault
>>> 2. hardware support access flag and expensive from "old" to "yong" without extra fault
>>> 3. no hardware support access flag (extra pagefaults from "old" to "yong". Expensive)
>>>
>>> For #2 and #3, it's expensive from "old" to "yong", so we always set PTEs "yong" in
>>> page fault.
>>> For #1, It's cheap from "old" to "yong", so it's OK to set PTEs "old" in page fault.
>>> And hardware will set it to "yong" when access memory. Actually, ARM64 with hardware
>>> access bit requires to set PTEs "old".
>>
>> Your logic makes sense, but it doesn't take into account the HPA
>> micro-architectural feature present in some ARM CPUs. HPA can transparently
>> coalesce multiple pages into a single TLB entry when certain conditions are met
>> (roughly; upto 4 pages physically and virtually contiguous and all within a
>> 4-page natural alignment). But as Matthew says, this works out better when all
>> pte attributes (including access and dirty) match. Given the reason for setting
>> the prefault pages to old is so that vmscan can do a better job of finding cold
>> pages, and given vmscan will now be looking for folios and not individual pages
>> (I assume?), I agree with Matthew that we should make whole folios young or old.
>> It will marginally increase our chances of the access and dirty bits being
>> consistent across the whole 4-page block that the HW tries to coalesce. If we
>> unconditionally make everything old, the hw will set accessed for the single
>> page that faulted, and we therefore don't have consistency for that 4-page block.
> My concern was that the benefit of "old" PTEs for ARM64 with hardware access bit
> will be lost. The workloads (application launch latency and direct reclaim according
> to commit 46bdb4277f98e70d0c91f4289897ade533fe9e80) can show regression with this
> series. Thanks.

My (potentially incorrect) understanding of the reason that marking the
prefaulted ptes as old was because it made it easier/quicker for vmscan to
identify those prefaulted pages and reclaim them under memory pressure. I
_assume_ now that we have large folios, that vmscan will be trying to pick
folios for reclaim, not individual subpages within the folio? In which case,
vmscan will only consider the folio as old if _all_ pages within are old. So
marking all the pages of a folio young vs marking 1 page in the folio young
won't make a difference from this perspective. But it will make a difference
from the perspective a HPA. (Please Matthew or somebody else, correct me if my
understanding is incorrect!)

>
> BTW, with TLB merge feature, should hardware update coalesce multiple pages access
> bit together? otherwise, it's avoidable that only one page access is set by hardware
> finally.

No, the HW will only update the access flag for the single page that is
accessed. So yes, in the long run the value of the flags across the 4-page block
will diverge - that's why I said "marginal" above.

>
> Regards
> Yin, Fengwei
>
>>
>>>
>>>>
>>>> Anyway, hopefully Ryan can test this and let us know if it fixes the
>>>> regression he sees.
>>> I highly suspect the regression Ryan saw is not related with this but another my
>>> stupid work. I will send out the testing patch soon. Thanks.
>>
>> I tested a version of this where I made everything unconditionally young,
>> thinking it might be the source of the perf regression, before I reported it. It
>> doesn't make any difference. So I agree the regression is somewhere else.
>>
>> Thanks,
>> Ryan
>>
>>>
>>>
>>> Regards
>>> Yin, Fengwei
>>


2023-03-17 13:45:33

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH v4 35/36] mm: Convert do_set_pte() to set_pte_range()



On 3/17/2023 9:00 PM, Ryan Roberts wrote:
> On 17/03/2023 08:19, Yin, Fengwei wrote:
>>
>>
>> On 3/17/2023 4:00 PM, Ryan Roberts wrote:
>>> On 17/03/2023 06:33, Yin, Fengwei wrote:
>>>>
>>>>
>>>> On 3/17/2023 11:44 AM, Matthew Wilcox wrote:
>>>>> On Fri, Mar 17, 2023 at 09:58:17AM +0800, Yin, Fengwei wrote:
>>>>>>
>>>>>>
>>>>>> On 3/17/2023 1:52 AM, Matthew Wilcox wrote:
>>>>>>> On Thu, Mar 16, 2023 at 04:38:58PM +0000, Ryan Roberts wrote:
>>>>>>>> On 16/03/2023 16:23, Yin, Fengwei wrote:
>>>>>>>>>> I think you are changing behavior here - is this intentional? Previously this
>>>>>>>>>> would be evaluated per page, now its evaluated once for the whole range. The
>>>>>>>>>> intention below is that directly faulted pages are mapped young and prefaulted
>>>>>>>>>> pages are mapped old. But now a whole range will be mapped the same.
>>>>>>>>>
>>>>>>>>> Yes. You are right here.
>>>>>>>>>
>>>>>>>>> Look at the prefault and cpu_has_hw_af for ARM64, it looks like we
>>>>>>>>> can avoid to handle vmf->address == addr specially. It's OK to
>>>>>>>>> drop prefault and change the logic here a little bit to:
>>>>>>>>> if (arch_wants_old_prefaulted_pte())
>>>>>>>>> entry = pte_mkold(entry);
>>>>>>>>> else
>>>>>>>>> entry = pte_sw_mkyong(entry);
>>>>>>>>>
>>>>>>>>> It's not necessary to use pte_sw_mkyong for vmf->address == addr
>>>>>>>>> because HW will set the ACCESS bit in page table entry.
>>>>>>>>>
>>>>>>>>> Add Will Deacon in case I missed something here. Thanks.
>>>>>>>>
>>>>>>>> I'll defer to Will's response, but not all arm HW supports HW access flag
>>>>>>>> management. In that case it's done by SW, so I would imagine that by setting
>>>>>>>> this to old initially, we will get a second fault to set the access bit, which
>>>>>>>> will slow things down. I wonder if you will need to split this into (up to) 3
>>>>>>>> calls to set_ptes()?
>>>>>>>
>>>>>>> I don't think we should do that. The limited information I have from
>>>>>>> various microarchitectures is that the PTEs must differ only in their
>>>>>>> PFN bits in order to use larger TLB entries. That includes the Accessed
>>>>>>> bit (or equivalent). So we should mkyoung all the PTEs in the same
>>>>>>> folio, at least initially.
>>>>>>>
>>>>>>> That said, we should still do this conditionally. We'll prefault some
>>>>>>> other folios too. So I think this should be:
>>>>>>>
>>>>>>> bool prefault = (addr > vmf->address) || ((addr + nr) < vmf->address);
>>>>>>>
>>>>>> According to commit 46bdb4277f98e70d0c91f4289897ade533fe9e80, if hardware access
>>>>>> flag is supported on ARM64, there is benefit if prefault PTEs is set as "old".
>>>>>> If we change prefault like above, the PTEs is set as "yong" which loose benefit
>>>>>> on ARM64 with hardware access flag.
>>>>>>
>>>>>> ITOH, if from "old" to "yong" is cheap, why not leave all PTEs of folio as "old"
>>>>>> and let hardware to update it to "yong"?
>>>>>
>>>>> Because we're tracking the entire folio as a single entity. So we're
>>>>> better off avoiding the extra pagefaults to update the accessed bit,
>>>>> which won't actually give us any information (vmscan needs to know "were
>>>>> any of the accessed bits set", not "how many of them were set").
>>>> There is no extra pagefaults to update the accessed bit. There are three cases here:
>>>> 1. hardware support access flag and cheap from "old" to "yong" without extra fault
>>>> 2. hardware support access flag and expensive from "old" to "yong" without extra fault
>>>> 3. no hardware support access flag (extra pagefaults from "old" to "yong". Expensive)
>>>>
>>>> For #2 and #3, it's expensive from "old" to "yong", so we always set PTEs "yong" in
>>>> page fault.
>>>> For #1, It's cheap from "old" to "yong", so it's OK to set PTEs "old" in page fault.
>>>> And hardware will set it to "yong" when access memory. Actually, ARM64 with hardware
>>>> access bit requires to set PTEs "old".
>>>
>>> Your logic makes sense, but it doesn't take into account the HPA
>>> micro-architectural feature present in some ARM CPUs. HPA can transparently
>>> coalesce multiple pages into a single TLB entry when certain conditions are met
>>> (roughly; upto 4 pages physically and virtually contiguous and all within a
>>> 4-page natural alignment). But as Matthew says, this works out better when all
>>> pte attributes (including access and dirty) match. Given the reason for setting
>>> the prefault pages to old is so that vmscan can do a better job of finding cold
>>> pages, and given vmscan will now be looking for folios and not individual pages
>>> (I assume?), I agree with Matthew that we should make whole folios young or old.
>>> It will marginally increase our chances of the access and dirty bits being
>>> consistent across the whole 4-page block that the HW tries to coalesce. If we
>>> unconditionally make everything old, the hw will set accessed for the single
>>> page that faulted, and we therefore don't have consistency for that 4-page block.
>> My concern was that the benefit of "old" PTEs for ARM64 with hardware access bit
>> will be lost. The workloads (application launch latency and direct reclaim according
>> to commit 46bdb4277f98e70d0c91f4289897ade533fe9e80) can show regression with this
>> series. Thanks.
>
> My (potentially incorrect) understanding of the reason that marking the
> prefaulted ptes as old was because it made it easier/quicker for vmscan to
> identify those prefaulted pages and reclaim them under memory pressure. I
> _assume_ now that we have large folios, that vmscan will be trying to pick
> folios for reclaim, not individual subpages within the folio? In which case,
> vmscan will only consider the folio as old if _all_ pages within are old. So
> marking all the pages of a folio young vs marking 1 page in the folio young
> won't make a difference from this perspective. But it will make a difference
> from the perspective a HPA. (Please Matthew or somebody else, correct me if my
> understanding is incorrect!)
Thanks a lot for your patient explanation. I got the point here. For the first
access, we mark the all PTEs of folio "yong". So later access will get large TLB.


Regards
Yin, Fengwei

>
>>
>> BTW, with TLB merge feature, should hardware update coalesce multiple pages access
>> bit together? otherwise, it's avoidable that only one page access is set by hardware
>> finally.
>
> No, the HW will only update the access flag for the single page that is
> accessed. So yes, in the long run the value of the flags across the 4-page block
> will diverge - that's why I said "marginal" above.
>
>>
>> Regards
>> Yin, Fengwei
>>
>>>
>>>>
>>>>>
>>>>> Anyway, hopefully Ryan can test this and let us know if it fixes the
>>>>> regression he sees.
>>>> I highly suspect the regression Ryan saw is not related with this but another my
>>>> stupid work. I will send out the testing patch soon. Thanks.
>>>
>>> I tested a version of this where I made everything unconditionally young,
>>> thinking it might be the source of the perf regression, before I reported it. It
>>> doesn't make any difference. So I agree the regression is somewhere else.
>>>
>>> Thanks,
>>> Ryan
>>>
>>>>
>>>>
>>>> Regards
>>>> Yin, Fengwei
>>>
>

2023-03-20 13:39:32

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH v4 35/36] mm: Convert do_set_pte() to set_pte_range()

Hi Matthew,

On 3/17/2023 11:44 AM, Matthew Wilcox wrote:
> On Fri, Mar 17, 2023 at 09:58:17AM +0800, Yin, Fengwei wrote:
>>
>>
>> On 3/17/2023 1:52 AM, Matthew Wilcox wrote:
>>> On Thu, Mar 16, 2023 at 04:38:58PM +0000, Ryan Roberts wrote:
>>>> On 16/03/2023 16:23, Yin, Fengwei wrote:
>>>>>> I think you are changing behavior here - is this intentional? Previously this
>>>>>> would be evaluated per page, now its evaluated once for the whole range. The
>>>>>> intention below is that directly faulted pages are mapped young and prefaulted
>>>>>> pages are mapped old. But now a whole range will be mapped the same.
>>>>>
>>>>> Yes. You are right here.
>>>>>
>>>>> Look at the prefault and cpu_has_hw_af for ARM64, it looks like we
>>>>> can avoid to handle vmf->address == addr specially. It's OK to
>>>>> drop prefault and change the logic here a little bit to:
>>>>> if (arch_wants_old_prefaulted_pte())
>>>>> entry = pte_mkold(entry);
>>>>> else
>>>>> entry = pte_sw_mkyong(entry);
>>>>>
>>>>> It's not necessary to use pte_sw_mkyong for vmf->address == addr
>>>>> because HW will set the ACCESS bit in page table entry.
>>>>>
>>>>> Add Will Deacon in case I missed something here. Thanks.
>>>>
>>>> I'll defer to Will's response, but not all arm HW supports HW access flag
>>>> management. In that case it's done by SW, so I would imagine that by setting
>>>> this to old initially, we will get a second fault to set the access bit, which
>>>> will slow things down. I wonder if you will need to split this into (up to) 3
>>>> calls to set_ptes()?
>>>
>>> I don't think we should do that. The limited information I have from
>>> various microarchitectures is that the PTEs must differ only in their
>>> PFN bits in order to use larger TLB entries. That includes the Accessed
>>> bit (or equivalent). So we should mkyoung all the PTEs in the same
>>> folio, at least initially.
>>>
>>> That said, we should still do this conditionally. We'll prefault some
>>> other folios too. So I think this should be:
>>>
>>> bool prefault = (addr > vmf->address) || ((addr + nr) < vmf->address);
>>>
>> According to commit 46bdb4277f98e70d0c91f4289897ade533fe9e80, if hardware access
>> flag is supported on ARM64, there is benefit if prefault PTEs is set as "old".
>> If we change prefault like above, the PTEs is set as "yong" which loose benefit
>> on ARM64 with hardware access flag.
>>
>> ITOH, if from "old" to "yong" is cheap, why not leave all PTEs of folio as "old"
>> and let hardware to update it to "yong"?
>
> Because we're tracking the entire folio as a single entity. So we're
> better off avoiding the extra pagefaults to update the accessed bit,
> which won't actually give us any information (vmscan needs to know "were
> any of the accessed bits set", not "how many of them were set").
>
> Anyway, hopefully Ryan can test this and let us know if it fixes the
> regression he sees.

Thanks a lot to Ryan for helping to test the debug patch I made.

Ryan confirmed that the following change could fix the kernel build regression:
diff --git a/mm/filemap.c b/mm/filemap.c
index db86e459dde6..343d6ff36b2c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3557,7 +3557,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,

ret |= filemap_map_folio_range(vmf, folio,
xas.xa_index - folio->index, addr, nr_pages);
- xas.xa_index += nr_pages;
+ xas.xa_index += folio_test_large(folio) ? nr_pages : 0;

folio_unlock(folio);
folio_put(folio);

I will make upstream-able change as "xas.xa_index += nr_pages - 1;"

Ryan and I also identify some other changes needed. I am not sure how to
integrate those changes to this series. Maybe an add-on patch after this
series? Thanks.

Regards
Yin, Fengwei

2023-03-20 14:08:58

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v4 35/36] mm: Convert do_set_pte() to set_pte_range()

On Mon, Mar 20, 2023 at 09:38:55PM +0800, Yin, Fengwei wrote:
> Thanks a lot to Ryan for helping to test the debug patch I made.
>
> Ryan confirmed that the following change could fix the kernel build regression:
> diff --git a/mm/filemap.c b/mm/filemap.c
> index db86e459dde6..343d6ff36b2c 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3557,7 +3557,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
>
> ret |= filemap_map_folio_range(vmf, folio,
> xas.xa_index - folio->index, addr, nr_pages);
> - xas.xa_index += nr_pages;
> + xas.xa_index += folio_test_large(folio) ? nr_pages : 0;
>
> folio_unlock(folio);
> folio_put(folio);
>
> I will make upstream-able change as "xas.xa_index += nr_pages - 1;"

Thanks to both of you!

Really, we shouldn't need to interfere with xas.xa_index at all.
Does this work?

diff --git a/mm/filemap.c b/mm/filemap.c
index 8e4f95c5b65a..e40c967dde5f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3420,10 +3420,10 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct folio *folio,
return false;
}

-static struct folio *next_uptodate_page(struct folio *folio,
- struct address_space *mapping,
- struct xa_state *xas, pgoff_t end_pgoff)
+static struct folio *next_uptodate_folio(struct xa_state *xas,
+ struct address_space *mapping, pgoff_t end_pgoff)
{
+ struct folio *folio = xas_next_entry(xas, end_pgoff);
unsigned long max_idx;

do {
@@ -3461,22 +3461,6 @@ static struct folio *next_uptodate_page(struct folio *folio,
return NULL;
}

-static inline struct folio *first_map_page(struct address_space *mapping,
- struct xa_state *xas,
- pgoff_t end_pgoff)
-{
- return next_uptodate_page(xas_find(xas, end_pgoff),
- mapping, xas, end_pgoff);
-}
-
-static inline struct folio *next_map_page(struct address_space *mapping,
- struct xa_state *xas,
- pgoff_t end_pgoff)
-{
- return next_uptodate_page(xas_next_entry(xas, end_pgoff),
- mapping, xas, end_pgoff);
-}
-
/*
* Map page range [start_page, start_page + nr_pages) of folio.
* start_page is gotten from start by folio_page(folio, start)
@@ -3552,7 +3536,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
int nr_pages = 0;

rcu_read_lock();
- folio = first_map_page(mapping, &xas, end_pgoff);
+ folio = next_uptodate_folio(&xas, mapping, end_pgoff);
if (!folio)
goto out;

@@ -3574,11 +3558,11 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,

ret |= filemap_map_folio_range(vmf, folio,
xas.xa_index - folio->index, addr, nr_pages);
- xas.xa_index += nr_pages;

folio_unlock(folio);
folio_put(folio);
- } while ((folio = next_map_page(mapping, &xas, end_pgoff)) != NULL);
+ folio = next_uptodate_folio(&xas, mapping, end_pgoff);
+ } while (folio);
pte_unmap_unlock(vmf->pte, vmf->ptl);
out:
rcu_read_unlock();

> Ryan and I also identify some other changes needed. I am not sure how to
> integrate those changes to this series. Maybe an add-on patch after this
> series? Thanks.

Up to you; I'm happy to integrate fixup patches into the current series
or add on new ones.

2023-03-21 01:58:55

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH v4 35/36] mm: Convert do_set_pte() to set_pte_range()



On 3/20/2023 10:08 PM, Matthew Wilcox wrote:
> On Mon, Mar 20, 2023 at 09:38:55PM +0800, Yin, Fengwei wrote:
>> Thanks a lot to Ryan for helping to test the debug patch I made.
>>
>> Ryan confirmed that the following change could fix the kernel build regression:
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index db86e459dde6..343d6ff36b2c 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -3557,7 +3557,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
>>
>> ret |= filemap_map_folio_range(vmf, folio,
>> xas.xa_index - folio->index, addr, nr_pages);
>> - xas.xa_index += nr_pages;
>> + xas.xa_index += folio_test_large(folio) ? nr_pages : 0;
>>
>> folio_unlock(folio);
>> folio_put(folio);
>>
>> I will make upstream-able change as "xas.xa_index += nr_pages - 1;"
>
> Thanks to both of you!
>
> Really, we shouldn't need to interfere with xas.xa_index at all.
> Does this work?
I will give this a try and let you know the result.

>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 8e4f95c5b65a..e40c967dde5f 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3420,10 +3420,10 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct folio *folio,
> return false;
> }
>
> -static struct folio *next_uptodate_page(struct folio *folio,
> - struct address_space *mapping,
> - struct xa_state *xas, pgoff_t end_pgoff)
> +static struct folio *next_uptodate_folio(struct xa_state *xas,
> + struct address_space *mapping, pgoff_t end_pgoff)
> {
> + struct folio *folio = xas_next_entry(xas, end_pgoff);
> unsigned long max_idx;
>
> do {
> @@ -3461,22 +3461,6 @@ static struct folio *next_uptodate_page(struct folio *folio,
> return NULL;
> }
>
> -static inline struct folio *first_map_page(struct address_space *mapping,
> - struct xa_state *xas,
> - pgoff_t end_pgoff)
> -{
> - return next_uptodate_page(xas_find(xas, end_pgoff),
> - mapping, xas, end_pgoff);
> -}
> -
> -static inline struct folio *next_map_page(struct address_space *mapping,
> - struct xa_state *xas,
> - pgoff_t end_pgoff)
> -{
> - return next_uptodate_page(xas_next_entry(xas, end_pgoff),
> - mapping, xas, end_pgoff);
> -}
> -
> /*
> * Map page range [start_page, start_page + nr_pages) of folio.
> * start_page is gotten from start by folio_page(folio, start)
> @@ -3552,7 +3536,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
> int nr_pages = 0;
>
> rcu_read_lock();
> - folio = first_map_page(mapping, &xas, end_pgoff);
> + folio = next_uptodate_folio(&xas, mapping, end_pgoff);
> if (!folio)
> goto out;
>
> @@ -3574,11 +3558,11 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
>
> ret |= filemap_map_folio_range(vmf, folio,
> xas.xa_index - folio->index, addr, nr_pages);
> - xas.xa_index += nr_pages;
>
> folio_unlock(folio);
> folio_put(folio);
> - } while ((folio = next_map_page(mapping, &xas, end_pgoff)) != NULL);
> + folio = next_uptodate_folio(&xas, mapping, end_pgoff);
> + } while (folio);
> pte_unmap_unlock(vmf->pte, vmf->ptl);
> out:
> rcu_read_unlock();
>
>> Ryan and I also identify some other changes needed. I am not sure how to
>> integrate those changes to this series. Maybe an add-on patch after this
>> series? Thanks.
>
> Up to you; I'm happy to integrate fixup patches into the current series
> or add on new ones.
Integrating to current series should be better. As it doesn't impact the
bisect operations. I will share the changes Ryan and I had after verify
the above change you proposed. Thanks.


Regards
Yin, Fengwei


2023-03-21 05:17:45

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH v4 35/36] mm: Convert do_set_pte() to set_pte_range()

On 3/20/23 22:08, Matthew Wilcox wrote:
> On Mon, Mar 20, 2023 at 09:38:55PM +0800, Yin, Fengwei wrote:
>> Thanks a lot to Ryan for helping to test the debug patch I made.
>>
>> Ryan confirmed that the following change could fix the kernel build regression:
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index db86e459dde6..343d6ff36b2c 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -3557,7 +3557,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
>>
>> ret |= filemap_map_folio_range(vmf, folio,
>> xas.xa_index - folio->index, addr, nr_pages);
>> - xas.xa_index += nr_pages;
>> + xas.xa_index += folio_test_large(folio) ? nr_pages : 0;
>>
>> folio_unlock(folio);
>> folio_put(folio);
>>
>> I will make upstream-able change as "xas.xa_index += nr_pages - 1;"
>
> Thanks to both of you!
>
> Really, we shouldn't need to interfere with xas.xa_index at all.
> Does this work?
Yes. This works perfectly in my side. Thanks.

Regards
Yin, Fengwei

>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 8e4f95c5b65a..e40c967dde5f 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3420,10 +3420,10 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct folio *folio,
> return false;
> }
>
> -static struct folio *next_uptodate_page(struct folio *folio,
> - struct address_space *mapping,
> - struct xa_state *xas, pgoff_t end_pgoff)
> +static struct folio *next_uptodate_folio(struct xa_state *xas,
> + struct address_space *mapping, pgoff_t end_pgoff)
> {
> + struct folio *folio = xas_next_entry(xas, end_pgoff);
> unsigned long max_idx;
>
> do {
> @@ -3461,22 +3461,6 @@ static struct folio *next_uptodate_page(struct folio *folio,
> return NULL;
> }
>
> -static inline struct folio *first_map_page(struct address_space *mapping,
> - struct xa_state *xas,
> - pgoff_t end_pgoff)
> -{
> - return next_uptodate_page(xas_find(xas, end_pgoff),
> - mapping, xas, end_pgoff);
> -}
> -
> -static inline struct folio *next_map_page(struct address_space *mapping,
> - struct xa_state *xas,
> - pgoff_t end_pgoff)
> -{
> - return next_uptodate_page(xas_next_entry(xas, end_pgoff),
> - mapping, xas, end_pgoff);
> -}
> -
> /*
> * Map page range [start_page, start_page + nr_pages) of folio.
> * start_page is gotten from start by folio_page(folio, start)
> @@ -3552,7 +3536,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
> int nr_pages = 0;
>
> rcu_read_lock();
> - folio = first_map_page(mapping, &xas, end_pgoff);
> + folio = next_uptodate_folio(&xas, mapping, end_pgoff);
> if (!folio)
> goto out;
>
> @@ -3574,11 +3558,11 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
>
> ret |= filemap_map_folio_range(vmf, folio,
> xas.xa_index - folio->index, addr, nr_pages);
> - xas.xa_index += nr_pages;
>
> folio_unlock(folio);
> folio_put(folio);
> - } while ((folio = next_map_page(mapping, &xas, end_pgoff)) != NULL);
> + folio = next_uptodate_folio(&xas, mapping, end_pgoff);
> + } while (folio);
> pte_unmap_unlock(vmf->pte, vmf->ptl);
> out:
> rcu_read_unlock();
>
>> Ryan and I also identify some other changes needed. I am not sure how to
>> integrate those changes to this series. Maybe an add-on patch after this
>> series? Thanks.
>
> Up to you; I'm happy to integrate fixup patches into the current series
> or add on new ones.


2023-03-24 15:06:32

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v4 35/36] mm: Convert do_set_pte() to set_pte_range()

On Fri, Mar 17, 2023 at 04:19:44PM +0800, Yin, Fengwei wrote:
>
>
> On 3/17/2023 4:00 PM, Ryan Roberts wrote:
> > On 17/03/2023 06:33, Yin, Fengwei wrote:
> >>
> >>
> >> On 3/17/2023 11:44 AM, Matthew Wilcox wrote:
> >>> On Fri, Mar 17, 2023 at 09:58:17AM +0800, Yin, Fengwei wrote:
> >>>>
> >>>>
> >>>> On 3/17/2023 1:52 AM, Matthew Wilcox wrote:
> >>>>> On Thu, Mar 16, 2023 at 04:38:58PM +0000, Ryan Roberts wrote:
> >>>>>> On 16/03/2023 16:23, Yin, Fengwei wrote:
> >>>>>>>> I think you are changing behavior here - is this intentional? Previously this
> >>>>>>>> would be evaluated per page, now its evaluated once for the whole range. The
> >>>>>>>> intention below is that directly faulted pages are mapped young and prefaulted
> >>>>>>>> pages are mapped old. But now a whole range will be mapped the same.
> >>>>>>>
> >>>>>>> Yes. You are right here.
> >>>>>>>
> >>>>>>> Look at the prefault and cpu_has_hw_af for ARM64, it looks like we
> >>>>>>> can avoid to handle vmf->address == addr specially. It's OK to
> >>>>>>> drop prefault and change the logic here a little bit to:
> >>>>>>> if (arch_wants_old_prefaulted_pte())
> >>>>>>> entry = pte_mkold(entry);
> >>>>>>> else
> >>>>>>> entry = pte_sw_mkyong(entry);
> >>>>>>>
> >>>>>>> It's not necessary to use pte_sw_mkyong for vmf->address == addr
> >>>>>>> because HW will set the ACCESS bit in page table entry.
> >>>>>>>
> >>>>>>> Add Will Deacon in case I missed something here. Thanks.
> >>>>>>
> >>>>>> I'll defer to Will's response, but not all arm HW supports HW access flag
> >>>>>> management. In that case it's done by SW, so I would imagine that by setting
> >>>>>> this to old initially, we will get a second fault to set the access bit, which
> >>>>>> will slow things down. I wonder if you will need to split this into (up to) 3
> >>>>>> calls to set_ptes()?
> >>>>>
> >>>>> I don't think we should do that. The limited information I have from
> >>>>> various microarchitectures is that the PTEs must differ only in their
> >>>>> PFN bits in order to use larger TLB entries. That includes the Accessed
> >>>>> bit (or equivalent). So we should mkyoung all the PTEs in the same
> >>>>> folio, at least initially.
> >>>>>
> >>>>> That said, we should still do this conditionally. We'll prefault some
> >>>>> other folios too. So I think this should be:
> >>>>>
> >>>>> bool prefault = (addr > vmf->address) || ((addr + nr) < vmf->address);
> >>>>>
> >>>> According to commit 46bdb4277f98e70d0c91f4289897ade533fe9e80, if hardware access
> >>>> flag is supported on ARM64, there is benefit if prefault PTEs is set as "old".
> >>>> If we change prefault like above, the PTEs is set as "yong" which loose benefit
> >>>> on ARM64 with hardware access flag.
> >>>>
> >>>> ITOH, if from "old" to "yong" is cheap, why not leave all PTEs of folio as "old"
> >>>> and let hardware to update it to "yong"?
> >>>
> >>> Because we're tracking the entire folio as a single entity. So we're
> >>> better off avoiding the extra pagefaults to update the accessed bit,
> >>> which won't actually give us any information (vmscan needs to know "were
> >>> any of the accessed bits set", not "how many of them were set").
> >> There is no extra pagefaults to update the accessed bit. There are three cases here:
> >> 1. hardware support access flag and cheap from "old" to "yong" without extra fault
> >> 2. hardware support access flag and expensive from "old" to "yong" without extra fault
> >> 3. no hardware support access flag (extra pagefaults from "old" to "yong". Expensive)
> >>
> >> For #2 and #3, it's expensive from "old" to "yong", so we always set PTEs "yong" in
> >> page fault.
> >> For #1, It's cheap from "old" to "yong", so it's OK to set PTEs "old" in page fault.
> >> And hardware will set it to "yong" when access memory. Actually, ARM64 with hardware
> >> access bit requires to set PTEs "old".
> >
> > Your logic makes sense, but it doesn't take into account the HPA
> > micro-architectural feature present in some ARM CPUs. HPA can transparently
> > coalesce multiple pages into a single TLB entry when certain conditions are met
> > (roughly; upto 4 pages physically and virtually contiguous and all within a
> > 4-page natural alignment). But as Matthew says, this works out better when all
> > pte attributes (including access and dirty) match. Given the reason for setting
> > the prefault pages to old is so that vmscan can do a better job of finding cold
> > pages, and given vmscan will now be looking for folios and not individual pages
> > (I assume?), I agree with Matthew that we should make whole folios young or old.
> > It will marginally increase our chances of the access and dirty bits being
> > consistent across the whole 4-page block that the HW tries to coalesce. If we
> > unconditionally make everything old, the hw will set accessed for the single
> > page that faulted, and we therefore don't have consistency for that 4-page block.
> My concern was that the benefit of "old" PTEs for ARM64 with hardware access bit
> will be lost. The workloads (application launch latency and direct reclaim according
> to commit 46bdb4277f98e70d0c91f4289897ade533fe9e80) can show regression with this
> series. Thanks.

Yes, please don't fault everything in as young as it has caused horrible
vmscan behaviour leading to app-startup slowdown in the past:

https://lore.kernel.org/all/20210111140149.GB7642@willie-the-truck/

If we have to use the same value for all the ptes, then just base them
all on arch_wants_old_prefaulted_pte() as iirc hardware AF was pretty
cheap in practice for us.

Cheers,

Will

2023-03-24 15:15:21

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v4 35/36] mm: Convert do_set_pte() to set_pte_range()

On Fri, Mar 24, 2023 at 02:58:29PM +0000, Will Deacon wrote:
> Yes, please don't fault everything in as young as it has caused horrible
> vmscan behaviour leading to app-startup slowdown in the past:
>
> https://lore.kernel.org/all/20210111140149.GB7642@willie-the-truck/
>
> If we have to use the same value for all the ptes, then just base them
> all on arch_wants_old_prefaulted_pte() as iirc hardware AF was pretty
> cheap in practice for us.

I think that's wrong, because this is a different scenario.

Before:

We faulted in N single-page folios. Each page/folio is tracked
independently. That's N entries on whatever LRU list it ends up on.
The prefaulted ones _should_ be marked old -- they haven't been
accessed; we've just decided to put them in the page tables to
speed up faultaround. The unaccessed pages need to fall off the LRU
list as quickly as possible; keeping them around only hurts if the
workload has no locality of reference.

After:

We fault in N folios, some possibly consisting of multiple pages.
Each folio is tracked separately, but individual pages in the folio
are not tracked; they belong to their folio. In this scenario, if
the other PTEs for pages in the same folio are marked as young or old
doesn't matter; the entire folio will be tracked as young, because we
referenced one of the pages in this folio. Marking the other PTEs as
young actually helps because we don't take pagefaults on them (whether
we have a HW or SW accessed bit).

(can i just say that i dislike how we mix up our old/young accessed/not
terminology here?)

We should still mark the PTEs referencing unaccessed folios as old.
No argument there, and this patch does that. But it's fine for all the
PTEs referencing the accessed folio to have the young bit, at least as
far as I can tell.

2023-03-24 17:31:38

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH v4 35/36] mm: Convert do_set_pte() to set_pte_range()

On Fri, Mar 24, 2023 at 03:11:00PM +0000, Matthew Wilcox wrote:
> On Fri, Mar 24, 2023 at 02:58:29PM +0000, Will Deacon wrote:
> > Yes, please don't fault everything in as young as it has caused horrible
> > vmscan behaviour leading to app-startup slowdown in the past:
> >
> > https://lore.kernel.org/all/20210111140149.GB7642@willie-the-truck/
> >
> > If we have to use the same value for all the ptes, then just base them
> > all on arch_wants_old_prefaulted_pte() as iirc hardware AF was pretty
> > cheap in practice for us.
>
> I think that's wrong, because this is a different scenario.
>
> Before:
>
> We faulted in N single-page folios. Each page/folio is tracked
> independently. That's N entries on whatever LRU list it ends up on.
> The prefaulted ones _should_ be marked old -- they haven't been
> accessed; we've just decided to put them in the page tables to
> speed up faultaround. The unaccessed pages need to fall off the LRU
> list as quickly as possible; keeping them around only hurts if the
> workload has no locality of reference.
>
> After:
>
> We fault in N folios, some possibly consisting of multiple pages.
> Each folio is tracked separately, but individual pages in the folio
> are not tracked; they belong to their folio. In this scenario, if
> the other PTEs for pages in the same folio are marked as young or old
> doesn't matter; the entire folio will be tracked as young, because we
> referenced one of the pages in this folio. Marking the other PTEs as
> young actually helps because we don't take pagefaults on them (whether
> we have a HW or SW accessed bit).
>
> (can i just say that i dislike how we mix up our old/young accessed/not
> terminology here?)
>
> We should still mark the PTEs referencing unaccessed folios as old.
> No argument there, and this patch does that. But it's fine for all the
> PTEs referencing the accessed folio to have the young bit, at least as
> far as I can tell.

Ok, thanks for the explanation. So as long as
arch_wants_old_prefaulted_pte() is taken into account for the unaccessed
folios, then I think we should be good? Unconditionally marking those
PTEs as old probably hurts x86.

Will

2023-03-27 01:32:07

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH v4 35/36] mm: Convert do_set_pte() to set_pte_range()

On 3/25/23 01:23, Will Deacon wrote:
> On Fri, Mar 24, 2023 at 03:11:00PM +0000, Matthew Wilcox wrote:
>> On Fri, Mar 24, 2023 at 02:58:29PM +0000, Will Deacon wrote:
>>> Yes, please don't fault everything in as young as it has caused horrible
>>> vmscan behaviour leading to app-startup slowdown in the past:
>>>
>>> https://lore.kernel.org/all/20210111140149.GB7642@willie-the-truck/
>>>
>>> If we have to use the same value for all the ptes, then just base them
>>> all on arch_wants_old_prefaulted_pte() as iirc hardware AF was pretty
>>> cheap in practice for us.
>>
>> I think that's wrong, because this is a different scenario.
>>
>> Before:
>>
>> We faulted in N single-page folios. Each page/folio is tracked
>> independently. That's N entries on whatever LRU list it ends up on.
>> The prefaulted ones _should_ be marked old -- they haven't been
>> accessed; we've just decided to put them in the page tables to
>> speed up faultaround. The unaccessed pages need to fall off the LRU
>> list as quickly as possible; keeping them around only hurts if the
>> workload has no locality of reference.
>>
>> After:
>>
>> We fault in N folios, some possibly consisting of multiple pages.
>> Each folio is tracked separately, but individual pages in the folio
>> are not tracked; they belong to their folio. In this scenario, if
>> the other PTEs for pages in the same folio are marked as young or old
>> doesn't matter; the entire folio will be tracked as young, because we
>> referenced one of the pages in this folio. Marking the other PTEs as
>> young actually helps because we don't take pagefaults on them (whether
>> we have a HW or SW accessed bit).
>>
>> (can i just say that i dislike how we mix up our old/young accessed/not
>> terminology here?)
>>
>> We should still mark the PTEs referencing unaccessed folios as old.
>> No argument there, and this patch does that. But it's fine for all the
>> PTEs referencing the accessed folio to have the young bit, at least as
>> far as I can tell.
>
> Ok, thanks for the explanation. So as long as
> arch_wants_old_prefaulted_pte() is taken into account for the unaccessed
> folios, then I think we should be good? Unconditionally marking those
> PTEs as old probably hurts x86.
Yes. We do only mark PTEs old for arch_wants_old_prefaulted_pte()
system. Thanks.


Regards
Yin, Fengwei

>
> Will

2023-05-25 02:47:45

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [PATCH v4 02/36] mm: Add generic flush_icache_pages() and documentation



On 3/15/23 10:44, Matthew Wilcox (Oracle) wrote:
> flush_icache_page() is deprecated but not yet removed, so add
> a range version of it. Change the documentation to refer to
> update_mmu_cache_range() instead of update_mmu_cache().
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>

Reviewed-by: Anshuman Khandual <[email protected]>

> ---
> Documentation/core-api/cachetlb.rst | 35 +++++++++++++++--------------
> include/asm-generic/cacheflush.h | 5 +++++
> 2 files changed, 23 insertions(+), 17 deletions(-)
>
> diff --git a/Documentation/core-api/cachetlb.rst b/Documentation/core-api/cachetlb.rst
> index 5c0552e78c58..d4c9e2a28d36 100644
> --- a/Documentation/core-api/cachetlb.rst
> +++ b/Documentation/core-api/cachetlb.rst
> @@ -88,13 +88,13 @@ changes occur:
>
> This is used primarily during fault processing.
>
> -5) ``void update_mmu_cache(struct vm_area_struct *vma,
> - unsigned long address, pte_t *ptep)``
> +5) ``void update_mmu_cache_range(struct vm_area_struct *vma,
> + unsigned long address, pte_t *ptep, unsigned int nr)``
>
> - At the end of every page fault, this routine is invoked to
> - tell the architecture specific code that a translation
> - now exists at virtual address "address" for address space
> - "vma->vm_mm", in the software page tables.
> + At the end of every page fault, this routine is invoked to tell
> + the architecture specific code that translations now exists
> + in the software page tables for address space "vma->vm_mm"
> + at virtual address "address" for "nr" consecutive pages.
>
> A port may use this information in any way it so chooses.
> For example, it could use this event to pre-load TLB
> @@ -306,17 +306,18 @@ maps this page at its virtual address.
> private". The kernel guarantees that, for pagecache pages, it will
> clear this bit when such a page first enters the pagecache.
>
> - This allows these interfaces to be implemented much more efficiently.
> - It allows one to "defer" (perhaps indefinitely) the actual flush if
> - there are currently no user processes mapping this page. See sparc64's
> - flush_dcache_page and update_mmu_cache implementations for an example
> - of how to go about doing this.
> + This allows these interfaces to be implemented much more
> + efficiently. It allows one to "defer" (perhaps indefinitely) the
> + actual flush if there are currently no user processes mapping this
> + page. See sparc64's flush_dcache_page and update_mmu_cache_range
> + implementations for an example of how to go about doing this.
>
> - The idea is, first at flush_dcache_page() time, if page_file_mapping()
> - returns a mapping, and mapping_mapped on that mapping returns %false,
> - just mark the architecture private page flag bit. Later, in
> - update_mmu_cache(), a check is made of this flag bit, and if set the
> - flush is done and the flag bit is cleared.
> + The idea is, first at flush_dcache_page() time, if
> + page_file_mapping() returns a mapping, and mapping_mapped on that
> + mapping returns %false, just mark the architecture private page
> + flag bit. Later, in update_mmu_cache_range(), a check is made
> + of this flag bit, and if set the flush is done and the flag bit
> + is cleared.
>
> .. important::
>
> @@ -369,7 +370,7 @@ maps this page at its virtual address.
> ``void flush_icache_page(struct vm_area_struct *vma, struct page *page)``
>
> All the functionality of flush_icache_page can be implemented in
> - flush_dcache_page and update_mmu_cache. In the future, the hope
> + flush_dcache_page and update_mmu_cache_range. In the future, the hope
> is to remove this interface completely.
>
> The final category of APIs is for I/O to deliberately aliased address
> diff --git a/include/asm-generic/cacheflush.h b/include/asm-generic/cacheflush.h
> index f46258d1a080..09d51a680765 100644
> --- a/include/asm-generic/cacheflush.h
> +++ b/include/asm-generic/cacheflush.h
> @@ -78,6 +78,11 @@ static inline void flush_icache_range(unsigned long start, unsigned long end)
> #endif
>
> #ifndef flush_icache_page
> +static inline void flush_icache_pages(struct vm_area_struct *vma,
> + struct page *page, unsigned int nr)
> +{
> +}
> +
> static inline void flush_icache_page(struct vm_area_struct *vma,
> struct page *page)
> {

2023-05-25 03:11:24

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [PATCH v4 04/36] mm: Remove ARCH_IMPLEMENTS_FLUSH_DCACHE_FOLIO



On 3/15/23 10:44, Matthew Wilcox (Oracle) wrote:
> Current best practice is to reuse the name of the function as a define
> to indicate that the function is implemented by the architecture.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>

Reviewed-by: Anshuman Khandual <[email protected]>

> ---
> Documentation/core-api/cachetlb.rst | 24 +++++++++---------------
> include/linux/cacheflush.h | 4 ++--
> mm/util.c | 2 +-
> 3 files changed, 12 insertions(+), 18 deletions(-)
>
> diff --git a/Documentation/core-api/cachetlb.rst b/Documentation/core-api/cachetlb.rst
> index d4c9e2a28d36..770008afd409 100644
> --- a/Documentation/core-api/cachetlb.rst
> +++ b/Documentation/core-api/cachetlb.rst
> @@ -269,7 +269,7 @@ maps this page at its virtual address.
> If D-cache aliasing is not an issue, these two routines may
> simply call memcpy/memset directly and do nothing more.
>
> - ``void flush_dcache_page(struct page *page)``
> + ``void flush_dcache_folio(struct folio *folio)``
>
> This routines must be called when:
>
> @@ -277,7 +277,7 @@ maps this page at its virtual address.
> and / or in high memory
> b) the kernel is about to read from a page cache page and user space
> shared/writable mappings of this page potentially exist. Note
> - that {get,pin}_user_pages{_fast} already call flush_dcache_page
> + that {get,pin}_user_pages{_fast} already call flush_dcache_folio
> on any page found in the user address space and thus driver
> code rarely needs to take this into account.
>
> @@ -291,7 +291,7 @@ maps this page at its virtual address.
>
> The phrase "kernel writes to a page cache page" means, specifically,
> that the kernel executes store instructions that dirty data in that
> - page at the page->virtual mapping of that page. It is important to
> + page at the kernel virtual mapping of that page. It is important to
> flush here to handle D-cache aliasing, to make sure these kernel stores
> are visible to user space mappings of that page.
>
> @@ -302,18 +302,18 @@ maps this page at its virtual address.
> If D-cache aliasing is not an issue, this routine may simply be defined
> as a nop on that architecture.
>
> - There is a bit set aside in page->flags (PG_arch_1) as "architecture
> + There is a bit set aside in folio->flags (PG_arch_1) as "architecture
> private". The kernel guarantees that, for pagecache pages, it will
> clear this bit when such a page first enters the pagecache.
>
> This allows these interfaces to be implemented much more
> efficiently. It allows one to "defer" (perhaps indefinitely) the
> actual flush if there are currently no user processes mapping this
> - page. See sparc64's flush_dcache_page and update_mmu_cache_range
> + page. See sparc64's flush_dcache_folio and update_mmu_cache_range
> implementations for an example of how to go about doing this.
>
> - The idea is, first at flush_dcache_page() time, if
> - page_file_mapping() returns a mapping, and mapping_mapped on that
> + The idea is, first at flush_dcache_folio() time, if
> + folio_flush_mapping() returns a mapping, and mapping_mapped() on that
> mapping returns %false, just mark the architecture private page
> flag bit. Later, in update_mmu_cache_range(), a check is made
> of this flag bit, and if set the flush is done and the flag bit
> @@ -327,12 +327,6 @@ maps this page at its virtual address.
> dirty. Again, see sparc64 for examples of how
> to deal with this.
>
> - ``void flush_dcache_folio(struct folio *folio)``
> - This function is called under the same circumstances as
> - flush_dcache_page(). It allows the architecture to
> - optimise for flushing the entire folio of pages instead
> - of flushing one page at a time.
> -
> ``void copy_to_user_page(struct vm_area_struct *vma, struct page *page,
> unsigned long user_vaddr, void *dst, void *src, int len)``
> ``void copy_from_user_page(struct vm_area_struct *vma, struct page *page,
> @@ -353,7 +347,7 @@ maps this page at its virtual address.
>
> When the kernel needs to access the contents of an anonymous
> page, it calls this function (currently only
> - get_user_pages()). Note: flush_dcache_page() deliberately
> + get_user_pages()). Note: flush_dcache_folio() deliberately
> doesn't work for an anonymous page. The default
> implementation is a nop (and should remain so for all coherent
> architectures). For incoherent architectures, it should flush
> @@ -370,7 +364,7 @@ maps this page at its virtual address.
> ``void flush_icache_page(struct vm_area_struct *vma, struct page *page)``
>
> All the functionality of flush_icache_page can be implemented in
> - flush_dcache_page and update_mmu_cache_range. In the future, the hope
> + flush_dcache_folio and update_mmu_cache_range. In the future, the hope
> is to remove this interface completely.
>
> The final category of APIs is for I/O to deliberately aliased address
> diff --git a/include/linux/cacheflush.h b/include/linux/cacheflush.h
> index a6189d21f2ba..82136f3fcf54 100644
> --- a/include/linux/cacheflush.h
> +++ b/include/linux/cacheflush.h
> @@ -7,14 +7,14 @@
> struct folio;
>
> #if ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE
> -#ifndef ARCH_IMPLEMENTS_FLUSH_DCACHE_FOLIO
> +#ifndef flush_dcache_folio
> void flush_dcache_folio(struct folio *folio);
> #endif
> #else
> static inline void flush_dcache_folio(struct folio *folio)
> {
> }
> -#define ARCH_IMPLEMENTS_FLUSH_DCACHE_FOLIO 0
> +#define flush_dcache_folio flush_dcache_folio
> #endif /* ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE */
>
> #endif /* _LINUX_CACHEFLUSH_H */
> diff --git a/mm/util.c b/mm/util.c
> index dd12b9531ac4..98ce51b01627 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -1125,7 +1125,7 @@ void page_offline_end(void)
> }
> EXPORT_SYMBOL(page_offline_end);
>
> -#ifndef ARCH_IMPLEMENTS_FLUSH_DCACHE_FOLIO
> +#ifndef flush_dcache_folio
> void flush_dcache_folio(struct folio *folio)
> {
> long i, nr = folio_nr_pages(folio);

2023-05-25 07:12:38

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [PATCH v4 32/36] mm: Use flush_icache_pages() in do_set_pmd()



On 3/15/23 10:44, Matthew Wilcox (Oracle) wrote:
> Push the iteration over each page down to the architectures (many
> can flush the entire THP without iteration).
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>

Reviewed-by: Anshuman Khandual <[email protected]>

> ---
> mm/memory.c | 4 +---
> 1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index c5f1bf906d0c..6aa21e8f3753 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4209,7 +4209,6 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
> bool write = vmf->flags & FAULT_FLAG_WRITE;
> unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
> pmd_t entry;
> - int i;
> vm_fault_t ret = VM_FAULT_FALLBACK;
>
> if (!transhuge_vma_suitable(vma, haddr))
> @@ -4242,8 +4241,7 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
> if (unlikely(!pmd_none(*vmf->pmd)))
> goto out;
>
> - for (i = 0; i < HPAGE_PMD_NR; i++)
> - flush_icache_page(vma, page + i);
> + flush_icache_pages(vma, page, HPAGE_PMD_NR);
>
> entry = mk_huge_pmd(page, vma->vm_page_prot);
> if (write)

2023-05-30 08:14:39

by Yin, Fengwei

[permalink] [raw]
Subject: [PATCH 0/4] New page table range API fixup patches

These are fixup patches for Matthew's New page table range API
patchset.

Thanks Matthew and Ryan a lot for helping on these fixup patches.
I sent the patches to Matthew and Ryan in private mail. Later,
realized that private mail should be avoid.


Yin Fengwei (4):
filemap: avoid interfere with xas.xa_index
rmap: fix typo in folio_add_file_rmap_range()
mm: mark PTEs referencing the accessed folio young
filemap: Check address range in filemap_map_folio_range()

mm/filemap.c | 39 ++++++++++++---------------------------
mm/memory.c | 2 +-
mm/rmap.c | 2 +-
3 files changed, 14 insertions(+), 29 deletions(-)

--
2.30.2


2023-05-30 08:20:56

by Yin, Fengwei

[permalink] [raw]
Subject: [PATCH 4/4] filemap: Check address range in filemap_map_folio_range()

With filemap_map_folio_range(), the addr is updated with range
also. Address range checking is needed to make sure correct
return value (VM_FAULT_NOPAGE) if vmf->address is handled.

Signed-off-by: Yin Fengwei <[email protected]>
---
mm/filemap.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index fdb3e0a339b3..0f4baba1cd31 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3488,15 +3488,15 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
if (!pte_none(vmf->pte[count]))
goto skip;

- if (vmf->address == addr)
- ret = VM_FAULT_NOPAGE;
-
count++;
continue;
skip:
if (count) {
set_pte_range(vmf, folio, page, count, addr);
folio_ref_add(folio, count);
+ if ((vmf->address < (addr + count * PAGE_SIZE)) &&
+ (vmf->address >= addr))
+ ret = VM_FAULT_NOPAGE;
}

count++;
@@ -3509,6 +3509,9 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
if (count) {
set_pte_range(vmf, folio, page, count, addr);
folio_ref_add(folio, count);
+ if ((vmf->address < (addr + count * PAGE_SIZE)) &&
+ (vmf->address >= addr))
+ ret = VM_FAULT_NOPAGE;
}

vmf->pte = old_ptep;
--
2.30.2


2023-05-30 08:25:30

by Yin, Fengwei

[permalink] [raw]
Subject: [PATCH 3/4] mm: mark PTEs referencing the accessed folio young

To allow using larger TLB entries, it's better to mark the
PTEs of same folio accessed when setup the PTEs.

Reported-by: Ryan Roberts <[email protected]>
Signed-off-by: Yin Fengwei <[email protected]>
---
mm/memory.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index c359fb8643e5..2615ea552613 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4259,7 +4259,7 @@ void set_pte_range(struct vm_fault *vmf, struct folio *folio,
struct vm_area_struct *vma = vmf->vma;
bool uffd_wp = pte_marker_uffd_wp(vmf->orig_pte);
bool write = vmf->flags & FAULT_FLAG_WRITE;
- bool prefault = vmf->address != addr;
+ bool prefault = (addr > vmf->address) || ((addr + nr) < vmf->address);
pte_t entry;

flush_icache_pages(vma, page, nr);
--
2.30.2


2023-05-30 08:25:58

by Yin, Fengwei

[permalink] [raw]
Subject: [PATCH 1/4] filemap: avoid interfere with xas.xa_index

Ryan noticed 1% performance regression for kernel build with
the ranged file map with ext4 file system. It was later identified
wrong xas.xa_index update in filemap_map_pages() when folio is
not large folio.

Matthew suggested to use XArray API instead of touch xas.xa_index
directly at [1].

[1] https://lore.kernel.org/linux-mm/ZBho6Q6Xq%[email protected]/

Signed-off-by: Yin Fengwei <[email protected]>
Suggested-by: Matthew Wilcox <[email protected]>
---
mm/filemap.c | 30 ++++++------------------------
1 file changed, 6 insertions(+), 24 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 40be33b5ee46..fdb3e0a339b3 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3416,10 +3416,10 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct folio *folio,
return false;
}

-static struct folio *next_uptodate_page(struct folio *folio,
- struct address_space *mapping,
- struct xa_state *xas, pgoff_t end_pgoff)
+static struct folio *next_uptodate_folio(struct xa_state *xas,
+ struct address_space *mapping, pgoff_t end_pgoff)
{
+ struct folio *folio = xas_next_entry(xas, end_pgoff);
unsigned long max_idx;

do {
@@ -3457,22 +3457,6 @@ static struct folio *next_uptodate_page(struct folio *folio,
return NULL;
}

-static inline struct folio *first_map_page(struct address_space *mapping,
- struct xa_state *xas,
- pgoff_t end_pgoff)
-{
- return next_uptodate_page(xas_find(xas, end_pgoff),
- mapping, xas, end_pgoff);
-}
-
-static inline struct folio *next_map_page(struct address_space *mapping,
- struct xa_state *xas,
- pgoff_t end_pgoff)
-{
- return next_uptodate_page(xas_next_entry(xas, end_pgoff),
- mapping, xas, end_pgoff);
-}
-
/*
* Map page range [start_page, start_page + nr_pages) of folio.
* start_page is gotten from start by folio_page(folio, start)
@@ -3543,12 +3527,11 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
unsigned long addr;
XA_STATE(xas, &mapping->i_pages, start_pgoff);
struct folio *folio;
- unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss);
vm_fault_t ret = 0;
int nr_pages = 0;

rcu_read_lock();
- folio = first_map_page(mapping, &xas, end_pgoff);
+ folio = next_uptodate_folio(&xas, mapping, end_pgoff);
if (!folio)
goto out;

@@ -3570,15 +3553,14 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,

ret |= filemap_map_folio_range(vmf, folio,
xas.xa_index - folio->index, addr, nr_pages);
- xas.xa_index += nr_pages;

folio_unlock(folio);
folio_put(folio);
- } while ((folio = next_map_page(mapping, &xas, end_pgoff)) != NULL);
+ folio = next_uptodate_folio(&xas, mapping, end_pgoff);
+ } while (folio);
pte_unmap_unlock(vmf->pte, vmf->ptl);
out:
rcu_read_unlock();
- WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss);
return ret;
}
EXPORT_SYMBOL(filemap_map_pages);
--
2.30.2


2023-05-30 08:42:06

by Yin, Fengwei

[permalink] [raw]
Subject: [PATCH 2/4] rmap: fix typo in folio_add_file_rmap_range()

The "first" should be used to compare with COMPOUND_MAPPED
instead of "nr".

Signed-off-by: Yin Fengwei <[email protected]>
---
mm/rmap.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index ec52d7f264aa..b352c14da16c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1330,7 +1330,7 @@ void folio_add_file_rmap_range(struct folio *folio, struct page *page,
first = atomic_inc_and_test(&page->_mapcount);
if (first && folio_test_large(folio)) {
first = atomic_inc_return_relaxed(mapped);
- first = (nr < COMPOUND_MAPPED);
+ first = (first < COMPOUND_MAPPED);
}

if (first)
--
2.30.2