2023-02-28 21:39:41

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v3 00/34] New page table range API

This patchset changes the API used by the MM to set up page table entries.
The four APIs are:
set_ptes(mm, addr, ptep, pte, nr)
update_mmu_cache_range(vma, addr, ptep, nr)
flush_dcache_folio(folio)
flush_icache_pages(vma, page, nr)

flush_dcache_folio() isn't technically new, but no architecture
implemented it, so I've done that for you. The old APIs remain around
but are mostly implemented by calling the new interfaces.

The new APIs are based around setting up N page table entries at once.
The N entries belong to the same PMD, the same folio and the same VMA,
so ptep++ is a legitimate operation, and locking is taken care of for
you. Some architectures can do a better job of it than just a loop,
but I have hesitated to make too deep a change to architectures I don't
understand well.

One thing I have changed in every architecture is that PG_arch_1 is now a
per-folio bit instead of a per-page bit. This was something that would
have to happen eventually, and it makes sense to do it now rather than
iterate over every page involved in a cache flush and figure out if it
needs to happen.

The point of all this is better performance, and Fengwei Yin has
measured improvement on x86. I suspect you'll see improvement on
your architecture too. Try the new will-it-scale test mentioned here:
https://lore.kernel.org/linux-mm/[email protected]/
You'll need to run it on an XFS filesystem and have
CONFIG_TRANSPARENT_HUGEPAGE set.

For testing, I've only run the code on x86. If an x86->foo compiler
exists in Debian, I've built defconfig. I'm relying on the buildbots
to tell me what I missed, and people who actually have the hardware to
tell me if it actually works.

I'd like to get this into the MM tree soon after the current merge window
closes, so quick feedback would be appreciated.

v3:
- Reinstate flush_dcache_icache_phys() on PowerPC
- Fix folio_flush_mapping(). The documentation was correct and the
implementation was completely wrong
- Change the flush_dcache_page() documentation to describe
flush_dcache_folio() instead
- Split ARM from ARC. I messed up my git commands
- Remove page_mapping_file()
- Rationalise how flush_icache_pages() and flush_icache_page() are defined
- Use flush_icache_pages() in do_set_pmd()
- Pick up Guo Ren's Ack for csky

Matthew Wilcox (Oracle) (30):
mm: Convert page_table_check_pte_set() to page_table_check_ptes_set()
mm: Add generic flush_icache_pages() and documentation
mm: Add folio_flush_mapping()
mm: Remove ARCH_IMPLEMENTS_FLUSH_DCACHE_FOLIO
alpha: Implement the new page table range API
arc: Implement the new page table range API
arm: Implement the new page table range API
arm64: Implement the new page table range API
csky: Implement the new page table range API
hexagon: Implement the new page table range API
ia64: Implement the new page table range API
loongarch: Implement the new page table range API
m68k: Implement the new page table range API
microblaze: Implement the new page table range API
mips: Implement the new page table range API
nios2: Implement the new page table range API
openrisc: Implement the new page table range API
parisc: Implement the new page table range API
powerpc: Implement the new page table range API
riscv: Implement the new page table range API
s390: Implement the new page table range API
superh: Implement the new page table range API
sparc32: Implement the new page table range API
sparc64: Implement the new page table range API
um: Implement the new page table range API
x86: Implement the new page table range API
xtensa: Implement the new page table range API
mm: Remove page_mapping_file()
mm: Rationalise flush_icache_pages() and flush_icache_page()
mm: Use flush_icache_pages() in do_set_pmd()

Yin Fengwei (4):
filemap: Add filemap_map_folio_range()
rmap: add folio_add_file_rmap_range()
mm: Convert do_set_pte() to set_pte_range()
filemap: Batch PTE mappings

Documentation/core-api/cachetlb.rst | 51 +++++-----
Documentation/filesystems/locking.rst | 2 +-
arch/alpha/include/asm/cacheflush.h | 13 ++-
arch/alpha/include/asm/pgtable.h | 18 +++-
arch/arc/include/asm/cacheflush.h | 14 +--
arch/arc/include/asm/pgtable-bits-arcv2.h | 20 +++-
arch/arc/mm/cache.c | 61 +++++++-----
arch/arc/mm/tlb.c | 18 ++--
arch/arm/include/asm/cacheflush.h | 29 +++---
arch/arm/include/asm/pgtable.h | 5 +-
arch/arm/include/asm/tlbflush.h | 13 ++-
arch/arm/mm/copypage-v4mc.c | 5 +-
arch/arm/mm/copypage-v6.c | 5 +-
arch/arm/mm/copypage-xscale.c | 5 +-
arch/arm/mm/dma-mapping.c | 24 ++---
arch/arm/mm/fault-armv.c | 14 +--
arch/arm/mm/flush.c | 99 +++++++++++--------
arch/arm/mm/mm.h | 2 +-
arch/arm/mm/mmu.c | 14 ++-
arch/arm64/include/asm/cacheflush.h | 4 +-
arch/arm64/include/asm/pgtable.h | 25 +++--
arch/arm64/mm/flush.c | 36 +++----
arch/csky/abiv1/cacheflush.c | 32 ++++---
arch/csky/abiv1/inc/abi/cacheflush.h | 3 +-
arch/csky/abiv2/cacheflush.c | 30 +++---
arch/csky/abiv2/inc/abi/cacheflush.h | 11 ++-
arch/csky/include/asm/pgtable.h | 21 +++-
arch/hexagon/include/asm/cacheflush.h | 9 +-
arch/hexagon/include/asm/pgtable.h | 16 +++-
arch/ia64/hp/common/sba_iommu.c | 26 ++---
arch/ia64/include/asm/cacheflush.h | 14 ++-
arch/ia64/include/asm/pgtable.h | 14 ++-
arch/ia64/mm/init.c | 29 ++++--
arch/loongarch/include/asm/cacheflush.h | 2 +-
arch/loongarch/include/asm/pgtable.h | 30 ++++--
arch/m68k/include/asm/cacheflush_mm.h | 25 +++--
arch/m68k/include/asm/pgtable_mm.h | 21 +++-
arch/m68k/mm/motorola.c | 2 +-
arch/microblaze/include/asm/cacheflush.h | 8 ++
arch/microblaze/include/asm/pgtable.h | 17 +++-
arch/microblaze/include/asm/tlbflush.h | 4 +-
arch/mips/include/asm/cacheflush.h | 32 ++++---
arch/mips/include/asm/pgtable.h | 36 ++++---
arch/mips/mm/c-r4k.c | 5 +-
arch/mips/mm/cache.c | 56 +++++------
arch/mips/mm/init.c | 17 ++--
arch/nios2/include/asm/cacheflush.h | 6 +-
arch/nios2/include/asm/pgtable.h | 27 ++++--
arch/nios2/mm/cacheflush.c | 62 ++++++------
arch/openrisc/include/asm/cacheflush.h | 8 +-
arch/openrisc/include/asm/pgtable.h | 27 +++++-
arch/openrisc/mm/cache.c | 12 ++-
arch/parisc/include/asm/cacheflush.h | 14 ++-
arch/parisc/include/asm/pgtable.h | 28 ++++--
arch/parisc/kernel/cache.c | 101 ++++++++++++++------
arch/powerpc/include/asm/book3s/pgtable.h | 10 +-
arch/powerpc/include/asm/cacheflush.h | 14 ++-
arch/powerpc/include/asm/kvm_ppc.h | 10 +-
arch/powerpc/include/asm/nohash/pgtable.h | 13 +--
arch/powerpc/include/asm/pgtable.h | 6 ++
arch/powerpc/mm/book3s64/hash_utils.c | 11 ++-
arch/powerpc/mm/cacheflush.c | 40 +++-----
arch/powerpc/mm/nohash/e500_hugetlbpage.c | 3 +-
arch/powerpc/mm/pgtable.c | 51 +++++-----
arch/riscv/include/asm/cacheflush.h | 19 ++--
arch/riscv/include/asm/pgtable.h | 26 +++--
arch/riscv/mm/cacheflush.c | 11 +--
arch/s390/include/asm/pgtable.h | 34 +++++--
arch/sh/include/asm/cacheflush.h | 21 ++--
arch/sh/include/asm/pgtable.h | 6 +-
arch/sh/include/asm/pgtable_32.h | 16 +++-
arch/sh/mm/cache-j2.c | 4 +-
arch/sh/mm/cache-sh4.c | 26 +++--
arch/sh/mm/cache-sh7705.c | 26 +++--
arch/sh/mm/cache.c | 54 ++++++-----
arch/sh/mm/kmap.c | 3 +-
arch/sparc/include/asm/cacheflush_32.h | 9 +-
arch/sparc/include/asm/cacheflush_64.h | 19 ++--
arch/sparc/include/asm/pgtable_32.h | 15 ++-
arch/sparc/include/asm/pgtable_64.h | 25 ++++-
arch/sparc/kernel/smp_64.c | 56 +++++++----
arch/sparc/mm/init_32.c | 13 ++-
arch/sparc/mm/init_64.c | 78 ++++++++-------
arch/sparc/mm/tlb.c | 5 +-
arch/um/include/asm/pgtable.h | 15 ++-
arch/x86/include/asm/pgtable.h | 21 +++-
arch/xtensa/include/asm/cacheflush.h | 11 ++-
arch/xtensa/include/asm/pgtable.h | 24 +++--
arch/xtensa/mm/cache.c | 83 +++++++++-------
include/asm-generic/cacheflush.h | 7 --
include/linux/cacheflush.h | 13 ++-
include/linux/mm.h | 3 +-
include/linux/page_table_check.h | 14 +--
include/linux/pagemap.h | 28 ++++--
include/linux/rmap.h | 2 +
mm/filemap.c | 111 +++++++++++++---------
mm/memory.c | 30 +++---
mm/page_table_check.c | 14 +--
mm/rmap.c | 60 +++++++++---
mm/util.c | 2 +-
100 files changed, 1436 insertions(+), 848 deletions(-)

--
2.39.1



2023-02-28 21:39:51

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v3 32/34] rmap: add folio_add_file_rmap_range()

From: Yin Fengwei <[email protected]>

folio_add_file_rmap_range() allows to add pte mapping to a specific
range of file folio. Comparing to page_add_file_rmap(), it batched
updates __lruvec_stat for large folio.

Signed-off-by: Yin Fengwei <[email protected]>
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
include/linux/rmap.h | 2 ++
mm/rmap.c | 60 +++++++++++++++++++++++++++++++++-----------
2 files changed, 48 insertions(+), 14 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index b87d01660412..a3825ce81102 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -198,6 +198,8 @@ void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
unsigned long address);
void page_add_file_rmap(struct page *, struct vm_area_struct *,
bool compound);
+void folio_add_file_rmap_range(struct folio *, struct page *, unsigned int nr,
+ struct vm_area_struct *, bool compound);
void page_remove_rmap(struct page *, struct vm_area_struct *,
bool compound);

diff --git a/mm/rmap.c b/mm/rmap.c
index bacdb795d5ee..fffdb85a3b3d 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1303,31 +1303,39 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
}

/**
- * page_add_file_rmap - add pte mapping to a file page
- * @page: the page to add the mapping to
+ * folio_add_file_rmap_range - add pte mapping to page range of a folio
+ * @folio: The folio to add the mapping to
+ * @page: The first page to add
+ * @nr_pages: The number of pages which will be mapped
* @vma: the vm area in which the mapping is added
* @compound: charge the page as compound or small page
*
+ * The page range of folio is defined by [first_page, first_page + nr_pages)
+ *
* The caller needs to hold the pte lock.
*/
-void page_add_file_rmap(struct page *page, struct vm_area_struct *vma,
- bool compound)
+void folio_add_file_rmap_range(struct folio *folio, struct page *page,
+ unsigned int nr_pages, struct vm_area_struct *vma,
+ bool compound)
{
- struct folio *folio = page_folio(page);
atomic_t *mapped = &folio->_nr_pages_mapped;
- int nr = 0, nr_pmdmapped = 0;
- bool first;
+ unsigned int nr_pmdmapped = 0, first;
+ int nr = 0;

- VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page);
+ VM_WARN_ON_FOLIO(compound && !folio_test_pmd_mappable(folio), folio);

/* Is page being mapped by PTE? Is this its first map to be added? */
if (likely(!compound)) {
- first = atomic_inc_and_test(&page->_mapcount);
- nr = first;
- if (first && folio_test_large(folio)) {
- nr = atomic_inc_return_relaxed(mapped);
- nr = (nr < COMPOUND_MAPPED);
- }
+ do {
+ first = atomic_inc_and_test(&page->_mapcount);
+ if (first && folio_test_large(folio)) {
+ first = atomic_inc_return_relaxed(mapped);
+ first = (nr < COMPOUND_MAPPED);
+ }
+
+ if (first)
+ nr++;
+ } while (page++, --nr_pages > 0);
} else if (folio_test_pmd_mappable(folio)) {
/* That test is redundant: it's for safety or to optimize out */

@@ -1356,6 +1364,30 @@ void page_add_file_rmap(struct page *page, struct vm_area_struct *vma,
mlock_vma_folio(folio, vma, compound);
}

+/**
+ * page_add_file_rmap - add pte mapping to a file page
+ * @page: the page to add the mapping to
+ * @vma: the vm area in which the mapping is added
+ * @compound: charge the page as compound or small page
+ *
+ * The caller needs to hold the pte lock.
+ */
+void page_add_file_rmap(struct page *page, struct vm_area_struct *vma,
+ bool compound)
+{
+ struct folio *folio = page_folio(page);
+ unsigned int nr_pages;
+
+ VM_WARN_ON_ONCE_PAGE(compound && !PageTransHuge(page), page);
+
+ if (likely(!compound))
+ nr_pages = 1;
+ else
+ nr_pages = folio_nr_pages(folio);
+
+ folio_add_file_rmap_range(folio, page, nr_pages, vma, compound);
+}
+
/**
* page_remove_rmap - take down pte mapping from a page
* @page: page to remove mapping from
--
2.39.1


2023-02-28 21:39:51

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v3 24/34] sparc64: Implement the new page table range API

Add set_ptes(), update_mmu_cache_range(), flush_dcache_folio() and
flush_icache_pages(). Convert the PG_dcache_dirty flag from being
per-page to per-folio.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: [email protected]
---
arch/sparc/include/asm/cacheflush_64.h | 18 ++++--
arch/sparc/include/asm/pgtable_64.h | 25 +++++++--
arch/sparc/kernel/smp_64.c | 56 +++++++++++-------
arch/sparc/mm/init_64.c | 78 +++++++++++++++-----------
arch/sparc/mm/tlb.c | 5 +-
5 files changed, 117 insertions(+), 65 deletions(-)

diff --git a/arch/sparc/include/asm/cacheflush_64.h b/arch/sparc/include/asm/cacheflush_64.h
index b9341836597e..a9a719f04d06 100644
--- a/arch/sparc/include/asm/cacheflush_64.h
+++ b/arch/sparc/include/asm/cacheflush_64.h
@@ -35,20 +35,26 @@ void flush_icache_range(unsigned long start, unsigned long end);
void __flush_icache_page(unsigned long);

void __flush_dcache_page(void *addr, int flush_icache);
-void flush_dcache_page_impl(struct page *page);
+void flush_dcache_folio_impl(struct folio *folio);
#ifdef CONFIG_SMP
-void smp_flush_dcache_page_impl(struct page *page, int cpu);
-void flush_dcache_page_all(struct mm_struct *mm, struct page *page);
+void smp_flush_dcache_folio_impl(struct folio *folio, int cpu);
+void flush_dcache_folio_all(struct mm_struct *mm, struct folio *folio);
#else
-#define smp_flush_dcache_page_impl(page,cpu) flush_dcache_page_impl(page)
-#define flush_dcache_page_all(mm,page) flush_dcache_page_impl(page)
+#define smp_flush_dcache_folio_impl(folio, cpu) flush_dcache_folio_impl(folio)
+#define flush_dcache_folio_all(mm, folio) flush_dcache_folio_impl(folio)
#endif

void __flush_dcache_range(unsigned long start, unsigned long end);
#define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1
-void flush_dcache_page(struct page *page);
+void flush_dcache_folio(struct folio *folio);
+#define flush_dcache_folio flush_dcache_folio
+static inline void flush_dcache_page(struct page *page)
+{
+ flush_dcache_folio(page_folio(page));
+}

#define flush_icache_page(vma, pg) do { } while(0)
+#define flush_icache_pages(vma, pg, nr) do { } while(0)

void flush_ptrace_access(struct vm_area_struct *, struct page *,
unsigned long uaddr, void *kaddr,
diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 2dc8d4641734..d5c0088e0c6a 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -911,8 +911,20 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr,
maybe_tlb_batch_add(mm, addr, ptep, orig, fullmm, PAGE_SHIFT);
}

-#define set_pte_at(mm,addr,ptep,pte) \
- __set_pte_at((mm), (addr), (ptep), (pte), 0)
+static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep, pte_t pte, unsigned int nr)
+{
+ for (;;) {
+ __set_pte_at(mm, addr, ptep, pte, 0);
+ if (--nr == 0)
+ break;
+ ptep++;
+ pte_val(pte) += PAGE_SIZE;
+ addr += PAGE_SIZE;
+ }
+}
+
+#define set_pte_at(mm, addr, ptep, pte) set_ptes(mm, addr, ptep, pte, 1);

#define pte_clear(mm,addr,ptep) \
set_pte_at((mm), (addr), (ptep), __pte(0UL))
@@ -931,8 +943,8 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr,
\
if (pfn_valid(this_pfn) && \
(((old_addr) ^ (new_addr)) & (1 << 13))) \
- flush_dcache_page_all(current->mm, \
- pfn_to_page(this_pfn)); \
+ flush_dcache_folio_all(current->mm, \
+ page_folio(pfn_to_page(this_pfn))); \
} \
newpte; \
})
@@ -947,7 +959,10 @@ struct seq_file;
void mmu_info(struct seq_file *);

struct vm_area_struct;
-void update_mmu_cache(struct vm_area_struct *, unsigned long, pte_t *);
+void update_mmu_cache_range(struct vm_area_struct *, unsigned long addr,
+ pte_t *ptep, unsigned int nr);
+#define update_mmu_cache(vma, addr, ptep) \
+ update_mmu_cache_range(vma, addr, ptep, 1)
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmd);
diff --git a/arch/sparc/kernel/smp_64.c b/arch/sparc/kernel/smp_64.c
index a55295d1b924..90ef8677ac89 100644
--- a/arch/sparc/kernel/smp_64.c
+++ b/arch/sparc/kernel/smp_64.c
@@ -921,20 +921,26 @@ extern unsigned long xcall_flush_dcache_page_cheetah;
#endif
extern unsigned long xcall_flush_dcache_page_spitfire;

-static inline void __local_flush_dcache_page(struct page *page)
+static inline void __local_flush_dcache_folio(struct folio *folio)
{
+ unsigned int i, nr = folio_nr_pages(folio);
+
#ifdef DCACHE_ALIASING_POSSIBLE
- __flush_dcache_page(page_address(page),
+ for (i = 0; i < nr; i++)
+ __flush_dcache_page(folio_address(folio) + i * PAGE_SIZE,
((tlb_type == spitfire) &&
- page_mapping_file(page) != NULL));
+ folio_flush_mapping(folio) != NULL));
#else
- if (page_mapping_file(page) != NULL &&
- tlb_type == spitfire)
- __flush_icache_page(__pa(page_address(page)));
+ if (folio_flush_mapping(folio) != NULL &&
+ tlb_type == spitfire) {
+ unsigned long pfn = folio_pfn(folio)
+ for (i = 0; i < nr; i++)
+ __flush_icache_page((pfn + i) * PAGE_SIZE);
+ }
#endif
}

-void smp_flush_dcache_page_impl(struct page *page, int cpu)
+void smp_flush_dcache_folio_impl(struct folio *folio, int cpu)
{
int this_cpu;

@@ -948,14 +954,14 @@ void smp_flush_dcache_page_impl(struct page *page, int cpu)
this_cpu = get_cpu();

if (cpu == this_cpu) {
- __local_flush_dcache_page(page);
+ __local_flush_dcache_folio(folio);
} else if (cpu_online(cpu)) {
- void *pg_addr = page_address(page);
+ void *pg_addr = folio_address(folio);
u64 data0 = 0;

if (tlb_type == spitfire) {
data0 = ((u64)&xcall_flush_dcache_page_spitfire);
- if (page_mapping_file(page) != NULL)
+ if (folio_flush_mapping(folio) != NULL)
data0 |= ((u64)1 << 32);
} else if (tlb_type == cheetah || tlb_type == cheetah_plus) {
#ifdef DCACHE_ALIASING_POSSIBLE
@@ -963,18 +969,23 @@ void smp_flush_dcache_page_impl(struct page *page, int cpu)
#endif
}
if (data0) {
- xcall_deliver(data0, __pa(pg_addr),
- (u64) pg_addr, cpumask_of(cpu));
+ unsigned int i, nr = folio_nr_pages(folio);
+
+ for (i = 0; i < nr; i++) {
+ xcall_deliver(data0, __pa(pg_addr),
+ (u64) pg_addr, cpumask_of(cpu));
#ifdef CONFIG_DEBUG_DCFLUSH
- atomic_inc(&dcpage_flushes_xcall);
+ atomic_inc(&dcpage_flushes_xcall);
#endif
+ pg_addr += PAGE_SIZE;
+ }
}
}

put_cpu();
}

-void flush_dcache_page_all(struct mm_struct *mm, struct page *page)
+void flush_dcache_folio_all(struct mm_struct *mm, struct folio *folio)
{
void *pg_addr;
u64 data0;
@@ -988,10 +999,10 @@ void flush_dcache_page_all(struct mm_struct *mm, struct page *page)
atomic_inc(&dcpage_flushes);
#endif
data0 = 0;
- pg_addr = page_address(page);
+ pg_addr = folio_address(folio);
if (tlb_type == spitfire) {
data0 = ((u64)&xcall_flush_dcache_page_spitfire);
- if (page_mapping_file(page) != NULL)
+ if (folio_flush_mapping(folio) != NULL)
data0 |= ((u64)1 << 32);
} else if (tlb_type == cheetah || tlb_type == cheetah_plus) {
#ifdef DCACHE_ALIASING_POSSIBLE
@@ -999,13 +1010,18 @@ void flush_dcache_page_all(struct mm_struct *mm, struct page *page)
#endif
}
if (data0) {
- xcall_deliver(data0, __pa(pg_addr),
- (u64) pg_addr, cpu_online_mask);
+ unsigned int i, nr = folio_nr_pages(folio);
+
+ for (i = 0; i < nr; i++) {
+ xcall_deliver(data0, __pa(pg_addr),
+ (u64) pg_addr, cpu_online_mask);
#ifdef CONFIG_DEBUG_DCFLUSH
- atomic_inc(&dcpage_flushes_xcall);
+ atomic_inc(&dcpage_flushes_xcall);
#endif
+ pg_addr += PAGE_SIZE;
+ }
}
- __local_flush_dcache_page(page);
+ __local_flush_dcache_folio(folio);

preempt_enable();
}
diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 04f9db0c3111..ab9aacbaf43c 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -195,21 +195,26 @@ atomic_t dcpage_flushes_xcall = ATOMIC_INIT(0);
#endif
#endif

-inline void flush_dcache_page_impl(struct page *page)
+inline void flush_dcache_folio_impl(struct folio *folio)
{
+ unsigned int i, nr = folio_nr_pages(folio);
+
BUG_ON(tlb_type == hypervisor);
#ifdef CONFIG_DEBUG_DCFLUSH
atomic_inc(&dcpage_flushes);
#endif

#ifdef DCACHE_ALIASING_POSSIBLE
- __flush_dcache_page(page_address(page),
- ((tlb_type == spitfire) &&
- page_mapping_file(page) != NULL));
+ for (i = 0; i < nr; i++)
+ __flush_dcache_page(folio_address(folio) + i * PAGE_SIZE,
+ ((tlb_type == spitfire) &&
+ folio_flush_mapping(folio) != NULL));
#else
- if (page_mapping_file(page) != NULL &&
- tlb_type == spitfire)
- __flush_icache_page(__pa(page_address(page)));
+ if (folio_flush_mapping(folio) != NULL &&
+ tlb_type == spitfire) {
+ for (i = 0; i < nr; i++)
+ __flush_icache_page((pfn + i) * PAGE_SIZE);
+ }
#endif
}

@@ -218,10 +223,10 @@ inline void flush_dcache_page_impl(struct page *page)
#define PG_dcache_cpu_mask \
((1UL<<ilog2(roundup_pow_of_two(NR_CPUS)))-1UL)

-#define dcache_dirty_cpu(page) \
- (((page)->flags >> PG_dcache_cpu_shift) & PG_dcache_cpu_mask)
+#define dcache_dirty_cpu(folio) \
+ (((folio)->flags >> PG_dcache_cpu_shift) & PG_dcache_cpu_mask)

-static inline void set_dcache_dirty(struct page *page, int this_cpu)
+static inline void set_dcache_dirty(struct folio *folio, int this_cpu)
{
unsigned long mask = this_cpu;
unsigned long non_cpu_bits;
@@ -238,11 +243,11 @@ static inline void set_dcache_dirty(struct page *page, int this_cpu)
"bne,pn %%xcc, 1b\n\t"
" nop"
: /* no outputs */
- : "r" (mask), "r" (non_cpu_bits), "r" (&page->flags)
+ : "r" (mask), "r" (non_cpu_bits), "r" (&folio->flags)
: "g1", "g7");
}

-static inline void clear_dcache_dirty_cpu(struct page *page, unsigned long cpu)
+static inline void clear_dcache_dirty_cpu(struct folio *folio, unsigned long cpu)
{
unsigned long mask = (1UL << PG_dcache_dirty);

@@ -260,7 +265,7 @@ static inline void clear_dcache_dirty_cpu(struct page *page, unsigned long cpu)
" nop\n"
"2:"
: /* no outputs */
- : "r" (cpu), "r" (mask), "r" (&page->flags),
+ : "r" (cpu), "r" (mask), "r" (&folio->flags),
"i" (PG_dcache_cpu_mask),
"i" (PG_dcache_cpu_shift)
: "g1", "g7");
@@ -284,9 +289,10 @@ static void flush_dcache(unsigned long pfn)

page = pfn_to_page(pfn);
if (page) {
+ struct folio *folio = page_folio(page);
unsigned long pg_flags;

- pg_flags = page->flags;
+ pg_flags = folio->flags;
if (pg_flags & (1UL << PG_dcache_dirty)) {
int cpu = ((pg_flags >> PG_dcache_cpu_shift) &
PG_dcache_cpu_mask);
@@ -296,11 +302,11 @@ static void flush_dcache(unsigned long pfn)
* in the SMP case.
*/
if (cpu == this_cpu)
- flush_dcache_page_impl(page);
+ flush_dcache_folio_impl(folio);
else
- smp_flush_dcache_page_impl(page, cpu);
+ smp_flush_dcache_folio_impl(folio, cpu);

- clear_dcache_dirty_cpu(page, cpu);
+ clear_dcache_dirty_cpu(folio, cpu);

put_cpu();
}
@@ -388,12 +394,14 @@ bool __init arch_hugetlb_valid_size(unsigned long size)
}
#endif /* CONFIG_HUGETLB_PAGE */

-void update_mmu_cache(struct vm_area_struct *vma, unsigned long address, pte_t *ptep)
+void update_mmu_cache_range(struct vm_area_struct *vma, unsigned long address,
+ pte_t *ptep, unsigned int nr)
{
struct mm_struct *mm;
unsigned long flags;
bool is_huge_tsb;
pte_t pte = *ptep;
+ unsigned int i;

if (tlb_type != hypervisor) {
unsigned long pfn = pte_pfn(pte);
@@ -440,15 +448,21 @@ void update_mmu_cache(struct vm_area_struct *vma, unsigned long address, pte_t *
}
}
#endif
- if (!is_huge_tsb)
- __update_mmu_tsb_insert(mm, MM_TSB_BASE, PAGE_SHIFT,
- address, pte_val(pte));
+ if (!is_huge_tsb) {
+ for (i = 0; i < nr; i++) {
+ __update_mmu_tsb_insert(mm, MM_TSB_BASE, PAGE_SHIFT,
+ address, pte_val(pte));
+ address += PAGE_SIZE;
+ pte_val(pte) += PAGE_SIZE;
+ }
+ }

spin_unlock_irqrestore(&mm->context.lock, flags);
}

-void flush_dcache_page(struct page *page)
+void flush_dcache_folio(struct folio *folio)
{
+ unsigned long pfn = folio_pfn(folio);
struct address_space *mapping;
int this_cpu;

@@ -459,35 +473,35 @@ void flush_dcache_page(struct page *page)
* is merely the zero page. The 'bigcore' testcase in GDB
* causes this case to run millions of times.
*/
- if (page == ZERO_PAGE(0))
+ if (is_zero_pfn(pfn))
return;

this_cpu = get_cpu();

- mapping = page_mapping_file(page);
+ mapping = folio_flush_mapping(folio);
if (mapping && !mapping_mapped(mapping)) {
- int dirty = test_bit(PG_dcache_dirty, &page->flags);
+ bool dirty = test_bit(PG_dcache_dirty, &folio->flags);
if (dirty) {
- int dirty_cpu = dcache_dirty_cpu(page);
+ int dirty_cpu = dcache_dirty_cpu(folio);

if (dirty_cpu == this_cpu)
goto out;
- smp_flush_dcache_page_impl(page, dirty_cpu);
+ smp_flush_dcache_folio_impl(folio, dirty_cpu);
}
- set_dcache_dirty(page, this_cpu);
+ set_dcache_dirty(folio, this_cpu);
} else {
/* We could delay the flush for the !page_mapping
* case too. But that case is for exec env/arg
* pages and those are %99 certainly going to get
* faulted into the tlb (and thus flushed) anyways.
*/
- flush_dcache_page_impl(page);
+ flush_dcache_folio_impl(folio);
}

out:
put_cpu();
}
-EXPORT_SYMBOL(flush_dcache_page);
+EXPORT_SYMBOL(flush_dcache_folio);

void __kprobes flush_icache_range(unsigned long start, unsigned long end)
{
@@ -2280,10 +2294,10 @@ void __init paging_init(void)
setup_page_offset();

/* These build time checkes make sure that the dcache_dirty_cpu()
- * page->flags usage will work.
+ * folio->flags usage will work.
*
* When a page gets marked as dcache-dirty, we store the
- * cpu number starting at bit 32 in the page->flags. Also,
+ * cpu number starting at bit 32 in the folio->flags. Also,
* functions like clear_dcache_dirty_cpu use the cpu mask
* in 13-bit signed-immediate instruction fields.
*/
diff --git a/arch/sparc/mm/tlb.c b/arch/sparc/mm/tlb.c
index 9a725547578e..3fa6a070912d 100644
--- a/arch/sparc/mm/tlb.c
+++ b/arch/sparc/mm/tlb.c
@@ -118,6 +118,7 @@ void tlb_batch_add(struct mm_struct *mm, unsigned long vaddr,
unsigned long paddr, pfn = pte_pfn(orig);
struct address_space *mapping;
struct page *page;
+ struct folio *folio;

if (!pfn_valid(pfn))
goto no_cache_flush;
@@ -127,13 +128,13 @@ void tlb_batch_add(struct mm_struct *mm, unsigned long vaddr,
goto no_cache_flush;

/* A real file page? */
- mapping = page_mapping_file(page);
+ mapping = folio_flush_mapping(folio);
if (!mapping)
goto no_cache_flush;

paddr = (unsigned long) page_address(page);
if ((paddr ^ vaddr) & (1 << 13))
- flush_dcache_page_all(mm, page);
+ flush_dcache_folio_all(mm, folio);
}

no_cache_flush:
--
2.39.1


2023-02-28 21:39:51

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v3 23/34] sparc32: Implement the new page table range API

Add set_ptes(), update_mmu_cache_range(), flush_dcache_folio() and
flush_icache_pages().

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: [email protected]
---
arch/sparc/include/asm/cacheflush_32.h | 9 +++++++--
arch/sparc/include/asm/pgtable_32.h | 15 ++++++++++++++-
arch/sparc/mm/init_32.c | 13 +++++++++++--
3 files changed, 32 insertions(+), 5 deletions(-)

diff --git a/arch/sparc/include/asm/cacheflush_32.h b/arch/sparc/include/asm/cacheflush_32.h
index adb6991d0455..8dba35d63328 100644
--- a/arch/sparc/include/asm/cacheflush_32.h
+++ b/arch/sparc/include/asm/cacheflush_32.h
@@ -16,6 +16,7 @@
sparc32_cachetlb_ops->cache_page(vma, addr)
#define flush_icache_range(start, end) do { } while (0)
#define flush_icache_page(vma, pg) do { } while (0)
+#define flush_icache_pages(vma, pg, nr) do { } while (0)

#define copy_to_user_page(vma, page, vaddr, dst, src, len) \
do { \
@@ -35,11 +36,15 @@
#define flush_page_for_dma(addr) \
sparc32_cachetlb_ops->page_for_dma(addr)

-struct page;
void sparc_flush_page_to_ram(struct page *page);
+void sparc_flush_folio_to_ram(struct folio *folio);

#define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1
-#define flush_dcache_page(page) sparc_flush_page_to_ram(page)
+#define flush_dcache_folio(folio) sparc_flush_folio_to_ram(folio)
+static inline void flush_dcache_page(struct page *page)
+{
+ flush_dcache_folio(page_folio(page));
+}
#define flush_dcache_mmap_lock(mapping) do { } while (0)
#define flush_dcache_mmap_unlock(mapping) do { } while (0)

diff --git a/arch/sparc/include/asm/pgtable_32.h b/arch/sparc/include/asm/pgtable_32.h
index d4330e3c57a6..47ae55ea1837 100644
--- a/arch/sparc/include/asm/pgtable_32.h
+++ b/arch/sparc/include/asm/pgtable_32.h
@@ -101,7 +101,19 @@ static inline void set_pte(pte_t *ptep, pte_t pteval)
srmmu_swap((unsigned long *)ptep, pte_val(pteval));
}

-#define set_pte_at(mm,addr,ptep,pteval) set_pte(ptep,pteval)
+static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep, pte_t pte, unsigned int nr)
+{
+ for (;;) {
+ set_pte(ptep, pte);
+ if (--nr == 0)
+ break;
+ ptep++;
+ pte_val(pte) += PAGE_SIZE;
+ }
+}
+
+#define set_pte_at(mm, addr, ptep, pte) set_ptes(mm, addr, ptep, pte, 1)

static inline int srmmu_device_memory(unsigned long x)
{
@@ -318,6 +330,7 @@ void mmu_info(struct seq_file *m);
#define FAULT_CODE_USER 0x4

#define update_mmu_cache(vma, address, ptep) do { } while (0)
+#define update_mmu_cache_range(vma, address, ptep, nr) do { } while (0)

void srmmu_mapiorange(unsigned int bus, unsigned long xpa,
unsigned long xva, unsigned int len);
diff --git a/arch/sparc/mm/init_32.c b/arch/sparc/mm/init_32.c
index 9c0ea457bdf0..d96a14ffceeb 100644
--- a/arch/sparc/mm/init_32.c
+++ b/arch/sparc/mm/init_32.c
@@ -297,11 +297,20 @@ void sparc_flush_page_to_ram(struct page *page)
{
unsigned long vaddr = (unsigned long)page_address(page);

- if (vaddr)
- __flush_page_to_ram(vaddr);
+ __flush_page_to_ram(vaddr);
}
EXPORT_SYMBOL(sparc_flush_page_to_ram);

+void sparc_flush_folio_to_ram(struct folio *folio)
+{
+ unsigned long vaddr = (unsigned long)folio_address(folio);
+ unsigned int i, nr = folio_nr_pages(folio);
+
+ for (i = 0; i < nr; i++)
+ __flush_page_to_ram(vaddr + i * PAGE_SIZE);
+}
+EXPORT_SYMBOL(sparc_flush_folio_to_ram);
+
static const pgprot_t protection_map[16] = {
[VM_NONE] = PAGE_NONE,
[VM_READ] = PAGE_READONLY,
--
2.39.1


2023-02-28 21:39:53

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v3 04/34] mm: Remove ARCH_IMPLEMENTS_FLUSH_DCACHE_FOLIO

Current best practice is to reuse the name of the function as a define
to indicate that the function is implemented by the architecture.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
Documentation/core-api/cachetlb.rst | 24 +++++++++---------------
include/linux/cacheflush.h | 4 ++--
mm/util.c | 2 +-
3 files changed, 12 insertions(+), 18 deletions(-)

diff --git a/Documentation/core-api/cachetlb.rst b/Documentation/core-api/cachetlb.rst
index d4c9e2a28d36..770008afd409 100644
--- a/Documentation/core-api/cachetlb.rst
+++ b/Documentation/core-api/cachetlb.rst
@@ -269,7 +269,7 @@ maps this page at its virtual address.
If D-cache aliasing is not an issue, these two routines may
simply call memcpy/memset directly and do nothing more.

- ``void flush_dcache_page(struct page *page)``
+ ``void flush_dcache_folio(struct folio *folio)``

This routines must be called when:

@@ -277,7 +277,7 @@ maps this page at its virtual address.
and / or in high memory
b) the kernel is about to read from a page cache page and user space
shared/writable mappings of this page potentially exist. Note
- that {get,pin}_user_pages{_fast} already call flush_dcache_page
+ that {get,pin}_user_pages{_fast} already call flush_dcache_folio
on any page found in the user address space and thus driver
code rarely needs to take this into account.

@@ -291,7 +291,7 @@ maps this page at its virtual address.

The phrase "kernel writes to a page cache page" means, specifically,
that the kernel executes store instructions that dirty data in that
- page at the page->virtual mapping of that page. It is important to
+ page at the kernel virtual mapping of that page. It is important to
flush here to handle D-cache aliasing, to make sure these kernel stores
are visible to user space mappings of that page.

@@ -302,18 +302,18 @@ maps this page at its virtual address.
If D-cache aliasing is not an issue, this routine may simply be defined
as a nop on that architecture.

- There is a bit set aside in page->flags (PG_arch_1) as "architecture
+ There is a bit set aside in folio->flags (PG_arch_1) as "architecture
private". The kernel guarantees that, for pagecache pages, it will
clear this bit when such a page first enters the pagecache.

This allows these interfaces to be implemented much more
efficiently. It allows one to "defer" (perhaps indefinitely) the
actual flush if there are currently no user processes mapping this
- page. See sparc64's flush_dcache_page and update_mmu_cache_range
+ page. See sparc64's flush_dcache_folio and update_mmu_cache_range
implementations for an example of how to go about doing this.

- The idea is, first at flush_dcache_page() time, if
- page_file_mapping() returns a mapping, and mapping_mapped on that
+ The idea is, first at flush_dcache_folio() time, if
+ folio_flush_mapping() returns a mapping, and mapping_mapped() on that
mapping returns %false, just mark the architecture private page
flag bit. Later, in update_mmu_cache_range(), a check is made
of this flag bit, and if set the flush is done and the flag bit
@@ -327,12 +327,6 @@ maps this page at its virtual address.
dirty. Again, see sparc64 for examples of how
to deal with this.

- ``void flush_dcache_folio(struct folio *folio)``
- This function is called under the same circumstances as
- flush_dcache_page(). It allows the architecture to
- optimise for flushing the entire folio of pages instead
- of flushing one page at a time.
-
``void copy_to_user_page(struct vm_area_struct *vma, struct page *page,
unsigned long user_vaddr, void *dst, void *src, int len)``
``void copy_from_user_page(struct vm_area_struct *vma, struct page *page,
@@ -353,7 +347,7 @@ maps this page at its virtual address.

When the kernel needs to access the contents of an anonymous
page, it calls this function (currently only
- get_user_pages()). Note: flush_dcache_page() deliberately
+ get_user_pages()). Note: flush_dcache_folio() deliberately
doesn't work for an anonymous page. The default
implementation is a nop (and should remain so for all coherent
architectures). For incoherent architectures, it should flush
@@ -370,7 +364,7 @@ maps this page at its virtual address.
``void flush_icache_page(struct vm_area_struct *vma, struct page *page)``

All the functionality of flush_icache_page can be implemented in
- flush_dcache_page and update_mmu_cache_range. In the future, the hope
+ flush_dcache_folio and update_mmu_cache_range. In the future, the hope
is to remove this interface completely.

The final category of APIs is for I/O to deliberately aliased address
diff --git a/include/linux/cacheflush.h b/include/linux/cacheflush.h
index a6189d21f2ba..82136f3fcf54 100644
--- a/include/linux/cacheflush.h
+++ b/include/linux/cacheflush.h
@@ -7,14 +7,14 @@
struct folio;

#if ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE
-#ifndef ARCH_IMPLEMENTS_FLUSH_DCACHE_FOLIO
+#ifndef flush_dcache_folio
void flush_dcache_folio(struct folio *folio);
#endif
#else
static inline void flush_dcache_folio(struct folio *folio)
{
}
-#define ARCH_IMPLEMENTS_FLUSH_DCACHE_FOLIO 0
+#define flush_dcache_folio flush_dcache_folio
#endif /* ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE */

#endif /* _LINUX_CACHEFLUSH_H */
diff --git a/mm/util.c b/mm/util.c
index b8ed9dbc7fd5..f66e0ca82d2d 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -1124,7 +1124,7 @@ void page_offline_end(void)
}
EXPORT_SYMBOL(page_offline_end);

-#ifndef ARCH_IMPLEMENTS_FLUSH_DCACHE_FOLIO
+#ifndef flush_dcache_folio
void flush_dcache_folio(struct folio *folio)
{
long i, nr = folio_nr_pages(folio);
--
2.39.1


2023-02-28 21:39:56

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v3 34/34] filemap: Batch PTE mappings

From: Yin Fengwei <[email protected]>

Call set_pte_range() once per contiguous range of the folio instead
of once per page. This batches the updates to mm counters and the
rmap.

With a will-it-scale.page_fault3 like app (change file write
fault testing to read fault testing. Trying to upstream it to
will-it-scale at [1]) got 15% performance gain on a 48C/96T
Cascade Lake test box with 96 processes running against xfs.

Perf data collected before/after the change:
18.73%--page_add_file_rmap
|
--11.60%--__mod_lruvec_page_state
|
|--7.40%--__mod_memcg_lruvec_state
| |
| --5.58%--cgroup_rstat_updated
|
--2.53%--__mod_lruvec_state
|
--1.48%--__mod_node_page_state

9.93%--page_add_file_rmap_range
|
--2.67%--__mod_lruvec_page_state
|
|--1.95%--__mod_memcg_lruvec_state
| |
| --1.57%--cgroup_rstat_updated
|
--0.61%--__mod_lruvec_state
|
--0.54%--__mod_node_page_state

The running time of __mode_lruvec_page_state() is reduced about 9%.

[1]: https://github.com/antonblanchard/will-it-scale/pull/37

Signed-off-by: Yin Fengwei <[email protected]>
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
mm/filemap.c | 36 +++++++++++++++++++++++++-----------
1 file changed, 25 insertions(+), 11 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 07ebd90967a3..40be33b5ee46 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3486,11 +3486,12 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
struct file *file = vma->vm_file;
struct page *page = folio_page(folio, start);
unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss);
- unsigned int ref_count = 0, count = 0;
+ unsigned int count = 0;
+ pte_t *old_ptep = vmf->pte;

do {
- if (PageHWPoison(page))
- continue;
+ if (PageHWPoison(page + count))
+ goto skip;

if (mmap_miss > 0)
mmap_miss--;
@@ -3500,20 +3501,33 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
* handled in the specific fault path, and it'll prohibit the
* fault-around logic.
*/
- if (!pte_none(*vmf->pte))
- continue;
+ if (!pte_none(vmf->pte[count]))
+ goto skip;

if (vmf->address == addr)
ret = VM_FAULT_NOPAGE;

- ref_count++;
- set_pte_range(vmf, folio, page, 1, addr);
- } while (vmf->pte++, page++, addr += PAGE_SIZE, ++count < nr_pages);
+ count++;
+ continue;
+skip:
+ if (count) {
+ set_pte_range(vmf, folio, page, count, addr);
+ folio_ref_add(folio, count);
+ }

- /* Restore the vmf->pte */
- vmf->pte -= nr_pages;
+ count++;
+ page += count;
+ vmf->pte += count;
+ addr += count * PAGE_SIZE;
+ count = 0;
+ } while (--nr_pages > 0);
+
+ if (count) {
+ set_pte_range(vmf, folio, page, count, addr);
+ folio_ref_add(folio, count);
+ }

- folio_ref_add(folio, ref_count);
+ vmf->pte = old_ptep;
WRITE_ONCE(file->f_ra.mmap_miss, mmap_miss);

return ret;
--
2.39.1


2023-03-01 03:04:41

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH v3 32/34] rmap: add folio_add_file_rmap_range()



On 3/1/2023 5:37 AM, Matthew Wilcox (Oracle) wrote:
> From: Yin Fengwei <[email protected]>
>
> folio_add_file_rmap_range() allows to add pte mapping to a specific
> range of file folio. Comparing to page_add_file_rmap(), it batched
> updates __lruvec_stat for large folio.
>
> Signed-off-by: Yin Fengwei <[email protected]>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> ---
> include/linux/rmap.h | 2 ++
> mm/rmap.c | 60 +++++++++++++++++++++++++++++++++-----------
> 2 files changed, 48 insertions(+), 14 deletions(-)
>
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index b87d01660412..a3825ce81102 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -198,6 +198,8 @@ void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
> unsigned long address);
> void page_add_file_rmap(struct page *, struct vm_area_struct *,
> bool compound);
> +void folio_add_file_rmap_range(struct folio *, struct page *, unsigned int nr,
> + struct vm_area_struct *, bool compound);
> void page_remove_rmap(struct page *, struct vm_area_struct *,
> bool compound);
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index bacdb795d5ee..fffdb85a3b3d 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1303,31 +1303,39 @@ void folio_add_new_anon_rmap(struct folio *folio, struct vm_area_struct *vma,
> }
>
> /**
> - * page_add_file_rmap - add pte mapping to a file page
> - * @page: the page to add the mapping to
> + * folio_add_file_rmap_range - add pte mapping to page range of a folio
> + * @folio: The folio to add the mapping to
> + * @page: The first page to add
> + * @nr_pages: The number of pages which will be mapped
> * @vma: the vm area in which the mapping is added
> * @compound: charge the page as compound or small page
> *
> + * The page range of folio is defined by [first_page, first_page + nr_pages)
> + *
> * The caller needs to hold the pte lock.
> */
> -void page_add_file_rmap(struct page *page, struct vm_area_struct *vma,
> - bool compound)
> +void folio_add_file_rmap_range(struct folio *folio, struct page *page,
> + unsigned int nr_pages, struct vm_area_struct *vma,
> + bool compound)
> {
> - struct folio *folio = page_folio(page);
> atomic_t *mapped = &folio->_nr_pages_mapped;
> - int nr = 0, nr_pmdmapped = 0;
> - bool first;
> + unsigned int nr_pmdmapped = 0, first;
> + int nr = 0;
>
> - VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page);
> + VM_WARN_ON_FOLIO(compound && !folio_test_pmd_mappable(folio), folio);
>
> /* Is page being mapped by PTE? Is this its first map to be added? */
> if (likely(!compound)) {
> - first = atomic_inc_and_test(&page->_mapcount);
> - nr = first;
> - if (first && folio_test_large(folio)) {
> - nr = atomic_inc_return_relaxed(mapped);
> - nr = (nr < COMPOUND_MAPPED);
> - }
> + do {
> + first = atomic_inc_and_test(&page->_mapcount);
> + if (first && folio_test_large(folio)) {
> + first = atomic_inc_return_relaxed(mapped);
> + first = (nr < COMPOUND_MAPPED);
Sorry. Just noticed this typo which was my bad. This should be:
first = (first < COMPOUND_MAPPED);


Regards
Yin, Fengwei

> + }
> +
> + if (first)
> + nr++;
> + } while (page++, --nr_pages > 0);
> } else if (folio_test_pmd_mappable(folio)) {
> /* That test is redundant: it's for safety or to optimize out */
>
> @@ -1356,6 +1364,30 @@ void page_add_file_rmap(struct page *page, struct vm_area_struct *vma,
> mlock_vma_folio(folio, vma, compound);
> }
>
> +/**
> + * page_add_file_rmap - add pte mapping to a file page
> + * @page: the page to add the mapping to
> + * @vma: the vm area in which the mapping is added
> + * @compound: charge the page as compound or small page
> + *
> + * The caller needs to hold the pte lock.
> + */
> +void page_add_file_rmap(struct page *page, struct vm_area_struct *vma,
> + bool compound)
> +{
> + struct folio *folio = page_folio(page);
> + unsigned int nr_pages;
> +
> + VM_WARN_ON_ONCE_PAGE(compound && !PageTransHuge(page), page);
> +
> + if (likely(!compound))
> + nr_pages = 1;
> + else
> + nr_pages = folio_nr_pages(folio);
> +
> + folio_add_file_rmap_range(folio, page, nr_pages, vma, compound);
> +}
> +
> /**
> * page_remove_rmap - take down pte mapping from a page
> * @page: page to remove mapping from

2023-03-03 14:20:03

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v3 00/34] New page table range API

On Tue, Feb 28, 2023 at 09:37:03PM +0000, Matthew Wilcox (Oracle) wrote:
> This patchset changes the API used by the MM to set up page table entries.
> The four APIs are:
> set_ptes(mm, addr, ptep, pte, nr)
> update_mmu_cache_range(vma, addr, ptep, nr)
> flush_dcache_folio(folio)
> flush_icache_pages(vma, page, nr)
>
> flush_dcache_folio() isn't technically new, but no architecture
> implemented it, so I've done that for you. The old APIs remain around
> but are mostly implemented by calling the new interfaces.
>
> The new APIs are based around setting up N page table entries at once.
> The N entries belong to the same PMD, the same folio and the same VMA,
> so ptep++ is a legitimate operation, and locking is taken care of for
> you. Some architectures can do a better job of it than just a loop,
> but I have hesitated to make too deep a change to architectures I don't
> understand well.

The new set_ptes() looks unnecessarily duplicated all over arch/
What do you say about adding the patch below on top of the series?
Ideally it should be split into per-arch bits, but I can send it separately
as a cleanup on top.

diff --git a/arch/alpha/include/asm/pgtable.h b/arch/alpha/include/asm/pgtable.h
index 1e3354e9731b..65fb9e66675d 100644
--- a/arch/alpha/include/asm/pgtable.h
+++ b/arch/alpha/include/asm/pgtable.h
@@ -37,6 +37,7 @@ static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
pte_val(pte) += 1UL << 32;
}
}
+#define set_ptes set_ptes
#define set_pte_at(mm, addr, ptep, pte) set_ptes(mm, addr, ptep, pte, 1)

/* PMD_SHIFT determines the size of the area a second-level page table can map */
diff --git a/arch/arc/include/asm/pgtable-bits-arcv2.h b/arch/arc/include/asm/pgtable-bits-arcv2.h
index 4a1b2ce204c6..06d8039180c0 100644
--- a/arch/arc/include/asm/pgtable-bits-arcv2.h
+++ b/arch/arc/include/asm/pgtable-bits-arcv2.h
@@ -100,19 +100,6 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
return __pte((pte_val(pte) & _PAGE_CHG_MASK) | pgprot_val(newprot));
}

-static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
- pte_t *ptep, pte_t pte, unsigned int nr)
-{
- for (;;) {
- set_pte(ptep, pte);
- if (--nr == 0)
- break;
- ptep++;
- pte_val(pte) += PAGE_SIZE;
- }
-}
-#define set_pte_at(mm, addr, ptep, pte) set_ptes(mm, addr, ptep, pte, 1)
-
void update_mmu_cache_range(struct vm_area_struct *vma, unsigned long address,
pte_t *ptep, unsigned int nr);

diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
index 6525ac82bd50..0d326b201797 100644
--- a/arch/arm/include/asm/pgtable.h
+++ b/arch/arm/include/asm/pgtable.h
@@ -209,6 +209,7 @@ extern void __sync_icache_dcache(pte_t pteval);

void set_ptes(struct mm_struct *mm, unsigned long addr,
pte_t *ptep, pte_t pteval, unsigned int nr);
+#define set_ptes set_ptes
#define set_pte_at(mm, addr, ptep, pte) set_ptes(mm, addr, ptep, pte, 1)

static inline pte_t clear_pte_bit(pte_t pte, pgprot_t prot)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 4d1b79dbff16..a8d6460c5c9f 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -369,6 +369,7 @@ static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
pte_val(pte) += PAGE_SIZE;
}
}
+#define set_ptes set_ptes
#define set_pte_at(mm, addr, ptep, pte) set_ptes(mm, addr, ptep, pte, 1)

/*
diff --git a/arch/csky/include/asm/pgtable.h b/arch/csky/include/asm/pgtable.h
index a30ae048233e..e426f1820deb 100644
--- a/arch/csky/include/asm/pgtable.h
+++ b/arch/csky/include/asm/pgtable.h
@@ -91,20 +91,6 @@ static inline void set_pte(pte_t *p, pte_t pte)
smp_mb();
}

-static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
- pte_t *ptep, pte_t pte, unsigned int nr)
-{
- for (;;) {
- set_pte(ptep, pte);
- if (--nr == 0)
- break;
- ptep++;
- pte_val(pte) += PAGE_SIZE;
- }
-}
-
-#define set_pte_at(mm, addr, ptep, pte) set_ptes(mm, addr, ptep, pte, 1)
-
static inline pte_t *pmd_page_vaddr(pmd_t pmd)
{
unsigned long ptr;
diff --git a/arch/hexagon/include/asm/pgtable.h b/arch/hexagon/include/asm/pgtable.h
index f58f1d920769..67ab91662e83 100644
--- a/arch/hexagon/include/asm/pgtable.h
+++ b/arch/hexagon/include/asm/pgtable.h
@@ -345,26 +345,6 @@ static inline int pte_exec(pte_t pte)
#define pte_pfn(pte) (pte_val(pte) >> PAGE_SHIFT)
#define set_pmd(pmdptr, pmdval) (*(pmdptr) = (pmdval))

-/*
- * set_ptes - update page table and do whatever magic may be
- * necessary to make the underlying hardware/firmware take note.
- *
- * VM may require a virtual instruction to alert the MMU.
- */
-static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
- pte_t *ptep, pte_t pte, unsigned int nr)
-{
- for (;;) {
- set_pte(ptep, pte);
- if (--nr == 0)
- break;
- ptep++;
- pte_val(pte) += PAGE_SIZE;
- }
-}
-
-#define set_pte_at(mm, addr, ptep, pte) set_ptes(mm, addr, ptep, pte, 1)
-
static inline unsigned long pmd_page_vaddr(pmd_t pmd)
{
return (unsigned long)__va(pmd_val(pmd) & PAGE_MASK);
diff --git a/arch/ia64/include/asm/pgtable.h b/arch/ia64/include/asm/pgtable.h
index 0c2be4ea664b..65a6e3b30721 100644
--- a/arch/ia64/include/asm/pgtable.h
+++ b/arch/ia64/include/asm/pgtable.h
@@ -303,19 +303,6 @@ static inline void set_pte(pte_t *ptep, pte_t pteval)
*ptep = pteval;
}

-static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
- pte_t *ptep, pte_t pte, unsigned int nr)
-{
- for (;;) {
- set_pte(ptep, pte);
- if (--nr == 0)
- break;
- ptep++;
- pte_val(pte) += PAGE_SIZE;
- }
-}
-#define set_pte_at(mm, addr, ptep, pte) set_ptes(mm, add, ptep, pte, 1)
-
/*
* Make page protection values cacheable, uncacheable, or write-
* combining. Note that "protection" is really a misnomer here as the
diff --git a/arch/loongarch/include/asm/pgtable.h b/arch/loongarch/include/asm/pgtable.h
index 9154d317ffb4..d4b0ca7b4bf7 100644
--- a/arch/loongarch/include/asm/pgtable.h
+++ b/arch/loongarch/include/asm/pgtable.h
@@ -346,6 +346,7 @@ static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
}
}

+#define set_ptes set_ptes
#define set_pte_at(mm, addr, ptep, pte) set_ptes(mm, addr, ptep, pte, 1)

static inline void pte_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
diff --git a/arch/m68k/include/asm/pgtable_mm.h b/arch/m68k/include/asm/pgtable_mm.h
index 400206c17c97..8c2db20abdb6 100644
--- a/arch/m68k/include/asm/pgtable_mm.h
+++ b/arch/m68k/include/asm/pgtable_mm.h
@@ -32,20 +32,6 @@
*(pteptr) = (pteval); \
} while(0)

-static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
- pte_t *ptep, pte_t pte, unsigned int nr)
-{
- for (;;) {
- set_pte(ptep, pte);
- if (--nr == 0)
- break;
- ptep++;
- pte_val(pte) += PAGE_SIZE;
- }
-}
-
-#define set_pte_at(mm, addr, ptep, pte) set_ptes(mm, addr, ptep, pte, 1)
-
/* PMD_SHIFT determines the size of the area a second-level page table can map */
#if CONFIG_PGTABLE_LEVELS == 3
#define PMD_SHIFT 18
diff --git a/arch/microblaze/include/asm/pgtable.h b/arch/microblaze/include/asm/pgtable.h
index a01e1369b486..3e7643a986ad 100644
--- a/arch/microblaze/include/asm/pgtable.h
+++ b/arch/microblaze/include/asm/pgtable.h
@@ -335,20 +335,6 @@ static inline void set_pte(pte_t *ptep, pte_t pte)
*ptep = pte;
}

-static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
- pte_t *ptep, pte_t pte, unsigned int nr)
-{
- for (;;) {
- set_pte(ptep, pte);
- if (--nr == 0)
- break;
- ptep++;
- pte_val(pte) += 1 << PFN_SHIFT_OFFSET;
- }
-}
-
-#define set_pte_at(mm, addr, ptep, pte) set_ptes(mm, addr, ptep, pte, 1)
-
#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
unsigned long address, pte_t *ptep)
diff --git a/arch/mips/include/asm/pgtable.h b/arch/mips/include/asm/pgtable.h
index 0cf0455e6ae8..18b77567ef72 100644
--- a/arch/mips/include/asm/pgtable.h
+++ b/arch/mips/include/asm/pgtable.h
@@ -108,6 +108,7 @@ do { \
static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
pte_t *ptep, pte_t pte, unsigned int nr);

+#define set_ptes set_ptes
#define set_pte_at(mm, addr, ptep, pte) set_ptes(mm, addr, ptep, pte, 1)

#if defined(CONFIG_PHYS_ADDR_T_64BIT) && defined(CONFIG_CPU_MIPS32)
diff --git a/arch/nios2/include/asm/pgtable.h b/arch/nios2/include/asm/pgtable.h
index 8a77821a17a5..2a994b225a41 100644
--- a/arch/nios2/include/asm/pgtable.h
+++ b/arch/nios2/include/asm/pgtable.h
@@ -193,6 +193,7 @@ static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
}
}

+#define set_ptes set_ptes
#define set_pte_at(mm, addr, ptep, pte) set_ptes(mm, addr, ptep, pte, 1)

static inline int pmd_none(pmd_t pmd)
diff --git a/arch/openrisc/include/asm/pgtable.h b/arch/openrisc/include/asm/pgtable.h
index 1a7077150d7b..8f27730a9ab7 100644
--- a/arch/openrisc/include/asm/pgtable.h
+++ b/arch/openrisc/include/asm/pgtable.h
@@ -47,20 +47,6 @@ extern void paging_init(void);
*/
#define set_pte(pteptr, pteval) ((*(pteptr)) = (pteval))

-static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
- pte_t *ptep, pte_t pte, unsigned int nr)
-{
- for (;;) {
- set_pte(ptep, pte);
- if (--nr == 0)
- break;
- ptep++;
- pte_val(pte) += PAGE_SIZE;
- }
-}
-
-#define set_pte_at(mm, addr, ptep, pte) set_ptes(mm, addr, ptep, pte, 1)
-
/*
* (pmds are folded into pgds so this doesn't get actually called,
* but the define is needed for a generic inline function.)
diff --git a/arch/parisc/include/asm/pgtable.h b/arch/parisc/include/asm/pgtable.h
index 78ee9816f423..cd04e85cb012 100644
--- a/arch/parisc/include/asm/pgtable.h
+++ b/arch/parisc/include/asm/pgtable.h
@@ -73,6 +73,7 @@ extern void __update_cache(pte_t pte);
mb(); \
} while(0)

+#define set_ptes set_ptes
#define set_pte_at(mm, addr, ptep, pte) set_ptes(mm, addr, ptep, pte, 1)

#endif /* !__ASSEMBLY__ */
diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h
index bf1263ff7e67..f10b6c2f8ade 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -43,6 +43,7 @@ struct mm_struct;

void set_ptes(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
pte_t pte, unsigned int nr);
+#define set_ptes set_ptes
#define set_pte_at(mm, addr, ptep, pte) set_ptes(mm, addr, ptep, pte, 1)
#define update_mmu_cache(vma, addr, ptep) \
update_mmu_cache_range(vma, addr, ptep, 1);
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 3a3a776fc047..8bc49496f8a6 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -473,6 +473,7 @@ static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
pte_val(pteval) += 1 << _PAGE_PFN_SHIFT;
}
}
+#define set_ptes set_ptes
#define set_pte_at(mm, addr, ptep, pte) set_ptes(mm, addr, ptep, pte, 1)

static inline void pte_clear(struct mm_struct *mm,
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 46bf475116f1..2fc20558af6b 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1346,6 +1346,7 @@ static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
}
}

+#define set_ptes set_ptes
#define set_pte_at(mm, addr, ptep, pte) set_ptes(mm, addr, ptep, pte, 1)

/*
diff --git a/arch/sh/include/asm/pgtable_32.h b/arch/sh/include/asm/pgtable_32.h
index 03ba1834e126..d2f17e944bea 100644
--- a/arch/sh/include/asm/pgtable_32.h
+++ b/arch/sh/include/asm/pgtable_32.h
@@ -319,6 +319,7 @@ static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
}
}

+#define set_ptes set_ptes
#define set_pte_at(mm, addr, ptep, pte) set_ptes(mm, addr, ptep, pte, 1)

/*
diff --git a/arch/sparc/include/asm/pgtable_32.h b/arch/sparc/include/asm/pgtable_32.h
index 47ae55ea1837..7fbc7772a9b7 100644
--- a/arch/sparc/include/asm/pgtable_32.h
+++ b/arch/sparc/include/asm/pgtable_32.h
@@ -101,20 +101,6 @@ static inline void set_pte(pte_t *ptep, pte_t pteval)
srmmu_swap((unsigned long *)ptep, pte_val(pteval));
}

-static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
- pte_t *ptep, pte_t pte, unsigned int nr)
-{
- for (;;) {
- set_pte(ptep, pte);
- if (--nr == 0)
- break;
- ptep++;
- pte_val(pte) += PAGE_SIZE;
- }
-}
-
-#define set_pte_at(mm, addr, ptep, pte) set_ptes(mm, addr, ptep, pte, 1)
-
static inline int srmmu_device_memory(unsigned long x)
{
return ((x & 0xF0000000) != 0);
diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index d5c0088e0c6a..fddca662ba1b 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -924,6 +924,7 @@ static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
}
}

+#define set_ptes set_ptes
#define set_pte_at(mm, addr, ptep, pte) set_ptes(mm, addr, ptep, pte, 1);

#define pte_clear(mm,addr,ptep) \
diff --git a/arch/um/include/asm/pgtable.h b/arch/um/include/asm/pgtable.h
index ca78c90ae74f..60d2b20ff218 100644
--- a/arch/um/include/asm/pgtable.h
+++ b/arch/um/include/asm/pgtable.h
@@ -242,20 +242,6 @@ static inline void set_pte(pte_t *pteptr, pte_t pteval)
if(pte_present(*pteptr)) *pteptr = pte_mknewprot(*pteptr);
}

-static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
- pte_t *ptep, pte_t pte, unsigned int nr)
-{
- for (;;) {
- set_pte(ptep, pte);
- if (--nr == 0)
- break;
- ptep++;
- pte_val(pte) += PAGE_SIZE;
- }
-}
-
-#define set_pte_at(mm, addr, ptep, pte) set_ptes(mm, addr, ptep, pte, 1)
-
#define __HAVE_ARCH_PTE_SAME
static inline int pte_same(pte_t pte_a, pte_t pte_b)
{
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index f424371ea143..1e5fd352880d 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1019,22 +1019,6 @@ static inline pud_t native_local_pudp_get_and_clear(pud_t *pudp)
return res;
}

-static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
- pte_t *ptep, pte_t pte, unsigned int nr)
-{
- page_table_check_ptes_set(mm, addr, ptep, pte, nr);
-
- for (;;) {
- set_pte(ptep, pte);
- if (--nr == 0)
- break;
- ptep++;
- pte = __pte(pte_val(pte) + PAGE_SIZE);
- }
-}
-
-#define set_pte_at(mm, addr, ptep, pte) set_ptes(mm, addr, ptep, pte, 1)
-
static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
pmd_t *pmdp, pmd_t pmd)
{
diff --git a/arch/xtensa/include/asm/pgtable.h b/arch/xtensa/include/asm/pgtable.h
index 293101530541..adeee96518b9 100644
--- a/arch/xtensa/include/asm/pgtable.h
+++ b/arch/xtensa/include/asm/pgtable.h
@@ -306,20 +306,6 @@ static inline void set_pte(pte_t *ptep, pte_t pte)
update_pte(ptep, pte);
}

-static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
- pte_t *ptep, pte_t pte, unsigned int nr)
-{
- for (;;) {
- set_pte(ptep, pte);
- if (--nr == 0)
- break;
- ptep++;
- pte_val(pte) += PAGE_SIZE;
- }
-}
-
-#define set_pte_at(mm, addr, ptep, pte) set_ptes(mm, addr, ptep, pte, 1)
-
static inline void
set_pmd(pmd_t *pmdp, pmd_t pmdval)
{
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index c63cd44777ec..ef204712eda3 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -172,6 +172,24 @@ static inline int pmd_young(pmd_t pmd)
}
#endif

+#ifndef set_ptes
+static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep, pte_t pte, unsigned int nr)
+{
+ page_table_check_ptes_set(mm, addr, ptep, pte, nr);
+
+ for (;;) {
+ set_pte(ptep, pte);
+ if (--nr == 0)
+ break;
+ ptep++;
+ pte = __pte(pte_val(pte) + PAGE_SIZE);
+ }
+}
+#define set_ptes set_ptes
+#define set_pte_at(mm, addr, ptep, pte) set_ptes(mm, addr, ptep, pte, 1)
+#endif
+
#ifndef __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
extern int ptep_set_access_flags(struct vm_area_struct *vma,
unsigned long address, pte_t *ptep,

--
Sincerely yours,
Mike.

2023-03-05 10:16:20

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [PATCH v3 00/34] New page table range API

Hi Willy,

On Tue, Feb 28, 2023 at 10:40 PM Matthew Wilcox (Oracle)
<[email protected]> wrote:
> This patchset changes the API used by the MM to set up page table entries.
> The four APIs are:
> set_ptes(mm, addr, ptep, pte, nr)
> update_mmu_cache_range(vma, addr, ptep, nr)
> flush_dcache_folio(folio)
> flush_icache_pages(vma, page, nr)
>
> flush_dcache_folio() isn't technically new, but no architecture
> implemented it, so I've done that for you. The old APIs remain around
> but are mostly implemented by calling the new interfaces.
>
> The new APIs are based around setting up N page table entries at once.
> The N entries belong to the same PMD, the same folio and the same VMA,
> so ptep++ is a legitimate operation, and locking is taken care of for
> you. Some architectures can do a better job of it than just a loop,
> but I have hesitated to make too deep a change to architectures I don't
> understand well.
>
> One thing I have changed in every architecture is that PG_arch_1 is now a
> per-folio bit instead of a per-page bit. This was something that would
> have to happen eventually, and it makes sense to do it now rather than
> iterate over every page involved in a cache flush and figure out if it
> needs to happen.
>
> The point of all this is better performance, and Fengwei Yin has
> measured improvement on x86. I suspect you'll see improvement on
> your architecture too. Try the new will-it-scale test mentioned here:
> https://lore.kernel.org/linux-mm/[email protected]/
> You'll need to run it on an XFS filesystem and have
> CONFIG_TRANSPARENT_HUGEPAGE set.

Thanks for your series!

> For testing, I've only run the code on x86. If an x86->foo compiler
> exists in Debian, I've built defconfig. I'm relying on the buildbots
> to tell me what I missed, and people who actually have the hardware to
> tell me if it actually works.

Seems to work fine on ARAnyM and qemu-system-m68k/virt, so
Tested-by: Geert Uytterhoeven <[email protected]>

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2023-03-09 11:14:20

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v3 00/34] New page table range API

On 28/02/2023 21:37, Matthew Wilcox (Oracle) wrote:
> This patchset changes the API used by the MM to set up page table entries.
> The four APIs are:
> set_ptes(mm, addr, ptep, pte, nr)
> update_mmu_cache_range(vma, addr, ptep, nr)
> flush_dcache_folio(folio)
> flush_icache_pages(vma, page, nr)
>
> flush_dcache_folio() isn't technically new, but no architecture
> implemented it, so I've done that for you. The old APIs remain around
> but are mostly implemented by calling the new interfaces.
>
> The new APIs are based around setting up N page table entries at once.
> The N entries belong to the same PMD, the same folio and the same VMA,
> so ptep++ is a legitimate operation, and locking is taken care of for
> you. Some architectures can do a better job of it than just a loop,
> but I have hesitated to make too deep a change to architectures I don't
> understand well.
>
> One thing I have changed in every architecture is that PG_arch_1 is now a
> per-folio bit instead of a per-page bit. This was something that would
> have to happen eventually, and it makes sense to do it now rather than
> iterate over every page involved in a cache flush and figure out if it
> needs to happen.
>
> The point of all this is better performance, and Fengwei Yin has
> measured improvement on x86. I suspect you'll see improvement on
> your architecture too. Try the new will-it-scale test mentioned here:
> https://lore.kernel.org/linux-mm/[email protected]/
> You'll need to run it on an XFS filesystem and have
> CONFIG_TRANSPARENT_HUGEPAGE set.
>
> For testing, I've only run the code on x86. If an x86->foo compiler
> exists in Debian, I've built defconfig. I'm relying on the buildbots
> to tell me what I missed, and people who actually have the hardware to
> tell me if it actually works.
>
> I'd like to get this into the MM tree soon after the current merge window
> closes, so quick feedback would be appreciated.

I've boot-tested the series (with the Yin's typo fix for patch 32) on arm64 FVP
and Ampere Altra. On the Altra, I also ran page_fault4 from will-it-scale, and
see ~35% improvement from this series. So:

Tested-by: Ryan Roberts <[email protected]>

Thanks,
Ryan