2023-07-10 21:13:23

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v5 00/38] New page table range API

This patchset changes the API used by the MM to set up page table entries.
The four APIs are:
set_ptes(mm, addr, ptep, pte, nr)
update_mmu_cache_range(vma, addr, ptep, nr)
flush_dcache_folio(folio)
flush_icache_pages(vma, page, nr)

flush_dcache_folio() isn't technically new, but no architecture
implemented it, so I've done that for them. The old APIs remain around
but are mostly implemented by calling the new interfaces.

The new APIs are based around setting up N page table entries at once.
The N entries belong to the same PMD, the same folio and the same VMA,
so ptep++ is a legitimate operation, and locking is taken care of for
you. Some architectures can do a better job of it than just a loop,
but I have hesitated to make too deep a change to architectures I don't
understand well.

One thing I have changed in every architecture is that PG_arch_1 is now a
per-folio bit instead of a per-page bit. This was something that would
have to happen eventually, and it makes sense to do it now rather than
iterate over every page involved in a cache flush and figure out if it
needs to happen.

The point of all this is better performance, and Fengwei Yin has
measured improvement on x86. I suspect you'll see improvement on
your architecture too. Try the new will-it-scale test mentioned here:
https://lore.kernel.org/linux-mm/[email protected]/
You'll need to run it on an XFS filesystem and have
CONFIG_TRANSPARENT_HUGEPAGE set.

This patchset is the basis for much of the anonymous large folio work
being done by Ryan Roberts, so it's received quite a lot of testing over
the last few months. My thanks to Ryan & Fengwei Yin for all their help
with this patchset.

v5:
- Add in_range() macro
- Fix numerous compilation problems on minority architectures (thanks LKP!)
- Add the 'vmf' argument to update_mmu_cache_range() to help MIPS
and other architectures that insert TLB entries in software,
rather than using a hardware page table walker
- Get rid of first_map_page() and next_map_page(); use
next_uptodate_folio() directly
- Actually move the mmap_miss accounting in filemap.c
- Add kernel-doc for set_pte_range()
- Correct determination of prefaulting in set_pte_range()
- More Acked & Reviewed tags

v4:
- Fix a few compile errors (mostly Mike Rapoport)
- Incorporate Mike's suggestion to avoid having to define set_ptes()
or set_pte_at() on the majority of architectures
- Optimise m68k's __flush_pages_to_ram (Geert Uytterhoeven)
- Fix sun3 (me)
- Fix sparc32 (me)
- Pick up a few more Ack/Reviewed tags

v3:
- Reinstate flush_dcache_icache_phys() on PowerPC
- Fix folio_flush_mapping(). The documentation was correct and the
implementation was completely wrong
- Change the flush_dcache_page() documentation to describe
flush_dcache_folio() instead
- Split ARM from ARC. I messed up my git commands
- Remove page_mapping_file()
- Rationalise how flush_icache_pages() and flush_icache_page() are defined
- Use flush_icache_pages() in do_set_pmd()
- Pick up Guo Ren's Ack for csky

Matthew Wilcox (Oracle) (34):
minmax: Add in_range() macro
mm: Convert page_table_check_pte_set() to page_table_check_ptes_set()
mm: Add generic flush_icache_pages() and documentation
mm: Add folio_flush_mapping()
mm: Remove ARCH_IMPLEMENTS_FLUSH_DCACHE_FOLIO
mm: Add default definition of set_ptes()
alpha: Implement the new page table range API
arc: Implement the new page table range API
arm: Implement the new page table range API
arm64: Implement the new page table range API
csky: Implement the new page table range API
hexagon: Implement the new page table range API
ia64: Implement the new page table range API
loongarch: Implement the new page table range API
m68k: Implement the new page table range API
microblaze: Implement the new page table range API
mips: Implement the new page table range API
nios2: Implement the new page table range API
openrisc: Implement the new page table range API
parisc: Implement the new page table range API
powerpc: Implement the new page table range API
riscv: Implement the new page table range API
s390: Implement the new page table range API
sh: Implement the new page table range API
sparc32: Implement the new page table range API
sparc64: Implement the new page table range API
um: Implement the new page table range API
x86: Implement the new page table range API
xtensa: Implement the new page table range API
mm: Remove page_mapping_file()
mm: Rationalise flush_icache_pages() and flush_icache_page()
mm: Tidy up set_ptes definition
mm: Use flush_icache_pages() in do_set_pmd()
mm: Call update_mmu_cache_range() in more page fault handling paths

Yin Fengwei (4):
filemap: Add filemap_map_folio_range()
rmap: add folio_add_file_rmap_range()
mm: Convert do_set_pte() to set_pte_range()
filemap: Batch PTE mappings

Documentation/core-api/cachetlb.rst | 55 ++++-----
Documentation/filesystems/locking.rst | 2 +-
arch/alpha/include/asm/cacheflush.h | 13 +-
arch/alpha/include/asm/pgtable.h | 10 +-
arch/arc/include/asm/cacheflush.h | 14 +--
arch/arc/include/asm/pgtable-bits-arcv2.h | 12 +-
arch/arc/include/asm/pgtable-levels.h | 1 +
arch/arc/mm/cache.c | 61 +++++----
arch/arc/mm/tlb.c | 18 +--
arch/arm/include/asm/cacheflush.h | 29 +++--
arch/arm/include/asm/pgtable.h | 5 +-
arch/arm/include/asm/tlbflush.h | 14 ++-
arch/arm/mm/copypage-v4mc.c | 5 +-
arch/arm/mm/copypage-v6.c | 5 +-
arch/arm/mm/copypage-xscale.c | 5 +-
arch/arm/mm/dma-mapping.c | 24 ++--
arch/arm/mm/fault-armv.c | 16 +--
arch/arm/mm/flush.c | 99 +++++++++------
arch/arm/mm/mm.h | 2 +-
arch/arm/mm/mmu.c | 14 ++-
arch/arm/mm/nommu.c | 6 +
arch/arm64/include/asm/cacheflush.h | 4 +-
arch/arm64/include/asm/pgtable.h | 26 ++--
arch/arm64/mm/flush.c | 36 +++---
arch/csky/abiv1/cacheflush.c | 32 +++--
arch/csky/abiv1/inc/abi/cacheflush.h | 3 +-
arch/csky/abiv2/cacheflush.c | 32 ++---
arch/csky/abiv2/inc/abi/cacheflush.h | 11 +-
arch/csky/include/asm/pgtable.h | 8 +-
arch/hexagon/include/asm/cacheflush.h | 10 +-
arch/hexagon/include/asm/pgtable.h | 9 +-
arch/ia64/hp/common/sba_iommu.c | 26 ++--
arch/ia64/include/asm/cacheflush.h | 14 ++-
arch/ia64/include/asm/pgtable.h | 4 +-
arch/ia64/mm/init.c | 28 +++--
arch/loongarch/include/asm/cacheflush.h | 1 -
arch/loongarch/include/asm/pgtable-bits.h | 4 +-
arch/loongarch/include/asm/pgtable.h | 33 ++---
arch/loongarch/mm/pgtable.c | 2 +-
arch/loongarch/mm/tlb.c | 2 +-
arch/m68k/include/asm/cacheflush_mm.h | 26 ++--
arch/m68k/include/asm/mcf_pgtable.h | 1 +
arch/m68k/include/asm/motorola_pgtable.h | 1 +
arch/m68k/include/asm/pgtable_mm.h | 10 +-
arch/m68k/include/asm/sun3_pgtable.h | 1 +
arch/m68k/mm/motorola.c | 2 +-
arch/microblaze/include/asm/cacheflush.h | 8 ++
arch/microblaze/include/asm/pgtable.h | 15 +--
arch/microblaze/include/asm/tlbflush.h | 4 +-
arch/mips/bcm47xx/prom.c | 2 +-
arch/mips/include/asm/cacheflush.h | 32 ++---
arch/mips/include/asm/pgtable-32.h | 10 +-
arch/mips/include/asm/pgtable-64.h | 6 +-
arch/mips/include/asm/pgtable-bits.h | 6 +-
arch/mips/include/asm/pgtable.h | 63 ++++++----
arch/mips/mm/c-r4k.c | 5 +-
arch/mips/mm/cache.c | 56 ++++-----
arch/mips/mm/init.c | 21 ++--
arch/mips/mm/pgtable-32.c | 2 +-
arch/mips/mm/pgtable-64.c | 2 +-
arch/mips/mm/tlbex.c | 2 +-
arch/nios2/include/asm/cacheflush.h | 6 +-
arch/nios2/include/asm/pgtable.h | 28 +++--
arch/nios2/mm/cacheflush.c | 79 ++++++------
arch/openrisc/include/asm/cacheflush.h | 8 +-
arch/openrisc/include/asm/pgtable.h | 15 ++-
arch/openrisc/mm/cache.c | 12 +-
arch/parisc/include/asm/cacheflush.h | 14 ++-
arch/parisc/include/asm/pgtable.h | 37 +++---
arch/parisc/kernel/cache.c | 107 +++++++++++-----
arch/powerpc/include/asm/book3s/32/pgtable.h | 5 -
arch/powerpc/include/asm/book3s/64/pgtable.h | 6 +-
arch/powerpc/include/asm/book3s/pgtable.h | 11 +-
arch/powerpc/include/asm/cacheflush.h | 14 ++-
arch/powerpc/include/asm/kvm_ppc.h | 10 +-
arch/powerpc/include/asm/nohash/pgtable.h | 16 +--
arch/powerpc/include/asm/pgtable.h | 12 ++
arch/powerpc/mm/book3s64/hash_utils.c | 11 +-
arch/powerpc/mm/cacheflush.c | 40 ++----
arch/powerpc/mm/nohash/e500_hugetlbpage.c | 3 +-
arch/powerpc/mm/pgtable.c | 51 ++++----
arch/riscv/include/asm/cacheflush.h | 19 ++-
arch/riscv/include/asm/pgtable.h | 38 ++++--
arch/riscv/mm/cacheflush.c | 13 +-
arch/s390/include/asm/pgtable.h | 33 +++--
arch/sh/include/asm/cacheflush.h | 21 ++--
arch/sh/include/asm/pgtable.h | 7 +-
arch/sh/include/asm/pgtable_32.h | 5 +-
arch/sh/mm/cache-j2.c | 4 +-
arch/sh/mm/cache-sh4.c | 26 ++--
arch/sh/mm/cache-sh7705.c | 26 ++--
arch/sh/mm/cache.c | 52 ++++----
arch/sh/mm/kmap.c | 3 +-
arch/sparc/include/asm/cacheflush_32.h | 9 +-
arch/sparc/include/asm/cacheflush_64.h | 19 +--
arch/sparc/include/asm/pgtable_32.h | 8 +-
arch/sparc/include/asm/pgtable_64.h | 24 +++-
arch/sparc/kernel/smp_64.c | 56 ++++++---
arch/sparc/mm/init_32.c | 13 +-
arch/sparc/mm/init_64.c | 78 +++++++-----
arch/sparc/mm/tlb.c | 5 +-
arch/um/include/asm/pgtable.h | 7 +-
arch/x86/include/asm/pgtable.h | 14 +--
arch/xtensa/include/asm/cacheflush.h | 11 +-
arch/xtensa/include/asm/pgtable.h | 18 ++-
arch/xtensa/mm/cache.c | 83 +++++++------
include/asm-generic/cacheflush.h | 7 --
include/linux/cacheflush.h | 13 +-
include/linux/minmax.h | 26 ++++
include/linux/mm.h | 3 +-
include/linux/page_table_check.h | 14 +--
include/linux/pagemap.h | 28 +++--
include/linux/pgtable.h | 31 +++++
include/linux/rmap.h | 2 +
mm/filemap.c | 123 +++++++++++--------
mm/memory.c | 56 +++++----
mm/page_table_check.c | 14 ++-
mm/rmap.c | 60 ++++++---
mm/util.c | 2 +-
119 files changed, 1452 insertions(+), 974 deletions(-)

--
2.39.2



2023-07-10 21:14:10

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v5 01/38] minmax: Add in_range() macro

Determine if a value lies within a range more efficiently (subtraction +
comparison vs two comparisons and an AND). It also has useful (under
some circumstances) behaviour if the range exceeds the maximum value of
the type.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
include/linux/minmax.h | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)

diff --git a/include/linux/minmax.h b/include/linux/minmax.h
index 396df1121bff..028069a1f7ef 100644
--- a/include/linux/minmax.h
+++ b/include/linux/minmax.h
@@ -158,6 +158,32 @@
*/
#define clamp_val(val, lo, hi) clamp_t(typeof(val), val, lo, hi)

+static inline bool in_range64(u64 val, u64 start, u64 len)
+{
+ return (val - start) < len;
+}
+
+static inline bool in_range32(u32 val, u32 start, u32 len)
+{
+ return (val - start) < len;
+}
+
+/**
+ * in_range - Determine if a value lies within a range.
+ * @val: Value to test.
+ * @start: First value in range.
+ * @len: Number of values in range.
+ *
+ * This is more efficient than "if (start <= val && val < (start + len))".
+ * It also gives a different answer if @start + @len overflows the size of
+ * the type by a sufficient amount to encompass @val. Decide for yourself
+ * which behaviour you want, or prove that start + len never overflow.
+ * Do not blindly replace one form with the other.
+ */
+#define in_range(val, start, len) \
+ sizeof(start) <= sizeof(u32) ? in_range32(val, start, len) : \
+ in_range64(val, start, len)
+
/**
* swap - swap values of @a and @b
* @a: first value
--
2.39.2


2023-07-10 21:14:11

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v5 23/38] s390: Implement the new page table range API

Add set_ptes() and update_mmu_cache_range().

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Gerald Schaefer <[email protected]>
Acked-by: Mike Rapoport (IBM) <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Vasily Gorbik <[email protected]>
Cc: Alexander Gordeev <[email protected]>
Cc: Christian Borntraeger <[email protected]>
Cc: Sven Schnelle <[email protected]>
Cc: [email protected]
---
arch/s390/include/asm/pgtable.h | 33 ++++++++++++++++++++++++---------
1 file changed, 24 insertions(+), 9 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index c55f3c3365af..02973c740a5b 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -47,6 +47,7 @@ static inline void update_page_count(int level, long count)
* tables contain all the necessary information.
*/
#define update_mmu_cache(vma, address, ptep) do { } while (0)
+#define update_mmu_cache_range(vmf, vma, addr, ptep, nr) do { } while (0)
#define update_mmu_cache_pmd(vma, address, ptep) do { } while (0)

/*
@@ -1316,20 +1317,34 @@ pgprot_t pgprot_writecombine(pgprot_t prot);
pgprot_t pgprot_writethrough(pgprot_t prot);

/*
- * Certain architectures need to do special things when PTEs
- * within a page table are directly modified. Thus, the following
- * hook is made available.
+ * Set multiple PTEs to consecutive pages with a single call. All PTEs
+ * are within the same folio, PMD and VMA.
*/
-static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
- pte_t *ptep, pte_t entry)
+static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep, pte_t entry, unsigned int nr)
{
if (pte_present(entry))
entry = clear_pte_bit(entry, __pgprot(_PAGE_UNUSED));
- if (mm_has_pgste(mm))
- ptep_set_pte_at(mm, addr, ptep, entry);
- else
- set_pte(ptep, entry);
+ if (mm_has_pgste(mm)) {
+ for (;;) {
+ ptep_set_pte_at(mm, addr, ptep, entry);
+ if (--nr == 0)
+ break;
+ ptep++;
+ entry = __pte(pte_val(entry) + PAGE_SIZE);
+ addr += PAGE_SIZE;
+ }
+ } else {
+ for (;;) {
+ set_pte(ptep, entry);
+ if (--nr == 0)
+ break;
+ ptep++;
+ entry = __pte(pte_val(entry) + PAGE_SIZE);
+ }
+ }
}
+#define set_ptes set_ptes

/*
* Conversion functions: convert a page and protection to a page entry,
--
2.39.2


2023-07-10 21:14:13

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v5 20/38] parisc: Implement the new page table range API

Add set_ptes(), update_mmu_cache_range(), flush_dcache_folio()
and flush_icache_pages(). Change the PG_arch_1 (aka PG_dcache_dirty) flag
from being per-page to per-folio.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Acked-by: Mike Rapoport (IBM) <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: [email protected]
---
arch/parisc/include/asm/cacheflush.h | 14 ++--
arch/parisc/include/asm/pgtable.h | 37 +++++----
arch/parisc/kernel/cache.c | 107 ++++++++++++++++++---------
3 files changed, 105 insertions(+), 53 deletions(-)

diff --git a/arch/parisc/include/asm/cacheflush.h b/arch/parisc/include/asm/cacheflush.h
index c8b6928cee1e..b77c3e0c37d3 100644
--- a/arch/parisc/include/asm/cacheflush.h
+++ b/arch/parisc/include/asm/cacheflush.h
@@ -43,8 +43,13 @@ void invalidate_kernel_vmap_range(void *vaddr, int size);
#define flush_cache_vmap(start, end) flush_cache_all()
#define flush_cache_vunmap(start, end) flush_cache_all()

+void flush_dcache_folio(struct folio *folio);
+#define flush_dcache_folio flush_dcache_folio
#define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1
-void flush_dcache_page(struct page *page);
+static inline void flush_dcache_page(struct page *page)
+{
+ flush_dcache_folio(page_folio(page));
+}

#define flush_dcache_mmap_lock(mapping) xa_lock_irq(&mapping->i_pages)
#define flush_dcache_mmap_unlock(mapping) xa_unlock_irq(&mapping->i_pages)
@@ -53,10 +58,9 @@ void flush_dcache_page(struct page *page);
#define flush_dcache_mmap_unlock_irqrestore(mapping, flags) \
xa_unlock_irqrestore(&mapping->i_pages, flags)

-#define flush_icache_page(vma,page) do { \
- flush_kernel_dcache_page_addr(page_address(page)); \
- flush_kernel_icache_page(page_address(page)); \
-} while (0)
+void flush_icache_pages(struct vm_area_struct *vma, struct page *page,
+ unsigned int nr);
+#define flush_icache_page(vma, page) flush_icache_pages(vma, page, 1)

#define flush_icache_range(s,e) do { \
flush_kernel_dcache_range_asm(s,e); \
diff --git a/arch/parisc/include/asm/pgtable.h b/arch/parisc/include/asm/pgtable.h
index 5656395c95ee..ce38bb375b60 100644
--- a/arch/parisc/include/asm/pgtable.h
+++ b/arch/parisc/include/asm/pgtable.h
@@ -73,15 +73,6 @@ extern void __update_cache(pte_t pte);
mb(); \
} while(0)

-#define set_pte_at(mm, addr, pteptr, pteval) \
- do { \
- if (pte_present(pteval) && \
- pte_user(pteval)) \
- __update_cache(pteval); \
- *(pteptr) = (pteval); \
- purge_tlb_entries(mm, addr); \
- } while (0)
-
#endif /* !__ASSEMBLY__ */

#define pte_ERROR(e) \
@@ -285,7 +276,7 @@ extern unsigned long *empty_zero_page;
#define pte_none(x) (pte_val(x) == 0)
#define pte_present(x) (pte_val(x) & _PAGE_PRESENT)
#define pte_user(x) (pte_val(x) & _PAGE_USER)
-#define pte_clear(mm, addr, xp) set_pte_at(mm, addr, xp, __pte(0))
+#define pte_clear(mm, addr, xp) set_pte(xp, __pte(0))

#define pmd_flag(x) (pmd_val(x) & PxD_FLAG_MASK)
#define pmd_address(x) ((unsigned long)(pmd_val(x) &~ PxD_FLAG_MASK) << PxD_VALUE_SHIFT)
@@ -391,11 +382,29 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd)

extern void paging_init (void);

+static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep, pte_t pte, unsigned int nr)
+{
+ if (pte_present(pte) && pte_user(pte))
+ __update_cache(pte);
+ for (;;) {
+ *ptep = pte;
+ purge_tlb_entries(mm, addr);
+ if (--nr == 0)
+ break;
+ ptep++;
+ pte_val(pte) += 1 << PFN_PTE_SHIFT;
+ addr += PAGE_SIZE;
+ }
+}
+#define set_ptes set_ptes
+
/* Used for deferring calls to flush_dcache_page() */

#define PG_dcache_dirty PG_arch_1

-#define update_mmu_cache(vms,addr,ptep) __update_cache(*ptep)
+#define update_mmu_cache_range(vmf, vma, addr, ptep, nr) __update_cache(*ptep)
+#define update_mmu_cache(vma, addr, ptep) __update_cache(*ptep)

/*
* Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that
@@ -450,7 +459,7 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, unsigned
if (!pte_young(pte)) {
return 0;
}
- set_pte_at(vma->vm_mm, addr, ptep, pte_mkold(pte));
+ set_pte(ptep, pte_mkold(pte));
return 1;
}

@@ -460,14 +469,14 @@ static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
pte_t old_pte;

old_pte = *ptep;
- set_pte_at(mm, addr, ptep, __pte(0));
+ set_pte(ptep, __pte(0));

return old_pte;
}

static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
{
- set_pte_at(mm, addr, ptep, pte_wrprotect(*ptep));
+ set_pte(ptep, pte_wrprotect(*ptep));
}

#define pte_same(A,B) (pte_val(A) == pte_val(B))
diff --git a/arch/parisc/kernel/cache.c b/arch/parisc/kernel/cache.c
index b55b35c89d6a..442109a48940 100644
--- a/arch/parisc/kernel/cache.c
+++ b/arch/parisc/kernel/cache.c
@@ -94,11 +94,11 @@ static inline void flush_data_cache(void)
/* Kernel virtual address of pfn. */
#define pfn_va(pfn) __va(PFN_PHYS(pfn))

-void
-__update_cache(pte_t pte)
+void __update_cache(pte_t pte)
{
unsigned long pfn = pte_pfn(pte);
- struct page *page;
+ struct folio *folio;
+ unsigned int nr;

/* We don't have pte special. As a result, we can be called with
an invalid pfn and we don't need to flush the kernel dcache page.
@@ -106,13 +106,17 @@ __update_cache(pte_t pte)
if (!pfn_valid(pfn))
return;

- page = pfn_to_page(pfn);
- if (page_mapping_file(page) &&
- test_bit(PG_dcache_dirty, &page->flags)) {
- flush_kernel_dcache_page_addr(pfn_va(pfn));
- clear_bit(PG_dcache_dirty, &page->flags);
+ folio = page_folio(pfn_to_page(pfn));
+ pfn = folio_pfn(folio);
+ nr = folio_nr_pages(folio);
+ if (folio_flush_mapping(folio) &&
+ test_bit(PG_dcache_dirty, &folio->flags)) {
+ while (nr--)
+ flush_kernel_dcache_page_addr(pfn_va(pfn + nr));
+ clear_bit(PG_dcache_dirty, &folio->flags);
} else if (parisc_requires_coherency())
- flush_kernel_dcache_page_addr(pfn_va(pfn));
+ while (nr--)
+ flush_kernel_dcache_page_addr(pfn_va(pfn + nr));
}

void
@@ -366,6 +370,20 @@ static void flush_user_cache_page(struct vm_area_struct *vma, unsigned long vmad
preempt_enable();
}

+void flush_icache_pages(struct vm_area_struct *vma, struct page *page,
+ unsigned int nr)
+{
+ void *kaddr = page_address(page);
+
+ for (;;) {
+ flush_kernel_dcache_page_addr(kaddr);
+ flush_kernel_icache_page(kaddr);
+ if (--nr == 0)
+ break;
+ kaddr += PAGE_SIZE;
+ }
+}
+
static inline pte_t *get_ptep(struct mm_struct *mm, unsigned long addr)
{
pte_t *ptep = NULL;
@@ -394,27 +412,30 @@ static inline bool pte_needs_flush(pte_t pte)
== (_PAGE_PRESENT | _PAGE_ACCESSED);
}

-void flush_dcache_page(struct page *page)
+void flush_dcache_folio(struct folio *folio)
{
- struct address_space *mapping = page_mapping_file(page);
- struct vm_area_struct *mpnt;
- unsigned long offset;
+ struct address_space *mapping = folio_flush_mapping(folio);
+ struct vm_area_struct *vma;
unsigned long addr, old_addr = 0;
+ void *kaddr;
unsigned long count = 0;
- unsigned long flags;
+ unsigned long i, nr, flags;
pgoff_t pgoff;

if (mapping && !mapping_mapped(mapping)) {
- set_bit(PG_dcache_dirty, &page->flags);
+ set_bit(PG_dcache_dirty, &folio->flags);
return;
}

- flush_kernel_dcache_page_addr(page_address(page));
+ nr = folio_nr_pages(folio);
+ kaddr = folio_address(folio);
+ for (i = 0; i < nr; i++)
+ flush_kernel_dcache_page_addr(kaddr + i * PAGE_SIZE);

if (!mapping)
return;

- pgoff = page->index;
+ pgoff = folio->index;

/*
* We have carefully arranged in arch_get_unmapped_area() that
@@ -424,20 +445,33 @@ void flush_dcache_page(struct page *page)
* on machines that support equivalent aliasing
*/
flush_dcache_mmap_lock_irqsave(mapping, flags);
- vma_interval_tree_foreach(mpnt, &mapping->i_mmap, pgoff, pgoff) {
- offset = (pgoff - mpnt->vm_pgoff) << PAGE_SHIFT;
- addr = mpnt->vm_start + offset;
- if (parisc_requires_coherency()) {
- bool needs_flush = false;
- pte_t *ptep;
+ vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff + nr - 1) {
+ unsigned long offset = pgoff - vma->vm_pgoff;
+ unsigned long pfn = folio_pfn(folio);
+
+ addr = vma->vm_start;
+ nr = folio_nr_pages(folio);
+ if (offset > -nr) {
+ pfn -= offset;
+ nr += offset;
+ } else {
+ addr += offset * PAGE_SIZE;
+ }
+ if (addr + nr * PAGE_SIZE > vma->vm_end)
+ nr = (vma->vm_end - addr) / PAGE_SIZE;

- ptep = get_ptep(mpnt->vm_mm, addr);
- if (ptep) {
- needs_flush = pte_needs_flush(*ptep);
+ if (parisc_requires_coherency()) {
+ for (i = 0; i < nr; i++) {
+ pte_t *ptep = get_ptep(vma->vm_mm,
+ addr + i * PAGE_SIZE);
+ if (!ptep)
+ continue;
+ if (pte_needs_flush(*ptep))
+ flush_user_cache_page(vma,
+ addr + i * PAGE_SIZE);
+ /* Optimise accesses to the same table? */
pte_unmap(ptep);
}
- if (needs_flush)
- flush_user_cache_page(mpnt, addr);
} else {
/*
* The TLB is the engine of coherence on parisc:
@@ -450,27 +484,32 @@ void flush_dcache_page(struct page *page)
* in (until the user or kernel specifically
* accesses it, of course)
*/
- flush_tlb_page(mpnt, addr);
+ for (i = 0; i < nr; i++)
+ flush_tlb_page(vma, addr + i * PAGE_SIZE);
if (old_addr == 0 || (old_addr & (SHM_COLOUR - 1))
!= (addr & (SHM_COLOUR - 1))) {
- __flush_cache_page(mpnt, addr, page_to_phys(page));
+ for (i = 0; i < nr; i++)
+ __flush_cache_page(vma,
+ addr + i * PAGE_SIZE,
+ (pfn + i) * PAGE_SIZE);
/*
* Software is allowed to have any number
* of private mappings to a page.
*/
- if (!(mpnt->vm_flags & VM_SHARED))
+ if (!(vma->vm_flags & VM_SHARED))
continue;
if (old_addr)
pr_err("INEQUIVALENT ALIASES 0x%lx and 0x%lx in file %pD\n",
- old_addr, addr, mpnt->vm_file);
- old_addr = addr;
+ old_addr, addr, vma->vm_file);
+ if (nr == folio_nr_pages(folio))
+ old_addr = addr;
}
}
WARN_ON(++count == 4096);
}
flush_dcache_mmap_unlock_irqrestore(mapping, flags);
}
-EXPORT_SYMBOL(flush_dcache_page);
+EXPORT_SYMBOL(flush_dcache_folio);

/* Defined in arch/parisc/kernel/pacache.S */
EXPORT_SYMBOL(flush_kernel_dcache_range_asm);
--
2.39.2


2023-07-10 21:14:52

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v5 11/38] csky: Implement the new page table range API

Add PFN_PTE_SHIFT, update_mmu_cache_range() and flush_dcache_folio().
Change the PG_dcache_clean flag from being per-page to per-folio.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Acked-by: Guo Ren <[email protected]>
Acked-by: Mike Rapoport (IBM) <[email protected]>
Cc: [email protected]
---
arch/csky/abiv1/cacheflush.c | 32 +++++++++++++++++-----------
arch/csky/abiv1/inc/abi/cacheflush.h | 2 ++
arch/csky/abiv2/cacheflush.c | 32 ++++++++++++++--------------
arch/csky/abiv2/inc/abi/cacheflush.h | 10 +++++++--
arch/csky/include/asm/pgtable.h | 8 ++++---
5 files changed, 50 insertions(+), 34 deletions(-)

diff --git a/arch/csky/abiv1/cacheflush.c b/arch/csky/abiv1/cacheflush.c
index 94fbc03cbe70..171e8fb32285 100644
--- a/arch/csky/abiv1/cacheflush.c
+++ b/arch/csky/abiv1/cacheflush.c
@@ -15,45 +15,51 @@

#define PG_dcache_clean PG_arch_1

-void flush_dcache_page(struct page *page)
+void flush_dcache_folio(struct folio *folio)
{
struct address_space *mapping;

- if (page == ZERO_PAGE(0))
+ if (is_zero_pfn(folio_pfn(folio)))
return;

- mapping = page_mapping_file(page);
+ mapping = folio_flush_mapping(folio);

- if (mapping && !page_mapcount(page))
- clear_bit(PG_dcache_clean, &page->flags);
+ if (mapping && !folio_mapped(folio))
+ clear_bit(PG_dcache_clean, &folio->flags);
else {
dcache_wbinv_all();
if (mapping)
icache_inv_all();
- set_bit(PG_dcache_clean, &page->flags);
+ set_bit(PG_dcache_clean, &folio->flags);
}
}
+EXPORT_SYMBOL(flush_dcache_folio);
+
+void flush_dcache_page(struct page *page)
+{
+ flush_dcache_folio(page_folio(page));
+}
EXPORT_SYMBOL(flush_dcache_page);

-void update_mmu_cache(struct vm_area_struct *vma, unsigned long addr,
- pte_t *ptep)
+void update_mmu_cache_range(struct vm_fault *vmf, struct vm_area_struct *vma,
+ unsigned long addr, pte_t *ptep, unsigned int nr)
{
unsigned long pfn = pte_pfn(*ptep);
- struct page *page;
+ struct folio *folio;

flush_tlb_page(vma, addr);

if (!pfn_valid(pfn))
return;

- page = pfn_to_page(pfn);
- if (page == ZERO_PAGE(0))
+ if (is_zero_pfn(pfn))
return;

- if (!test_and_set_bit(PG_dcache_clean, &page->flags))
+ folio = page_folio(pfn_to_page(pfn));
+ if (!test_and_set_bit(PG_dcache_clean, &folio->flags))
dcache_wbinv_all();

- if (page_mapping_file(page)) {
+ if (folio_flush_mapping(folio)) {
if (vma->vm_flags & VM_EXEC)
icache_inv_all();
}
diff --git a/arch/csky/abiv1/inc/abi/cacheflush.h b/arch/csky/abiv1/inc/abi/cacheflush.h
index ed62e2066ba7..0d6cb65624c4 100644
--- a/arch/csky/abiv1/inc/abi/cacheflush.h
+++ b/arch/csky/abiv1/inc/abi/cacheflush.h
@@ -9,6 +9,8 @@

#define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1
extern void flush_dcache_page(struct page *);
+void flush_dcache_folio(struct folio *);
+#define flush_dcache_folio flush_dcache_folio

#define flush_cache_mm(mm) dcache_wbinv_all()
#define flush_cache_page(vma, page, pfn) cache_wbinv_all()
diff --git a/arch/csky/abiv2/cacheflush.c b/arch/csky/abiv2/cacheflush.c
index 9923cd24db58..d05a551af5d5 100644
--- a/arch/csky/abiv2/cacheflush.c
+++ b/arch/csky/abiv2/cacheflush.c
@@ -7,32 +7,32 @@
#include <asm/cache.h>
#include <asm/tlbflush.h>

-void update_mmu_cache(struct vm_area_struct *vma, unsigned long address,
- pte_t *pte)
+void update_mmu_cache_range(struct vm_fault *vmf, struct vm_area_struct *vma,
+ unsigned long address, pte_t *pte, unsigned int nr)
{
- unsigned long addr;
- struct page *page;
+ unsigned long pfn = pte_pfn(*pte);
+ struct folio *folio;
+ unsigned int i;

flush_tlb_page(vma, address);

- if (!pfn_valid(pte_pfn(*pte)))
+ if (!pfn_valid(pfn))
return;

- page = pfn_to_page(pte_pfn(*pte));
- if (page == ZERO_PAGE(0))
- return;
+ folio = page_folio(pfn_to_page(pfn));

- if (test_and_set_bit(PG_dcache_clean, &page->flags))
+ if (test_and_set_bit(PG_dcache_clean, &folio->flags))
return;

- addr = (unsigned long) kmap_atomic(page);
-
- dcache_wb_range(addr, addr + PAGE_SIZE);
+ for (i = 0; i < folio_nr_pages(folio); i++) {
+ unsigned long addr = (unsigned long) kmap_local_folio(folio,
+ i * PAGE_SIZE);

- if (vma->vm_flags & VM_EXEC)
- icache_inv_range(addr, addr + PAGE_SIZE);
-
- kunmap_atomic((void *) addr);
+ dcache_wb_range(addr, addr + PAGE_SIZE);
+ if (vma->vm_flags & VM_EXEC)
+ icache_inv_range(addr, addr + PAGE_SIZE);
+ kunmap_local((void *) addr);
+ }
}

void flush_icache_deferred(struct mm_struct *mm)
diff --git a/arch/csky/abiv2/inc/abi/cacheflush.h b/arch/csky/abiv2/inc/abi/cacheflush.h
index a565e00c3f70..9c728933a776 100644
--- a/arch/csky/abiv2/inc/abi/cacheflush.h
+++ b/arch/csky/abiv2/inc/abi/cacheflush.h
@@ -18,11 +18,17 @@

#define PG_dcache_clean PG_arch_1

+static inline void flush_dcache_folio(struct folio *folio)
+{
+ if (test_bit(PG_dcache_clean, &folio->flags))
+ clear_bit(PG_dcache_clean, &folio->flags);
+}
+#define flush_dcache_folio flush_dcache_folio
+
#define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1
static inline void flush_dcache_page(struct page *page)
{
- if (test_bit(PG_dcache_clean, &page->flags))
- clear_bit(PG_dcache_clean, &page->flags);
+ flush_dcache_folio(page_folio(page));
}

#define flush_dcache_mmap_lock(mapping) do { } while (0)
diff --git a/arch/csky/include/asm/pgtable.h b/arch/csky/include/asm/pgtable.h
index d4042495febc..42405037c871 100644
--- a/arch/csky/include/asm/pgtable.h
+++ b/arch/csky/include/asm/pgtable.h
@@ -28,6 +28,7 @@
#define pgd_ERROR(e) \
pr_err("%s:%d: bad pgd %08lx.\n", __FILE__, __LINE__, pgd_val(e))

+#define PFN_PTE_SHIFT PAGE_SHIFT
#define pmd_pfn(pmd) (pmd_phys(pmd) >> PAGE_SHIFT)
#define pmd_page(pmd) (pfn_to_page(pmd_phys(pmd) >> PAGE_SHIFT))
#define pte_clear(mm, addr, ptep) set_pte((ptep), \
@@ -90,7 +91,6 @@ static inline void set_pte(pte_t *p, pte_t pte)
/* prevent out of order excution */
smp_mb();
}
-#define set_pte_at(mm, addr, ptep, pteval) set_pte(ptep, pteval)

static inline pte_t *pmd_page_vaddr(pmd_t pmd)
{
@@ -263,8 +263,10 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
extern pgd_t swapper_pg_dir[PTRS_PER_PGD];
extern void paging_init(void);

-void update_mmu_cache(struct vm_area_struct *vma, unsigned long address,
- pte_t *pte);
+void update_mmu_cache_range(struct vm_fault *vmf, struct vm_area_struct *vma,
+ unsigned long address, pte_t *pte, unsigned int nr);
+#define update_mmu_cache(vma, addr, ptep) \
+ update_mmu_cache_range(NULL, vma, addr, ptep, 1)

#define io_remap_pfn_range(vma, vaddr, pfn, size, prot) \
remap_pfn_range(vma, vaddr, pfn, size, prot)
--
2.39.2


2023-07-10 21:15:45

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v5 06/38] mm: Add default definition of set_ptes()

Most architectures can just define set_pte() and PFN_PTE_SHIFT to
use this definition. It's also a handy spot to document the guarantees
provided by the MM.

Suggested-by: Mike Rapoport (IBM) <[email protected]>
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Mike Rapoport (IBM) <[email protected]>
---
include/linux/pgtable.h | 37 +++++++++++++++++++++++++++++++++++++
1 file changed, 37 insertions(+)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 5063b482e34f..22f48f9997d5 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -180,6 +180,43 @@ static inline int pmd_young(pmd_t pmd)
}
#endif

+#ifndef set_ptes
+#ifdef PFN_PTE_SHIFT
+/**
+ * set_ptes - Map consecutive pages to a contiguous range of addresses.
+ * @mm: Address space to map the pages into.
+ * @addr: Address to map the first page at.
+ * @ptep: Page table pointer for the first entry.
+ * @pte: Page table entry for the first page.
+ * @nr: Number of pages to map.
+ *
+ * May be overridden by the architecture, or the architecture can define
+ * set_pte() and PFN_PTE_SHIFT.
+ *
+ * Context: The caller holds the page table lock. The pages all belong
+ * to the same folio. The PTEs are all in the same PMD.
+ */
+static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep, pte_t pte, unsigned int nr)
+{
+ page_table_check_ptes_set(mm, addr, ptep, pte, nr);
+
+ for (;;) {
+ set_pte(ptep, pte);
+ if (--nr == 0)
+ break;
+ ptep++;
+ pte = __pte(pte_val(pte) + (1UL << PFN_PTE_SHIFT));
+ }
+}
+#ifndef set_pte_at
+#define set_pte_at(mm, addr, ptep, pte) set_ptes(mm, addr, ptep, pte, 1)
+#endif
+#endif
+#else
+#define set_pte_at(mm, addr, ptep, pte) set_ptes(mm, addr, ptep, pte, 1)
+#endif
+
#ifndef __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
extern int ptep_set_access_flags(struct vm_area_struct *vma,
unsigned long address, pte_t *ptep,
--
2.39.2


2023-07-10 21:16:00

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v5 14/38] loongarch: Implement the new page table range API

Add update_mmu_cache_range() and change _PFN_SHIFT to PFN_PTE_SHIFT.
It would probably be more efficient to implement __update_tlb() by
flushing the entire folio instead of calling __update_tlb() N times,
but I'll leave that for someone who understands the architecture better.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Acked-by: Mike Rapoport (IBM) <[email protected]>
Cc: Huacai Chen <[email protected]>
Cc: WANG Xuerui <[email protected]>
Cc: [email protected]
---
arch/loongarch/include/asm/cacheflush.h | 1 +
arch/loongarch/include/asm/pgtable-bits.h | 4 +--
arch/loongarch/include/asm/pgtable.h | 33 ++++++++++++-----------
arch/loongarch/mm/pgtable.c | 2 +-
arch/loongarch/mm/tlb.c | 2 +-
5 files changed, 23 insertions(+), 19 deletions(-)

diff --git a/arch/loongarch/include/asm/cacheflush.h b/arch/loongarch/include/asm/cacheflush.h
index 0681788eb474..88a44da50a3b 100644
--- a/arch/loongarch/include/asm/cacheflush.h
+++ b/arch/loongarch/include/asm/cacheflush.h
@@ -47,6 +47,7 @@ void local_flush_icache_range(unsigned long start, unsigned long end);
#define flush_cache_vmap(start, end) do { } while (0)
#define flush_cache_vunmap(start, end) do { } while (0)
#define flush_icache_page(vma, page) do { } while (0)
+#define flush_icache_pages(vma, page) do { } while (0)
#define flush_icache_user_page(vma, page, addr, len) do { } while (0)
#define flush_dcache_page(page) do { } while (0)
#define flush_dcache_mmap_lock(mapping) do { } while (0)
diff --git a/arch/loongarch/include/asm/pgtable-bits.h b/arch/loongarch/include/asm/pgtable-bits.h
index de46a6b1e9f1..35348d4c4209 100644
--- a/arch/loongarch/include/asm/pgtable-bits.h
+++ b/arch/loongarch/include/asm/pgtable-bits.h
@@ -50,12 +50,12 @@
#define _PAGE_NO_EXEC (_ULCAST_(1) << _PAGE_NO_EXEC_SHIFT)
#define _PAGE_RPLV (_ULCAST_(1) << _PAGE_RPLV_SHIFT)
#define _CACHE_MASK (_ULCAST_(3) << _CACHE_SHIFT)
-#define _PFN_SHIFT (PAGE_SHIFT - 12 + _PAGE_PFN_SHIFT)
+#define PFN_PTE_SHIFT (PAGE_SHIFT - 12 + _PAGE_PFN_SHIFT)

#define _PAGE_USER (PLV_USER << _PAGE_PLV_SHIFT)
#define _PAGE_KERN (PLV_KERN << _PAGE_PLV_SHIFT)

-#define _PFN_MASK (~((_ULCAST_(1) << (_PFN_SHIFT)) - 1) & \
+#define _PFN_MASK (~((_ULCAST_(1) << (PFN_PTE_SHIFT)) - 1) & \
((_ULCAST_(1) << (_PAGE_PFN_END_SHIFT)) - 1))

/*
diff --git a/arch/loongarch/include/asm/pgtable.h b/arch/loongarch/include/asm/pgtable.h
index 38afeb7dd58b..e7cf25e452c0 100644
--- a/arch/loongarch/include/asm/pgtable.h
+++ b/arch/loongarch/include/asm/pgtable.h
@@ -237,9 +237,9 @@ extern pmd_t mk_pmd(struct page *page, pgprot_t prot);
extern void set_pmd_at(struct mm_struct *mm, unsigned long addr, pmd_t *pmdp, pmd_t pmd);

#define pte_page(x) pfn_to_page(pte_pfn(x))
-#define pte_pfn(x) ((unsigned long)(((x).pte & _PFN_MASK) >> _PFN_SHIFT))
-#define pfn_pte(pfn, prot) __pte(((pfn) << _PFN_SHIFT) | pgprot_val(prot))
-#define pfn_pmd(pfn, prot) __pmd(((pfn) << _PFN_SHIFT) | pgprot_val(prot))
+#define pte_pfn(x) ((unsigned long)(((x).pte & _PFN_MASK) >> PFN_PTE_SHIFT))
+#define pfn_pte(pfn, prot) __pte(((pfn) << PFN_PTE_SHIFT) | pgprot_val(prot))
+#define pfn_pmd(pfn, prot) __pmd(((pfn) << PFN_PTE_SHIFT) | pgprot_val(prot))

/*
* Initialize a new pgd / pud / pmd table with invalid pointers.
@@ -334,19 +334,13 @@ static inline void set_pte(pte_t *ptep, pte_t pteval)
}
}

-static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
- pte_t *ptep, pte_t pteval)
-{
- set_pte(ptep, pteval);
-}
-
static inline void pte_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
{
/* Preserve global status for the pair */
if (pte_val(*ptep_buddy(ptep)) & _PAGE_GLOBAL)
- set_pte_at(mm, addr, ptep, __pte(_PAGE_GLOBAL));
+ set_pte(ptep, __pte(_PAGE_GLOBAL));
else
- set_pte_at(mm, addr, ptep, __pte(0));
+ set_pte(ptep, __pte(0));
}

#define PGD_T_LOG2 (__builtin_ffs(sizeof(pgd_t)) - 1)
@@ -445,11 +439,20 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
extern void __update_tlb(struct vm_area_struct *vma,
unsigned long address, pte_t *ptep);

-static inline void update_mmu_cache(struct vm_area_struct *vma,
- unsigned long address, pte_t *ptep)
+static inline void update_mmu_cache_range(struct vm_fault *vmf,
+ struct vm_area_struct *vma, unsigned long address,
+ pte_t *ptep, unsigned int nr)
{
- __update_tlb(vma, address, ptep);
+ for (;;) {
+ __update_tlb(vma, address, ptep);
+ if (--nr == 0)
+ break;
+ address += PAGE_SIZE;
+ ptep++;
+ }
}
+#define update_mmu_cache(vma, addr, ptep) \
+ update_mmu_cache_range(NULL, vma, addr, ptep, 1)

#define __HAVE_ARCH_UPDATE_MMU_TLB
#define update_mmu_tlb update_mmu_cache
@@ -462,7 +465,7 @@ static inline void update_mmu_cache_pmd(struct vm_area_struct *vma,

static inline unsigned long pmd_pfn(pmd_t pmd)
{
- return (pmd_val(pmd) & _PFN_MASK) >> _PFN_SHIFT;
+ return (pmd_val(pmd) & _PFN_MASK) >> PFN_PTE_SHIFT;
}

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
diff --git a/arch/loongarch/mm/pgtable.c b/arch/loongarch/mm/pgtable.c
index 36a6dc0148ae..1260cf30e3ee 100644
--- a/arch/loongarch/mm/pgtable.c
+++ b/arch/loongarch/mm/pgtable.c
@@ -107,7 +107,7 @@ pmd_t mk_pmd(struct page *page, pgprot_t prot)
{
pmd_t pmd;

- pmd_val(pmd) = (page_to_pfn(page) << _PFN_SHIFT) | pgprot_val(prot);
+ pmd_val(pmd) = (page_to_pfn(page) << PFN_PTE_SHIFT) | pgprot_val(prot);

return pmd;
}
diff --git a/arch/loongarch/mm/tlb.c b/arch/loongarch/mm/tlb.c
index 00bb563e3c89..eb8572e201ea 100644
--- a/arch/loongarch/mm/tlb.c
+++ b/arch/loongarch/mm/tlb.c
@@ -252,7 +252,7 @@ static void output_pgtable_bits_defines(void)
pr_define("_PAGE_WRITE_SHIFT %d\n", _PAGE_WRITE_SHIFT);
pr_define("_PAGE_NO_READ_SHIFT %d\n", _PAGE_NO_READ_SHIFT);
pr_define("_PAGE_NO_EXEC_SHIFT %d\n", _PAGE_NO_EXEC_SHIFT);
- pr_define("_PFN_SHIFT %d\n", _PFN_SHIFT);
+ pr_define("PFN_PTE_SHIFT %d\n", PFN_PTE_SHIFT);
pr_debug("\n");
}

--
2.39.2


2023-07-10 21:16:50

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v5 10/38] arm64: Implement the new page table range API

Add set_ptes(), update_mmu_cache_range() and flush_dcache_folio().
Change the PG_dcache_clean flag from being per-page to per-folio.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Catalin Marinas <[email protected]>
Acked-by: Mike Rapoport (IBM) <[email protected]>
Cc: [email protected]
---
arch/arm64/include/asm/cacheflush.h | 4 +++-
arch/arm64/include/asm/pgtable.h | 26 +++++++++++++++------
arch/arm64/mm/flush.c | 36 +++++++++++------------------
3 files changed, 36 insertions(+), 30 deletions(-)

diff --git a/arch/arm64/include/asm/cacheflush.h b/arch/arm64/include/asm/cacheflush.h
index 37185e978aeb..d115451ed263 100644
--- a/arch/arm64/include/asm/cacheflush.h
+++ b/arch/arm64/include/asm/cacheflush.h
@@ -114,7 +114,7 @@ extern void copy_to_user_page(struct vm_area_struct *, struct page *,
#define copy_to_user_page copy_to_user_page

/*
- * flush_dcache_page is used when the kernel has written to the page
+ * flush_dcache_folio is used when the kernel has written to the page
* cache page at virtual address page->virtual.
*
* If this page isn't mapped (ie, page_mapping == NULL), or it might
@@ -127,6 +127,8 @@ extern void copy_to_user_page(struct vm_area_struct *, struct page *,
*/
#define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1
extern void flush_dcache_page(struct page *);
+void flush_dcache_folio(struct folio *);
+#define flush_dcache_folio flush_dcache_folio

static __always_inline void icache_inval_all_pou(void)
{
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index a44a150e0318..c1c4abf75217 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -345,12 +345,21 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr,
set_pte(ptep, pte);
}

-static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
- pte_t *ptep, pte_t pte)
-{
- page_table_check_ptes_set(mm, addr, ptep, pte, 1);
- return __set_pte_at(mm, addr, ptep, pte);
+static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
+ pte_t *ptep, pte_t pte, unsigned int nr)
+{
+ page_table_check_ptes_set(mm, addr, ptep, pte, nr);
+
+ for (;;) {
+ __set_pte_at(mm, addr, ptep, pte);
+ if (--nr == 0)
+ break;
+ ptep++;
+ addr += PAGE_SIZE;
+ pte_val(pte) += PAGE_SIZE;
+ }
}
+#define set_ptes set_ptes

/*
* Huge pte definitions.
@@ -1049,8 +1058,9 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
/*
* On AArch64, the cache coherency is handled via the set_pte_at() function.
*/
-static inline void update_mmu_cache(struct vm_area_struct *vma,
- unsigned long addr, pte_t *ptep)
+static inline void update_mmu_cache_range(struct vm_fault *vmf,
+ struct vm_area_struct *vma, unsigned long addr, pte_t *ptep,
+ unsigned int nr)
{
/*
* We don't do anything here, so there's a very small chance of
@@ -1059,6 +1069,8 @@ static inline void update_mmu_cache(struct vm_area_struct *vma,
*/
}

+#define update_mmu_cache(vma, addr, ptep) \
+ update_mmu_cache_range(NULL, vma, addr, ptep, 1)
#define update_mmu_cache_pmd(vma, address, pmd) do { } while (0)

#ifdef CONFIG_ARM64_PA_BITS_52
diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c
index 4e6476094952..013eead9b695 100644
--- a/arch/arm64/mm/flush.c
+++ b/arch/arm64/mm/flush.c
@@ -51,20 +51,13 @@ void copy_to_user_page(struct vm_area_struct *vma, struct page *page,

void __sync_icache_dcache(pte_t pte)
{
- struct page *page = pte_page(pte);
+ struct folio *folio = page_folio(pte_page(pte));

- /*
- * HugeTLB pages are always fully mapped, so only setting head page's
- * PG_dcache_clean flag is enough.
- */
- if (PageHuge(page))
- page = compound_head(page);
-
- if (!test_bit(PG_dcache_clean, &page->flags)) {
- sync_icache_aliases((unsigned long)page_address(page),
- (unsigned long)page_address(page) +
- page_size(page));
- set_bit(PG_dcache_clean, &page->flags);
+ if (!test_bit(PG_dcache_clean, &folio->flags)) {
+ sync_icache_aliases((unsigned long)folio_address(folio),
+ (unsigned long)folio_address(folio) +
+ folio_size(folio));
+ set_bit(PG_dcache_clean, &folio->flags);
}
}
EXPORT_SYMBOL_GPL(__sync_icache_dcache);
@@ -74,17 +67,16 @@ EXPORT_SYMBOL_GPL(__sync_icache_dcache);
* it as dirty for later flushing when mapped in user space (if executable,
* see __sync_icache_dcache).
*/
-void flush_dcache_page(struct page *page)
+void flush_dcache_folio(struct folio *folio)
{
- /*
- * HugeTLB pages are always fully mapped and only head page will be
- * set PG_dcache_clean (see comments in __sync_icache_dcache()).
- */
- if (PageHuge(page))
- page = compound_head(page);
+ if (test_bit(PG_dcache_clean, &folio->flags))
+ clear_bit(PG_dcache_clean, &folio->flags);
+}
+EXPORT_SYMBOL(flush_dcache_folio);

- if (test_bit(PG_dcache_clean, &page->flags))
- clear_bit(PG_dcache_clean, &page->flags);
+void flush_dcache_page(struct page *page)
+{
+ flush_dcache_folio(page_folio(page));
}
EXPORT_SYMBOL(flush_dcache_page);

--
2.39.2


2023-07-10 21:17:40

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v5 30/38] mm: Remove page_mapping_file()

This function has no more users.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Anshuman Khandual <[email protected]>
---
include/linux/pagemap.h | 8 --------
1 file changed, 8 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 794e4e55dc38..71dd79b4ae0a 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -414,14 +414,6 @@ static inline struct address_space *page_file_mapping(struct page *page)
return folio_file_mapping(page_folio(page));
}

-/*
- * For file cache pages, return the address_space, otherwise return NULL
- */
-static inline struct address_space *page_mapping_file(struct page *page)
-{
- return folio_flush_mapping(page_folio(page));
-}
-
/**
* folio_inode - Get the host inode for this folio.
* @folio: The folio.
--
2.39.2


2023-07-10 21:20:35

by Matthew Wilcox

[permalink] [raw]
Subject: [PATCH v5 32/38] mm: Tidy up set_ptes definition

Now that all architectures are converted, we can remove the
PFN_PTE_SHIFT ifdef and we can define set_pte_at() unconditionally.

Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Anshuman Khandual <[email protected]>
---
include/linux/pgtable.h | 6 ------
1 file changed, 6 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 22f48f9997d5..e2a0bd5941be 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -181,7 +181,6 @@ static inline int pmd_young(pmd_t pmd)
#endif

#ifndef set_ptes
-#ifdef PFN_PTE_SHIFT
/**
* set_ptes - Map consecutive pages to a contiguous range of addresses.
* @mm: Address space to map the pages into.
@@ -209,13 +208,8 @@ static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
pte = __pte(pte_val(pte) + (1UL << PFN_PTE_SHIFT));
}
}
-#ifndef set_pte_at
-#define set_pte_at(mm, addr, ptep, pte) set_ptes(mm, addr, ptep, pte, 1)
-#endif
#endif
-#else
#define set_pte_at(mm, addr, ptep, pte) set_ptes(mm, addr, ptep, pte, 1)
-#endif

#ifndef __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
extern int ptep_set_access_flags(struct vm_area_struct *vma,
--
2.39.2


2023-07-11 00:24:24

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v5 01/38] minmax: Add in_range() macro

On Mon, 10 Jul 2023 21:43:02 +0100 "Matthew Wilcox (Oracle)" <[email protected]> wrote:

> Determine if a value lies within a range more efficiently (subtraction +
> comparison vs two comparisons and an AND). It also has useful (under
> some circumstances) behaviour if the range exceeds the maximum value of
> the type.
>
> Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> --- a/include/linux/minmax.h
> +++ b/include/linux/minmax.h
> @@ -158,6 +158,32 @@
> */
> #define clamp_val(val, lo, hi) clamp_t(typeof(val), val, lo, hi)
>
> +static inline bool in_range64(u64 val, u64 start, u64 len)
> +{
> + return (val - start) < len;
> +}
> +
> +static inline bool in_range32(u32 val, u32 start, u32 len)
> +{
> + return (val - start) < len;
> +}
> +
> +/**
> + * in_range - Determine if a value lies within a range.
> + * @val: Value to test.
> + * @start: First value in range.
> + * @len: Number of values in range.
> + *
> + * This is more efficient than "if (start <= val && val < (start + len))".
> + * It also gives a different answer if @start + @len overflows the size of
> + * the type by a sufficient amount to encompass @val. Decide for yourself
> + * which behaviour you want, or prove that start + len never overflow.
> + * Do not blindly replace one form with the other.
> + */
> +#define in_range(val, start, len) \
> + sizeof(start) <= sizeof(u32) ? in_range32(val, start, len) : \
> + in_range64(val, start, len)

There's nothing here to prevent callers from passing a mixture of
32-bit and 64-bit values, possibly resulting in truncation of `val' or
`len'.

Obviously caller is being dumb, but I think it's cost-free to check all
three of the arguments for 64-bitness?

Or do a min()/max()-style check for consistently typed arguments?


2023-07-11 02:35:55

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v5 01/38] minmax: Add in_range() macro

On Mon, Jul 10, 2023 at 04:13:41PM -0700, Andrew Morton wrote:
> > +/**
> > + * in_range - Determine if a value lies within a range.
> > + * @val: Value to test.
> > + * @start: First value in range.
> > + * @len: Number of values in range.
> > + *
> > + * This is more efficient than "if (start <= val && val < (start + len))".
> > + * It also gives a different answer if @start + @len overflows the size of
> > + * the type by a sufficient amount to encompass @val. Decide for yourself
> > + * which behaviour you want, or prove that start + len never overflow.
> > + * Do not blindly replace one form with the other.
> > + */
> > +#define in_range(val, start, len) \
> > + sizeof(start) <= sizeof(u32) ? in_range32(val, start, len) : \
> > + in_range64(val, start, len)
>
> There's nothing here to prevent callers from passing a mixture of
> 32-bit and 64-bit values, possibly resulting in truncation of `val' or
> `len'.
>
> Obviously caller is being dumb, but I think it's cost-free to check all
> three of the arguments for 64-bitness?
>
> Or do a min()/max()-style check for consistently typed arguments?

How about

#define in_range(val, start, len) \
(sizeof(val) | sizeof(start) | size(len)) <= sizeof(u32) ? \
in_range32(val, start, len) : in_range64(val, start, len)


2023-07-11 06:05:48

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v5 01/38] minmax: Add in_range() macro

On Mon, Jul 10, 2023 at 09:43:02PM +0100, Matthew Wilcox (Oracle) wrote:
> Determine if a value lies within a range more efficiently (subtraction +
> comparison vs two comparisons and an AND). It also has useful (under
> some circumstances) behaviour if the range exceeds the maximum value of
> the type.

Should this also drop existing versions of in_range()? E.g. btrfs
already has its own.

2023-07-11 09:18:08

by Christian Borntraeger

[permalink] [raw]
Subject: Re: [PATCH v5 00/38] New page table range API

Am 10.07.23 um 22:43 schrieb Matthew Wilcox (Oracle):
> This patchset changes the API used by the MM to set up page table entries.
> The four APIs are:
> set_ptes(mm, addr, ptep, pte, nr)
> update_mmu_cache_range(vma, addr, ptep, nr)
> flush_dcache_folio(folio)
> flush_icache_pages(vma, page, nr)
>
> flush_dcache_folio() isn't technically new, but no architecture
> implemented it, so I've done that for them. The old APIs remain around
> but are mostly implemented by calling the new interfaces.
>
> The new APIs are based around setting up N page table entries at once.
> The N entries belong to the same PMD, the same folio and the same VMA,
> so ptep++ is a legitimate operation, and locking is taken care of for
> you. Some architectures can do a better job of it than just a loop,
> but I have hesitated to make too deep a change to architectures I don't
> understand well.
>
> One thing I have changed in every architecture is that PG_arch_1 is now a
> per-folio bit instead of a per-page bit. This was something that would
> have to happen eventually, and it makes sense to do it now rather than
> iterate over every page involved in a cache flush and figure out if it
> needs to happen.

I think we do use PG_arch_1 on s390 for our secure page handling and
making this perf folio instead of physical page really seems wrong
and it probably breaks this code.

Claudio, can you have a look?



2023-07-11 13:47:39

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v5 00/38] New page table range API

On Tue, Jul 11, 2023 at 11:07:06AM +0200, Christian Borntraeger wrote:
> Am 10.07.23 um 22:43 schrieb Matthew Wilcox (Oracle):
> > This patchset changes the API used by the MM to set up page table entries.
> > The four APIs are:
> > set_ptes(mm, addr, ptep, pte, nr)
> > update_mmu_cache_range(vma, addr, ptep, nr)
> > flush_dcache_folio(folio)
> > flush_icache_pages(vma, page, nr)
> >
> > flush_dcache_folio() isn't technically new, but no architecture
> > implemented it, so I've done that for them. The old APIs remain around
> > but are mostly implemented by calling the new interfaces.
> >
> > The new APIs are based around setting up N page table entries at once.
> > The N entries belong to the same PMD, the same folio and the same VMA,
> > so ptep++ is a legitimate operation, and locking is taken care of for
> > you. Some architectures can do a better job of it than just a loop,
> > but I have hesitated to make too deep a change to architectures I don't
> > understand well.
> >
> > One thing I have changed in every architecture is that PG_arch_1 is now a
> > per-folio bit instead of a per-page bit. This was something that would
> > have to happen eventually, and it makes sense to do it now rather than
> > iterate over every page involved in a cache flush and figure out if it
> > needs to happen.
>
> I think we do use PG_arch_1 on s390 for our secure page handling and
> making this perf folio instead of physical page really seems wrong
> and it probably breaks this code.

Per-page flags are going away in the next few years, so you're going to
need a new design. s390 seems to do a lot of unusual things. I wish
you'd talk to the rest of us more.

2023-07-11 15:37:42

by Claudio Imbrenda

[permalink] [raw]
Subject: Re: [PATCH v5 00/38] New page table range API

On Tue, 11 Jul 2023 13:36:27 +0100
Matthew Wilcox <[email protected]> wrote:

> On Tue, Jul 11, 2023 at 11:07:06AM +0200, Christian Borntraeger wrote:
> > Am 10.07.23 um 22:43 schrieb Matthew Wilcox (Oracle):
> > > This patchset changes the API used by the MM to set up page table entries.
> > > The four APIs are:
> > > set_ptes(mm, addr, ptep, pte, nr)
> > > update_mmu_cache_range(vma, addr, ptep, nr)
> > > flush_dcache_folio(folio)
> > > flush_icache_pages(vma, page, nr)
> > >
> > > flush_dcache_folio() isn't technically new, but no architecture
> > > implemented it, so I've done that for them. The old APIs remain around
> > > but are mostly implemented by calling the new interfaces.
> > >
> > > The new APIs are based around setting up N page table entries at once.
> > > The N entries belong to the same PMD, the same folio and the same VMA,
> > > so ptep++ is a legitimate operation, and locking is taken care of for
> > > you. Some architectures can do a better job of it than just a loop,
> > > but I have hesitated to make too deep a change to architectures I don't
> > > understand well.
> > >
> > > One thing I have changed in every architecture is that PG_arch_1 is now a
> > > per-folio bit instead of a per-page bit. This was something that would
> > > have to happen eventually, and it makes sense to do it now rather than
> > > iterate over every page involved in a cache flush and figure out if it
> > > needs to happen.
> >
> > I think we do use PG_arch_1 on s390 for our secure page handling and
> > making this perf folio instead of physical page really seems wrong
> > and it probably breaks this code.
>
> Per-page flags are going away in the next few years, so you're going to

For each 4k physical page frame, we need to keep track whether it is
secure or not.

A bit in struct page seems the most logical choice. If that's not
possible anymore, how would you propose we should do?

> need a new design. s390 seems to do a lot of unusual things. I wish

s390 is an unusual architecture. we are working on un-weirding our
code, but it takes time

> you'd talk to the rest of us more.


2023-07-11 16:10:56

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v5 01/38] minmax: Add in_range() macro

On Tue, 11 Jul 2023 03:14:44 +0100 Matthew Wilcox <[email protected]> wrote:

> On Mon, Jul 10, 2023 at 04:13:41PM -0700, Andrew Morton wrote:
> > > +/**
> > > + * in_range - Determine if a value lies within a range.
> > > + * @val: Value to test.
> > > + * @start: First value in range.
> > > + * @len: Number of values in range.
> > > + *
> > > + * This is more efficient than "if (start <= val && val < (start + len))".
> > > + * It also gives a different answer if @start + @len overflows the size of
> > > + * the type by a sufficient amount to encompass @val. Decide for yourself
> > > + * which behaviour you want, or prove that start + len never overflow.
> > > + * Do not blindly replace one form with the other.
> > > + */
> > > +#define in_range(val, start, len) \
> > > + sizeof(start) <= sizeof(u32) ? in_range32(val, start, len) : \
> > > + in_range64(val, start, len)
> >
> > There's nothing here to prevent callers from passing a mixture of
> > 32-bit and 64-bit values, possibly resulting in truncation of `val' or
> > `len'.
> >
> > Obviously caller is being dumb, but I think it's cost-free to check all
> > three of the arguments for 64-bitness?
> >
> > Or do a min()/max()-style check for consistently typed arguments?
>
> How about
>
> #define in_range(val, start, len) \
> (sizeof(val) | sizeof(start) | size(len)) <= sizeof(u32) ? \
> in_range32(val, start, len) : in_range64(val, start, len)

It saves some typing ;) sizeof(val+start+len)? <no>





2023-07-11 17:25:49

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v5 00/38] New page table range API

On Tue, 11 Jul 2023 17:24:40 +0200 Claudio Imbrenda <[email protected]> wrote:

> On Tue, 11 Jul 2023 13:36:27 +0100
> Matthew Wilcox <[email protected]> wrote:
>
> > On Tue, Jul 11, 2023 at 11:07:06AM +0200, Christian Borntraeger wrote:
> > > Am 10.07.23 um 22:43 schrieb Matthew Wilcox (Oracle):
> > > > This patchset changes the API used by the MM to set up page table entries.
> > > > The four APIs are:
> > > > set_ptes(mm, addr, ptep, pte, nr)
> > > > update_mmu_cache_range(vma, addr, ptep, nr)
> > > > flush_dcache_folio(folio)
> > > > flush_icache_pages(vma, page, nr)
> > > >
> > > > flush_dcache_folio() isn't technically new, but no architecture
> > > > implemented it, so I've done that for them. The old APIs remain around
> > > > but are mostly implemented by calling the new interfaces.
> > > >
> > > > The new APIs are based around setting up N page table entries at once.
> > > > The N entries belong to the same PMD, the same folio and the same VMA,
> > > > so ptep++ is a legitimate operation, and locking is taken care of for
> > > > you. Some architectures can do a better job of it than just a loop,
> > > > but I have hesitated to make too deep a change to architectures I don't
> > > > understand well.
> > > >
> > > > One thing I have changed in every architecture is that PG_arch_1 is now a
> > > > per-folio bit instead of a per-page bit. This was something that would
> > > > have to happen eventually, and it makes sense to do it now rather than
> > > > iterate over every page involved in a cache flush and figure out if it
> > > > needs to happen.
> > >
> > > I think we do use PG_arch_1 on s390 for our secure page handling and
> > > making this perf folio instead of physical page really seems wrong
> > > and it probably breaks this code.
> >
> > Per-page flags are going away in the next few years, so you're going to
>
> For each 4k physical page frame, we need to keep track whether it is
> secure or not.
>
> A bit in struct page seems the most logical choice. If that's not
> possible anymore, how would you propose we should do?
>
> > need a new design. s390 seems to do a lot of unusual things. I wish
>
> s390 is an unusual architecture. we are working on un-weirding our
> code, but it takes time
>

This issue sounds fatal for this version of this patchset?

2023-07-11 22:19:59

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v5 00/38] New page table range API

On Tue, Jul 11, 2023 at 09:52:33AM -0700, Andrew Morton wrote:
> On Tue, 11 Jul 2023 17:24:40 +0200 Claudio Imbrenda <[email protected]> wrote:
>
> > On Tue, 11 Jul 2023 13:36:27 +0100
> > Matthew Wilcox <[email protected]> wrote:
> >
> > > On Tue, Jul 11, 2023 at 11:07:06AM +0200, Christian Borntraeger wrote:
> > > > Am 10.07.23 um 22:43 schrieb Matthew Wilcox (Oracle):
> > > > > This patchset changes the API used by the MM to set up page table entries.
> > > > > The four APIs are:
> > > > > set_ptes(mm, addr, ptep, pte, nr)
> > > > > update_mmu_cache_range(vma, addr, ptep, nr)
> > > > > flush_dcache_folio(folio)
> > > > > flush_icache_pages(vma, page, nr)
> > > > >
> > > > > flush_dcache_folio() isn't technically new, but no architecture
> > > > > implemented it, so I've done that for them. The old APIs remain around
> > > > > but are mostly implemented by calling the new interfaces.
> > > > >
> > > > > The new APIs are based around setting up N page table entries at once.
> > > > > The N entries belong to the same PMD, the same folio and the same VMA,
> > > > > so ptep++ is a legitimate operation, and locking is taken care of for
> > > > > you. Some architectures can do a better job of it than just a loop,
> > > > > but I have hesitated to make too deep a change to architectures I don't
> > > > > understand well.
> > > > >
> > > > > One thing I have changed in every architecture is that PG_arch_1 is now a
> > > > > per-folio bit instead of a per-page bit. This was something that would
> > > > > have to happen eventually, and it makes sense to do it now rather than
> > > > > iterate over every page involved in a cache flush and figure out if it
> > > > > needs to happen.
> > > >
> > > > I think we do use PG_arch_1 on s390 for our secure page handling and
> > > > making this perf folio instead of physical page really seems wrong
> > > > and it probably breaks this code.
> > >
> > > Per-page flags are going away in the next few years, so you're going to
> >
> > For each 4k physical page frame, we need to keep track whether it is
> > secure or not.
> >
> > A bit in struct page seems the most logical choice. If that's not
> > possible anymore, how would you propose we should do?
> >
> > > need a new design. s390 seems to do a lot of unusual things. I wish
> >
> > s390 is an unusual architecture. we are working on un-weirding our
> > code, but it takes time
> >
>
> This issue sounds fatal for this version of this patchset?

It's only declared as being per-folio in the cover letter to this
patchset. I haven't done anything that will prohibit s390 from using it
the way they do now. So it's not fatal, but it sounds like the
in_range() macro might be ...

2023-07-12 06:02:31

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v5 00/38] New page table range API

On Tue, Jul 11, 2023 at 05:24:40PM +0200, Claudio Imbrenda wrote:
> On Tue, 11 Jul 2023 13:36:27 +0100
> Matthew Wilcox <[email protected]> wrote:
> > > I think we do use PG_arch_1 on s390 for our secure page handling and
> > > making this perf folio instead of physical page really seems wrong
> > > and it probably breaks this code.
> >
> > Per-page flags are going away in the next few years, so you're going to
>
> For each 4k physical page frame, we need to keep track whether it is
> secure or not.

Do you? Wouldn't it make more sense to track that per allocation instead
of per page? ie if we allocate a 16kB anon folio for a VMA, don't you
want the entire folio to be marked as secure vs insecure?

I don't really know what secure means in this context. I think it has
something to do with which of the VM or the hypervisor can access it, but
it feels like something new that I've never had properly explained to me.

> A bit in struct page seems the most logical choice. If that's not
> possible anymore, how would you propose we should do?

The plan is to shrink struct page down to a single pointer (which
includes a few tag bits to say what type that pointer is -- a page
table, anon mem, file mem, slab, etc). So there won't be any bits
available for something like "secure or not". You could use a side
structure if you really need to keep track on a per page basis.

2023-07-12 08:42:13

by Claudio Imbrenda

[permalink] [raw]
Subject: Re: [PATCH v5 00/38] New page table range API

On Wed, 12 Jul 2023 06:29:21 +0100
Matthew Wilcox <[email protected]> wrote:

> On Tue, Jul 11, 2023 at 05:24:40PM +0200, Claudio Imbrenda wrote:
> > On Tue, 11 Jul 2023 13:36:27 +0100
> > Matthew Wilcox <[email protected]> wrote:
> > > > I think we do use PG_arch_1 on s390 for our secure page handling and
> > > > making this perf folio instead of physical page really seems wrong
> > > > and it probably breaks this code.
> > >
> > > Per-page flags are going away in the next few years, so you're going to
> >
> > For each 4k physical page frame, we need to keep track whether it is
> > secure or not.
>
> Do you? Wouldn't it make more sense to track that per allocation instead

no

> of per page? ie if we allocate a 16kB anon folio for a VMA, don't you
> want the entire folio to be marked as secure vs insecure?

if we allocate a 16k folio, it would actually be initially marked as
non-secure until the guest touches any of it, then only those 4k pages
that are needed get marked as secure.

the guest can also share the pages with the host, in which case the
individual 4k pages get marked as non-secure once I/O is attempted on
them (e.g. direct I/O)

userspace (i.e. QEMU) can also try to look into the guest, causing
individual pages to be exported (securely encrypted and then marked as
non-secure) if they were secure and not shared.

I/O cannot trigger exports, it will just fail, and that should not
happen because in some cases it can bring down the whole system. Which
is one of the main reasons why we need to keep track of the state.

>
> I don't really know what secure means in this context. I think it has
> something to do with which of the VM or the hypervisor can access it, but
> it feels like something new that I've never had properly explained to me.

Secure means it belongs to a secure guest (confidential VM,
protected virtualisation, Secure Execution, there are many names...).

Hardware will prevent the host (or any other entity except for the
secure guest itself) from accessing those 4k physical page frames,
regardless of how the host might try. An exception will be presented
for any attempts.

I/O will not trigger any exception, and will instead just fail.

I hope this explains why we need to track the property for each 4k
physical page frame.

>
> > A bit in struct page seems the most logical choice. If that's not
> > possible anymore, how would you propose we should do?
>
> The plan is to shrink struct page down to a single pointer (which

interesting

> includes a few tag bits to say what type that pointer is -- a page
> table, anon mem, file mem, slab, etc). So there won't be any bits
> available for something like "secure or not". You could use a side
> structure if you really need to keep track on a per page basis.

I guess that's something we will need to work on

2023-07-13 11:06:53

by Christian Borntraeger

[permalink] [raw]
Subject: Re: [PATCH v5 00/38] New page table range API



Am 11.07.23 um 14:36 schrieb Matthew Wilcox:
> On Tue, Jul 11, 2023 at 11:07:06AM +0200, Christian Borntraeger wrote:
>> Am 10.07.23 um 22:43 schrieb Matthew Wilcox (Oracle):
>>> This patchset changes the API used by the MM to set up page table entries.
>>> The four APIs are:
>>> set_ptes(mm, addr, ptep, pte, nr)
>>> update_mmu_cache_range(vma, addr, ptep, nr)
>>> flush_dcache_folio(folio)
>>> flush_icache_pages(vma, page, nr)
>>>
>>> flush_dcache_folio() isn't technically new, but no architecture
>>> implemented it, so I've done that for them. The old APIs remain around
>>> but are mostly implemented by calling the new interfaces.
>>>
>>> The new APIs are based around setting up N page table entries at once.
>>> The N entries belong to the same PMD, the same folio and the same VMA,
>>> so ptep++ is a legitimate operation, and locking is taken care of for
>>> you. Some architectures can do a better job of it than just a loop,
>>> but I have hesitated to make too deep a change to architectures I don't
>>> understand well.
>>>
>>> One thing I have changed in every architecture is that PG_arch_1 is now a
>>> per-folio bit instead of a per-page bit. This was something that would
>>> have to happen eventually, and it makes sense to do it now rather than
>>> iterate over every page involved in a cache flush and figure out if it
>>> needs to happen.
>>
>> I think we do use PG_arch_1 on s390 for our secure page handling and
>> making this perf folio instead of physical page really seems wrong
>> and it probably breaks this code.
>
> Per-page flags are going away in the next few years, so you're going to
> need a new design. s390 seems to do a lot of unusual things. I wish
> you'd talk to the rest of us more.

I understand you point from a logical point of view, but a 4k page frame
is also a hardware defined memory region. And I think not only for us.
How do you want to implement hardware poisoning for example?
Marking the whole folio with PG_hwpoison seems wrong.

2023-07-13 14:13:44

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v5 00/38] New page table range API

On Thu, Jul 13, 2023 at 12:42:44PM +0200, Christian Borntraeger wrote:
>
>
> Am 11.07.23 um 14:36 schrieb Matthew Wilcox:
> > On Tue, Jul 11, 2023 at 11:07:06AM +0200, Christian Borntraeger wrote:
> > > Am 10.07.23 um 22:43 schrieb Matthew Wilcox (Oracle):
> > > > This patchset changes the API used by the MM to set up page table entries.
> > > > The four APIs are:
> > > > set_ptes(mm, addr, ptep, pte, nr)
> > > > update_mmu_cache_range(vma, addr, ptep, nr)
> > > > flush_dcache_folio(folio)
> > > > flush_icache_pages(vma, page, nr)
> > > >
> > > > flush_dcache_folio() isn't technically new, but no architecture
> > > > implemented it, so I've done that for them. The old APIs remain around
> > > > but are mostly implemented by calling the new interfaces.
> > > >
> > > > The new APIs are based around setting up N page table entries at once.
> > > > The N entries belong to the same PMD, the same folio and the same VMA,
> > > > so ptep++ is a legitimate operation, and locking is taken care of for
> > > > you. Some architectures can do a better job of it than just a loop,
> > > > but I have hesitated to make too deep a change to architectures I don't
> > > > understand well.
> > > >
> > > > One thing I have changed in every architecture is that PG_arch_1 is now a
> > > > per-folio bit instead of a per-page bit. This was something that would
> > > > have to happen eventually, and it makes sense to do it now rather than
> > > > iterate over every page involved in a cache flush and figure out if it
> > > > needs to happen.
> > >
> > > I think we do use PG_arch_1 on s390 for our secure page handling and
> > > making this perf folio instead of physical page really seems wrong
> > > and it probably breaks this code.
> >
> > Per-page flags are going away in the next few years, so you're going to
> > need a new design. s390 seems to do a lot of unusual things. I wish
> > you'd talk to the rest of us more.
>
> I understand you point from a logical point of view, but a 4k page frame
> is also a hardware defined memory region. And I think not only for us.
> How do you want to implement hardware poisoning for example?
> Marking the whole folio with PG_hwpoison seems wrong.

For hardware poison, we can't use the page for any other purpose any more.
So one of the 16 types of pointer is for hardware poison. That doesn't
seem like it's a solution that could work for secure/insecure pages?

But what I'm really wondering is why you need to transition pages
between secure/insecure on a 4kB boundary. What's the downside to doing
it on a 16kB or 64kB boundary, or whatever size has been allocated?


2023-07-13 20:52:46

by Christian Borntraeger

[permalink] [raw]
Subject: Re: [PATCH v5 00/38] New page table range API



Am 13.07.23 um 15:42 schrieb Matthew Wilcox:
> On Thu, Jul 13, 2023 at 12:42:44PM +0200, Christian Borntraeger wrote:
>>
>>
>> Am 11.07.23 um 14:36 schrieb Matthew Wilcox:
>>> On Tue, Jul 11, 2023 at 11:07:06AM +0200, Christian Borntraeger wrote:
>>>> Am 10.07.23 um 22:43 schrieb Matthew Wilcox (Oracle):
>>>>> This patchset changes the API used by the MM to set up page table entries.
>>>>> The four APIs are:
>>>>> set_ptes(mm, addr, ptep, pte, nr)
>>>>> update_mmu_cache_range(vma, addr, ptep, nr)
>>>>> flush_dcache_folio(folio)
>>>>> flush_icache_pages(vma, page, nr)
>>>>>
>>>>> flush_dcache_folio() isn't technically new, but no architecture
>>>>> implemented it, so I've done that for them. The old APIs remain around
>>>>> but are mostly implemented by calling the new interfaces.
>>>>>
>>>>> The new APIs are based around setting up N page table entries at once.
>>>>> The N entries belong to the same PMD, the same folio and the same VMA,
>>>>> so ptep++ is a legitimate operation, and locking is taken care of for
>>>>> you. Some architectures can do a better job of it than just a loop,
>>>>> but I have hesitated to make too deep a change to architectures I don't
>>>>> understand well.
>>>>>
>>>>> One thing I have changed in every architecture is that PG_arch_1 is now a
>>>>> per-folio bit instead of a per-page bit. This was something that would
>>>>> have to happen eventually, and it makes sense to do it now rather than
>>>>> iterate over every page involved in a cache flush and figure out if it
>>>>> needs to happen.
>>>>
>>>> I think we do use PG_arch_1 on s390 for our secure page handling and
>>>> making this perf folio instead of physical page really seems wrong
>>>> and it probably breaks this code.
>>>
>>> Per-page flags are going away in the next few years, so you're going to
>>> need a new design. s390 seems to do a lot of unusual things. I wish
>>> you'd talk to the rest of us more.
>>
>> I understand you point from a logical point of view, but a 4k page frame
>> is also a hardware defined memory region. And I think not only for us.
>> How do you want to implement hardware poisoning for example?
>> Marking the whole folio with PG_hwpoison seems wrong.
>
> For hardware poison, we can't use the page for any other purpose any more.
> So one of the 16 types of pointer is for hardware poison. That doesn't
> seem like it's a solution that could work for secure/insecure pages?
>
> But what I'm really wondering is why you need to transition pages
> between secure/insecure on a 4kB boundary. What's the downside to doing
> it on a 16kB or 64kB boundary, or whatever size has been allocated?

The export and import for more pages will be more expensive, but I assume that
we would then also use the larger chunks (e.g. for paging). The more interesting
problem is that the guest can make a page shared/non-shared on a 4kb granularity.

Stupid question: can folios be split into folio,single page,folio when needed?

2023-07-13 22:19:32

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v5 00/38] New page table range API

On Thu, Jul 13, 2023 at 10:27:21PM +0200, Christian Borntraeger wrote:
>
>
> Am 13.07.23 um 15:42 schrieb Matthew Wilcox:
> > On Thu, Jul 13, 2023 at 12:42:44PM +0200, Christian Borntraeger wrote:
> > >
> > >
> > > Am 11.07.23 um 14:36 schrieb Matthew Wilcox:
> > > > On Tue, Jul 11, 2023 at 11:07:06AM +0200, Christian Borntraeger wrote:
> > > > > Am 10.07.23 um 22:43 schrieb Matthew Wilcox (Oracle):
> > > > > > This patchset changes the API used by the MM to set up page table entries.
> > > > > > The four APIs are:
> > > > > > set_ptes(mm, addr, ptep, pte, nr)
> > > > > > update_mmu_cache_range(vma, addr, ptep, nr)
> > > > > > flush_dcache_folio(folio)
> > > > > > flush_icache_pages(vma, page, nr)
> > > > > >
> > > > > > flush_dcache_folio() isn't technically new, but no architecture
> > > > > > implemented it, so I've done that for them. The old APIs remain around
> > > > > > but are mostly implemented by calling the new interfaces.
> > > > > >
> > > > > > The new APIs are based around setting up N page table entries at once.
> > > > > > The N entries belong to the same PMD, the same folio and the same VMA,
> > > > > > so ptep++ is a legitimate operation, and locking is taken care of for
> > > > > > you. Some architectures can do a better job of it than just a loop,
> > > > > > but I have hesitated to make too deep a change to architectures I don't
> > > > > > understand well.
> > > > > >
> > > > > > One thing I have changed in every architecture is that PG_arch_1 is now a
> > > > > > per-folio bit instead of a per-page bit. This was something that would
> > > > > > have to happen eventually, and it makes sense to do it now rather than
> > > > > > iterate over every page involved in a cache flush and figure out if it
> > > > > > needs to happen.
> > > > >
> > > > > I think we do use PG_arch_1 on s390 for our secure page handling and
> > > > > making this perf folio instead of physical page really seems wrong
> > > > > and it probably breaks this code.
> > > >
> > > > Per-page flags are going away in the next few years, so you're going to
> > > > need a new design. s390 seems to do a lot of unusual things. I wish
> > > > you'd talk to the rest of us more.
> > >
> > > I understand you point from a logical point of view, but a 4k page frame
> > > is also a hardware defined memory region. And I think not only for us.
> > > How do you want to implement hardware poisoning for example?
> > > Marking the whole folio with PG_hwpoison seems wrong.
> >
> > For hardware poison, we can't use the page for any other purpose any more.
> > So one of the 16 types of pointer is for hardware poison. That doesn't
> > seem like it's a solution that could work for secure/insecure pages?
> >
> > But what I'm really wondering is why you need to transition pages
> > between secure/insecure on a 4kB boundary. What's the downside to doing
> > it on a 16kB or 64kB boundary, or whatever size has been allocated?
>
> The export and import for more pages will be more expensive, but I assume that
> we would then also use the larger chunks (e.g. for paging). The more interesting
> problem is that the guest can make a page shared/non-shared on a 4kb granularity.
>
> Stupid question: can folios be split into folio,single page,folio when needed?

If that's a stupid question, you're going to find the answer utterly
moronic ...

Yes, we have split_folio() today. However, it can fail if somebody else
has a reference to it, and if it does succeed, we don't really have a
join_folio() operation (we have khugepaged which walks around looking
for small folios it can replace with large folios, but that's not really
what you want).

In the MM of, let's say, 2025, I do intend to support what we might
call a hole in a folio, precisely for hwpoison and it's beginning to
sound a bit like it might work for you too. So you'd do something like
...

Allocate a 256MB folio for your VM (probably one of many allocations
you'd do to give your VM some memory). That sets 65536 page pointers
to the same value. Then you "secure" all 256MB of it so the
VM can use it all. Then the VM wants the host to read/write a 16kB
chunk of that, so you allocate a "struct insecure_mem" and set four
of the page pointers to point to that instead (it probably contains
a copy of the original page pointer). We'd mark the folio as containing
a hole so that the MM knows something unusual is going on. When you're
done reading/writing the memory, you re-secure it, set the page pointers
back to point to the original folio and free the struct insecure_mem.

Would something like that work for you? Details TBD, of course.


2023-07-21 10:51:10

by Ryan Roberts

[permalink] [raw]
Subject: Re: [PATCH v5 01/38] minmax: Add in_range() macro

On 10/07/2023 21:43, Matthew Wilcox (Oracle) wrote:
> Determine if a value lies within a range more efficiently (subtraction +
> comparison vs two comparisons and an AND). It also has useful (under
> some circumstances) behaviour if the range exceeds the maximum value of
> the type.

Sorry it's taken me a while to looking at this.

I'm getting a lot of warnings about in_range() being redefined when building
arm64 (defconfig-ish) with this patch set on top of v6.5-rc2.

Looks like there are multiple existing implementations.

Thanks,
Ryan


2023-07-24 15:30:00

by David Laight

[permalink] [raw]
Subject: RE: [PATCH v5 01/38] minmax: Add in_range() macro

From: Andrew Morton
> Sent: 11 July 2023 00:14
> To: Matthew Wilcox (Oracle) <[email protected]>
> Cc: [email protected]; [email protected]; [email protected]
> Subject: Re: [PATCH v5 01/38] minmax: Add in_range() macro
>
> On Mon, 10 Jul 2023 21:43:02 +0100 "Matthew Wilcox (Oracle)" <[email protected]> wrote:
>
> > Determine if a value lies within a range more efficiently (subtraction +
> > comparison vs two comparisons and an AND). It also has useful (under
> > some circumstances) behaviour if the range exceeds the maximum value of
> > the type.
> >
> > Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
> > --- a/include/linux/minmax.h
> > +++ b/include/linux/minmax.h
> > @@ -158,6 +158,32 @@
> > */
> > #define clamp_val(val, lo, hi) clamp_t(typeof(val), val, lo, hi)
> >
> > +static inline bool in_range64(u64 val, u64 start, u64 len)
> > +{
> > + return (val - start) < len;
> > +}
> > +
> > +static inline bool in_range32(u32 val, u32 start, u32 len)
> > +{
> > + return (val - start) < len;
> > +}
> > +
> > +/**
> > + * in_range - Determine if a value lies within a range.
> > + * @val: Value to test.
> > + * @start: First value in range.
> > + * @len: Number of values in range.
> > + *
> > + * This is more efficient than "if (start <= val && val < (start + len))".
> > + * It also gives a different answer if @start + @len overflows the size of
> > + * the type by a sufficient amount to encompass @val. Decide for yourself
> > + * which behaviour you want, or prove that start + len never overflow.
> > + * Do not blindly replace one form with the other.
> > + */
> > +#define in_range(val, start, len) \
> > + sizeof(start) <= sizeof(u32) ? in_range32(val, start, len) : \
> > + in_range64(val, start, len)
>
> There's nothing here to prevent callers from passing a mixture of
> 32-bit and 64-bit values, possibly resulting in truncation of `val' or
> `len'.
>
> Obviously caller is being dumb, but I think it's cost-free to check all
> three of the arguments for 64-bitness?
>
> Or do a min()/max()-style check for consistently typed arguments?

Just use integer promotions to extend everything to 'unsigned long long'.

#define in_range(val, start, len) ((val) + 0ull - (start)) < (len))

If all the values are unsigned 32bit the compiler will discard
all the zero extensions.

If values might be signed types (with non-negative values)
you might want to do explicit ((xxx) + 0u + 0ul + 0ull) to avoid
any potentially expensive sign extensions.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)