The series is based on v6.8-rc1 and can be cloned with:
$ git clone https://gitlab.arm.com/linux-arm/linux-ae.git \
-b arm-mte-dynamic-carveout-rfc-v3
Changelog
=========
The changes from the previous version [1] are extensive, so I'll list them
first. Only the major changes are below, individual patches will have their
own changelog.
I would like to point out that patch #31 ("khugepaged: arm64: Don't
collapse MTE enabled VMAs") might be controversial. Please have a look.
Changes since rfc v2 [1]:
- Patches #5 ("mm: cma: Don't append newline when generating CMA area
name") and #16 ("KVM: arm64: Don't deny VM_PFNMAP VMAs when kvm_has_mte()")
are new and they are fixes. I think they can be merged independently of the
rest of the series.
- Tag storage now uses the CMA API to allocate and free tag storage pages
(David Hildenbrand).
- Tag storage is now described as subnode of 'reserved-memory' (Rob
Herring).
- KVM now has support for dynamic tag storage reuse, added in patches #32
("KVM: arm64: mte: Reserve tag storage for VMs with MTE") and #33 ("KVM:
arm64: mte: Introduce VM_MTE_KVM VMA flag").
- Reserving tag storage when a tagged page is allocated is now a best
effort approach instead of being mandatory. If tag storage cannot be
reserved, the page is marked as protnone and tag storage is reserved when
the fault is taken on the next userspace access to the address.
- ptrace support for pages without tag storage has been added, implemented
in patch #30 ("arm64: mte: ptrace: Handle pages with missing tag storage").
- The following patches have been dropped: #4 (" mm: migrate/mempolicy: Add hook
to modify migration target gfp"), #5 ("mm: page_alloc: Add an arch hook to allow
prep_new_page() to fail") because reserving tag storage is now best effort,
and to make the series shorter, in the case of patch #4.
- Also dropped patch #13 ("arm64: mte: Make tag storage depend on
ARCH_KEEP_MEMBLOCK") and added a BUILD_BUG_ON() instead (David
Hildenbrand).
- Dropped patch #15 ("arm64: mte: Check that tag storage blocks are in the
same zone") because it's not needed anymore,
cma_init_reserved_areas->cma_activate_area() already does that (David
Hildenbrand).
- Moved patches #1 ("arm64: mte: Rework naming for tag manipulation functions")
and #2 ("arm64: mte: Rename __GFP_ZEROTAGS to __GFP_TAGGED") after the changes
to the common code and before tag storage is discovered.
- Patch #12 ("arm64: mte: Add tag storage pages to the MIGRATE_CMA
migratetype") was replaced with patch #20 ("arm64: mte: Add tag storage
memory to CMA") (David Hildenbrand).
- Split patch #19 ("mm: mprotect: Introduce PAGE_FAULT_ON_ACCESS for
mprotect(PROT_MTE)") into an arch independent part (patch #13, "mm: memory:
Introduce fault-on-access mechanism for pages") and into an arm64 patch (patch
#26, "arm64: mte: Use fault-on-access to reserve missing tag storage"). The
arm64 code is much smaller because of this (David Hildenbrand).
[1] https://lore.kernel.org/linux-arm-kernel/[email protected]/
Introduction
============
Memory Tagging Extension (MTE) is implemented currently to have a static
carve-out of the DRAM to store the allocation tags (a.k.a. memory colour).
This is what we call the tag storage. Each 16 bytes have 4 bits of tags, so
this means 1/32 of the DRAM, roughly 3% used for the tag storage. This is
done transparently by the hardware/interconnect (with firmware setup) and
normally hidden from the OS. So a checked memory access to location X
generates a tag fetch from location Y in the carve-out and this tag is
compared with the bits 59:56 in the pointer. The correspondence from X to Y
is linear (subject to a minimum block size to deal with some address
interleaving). The software doesn't need to know about this correspondence
as we have specific instructions like STG/LDG to location X that lead to a
tag store/load to Y.
Not all memory used by applications is tagged (mmap(PROT_MTE)). For
example, some large allocations may not use PROT_MTE at all or only for the
first and last page since initialising the tags takes time. And executable
memory is never tagged. The side-effect is that of thie 3% of DRAM, only
part of it, say 1%, is effectively used.
The series aims to take that unused tag storage and release it to the page
allocator for normal data usage.
The first complication is that a PROT_MTE page allocation at address X will
need to reserve the tag storage page at location Y (and migrate any data in
that page if it is in use).
To make things more complicated, pages in the tag storage/carve-out range
cannot use PROT_MTE themselves on current hardware, so this adds the second
complication - a heterogeneous memory layout. The kernel needs to know
where to allocate a PROT_MTE page from or migrate a current page if it
becomes PROT_MTE (mprotect()) and the range it is in does not support
tagging.
Some other complications are arm64-specific like cache coherency between
tags and data accesses. There is a draft architecture spec which will be
released soon, detailing how the hardware behaves.
All of this will be entirely transparent to userspace. As with the current
kernel (without this dynamic tag storage), a user only needs to ask for
PROT_MTE mappings to get tagged pages.
Implementation
==============
MTE tag storage reuse is accomplished with the following changes to the
Linux kernel:
1. The tag storage memory is exposed to the memory allocator as
MIGRATE_CMA. The arm64 uses the newly added function cma_alloc_range() to
reserve tag storage when the associated page is allocated as tagged.
There is a limitation to this approach: all MIGRATE_CMA memory cannot be
used for tagged allocations, even if not all of it is tag storage.
2. mprotect(PROT_MTE) is implemented by adding a fault-on-access mechanism
for existing pages. When a page is next accessed, a fault is taken and the
corresponding tag storage is reserved.
3. When the code tries to copy tags to a page (when swapping in a newly
allocated page, or during migration/THP collapse) which doesn't have the
tag storage reserved, the tags are copied to an xarray and restored when
tag storage is reserved for the destination page.
4. KVM allows VMAs without MTE enabled to represent the memory of a virtual
machine with MTE enabled. Even though the host treats the pages that
represent guest memory as untagged, they have tags associated with them,
which are used by the guest. To make dynamic tag storage work with KVM, two
changes were necessary: try to reserve tag storage when a guest accesses an
address the first time, and if not possible, migrate the page to replace it
with a page with tag storage reserved; and a new VMA flag, VM_MTE_KVM, was
added so the page allocator will not use tag storage pages (which cannot be
tagged) for VM memory. The second change is a performance optimization.
Testing
=======
To enable MTE dynamic tag storage:
- CONFIG_ARM64_MTE_TAG_STORAGE=y
- system_supports_mte() returns true
- kasan_hw_tags_enabled() returns false
- correct DTB node. For an example that works with FVP, have a look at
patch #35 ("HACK! Add fake tag storage to fvp-base-revc.dts")
Check dmesg for the message "MTE tag storage region management enabled".
Alexandru Elisei (35):
mm: page_alloc: Add gfp_flags parameter to arch_alloc_page()
mm: page_alloc: Add an arch hook early in free_pages_prepare()
mm: page_alloc: Add an arch hook to filter MIGRATE_CMA allocations
mm: page_alloc: Partially revert "mm: page_alloc: remove stale CMA
guard code"
mm: cma: Don't append newline when generating CMA area name
mm: cma: Make CMA_ALLOC_SUCCESS/FAIL count the number of pages
mm: cma: Add CMA_RELEASE_{SUCCESS,FAIL} events
mm: cma: Introduce cma_alloc_range()
mm: cma: Introduce cma_remove_mem()
mm: cma: Fast track allocating memory when the pages are free
mm: Allow an arch to hook into folio allocation when VMA is known
mm: Call arch_swap_prepare_to_restore() before arch_swap_restore()
mm: memory: Introduce fault-on-access mechanism for pages
of: fdt: Return the region size in of_flat_dt_translate_address()
of: fdt: Add of_flat_read_u32()
KVM: arm64: Don't deny VM_PFNMAP VMAs when kvm_has_mte()
arm64: mte: Rework naming for tag manipulation functions
arm64: mte: Rename __GFP_ZEROTAGS to __GFP_TAGGED
arm64: mte: Discover tag storage memory
arm64: mte: Add tag storage memory to CMA
arm64: mte: Disable dynamic tag storage management if HW KASAN is
enabled
arm64: mte: Enable tag storage if CMA areas have been activated
arm64: mte: Try to reserve tag storage in arch_alloc_page()
arm64: mte: Perform CMOs for tag blocks
arm64: mte: Reserve tag block for the zero page
arm64: mte: Use fault-on-access to reserve missing tag storage
arm64: mte: Handle tag storage pages mapped in an MTE VMA
arm64: mte: swap: Handle tag restoring when missing tag storage
arm64: mte: copypage: Handle tag restoring when missing tag storage
arm64: mte: ptrace: Handle pages with missing tag storage
khugepaged: arm64: Don't collapse MTE enabled VMAs
KVM: arm64: mte: Reserve tag storage for virtual machines with MTE
KVM: arm64: mte: Introduce VM_MTE_KVM VMA flag
arm64: mte: Enable dynamic tag storage management
HACK! Add fake tag storage to fvp-base-revc.dts
.../reserved-memory/arm,mte-tag-storage.yaml | 78 +++
arch/arm64/Kconfig | 14 +
arch/arm64/boot/dts/arm/fvp-base-revc.dts | 42 +-
arch/arm64/include/asm/assembler.h | 10 +
arch/arm64/include/asm/mte-def.h | 16 +-
arch/arm64/include/asm/mte.h | 43 +-
arch/arm64/include/asm/mte_tag_storage.h | 83 +++
arch/arm64/include/asm/page.h | 10 +-
arch/arm64/include/asm/pgtable-prot.h | 2 +
arch/arm64/include/asm/pgtable.h | 93 ++-
arch/arm64/kernel/Makefile | 1 +
arch/arm64/kernel/elfcore.c | 14 +-
arch/arm64/kernel/hibernate.c | 46 +-
arch/arm64/kernel/mte.c | 37 +-
arch/arm64/kernel/mte_tag_storage.c | 643 ++++++++++++++++++
arch/arm64/kvm/mmu.c | 128 +++-
arch/arm64/lib/mte.S | 34 +-
arch/arm64/mm/copypage.c | 56 ++
arch/arm64/mm/fault.c | 133 +++-
arch/arm64/mm/init.c | 3 +
arch/arm64/mm/mteswap.c | 160 ++++-
arch/s390/include/asm/page.h | 2 +-
arch/s390/mm/page-states.c | 2 +-
arch/sh/kernel/cpu/sh2/probe.c | 2 +-
drivers/of/fdt.c | 21 +
drivers/of/fdt_address.c | 12 +-
drivers/tty/serial/earlycon.c | 2 +-
fs/proc/page.c | 1 +
include/linux/cma.h | 3 +
include/linux/gfp.h | 2 +-
include/linux/gfp_types.h | 6 +-
include/linux/huge_mm.h | 4 +-
include/linux/kernel-page-flags.h | 1 +
include/linux/khugepaged.h | 5 +
include/linux/memcontrol.h | 2 +
include/linux/migrate.h | 8 +-
include/linux/migrate_mode.h | 1 +
include/linux/mm.h | 2 +
include/linux/of_fdt.h | 4 +-
include/linux/page-flags.h | 16 +-
include/linux/pgtable.h | 72 +-
include/linux/vm_event_item.h | 2 +
include/trace/events/cma.h | 59 ++
include/trace/events/mmflags.h | 5 +-
mm/Kconfig | 8 +
mm/cma.c | 166 ++++-
mm/huge_memory.c | 37 +-
mm/internal.h | 6 -
mm/khugepaged.c | 4 +
mm/memory-failure.c | 8 +-
mm/memory.c | 55 +-
mm/mempolicy.c | 1 +
mm/page_alloc.c | 46 +-
mm/shmem.c | 14 +-
mm/swapfile.c | 5 +
mm/vmstat.c | 2 +
56 files changed, 2016 insertions(+), 216 deletions(-)
create mode 100644 Documentation/devicetree/bindings/reserved-memory/arm,mte-tag-storage.yaml
create mode 100644 arch/arm64/include/asm/mte_tag_storage.h
create mode 100644 arch/arm64/kernel/mte_tag_storage.c
base-commit: 6613476e225e090cc9aad49be7fa504e290dd33d
--
2.43.0
Extend the usefulness of arch_alloc_page() by adding the gfp_flags
parameter.
Signed-off-by: Alexandru Elisei <[email protected]>
---
Changes since rfc v2:
* New patch.
arch/s390/include/asm/page.h | 2 +-
arch/s390/mm/page-states.c | 2 +-
include/linux/gfp.h | 2 +-
mm/page_alloc.c | 2 +-
4 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/arch/s390/include/asm/page.h b/arch/s390/include/asm/page.h
index 73b9c3bf377f..859f0958c574 100644
--- a/arch/s390/include/asm/page.h
+++ b/arch/s390/include/asm/page.h
@@ -163,7 +163,7 @@ static inline int page_reset_referenced(unsigned long addr)
struct page;
void arch_free_page(struct page *page, int order);
-void arch_alloc_page(struct page *page, int order);
+void arch_alloc_page(struct page *page, int order, gfp_t gfp_flags);
static inline int devmem_is_allowed(unsigned long pfn)
{
diff --git a/arch/s390/mm/page-states.c b/arch/s390/mm/page-states.c
index 01f9b39e65f5..b986c8b158e3 100644
--- a/arch/s390/mm/page-states.c
+++ b/arch/s390/mm/page-states.c
@@ -21,7 +21,7 @@ void arch_free_page(struct page *page, int order)
__set_page_unused(page_to_virt(page), 1UL << order);
}
-void arch_alloc_page(struct page *page, int order)
+void arch_alloc_page(struct page *page, int order, gfp_t gfp_flags)
{
if (!cmma_flag)
return;
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index de292a007138..9e8aa3d144db 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -172,7 +172,7 @@ static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
static inline void arch_free_page(struct page *page, int order) { }
#endif
#ifndef HAVE_ARCH_ALLOC_PAGE
-static inline void arch_alloc_page(struct page *page, int order) { }
+static inline void arch_alloc_page(struct page *page, int order, gfp_t gfp_flags) { }
#endif
struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 150d4f23b010..2c140abe5ee6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1485,7 +1485,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
set_page_private(page, 0);
set_page_refcounted(page);
- arch_alloc_page(page, order);
+ arch_alloc_page(page, order, gfp_flags);
debug_pagealloc_map_pages(page, 1 << order);
/*
--
2.43.0
The arm64 MTE code uses the PG_arch_2 page flag, which it renames to
PG_mte_tagged, to track if a page has been mapped with tagging enabled.
That flag is cleared by free_pages_prepare() by doing:
page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
When tag storage management is added, tag storage will be reserved for a
page if and only if the page is mapped as tagged (the page flag
PG_mte_tagged is set). When a page is freed, likewise, the code will have
to look at the the page flags to determine if the page has tag storage
reserved, which should also be freed.
For this purpose, add an arch_free_pages_prepare() hook that is called
before that page flags are cleared. The function arch_free_page() has also
been considered for this purpose, but it is called after the flags are
cleared.
Signed-off-by: Alexandru Elisei <[email protected]>
---
Changes since rfc v2:
* Expanded commit message (David Hildenbrand).
include/linux/pgtable.h | 4 ++++
mm/page_alloc.c | 1 +
2 files changed, 5 insertions(+)
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index f6d0e3513948..6d98d5fdd697 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -901,6 +901,10 @@ static inline void arch_do_swap_page(struct mm_struct *mm,
}
#endif
+#ifndef __HAVE_ARCH_FREE_PAGES_PREPARE
+static inline void arch_free_pages_prepare(struct page *page, int order) { }
+#endif
+
#ifndef __HAVE_ARCH_UNMAP_ONE
/*
* Some architectures support metadata associated with a page. When a
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2c140abe5ee6..27282a1c82fe 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1092,6 +1092,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
trace_mm_page_free(page, order);
kmsan_free_page(page, order);
+ arch_free_pages_prepare(page, order);
if (memcg_kmem_online() && PageMemcgKmem(page))
__memcg_kmem_uncharge_page(page, order);
--
2.43.0
The patch f945116e4e19 ("mm: page_alloc: remove stale CMA guard code")
removed the CMA filter when allocating from the MIGRATE_MOVABLE pcp list
because CMA is always allowed when __GFP_MOVABLE is set.
With the introduction of the arch_alloc_cma() function, the above is not
true anymore, so bring back the filter.
This is a partially revert because the stale comment remains removed.
Signed-off-by: Alexandru Elisei <[email protected]>
---
mm/page_alloc.c | 15 +++++++++++----
1 file changed, 11 insertions(+), 4 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a96d47a6393e..0fa34bcfb1af 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2897,10 +2897,17 @@ struct page *rmqueue(struct zone *preferred_zone,
WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
if (likely(pcp_allowed_order(order))) {
- page = rmqueue_pcplist(preferred_zone, zone, order,
- migratetype, alloc_flags);
- if (likely(page))
- goto out;
+ /*
+ * MIGRATE_MOVABLE pcplist could have the pages on CMA area and
+ * we need to skip it when CMA area isn't allowed.
+ */
+ if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & ALLOC_CMA ||
+ migratetype != MIGRATE_MOVABLE) {
+ page = rmqueue_pcplist(preferred_zone, zone, order,
+ migratetype, alloc_flags);
+ if (likely(page))
+ goto out;
+ }
}
page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags,
--
2.43.0
cma->name is displayed in several CMA messages. When the name is generated
by the CMA code, don't append a newline to avoid breaking the text across
two lines.
Signed-off-by: Alexandru Elisei <[email protected]>
---
Changes since rfc v2:
* New patch. This is a fix, and can be merged independently of the other
patches.
mm/cma.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/cma.c b/mm/cma.c
index 7c09c47e530b..f49c95f8ee37 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -204,7 +204,7 @@ int __init cma_init_reserved_mem(phys_addr_t base, phys_addr_t size,
if (name)
snprintf(cma->name, CMA_MAX_NAME, name);
else
- snprintf(cma->name, CMA_MAX_NAME, "cma%d\n", cma_area_count);
+ snprintf(cma->name, CMA_MAX_NAME, "cma%d", cma_area_count);
cma->base_pfn = PFN_DOWN(base);
cma->count = size >> PAGE_SHIFT;
--
2.43.0
As an architecture might have specific requirements around the allocation
of CMA pages, add an arch hook that can disable allocations from
MIGRATE_CMA, if the allocation was otherwise allowed.
This will be used by arm64, which will put tag storage pages on the
MIGRATE_CMA list, and tag storage pages cannot be tagged. The filter will
be used to deny using MIGRATE_CMA for __GFP_TAGGED allocations.
Signed-off-by: Alexandru Elisei <[email protected]>
---
include/linux/pgtable.h | 7 +++++++
mm/page_alloc.c | 3 ++-
2 files changed, 9 insertions(+), 1 deletion(-)
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 6d98d5fdd697..c5ddec6b5305 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -905,6 +905,13 @@ static inline void arch_do_swap_page(struct mm_struct *mm,
static inline void arch_free_pages_prepare(struct page *page, int order) { }
#endif
+#ifndef __HAVE_ARCH_ALLOC_CMA
+static inline bool arch_alloc_cma(gfp_t gfp)
+{
+ return true;
+}
+#endif
+
#ifndef __HAVE_ARCH_UNMAP_ONE
/*
* Some architectures support metadata associated with a page. When a
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 27282a1c82fe..a96d47a6393e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3157,7 +3157,8 @@ static inline unsigned int gfp_to_alloc_flags_cma(gfp_t gfp_mask,
unsigned int alloc_flags)
{
#ifdef CONFIG_CMA
- if (gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE)
+ if (gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE &&
+ arch_alloc_cma(gfp_mask))
alloc_flags |= ALLOC_CMA;
#endif
return alloc_flags;
--
2.43.0
Similar to the two events that relate to CMA allocations, add the
CMA_RELEASE_SUCCESS and CMA_RELEASE_FAIL events that count when CMA pages
are freed.
Signed-off-by: Alexandru Elisei <[email protected]>
---
Changes since rfc v2:
* New patch.
include/linux/vm_event_item.h | 2 ++
mm/cma.c | 6 +++++-
mm/vmstat.c | 2 ++
3 files changed, 9 insertions(+), 1 deletion(-)
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 747943bc8cc2..aba5c5bf8127 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -83,6 +83,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
#ifdef CONFIG_CMA
CMA_ALLOC_SUCCESS,
CMA_ALLOC_FAIL,
+ CMA_RELEASE_SUCCESS,
+ CMA_RELEASE_FAIL,
#endif
UNEVICTABLE_PGCULLED, /* culled to noreclaim list */
UNEVICTABLE_PGSCANNED, /* scanned for reclaimability */
diff --git a/mm/cma.c b/mm/cma.c
index dbf7fe8cb1bd..543bb6b3be8e 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -562,8 +562,10 @@ bool cma_release(struct cma *cma, const struct page *pages,
{
unsigned long pfn;
- if (!cma_pages_valid(cma, pages, count))
+ if (!cma_pages_valid(cma, pages, count)) {
+ count_vm_events(CMA_RELEASE_FAIL, count);
return false;
+ }
pr_debug("%s(page %p, count %lu)\n", __func__, (void *)pages, count);
@@ -575,6 +577,8 @@ bool cma_release(struct cma *cma, const struct page *pages,
cma_clear_bitmap(cma, pfn, count);
trace_cma_release(cma->name, pfn, pages, count);
+ count_vm_events(CMA_RELEASE_SUCCESS, count);
+
return true;
}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index db79935e4a54..eebfd5c6c723 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1340,6 +1340,8 @@ const char * const vmstat_text[] = {
#ifdef CONFIG_CMA
"cma_alloc_success",
"cma_alloc_fail",
+ "cma_release_success",
+ "cma_release_fail",
#endif
"unevictable_pgs_culled",
"unevictable_pgs_scanned",
--
2.43.0
The CMA_ALLOC_SUCCESS, respectively CMA_ALLOC_FAIL, are increased by one
after each cma_alloc() function call. This is done even though cma_alloc()
can allocate an arbitrary number of CMA pages. When looking at
/proc/vmstat, the number of successful (or failed) cma_alloc() calls
doesn't tell much with regards to how many CMA pages were allocated via
cma_alloc() versus via the page allocator (regular allocation request or
PCP lists refill).
This can also be rather confusing to a user who isn't familiar with the
code, since the unit of measurement for nr_free_cma is the number of pages,
but cma_alloc_success and cma_alloc_fail count the number of cma_alloc()
function calls.
Let's make this consistent, and arguably more useful, by having
CMA_ALLOC_SUCCESS count the number of successfully allocated CMA pages, and
CMA_ALLOC_FAIL count the number of pages the cma_alloc() failed to
allocate.
For users that wish to track the number of cma_alloc() calls, there are
tracepoints for that already implemented.
Signed-off-by: Alexandru Elisei <[email protected]>
---
mm/cma.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/cma.c b/mm/cma.c
index f49c95f8ee37..dbf7fe8cb1bd 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -517,10 +517,10 @@ struct page *cma_alloc(struct cma *cma, unsigned long count,
pr_debug("%s(): returned %p\n", __func__, page);
out:
if (page) {
- count_vm_event(CMA_ALLOC_SUCCESS);
+ count_vm_events(CMA_ALLOC_SUCCESS, count);
cma_sysfs_account_success_pages(cma, count);
} else {
- count_vm_event(CMA_ALLOC_FAIL);
+ count_vm_events(CMA_ALLOC_FAIL, count);
if (cma)
cma_sysfs_account_fail_pages(cma, count);
}
--
2.43.0
Today, cma_alloc() is used to allocate a contiguous memory region. The
function allows the caller to specify the number of pages to allocate, but
not the starting address. cma_alloc() will walk over the entire CMA region
trying to allocate the first available range of the specified size.
Introduce cma_alloc_range(), which makes CMA more versatile by allowing the
caller to specify a particular range in the CMA region, defined by the
start pfn and the size.
arm64 will make use of this function when tag storage management will be
implemented: cma_alloc_range() will be used to reserve the tag storage
associated with a tagged page.
Signed-off-by: Alexandru Elisei <[email protected]>
---
Changes since rfc v2:
* New patch.
include/linux/cma.h | 2 +
include/trace/events/cma.h | 59 ++++++++++++++++++++++++++
mm/cma.c | 86 ++++++++++++++++++++++++++++++++++++++
3 files changed, 147 insertions(+)
diff --git a/include/linux/cma.h b/include/linux/cma.h
index 63873b93deaa..e32559da6942 100644
--- a/include/linux/cma.h
+++ b/include/linux/cma.h
@@ -50,6 +50,8 @@ extern int cma_init_reserved_mem(phys_addr_t base, phys_addr_t size,
struct cma **res_cma);
extern struct page *cma_alloc(struct cma *cma, unsigned long count, unsigned int align,
bool no_warn);
+extern int cma_alloc_range(struct cma *cma, unsigned long start, unsigned long count,
+ unsigned tries, gfp_t gfp);
extern bool cma_pages_valid(struct cma *cma, const struct page *pages, unsigned long count);
extern bool cma_release(struct cma *cma, const struct page *pages, unsigned long count);
diff --git a/include/trace/events/cma.h b/include/trace/events/cma.h
index 25103e67737c..a89af313a572 100644
--- a/include/trace/events/cma.h
+++ b/include/trace/events/cma.h
@@ -36,6 +36,65 @@ TRACE_EVENT(cma_release,
__entry->count)
);
+TRACE_EVENT(cma_alloc_range_start,
+
+ TP_PROTO(const char *name, unsigned long start, unsigned long count,
+ unsigned tries),
+
+ TP_ARGS(name, start, count, tries),
+
+ TP_STRUCT__entry(
+ __string(name, name)
+ __field(unsigned long, start)
+ __field(unsigned long, count)
+ __field(unsigned, tries)
+ ),
+
+ TP_fast_assign(
+ __assign_str(name, name);
+ __entry->start = start;
+ __entry->count = count;
+ __entry->tries = tries;
+ ),
+
+ TP_printk("name=%s start=%lx count=%lu tries=%u",
+ __get_str(name),
+ __entry->start,
+ __entry->count,
+ __entry->tries)
+);
+
+TRACE_EVENT(cma_alloc_range_finish,
+
+ TP_PROTO(const char *name, unsigned long start, unsigned long count,
+ unsigned attempts, int err),
+
+ TP_ARGS(name, start, count, attempts, err),
+
+ TP_STRUCT__entry(
+ __string(name, name)
+ __field(unsigned long, start)
+ __field(unsigned long, count)
+ __field(unsigned, attempts)
+ __field(int, err)
+ ),
+
+ TP_fast_assign(
+ __assign_str(name, name);
+ __entry->start = start;
+ __entry->count = count;
+ __entry->attempts = attempts;
+ __entry->err = err;
+ ),
+
+ TP_printk("name=%s start=%lx count=%lu attempts=%u err=%d",
+ __get_str(name),
+ __entry->start,
+ __entry->count,
+ __entry->attempts,
+ __entry->err)
+);
+
TRACE_EVENT(cma_alloc_start,
TP_PROTO(const char *name, unsigned long count, unsigned int align),
diff --git a/mm/cma.c b/mm/cma.c
index 543bb6b3be8e..4a0f68b9443b 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -416,6 +416,92 @@ static void cma_debug_show_areas(struct cma *cma)
static inline void cma_debug_show_areas(struct cma *cma) { }
#endif
+/**
+ * cma_alloc_range() - allocate pages in a specific range
+ * @cma: Contiguous memory region for which the allocation is performed.
+ * @start: Starting pfn of the allocation.
+ * @count: Requested number of pages
+ * @tries: Number of tries if the range is busy
+ * @no_warn: Avoid printing message about failed allocation
+ *
+ * This function allocates part of contiguous memory from a specific contiguous
+ * memory area, from the specified starting address. The 'start' pfn and the the
+ * 'count' number of pages must be aligned to the CMA bitmap order per bit.
+ */
+int cma_alloc_range(struct cma *cma, unsigned long start, unsigned long count,
+ unsigned tries, gfp_t gfp)
+{
+ unsigned long bitmap_maxno, bitmap_no, bitmap_start, bitmap_count;
+ unsigned long i = 0;
+ struct page *page;
+ int err = -EINVAL;
+
+ if (!cma || !cma->count || !cma->bitmap)
+ goto out_stats;
+
+ trace_cma_alloc_range_start(cma->name, start, count, tries);
+
+ if (!count || start < cma->base_pfn ||
+ start + count > cma->base_pfn + cma->count)
+ goto out_stats;
+
+ if (!IS_ALIGNED(start | count, 1 << cma->order_per_bit))
+ goto out_stats;
+
+ bitmap_start = (start - cma->base_pfn) >> cma->order_per_bit;
+ bitmap_maxno = cma_bitmap_maxno(cma);
+ bitmap_count = cma_bitmap_pages_to_bits(cma, count);
+
+ spin_lock_irq(&cma->lock);
+ bitmap_no = bitmap_find_next_zero_area(cma->bitmap, bitmap_maxno,
+ bitmap_start, bitmap_count, 0);
+ if (bitmap_no != bitmap_start) {
+ spin_unlock_irq(&cma->lock);
+ err = -EEXIST;
+ goto out_stats;
+ }
+ bitmap_set(cma->bitmap, bitmap_start, bitmap_count);
+ spin_unlock_irq(&cma->lock);
+
+ for (i = 0; i < tries; i++) {
+ mutex_lock(&cma_mutex);
+ err = alloc_contig_range(start, start + count, MIGRATE_CMA, gfp);
+ mutex_unlock(&cma_mutex);
+
+ if (err != -EBUSY)
+ break;
+ }
+
+ if (err) {
+ cma_clear_bitmap(cma, start, count);
+ } else {
+ page = pfn_to_page(start);
+
+ /*
+ * CMA can allocate multiple page blocks, which results in
+ * different blocks being marked with different tags. Reset the
+ * tags to ignore those page blocks.
+ */
+ for (i = 0; i < count; i++)
+ page_kasan_tag_reset(nth_page(page, i));
+ }
+
+out_stats:
+ trace_cma_alloc_range_finish(cma->name, start, count, i, err);
+
+ if (err) {
+ count_vm_events(CMA_ALLOC_FAIL, count);
+ if (cma)
+ cma_sysfs_account_fail_pages(cma, count);
+ } else {
+ count_vm_events(CMA_ALLOC_SUCCESS, count);
+ cma_sysfs_account_success_pages(cma, count);
+ }
+
+ return err;
+}
+
+
/**
* cma_alloc() - allocate pages from contiguous area
* @cma: Contiguous memory region for which the allocation is performed.
--
2.43.0
Memory is added to CMA with cma_declare_contiguous_nid() and
cma_init_reserved_mem(). This memory is then put on the MIGRATE_CMA list in
cma_init_reserved_areas(), where the page allocator can make use of it.
If a device manages multiple CMA areas, and there's an error when one of
the areas is added to CMA, there is no mechanism for the device to prevent
the rest of the areas, which were added before the error occured, from
being later added to the MIGRATE_CMA list.
Add cma_remove_mem() which allows a previously reserved CMA area to be
removed and thus it cannot be used by the page allocator.
Signed-off-by: Alexandru Elisei <[email protected]>
---
Changes since rfc v2:
* New patch.
include/linux/cma.h | 1 +
mm/cma.c | 30 +++++++++++++++++++++++++++++-
2 files changed, 30 insertions(+), 1 deletion(-)
diff --git a/include/linux/cma.h b/include/linux/cma.h
index e32559da6942..787cbec1702e 100644
--- a/include/linux/cma.h
+++ b/include/linux/cma.h
@@ -48,6 +48,7 @@ extern int cma_init_reserved_mem(phys_addr_t base, phys_addr_t size,
unsigned int order_per_bit,
const char *name,
struct cma **res_cma);
+extern void cma_remove_mem(struct cma **res_cma);
extern struct page *cma_alloc(struct cma *cma, unsigned long count, unsigned int align,
bool no_warn);
extern int cma_alloc_range(struct cma *cma, unsigned long start, unsigned long count,
diff --git a/mm/cma.c b/mm/cma.c
index 4a0f68b9443b..2881bab12b01 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -147,8 +147,12 @@ static int __init cma_init_reserved_areas(void)
{
int i;
- for (i = 0; i < cma_area_count; i++)
+ for (i = 0; i < cma_area_count; i++) {
+ /* Region was removed. */
+ if (!cma_areas[i].count)
+ continue;
cma_activate_area(&cma_areas[i]);
+ }
return 0;
}
@@ -216,6 +220,30 @@ int __init cma_init_reserved_mem(phys_addr_t base, phys_addr_t size,
return 0;
}
+/**
+ * cma_remove_mem() - remove cma area
+ * @res_cma: Pointer to the cma region.
+ *
+ * This function removes a cma region created with cma_init_reserved_mem(). The
+ * ->count is set to 0.
+ */
+void __init cma_remove_mem(struct cma **res_cma)
+{
+ struct cma *cma;
+
+ if (WARN_ON_ONCE(!res_cma || !(*res_cma)))
+ return;
+
+ cma = *res_cma;
+ if (WARN_ON_ONCE(!cma->count))
+ return;
+
+ totalcma_pages -= cma->count;
+ cma->count = 0;
+
+ *res_cma = NULL;
+}
+
/**
* cma_declare_contiguous_nid() - reserve custom contiguous area
* @base: Base address of the reserved area optional, use 0 for any
--
2.43.0
arm64 uses VM_HIGH_ARCH_0 and VM_HIGH_ARCH_1 for enabling MTE for a VMA.
When VM_HIGH_ARCH_0, which arm64 renames to VM_MTE, is set for a VMA, and
the gfp flag __GFP_ZERO is present, the __GFP_ZEROTAGS gfp flag also gets
set in vma_alloc_zeroed_movable_folio().
Expand this to be more generic by adding an arch hook that modifes the gfp
flags for an allocation when the VMA is known.
Note that __GFP_ZEROTAGS is ignored by the page allocator unless __GFP_ZERO
is also set; from that point of view, the current behaviour is unchanged,
even though the arm64 flag is set in more places. When arm64 will have
support to reuse the tag storage for data allocation, the uses of the
__GFP_ZEROTAGS flag will be expanded to instruct the page allocator to try
to reserve the corresponding tag storage for the pages being allocated.
The flags returned by arch_calc_vma_gfp() are or'ed with the flags set by
the caller; this has been done to keep an architecture from modifying the
flags already set by the core memory management code; this is similar to
how do_mmap() -> calc_vm_flag_bits() -> arch_calc_vm_flag_bits() has been
implemented. This can be revisited in the future if there's a need to do
so.
Signed-off-by: Alexandru Elisei <[email protected]>
---
arch/arm64/include/asm/page.h | 5 ++---
arch/arm64/include/asm/pgtable.h | 3 +++
arch/arm64/mm/fault.c | 19 ++++++-------------
include/linux/pgtable.h | 7 +++++++
mm/mempolicy.c | 1 +
mm/shmem.c | 5 ++++-
6 files changed, 23 insertions(+), 17 deletions(-)
diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
index 2312e6ee595f..88bab032a493 100644
--- a/arch/arm64/include/asm/page.h
+++ b/arch/arm64/include/asm/page.h
@@ -29,9 +29,8 @@ void copy_user_highpage(struct page *to, struct page *from,
void copy_highpage(struct page *to, struct page *from);
#define __HAVE_ARCH_COPY_HIGHPAGE
-struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
- unsigned long vaddr);
-#define vma_alloc_zeroed_movable_folio vma_alloc_zeroed_movable_folio
+#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
+ vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr, false)
void tag_clear_highpage(struct page *to);
#define __HAVE_ARCH_TAG_CLEAR_HIGHPAGE
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 79ce70fbb751..08f0904dbfc2 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1071,6 +1071,9 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
#endif /* CONFIG_ARM64_MTE */
+#define __HAVE_ARCH_CALC_VMA_GFP
+gfp_t arch_calc_vma_gfp(struct vm_area_struct *vma, gfp_t gfp);
+
/*
* On AArch64, the cache coherency is handled via the set_pte_at() function.
*/
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 55f6455a8284..4d3f0a870ad8 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -937,22 +937,15 @@ void do_debug_exception(unsigned long addr_if_watchpoint, unsigned long esr,
NOKPROBE_SYMBOL(do_debug_exception);
/*
- * Used during anonymous page fault handling.
+ * If this is called during anonymous page fault handling, and the page is
+ * mapped with PROT_MTE, initialise the tags at the point of tag zeroing as this
+ * is usually faster than separate DC ZVA and STGM.
*/
-struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
- unsigned long vaddr)
+gfp_t arch_calc_vma_gfp(struct vm_area_struct *vma, gfp_t gfp)
{
- gfp_t flags = GFP_HIGHUSER_MOVABLE | __GFP_ZERO;
-
- /*
- * If the page is mapped with PROT_MTE, initialise the tags at the
- * point of allocation and page zeroing as this is usually faster than
- * separate DC ZVA and STGM.
- */
if (vma->vm_flags & VM_MTE)
- flags |= __GFP_ZEROTAGS;
-
- return vma_alloc_folio(flags, 0, vma, vaddr, false);
+ return __GFP_ZEROTAGS;
+ return 0;
}
void tag_clear_highpage(struct page *page)
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index c5ddec6b5305..98f81ca08cbe 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -901,6 +901,13 @@ static inline void arch_do_swap_page(struct mm_struct *mm,
}
#endif
+#ifndef __HAVE_ARCH_CALC_VMA_GFP
+static inline gfp_t arch_calc_vma_gfp(struct vm_area_struct *vma, gfp_t gfp)
+{
+ return 0;
+}
+#endif
+
#ifndef __HAVE_ARCH_FREE_PAGES_PREPARE
static inline void arch_free_pages_prepare(struct page *page, int order) { }
#endif
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 10a590ee1c89..f7ef52760b32 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2168,6 +2168,7 @@ struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma,
pgoff_t ilx;
struct page *page;
+ gfp |= arch_calc_vma_gfp(vma, gfp);
pol = get_vma_policy(vma, addr, order, &ilx);
page = alloc_pages_mpol(gfp | __GFP_COMP, order,
pol, ilx, numa_node_id());
diff --git a/mm/shmem.c b/mm/shmem.c
index d7c84ff62186..14427e9982f9 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1585,7 +1585,7 @@ static struct folio *shmem_swapin_cluster(swp_entry_t swap, gfp_t gfp,
*/
static gfp_t limit_gfp_mask(gfp_t huge_gfp, gfp_t limit_gfp)
{
- gfp_t allowflags = __GFP_IO | __GFP_FS | __GFP_RECLAIM;
+ gfp_t allowflags = __GFP_IO | __GFP_FS | __GFP_RECLAIM | __GFP_ZEROTAGS;
gfp_t denyflags = __GFP_NOWARN | __GFP_NORETRY;
gfp_t zoneflags = limit_gfp & GFP_ZONEMASK;
gfp_t result = huge_gfp & ~(allowflags | GFP_ZONEMASK);
@@ -2038,6 +2038,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
gfp_t huge_gfp;
huge_gfp = vma_thp_gfp_mask(vma);
+ huge_gfp |= arch_calc_vma_gfp(vma, huge_gfp);
huge_gfp = limit_gfp_mask(huge_gfp, gfp);
folio = shmem_alloc_and_add_folio(huge_gfp,
inode, index, fault_mm, true);
@@ -2214,6 +2215,8 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
vm_fault_t ret = 0;
int err;
+ gfp |= arch_calc_vma_gfp(vmf->vma, gfp);
+
/*
* Trinity finds that probing a hole which tmpfs is punching can
* prevent the hole-punch from ever completing: noted in i_private.
--
2.43.0
If the pages to be allocated are free, take them directly off the buddy
allocator, instead of going through alloc_contig_range() and avoiding
costly calls to lru_cache_disable().
Only allocations of the same size as the CMA region order are considered,
to avoid taking the zone spinlock for too long.
Signed-off-by: Alexandru Elisei <[email protected]>
---
Changes since rfc v2:
* New patch. Reworked from the rfc v2 patch #26 ("arm64: mte: Fast track
reserving tag storage when the block is free") (David Hildenbrand).
include/linux/page-flags.h | 15 ++++++++++++--
mm/Kconfig | 5 +++++
mm/cma.c | 42 ++++++++++++++++++++++++++++++++++----
mm/memory-failure.c | 8 ++++----
mm/page_alloc.c | 23 ++++++++++++---------
5 files changed, 73 insertions(+), 20 deletions(-)
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 735cddc13d20..b7237bce7446 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -575,11 +575,22 @@ TESTSCFLAG(HWPoison, hwpoison, PF_ANY)
#define MAGIC_HWPOISON 0x48575053U /* HWPS */
extern void SetPageHWPoisonTakenOff(struct page *page);
extern void ClearPageHWPoisonTakenOff(struct page *page);
-extern bool take_page_off_buddy(struct page *page);
-extern bool put_page_back_buddy(struct page *page);
+extern bool PageHWPoisonTakenOff(struct page *page);
#else
PAGEFLAG_FALSE(HWPoison, hwpoison)
+TESTSCFLAG_FALSE(HWPoison, hwpoison)
#define __PG_HWPOISON 0
+static inline void SetPageHWPoisonTakenOff(struct page *page) { }
+static inline void ClearPageHWPoisonTakenOff(struct page *page) { }
+static inline bool PageHWPoisonTakenOff(struct page *page)
+{
+ return false;
+}
+#endif
+
+#ifdef CONFIG_WANTS_TAKE_PAGE_OFF_BUDDY
+extern bool take_page_off_buddy(struct page *page, bool poison);
+extern bool put_page_back_buddy(struct page *page, bool unpoison);
#endif
#if defined(CONFIG_PAGE_IDLE_FLAG) && defined(CONFIG_64BIT)
diff --git a/mm/Kconfig b/mm/Kconfig
index ffc3a2ba3a8c..341cf53898db 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -745,12 +745,16 @@ config DEFAULT_MMAP_MIN_ADDR
config ARCH_SUPPORTS_MEMORY_FAILURE
bool
+config WANTS_TAKE_PAGE_OFF_BUDDY
+ bool
+
config MEMORY_FAILURE
depends on MMU
depends on ARCH_SUPPORTS_MEMORY_FAILURE
bool "Enable recovery from hardware memory errors"
select MEMORY_ISOLATION
select RAS
+ select WANTS_TAKE_PAGE_OFF_BUDDY
help
Enables code to recover from some memory failures on systems
with MCA recovery. This allows a system to continue running
@@ -891,6 +895,7 @@ config CMA
depends on MMU
select MIGRATION
select MEMORY_ISOLATION
+ select WANTS_TAKE_PAGE_OFF_BUDDY
help
This enables the Contiguous Memory Allocator which allows other
subsystems to allocate big physically-contiguous blocks of memory.
diff --git a/mm/cma.c b/mm/cma.c
index 2881bab12b01..15663f95d77b 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -444,6 +444,34 @@ static void cma_debug_show_areas(struct cma *cma)
static inline void cma_debug_show_areas(struct cma *cma) { }
#endif
+/* Called with the cma mutex held. */
+static int cma_alloc_pages_fastpath(struct cma *cma, unsigned long start,
+ unsigned long end)
+{
+ bool success = false;
+ unsigned long i, j;
+
+ /* Avoid contention on the zone lock. */
+ if (start - end != 1 << cma->order_per_bit)
+ return -EINVAL;
+
+ for (i = start; i < end; i++) {
+ if (!is_free_buddy_page(pfn_to_page(i)))
+ break;
+ success = take_page_off_buddy(pfn_to_page(i), false);
+ if (!success)
+ break;
+ }
+
+ if (success)
+ return 0;
+
+ for (j = start; j < i; j++)
+ put_page_back_buddy(pfn_to_page(j), false);
+
+ return -EBUSY;
+}
+
/**
* cma_alloc_range() - allocate pages in a specific range
* @cma: Contiguous memory region for which the allocation is performed.
@@ -493,7 +521,11 @@ int cma_alloc_range(struct cma *cma, unsigned long start, unsigned long count,
for (i = 0; i < tries; i++) {
mutex_lock(&cma_mutex);
- err = alloc_contig_range(start, start + count, MIGRATE_CMA, gfp);
+ err = cma_alloc_pages_fastpath(cma, start, start + count);
+ if (err) {
+ err = alloc_contig_range(start, start + count,
+ MIGRATE_CMA, gfp);
+ }
mutex_unlock(&cma_mutex);
if (err != -EBUSY)
@@ -529,7 +561,6 @@ int cma_alloc_range(struct cma *cma, unsigned long start, unsigned long count,
return err;
}
-
/**
* cma_alloc() - allocate pages from contiguous area
* @cma: Contiguous memory region for which the allocation is performed.
@@ -589,8 +620,11 @@ struct page *cma_alloc(struct cma *cma, unsigned long count,
pfn = cma->base_pfn + (bitmap_no << cma->order_per_bit);
mutex_lock(&cma_mutex);
- ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA,
- GFP_KERNEL | (no_warn ? __GFP_NOWARN : 0));
+ ret = cma_alloc_pages_fastpath(cma, pfn, pfn + count);
+ if (ret) {
+ ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA,
+ GFP_KERNEL | (no_warn ? __GFP_NOWARN : 0));
+ }
mutex_unlock(&cma_mutex);
if (ret == 0) {
page = pfn_to_page(pfn);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 4f9b61f4a668..b87b533a9871 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -157,7 +157,7 @@ static int __page_handle_poison(struct page *page)
zone_pcp_disable(page_zone(page));
ret = dissolve_free_huge_page(page);
if (!ret)
- ret = take_page_off_buddy(page);
+ ret = take_page_off_buddy(page, true);
zone_pcp_enable(page_zone(page));
return ret;
@@ -1353,7 +1353,7 @@ static int page_action(struct page_state *ps, struct page *p,
return action_result(pfn, ps->type, result);
}
-static inline bool PageHWPoisonTakenOff(struct page *page)
+bool PageHWPoisonTakenOff(struct page *page)
{
return PageHWPoison(page) && page_private(page) == MAGIC_HWPOISON;
}
@@ -2247,7 +2247,7 @@ int memory_failure(unsigned long pfn, int flags)
res = get_hwpoison_page(p, flags);
if (!res) {
if (is_free_buddy_page(p)) {
- if (take_page_off_buddy(p)) {
+ if (take_page_off_buddy(p, true)) {
page_ref_inc(p);
res = MF_RECOVERED;
} else {
@@ -2578,7 +2578,7 @@ int unpoison_memory(unsigned long pfn)
ret = folio_test_clear_hwpoison(folio) ? 0 : -EBUSY;
} else if (ghp < 0) {
if (ghp == -EHWPOISON) {
- ret = put_page_back_buddy(p) ? 0 : -EBUSY;
+ ret = put_page_back_buddy(p, true) ? 0 : -EBUSY;
} else {
ret = ghp;
unpoison_pr_info("Unpoison: failed to grab page %#lx\n",
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0fa34bcfb1af..502ee3eb8583 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6655,7 +6655,7 @@ bool is_free_buddy_page(struct page *page)
}
EXPORT_SYMBOL(is_free_buddy_page);
-#ifdef CONFIG_MEMORY_FAILURE
+#ifdef CONFIG_WANTS_TAKE_PAGE_OFF_BUDDY
/*
* Break down a higher-order page in sub-pages, and keep our target out of
* buddy allocator.
@@ -6687,9 +6687,9 @@ static void break_down_buddy_pages(struct zone *zone, struct page *page,
}
/*
- * Take a page that will be marked as poisoned off the buddy allocator.
+ * Take a page off the buddy allocator, and optionally mark it as poisoned.
*/
-bool take_page_off_buddy(struct page *page)
+bool take_page_off_buddy(struct page *page, bool poison)
{
struct zone *zone = page_zone(page);
unsigned long pfn = page_to_pfn(page);
@@ -6710,7 +6710,8 @@ bool take_page_off_buddy(struct page *page)
del_page_from_free_list(page_head, zone, page_order);
break_down_buddy_pages(zone, page_head, page, 0,
page_order, migratetype);
- SetPageHWPoisonTakenOff(page);
+ if (poison)
+ SetPageHWPoisonTakenOff(page);
if (!is_migrate_isolate(migratetype))
__mod_zone_freepage_state(zone, -1, migratetype);
ret = true;
@@ -6724,9 +6725,10 @@ bool take_page_off_buddy(struct page *page)
}
/*
- * Cancel takeoff done by take_page_off_buddy().
+ * Cancel takeoff done by take_page_off_buddy(), and optionally unpoison the
+ * page.
*/
-bool put_page_back_buddy(struct page *page)
+bool put_page_back_buddy(struct page *page, bool unpoison)
{
struct zone *zone = page_zone(page);
unsigned long pfn = page_to_pfn(page);
@@ -6736,17 +6738,18 @@ bool put_page_back_buddy(struct page *page)
spin_lock_irqsave(&zone->lock, flags);
if (put_page_testzero(page)) {
- ClearPageHWPoisonTakenOff(page);
+ VM_WARN_ON_ONCE(PageHWPoisonTakenOff(page) && !unpoison);
+ if (unpoison)
+ ClearPageHWPoisonTakenOff(page);
__free_one_page(page, pfn, zone, 0, migratetype, FPI_NONE);
- if (TestClearPageHWPoison(page)) {
+ if (!unpoison || (unpoison && TestClearPageHWPoison(page)))
ret = true;
- }
}
spin_unlock_irqrestore(&zone->lock, flags);
return ret;
}
-#endif
+#endif /* CONFIG_WANTS_TAKE_PAGE_OFF_BUDDY */
#ifdef CONFIG_ZONE_DMA
bool has_managed_dma(void)
--
2.43.0
arm64 uses arch_swap_restore() to restore saved tags before the page is
swapped in and it's called in atomic context (with the ptl lock held).
Introduce arch_swap_prepare_to_restore() that will allow an architecture to
perform extra work during swap in and outside of a critical section.
This will be used by arm64 to allocate a buffer in memory where to
temporarily save tags if tag storage is not available for the page being
swapped in.
Signed-off-by: Alexandru Elisei <[email protected]>
---
include/linux/pgtable.h | 7 +++++++
mm/memory.c | 4 ++++
mm/shmem.c | 9 +++++++++
mm/swapfile.c | 5 +++++
4 files changed, 25 insertions(+)
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 98f81ca08cbe..2d0f04042f62 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -959,6 +959,13 @@ static inline void arch_swap_invalidate_area(int type)
}
#endif
+#ifndef __HAVE_ARCH_SWAP_PREPARE_TO_RESTORE
+static inline vm_fault_t arch_swap_prepare_to_restore(swp_entry_t entry, struct folio *folio)
+{
+ return 0;
+}
+#endif
+
#ifndef __HAVE_ARCH_SWAP_RESTORE
static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
{
diff --git a/mm/memory.c b/mm/memory.c
index 7e1f4849463a..8a421e168b57 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3975,6 +3975,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
folio_throttle_swaprate(folio, GFP_KERNEL);
+ ret = arch_swap_prepare_to_restore(entry, folio);
+ if (ret)
+ goto out_page;
+
/*
* Back out if somebody else already faulted in this pte.
*/
diff --git a/mm/shmem.c b/mm/shmem.c
index 14427e9982f9..621fabc3b8c6 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1855,6 +1855,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
struct swap_info_struct *si;
struct folio *folio = NULL;
swp_entry_t swap;
+ vm_fault_t ret;
int error;
VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
@@ -1903,6 +1904,14 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
}
folio_wait_writeback(folio);
+ ret = arch_swap_prepare_to_restore(swap, folio);
+ if (ret) {
+ if (fault_type)
+ *fault_type = ret;
+ error = -EINVAL;
+ goto unlock;
+ }
+
/*
* Some architectures may have to restore extra metadata to the
* folio after reading from swap.
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 556ff7347d5f..49425598f778 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1785,6 +1785,11 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
goto setpte;
}
+ if (arch_swap_prepare_to_restore(entry, folio)) {
+ ret = -EINVAL;
+ goto out;
+ }
+
/*
* Some architectures may have to restore extra metadata to the page
* when reading from swap. This metadata may be indexed by swap entry
--
2.43.0
Introduce a mechanism that allows an architecture to trigger a page fault,
and add the infrastructure to handle that fault accordingly. To use make
use of this, an arch is expected to mark the table entry as PAGE_NONE (which
will cause a fault next time it is accessed) and to implement an
arch-specific method (like a software bit) for recognizing that the fault
needs to be handled by the arch code.
arm64 will use of this approach to reserve tag storage for pages which are
mapped in an MTE enabled VMA, but the storage needed to store tags isn't
reserved (for example, because of an mprotect(PROT_MTE) call on a VMA with
existing pages).
Signed-off-by: Alexandru Elisei <[email protected]>
---
Changes since rfc v2:
* New patch. Split from patch #19 ("mm: mprotect: Introduce PAGE_FAULT_ON_ACCESS
for mprotect(PROT_MTE)") (David Hildenbrand).
include/linux/huge_mm.h | 4 ++--
include/linux/pgtable.h | 47 +++++++++++++++++++++++++++++++++++--
mm/Kconfig | 3 +++
mm/huge_memory.c | 36 +++++++++++++++++++++--------
mm/memory.c | 51 ++++++++++++++++++++++++++---------------
5 files changed, 109 insertions(+), 32 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 5adb86af35fc..4678a0a5e6a8 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -346,7 +346,7 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
pud_t *pud, int flags, struct dev_pagemap **pgmap);
-vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf);
+vm_fault_t handle_huge_pmd_protnone(struct vm_fault *vmf);
extern struct page *huge_zero_page;
extern unsigned long huge_zero_pfn;
@@ -476,7 +476,7 @@ static inline spinlock_t *pud_trans_huge_lock(pud_t *pud,
return NULL;
}
-static inline vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
+static inline vm_fault_t handle_huge_pmd_protnone(struct vm_fault *vmf)
{
return 0;
}
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 2d0f04042f62..81a21be855a2 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1455,7 +1455,7 @@ static inline int pud_trans_unstable(pud_t *pud)
return 0;
}
-#ifndef CONFIG_NUMA_BALANCING
+#if !defined(CONFIG_NUMA_BALANCING) && !defined(CONFIG_ARCH_HAS_FAULT_ON_ACCESS)
/*
* In an inaccessible (PROT_NONE) VMA, pte_protnone() may indicate "yes". It is
* perfectly valid to indicate "no" in that case, which is why our default
@@ -1477,7 +1477,50 @@ static inline int pmd_protnone(pmd_t pmd)
{
return 0;
}
-#endif /* CONFIG_NUMA_BALANCING */
+#endif /* !CONFIG_NUMA_BALANCING && !CONFIG_ARCH_HAS_FAULT_ON_ACCESS */
+
+#ifndef CONFIG_ARCH_HAS_FAULT_ON_ACCESS
+static inline bool arch_fault_on_access_pte(pte_t pte)
+{
+ return false;
+}
+
+static inline bool arch_fault_on_access_pmd(pmd_t pmd)
+{
+ return false;
+}
+
+/*
+ * The function is called with the fault lock held and an elevated reference on
+ * the folio.
+ *
+ * Rules that an arch implementation of the function must follow:
+ *
+ * 1. The function must return with the elevated reference dropped.
+ *
+ * 2. If the return value contains VM_FAULT_RETRY or VM_FAULT_COMPLETED then:
+ *
+ * - if FAULT_FLAG_RETRY_NOWAIT is not set, the function must return with the
+ * correct fault lock released, which can be accomplished with
+ * release_fault_lock(vmf). Note that release_fault_lock() doesn't check if
+ * FAULT_FLAG_RETRY_NOWAIT is set before releasing the mmap_lock.
+ *
+ * - if FAULT_FLAG_RETRY_NOWAIT is set, then the function must not release the
+ * mmap_lock. The flag should be set only if the mmap_lock is held.
+ *
+ * 3. If the return value contains neither of the above, the function must not
+ * release the fault lock; the generic fault handler will take care of releasing
+ * the correct lock.
+ */
+static inline vm_fault_t arch_handle_folio_fault_on_access(struct folio *folio,
+ struct vm_fault *vmf,
+ bool *map_pte)
+{
+ *map_pte = false;
+
+ return VM_FAULT_SIGBUS;
+}
+#endif
#endif /* CONFIG_MMU */
diff --git a/mm/Kconfig b/mm/Kconfig
index 341cf53898db..153df67221f1 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1006,6 +1006,9 @@ config IDLE_PAGE_TRACKING
config ARCH_HAS_CACHE_LINE_SIZE
bool
+config ARCH_HAS_FAULT_ON_ACCESS
+ bool
+
config ARCH_HAS_CURRENT_STACK_POINTER
bool
help
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 94ef5c02b459..2bad63a7ec16 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1698,7 +1698,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
}
/* NUMA hinting page fault entry point for trans huge pmds */
-vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
+vm_fault_t handle_huge_pmd_protnone(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
pmd_t oldpmd = vmf->orig_pmd;
@@ -1708,6 +1708,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
int nid = NUMA_NO_NODE;
int target_nid, last_cpupid = (-1 & LAST_CPUPID_MASK);
bool migrated = false, writable = false;
+ vm_fault_t ret;
int flags = 0;
vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
@@ -1731,6 +1732,20 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
if (!folio)
goto out_map;
+ folio_get(folio);
+ vma_set_access_pid_bit(vma);
+
+ if (arch_fault_on_access_pmd(oldpmd)) {
+ bool map_pte = false;
+
+ spin_unlock(vmf->ptl);
+ ret = arch_handle_folio_fault_on_access(folio, vmf, &map_pte);
+ if (ret || !map_pte)
+ return ret;
+ writable = false;
+ goto out_lock_and_map;
+ }
+
/* See similar comment in do_numa_page for explanation */
if (!writable)
flags |= TNF_NO_GROUP;
@@ -1755,15 +1770,18 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
if (migrated) {
flags |= TNF_MIGRATED;
nid = target_nid;
- } else {
- flags |= TNF_MIGRATE_FAIL;
- vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
- if (unlikely(!pmd_same(oldpmd, *vmf->pmd))) {
- spin_unlock(vmf->ptl);
- goto out;
- }
- goto out_map;
+ goto out;
+ }
+
+ flags |= TNF_MIGRATE_FAIL;
+
+out_lock_and_map:
+ vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
+ if (unlikely(!pmd_same(oldpmd, *vmf->pmd))) {
+ spin_unlock(vmf->ptl);
+ goto out;
}
+ goto out_map;
out:
if (nid != NUMA_NO_NODE)
diff --git a/mm/memory.c b/mm/memory.c
index 8a421e168b57..110fe2224277 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4886,11 +4886,6 @@ static vm_fault_t do_fault(struct vm_fault *vmf)
int numa_migrate_prep(struct folio *folio, struct vm_area_struct *vma,
unsigned long addr, int page_nid, int *flags)
{
- folio_get(folio);
-
- /* Record the current PID acceesing VMA */
- vma_set_access_pid_bit(vma);
-
count_vm_numa_event(NUMA_HINT_FAULTS);
if (page_nid == numa_node_id()) {
count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
@@ -4900,13 +4895,14 @@ int numa_migrate_prep(struct folio *folio, struct vm_area_struct *vma,
return mpol_misplaced(folio, vma, addr);
}
-static vm_fault_t do_numa_page(struct vm_fault *vmf)
+static vm_fault_t handle_pte_protnone(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
struct folio *folio = NULL;
int nid = NUMA_NO_NODE;
bool writable = false;
int last_cpupid;
+ vm_fault_t ret;
int target_nid;
pte_t pte, old_pte;
int flags = 0;
@@ -4939,6 +4935,20 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
if (!folio || folio_is_zone_device(folio))
goto out_map;
+ folio_get(folio);
+ /* Record the current PID acceesing VMA */
+ vma_set_access_pid_bit(vma);
+
+ if (arch_fault_on_access_pte(old_pte)) {
+ bool map_pte = false;
+
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
+ ret = arch_handle_folio_fault_on_access(folio, vmf, &map_pte);
+ if (ret || !map_pte)
+ return ret;
+ goto out_lock_and_map;
+ }
+
/* TODO: handle PTE-mapped THP */
if (folio_test_large(folio))
goto out_map;
@@ -4983,18 +4993,21 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
if (migrate_misplaced_folio(folio, vma, target_nid)) {
nid = target_nid;
flags |= TNF_MIGRATED;
- } else {
- flags |= TNF_MIGRATE_FAIL;
- vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
- vmf->address, &vmf->ptl);
- if (unlikely(!vmf->pte))
- goto out;
- if (unlikely(!pte_same(ptep_get(vmf->pte), vmf->orig_pte))) {
- pte_unmap_unlock(vmf->pte, vmf->ptl);
- goto out;
- }
- goto out_map;
+ goto out;
+ }
+
+ flags |= TNF_MIGRATE_FAIL;
+
+out_lock_and_map:
+ vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
+ vmf->address, &vmf->ptl);
+ if (unlikely(!vmf->pte))
+ goto out;
+ if (unlikely(!pte_same(ptep_get(vmf->pte), vmf->orig_pte))) {
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
+ goto out;
}
+ goto out_map;
out:
if (nid != NUMA_NO_NODE)
@@ -5151,7 +5164,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
return do_swap_page(vmf);
if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
- return do_numa_page(vmf);
+ return handle_pte_protnone(vmf);
spin_lock(vmf->ptl);
entry = vmf->orig_pte;
@@ -5272,7 +5285,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
}
if (pmd_trans_huge(vmf.orig_pmd) || pmd_devmap(vmf.orig_pmd)) {
if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma))
- return do_huge_pmd_numa_page(&vmf);
+ return handle_huge_pmd_protnone(&vmf);
if ((flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) &&
!pmd_write(vmf.orig_pmd)) {
--
2.43.0
The tag save/restore/copy functions could be more explicit about from where
the tags are coming from and where they are being copied to. Renaming the
functions to make it easier to understand what they are doing:
- Rename the mte_clear_page_tags() 'addr' parameter to 'page_addr', to
match the other functions that take a page address as parameter.
- Rename mte_save/restore_tags() to
mte_save/restore_page_tags_by_swp_entry() to make it clear that they are
saved in a collection indexed by swp_entry (this will become important
when they will be also saved in a collection indexed by page pfn). Same
applies to mte_invalidate_tags{,_area}_by_swp_entry().
- Rename mte_save/restore_page_tags() to make it clear where the tags are
going to be saved, respectively from where they are restored - in a
previously allocated memory buffer, not in an xarray, like when the tags
are saved when swapping. Rename the action to 'copy' instead of
'save'/'restore' to match the copy from user functions, which also copy
tags to memory.
- Rename mte_allocate/free_tag_storage() to mte_allocate/free_tag_buf() to
make it clear the functions have nothing to do with the memory where the
corresponding tags for a page live. Change the parameter type for
mte_free_tag_buf()) to be void *, to match the return value of
mte_allocate_tag_buf(). Also do that because that memory is opaque and it
is not meant to be directly deferenced.
In the name of consistency rename local variables from tag_storage to tags.
Give a similar treatment to the hibernation code that saves and restores
the tags for all tagged pages.
In the same spirit, rename MTE_PAGE_TAG_STORAGE to
MTE_PAGE_TAG_STORAGE_SIZE to make it clear that it relates to the size of
the memory needed to save the tags for a page. Oportunistically rename
MTE_TAG_SIZE to MTE_TAG_SIZE_BITS to make it clear it is measured in bits,
not bytes, like the rest of the size variable from the same header file.
Signed-off-by: Alexandru Elisei <[email protected]>
---
arch/arm64/include/asm/mte-def.h | 16 +++++-----
arch/arm64/include/asm/mte.h | 23 +++++++++------
arch/arm64/include/asm/pgtable.h | 8 ++---
arch/arm64/kernel/elfcore.c | 14 ++++-----
arch/arm64/kernel/hibernate.c | 46 ++++++++++++++---------------
arch/arm64/lib/mte.S | 18 ++++++------
arch/arm64/mm/mteswap.c | 50 ++++++++++++++++----------------
7 files changed, 90 insertions(+), 85 deletions(-)
diff --git a/arch/arm64/include/asm/mte-def.h b/arch/arm64/include/asm/mte-def.h
index 14ee86b019c2..eb0d76a6bdcf 100644
--- a/arch/arm64/include/asm/mte-def.h
+++ b/arch/arm64/include/asm/mte-def.h
@@ -5,14 +5,14 @@
#ifndef __ASM_MTE_DEF_H
#define __ASM_MTE_DEF_H
-#define MTE_GRANULE_SIZE UL(16)
-#define MTE_GRANULE_MASK (~(MTE_GRANULE_SIZE - 1))
-#define MTE_GRANULES_PER_PAGE (PAGE_SIZE / MTE_GRANULE_SIZE)
-#define MTE_TAG_SHIFT 56
-#define MTE_TAG_SIZE 4
-#define MTE_TAG_MASK GENMASK((MTE_TAG_SHIFT + (MTE_TAG_SIZE - 1)), MTE_TAG_SHIFT)
-#define MTE_PAGE_TAG_STORAGE (MTE_GRANULES_PER_PAGE * MTE_TAG_SIZE / 8)
+#define MTE_GRANULE_SIZE UL(16)
+#define MTE_GRANULE_MASK (~(MTE_GRANULE_SIZE - 1))
+#define MTE_GRANULES_PER_PAGE (PAGE_SIZE / MTE_GRANULE_SIZE)
+#define MTE_TAG_SHIFT 56
+#define MTE_TAG_SIZE_BITS 4
+#define MTE_TAG_MASK GENMASK((MTE_TAG_SHIFT + (MTE_TAG_SIZE_BITS - 1)), MTE_TAG_SHIFT)
+#define MTE_PAGE_TAG_STORAGE_SIZE (MTE_GRANULES_PER_PAGE * MTE_TAG_SIZE_BITS / 8)
-#define __MTE_PREAMBLE ARM64_ASM_PREAMBLE ".arch_extension memtag\n"
+#define __MTE_PREAMBLE ARM64_ASM_PREAMBLE ".arch_extension memtag\n"
#endif /* __ASM_MTE_DEF_H */
diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
index 91fbd5c8a391..8034695b3dd7 100644
--- a/arch/arm64/include/asm/mte.h
+++ b/arch/arm64/include/asm/mte.h
@@ -18,19 +18,24 @@
#include <asm/pgtable-types.h>
-void mte_clear_page_tags(void *addr);
+void mte_clear_page_tags(void *page_addr);
+
unsigned long mte_copy_tags_from_user(void *to, const void __user *from,
unsigned long n);
unsigned long mte_copy_tags_to_user(void __user *to, void *from,
unsigned long n);
-int mte_save_tags(struct page *page);
-void mte_save_page_tags(const void *page_addr, void *tag_storage);
-void mte_restore_tags(swp_entry_t entry, struct page *page);
-void mte_restore_page_tags(void *page_addr, const void *tag_storage);
-void mte_invalidate_tags(int type, pgoff_t offset);
-void mte_invalidate_tags_area(int type);
-void *mte_allocate_tag_storage(void);
-void mte_free_tag_storage(char *storage);
+
+int mte_save_page_tags_by_swp_entry(struct page *page);
+void mte_restore_page_tags_by_swp_entry(swp_entry_t entry, struct page *page);
+
+void mte_copy_page_tags_to_buf(const void *page_addr, void *to);
+void mte_copy_page_tags_from_buf(void *page_addr, const void *from);
+
+void mte_invalidate_tags_by_swp_entry(int type, pgoff_t offset);
+void mte_invalidate_tags_area_by_swp_entry(int type);
+
+void *mte_allocate_tag_buf(void);
+void mte_free_tag_buf(void *buf);
#ifdef CONFIG_ARM64_MTE
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 08f0904dbfc2..2499cc4fa4f2 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1045,7 +1045,7 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
static inline int arch_prepare_to_swap(struct page *page)
{
if (system_supports_mte())
- return mte_save_tags(page);
+ return mte_save_page_tags_by_swp_entry(page);
return 0;
}
@@ -1053,20 +1053,20 @@ static inline int arch_prepare_to_swap(struct page *page)
static inline void arch_swap_invalidate_page(int type, pgoff_t offset)
{
if (system_supports_mte())
- mte_invalidate_tags(type, offset);
+ mte_invalidate_tags_by_swp_entry(type, offset);
}
static inline void arch_swap_invalidate_area(int type)
{
if (system_supports_mte())
- mte_invalidate_tags_area(type);
+ mte_invalidate_tags_area_by_swp_entry(type);
}
#define __HAVE_ARCH_SWAP_RESTORE
static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
{
if (system_supports_mte())
- mte_restore_tags(entry, &folio->page);
+ mte_restore_page_tags_by_swp_entry(entry, &folio->page);
}
#endif /* CONFIG_ARM64_MTE */
diff --git a/arch/arm64/kernel/elfcore.c b/arch/arm64/kernel/elfcore.c
index 2e94d20c4ac7..e9ae00dacad8 100644
--- a/arch/arm64/kernel/elfcore.c
+++ b/arch/arm64/kernel/elfcore.c
@@ -17,7 +17,7 @@
static unsigned long mte_vma_tag_dump_size(struct core_vma_metadata *m)
{
- return (m->dump_size >> PAGE_SHIFT) * MTE_PAGE_TAG_STORAGE;
+ return (m->dump_size >> PAGE_SHIFT) * MTE_PAGE_TAG_STORAGE_SIZE;
}
/* Derived from dump_user_range(); start/end must be page-aligned */
@@ -38,7 +38,7 @@ static int mte_dump_tag_range(struct coredump_params *cprm,
* have been all zeros.
*/
if (!page) {
- dump_skip(cprm, MTE_PAGE_TAG_STORAGE);
+ dump_skip(cprm, MTE_PAGE_TAG_STORAGE_SIZE);
continue;
}
@@ -48,12 +48,12 @@ static int mte_dump_tag_range(struct coredump_params *cprm,
*/
if (!page_mte_tagged(page)) {
put_page(page);
- dump_skip(cprm, MTE_PAGE_TAG_STORAGE);
+ dump_skip(cprm, MTE_PAGE_TAG_STORAGE_SIZE);
continue;
}
if (!tags) {
- tags = mte_allocate_tag_storage();
+ tags = mte_allocate_tag_buf();
if (!tags) {
put_page(page);
ret = 0;
@@ -61,16 +61,16 @@ static int mte_dump_tag_range(struct coredump_params *cprm,
}
}
- mte_save_page_tags(page_address(page), tags);
+ mte_copy_page_tags_to_buf(page_address(page), tags);
put_page(page);
- if (!dump_emit(cprm, tags, MTE_PAGE_TAG_STORAGE)) {
+ if (!dump_emit(cprm, tags, MTE_PAGE_TAG_STORAGE_SIZE)) {
ret = 0;
break;
}
}
if (tags)
- mte_free_tag_storage(tags);
+ mte_free_tag_buf(tags);
return ret;
}
diff --git a/arch/arm64/kernel/hibernate.c b/arch/arm64/kernel/hibernate.c
index 02870beb271e..a3b0e7b32457 100644
--- a/arch/arm64/kernel/hibernate.c
+++ b/arch/arm64/kernel/hibernate.c
@@ -215,41 +215,41 @@ static int create_safe_exec_page(void *src_start, size_t length,
#ifdef CONFIG_ARM64_MTE
-static DEFINE_XARRAY(mte_pages);
+static DEFINE_XARRAY(tags_by_pfn);
-static int save_tags(struct page *page, unsigned long pfn)
+static int save_page_tags_by_pfn(struct page *page, unsigned long pfn)
{
- void *tag_storage, *ret;
+ void *tags, *ret;
- tag_storage = mte_allocate_tag_storage();
- if (!tag_storage)
+ tags = mte_allocate_tag_buf();
+ if (!tags)
return -ENOMEM;
- mte_save_page_tags(page_address(page), tag_storage);
+ mte_copy_page_tags_to_buf(page_address(page), tags);
- ret = xa_store(&mte_pages, pfn, tag_storage, GFP_KERNEL);
+ ret = xa_store(&tags_by_pfn, pfn, tags, GFP_KERNEL);
if (WARN(xa_is_err(ret), "Failed to store MTE tags")) {
- mte_free_tag_storage(tag_storage);
+ mte_free_tag_buf(tags);
return xa_err(ret);
} else if (WARN(ret, "swsusp: %s: Duplicate entry", __func__)) {
- mte_free_tag_storage(ret);
+ mte_free_tag_buf(ret);
}
return 0;
}
-static void swsusp_mte_free_storage(void)
+static void swsusp_mte_free_tags(void)
{
- XA_STATE(xa_state, &mte_pages, 0);
+ XA_STATE(xa_state, &tags_by_pfn, 0);
void *tags;
- xa_lock(&mte_pages);
+ xa_lock(&tags_by_pfn);
xas_for_each(&xa_state, tags, ULONG_MAX) {
- mte_free_tag_storage(tags);
+ mte_free_tag_buf(tags);
}
- xa_unlock(&mte_pages);
+ xa_unlock(&tags_by_pfn);
- xa_destroy(&mte_pages);
+ xa_destroy(&tags_by_pfn);
}
static int swsusp_mte_save_tags(void)
@@ -273,9 +273,9 @@ static int swsusp_mte_save_tags(void)
if (!page_mte_tagged(page))
continue;
- ret = save_tags(page, pfn);
+ ret = save_page_tags_by_pfn(page, pfn);
if (ret) {
- swsusp_mte_free_storage();
+ swsusp_mte_free_tags();
goto out;
}
@@ -290,25 +290,25 @@ static int swsusp_mte_save_tags(void)
static void swsusp_mte_restore_tags(void)
{
- XA_STATE(xa_state, &mte_pages, 0);
+ XA_STATE(xa_state, &tags_by_pfn, 0);
int n = 0;
void *tags;
- xa_lock(&mte_pages);
+ xa_lock(&tags_by_pfn);
xas_for_each(&xa_state, tags, ULONG_MAX) {
unsigned long pfn = xa_state.xa_index;
struct page *page = pfn_to_online_page(pfn);
- mte_restore_page_tags(page_address(page), tags);
+ mte_copy_page_tags_from_buf(page_address(page), tags);
- mte_free_tag_storage(tags);
+ mte_free_tag_buf(tags);
n++;
}
- xa_unlock(&mte_pages);
+ xa_unlock(&tags_by_pfn);
pr_info("Restored %d MTE pages\n", n);
- xa_destroy(&mte_pages);
+ xa_destroy(&tags_by_pfn);
}
#else /* CONFIG_ARM64_MTE */
diff --git a/arch/arm64/lib/mte.S b/arch/arm64/lib/mte.S
index 5018ac03b6bf..9f623e9da09f 100644
--- a/arch/arm64/lib/mte.S
+++ b/arch/arm64/lib/mte.S
@@ -119,7 +119,7 @@ SYM_FUNC_START(mte_copy_tags_to_user)
cbz x2, 2f
1:
ldg x4, [x1]
- ubfx x4, x4, #MTE_TAG_SHIFT, #MTE_TAG_SIZE
+ ubfx x4, x4, #MTE_TAG_SHIFT, #MTE_TAG_SIZE_BITS
USER(2f, sttrb w4, [x0])
add x0, x0, #1
add x1, x1, #MTE_GRANULE_SIZE
@@ -132,11 +132,11 @@ USER(2f, sttrb w4, [x0])
SYM_FUNC_END(mte_copy_tags_to_user)
/*
- * Save the tags in a page
+ * Copy the tags in a page to a buffer
* x0 - page address
- * x1 - tag storage, MTE_PAGE_TAG_STORAGE bytes
+ * x1 - memory buffer, MTE_PAGE_TAG_STORAGE_SIZE bytes
*/
-SYM_FUNC_START(mte_save_page_tags)
+SYM_FUNC_START(mte_copy_page_tags_to_buf)
multitag_transfer_size x7, x5
1:
mov x2, #0
@@ -153,14 +153,14 @@ SYM_FUNC_START(mte_save_page_tags)
b.ne 1b
ret
-SYM_FUNC_END(mte_save_page_tags)
+SYM_FUNC_END(mte_copy_page_tags_to_buf)
/*
- * Restore the tags in a page
+ * Restore the tags in a page from a buffer
* x0 - page address
- * x1 - tag storage, MTE_PAGE_TAG_STORAGE bytes
+ * x1 - memory buffer, MTE_PAGE_TAG_STORAGE_SIZE bytes
*/
-SYM_FUNC_START(mte_restore_page_tags)
+SYM_FUNC_START(mte_copy_page_tags_from_buf)
multitag_transfer_size x7, x5
1:
ldr x2, [x1], #8
@@ -174,4 +174,4 @@ SYM_FUNC_START(mte_restore_page_tags)
b.ne 1b
ret
-SYM_FUNC_END(mte_restore_page_tags)
+SYM_FUNC_END(mte_copy_page_tags_from_buf)
diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
index a31833e3ddc5..2a43746b803f 100644
--- a/arch/arm64/mm/mteswap.c
+++ b/arch/arm64/mm/mteswap.c
@@ -7,79 +7,79 @@
#include <linux/swapops.h>
#include <asm/mte.h>
-static DEFINE_XARRAY(mte_pages);
+static DEFINE_XARRAY(tags_by_swp_entry);
-void *mte_allocate_tag_storage(void)
+void *mte_allocate_tag_buf(void)
{
/* tags granule is 16 bytes, 2 tags stored per byte */
- return kmalloc(MTE_PAGE_TAG_STORAGE, GFP_KERNEL);
+ return kmalloc(MTE_PAGE_TAG_STORAGE_SIZE, GFP_KERNEL);
}
-void mte_free_tag_storage(char *storage)
+void mte_free_tag_buf(void *buf)
{
- kfree(storage);
+ kfree(buf);
}
-int mte_save_tags(struct page *page)
+int mte_save_page_tags_by_swp_entry(struct page *page)
{
- void *tag_storage, *ret;
+ void *tags, *ret;
if (!page_mte_tagged(page))
return 0;
- tag_storage = mte_allocate_tag_storage();
- if (!tag_storage)
+ tags = mte_allocate_tag_buf();
+ if (!tags)
return -ENOMEM;
- mte_save_page_tags(page_address(page), tag_storage);
+ mte_copy_page_tags_to_buf(page_address(page), tags);
/* lookup the swap entry.val from the page */
- ret = xa_store(&mte_pages, page_swap_entry(page).val, tag_storage,
+ ret = xa_store(&tags_by_swp_entry, page_swap_entry(page).val, tags,
GFP_KERNEL);
if (WARN(xa_is_err(ret), "Failed to store MTE tags")) {
- mte_free_tag_storage(tag_storage);
+ mte_free_tag_buf(tags);
return xa_err(ret);
} else if (ret) {
/* Entry is being replaced, free the old entry */
- mte_free_tag_storage(ret);
+ mte_free_tag_buf(ret);
}
return 0;
}
-void mte_restore_tags(swp_entry_t entry, struct page *page)
+void mte_restore_page_tags_by_swp_entry(swp_entry_t entry, struct page *page)
{
- void *tags = xa_load(&mte_pages, entry.val);
+ void *tags = xa_load(&tags_by_swp_entry, entry.val);
if (!tags)
return;
if (try_page_mte_tagging(page)) {
- mte_restore_page_tags(page_address(page), tags);
+ mte_copy_page_tags_from_buf(page_address(page), tags);
set_page_mte_tagged(page);
}
}
-void mte_invalidate_tags(int type, pgoff_t offset)
+void mte_invalidate_tags_by_swp_entry(int type, pgoff_t offset)
{
swp_entry_t entry = swp_entry(type, offset);
- void *tags = xa_erase(&mte_pages, entry.val);
+ void *tags = xa_erase(&tags_by_swp_entry, entry.val);
- mte_free_tag_storage(tags);
+ mte_free_tag_buf(tags);
}
-void mte_invalidate_tags_area(int type)
+void mte_invalidate_tags_area_by_swp_entry(int type)
{
swp_entry_t entry = swp_entry(type, 0);
swp_entry_t last_entry = swp_entry(type + 1, 0);
void *tags;
- XA_STATE(xa_state, &mte_pages, entry.val);
+ XA_STATE(xa_state, &tags_by_swp_entry, entry.val);
- xa_lock(&mte_pages);
+ xa_lock(&tags_by_swp_entry);
xas_for_each(&xa_state, tags, last_entry.val - 1) {
- __xa_erase(&mte_pages, xa_state.xa_index);
- mte_free_tag_storage(tags);
+ __xa_erase(&tags_by_swp_entry, xa_state.xa_index);
+ mte_free_tag_buf(tags);
}
- xa_unlock(&mte_pages);
+ xa_unlock(&tags_by_swp_entry);
}
--
2.43.0
__GFP_ZEROTAGS is used to instruct the page allocator to zero the tags at
the same time as the physical frame is zeroed. The name can be slightly
misleading, because it doesn't mean that the code will zero the tags
unconditionally, but that the tags will be zeroed if and only if the
physical frame is also zeroed (either __GFP_ZERO is set or init_on_alloc is
1).
Rename it to __GFP_TAGGED, in preparation for it to be used by the page
allocator to recognize when an allocation is tagged (has metadata).
Signed-off-by: Alexandru Elisei <[email protected]>
---
arch/arm64/mm/fault.c | 2 +-
include/linux/gfp_types.h | 6 +++---
include/trace/events/mmflags.h | 2 +-
mm/page_alloc.c | 2 +-
mm/shmem.c | 2 +-
5 files changed, 7 insertions(+), 7 deletions(-)
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 4d3f0a870ad8..c022e473c17c 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -944,7 +944,7 @@ NOKPROBE_SYMBOL(do_debug_exception);
gfp_t arch_calc_vma_gfp(struct vm_area_struct *vma, gfp_t gfp)
{
if (vma->vm_flags & VM_MTE)
- return __GFP_ZEROTAGS;
+ return __GFP_TAGGED;
return 0;
}
diff --git a/include/linux/gfp_types.h b/include/linux/gfp_types.h
index 1b6053da8754..f638353ebdc7 100644
--- a/include/linux/gfp_types.h
+++ b/include/linux/gfp_types.h
@@ -45,7 +45,7 @@ typedef unsigned int __bitwise gfp_t;
#define ___GFP_HARDWALL 0x100000u
#define ___GFP_THISNODE 0x200000u
#define ___GFP_ACCOUNT 0x400000u
-#define ___GFP_ZEROTAGS 0x800000u
+#define ___GFP_TAGGED 0x800000u
#ifdef CONFIG_KASAN_HW_TAGS
#define ___GFP_SKIP_ZERO 0x1000000u
#define ___GFP_SKIP_KASAN 0x2000000u
@@ -226,7 +226,7 @@ typedef unsigned int __bitwise gfp_t;
*
* %__GFP_ZERO returns a zeroed page on success.
*
- * %__GFP_ZEROTAGS zeroes memory tags at allocation time if the memory itself
+ * %__GFP_TAGGED zeroes memory tags at allocation time if the memory itself
* is being zeroed (either via __GFP_ZERO or via init_on_alloc, provided that
* __GFP_SKIP_ZERO is not set). This flag is intended for optimization: setting
* memory tags at the same time as zeroing memory has minimal additional
@@ -241,7 +241,7 @@ typedef unsigned int __bitwise gfp_t;
#define __GFP_NOWARN ((__force gfp_t)___GFP_NOWARN)
#define __GFP_COMP ((__force gfp_t)___GFP_COMP)
#define __GFP_ZERO ((__force gfp_t)___GFP_ZERO)
-#define __GFP_ZEROTAGS ((__force gfp_t)___GFP_ZEROTAGS)
+#define __GFP_TAGGED ((__force gfp_t)___GFP_TAGGED)
#define __GFP_SKIP_ZERO ((__force gfp_t)___GFP_SKIP_ZERO)
#define __GFP_SKIP_KASAN ((__force gfp_t)___GFP_SKIP_KASAN)
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index d801409b33cf..6ca0d5ed46c0 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -50,7 +50,7 @@
gfpflag_string(__GFP_RECLAIM), \
gfpflag_string(__GFP_DIRECT_RECLAIM), \
gfpflag_string(__GFP_KSWAPD_RECLAIM), \
- gfpflag_string(__GFP_ZEROTAGS)
+ gfpflag_string(__GFP_TAGGED)
#ifdef CONFIG_KASAN_HW_TAGS
#define __def_gfpflag_names_kasan , \
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 502ee3eb8583..0a0118612a13 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1480,7 +1480,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
{
bool init = !want_init_on_free() && want_init_on_alloc(gfp_flags) &&
!should_skip_init(gfp_flags);
- bool zero_tags = init && (gfp_flags & __GFP_ZEROTAGS);
+ bool zero_tags = init && (gfp_flags & __GFP_TAGGED);
int i;
set_page_private(page, 0);
diff --git a/mm/shmem.c b/mm/shmem.c
index 621fabc3b8c6..3e28357b0a40 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1585,7 +1585,7 @@ static struct folio *shmem_swapin_cluster(swp_entry_t swap, gfp_t gfp,
*/
static gfp_t limit_gfp_mask(gfp_t huge_gfp, gfp_t limit_gfp)
{
- gfp_t allowflags = __GFP_IO | __GFP_FS | __GFP_RECLAIM | __GFP_ZEROTAGS;
+ gfp_t allowflags = __GFP_IO | __GFP_FS | __GFP_RECLAIM | __GFP_TAGGED;
gfp_t denyflags = __GFP_NOWARN | __GFP_NORETRY;
gfp_t zoneflags = limit_gfp & GFP_ZONEMASK;
gfp_t result = huge_gfp & ~(allowflags | GFP_ZONEMASK);
--
2.43.0
Allow the kernel to get the base address, size, block size and associated
memory node for tag storage from the device tree blob.
A tag storage region represents the smallest contiguous memory region that
holds all the tags for the associated contiguous memory region which can be
tagged. For example, for a 32GB contiguous tagged memory the corresponding
tag storage region is exactly 1GB of contiguous memory, not two adjacent
512M of tag storage memory, nor one 2GB tag storage region.
Tag storage is described as reserved memory; future patches will teach the
kernel how to make use of it for data (non-tagged) allocations.
Signed-off-by: Alexandru Elisei <[email protected]>
---
Changes since rfc v2:
* Reworked from rfc v2 patch #11 ("arm64: mte: Reserve tag storage memory").
* Added device tree schema (Rob Herring)
* Tag storage memory is now described in the "reserved-memory" node (Rob
Herring).
.../reserved-memory/arm,mte-tag-storage.yaml | 78 +++++++++
arch/arm64/Kconfig | 12 ++
arch/arm64/include/asm/mte_tag_storage.h | 16 ++
arch/arm64/kernel/Makefile | 1 +
arch/arm64/kernel/mte_tag_storage.c | 158 ++++++++++++++++++
arch/arm64/mm/init.c | 3 +
6 files changed, 268 insertions(+)
create mode 100644 Documentation/devicetree/bindings/reserved-memory/arm,mte-tag-storage.yaml
create mode 100644 arch/arm64/include/asm/mte_tag_storage.h
create mode 100644 arch/arm64/kernel/mte_tag_storage.c
diff --git a/Documentation/devicetree/bindings/reserved-memory/arm,mte-tag-storage.yaml b/Documentation/devicetree/bindings/reserved-memory/arm,mte-tag-storage.yaml
new file mode 100644
index 000000000000..a99aaa1e8b6e
--- /dev/null
+++ b/Documentation/devicetree/bindings/reserved-memory/arm,mte-tag-storage.yaml
@@ -0,0 +1,78 @@
+# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/reserved-memory/arm,mte-tag-storage.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Tag storage memory for Memory Tagging Extension
+
+description: |
+ Description of the tag storage memory region that Linux can use to store
+ data when the associated memory is not tagged.
+
+ The reserved memory described by the node must also be described by a
+ standalone 'memory' node.
+
+maintainers:
+ - Alexandru Elisei <[email protected]>
+
+allOf:
+ - $ref: reserved-memory.yaml
+
+properties:
+ compatible:
+ const: arm,mte-tag-storage
+
+ reg:
+ description: |
+ Specifies the memory region that MTE uses for tag storage. The size of the
+ region must be equal to the size needed to store all the tags for the
+ associated tagged memory.
+
+ block-size:
+ description: |
+ Specifies the minimum multiple of 4K bytes of tag storage where all the
+ tags stored in the block correspond to a contiguous memory region. This
+ is needed for platforms where the memory controller interleaves tag
+ writes to memory.
+
+ For example, if the memory controller interleaves tag writes for 256KB
+ of contiguous memory across 8K of tag storage (2-way interleave), then
+ the correct value for 'block-size' is 0x2000.
+
+ This value is a hardware property, independent of the selected kernel page
+ size.
+ $ref: /schemas/types.yaml#/definitions/uint32
+
+ tagged-memory:
+ description: |
+ Specifies the memory node, as a phandle, for which all the tags are
+ stored in the tag storage region.
+
+ The memory node must describe one contiguous memory region (i.e, the
+ 'ranges' property of the memory node must have exactly one entry).
+ $ref: /schemas/types.yaml#/definitions/phandle
+
+unevaluatedProperties: false
+
+required:
+ - compatible
+ - reg
+ - block-size
+ - tagged-memory
+ - reusable
+
+examples:
+ - |
+ reserved-memory {
+ #address-cells = <2>;
+ #size-cells = <2>;
+
+ tags0: tag-storage@8f8000000 {
+ compatible = "arm,mte-tag-storage";
+ reg = <0x08 0xf8000000 0x00 0x4000000>;
+ block-size = <0x1000>;
+ tagged-memory = <&memory0>;
+ reusable;
+ };
+ };
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index aa7c1d435139..92d97930b56e 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2082,6 +2082,18 @@ config ARM64_MTE
Documentation/arch/arm64/memory-tagging-extension.rst.
+if ARM64_MTE
+config ARM64_MTE_TAG_STORAGE
+ bool
+ help
+ Adds support for dynamic management of the memory used by the hardware
+ for storing MTE tags. This memory, unlike normal memory, cannot be
+ tagged. When it is used to store tags for another memory location it
+ cannot be used for any type of allocation.
+
+ If unsure, say N
+endif # ARM64_MTE
+
endmenu # "ARMv8.5 architectural features"
menu "ARMv8.7 architectural features"
diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
new file mode 100644
index 000000000000..3c2cd29e053e
--- /dev/null
+++ b/arch/arm64/include/asm/mte_tag_storage.h
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2023 ARM Ltd.
+ */
+#ifndef __ASM_MTE_TAG_STORAGE_H
+#define __ASM_MTE_TAG_STORAGE_H
+
+#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
+void mte_init_tag_storage(void);
+#else
+static inline void mte_init_tag_storage(void)
+{
+}
+#endif /* CONFIG_ARM64_MTE_TAG_STORAGE */
+
+#endif /* __ASM_MTE_TAG_STORAGE_H */
diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
index e5d03a7039b4..89c28b538908 100644
--- a/arch/arm64/kernel/Makefile
+++ b/arch/arm64/kernel/Makefile
@@ -70,6 +70,7 @@ obj-$(CONFIG_CRASH_CORE) += crash_core.o
obj-$(CONFIG_ARM_SDE_INTERFACE) += sdei.o
obj-$(CONFIG_ARM64_PTR_AUTH) += pointer_auth.o
obj-$(CONFIG_ARM64_MTE) += mte.o
+obj-$(CONFIG_ARM64_MTE_TAG_STORAGE) += mte_tag_storage.o
obj-y += vdso-wrap.o
obj-$(CONFIG_COMPAT_VDSO) += vdso32-wrap.o
obj-$(CONFIG_UNWIND_PATCH_PAC_INTO_SCS) += patch-scs.o
diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
new file mode 100644
index 000000000000..2f32265d8ad8
--- /dev/null
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -0,0 +1,158 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Support for dynamic tag storage.
+ *
+ * Copyright (C) 2023 ARM Ltd.
+ */
+
+#include <linux/memblock.h>
+#include <linux/mm.h>
+#include <linux/of.h>
+#include <linux/of_address.h>
+#include <linux/of_fdt.h>
+#include <linux/of_reserved_mem.h>
+#include <linux/range.h>
+#include <linux/string.h>
+#include <linux/xarray.h>
+
+#include <asm/mte_tag_storage.h>
+
+struct tag_region {
+ struct range mem_range; /* Memory associated with the tag storage, in PFNs. */
+ struct range tag_range; /* Tag storage memory, in PFNs. */
+ u32 block_size_pages; /* Tag block size, in pages. */
+ phandle mem_phandle; /* phandle for the associated memory node. */
+};
+
+#define MAX_TAG_REGIONS 32
+
+static struct tag_region tag_regions[MAX_TAG_REGIONS];
+static int num_tag_regions;
+
+static u32 __init get_block_size_pages(u32 block_size_bytes)
+{
+ u32 a = PAGE_SIZE;
+ u32 b = block_size_bytes;
+ u32 r;
+
+ /* Find greatest common divisor using the Euclidian algorithm. */
+ do {
+ r = a % b;
+ a = b;
+ b = r;
+ } while (b != 0);
+
+ return PHYS_PFN(PAGE_SIZE * block_size_bytes / a);
+}
+
+int __init tag_storage_probe(struct reserved_mem *rmem)
+{
+ struct tag_region *region;
+ u32 block_size_bytes;
+ int ret;
+
+ if (num_tag_regions == MAX_TAG_REGIONS) {
+ pr_err("Exceeded maximum number of tag storage regions");
+ goto out_err;
+ }
+
+ region = &tag_regions[num_tag_regions];
+ region->tag_range.start = PHYS_PFN(rmem->base);
+ region->tag_range.end = PHYS_PFN(rmem->base + rmem->size - 1);
+
+ ret = of_flat_read_u32(rmem->fdt_node, "block-size", &block_size_bytes);
+ if (ret || block_size_bytes == 0) {
+ pr_err("Invalid or missing 'block-size' property");
+ goto out_err;
+ }
+
+ region->block_size_pages = get_block_size_pages(block_size_bytes);
+ if (range_len(®ion->tag_range) % region->block_size_pages != 0) {
+ pr_err("Tag storage region size 0x%llx pages is not a multiple of block size 0x%x pages",
+ range_len(®ion->tag_range), region->block_size_pages);
+ goto out_err;
+ }
+
+ ret = of_flat_read_u32(rmem->fdt_node, "tagged-memory", ®ion->mem_phandle);
+ if (ret) {
+ pr_err("Invalid or missing 'tagged-memory' property");
+ goto out_err;
+ }
+
+ num_tag_regions++;
+ return 0;
+
+out_err:
+ num_tag_regions = 0;
+ return -EINVAL;
+}
+RESERVEDMEM_OF_DECLARE(tag_storage, "arm,mte-tag-storage", tag_storage_probe);
+
+static int __init mte_find_tagged_memory_regions(void)
+{
+ struct device_node *mem_dev;
+ struct tag_region *region;
+ struct range *mem_range;
+ const __be32 *reg;
+ u64 addr, size;
+ int i;
+
+ for (i = 0; i < num_tag_regions; i++) {
+ region = &tag_regions[i];
+ mem_range = ®ion->mem_range;
+
+ mem_dev = of_find_node_by_phandle(region->mem_phandle);
+ if (!mem_dev) {
+ pr_err("Cannot find tagged memory node");
+ goto out;
+ }
+
+ reg = of_get_property(mem_dev, "reg", NULL);
+ if (!reg) {
+ pr_err("Invalid tagged memory node");
+ goto out_put_mem;
+ }
+
+ addr = of_translate_address(mem_dev, reg);
+ if (addr == OF_BAD_ADDR) {
+ pr_err("Invalid memory address");
+ goto out_put_mem;
+ }
+
+ size = of_read_number(reg + of_n_addr_cells(mem_dev), of_n_size_cells(mem_dev));
+ if (!size) {
+ pr_err("Invalid memory size");
+ goto out_put_mem;
+ }
+
+ mem_range->start = PHYS_PFN(addr);
+ mem_range->end = PHYS_PFN(addr + size - 1);
+
+ of_node_put(mem_dev);
+ }
+
+ return 0;
+
+out_put_mem:
+ of_node_put(mem_dev);
+out:
+ return -EINVAL;
+}
+
+void __init mte_init_tag_storage(void)
+{
+ int ret;
+
+ if (num_tag_regions == 0)
+ return;
+
+ ret = mte_find_tagged_memory_regions();
+ if (ret)
+ goto out_disabled;
+
+ return;
+
+out_disabled:
+ num_tag_regions = 0;
+ pr_info("MTE tag storage region management disabled");
+}
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 74c1db8ce271..2ccc0c294a13 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -39,6 +39,7 @@
#include <asm/kernel-pgtable.h>
#include <asm/kvm_host.h>
#include <asm/memory.h>
+#include <asm/mte_tag_storage.h>
#include <asm/numa.h>
#include <asm/sections.h>
#include <asm/setup.h>
@@ -386,6 +387,8 @@ void __init mem_init(void)
/* this will put all unused low memory onto the freelists */
memblock_free_all();
+ mte_init_tag_storage();
+
/*
* Check boundaries twice: Some fundamental inconsistencies can be
* detected at build time already.
--
2.43.0
Add the MTE tag storage pages to CMA, which allows the page allocator to
manage them like regular pages.
The CMA migratype lends the tag storage pages some very desirable
properties:
* They cannot be longterm pinned, meaning they should always be migratable.
* The pages can be allocated explicitely by using their PFN (with
alloc_cma_range()) when they are needed to store tags.
Signed-off-by: Alexandru Elisei <[email protected]>
---
Changes since v2:
* Reworked from rfc v2 patch #12 ("arm64: mte: Add tag storage pages to the
MIGRATE_CMA migratetype").
* Tag storage memory is now added to the cma_areas array and will be managed
like a regular CMA region (David Hildenbrand).
* If a tag storage region spans multiple zones, CMA won't be able to activate
the region. Split such regions into multiple tag storage regions (Hyesoo Yu).
arch/arm64/Kconfig | 1 +
arch/arm64/kernel/mte_tag_storage.c | 150 +++++++++++++++++++++++++++-
2 files changed, 150 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 92d97930b56e..6f65e9005dc9 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2085,6 +2085,7 @@ config ARM64_MTE
if ARM64_MTE
config ARM64_MTE_TAG_STORAGE
bool
+ select CONFIG_CMA
help
Adds support for dynamic management of the memory used by the hardware
for storing MTE tags. This memory, unlike normal memory, cannot be
diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index 2f32265d8ad8..90b157132efa 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -5,6 +5,8 @@
* Copyright (C) 2023 ARM Ltd.
*/
+#include <linux/cma.h>
+#include <linux/log2.h>
#include <linux/memblock.h>
#include <linux/mm.h>
#include <linux/of.h>
@@ -22,6 +24,7 @@ struct tag_region {
struct range tag_range; /* Tag storage memory, in PFNs. */
u32 block_size_pages; /* Tag block size, in pages. */
phandle mem_phandle; /* phandle for the associated memory node. */
+ struct cma *cma; /* CMA cookie */
};
#define MAX_TAG_REGIONS 32
@@ -139,9 +142,88 @@ static int __init mte_find_tagged_memory_regions(void)
return -EINVAL;
}
+static void __init mte_split_tag_region(struct tag_region *region, unsigned long last_tag_pfn)
+{
+ struct tag_region *new_region;
+ unsigned long last_mem_pfn;
+
+ new_region = &tag_regions[num_tag_regions];
+ last_mem_pfn = region->mem_range.start + (last_tag_pfn - region->tag_range.start) * 32;
+
+ new_region->mem_range.start = last_mem_pfn + 1;
+ new_region->mem_range.end = region->mem_range.end;
+ region->mem_range.end = last_mem_pfn;
+
+ new_region->tag_range.start = last_tag_pfn + 1;
+ new_region->tag_range.end = region->tag_range.end;
+ region->tag_range.end = last_tag_pfn;
+
+ new_region->block_size_pages = region->block_size_pages;
+
+ num_tag_regions++;
+}
+
+/*
+ * Split any tag region that spans multiple zones - CMA will fail if that
+ * happens.
+ */
+static int __init mte_split_tag_regions(void)
+{
+ struct tag_region *region;
+ struct range *tag_range;
+ struct zone *zone;
+ unsigned long pfn;
+ int i;
+
+ for (i = 0; i < num_tag_regions; i++) {
+ region = &tag_regions[i];
+ tag_range = ®ion->tag_range;
+ zone = page_zone(pfn_to_page(tag_range->start));
+
+ for (pfn = tag_range->start + 1; pfn <= tag_range->end; pfn++) {
+ if (page_zone(pfn_to_page(pfn)) == zone)
+ continue;
+
+ if (WARN_ON_ONCE(pfn % region->block_size_pages))
+ goto out_err;
+
+ if (num_tag_regions == MAX_TAG_REGIONS)
+ goto out_err;
+
+ mte_split_tag_region(&tag_regions[i], pfn - 1);
+ /* Move on to the next region. */
+ break;
+ }
+ }
+
+ return 0;
+
+out_err:
+ pr_err("Error splitting tag storage region 0x%llx-0x%llx spanning multiple zones",
+ PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end + 1) - 1);
+ return -EINVAL;
+}
+
void __init mte_init_tag_storage(void)
{
- int ret;
+ unsigned long long mem_end;
+ struct tag_region *region;
+ unsigned long pfn, order;
+ u64 start, end;
+ int i, j, ret;
+
+ /*
+ * Tag storage memory requires that tag storage pages in use for data
+ * are always migratable when they need to be repurposed to store tags.
+ * If ARCH_KEEP_MEMBLOCK is enabled, kexec will not scan reserved
+ * memblocks when trying to find a suitable location for the kernel
+ * image. This means that kexec will not use tag storage pages for
+ * copying the kernel, and the pages will remain migratable.
+ *
+ * Add the check in case arm64 stops selecting ARCH_KEEP_MEMBLOCK by
+ * default.
+ */
+ BUILD_BUG_ON(!IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK));
if (num_tag_regions == 0)
return;
@@ -150,6 +232,72 @@ void __init mte_init_tag_storage(void)
if (ret)
goto out_disabled;
+ mem_end = PHYS_PFN(memblock_end_of_DRAM());
+
+ /*
+ * MTE is disabled, tag storage pages can be used like any other pages.
+ * The only restriction is that the pages cannot be used by kexec
+ * because the memory remains marked as reserved in the memblock
+ * allocator.
+ */
+ if (!system_supports_mte()) {
+ for (i = 0; i< num_tag_regions; i++) {
+ start = tag_regions[i].tag_range.start;
+ end = tag_regions[i].tag_range.end;
+
+ /* end is inclusive, mem_end is not */
+ if (end >= mem_end)
+ end = mem_end - 1;
+ if (end < start)
+ continue;
+ for (pfn = start; pfn <= end; pfn++)
+ free_reserved_page(pfn_to_page(pfn));
+ }
+ goto out_disabled;
+ }
+
+ /*
+ * Check that tag storage is addressable by the kernel.
+ * cma_init_reserved_mem(), unlike cma_declare_contiguous_nid(), doesn't
+ * perform this check.
+ */
+ for (i = 0; i< num_tag_regions; i++) {
+ start = tag_regions[i].tag_range.start;
+ end = tag_regions[i].tag_range.end;
+
+ if (end >= mem_end) {
+ pr_err("Tag region 0x%llx-0x%llx outside addressable memory",
+ PFN_PHYS(start), PFN_PHYS(end + 1) - 1);
+ goto out_disabled;
+ }
+ }
+
+ ret = mte_split_tag_regions();
+ if (ret)
+ goto out_disabled;
+
+ for (i = 0; i < num_tag_regions; i++) {
+ region = &tag_regions[i];
+
+ /* Tag storage pages are managed in block_size_pages chunks. */
+ if (is_power_of_2(region->block_size_pages))
+ order = ilog2(region->block_size_pages);
+ else
+ order = 0;
+
+ ret = cma_init_reserved_mem(PFN_PHYS(region->tag_range.start),
+ PFN_PHYS(range_len(®ion->tag_range)),
+ order, NULL, ®ion->cma);
+ if (ret) {
+ for (j = 0; j < i; j++)
+ cma_remove_mem(®ion->cma);
+ goto out_disabled;
+ }
+
+ /* Keep pages reserved if activation fails. */
+ cma_reserve_pages_on_error(region->cma);
+ }
+
return;
out_disabled:
--
2.43.0
To be able to reserve the tag storage associated with a tagged page
requires that the tag storage can be migrated, if it's in use for data.
The kernel allocates pages in non-preemptible contexts, which makes
migration impossible. The only user of tagged pages in the kernel is HW
KASAN, so don't use tag storage pages if HW KASAN is enabled.
Signed-off-by: Alexandru Elisei <[email protected]>
---
Changes since rfc v2:
* Expanded commit message (David Hildenbrand)
arch/arm64/kernel/mte_tag_storage.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index 90b157132efa..9a1a8a45171e 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -256,6 +256,16 @@ void __init mte_init_tag_storage(void)
goto out_disabled;
}
+ /*
+ * The kernel allocates memory in non-preemptible contexts, which makes
+ * migration impossible when reserving the associated tag storage. The
+ * only in-kernel user of tagged pages is HW KASAN.
+ */
+ if (kasan_hw_tags_enabled()) {
+ pr_info("KASAN HW tags incompatible with MTE tag storage management");
+ goto out_disabled;
+ }
+
/*
* Check that tag storage is addressable by the kernel.
* cma_init_reserved_mem(), unlike cma_declare_contiguous_nid(), doesn't
--
2.43.0
Before enabling MTE tag storage management, make sure that the CMA areas
have been successfully activated. If a CMA area fails activation, the pages
are kept as reserved. Reserved pages are never used by the page allocator.
If this happens, the kernel would have to manage tag storage only for some
of the memory, but not for all memory, and that would make the code
unreasonably complicated.
Choose to disable tag storage management altogether if a CMA area fails to
be activated.
Signed-off-by: Alexandru Elisei <[email protected]>
---
Changes since v2:
* New patch.
arch/arm64/include/asm/mte_tag_storage.h | 12 ++++++
arch/arm64/kernel/mte_tag_storage.c | 50 ++++++++++++++++++++++++
2 files changed, 62 insertions(+)
diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
index 3c2cd29e053e..7b3f6bff8e6f 100644
--- a/arch/arm64/include/asm/mte_tag_storage.h
+++ b/arch/arm64/include/asm/mte_tag_storage.h
@@ -6,8 +6,20 @@
#define __ASM_MTE_TAG_STORAGE_H
#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
+
+DECLARE_STATIC_KEY_FALSE(tag_storage_enabled_key);
+
+static inline bool tag_storage_enabled(void)
+{
+ return static_branch_likely(&tag_storage_enabled_key);
+}
+
void mte_init_tag_storage(void);
#else
+static inline bool tag_storage_enabled(void)
+{
+ return false;
+}
static inline void mte_init_tag_storage(void)
{
}
diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index 9a1a8a45171e..d58c68b4a849 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -19,6 +19,8 @@
#include <asm/mte_tag_storage.h>
+__ro_after_init DEFINE_STATIC_KEY_FALSE(tag_storage_enabled_key);
+
struct tag_region {
struct range mem_range; /* Memory associated with the tag storage, in PFNs. */
struct range tag_range; /* Tag storage memory, in PFNs. */
@@ -314,3 +316,51 @@ void __init mte_init_tag_storage(void)
num_tag_regions = 0;
pr_info("MTE tag storage region management disabled");
}
+
+static int __init mte_enable_tag_storage(void)
+{
+ struct range *tag_range;
+ struct cma *cma;
+ int i, ret;
+
+ if (num_tag_regions == 0)
+ return 0;
+
+ for (i = 0; i < num_tag_regions; i++) {
+ tag_range = &tag_regions[i].tag_range;
+ cma = tag_regions[i].cma;
+ /*
+ * CMA will keep the pages as reserved when the region fails
+ * activation.
+ */
+ if (PageReserved(pfn_to_page(tag_range->start)))
+ goto out_disabled;
+ }
+
+ static_branch_enable(&tag_storage_enabled_key);
+ pr_info("MTE tag storage region management enabled");
+
+ return 0;
+
+out_disabled:
+ for (i = 0; i < num_tag_regions; i++) {
+ tag_range = &tag_regions[i].tag_range;
+ cma = tag_regions[i].cma;
+
+ if (PageReserved(pfn_to_page(tag_range->start)))
+ continue;
+
+ /* Try really hard to reserve the tag storage. */
+ ret = cma_alloc(cma, range_len(tag_range), 8, true);
+ /*
+ * Tag storage is still in use for data, memory and/or tag
+ * corruption will ensue.
+ */
+ WARN_ON_ONCE(ret);
+ }
+ num_tag_regions = 0;
+ pr_info("MTE tag storage region management disabled");
+
+ return -EINVAL;
+}
+arch_initcall(mte_enable_tag_storage);
--
2.43.0
Make sure the contents of the tag storage block is not corrupted by
performing:
1. A tag dcache inval when the associated tagged pages are freed, to avoid
dirty tag cache lines being evicted and corrupting the tag storage
block when it's being used to store data.
2. A data cache inval when the tag storage block is being reserved, to
ensure that no dirty data cache lines are present, which would
trigger a writeback that could corrupt the tags stored in the block.
Signed-off-by: Alexandru Elisei <[email protected]>
---
arch/arm64/include/asm/assembler.h | 10 ++++++++++
arch/arm64/include/asm/mte_tag_storage.h | 2 ++
arch/arm64/kernel/mte_tag_storage.c | 11 +++++++++++
arch/arm64/lib/mte.S | 16 ++++++++++++++++
4 files changed, 39 insertions(+)
diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h
index 513787e43329..65fe88cce72b 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -310,6 +310,16 @@ alternative_cb_end
lsl \reg, \reg, \tmp // actual cache line size
.endm
+/*
+ * tcache_line_size - get the safe tag cache line size across all CPUs
+ */
+ .macro tcache_line_size, reg, tmp
+ read_ctr \tmp
+ ubfm \tmp, \tmp, #32, #37 // tag cache line size encoding
+ mov \reg, #4 // bytes per word
+ lsl \reg, \reg, \tmp // actual tag cache line size
+ .endm
+
/*
* raw_icache_line_size - get the minimum I-cache line size on this CPU
* from the CTR register.
diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
index 09f1318d924e..423b19e0cc46 100644
--- a/arch/arm64/include/asm/mte_tag_storage.h
+++ b/arch/arm64/include/asm/mte_tag_storage.h
@@ -11,6 +11,8 @@
#include <asm/mte.h>
+extern void dcache_inval_tags_poc(unsigned long start, unsigned long end);
+
#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
DECLARE_STATIC_KEY_FALSE(tag_storage_enabled_key);
diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index 762c7c803a70..8c347f4855e4 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -17,6 +17,7 @@
#include <linux/string.h>
#include <linux/xarray.h>
+#include <asm/cacheflush.h>
#include <asm/mte_tag_storage.h>
__ro_after_init DEFINE_STATIC_KEY_FALSE(tag_storage_enabled_key);
@@ -421,8 +422,13 @@ static bool tag_storage_block_is_reserved(unsigned long block)
static int tag_storage_reserve_block(unsigned long block, struct tag_region *region, int order)
{
+ unsigned long block_va;
int ret;
+ block_va = (unsigned long)page_to_virt(pfn_to_page(block));
+ /* Avoid writeback of dirty data cache lines corrupting tags. */
+ dcache_inval_poc(block_va, block_va + region->block_size_pages * PAGE_SIZE);
+
ret = xa_err(xa_store(&tag_blocks_reserved, block, pfn_to_page(block), GFP_KERNEL));
if (!ret)
block_ref_add(block, region, order);
@@ -570,6 +576,7 @@ void free_tag_storage(struct page *page, int order)
{
unsigned long block, start_block, end_block;
struct tag_region *region;
+ unsigned long page_va;
unsigned long flags;
int ret;
@@ -577,6 +584,10 @@ void free_tag_storage(struct page *page, int order)
if (WARN_ONCE(ret, "Missing tag storage block for pfn 0x%lx", page_to_pfn(page)))
return;
+ page_va = (unsigned long)page_to_virt(page);
+ /* Avoid writeback of dirty tag cache lines corrupting data. */
+ dcache_inval_tags_poc(page_va, page_va + (PAGE_SIZE << order));
+
end_block = start_block + order_to_num_blocks(order, region->block_size_pages);
xa_lock_irqsave(&tag_blocks_reserved, flags);
diff --git a/arch/arm64/lib/mte.S b/arch/arm64/lib/mte.S
index 9f623e9da09f..bc02b4e95062 100644
--- a/arch/arm64/lib/mte.S
+++ b/arch/arm64/lib/mte.S
@@ -175,3 +175,19 @@ SYM_FUNC_START(mte_copy_page_tags_from_buf)
ret
SYM_FUNC_END(mte_copy_page_tags_from_buf)
+
+/*
+ * dcache_inval_tags_poc(start, end)
+ *
+ * Ensure that any tags in the D-cache for the interval [start, end)
+ * are invalidated to PoC.
+ *
+ * - start - virtual start address of region
+ * - end - virtual end address of region
+ */
+SYM_FUNC_START(__pi_dcache_inval_tags_poc)
+ tcache_line_size x2, x3
+ dcache_by_myline_op igvac, sy, x0, x1, x2, x3
+ ret
+SYM_FUNC_END(__pi_dcache_inval_tags_poc)
+SYM_FUNC_ALIAS(dcache_inval_tags_poc, __pi_dcache_inval_tags_poc)
--
2.43.0
On arm64, when a page is mapped as tagged, its tags are zeroed for two
reasons:
* To prevent leakage of tags to userspace.
* To allow userspace to access the contents of the page with having to set
the tags explicitely (bits 59:56 of an userspace pointer are zero, which
correspond to tag 0b0000).
The zero page receives special treatment, as the tags for the zero page are
zeroed when the MTE feature is being enabled. This is done for performance
reasons - the tags are zeroed once, instead of every time the page is
mapped.
When the tags for the zero page are zeroed, tag storage is not yet enabled.
Reserve tag storage for the page immediately after tag storage management
becomes enabled.
Note that zeroing tags before tag storage management is enabled is safe to
do because the tag storage pages are reserved at that point.
Signed-off-by: Alexandru Elisei <[email protected]>
---
Changes since rfc v2:
* Expanded commit message (David Hildenbrand)
arch/arm64/kernel/mte_tag_storage.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index 8c347f4855e4..1c8469781870 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -363,6 +363,8 @@ static int __init mte_enable_tag_storage(void)
goto out_disabled;
}
+ reserve_tag_storage(ZERO_PAGE(0), 0, GFP_HIGHUSER);
+
static_branch_enable(&tag_storage_enabled_key);
pr_info("MTE tag storage region management enabled");
--
2.43.0
There are three situations in which a page that is to be mapped as
tagged doesn't have the corresponding tag storage reserved:
* reserve_tag_storage() failed.
* The allocation didn't specifiy __GFP_TAGGED (this can happen during
migration, for example).
* The page was mapped in a non-MTE enabled VMA, then an mprotect(PROT_MTE)
enabled MTE.
If a page that is about to be mapped as tagged doesn't have tag storage
reserved, map it with the PAGE_FAULT_ON_ACCESS protection to trigger a
fault next time they are accessed, and then reserve tag storage when the
fault is handled. If tag storage cannot be reserved, then the page is
migrated out of the VMA.
Tag storage pages (which cannot be tagged) mapped in an MTE enabled MTE
will be handled in a subsequent patch.
Signed-off-by: Alexandru Elisei <[email protected]>
---
Changes since rfc v2:
* New patch, loosely based on the arm64 code from the rfc v2 patch #19 ("mm:
mprotect: Introduce PAGE_FAULT_ON_ACCESS for mprotect(PROT_MTE)")
* All the common code has been moved back to the arch independent function
handle_{huge_pmd,pte}_protnone() (David Hildenbrand).
* Page is migrated if tag storage cannot be reserved after exhausting all
attempts (Hyesoo Yu).
* Moved folio_isolate_lru() declaration and struct migration_target_control to
headers in include/linux (Peter Collingbourne).
arch/arm64/Kconfig | 1 +
arch/arm64/include/asm/mte.h | 4 +-
arch/arm64/include/asm/mte_tag_storage.h | 3 +
arch/arm64/include/asm/pgtable-prot.h | 2 +
arch/arm64/include/asm/pgtable.h | 44 ++++++++---
arch/arm64/kernel/mte.c | 11 ++-
arch/arm64/mm/fault.c | 98 ++++++++++++++++++++++++
include/linux/memcontrol.h | 2 +
include/linux/migrate.h | 8 +-
include/linux/migrate_mode.h | 1 +
mm/internal.h | 6 --
11 files changed, 156 insertions(+), 24 deletions(-)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 6f65e9005dc9..088e30fc6d12 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2085,6 +2085,7 @@ config ARM64_MTE
if ARM64_MTE
config ARM64_MTE_TAG_STORAGE
bool
+ select ARCH_HAS_FAULT_ON_ACCESS
select CONFIG_CMA
help
Adds support for dynamic management of the memory used by the hardware
diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
index 6457b7899207..70dc2e409070 100644
--- a/arch/arm64/include/asm/mte.h
+++ b/arch/arm64/include/asm/mte.h
@@ -107,7 +107,7 @@ static inline bool try_page_mte_tagging(struct page *page)
}
void mte_zero_clear_page_tags(void *addr);
-void mte_sync_tags(pte_t pte, unsigned int nr_pages);
+void mte_sync_tags(pte_t *pteval, unsigned int nr_pages);
void mte_copy_page_tags(void *kto, const void *kfrom);
void mte_thread_init_user(void);
void mte_thread_switch(struct task_struct *next);
@@ -139,7 +139,7 @@ static inline bool try_page_mte_tagging(struct page *page)
static inline void mte_zero_clear_page_tags(void *addr)
{
}
-static inline void mte_sync_tags(pte_t pte, unsigned int nr_pages)
+static inline void mte_sync_tags(pte_t *pteval, unsigned int nr_pages)
{
}
static inline void mte_copy_page_tags(void *kto, const void *kfrom)
diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
index 423b19e0cc46..6d0f6ffcfdd6 100644
--- a/arch/arm64/include/asm/mte_tag_storage.h
+++ b/arch/arm64/include/asm/mte_tag_storage.h
@@ -32,6 +32,9 @@ int reserve_tag_storage(struct page *page, int order, gfp_t gfp);
void free_tag_storage(struct page *page, int order);
bool page_tag_storage_reserved(struct page *page);
+
+vm_fault_t handle_folio_missing_tag_storage(struct folio *folio, struct vm_fault *vmf,
+ bool *map_pte);
#else
static inline bool tag_storage_enabled(void)
{
diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
index 483dbfa39c4c..1820e29244f8 100644
--- a/arch/arm64/include/asm/pgtable-prot.h
+++ b/arch/arm64/include/asm/pgtable-prot.h
@@ -19,6 +19,7 @@
#define PTE_SPECIAL (_AT(pteval_t, 1) << 56)
#define PTE_DEVMAP (_AT(pteval_t, 1) << 57)
#define PTE_PROT_NONE (_AT(pteval_t, 1) << 58) /* only when !PTE_VALID */
+#define PTE_TAG_STORAGE_NONE (_AT(pteval_t, 1) << 60) /* only when PTE_PROT_NONE */
/*
* This bit indicates that the entry is present i.e. pmd_page()
@@ -96,6 +97,7 @@ extern bool arm64_use_ng_mappings;
})
#define PAGE_NONE __pgprot(((_PAGE_DEFAULT) & ~PTE_VALID) | PTE_PROT_NONE | PTE_RDONLY | PTE_NG | PTE_PXN | PTE_UXN)
+#define PAGE_FAULT_ON_ACCESS __pgprot(((_PAGE_DEFAULT) & ~PTE_VALID) | PTE_PROT_NONE | PTE_TAG_STORAGE_NONE | PTE_RDONLY | PTE_NG | PTE_PXN | PTE_UXN)
/* shared+writable pages are clean by default, hence PTE_RDONLY|PTE_WRITE */
#define PAGE_SHARED __pgprot(_PAGE_SHARED)
#define PAGE_SHARED_EXEC __pgprot(_PAGE_SHARED_EXEC)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index f30466199a9b..0174e292f890 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -326,10 +326,10 @@ static inline void __check_safe_pte_update(struct mm_struct *mm, pte_t *ptep,
__func__, pte_val(old_pte), pte_val(pte));
}
-static inline void __sync_cache_and_tags(pte_t pte, unsigned int nr_pages)
+static inline void __sync_cache_and_tags(pte_t *pteval, unsigned int nr_pages)
{
- if (pte_present(pte) && pte_user_exec(pte) && !pte_special(pte))
- __sync_icache_dcache(pte);
+ if (pte_present(*pteval) && pte_user_exec(*pteval) && !pte_special(*pteval))
+ __sync_icache_dcache(*pteval);
/*
* If the PTE would provide user space access to the tags associated
@@ -337,9 +337,9 @@ static inline void __sync_cache_and_tags(pte_t pte, unsigned int nr_pages)
* pte_access_permitted() returns false for exec only mappings, they
* don't expose tags (instruction fetches don't check tags).
*/
- if (system_supports_mte() && pte_access_permitted(pte, false) &&
- !pte_special(pte) && pte_tagged(pte))
- mte_sync_tags(pte, nr_pages);
+ if (system_supports_mte() && pte_access_permitted(*pteval, false) &&
+ !pte_special(*pteval) && pte_tagged(*pteval))
+ mte_sync_tags(pteval, nr_pages);
}
static inline void set_ptes(struct mm_struct *mm,
@@ -347,7 +347,7 @@ static inline void set_ptes(struct mm_struct *mm,
pte_t *ptep, pte_t pte, unsigned int nr)
{
page_table_check_ptes_set(mm, ptep, pte, nr);
- __sync_cache_and_tags(pte, nr);
+ __sync_cache_and_tags(&pte, nr);
for (;;) {
__check_safe_pte_update(mm, ptep, pte);
@@ -444,7 +444,7 @@ static inline pgprot_t pte_pgprot(pte_t pte)
return __pgprot(pte_val(pfn_pte(pfn, __pgprot(0))) ^ pte_val(pte));
}
-#ifdef CONFIG_NUMA_BALANCING
+#if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_ARCH_HAS_FAULT_ON_ACCESS)
/*
* See the comment in include/linux/pgtable.h
*/
@@ -459,6 +459,28 @@ static inline int pmd_protnone(pmd_t pmd)
}
#endif
+#ifdef CONFIG_ARCH_HAS_FAULT_ON_ACCESS
+static inline bool arch_fault_on_access_pte(pte_t pte)
+{
+ return pte_protnone(pte) && (pte_val(pte) & PTE_TAG_STORAGE_NONE);
+}
+
+static inline bool arch_fault_on_access_pmd(pmd_t pmd)
+{
+ return arch_fault_on_access_pte(pmd_pte(pmd));
+}
+
+static inline vm_fault_t arch_handle_folio_fault_on_access(struct folio *folio,
+ struct vm_fault *vmf,
+ bool *map_pte)
+{
+ if (tag_storage_enabled())
+ return handle_folio_missing_tag_storage(folio, vmf, map_pte);
+
+ return VM_FAULT_SIGBUS;
+}
+#endif /* CONFIG_ARCH_HAS_FAULT_ON_ACCESS */
+
#define pmd_present_invalid(pmd) (!!(pmd_val(pmd) & PMD_PRESENT_INVALID))
static inline int pmd_present(pmd_t pmd)
@@ -533,7 +555,7 @@ static inline void __set_pte_at(struct mm_struct *mm,
unsigned long __always_unused addr,
pte_t *ptep, pte_t pte, unsigned int nr)
{
- __sync_cache_and_tags(pte, nr);
+ __sync_cache_and_tags(&pte, nr);
__check_safe_pte_update(mm, ptep, pte);
set_pte(ptep, pte);
}
@@ -828,8 +850,8 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
* in MAIR_EL1. The mask below has to include PTE_ATTRINDX_MASK.
*/
const pteval_t mask = PTE_USER | PTE_PXN | PTE_UXN | PTE_RDONLY |
- PTE_PROT_NONE | PTE_VALID | PTE_WRITE | PTE_GP |
- PTE_ATTRINDX_MASK;
+ PTE_PROT_NONE | PTE_TAG_STORAGE_NONE | PTE_VALID |
+ PTE_WRITE | PTE_GP | PTE_ATTRINDX_MASK;
/* preserve the hardware dirty information */
if (pte_hw_dirty(pte))
pte = set_pte_bit(pte, __pgprot(PTE_DIRTY));
diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index a41ef3213e1e..faf09da3400a 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -35,13 +35,18 @@ DEFINE_STATIC_KEY_FALSE(mte_async_or_asymm_mode);
EXPORT_SYMBOL_GPL(mte_async_or_asymm_mode);
#endif
-void mte_sync_tags(pte_t pte, unsigned int nr_pages)
+void mte_sync_tags(pte_t *pteval, unsigned int nr_pages)
{
- struct page *page = pte_page(pte);
+ struct page *page = pte_page(*pteval);
unsigned int i;
- /* if PG_mte_tagged is set, tags have already been initialised */
for (i = 0; i < nr_pages; i++, page++) {
+ if (tag_storage_enabled() && !page_tag_storage_reserved(page)) {
+ *pteval = pte_modify(*pteval, PAGE_FAULT_ON_ACCESS);
+ continue;
+ }
+
+ /* if PG_mte_tagged is set, tags have already been initialised */
if (try_page_mte_tagging(page)) {
mte_clear_page_tags(page_address(page));
set_page_mte_tagged(page);
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 1ffaeccecda2..1db3adb6499f 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -12,6 +12,8 @@
#include <linux/extable.h>
#include <linux/kfence.h>
#include <linux/signal.h>
+#include <linux/memcontrol.h>
+#include <linux/migrate.h>
#include <linux/mm.h>
#include <linux/hardirq.h>
#include <linux/init.h>
@@ -19,6 +21,7 @@
#include <linux/kprobes.h>
#include <linux/uaccess.h>
#include <linux/page-flags.h>
+#include <linux/page-isolation.h>
#include <linux/sched/signal.h>
#include <linux/sched/debug.h>
#include <linux/highmem.h>
@@ -962,3 +965,98 @@ void tag_clear_highpage(struct page *page)
mte_zero_clear_page_tags(page_address(page));
set_page_mte_tagged(page);
}
+
+#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
+
+#define MR_TAG_STORAGE MR_ARCH_1
+
+/*
+ * Called with an elevated reference on the folio.
+ * Returns with the elevated reference dropped.
+ */
+static int replace_folio_with_tagged(struct folio *folio)
+{
+ struct migration_target_control mtc = {
+ .nid = NUMA_NO_NODE,
+ .gfp_mask = GFP_HIGHUSER_MOVABLE | __GFP_TAGGED,
+ };
+ LIST_HEAD(foliolist);
+ int ret, tries;
+
+ lru_cache_disable();
+
+ if (!folio_isolate_lru(folio)) {
+ lru_cache_enable();
+ folio_put(folio);
+ return -EAGAIN;
+ }
+
+ /* Isolate just grabbed another reference, drop ours. */
+ folio_put(folio);
+ list_add_tail(&folio->lru, &foliolist);
+
+ tries = 3;
+ while (tries--) {
+ ret = migrate_pages(&foliolist, alloc_migration_target, NULL, (unsigned long)&mtc,
+ MIGRATE_SYNC, MR_TAG_STORAGE, NULL);
+ if (ret != -EBUSY)
+ break;
+ }
+
+ if (ret != 0)
+ putback_movable_pages(&foliolist);
+
+ lru_cache_enable();
+
+ return ret;
+}
+
+vm_fault_t handle_folio_missing_tag_storage(struct folio *folio, struct vm_fault *vmf,
+ bool *map_pte)
+{
+ struct vm_area_struct *vma = vmf->vma;
+ int ret = 0;
+
+ *map_pte = false;
+
+ /*
+ * This should never happen, once a VMA has been marked as tagged, that
+ * cannot be changed.
+ */
+ if (WARN_ON_ONCE(!(vma->vm_flags & VM_MTE)))
+ goto out_map;
+
+ /*
+ * The folio is probably being isolated for migration, replay the fault
+ * to give time for the entry to be replaced by a migration pte.
+ */
+ if (unlikely(is_migrate_isolate_page(folio_page(folio, 0))))
+ goto out_retry;
+
+ ret = reserve_tag_storage(folio_page(folio, 0), folio_order(folio), GFP_HIGHUSER_MOVABLE);
+ if (ret) {
+ /* replace_folio_with_tagged() is expensive, try to avoid it. */
+ if (fault_flag_allow_retry_first(vmf->flags))
+ goto out_retry;
+
+ replace_folio_with_tagged(folio);
+ return 0;
+ }
+
+out_map:
+ folio_put(folio);
+ *map_pte = true;
+ return 0;
+
+out_retry:
+ folio_put(folio);
+ if (fault_flag_allow_retry_first(vmf->flags)) {
+ /* Flag set by GUP. */
+ if (!(vmf->flags & FAULT_FLAG_RETRY_NOWAIT))
+ release_fault_lock(vmf);
+ return VM_FAULT_RETRY;
+ }
+ /* Replay the fault. */
+ return 0;
+}
+#endif
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 20ff87f8e001..9c0b559f54f5 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1633,6 +1633,8 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
}
#endif /* CONFIG_MEMCG */
+bool folio_isolate_lru(struct folio *folio);
+
static inline void __inc_lruvec_kmem_state(void *p, enum node_stat_item idx)
{
__mod_lruvec_kmem_state(p, idx, 1);
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 2ce13e8a309b..f954e19bd9d1 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -10,8 +10,6 @@
typedef struct folio *new_folio_t(struct folio *folio, unsigned long private);
typedef void free_folio_t(struct folio *folio, unsigned long private);
-struct migration_target_control;
-
/*
* Return values from addresss_space_operations.migratepage():
* - negative errno on page migration failure;
@@ -57,6 +55,12 @@ struct movable_operations {
void (*putback_page)(struct page *);
};
+struct migration_target_control {
+ int nid; /* preferred node id */
+ nodemask_t *nmask;
+ gfp_t gfp_mask;
+};
+
/* Defined in mm/debug.c: */
extern const char *migrate_reason_names[MR_TYPES];
diff --git a/include/linux/migrate_mode.h b/include/linux/migrate_mode.h
index f37cc03f9369..c6c5c7726d26 100644
--- a/include/linux/migrate_mode.h
+++ b/include/linux/migrate_mode.h
@@ -29,6 +29,7 @@ enum migrate_reason {
MR_CONTIG_RANGE,
MR_LONGTERM_PIN,
MR_DEMOTION,
+ MR_ARCH_1,
MR_TYPES
};
diff --git a/mm/internal.h b/mm/internal.h
index f309a010d50f..cb76cf0928f5 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -952,12 +952,6 @@ static inline bool is_migrate_highatomic_page(struct page *page)
void setup_zone_pageset(struct zone *zone);
-struct migration_target_control {
- int nid; /* preferred node id */
- nodemask_t *nmask;
- gfp_t gfp_mask;
-};
-
/*
* mm/filemap.c
*/
--
2.43.0
There are several situations where copy_highpage() can end up copying
tags to a page which doesn't have its tag storage reserved.
One situation involves migration racing with mprotect(PROT_MTE): VMA is
initially untagged, migration starts and destination page is allocated
as untagged, mprotect(PROT_MTE) changes the VMA to tagged and userspace
accesses the source page, thus making it tagged. The migration code
then calls copy_highpage(), which will copy the tags from the source
page (now tagged) to the destination page (allocated as untagged).
Yes another situation can happen during THP collapse. The huge page that
will replace the HPAGE_PMD_NR contiguous mapped pages is allocated with
__GFP_TAGGED not set. copy_highpage() will copy the tags from the pages
being replaced to the huge page which doesn't have tag storage reserved.
The situation gets even more complicated when the replacement huge page
is a tag storage page. The tag storage huge page will be migrated after
a fault on access, but the tags from the original pages must be copied
over to the huge page that will be replacing the tag storage huge page.
Signed-off-by: Alexandru Elisei <[email protected]>
---
arch/arm64/mm/copypage.c | 56 ++++++++++++++++++++++++++++++++++++++++
1 file changed, 56 insertions(+)
diff --git a/arch/arm64/mm/copypage.c b/arch/arm64/mm/copypage.c
index a7bb20055ce0..e991ccb43fb7 100644
--- a/arch/arm64/mm/copypage.c
+++ b/arch/arm64/mm/copypage.c
@@ -13,6 +13,59 @@
#include <asm/cacheflush.h>
#include <asm/cpufeature.h>
#include <asm/mte.h>
+#include <asm/mte_tag_storage.h>
+
+#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
+static inline bool try_transfer_saved_tags(struct page *from, struct page *to)
+{
+ void *tags;
+ bool saved;
+
+ VM_WARN_ON_ONCE(!preemptible());
+
+ if (page_mte_tagged(from)) {
+ if (page_tag_storage_reserved(to))
+ return false;
+
+ tags = mte_allocate_tag_buf();
+ if (WARN_ON(!tags))
+ return true;
+
+ mte_copy_page_tags_to_buf(page_address(from), tags);
+ saved = mte_save_tags_for_pfn(tags, page_to_pfn(to));
+ if (!saved)
+ mte_free_tag_buf(tags);
+
+ return saved;
+ }
+
+ tags_by_pfn_lock();
+ tags = mte_erase_tags_for_pfn(page_to_pfn(from));
+ tags_by_pfn_unlock();
+
+ if (likely(!tags))
+ return false;
+
+ if (page_tag_storage_reserved(to)) {
+ WARN_ON_ONCE(!try_page_mte_tagging(to));
+ mte_copy_page_tags_from_buf(page_address(to), tags);
+ set_page_mte_tagged(to);
+ mte_free_tag_buf(tags);
+ return true;
+ }
+
+ saved = mte_save_tags_for_pfn(tags, page_to_pfn(to));
+ if (!saved)
+ mte_free_tag_buf(tags);
+
+ return saved;
+}
+#else
+static inline bool try_transfer_saved_tags(struct page *from, struct page *to)
+{
+ return false;
+}
+#endif
void copy_highpage(struct page *to, struct page *from)
{
@@ -24,6 +77,9 @@ void copy_highpage(struct page *to, struct page *from)
if (kasan_hw_tags_enabled())
page_kasan_tag_reset(to);
+ if (tag_storage_enabled() && try_transfer_saved_tags(from, to))
+ return;
+
if (system_supports_mte() && page_mte_tagged(from)) {
/* It's a new page, shouldn't have been tagged yet */
WARN_ON_ONCE(!try_page_mte_tagging(to));
--
2.43.0
A page can end up mapped in a MTE enabled VMA without the corresponding tag
storage block reserved. Tag accesses made by ptrace in this case can lead
to the wrong tags being read or memory corruption for the process that is
using the tag storage memory as data.
Reserve tag storage by treating ptrace accesses like a fault.
Signed-off-by: Alexandru Elisei <[email protected]>
---
Changes since rfc v2:
* New patch, issue reported by Peter Collingbourne.
arch/arm64/kernel/mte.c | 26 ++++++++++++++++++++++++--
1 file changed, 24 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index faf09da3400a..b1fa02dad4fd 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -412,10 +412,13 @@ static int __access_remote_tags(struct mm_struct *mm, unsigned long addr,
while (len) {
struct vm_area_struct *vma;
unsigned long tags, offset;
+ unsigned int fault_flags;
+ struct page *page;
+ vm_fault_t ret;
void *maddr;
- struct page *page = get_user_page_vma_remote(mm, addr,
- gup_flags, &vma);
+get_page:
+ page = get_user_page_vma_remote(mm, addr, gup_flags, &vma);
if (IS_ERR(page)) {
err = PTR_ERR(page);
break;
@@ -433,6 +436,25 @@ static int __access_remote_tags(struct mm_struct *mm, unsigned long addr,
put_page(page);
break;
}
+
+ if (tag_storage_enabled() && !page_tag_storage_reserved(page)) {
+ fault_flags = FAULT_FLAG_DEFAULT | \
+ FAULT_FLAG_USER | \
+ FAULT_FLAG_REMOTE | \
+ FAULT_FLAG_ALLOW_RETRY | \
+ FAULT_FLAG_RETRY_NOWAIT;
+ if (write)
+ fault_flags |= FAULT_FLAG_WRITE;
+
+ put_page(page);
+ ret = handle_mm_fault(vma, addr, fault_flags, NULL);
+ if (ret & VM_FAULT_ERROR) {
+ err = -EFAULT;
+ break;
+ }
+ goto get_page;
+ }
+
WARN_ON_ONCE(!page_mte_tagged(page));
/* limit access to the end of the page */
--
2.43.0
KVM allows MTE enabled VMs to be created when the backing VMA does not have
MTE enabled. As a result, pages allocated for the virtual machine's memory
won't have tag storage reserved. Try to reserve tag storage the first time
the page is accessed by the guest. This is similar to how pages mapped
without tag storage in an MTE VMA are handled.
Signed-off-by: Alexandru Elisei <[email protected]>
---
Changes since rfc v2:
* New patch.
arch/arm64/include/asm/mte_tag_storage.h | 10 ++++++
arch/arm64/include/asm/pgtable.h | 7 +++-
arch/arm64/kvm/mmu.c | 43 ++++++++++++++++++++++++
arch/arm64/mm/fault.c | 2 +-
4 files changed, 60 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
index 40590a8c3748..32940ef7bcdf 100644
--- a/arch/arm64/include/asm/mte_tag_storage.h
+++ b/arch/arm64/include/asm/mte_tag_storage.h
@@ -34,6 +34,8 @@ void free_tag_storage(struct page *page, int order);
bool page_tag_storage_reserved(struct page *page);
bool page_is_tag_storage(struct page *page);
+int replace_folio_with_tagged(struct folio *folio);
+
vm_fault_t handle_folio_missing_tag_storage(struct folio *folio, struct vm_fault *vmf,
bool *map_pte);
vm_fault_t mte_try_transfer_swap_tags(swp_entry_t entry, struct page *page);
@@ -67,6 +69,14 @@ static inline bool page_tag_storage_reserved(struct page *page)
{
return true;
}
+static inline bool page_is_tag_storage(struct page *page)
+{
+ return false;
+}
+static inline int replace_folio_with_tagged(struct folio *folio)
+{
+ return -EINVAL;
+}
#endif /* CONFIG_ARM64_MTE_TAG_STORAGE */
#endif /* !__ASSEMBLY__ */
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index d0473538c926..7f89606ad617 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1108,7 +1108,12 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
#define __HAVE_ARCH_FREE_PAGES_PREPARE
static inline void arch_free_pages_prepare(struct page *page, int order)
{
- if (tag_storage_enabled() && page_mte_tagged(page))
+ /*
+ * KVM can free a page after tag storage has been reserved and before is
+ * marked as tagged, hence use page_tag_storage_reserved() instead of
+ * page_mte_tagged() to check for tag storage.
+ */
+ if (tag_storage_enabled() && page_tag_storage_reserved(page))
free_tag_storage(page, order);
}
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index b7517c4a19c4..986a9544228d 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1361,6 +1361,8 @@ static void sanitise_mte_tags(struct kvm *kvm, kvm_pfn_t pfn,
if (!kvm_has_mte(kvm))
return;
+ WARN_ON_ONCE(tag_storage_enabled() && !page_tag_storage_reserved(pfn_to_page(pfn)));
+
for (i = 0; i < nr_pages; i++, page++) {
if (try_page_mte_tagging(page)) {
mte_clear_page_tags(page_address(page));
@@ -1374,6 +1376,39 @@ static bool kvm_vma_mte_allowed(struct vm_area_struct *vma)
return vma->vm_flags & VM_MTE_ALLOWED;
}
+/*
+ * Called with an elevated reference on the pfn. If successful, the reference
+ * count is not changed. If it returns an error, the elevated reference is
+ * dropped.
+ */
+static int kvm_mte_reserve_tag_storage(kvm_pfn_t pfn)
+{
+ struct folio *folio;
+ int ret;
+
+ folio = page_folio(pfn_to_page(pfn));
+
+ if (page_tag_storage_reserved(folio_page(folio, 0)))
+ return 0;
+
+ if (page_is_tag_storage(folio_page(folio, 0)))
+ goto migrate;
+
+ ret = reserve_tag_storage(folio_page(folio, 0), folio_order(folio),
+ GFP_HIGHUSER_MOVABLE);
+ if (!ret)
+ return 0;
+
+migrate:
+ replace_folio_with_tagged(folio);
+ /*
+ * If migration succeeds, the fault needs to be replayed because 'pfn'
+ * has been unmapped. If migration fails, KVM will try to reserve tag
+ * storage again by replaying the fault.
+ */
+ return -EAGAIN;
+}
+
static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
struct kvm_memory_slot *memslot, unsigned long hva,
bool fault_is_perm)
@@ -1488,6 +1523,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL,
write_fault, &writable, NULL);
+
if (pfn == KVM_PFN_ERR_HWPOISON) {
kvm_send_hwpoison_signal(hva, vma_shift);
return 0;
@@ -1518,6 +1554,13 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
if (exec_fault && device)
return -ENOEXEC;
+ if (tag_storage_enabled() && !fault_is_perm && !device &&
+ kvm_has_mte(kvm) && mte_allowed) {
+ ret = kvm_mte_reserve_tag_storage(pfn);
+ if (ret)
+ return ret == -EAGAIN ? 0 : ret;
+ }
+
read_lock(&kvm->mmu_lock);
pgt = vcpu->arch.hw_mmu->pgt;
if (mmu_invalidate_retry(kvm, mmu_seq))
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 01450ab91a87..5c12232bdf0b 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -974,7 +974,7 @@ void tag_clear_highpage(struct page *page)
* Called with an elevated reference on the folio.
* Returns with the elevated reference dropped.
*/
-static int replace_folio_with_tagged(struct folio *folio)
+int replace_folio_with_tagged(struct folio *folio)
{
struct migration_target_control mtc = {
.nid = NUMA_NO_NODE,
--
2.43.0
Tag storage pages mapped by the host in a VM with MTE enabled are migrated
when they are first accessed by the guest. This introduces latency spikes
for memory accesses made by the guest.
Tag storage pages can be mapped in the guest memory when the VM_MTE VMA
flag is not set. Introduce a new VMA flag, VM_MTE_KVM, to stop tag storage
pages from being mapped in a VM with MTE enabled.
The flag is different from VM_MTE, because the pages from the VMA won't be
mapped as tagged in the host, and host's userspace can continue to access
the guest memory as Untagged. The flag's only function is to instruct the
page allocator to treat the allocation as tagged, so tag storage pages
aren't used. The page allocator will also try to reserve tag storage for
the new page, which can speed up stage 2 aborts further if the VMM has
accessed the memory before the guest. For example, qemu and kvmtool will
benefit from this change because the guest image is copied after the
memslot is created.
Signed-off-by: Alexandru Elisei <[email protected]>
---
Changes since rfc v2:
* New patch.
arch/arm64/kvm/mmu.c | 77 ++++++++++++++++++++++++++++++++++++++++++-
arch/arm64/mm/fault.c | 2 +-
include/linux/mm.h | 2 ++
3 files changed, 79 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 986a9544228d..45c57c4b9fe2 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1420,7 +1420,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
unsigned long mmu_seq;
struct kvm *kvm = vcpu->kvm;
struct kvm_mmu_memory_cache *memcache = &vcpu->arch.mmu_page_cache;
- struct vm_area_struct *vma;
+ struct vm_area_struct *vma, *old_vma;
short vma_shift;
gfn_t gfn;
kvm_pfn_t pfn;
@@ -1428,6 +1428,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
long vma_pagesize, fault_granule;
enum kvm_pgtable_prot prot = KVM_PGTABLE_PROT_R;
struct kvm_pgtable *pgt;
+ bool vma_has_kvm_mte = false;
if (fault_is_perm)
fault_granule = kvm_vcpu_trap_get_perm_fault_granule(vcpu);
@@ -1506,6 +1507,8 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
gfn = fault_ipa >> PAGE_SHIFT;
mte_allowed = kvm_vma_mte_allowed(vma);
+ vma_has_kvm_mte = !!(vma->vm_flags & VM_MTE_KVM);
+ old_vma = vma;
/* Don't use the VMA after the unlock -- it may have vanished */
vma = NULL;
@@ -1521,6 +1524,27 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
mmu_seq = vcpu->kvm->mmu_invalidate_seq;
mmap_read_unlock(current->mm);
+ /*
+ * If the VMA was created after the memslot, it doesn't have the
+ * VM_MTE_KVM flag set.
+ */
+ if (unlikely(tag_storage_enabled() && !fault_is_perm &&
+ kvm_has_mte(kvm) && mte_allowed && !vma_has_kvm_mte)) {
+ mmap_write_lock(current->mm);
+ vma = vma_lookup(current->mm, hva);
+ /* The VMA was changed, replay the fault. */
+ if (vma != old_vma) {
+ mmap_write_unlock(current->mm);
+ return 0;
+ }
+ if (!(vma->vm_flags & VM_MTE_KVM)) {
+ vma_start_write(vma);
+ vm_flags_reset(vma, vma->vm_flags | VM_MTE_KVM);
+ }
+ vma = NULL;
+ mmap_write_unlock(current->mm);
+ }
+
pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL,
write_fault, &writable, NULL);
@@ -1986,6 +2010,40 @@ int __init kvm_mmu_init(u32 *hyp_va_bits)
return err;
}
+static int kvm_set_clear_kvm_mte_vma(const struct kvm_memory_slot *memslot, bool set)
+{
+ struct vm_area_struct *vma;
+ hva_t hva, memslot_end;
+ int ret = 0;
+
+ hva = memslot->userspace_addr;
+ memslot_end = hva + (memslot->npages << PAGE_SHIFT);
+
+ mmap_write_lock(current->mm);
+
+ do {
+ vma = find_vma_intersection(current->mm, hva, memslot_end);
+ if (!vma)
+ break;
+ if (!kvm_vma_mte_allowed(vma))
+ continue;
+ if (set) {
+ if (!(vma->vm_flags & VM_MTE_KVM)) {
+ vma_start_write(vma);
+ vm_flags_reset(vma, vma->vm_flags | VM_MTE_KVM);
+ }
+ } else if (vma->vm_flags & VM_MTE_KVM) {
+ vma_start_write(vma);
+ vm_flags_reset(vma, vma->vm_flags & ~VM_MTE_KVM);
+ }
+ hva = min(memslot_end, vma->vm_end);
+ } while (hva < memslot_end);
+
+ mmap_write_unlock(current->mm);
+
+ return ret;
+}
+
void kvm_arch_commit_memory_region(struct kvm *kvm,
struct kvm_memory_slot *old,
const struct kvm_memory_slot *new,
@@ -1993,6 +2051,23 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
{
bool log_dirty_pages = new && new->flags & KVM_MEM_LOG_DIRTY_PAGES;
+ if (kvm_has_mte(kvm) && change != KVM_MR_FLAGS_ONLY) {
+ switch (change) {
+ case KVM_MR_CREATE:
+ kvm_set_clear_kvm_mte_vma(new, true);
+ break;
+ case KVM_MR_DELETE:
+ kvm_set_clear_kvm_mte_vma(old, false);
+ break;
+ case KVM_MR_MOVE:
+ kvm_set_clear_kvm_mte_vma(old, false);
+ kvm_set_clear_kvm_mte_vma(new, true);
+ break;
+ default:
+ WARN(true, "Unknown memslot change");
+ }
+ }
+
/*
* At this point memslot has been committed and there is an
* allocated dirty_bitmap[], dirty pages will be tracked while the
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 5c12232bdf0b..f4ca3ba8dde7 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -947,7 +947,7 @@ NOKPROBE_SYMBOL(do_debug_exception);
*/
gfp_t arch_calc_vma_gfp(struct vm_area_struct *vma, gfp_t gfp)
{
- if (vma->vm_flags & VM_MTE)
+ if (vma->vm_flags & (VM_MTE |VM_MTE_KVM))
return __GFP_TAGGED;
return 0;
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f5a97dec5169..924aa7c26ec9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -375,9 +375,11 @@ extern unsigned int kobjsize(const void *objp);
#if defined(CONFIG_ARM64_MTE)
# define VM_MTE VM_HIGH_ARCH_0 /* Use Tagged memory for access control */
# define VM_MTE_ALLOWED VM_HIGH_ARCH_1 /* Tagged memory permitted */
+# define VM_MTE_KVM VM_HIGH_ARCH_2 /* VMA is mapped in a virtual machine with MTE */
#else
# define VM_MTE VM_NONE
# define VM_MTE_ALLOWED VM_NONE
+# define VM_MTE_KVM VM_NONE
#endif
#ifndef VM_GROWSUP
--
2.43.0
Everything is in place, enable tag storage management.
Signed-off-by: Alexandru Elisei <[email protected]>
---
arch/arm64/Kconfig | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 088e30fc6d12..95c153705a2c 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2084,7 +2084,7 @@ config ARM64_MTE
if ARM64_MTE
config ARM64_MTE_TAG_STORAGE
- bool
+ bool "MTE tag storage management"
select ARCH_HAS_FAULT_ON_ACCESS
select CONFIG_CMA
help
--
2.43.0
Alongside the base address, arm64 will also need to know the size of a
tag storage region. Teach of_flat_dt_translate_address() to parse and
return the size.
Signed-off-by: Alexandru Elisei <[email protected]>
---
Changes since rfc v2:
* New patch, suggested by Rob Herring.
arch/sh/kernel/cpu/sh2/probe.c | 2 +-
drivers/of/fdt_address.c | 12 +++++++++---
drivers/tty/serial/earlycon.c | 2 +-
include/linux/of_fdt.h | 2 +-
4 files changed, 12 insertions(+), 6 deletions(-)
diff --git a/arch/sh/kernel/cpu/sh2/probe.c b/arch/sh/kernel/cpu/sh2/probe.c
index 70a07f4f2142..fa8904e8f390 100644
--- a/arch/sh/kernel/cpu/sh2/probe.c
+++ b/arch/sh/kernel/cpu/sh2/probe.c
@@ -21,7 +21,7 @@ static int __init scan_cache(unsigned long node, const char *uname,
if (!of_flat_dt_is_compatible(node, "jcore,cache"))
return 0;
- j2_ccr_base = ioremap(of_flat_dt_translate_address(node), 4);
+ j2_ccr_base = ioremap(of_flat_dt_translate_address(node, NULL), 4);
return 1;
}
diff --git a/drivers/of/fdt_address.c b/drivers/of/fdt_address.c
index 1dc15ab78b10..4c077778d710 100644
--- a/drivers/of/fdt_address.c
+++ b/drivers/of/fdt_address.c
@@ -160,7 +160,8 @@ static int __init fdt_translate_one(const void *blob, int parent,
* that can be mapped to a cpu physical address). This is not really specified
* that way, but this is traditionally the way IBM at least do things
*/
-static u64 __init fdt_translate_address(const void *blob, int node_offset)
+static u64 __init fdt_translate_address(const void *blob, int node_offset,
+ u64 *out_size)
{
int parent, len;
const struct of_bus *bus, *pbus;
@@ -193,6 +194,9 @@ static u64 __init fdt_translate_address(const void *blob, int node_offset)
goto bail;
}
memcpy(addr, reg, na * 4);
+ /* The size of the region doesn't need translating. */
+ if (out_size)
+ *out_size = of_read_number(reg + na, ns);
pr_debug("bus (na=%d, ns=%d) on %s\n",
na, ns, fdt_get_name(blob, parent, NULL));
@@ -242,8 +246,10 @@ static u64 __init fdt_translate_address(const void *blob, int node_offset)
/**
* of_flat_dt_translate_address - translate DT addr into CPU phys addr
* @node: node in the flat blob
+ * @out_size: size of the region, can be NULL if not needed
+ * @return: the address, OF_BAD_ADDR in case of error
*/
-u64 __init of_flat_dt_translate_address(unsigned long node)
+u64 __init of_flat_dt_translate_address(unsigned long node, u64 *out_size)
{
- return fdt_translate_address(initial_boot_params, node);
+ return fdt_translate_address(initial_boot_params, node, out_size);
}
diff --git a/drivers/tty/serial/earlycon.c b/drivers/tty/serial/earlycon.c
index a5fbb6ed38ae..e941cf786232 100644
--- a/drivers/tty/serial/earlycon.c
+++ b/drivers/tty/serial/earlycon.c
@@ -265,7 +265,7 @@ int __init of_setup_earlycon(const struct earlycon_id *match,
spin_lock_init(&port->lock);
port->iotype = UPIO_MEM;
- addr = of_flat_dt_translate_address(node);
+ addr = of_flat_dt_translate_address(node, NULL);
if (addr == OF_BAD_ADDR) {
pr_warn("[%s] bad address\n", match->name);
return -ENXIO;
diff --git a/include/linux/of_fdt.h b/include/linux/of_fdt.h
index d69ad5bb1eb1..0e26f8c3b10e 100644
--- a/include/linux/of_fdt.h
+++ b/include/linux/of_fdt.h
@@ -36,7 +36,7 @@ extern char __dtb_start[];
extern char __dtb_end[];
/* Other Prototypes */
-extern u64 of_flat_dt_translate_address(unsigned long node);
+extern u64 of_flat_dt_translate_address(unsigned long node, u64 *out_size);
extern void of_fdt_limit_memory(int limit);
#endif /* CONFIG_OF_FLATTREE */
--
2.43.0
Faking a tag storage region for FVP is useful for testing.
Signed-off-by: Alexandru Elisei <[email protected]>
---
Changes since rfc v2:
* New patch, not intended to be merged.
arch/arm64/boot/dts/arm/fvp-base-revc.dts | 42 +++++++++++++++++++++--
1 file changed, 39 insertions(+), 3 deletions(-)
diff --git a/arch/arm64/boot/dts/arm/fvp-base-revc.dts b/arch/arm64/boot/dts/arm/fvp-base-revc.dts
index 60472d65a355..e9f44420cb62 100644
--- a/arch/arm64/boot/dts/arm/fvp-base-revc.dts
+++ b/arch/arm64/boot/dts/arm/fvp-base-revc.dts
@@ -165,10 +165,30 @@ C1_L2: l2-cache1 {
};
};
- memory@80000000 {
+ memory0: memory@80000000 {
device_type = "memory";
- reg = <0x00000000 0x80000000 0 0x80000000>,
- <0x00000008 0x80000000 0 0x80000000>;
+ reg = <0x00 0x80000000 0x00 0x80000000>;
+ numa-node-id = <0x00>;
+ };
+
+ /* tags0 */
+ tags_memory0: memory@8f8000000 {
+ device_type = "memory";
+ reg = <0x08 0xf8000000 0x00 0x4000000>;
+ numa-node-id = <0x00>;
+ };
+
+ memory1: memory@880000000 {
+ device_type = "memory";
+ reg = <0x08 0x80000000 0x00 0x78000000>;
+ numa-node-id = <0x01>;
+ };
+
+ /* tags1 */
+ tags_memory1: memory@8fc00000 {
+ device_type = "memory";
+ reg = <0x08 0xfc000000 0x00 0x3c00000>;
+ numa-node-id = <0x01>;
};
reserved-memory {
@@ -183,6 +203,22 @@ vram: vram@18000000 {
reg = <0x00000000 0x18000000 0 0x00800000>;
no-map;
};
+
+ tags0: tag-storage@8f8000000 {
+ compatible = "arm,mte-tag-storage";
+ reg = <0x08 0xf8000000 0x00 0x4000000>;
+ block-size = <0x1000>;
+ tagged-memory = <&memory0>;
+ reusable;
+ };
+
+ tags1: tag-storage@8fc00000 {
+ compatible = "arm,mte-tag-storage";
+ reg = <0x08 0xfc000000 0x00 0x3c00000>;
+ block-size = <0x1000>;
+ tagged-memory = <&memory1>;
+ reusable;
+ };
};
gic: interrupt-controller@2f000000 {
--
2.43.0
Add the function of_flat_read_u32() to return the value of a property as
an u32.
Signed-off-by: Alexandru Elisei <[email protected]>
---
Changes since rfc v2:
* New patch, suggested by Rob Herring.
drivers/of/fdt.c | 21 +++++++++++++++++++++
include/linux/of_fdt.h | 2 ++
2 files changed, 23 insertions(+)
diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c
index bf502ba8da95..dfcd79fd5fd9 100644
--- a/drivers/of/fdt.c
+++ b/drivers/of/fdt.c
@@ -755,6 +755,27 @@ const void *__init of_get_flat_dt_prop(unsigned long node, const char *name,
return fdt_getprop(initial_boot_params, node, name, size);
}
+/*
+ * of_flat_read_u32 - Return the value of the given property as an u32.
+ *
+ * @node: device node from which the property value is to be read
+ * @propname: name of the property
+ * @out_value: the value of the property
+ * @return: 0 on success, -EINVAL if property does not exist
+ */
+int __init of_flat_read_u32(unsigned long node, const char *propname,
+ u32 *out_value)
+{
+ const __be32 *reg;
+
+ reg = of_get_flat_dt_prop(node, propname, NULL);
+ if (!reg)
+ return -EINVAL;
+
+ *out_value = be32_to_cpup(reg);
+ return 0;
+}
+
/**
* of_fdt_is_compatible - Return true if given node from the given blob has
* compat in its compatible list
diff --git a/include/linux/of_fdt.h b/include/linux/of_fdt.h
index 0e26f8c3b10e..d7901699061b 100644
--- a/include/linux/of_fdt.h
+++ b/include/linux/of_fdt.h
@@ -57,6 +57,8 @@ extern const void *of_get_flat_dt_prop(unsigned long node, const char *name,
extern int of_flat_dt_is_compatible(unsigned long node, const char *name);
extern unsigned long of_get_flat_dt_root(void);
extern uint32_t of_get_flat_dt_phandle(unsigned long node);
+extern int of_flat_read_u32(unsigned long node, const char *propname,
+ u32 *out_value);
extern int early_init_dt_scan_chosen(char *cmdline);
extern int early_init_dt_scan_memory(void);
--
2.43.0
According to ARM DDI 0487J.a, page D10-5976, a memory location which
doesn't have the Normal memory attribute is considered Untagged, and
accesses are Tag Unchecked. Tag reads from an Untagged address return
0b0000, and writes are ignored.
Linux uses VM_PFNMAP VMAs represent device memory, and Linux doesn't set
the VM_MTE_ALLOWED flag for these VMAs.
In user_mem_abort(), KVM requires that all VMAs that back guest memory must
allow tagging (VM_MTE_ALLOWED flag set), except for VMAs that represent
device memory. When a memslot is created or changed, KVM enforces a
different behaviour: **all** VMAs that intersect the memslot must allow
tagging, even those that represent device memory. This is too restrictive,
and can lead to inconsistent behaviour: a VM_PFNMAP VMA that is present
when a memslot is created causes KVM_SET_USER_MEMORY_REGION to fail, but if
such a VMA is created after the memslot has been created, the virtual
machine will run without errors.
Change kvm_arch_prepare_memory_region() to allow VM_PFNMAP VMAs when the VM
has the MTE capability enabled.
Signed-off-by: Alexandru Elisei <[email protected]>
---
Changes from rfc v2:
* New patch. It's a fix, and can be taken independently of the series.
arch/arm64/kvm/mmu.c | 8 +++-----
1 file changed, 3 insertions(+), 5 deletions(-)
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index d14504821b79..b7517c4a19c4 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -2028,17 +2028,15 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
if (!vma)
break;
- if (kvm_has_mte(kvm) && !kvm_vma_mte_allowed(vma)) {
- ret = -EINVAL;
- break;
- }
-
if (vma->vm_flags & VM_PFNMAP) {
/* IO region dirty page logging not allowed */
if (new->flags & KVM_MEM_LOG_DIRTY_PAGES) {
ret = -EINVAL;
break;
}
+ } else if (kvm_has_mte(kvm) && !kvm_vma_mte_allowed(vma)) {
+ ret = -EINVAL;
+ break;
}
hva = min(reg_end, vma->vm_end);
} while (hva < reg_end);
--
2.43.0
Reserve tag storage for a page that is being allocated as tagged. This
is a best effort approach, and failing to reserve tag storage is
allowed.
When all the associated tagged pages have been freed, return the tag
storage pages back to the page allocator, where they can be used again for
data allocations.
Signed-off-by: Alexandru Elisei <[email protected]>
---
Changes since rfc v2:
* Based on rfc v2 patch #16 ("arm64: mte: Manage tag storage on page
allocation").
* Fixed calculation of the number of associated tag storage blocks (Hyesoo
Yu).
* Tag storage is reserved in arch_alloc_page() instead of
arch_prep_new_page().
arch/arm64/include/asm/mte.h | 16 +-
arch/arm64/include/asm/mte_tag_storage.h | 31 +++
arch/arm64/include/asm/page.h | 5 +
arch/arm64/include/asm/pgtable.h | 19 ++
arch/arm64/kernel/mte_tag_storage.c | 234 +++++++++++++++++++++++
arch/arm64/mm/fault.c | 7 +
fs/proc/page.c | 1 +
include/linux/kernel-page-flags.h | 1 +
include/linux/page-flags.h | 1 +
include/trace/events/mmflags.h | 3 +-
mm/huge_memory.c | 1 +
11 files changed, 316 insertions(+), 3 deletions(-)
diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
index 8034695b3dd7..6457b7899207 100644
--- a/arch/arm64/include/asm/mte.h
+++ b/arch/arm64/include/asm/mte.h
@@ -40,12 +40,24 @@ void mte_free_tag_buf(void *buf);
#ifdef CONFIG_ARM64_MTE
/* track which pages have valid allocation tags */
-#define PG_mte_tagged PG_arch_2
+#define PG_mte_tagged PG_arch_2
/* simple lock to avoid multiple threads tagging the same page */
-#define PG_mte_lock PG_arch_3
+#define PG_mte_lock PG_arch_3
+/* Track if a tagged page has tag storage reserved */
+#define PG_tag_storage_reserved PG_arch_4
+
+#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
+DECLARE_STATIC_KEY_FALSE(tag_storage_enabled_key);
+extern bool page_tag_storage_reserved(struct page *page);
+#endif
static inline void set_page_mte_tagged(struct page *page)
{
+#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
+ /* Open code mte_tag_storage_enabled() */
+ WARN_ON_ONCE(static_branch_likely(&tag_storage_enabled_key) &&
+ !page_tag_storage_reserved(page));
+#endif
/*
* Ensure that the tags written prior to this function are visible
* before the page flags update.
diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
index 7b3f6bff8e6f..09f1318d924e 100644
--- a/arch/arm64/include/asm/mte_tag_storage.h
+++ b/arch/arm64/include/asm/mte_tag_storage.h
@@ -5,6 +5,12 @@
#ifndef __ASM_MTE_TAG_STORAGE_H
#define __ASM_MTE_TAG_STORAGE_H
+#ifndef __ASSEMBLY__
+
+#include <linux/mm_types.h>
+
+#include <asm/mte.h>
+
#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
DECLARE_STATIC_KEY_FALSE(tag_storage_enabled_key);
@@ -15,6 +21,15 @@ static inline bool tag_storage_enabled(void)
}
void mte_init_tag_storage(void);
+
+static inline bool alloc_requires_tag_storage(gfp_t gfp)
+{
+ return gfp & __GFP_TAGGED;
+}
+int reserve_tag_storage(struct page *page, int order, gfp_t gfp);
+void free_tag_storage(struct page *page, int order);
+
+bool page_tag_storage_reserved(struct page *page);
#else
static inline bool tag_storage_enabled(void)
{
@@ -23,6 +38,22 @@ static inline bool tag_storage_enabled(void)
static inline void mte_init_tag_storage(void)
{
}
+static inline bool alloc_requires_tag_storage(struct page *page)
+{
+ return false;
+}
+static inline int reserve_tag_storage(struct page *page, int order, gfp_t gfp)
+{
+ return 0;
+}
+static inline void free_tag_storage(struct page *page, int order)
+{
+}
+static inline bool page_tag_storage_reserved(struct page *page)
+{
+ return true;
+}
#endif /* CONFIG_ARM64_MTE_TAG_STORAGE */
+#endif /* !__ASSEMBLY__ */
#endif /* __ASM_MTE_TAG_STORAGE_H */
diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
index 88bab032a493..3a656492f34a 100644
--- a/arch/arm64/include/asm/page.h
+++ b/arch/arm64/include/asm/page.h
@@ -35,6 +35,11 @@ void copy_highpage(struct page *to, struct page *from);
void tag_clear_highpage(struct page *to);
#define __HAVE_ARCH_TAG_CLEAR_HIGHPAGE
+#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
+void arch_alloc_page(struct page *, int order, gfp_t gfp);
+#define HAVE_ARCH_ALLOC_PAGE
+#endif
+
#define clear_user_page(page, vaddr, pg) clear_page(page)
#define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 2499cc4fa4f2..f30466199a9b 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -10,6 +10,7 @@
#include <asm/memory.h>
#include <asm/mte.h>
+#include <asm/mte_tag_storage.h>
#include <asm/pgtable-hwdef.h>
#include <asm/pgtable-prot.h>
#include <asm/tlbflush.h>
@@ -1069,6 +1070,24 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
mte_restore_page_tags_by_swp_entry(entry, &folio->page);
}
+#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
+
+#define __HAVE_ARCH_FREE_PAGES_PREPARE
+static inline void arch_free_pages_prepare(struct page *page, int order)
+{
+ if (tag_storage_enabled() && page_mte_tagged(page))
+ free_tag_storage(page, order);
+}
+
+#define __HAVE_ARCH_ALLOC_CMA
+static inline bool arch_alloc_cma(gfp_t gfp_mask)
+{
+ if (tag_storage_enabled() && alloc_requires_tag_storage(gfp_mask))
+ return false;
+ return true;
+}
+
+#endif /* CONFIG_ARM64_MTE_TAG_STORAGE */
#endif /* CONFIG_ARM64_MTE */
#define __HAVE_ARCH_CALC_VMA_GFP
diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index d58c68b4a849..762c7c803a70 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -34,6 +34,31 @@ struct tag_region {
static struct tag_region tag_regions[MAX_TAG_REGIONS];
static int num_tag_regions;
+/*
+ * A note on locking. Reserving tag storage takes the tag_blocks_lock mutex,
+ * because alloc_contig_range() might sleep.
+ *
+ * Freeing tag storage takes the xa_lock spinlock with interrupts disabled
+ * because pages can be freed from non-preemptible contexts, including from an
+ * interrupt handler.
+ *
+ * Because tag storage can be freed from interrupt contexts, the xarray is
+ * defined with the XA_FLAGS_LOCK_IRQ flag to disable interrupts when calling
+ * xa_store(). This is done to prevent a deadlock with free_tag_storage() being
+ * called from an interrupt raised before xa_store() releases the xa_lock.
+ *
+ * All of the above means that reserve_tag_storage() cannot run concurrently
+ * with itself (no concurrent insertions), but it can run at the same time as
+ * free_tag_storage(). The first thing that reserve_tag_storage() does after
+ * taking the mutex is increase the refcount on all present tag storage blocks
+ * with the xa_lock held, to serialize against freeing the blocks. This is an
+ * optimization to avoid taking and releasing the xa_lock after each iteration
+ * if the refcount operation was moved inside the loop, where it would have had
+ * to be executed for each block.
+ */
+static DEFINE_XARRAY_FLAGS(tag_blocks_reserved, XA_FLAGS_LOCK_IRQ);
+static DEFINE_MUTEX(tag_blocks_lock);
+
static u32 __init get_block_size_pages(u32 block_size_bytes)
{
u32 a = PAGE_SIZE;
@@ -364,3 +389,212 @@ static int __init mte_enable_tag_storage(void)
return -EINVAL;
}
arch_initcall(mte_enable_tag_storage);
+
+static void page_set_tag_storage_reserved(struct page *page, int order)
+{
+ int i;
+
+ for (i = 0; i < (1 << order); i++)
+ set_bit(PG_tag_storage_reserved, &(page + i)->flags);
+}
+
+static void block_ref_add(unsigned long block, struct tag_region *region, int order)
+{
+ int count;
+
+ count = min(1u << order, 32 * region->block_size_pages);
+ page_ref_add(pfn_to_page(block), count);
+}
+
+static int block_ref_sub_return(unsigned long block, struct tag_region *region, int order)
+{
+ int count;
+
+ count = min(1u << order, 32 * region->block_size_pages);
+ return page_ref_sub_return(pfn_to_page(block), count);
+}
+
+static bool tag_storage_block_is_reserved(unsigned long block)
+{
+ return xa_load(&tag_blocks_reserved, block) != NULL;
+}
+
+static int tag_storage_reserve_block(unsigned long block, struct tag_region *region, int order)
+{
+ int ret;
+
+ ret = xa_err(xa_store(&tag_blocks_reserved, block, pfn_to_page(block), GFP_KERNEL));
+ if (!ret)
+ block_ref_add(block, region, order);
+
+ return ret;
+}
+
+static int order_to_num_blocks(int order, u32 block_size_pages)
+{
+ int num_tag_storage_pages = max((1 << order) / 32, 1);
+
+ return DIV_ROUND_UP(num_tag_storage_pages, block_size_pages);
+}
+
+static int tag_storage_find_block_in_region(struct page *page, unsigned long *blockp,
+ struct tag_region *region)
+{
+ struct range *tag_range = ®ion->tag_range;
+ struct range *mem_range = ®ion->mem_range;
+ u64 page_pfn = page_to_pfn(page);
+ u64 block, block_offset;
+
+ if (!(mem_range->start <= page_pfn && page_pfn <= mem_range->end))
+ return -ERANGE;
+
+ block_offset = (page_pfn - mem_range->start) / 32;
+ block = tag_range->start + rounddown(block_offset, region->block_size_pages);
+
+ if (block + region->block_size_pages - 1 > tag_range->end) {
+ pr_err("Block 0x%llx-0x%llx is outside tag region 0x%llx-0x%llx\n",
+ PFN_PHYS(block), PFN_PHYS(block + region->block_size_pages + 1) - 1,
+ PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end + 1) - 1);
+ return -ERANGE;
+ }
+ *blockp = block;
+
+ return 0;
+
+}
+
+static int tag_storage_find_block(struct page *page, unsigned long *block,
+ struct tag_region **region)
+{
+ int i, ret;
+
+ for (i = 0; i < num_tag_regions; i++) {
+ ret = tag_storage_find_block_in_region(page, block, &tag_regions[i]);
+ if (ret == 0) {
+ *region = &tag_regions[i];
+ return 0;
+ }
+ }
+
+ return -EINVAL;
+}
+
+bool page_tag_storage_reserved(struct page *page)
+{
+ return test_bit(PG_tag_storage_reserved, &page->flags);
+}
+
+int reserve_tag_storage(struct page *page, int order, gfp_t gfp)
+{
+ unsigned long start_block, end_block;
+ struct tag_region *region;
+ unsigned long block;
+ unsigned long flags;
+ int ret = 0;
+
+ VM_WARN_ON_ONCE(!preemptible());
+
+ if (page_tag_storage_reserved(page))
+ return 0;
+
+ /*
+ * __alloc_contig_migrate_range() ignores gfp when allocating the
+ * destination page for migration. Regardless, massage gfp flags and
+ * remove __GFP_TAGGED to avoid recursion in case gfp stops being
+ * ignored.
+ */
+ gfp &= ~__GFP_TAGGED;
+ if (!(gfp & __GFP_NORETRY))
+ gfp |= __GFP_RETRY_MAYFAIL;
+
+ ret = tag_storage_find_block(page, &start_block, ®ion);
+ if (WARN_ONCE(ret, "Missing tag storage block for pfn 0x%lx", page_to_pfn(page)))
+ return -EINVAL;
+ end_block = start_block + order_to_num_blocks(order, region->block_size_pages);
+
+ mutex_lock(&tag_blocks_lock);
+
+ /* Check again, this time with the lock held. */
+ if (page_tag_storage_reserved(page))
+ goto out_unlock;
+
+ /* Make sure existing entries are not freed from out under out feet. */
+ xa_lock_irqsave(&tag_blocks_reserved, flags);
+ for (block = start_block; block < end_block; block += region->block_size_pages) {
+ if (tag_storage_block_is_reserved(block))
+ block_ref_add(block, region, order);
+ }
+ xa_unlock_irqrestore(&tag_blocks_reserved, flags);
+
+ for (block = start_block; block < end_block; block += region->block_size_pages) {
+ /* Refcount incremented above. */
+ if (tag_storage_block_is_reserved(block))
+ continue;
+
+ ret = cma_alloc_range(region->cma, block, region->block_size_pages, 3, gfp);
+ /* Should never happen. */
+ VM_WARN_ON_ONCE(ret == -EEXIST);
+ if (ret)
+ goto out_error;
+
+ ret = tag_storage_reserve_block(block, region, order);
+ if (ret) {
+ cma_release(region->cma, pfn_to_page(block), region->block_size_pages);
+ goto out_error;
+ }
+ }
+
+ page_set_tag_storage_reserved(page, order);
+out_unlock:
+ mutex_unlock(&tag_blocks_lock);
+
+ return 0;
+
+out_error:
+ xa_lock_irqsave(&tag_blocks_reserved, flags);
+ for (block = start_block; block < end_block; block += region->block_size_pages) {
+ if (tag_storage_block_is_reserved(block) &&
+ block_ref_sub_return(block, region, order) == 1) {
+ __xa_erase(&tag_blocks_reserved, block);
+ cma_release(region->cma, pfn_to_page(block), region->block_size_pages);
+ }
+ }
+ xa_unlock_irqrestore(&tag_blocks_reserved, flags);
+
+ mutex_unlock(&tag_blocks_lock);
+
+ return ret;
+}
+
+void free_tag_storage(struct page *page, int order)
+{
+ unsigned long block, start_block, end_block;
+ struct tag_region *region;
+ unsigned long flags;
+ int ret;
+
+ ret = tag_storage_find_block(page, &start_block, ®ion);
+ if (WARN_ONCE(ret, "Missing tag storage block for pfn 0x%lx", page_to_pfn(page)))
+ return;
+
+ end_block = start_block + order_to_num_blocks(order, region->block_size_pages);
+
+ xa_lock_irqsave(&tag_blocks_reserved, flags);
+ for (block = start_block; block < end_block; block += region->block_size_pages) {
+ if (WARN_ONCE(!tag_storage_block_is_reserved(block),
+ "Block 0x%lx is not reserved for pfn 0x%lx", block, page_to_pfn(page)))
+ continue;
+
+ if (block_ref_sub_return(block, region, order) == 1) {
+ __xa_erase(&tag_blocks_reserved, block);
+ cma_release(region->cma, pfn_to_page(block), region->block_size_pages);
+ }
+ }
+ xa_unlock_irqrestore(&tag_blocks_reserved, flags);
+}
+
+void arch_alloc_page(struct page *page, int order, gfp_t gfp)
+{
+ if (tag_storage_enabled() && alloc_requires_tag_storage(gfp))
+ reserve_tag_storage(page, order, gfp);
+}
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index c022e473c17c..1ffaeccecda2 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -37,6 +37,7 @@
#include <asm/esr.h>
#include <asm/kprobes.h>
#include <asm/mte.h>
+#include <asm/mte_tag_storage.h>
#include <asm/processor.h>
#include <asm/sysreg.h>
#include <asm/system_misc.h>
@@ -950,6 +951,12 @@ gfp_t arch_calc_vma_gfp(struct vm_area_struct *vma, gfp_t gfp)
void tag_clear_highpage(struct page *page)
{
+ if (tag_storage_enabled() && !page_tag_storage_reserved(page)) {
+ /* Don't zero the tags if tag storage is not reserved */
+ clear_page(page_address(page));
+ return;
+ }
+
/* Newly allocated page, shouldn't have been tagged yet */
WARN_ON_ONCE(!try_page_mte_tagging(page));
mte_zero_clear_page_tags(page_address(page));
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 195b077c0fac..e7eb584a9234 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -221,6 +221,7 @@ u64 stable_page_flags(struct page *page)
#ifdef CONFIG_ARCH_USES_PG_ARCH_X
u |= kpf_copy_bit(k, KPF_ARCH_2, PG_arch_2);
u |= kpf_copy_bit(k, KPF_ARCH_3, PG_arch_3);
+ u |= kpf_copy_bit(k, KPF_ARCH_4, PG_arch_4);
#endif
return u;
diff --git a/include/linux/kernel-page-flags.h b/include/linux/kernel-page-flags.h
index 859f4b0c1b2b..4a0d719ffdd4 100644
--- a/include/linux/kernel-page-flags.h
+++ b/include/linux/kernel-page-flags.h
@@ -19,5 +19,6 @@
#define KPF_SOFTDIRTY 40
#define KPF_ARCH_2 41
#define KPF_ARCH_3 42
+#define KPF_ARCH_4 43
#endif /* LINUX_KERNEL_PAGE_FLAGS_H */
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index b7237bce7446..03f03e6d735e 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -135,6 +135,7 @@ enum pageflags {
#ifdef CONFIG_ARCH_USES_PG_ARCH_X
PG_arch_2,
PG_arch_3,
+ PG_arch_4,
#endif
__NR_PAGEFLAGS,
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 6ca0d5ed46c0..ba962fd10a2c 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -125,7 +125,8 @@ IF_HAVE_PG_HWPOISON(hwpoison) \
IF_HAVE_PG_IDLE(idle) \
IF_HAVE_PG_IDLE(young) \
IF_HAVE_PG_ARCH_X(arch_2) \
-IF_HAVE_PG_ARCH_X(arch_3)
+IF_HAVE_PG_ARCH_X(arch_3) \
+IF_HAVE_PG_ARCH_X(arch_4)
#define show_page_flags(flags) \
(flags) ? __print_flags(flags, "|", \
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2bad63a7ec16..47932539cc50 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2804,6 +2804,7 @@ static void __split_huge_page_tail(struct folio *folio, int tail,
#ifdef CONFIG_ARCH_USES_PG_ARCH_X
(1L << PG_arch_2) |
(1L << PG_arch_3) |
+ (1L << PG_arch_4) |
#endif
(1L << PG_dirty) |
LRU_GEN_MASK | LRU_REFS_MASK));
--
2.43.0
Tag stoarge pages cannot be tagged. When such a page is mapped in a
MTE-enabled VMA, migrate it out directly and don't try to reserve tag
storage for it.
Signed-off-by: Alexandru Elisei <[email protected]>
---
arch/arm64/include/asm/mte_tag_storage.h | 1 +
arch/arm64/kernel/mte_tag_storage.c | 15 +++++++++++++++
arch/arm64/mm/fault.c | 11 +++++++++--
3 files changed, 25 insertions(+), 2 deletions(-)
diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
index 6d0f6ffcfdd6..50bdae94cf71 100644
--- a/arch/arm64/include/asm/mte_tag_storage.h
+++ b/arch/arm64/include/asm/mte_tag_storage.h
@@ -32,6 +32,7 @@ int reserve_tag_storage(struct page *page, int order, gfp_t gfp);
void free_tag_storage(struct page *page, int order);
bool page_tag_storage_reserved(struct page *page);
+bool page_is_tag_storage(struct page *page);
vm_fault_t handle_folio_missing_tag_storage(struct folio *folio, struct vm_fault *vmf,
bool *map_pte);
diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index 1c8469781870..afe2bb754879 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -492,6 +492,21 @@ bool page_tag_storage_reserved(struct page *page)
return test_bit(PG_tag_storage_reserved, &page->flags);
}
+bool page_is_tag_storage(struct page *page)
+{
+ unsigned long pfn = page_to_pfn(page);
+ struct range *tag_range;
+ int i;
+
+ for (i = 0; i < num_tag_regions; i++) {
+ tag_range = &tag_regions[i].tag_range;
+ if (tag_range->start <= pfn && pfn <= tag_range->end)
+ return true;
+ }
+
+ return false;
+}
+
int reserve_tag_storage(struct page *page, int order, gfp_t gfp)
{
unsigned long start_block, end_block;
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 1db3adb6499f..01450ab91a87 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -1014,6 +1014,7 @@ static int replace_folio_with_tagged(struct folio *folio)
vm_fault_t handle_folio_missing_tag_storage(struct folio *folio, struct vm_fault *vmf,
bool *map_pte)
{
+ bool is_tag_storage = page_is_tag_storage(folio_page(folio, 0));
struct vm_area_struct *vma = vmf->vma;
int ret = 0;
@@ -1033,12 +1034,18 @@ vm_fault_t handle_folio_missing_tag_storage(struct folio *folio, struct vm_fault
if (unlikely(is_migrate_isolate_page(folio_page(folio, 0))))
goto out_retry;
- ret = reserve_tag_storage(folio_page(folio, 0), folio_order(folio), GFP_HIGHUSER_MOVABLE);
- if (ret) {
+ if (!is_tag_storage) {
+ ret = reserve_tag_storage(folio_page(folio, 0), folio_order(folio),
+ GFP_HIGHUSER_MOVABLE);
+ if (!ret)
+ goto out_map;
+
/* replace_folio_with_tagged() is expensive, try to avoid it. */
if (fault_flag_allow_retry_first(vmf->flags))
goto out_retry;
+ }
+ if (ret || is_tag_storage) {
replace_folio_with_tagged(folio);
return 0;
}
--
2.43.0
Linux restores tags when a page is swapped in and there are tags associated
with the swap entry which the new page will replace. The saved tags are
restored even if the page will not be mapped as tagged, to protect against
cases where the page is shared between different VMAs, and is tagged in
some, but untagged in others. By using this approach, the process can still
access the correct tags following an mprotect(PROT_MTE) on the non-MTE
enabled VMA.
But this poses a challenge for managing tag storage: in the scenario above,
when a new page is allocated to be swapped in for the process where it will
be mapped as untagged, the corresponding tag storage block is not reserved.
mte_restore_page_tags_by_swp_entry(), when it restores the saved tags, will
overwrite data in the tag storage block associated with the new page,
leading to data corruption if the block is in use by a process.
Get around this issue by saving the tags in a new xarray, this time indexed
by the page pfn, and then restoring them when tag storage is reserved for
the page.
Signed-off-by: Alexandru Elisei <[email protected]>
---
Changes since rfc v2:
* Restore saved tags **before** setting the PG_tag_storage_reserved bit to
eliminate a brief window of opportunity where userspace can access uninitialized
tags (Peter Collingbourne).
arch/arm64/include/asm/mte_tag_storage.h | 8 ++
arch/arm64/include/asm/pgtable.h | 11 +++
arch/arm64/kernel/mte_tag_storage.c | 12 ++-
arch/arm64/mm/mteswap.c | 110 +++++++++++++++++++++++
4 files changed, 140 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
index 50bdae94cf71..40590a8c3748 100644
--- a/arch/arm64/include/asm/mte_tag_storage.h
+++ b/arch/arm64/include/asm/mte_tag_storage.h
@@ -36,6 +36,14 @@ bool page_is_tag_storage(struct page *page);
vm_fault_t handle_folio_missing_tag_storage(struct folio *folio, struct vm_fault *vmf,
bool *map_pte);
+vm_fault_t mte_try_transfer_swap_tags(swp_entry_t entry, struct page *page);
+
+void tags_by_pfn_lock(void);
+void tags_by_pfn_unlock(void);
+
+void *mte_erase_tags_for_pfn(unsigned long pfn);
+bool mte_save_tags_for_pfn(void *tags, unsigned long pfn);
+void mte_restore_tags_for_pfn(unsigned long start_pfn, int order);
#else
static inline bool tag_storage_enabled(void)
{
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 0174e292f890..87ae59436162 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1085,6 +1085,17 @@ static inline void arch_swap_invalidate_area(int type)
mte_invalidate_tags_area_by_swp_entry(type);
}
+#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
+#define __HAVE_ARCH_SWAP_PREPARE_TO_RESTORE
+static inline vm_fault_t arch_swap_prepare_to_restore(swp_entry_t entry,
+ struct folio *folio)
+{
+ if (tag_storage_enabled())
+ return mte_try_transfer_swap_tags(entry, &folio->page);
+ return 0;
+}
+#endif
+
#define __HAVE_ARCH_SWAP_RESTORE
static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
{
diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index afe2bb754879..ac7b9c9c585c 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -567,6 +567,7 @@ int reserve_tag_storage(struct page *page, int order, gfp_t gfp)
}
}
+ mte_restore_tags_for_pfn(page_to_pfn(page), order);
page_set_tag_storage_reserved(page, order);
out_unlock:
mutex_unlock(&tag_blocks_lock);
@@ -595,7 +596,8 @@ void free_tag_storage(struct page *page, int order)
struct tag_region *region;
unsigned long page_va;
unsigned long flags;
- int ret;
+ void *tags;
+ int i, ret;
ret = tag_storage_find_block(page, &start_block, ®ion);
if (WARN_ONCE(ret, "Missing tag storage block for pfn 0x%lx", page_to_pfn(page)))
@@ -605,6 +607,14 @@ void free_tag_storage(struct page *page, int order)
/* Avoid writeback of dirty tag cache lines corrupting data. */
dcache_inval_tags_poc(page_va, page_va + (PAGE_SIZE << order));
+ tags_by_pfn_lock();
+ for (i = 0; i < (1 << order); i++) {
+ tags = mte_erase_tags_for_pfn(page_to_pfn(page + i));
+ if (unlikely(tags))
+ mte_free_tag_buf(tags);
+ }
+ tags_by_pfn_unlock();
+
end_block = start_block + order_to_num_blocks(order, region->block_size_pages);
xa_lock_irqsave(&tag_blocks_reserved, flags);
diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
index 2a43746b803f..e11495fa3c18 100644
--- a/arch/arm64/mm/mteswap.c
+++ b/arch/arm64/mm/mteswap.c
@@ -20,6 +20,112 @@ void mte_free_tag_buf(void *buf)
kfree(buf);
}
+#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
+static DEFINE_XARRAY(tags_by_pfn);
+
+void tags_by_pfn_lock(void)
+{
+ xa_lock(&tags_by_pfn);
+}
+
+void tags_by_pfn_unlock(void)
+{
+ xa_unlock(&tags_by_pfn);
+}
+
+void *mte_erase_tags_for_pfn(unsigned long pfn)
+{
+ return __xa_erase(&tags_by_pfn, pfn);
+}
+
+bool mte_save_tags_for_pfn(void *tags, unsigned long pfn)
+{
+ void *entry;
+ int ret;
+
+ ret = xa_reserve(&tags_by_pfn, pfn, GFP_KERNEL);
+ if (ret)
+ return true;
+
+ tags_by_pfn_lock();
+
+ if (page_tag_storage_reserved(pfn_to_page(pfn))) {
+ xa_release(&tags_by_pfn, pfn);
+ tags_by_pfn_unlock();
+ return false;
+ }
+
+ entry = __xa_store(&tags_by_pfn, pfn, tags, GFP_ATOMIC);
+ if (xa_is_err(entry)) {
+ xa_release(&tags_by_pfn, pfn);
+ goto out_unlock;
+ } else if (entry) {
+ mte_free_tag_buf(entry);
+ }
+
+out_unlock:
+ tags_by_pfn_unlock();
+ return true;
+}
+
+void mte_restore_tags_for_pfn(unsigned long start_pfn, int order)
+{
+ struct page *page = pfn_to_page(start_pfn);
+ unsigned long pfn;
+ void *tags;
+
+ tags_by_pfn_lock();
+
+ for (pfn = start_pfn; pfn < start_pfn + (1 << order); pfn++, page++) {
+ tags = mte_erase_tags_for_pfn(pfn);
+ if (unlikely(tags)) {
+ /*
+ * Mark the page as tagged so mte_sync_tags() doesn't
+ * clear the tags.
+ */
+ WARN_ON_ONCE(!try_page_mte_tagging(page));
+ mte_copy_page_tags_from_buf(page_address(page), tags);
+ set_page_mte_tagged(page);
+ mte_free_tag_buf(tags);
+ }
+ }
+
+ tags_by_pfn_unlock();
+}
+
+/*
+ * Note on locking: swap in/out is done with the folio locked, which eliminates
+ * races with mte_save/restore_page_tags_by_swp_entry.
+ */
+vm_fault_t mte_try_transfer_swap_tags(swp_entry_t entry, struct page *page)
+{
+ void *swap_tags, *pfn_tags;
+ bool saved;
+
+ /*
+ * mte_restore_page_tags_by_swp_entry() will take care of copying the
+ * tags over.
+ */
+ if (likely(page_mte_tagged(page) || page_tag_storage_reserved(page)))
+ return 0;
+
+ swap_tags = xa_load(&tags_by_swp_entry, entry.val);
+ if (!swap_tags)
+ return 0;
+
+ pfn_tags = mte_allocate_tag_buf();
+ if (!pfn_tags)
+ return VM_FAULT_OOM;
+
+ memcpy(pfn_tags, swap_tags, MTE_PAGE_TAG_STORAGE_SIZE);
+ saved = mte_save_tags_for_pfn(pfn_tags, page_to_pfn(page));
+ if (!saved)
+ mte_free_tag_buf(pfn_tags);
+
+ return 0;
+}
+#endif
+
int mte_save_page_tags_by_swp_entry(struct page *page)
{
void *tags, *ret;
@@ -54,6 +160,10 @@ void mte_restore_page_tags_by_swp_entry(swp_entry_t entry, struct page *page)
if (!tags)
return;
+ /* Tags will be restored when tag storage is reserved. */
+ if (tag_storage_enabled() && unlikely(!page_tag_storage_reserved(page)))
+ return;
+
if (try_page_mte_tagging(page)) {
mte_copy_page_tags_from_buf(page_address(page), tags);
set_page_mte_tagged(page);
--
2.43.0
copy_user_highpage() will do memory allocation if there are saved tags for
the destination page, and the page is missing tag storage.
After commit a349d72fd9ef ("mm/pgtable: add rcu_read_lock() and
rcu_read_unlock()s"), collapse_huge_page() calls
__collapse_huge_page_copy() -> .. -> copy_user_highpage() with the RCU lock
held, which means that copy_user_highpage() can only allocate memory using
GFP_ATOMIC or equivalent.
Get around this by refusing to collapse pages into a transparent huge page
if the VMA is MTE-enabled.
Signed-off-by: Alexandru Elisei <[email protected]>
---
Changes since rfc v2:
* New patch. I think an agreement on whether copy*_user_highpage() should be
always allowed to sleep, or should not be allowed, would be useful.
arch/arm64/include/asm/pgtable.h | 3 +++
arch/arm64/kernel/mte_tag_storage.c | 5 +++++
include/linux/khugepaged.h | 5 +++++
mm/khugepaged.c | 4 ++++
4 files changed, 17 insertions(+)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 87ae59436162..d0473538c926 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1120,6 +1120,9 @@ static inline bool arch_alloc_cma(gfp_t gfp_mask)
return true;
}
+bool arch_hugepage_vma_revalidate(struct vm_area_struct *vma, unsigned long address);
+#define arch_hugepage_vma_revalidate arch_hugepage_vma_revalidate
+
#endif /* CONFIG_ARM64_MTE_TAG_STORAGE */
#endif /* CONFIG_ARM64_MTE */
diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
index ac7b9c9c585c..a99959b70573 100644
--- a/arch/arm64/kernel/mte_tag_storage.c
+++ b/arch/arm64/kernel/mte_tag_storage.c
@@ -636,3 +636,8 @@ void arch_alloc_page(struct page *page, int order, gfp_t gfp)
if (tag_storage_enabled() && alloc_requires_tag_storage(gfp))
reserve_tag_storage(page, order, gfp);
}
+
+bool arch_hugepage_vma_revalidate(struct vm_area_struct *vma, unsigned long address)
+{
+ return !(vma->vm_flags & VM_MTE);
+}
diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
index f68865e19b0b..461e4322dff2 100644
--- a/include/linux/khugepaged.h
+++ b/include/linux/khugepaged.h
@@ -38,6 +38,11 @@ static inline void khugepaged_exit(struct mm_struct *mm)
if (test_bit(MMF_VM_HUGEPAGE, &mm->flags))
__khugepaged_exit(mm);
}
+
+#ifndef arch_hugepage_vma_revalidate
+#define arch_hugepage_vma_revalidate(vma, address) 1
+#endif
+
#else /* CONFIG_TRANSPARENT_HUGEPAGE */
static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
{
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 2b219acb528e..cb9a9ddb4d86 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -935,6 +935,10 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
*/
if (expect_anon && (!(*vmap)->anon_vma || !vma_is_anonymous(*vmap)))
return SCAN_PAGE_ANON;
+
+ if (!arch_hugepage_vma_revalidate(vma, address))
+ return SCAN_VMA_CHECK;
+
return SCAN_SUCCEED;
}
--
2.43.0
On Thu, 25 Jan 2024 16:42:21 +0000
Alexandru Elisei <[email protected]> wrote:
> include/trace/events/cma.h | 59 ++
> include/trace/events/mmflags.h | 5 +-
I know others like being Cc'd on every patch in a series, but I'm not about
to trudge through 35 patches to review trace events, having no idea which
patch they are in.
-- Steve
On 25/01/2024 17:42, Alexandru Elisei wrote:
> Allow the kernel to get the base address, size, block size and associated
> memory node for tag storage from the device tree blob.
>
Please use scripts/get_maintainers.pl to get a list of necessary people
and lists to CC. It might happen, that command when run on an older
kernel, gives you outdated entries. Therefore please be sure you base
your patches on recent Linux kernel.
Tools like b4 or scripts_getmaintainer.pl provide you proper list of
people, so fix your workflow. Tools might also fail if you work on some
ancient tree (don't, use mainline), work on fork of kernel (don't, use
mainline) or you ignore some maintainers (really don't). Just use b4 and
all the problems go away.
You missed at least devicetree list (maybe more), so this won't be
tested by automated tooling. Performing review on untested code might be
a waste of time, thus I will skip this patch entirely till you follow
the process allowing the patch to be tested.
Please kindly resend and include all necessary To/Cc entries.
> A tag storage region represents the smallest contiguous memory region that
> holds all the tags for the associated contiguous memory region which can be
> tagged. For example, for a 32GB contiguous tagged memory the corresponding
> tag storage region is exactly 1GB of contiguous memory, not two adjacent
> 512M of tag storage memory, nor one 2GB tag storage region.
>
> Tag storage is described as reserved memory; future patches will teach the
> kernel how to make use of it for data (non-tagged) allocations.
>
> Signed-off-by: Alexandru Elisei <[email protected]>
> ---
>
> Changes since rfc v2:
>
> * Reworked from rfc v2 patch #11 ("arm64: mte: Reserve tag storage memory").
> * Added device tree schema (Rob Herring)
> * Tag storage memory is now described in the "reserved-memory" node (Rob
> Herring).
>
> .../reserved-memory/arm,mte-tag-storage.yaml | 78 +++++++++
Please run scripts/checkpatch.pl and fix reported warnings. Some
warnings can be ignored, but the code here looks like it needs a fix.
Feel free to get in touch if the warning is not clear.
Best regards,
Krzysztof
Hi Krzysztof,
On Fri, Jan 26, 2024 at 09:50:58AM +0100, Krzysztof Kozlowski wrote:
> On 25/01/2024 17:42, Alexandru Elisei wrote:
> > Allow the kernel to get the base address, size, block size and associated
> > memory node for tag storage from the device tree blob.
> >
>
> Please use scripts/get_maintainers.pl to get a list of necessary people
> and lists to CC. It might happen, that command when run on an older
> kernel, gives you outdated entries. Therefore please be sure you base
> your patches on recent Linux kernel.
>
> Tools like b4 or scripts_getmaintainer.pl provide you proper list of
> people, so fix your workflow. Tools might also fail if you work on some
> ancient tree (don't, use mainline), work on fork of kernel (don't, use
> mainline) or you ignore some maintainers (really don't). Just use b4 and
> all the problems go away.
>
> You missed at least devicetree list (maybe more), so this won't be
> tested by automated tooling. Performing review on untested code might be
> a waste of time, thus I will skip this patch entirely till you follow
> the process allowing the patch to be tested.
>
> Please kindly resend and include all necessary To/Cc entries.
My mistake, the previous iteration of the series didn't include a
devicetree binding and I forgot to update the To/Cc list. Thank you for the
heads-up, hopefully you can have a look after I resend the series.
>
>
> > A tag storage region represents the smallest contiguous memory region that
> > holds all the tags for the associated contiguous memory region which can be
> > tagged. For example, for a 32GB contiguous tagged memory the corresponding
> > tag storage region is exactly 1GB of contiguous memory, not two adjacent
> > 512M of tag storage memory, nor one 2GB tag storage region.
> >
> > Tag storage is described as reserved memory; future patches will teach the
> > kernel how to make use of it for data (non-tagged) allocations.
> >
> > Signed-off-by: Alexandru Elisei <[email protected]>
> > ---
> >
> > Changes since rfc v2:
> >
> > * Reworked from rfc v2 patch #11 ("arm64: mte: Reserve tag storage memory").
> > * Added device tree schema (Rob Herring)
> > * Tag storage memory is now described in the "reserved-memory" node (Rob
> > Herring).
> >
> > .../reserved-memory/arm,mte-tag-storage.yaml | 78 +++++++++
>
> Please run scripts/checkpatch.pl and fix reported warnings. Some
> warnings can be ignored, but the code here looks like it needs a fix.
> Feel free to get in touch if the warning is not clear.
Thank you for pointing it out, I'll move the binding to a separate patch.
Alex
On Thu, Jan 25, 2024 at 8:43 AM Alexandru Elisei
<[email protected]> wrote:
>
> arm64 uses VM_HIGH_ARCH_0 and VM_HIGH_ARCH_1 for enabling MTE for a VMA.
> When VM_HIGH_ARCH_0, which arm64 renames to VM_MTE, is set for a VMA, and
> the gfp flag __GFP_ZERO is present, the __GFP_ZEROTAGS gfp flag also gets
> set in vma_alloc_zeroed_movable_folio().
>
> Expand this to be more generic by adding an arch hook that modifes the gfp
> flags for an allocation when the VMA is known.
>
> Note that __GFP_ZEROTAGS is ignored by the page allocator unless __GFP_ZERO
> is also set; from that point of view, the current behaviour is unchanged,
> even though the arm64 flag is set in more places. When arm64 will have
> support to reuse the tag storage for data allocation, the uses of the
> __GFP_ZEROTAGS flag will be expanded to instruct the page allocator to try
> to reserve the corresponding tag storage for the pages being allocated.
>
> The flags returned by arch_calc_vma_gfp() are or'ed with the flags set by
> the caller; this has been done to keep an architecture from modifying the
> flags already set by the core memory management code; this is similar to
> how do_mmap() -> calc_vm_flag_bits() -> arch_calc_vm_flag_bits() has been
> implemented. This can be revisited in the future if there's a need to do
> so.
>
> Signed-off-by: Alexandru Elisei <[email protected]>
This patch also needs to update the non-CONFIG_NUMA definition of
vma_alloc_folio in include/linux/gfp.h to call arch_calc_vma_gfp. See:
https://r.android.com/2849146
Peter
On 1/25/24 22:12, Alexandru Elisei wrote:
> Extend the usefulness of arch_alloc_page() by adding the gfp_flags
> parameter.
Although the change here is harmless in itself, it will definitely benefit
from some additional context explaining the rationale, taking into account
why-how arch_alloc_page() got added particularly for s390 platform and how
it's going to be used in the present proposal.
>
> Signed-off-by: Alexandru Elisei <[email protected]>
> ---
>
> Changes since rfc v2:
>
> * New patch.
>
> arch/s390/include/asm/page.h | 2 +-
> arch/s390/mm/page-states.c | 2 +-
> include/linux/gfp.h | 2 +-
> mm/page_alloc.c | 2 +-
> 4 files changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/arch/s390/include/asm/page.h b/arch/s390/include/asm/page.h
> index 73b9c3bf377f..859f0958c574 100644
> --- a/arch/s390/include/asm/page.h
> +++ b/arch/s390/include/asm/page.h
> @@ -163,7 +163,7 @@ static inline int page_reset_referenced(unsigned long addr)
>
> struct page;
> void arch_free_page(struct page *page, int order);
> -void arch_alloc_page(struct page *page, int order);
> +void arch_alloc_page(struct page *page, int order, gfp_t gfp_flags);
>
> static inline int devmem_is_allowed(unsigned long pfn)
> {
> diff --git a/arch/s390/mm/page-states.c b/arch/s390/mm/page-states.c
> index 01f9b39e65f5..b986c8b158e3 100644
> --- a/arch/s390/mm/page-states.c
> +++ b/arch/s390/mm/page-states.c
> @@ -21,7 +21,7 @@ void arch_free_page(struct page *page, int order)
> __set_page_unused(page_to_virt(page), 1UL << order);
> }
>
> -void arch_alloc_page(struct page *page, int order)
> +void arch_alloc_page(struct page *page, int order, gfp_t gfp_flags)
> {
> if (!cmma_flag)
> return;
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index de292a007138..9e8aa3d144db 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -172,7 +172,7 @@ static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
> static inline void arch_free_page(struct page *page, int order) { }
> #endif
> #ifndef HAVE_ARCH_ALLOC_PAGE
> -static inline void arch_alloc_page(struct page *page, int order) { }
> +static inline void arch_alloc_page(struct page *page, int order, gfp_t gfp_flags) { }
> #endif
>
> struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 150d4f23b010..2c140abe5ee6 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1485,7 +1485,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
> set_page_private(page, 0);
> set_page_refcounted(page);
>
> - arch_alloc_page(page, order);
> + arch_alloc_page(page, order, gfp_flags);
> debug_pagealloc_map_pages(page, 1 << order);
>
> /*
Otherwise LGTM.
On 1/25/24 22:12, Alexandru Elisei wrote:
> The arm64 MTE code uses the PG_arch_2 page flag, which it renames to
> PG_mte_tagged, to track if a page has been mapped with tagging enabled.
> That flag is cleared by free_pages_prepare() by doing:
>
> page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
>
> When tag storage management is added, tag storage will be reserved for a
> page if and only if the page is mapped as tagged (the page flag
> PG_mte_tagged is set). When a page is freed, likewise, the code will have
> to look at the the page flags to determine if the page has tag storage
> reserved, which should also be freed.
>
> For this purpose, add an arch_free_pages_prepare() hook that is called
> before that page flags are cleared. The function arch_free_page() has also
> been considered for this purpose, but it is called after the flags are
> cleared.
arch_free_pages_prepare() makes sense as a prologue to arch_free_page().
s/arch_free_pages_prepare/arch_free_page_prepare to match similar functions.
>
> Signed-off-by: Alexandru Elisei <[email protected]>
> ---
>
> Changes since rfc v2:
>
> * Expanded commit message (David Hildenbrand).
>
> include/linux/pgtable.h | 4 ++++
> mm/page_alloc.c | 1 +
> 2 files changed, 5 insertions(+)
>
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index f6d0e3513948..6d98d5fdd697 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -901,6 +901,10 @@ static inline void arch_do_swap_page(struct mm_struct *mm,
> }
> #endif
>
> +#ifndef __HAVE_ARCH_FREE_PAGES_PREPARE
I guess new __HAVE_ARCH_ constructs are not being added lately. Instead
something like '#ifndef arch_free_pages_prepare' might be better suited.
> +static inline void arch_free_pages_prepare(struct page *page, int order) { }
> +#endif
> +
> #ifndef __HAVE_ARCH_UNMAP_ONE
> /*
> * Some architectures support metadata associated with a page. When a
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 2c140abe5ee6..27282a1c82fe 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1092,6 +1092,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
>
> trace_mm_page_free(page, order);
> kmsan_free_page(page, order);
> + arch_free_pages_prepare(page, order);
>
> if (memcg_kmem_online() && PageMemcgKmem(page))
> __memcg_kmem_uncharge_page(page, order);
On 1/25/24 22:12, Alexandru Elisei wrote:
> As an architecture might have specific requirements around the allocation
> of CMA pages, add an arch hook that can disable allocations from
> MIGRATE_CMA, if the allocation was otherwise allowed.
>
> This will be used by arm64, which will put tag storage pages on the
> MIGRATE_CMA list, and tag storage pages cannot be tagged. The filter will
> be used to deny using MIGRATE_CMA for __GFP_TAGGED allocations.
Just wondering how allocation requests would be blocked for direct
alloc_contig_range() requests ?
>
> Signed-off-by: Alexandru Elisei <[email protected]>
> ---
> include/linux/pgtable.h | 7 +++++++
> mm/page_alloc.c | 3 ++-
> 2 files changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 6d98d5fdd697..c5ddec6b5305 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -905,6 +905,13 @@ static inline void arch_do_swap_page(struct mm_struct *mm,
> static inline void arch_free_pages_prepare(struct page *page, int order) { }
> #endif
>
> +#ifndef __HAVE_ARCH_ALLOC_CMA
Same as last patch i.e __HAVE_ARCH_ALLOC_CMA could be avoided via
a direct check on #ifndef arch_alloc_cma instead.
> +static inline bool arch_alloc_cma(gfp_t gfp)
> +{
> + return true;
> +}
> +#endif
> +
> #ifndef __HAVE_ARCH_UNMAP_ONE
> /*
> * Some architectures support metadata associated with a page. When a
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 27282a1c82fe..a96d47a6393e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3157,7 +3157,8 @@ static inline unsigned int gfp_to_alloc_flags_cma(gfp_t gfp_mask,
> unsigned int alloc_flags)
> {
> #ifdef CONFIG_CMA
> - if (gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE)
> + if (gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE &&
> + arch_alloc_cma(gfp_mask))
> alloc_flags |= ALLOC_CMA;
> #endif
> return alloc_flags;
On 1/25/24 22:12, Alexandru Elisei wrote:
> The patch f945116e4e19 ("mm: page_alloc: remove stale CMA guard code")
> removed the CMA filter when allocating from the MIGRATE_MOVABLE pcp list
> because CMA is always allowed when __GFP_MOVABLE is set.
>
> With the introduction of the arch_alloc_cma() function, the above is not
> true anymore, so bring back the filter.
This makes sense as arch_alloc_cma() now might prevent ALLOC_CMA being
assigned to alloc_flags in gfp_to_alloc_flags_cma().
>
> This is a partially revert because the stale comment remains removed.
>
> Signed-off-by: Alexandru Elisei <[email protected]>
> ---
> mm/page_alloc.c | 15 +++++++++++----
> 1 file changed, 11 insertions(+), 4 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a96d47a6393e..0fa34bcfb1af 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2897,10 +2897,17 @@ struct page *rmqueue(struct zone *preferred_zone,
> WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
>
> if (likely(pcp_allowed_order(order))) {
> - page = rmqueue_pcplist(preferred_zone, zone, order,
> - migratetype, alloc_flags);
> - if (likely(page))
> - goto out;
> + /*
> + * MIGRATE_MOVABLE pcplist could have the pages on CMA area and
> + * we need to skip it when CMA area isn't allowed.
> + */
> + if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & ALLOC_CMA ||
> + migratetype != MIGRATE_MOVABLE) {
> + page = rmqueue_pcplist(preferred_zone, zone, order,
> + migratetype, alloc_flags);
> + if (likely(page))
> + goto out;
> + }
> }
>
> page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags,
On 1/25/24 22:12, Alexandru Elisei wrote:
> cma->name is displayed in several CMA messages. When the name is generated
> by the CMA code, don't append a newline to avoid breaking the text across
> two lines.
An example of such mis-formatted CMA output from dmesg could be added
here in the commit message to demonstrate the problem better.
>
> Signed-off-by: Alexandru Elisei <[email protected]>
> ---
Regardless, LGTM.
Reviewed-by: Anshuman Khandual <[email protected]>
>
> Changes since rfc v2:
>
> * New patch. This is a fix, and can be merged independently of the other
> patches.
Right, need not be part of this series. Hence please send it separately to
the MM list.
>
> mm/cma.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/cma.c b/mm/cma.c
> index 7c09c47e530b..f49c95f8ee37 100644
> --- a/mm/cma.c
> +++ b/mm/cma.c
> @@ -204,7 +204,7 @@ int __init cma_init_reserved_mem(phys_addr_t base, phys_addr_t size,
> if (name)
> snprintf(cma->name, CMA_MAX_NAME, name);
> else
> - snprintf(cma->name, CMA_MAX_NAME, "cma%d\n", cma_area_count);
> + snprintf(cma->name, CMA_MAX_NAME, "cma%d", cma_area_count);
>
> cma->base_pfn = PFN_DOWN(base);
> cma->count = size >> PAGE_SHIFT;
On 1/25/24 22:12, Alexandru Elisei wrote:
> The CMA_ALLOC_SUCCESS, respectively CMA_ALLOC_FAIL, are increased by one
> after each cma_alloc() function call. This is done even though cma_alloc()
> can allocate an arbitrary number of CMA pages. When looking at
> /proc/vmstat, the number of successful (or failed) cma_alloc() calls
> doesn't tell much with regards to how many CMA pages were allocated via
> cma_alloc() versus via the page allocator (regular allocation request or
> PCP lists refill).
>
> This can also be rather confusing to a user who isn't familiar with the
> code, since the unit of measurement for nr_free_cma is the number of pages,
> but cma_alloc_success and cma_alloc_fail count the number of cma_alloc()
> function calls.
>
> Let's make this consistent, and arguably more useful, by having
> CMA_ALLOC_SUCCESS count the number of successfully allocated CMA pages, and
> CMA_ALLOC_FAIL count the number of pages the cma_alloc() failed to
> allocate.
>
> For users that wish to track the number of cma_alloc() calls, there are
> tracepoints for that already implemented.
>
> Signed-off-by: Alexandru Elisei <[email protected]>
> ---
> mm/cma.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/mm/cma.c b/mm/cma.c
> index f49c95f8ee37..dbf7fe8cb1bd 100644
> --- a/mm/cma.c
> +++ b/mm/cma.c
> @@ -517,10 +517,10 @@ struct page *cma_alloc(struct cma *cma, unsigned long count,
> pr_debug("%s(): returned %p\n", __func__, page);
> out:
> if (page) {
> - count_vm_event(CMA_ALLOC_SUCCESS);
> + count_vm_events(CMA_ALLOC_SUCCESS, count);
> cma_sysfs_account_success_pages(cma, count);
> } else {
> - count_vm_event(CMA_ALLOC_FAIL);
> + count_vm_events(CMA_ALLOC_FAIL, count);
> if (cma)
> cma_sysfs_account_fail_pages(cma, count);
> }
Without getting into the merits of this patch - which is actually trying to do
semantics change to /proc/vmstat, wondering how is this even related to this
particular series ? If required this could be debated on it's on separately.
On 1/25/24 22:12, Alexandru Elisei wrote:
> Similar to the two events that relate to CMA allocations, add the
> CMA_RELEASE_SUCCESS and CMA_RELEASE_FAIL events that count when CMA pages
> are freed.
How is this is going to be beneficial towards analyzing CMA alloc/release
behaviour - particularly with respect to this series. OR just adding this
from parity perspective with CMA alloc side counters ? Regardless this
CMA change too could be discussed separately.
>
> Signed-off-by: Alexandru Elisei <[email protected]>
> ---
>
> Changes since rfc v2:
>
> * New patch.
>
> include/linux/vm_event_item.h | 2 ++
> mm/cma.c | 6 +++++-
> mm/vmstat.c | 2 ++
> 3 files changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 747943bc8cc2..aba5c5bf8127 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -83,6 +83,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> #ifdef CONFIG_CMA
> CMA_ALLOC_SUCCESS,
> CMA_ALLOC_FAIL,
> + CMA_RELEASE_SUCCESS,
> + CMA_RELEASE_FAIL,
> #endif
> UNEVICTABLE_PGCULLED, /* culled to noreclaim list */
> UNEVICTABLE_PGSCANNED, /* scanned for reclaimability */
> diff --git a/mm/cma.c b/mm/cma.c
> index dbf7fe8cb1bd..543bb6b3be8e 100644
> --- a/mm/cma.c
> +++ b/mm/cma.c
> @@ -562,8 +562,10 @@ bool cma_release(struct cma *cma, const struct page *pages,
> {
> unsigned long pfn;
>
> - if (!cma_pages_valid(cma, pages, count))
> + if (!cma_pages_valid(cma, pages, count)) {
> + count_vm_events(CMA_RELEASE_FAIL, count);
> return false;
> + }
>
> pr_debug("%s(page %p, count %lu)\n", __func__, (void *)pages, count);
>
> @@ -575,6 +577,8 @@ bool cma_release(struct cma *cma, const struct page *pages,
> cma_clear_bitmap(cma, pfn, count);
> trace_cma_release(cma->name, pfn, pages, count);
>
> + count_vm_events(CMA_RELEASE_SUCCESS, count);
> +
> return true;
> }
>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index db79935e4a54..eebfd5c6c723 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1340,6 +1340,8 @@ const char * const vmstat_text[] = {
> #ifdef CONFIG_CMA
> "cma_alloc_success",
> "cma_alloc_fail",
> + "cma_release_success",
> + "cma_release_fail",
> #endif
> "unevictable_pgs_culled",
> "unevictable_pgs_scanned",
Hi,
On Mon, Jan 29, 2024 at 11:18:59AM +0530, Anshuman Khandual wrote:
>
> On 1/25/24 22:12, Alexandru Elisei wrote:
> > Extend the usefulness of arch_alloc_page() by adding the gfp_flags
> > parameter.
>
> Although the change here is harmless in itself, it will definitely benefit
> from some additional context explaining the rationale, taking into account
> why-how arch_alloc_page() got added particularly for s390 platform and how
> it's going to be used in the present proposal.
arm64 will use it to reserve tag storage if the caller requested a tagged
page. Right now that means that __GFP_ZEROTAGS is set in the gfp mask, but
I'll rename it to __GFP_TAGGED in patch #18 ("arm64: mte: Rename
__GFP_ZEROTAGS to __GFP_TAGGED") [1].
[1] https://lore.kernel.org/lkml/[email protected]/
Thanks,
Alex
>
> >
> > Signed-off-by: Alexandru Elisei <[email protected]>
> > ---
> >
> > Changes since rfc v2:
> >
> > * New patch.
> >
> > arch/s390/include/asm/page.h | 2 +-
> > arch/s390/mm/page-states.c | 2 +-
> > include/linux/gfp.h | 2 +-
> > mm/page_alloc.c | 2 +-
> > 4 files changed, 4 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/s390/include/asm/page.h b/arch/s390/include/asm/page.h
> > index 73b9c3bf377f..859f0958c574 100644
> > --- a/arch/s390/include/asm/page.h
> > +++ b/arch/s390/include/asm/page.h
> > @@ -163,7 +163,7 @@ static inline int page_reset_referenced(unsigned long addr)
> >
> > struct page;
> > void arch_free_page(struct page *page, int order);
> > -void arch_alloc_page(struct page *page, int order);
> > +void arch_alloc_page(struct page *page, int order, gfp_t gfp_flags);
> >
> > static inline int devmem_is_allowed(unsigned long pfn)
> > {
> > diff --git a/arch/s390/mm/page-states.c b/arch/s390/mm/page-states.c
> > index 01f9b39e65f5..b986c8b158e3 100644
> > --- a/arch/s390/mm/page-states.c
> > +++ b/arch/s390/mm/page-states.c
> > @@ -21,7 +21,7 @@ void arch_free_page(struct page *page, int order)
> > __set_page_unused(page_to_virt(page), 1UL << order);
> > }
> >
> > -void arch_alloc_page(struct page *page, int order)
> > +void arch_alloc_page(struct page *page, int order, gfp_t gfp_flags)
> > {
> > if (!cmma_flag)
> > return;
> > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > index de292a007138..9e8aa3d144db 100644
> > --- a/include/linux/gfp.h
> > +++ b/include/linux/gfp.h
> > @@ -172,7 +172,7 @@ static inline struct zonelist *node_zonelist(int nid, gfp_t flags)
> > static inline void arch_free_page(struct page *page, int order) { }
> > #endif
> > #ifndef HAVE_ARCH_ALLOC_PAGE
> > -static inline void arch_alloc_page(struct page *page, int order) { }
> > +static inline void arch_alloc_page(struct page *page, int order, gfp_t gfp_flags) { }
> > #endif
> >
> > struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 150d4f23b010..2c140abe5ee6 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1485,7 +1485,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
> > set_page_private(page, 0);
> > set_page_refcounted(page);
> >
> > - arch_alloc_page(page, order);
> > + arch_alloc_page(page, order, gfp_flags);
> > debug_pagealloc_map_pages(page, 1 << order);
> >
> > /*
>
> Otherwise LGTM.
Hi,
On Mon, Jan 29, 2024 at 01:49:44PM +0530, Anshuman Khandual wrote:
>
>
> On 1/25/24 22:12, Alexandru Elisei wrote:
> > The arm64 MTE code uses the PG_arch_2 page flag, which it renames to
> > PG_mte_tagged, to track if a page has been mapped with tagging enabled.
> > That flag is cleared by free_pages_prepare() by doing:
> >
> > page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
> >
> > When tag storage management is added, tag storage will be reserved for a
> > page if and only if the page is mapped as tagged (the page flag
> > PG_mte_tagged is set). When a page is freed, likewise, the code will have
> > to look at the the page flags to determine if the page has tag storage
> > reserved, which should also be freed.
> >
> > For this purpose, add an arch_free_pages_prepare() hook that is called
> > before that page flags are cleared. The function arch_free_page() has also
> > been considered for this purpose, but it is called after the flags are
> > cleared.
>
> arch_free_pages_prepare() makes sense as a prologue to arch_free_page().
Thanks!
>
> s/arch_free_pages_prepare/arch_free_page_prepare to match similar functions.
The function free_pages_prepare() calls the function arch_free_pages_prepare().
I find that consistent, and it makes it easy to identify from where
arch_free_pages_prepare() is called.
Thanks,
Alex
>
> >
> > Signed-off-by: Alexandru Elisei <[email protected]>
> > ---
> >
> > Changes since rfc v2:
> >
> > * Expanded commit message (David Hildenbrand).
> >
> > include/linux/pgtable.h | 4 ++++
> > mm/page_alloc.c | 1 +
> > 2 files changed, 5 insertions(+)
> >
> > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > index f6d0e3513948..6d98d5fdd697 100644
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -901,6 +901,10 @@ static inline void arch_do_swap_page(struct mm_struct *mm,
> > }
> > #endif
> >
> > +#ifndef __HAVE_ARCH_FREE_PAGES_PREPARE
>
> I guess new __HAVE_ARCH_ constructs are not being added lately. Instead
> something like '#ifndef arch_free_pages_prepare' might be better suited.
>
> > +static inline void arch_free_pages_prepare(struct page *page, int order) { }
> > +#endif
> > +
> > #ifndef __HAVE_ARCH_UNMAP_ONE
> > /*
> > * Some architectures support metadata associated with a page. When a
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 2c140abe5ee6..27282a1c82fe 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1092,6 +1092,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
> >
> > trace_mm_page_free(page, order);
> > kmsan_free_page(page, order);
> > + arch_free_pages_prepare(page, order);
> >
> > if (memcg_kmem_online() && PageMemcgKmem(page))
> > __memcg_kmem_uncharge_page(page, order);
Hi,
On Mon, Jan 29, 2024 at 02:14:16PM +0530, Anshuman Khandual wrote:
>
>
> On 1/25/24 22:12, Alexandru Elisei wrote:
> > As an architecture might have specific requirements around the allocation
> > of CMA pages, add an arch hook that can disable allocations from
> > MIGRATE_CMA, if the allocation was otherwise allowed.
> >
> > This will be used by arm64, which will put tag storage pages on the
> > MIGRATE_CMA list, and tag storage pages cannot be tagged. The filter will
> > be used to deny using MIGRATE_CMA for __GFP_TAGGED allocations.
>
> Just wondering how allocation requests would be blocked for direct
> alloc_contig_range() requests ?
alloc_contig_range() does page allocation in __alloc_contig_migrate_range()
-> alloc_migration_target(); __alloc_contig_migrate_range() ignores the
gfp_mask parameter passed to alloc_contig_range() when building struct
migration_target_control, even though it's available in the struct
compact_control argument. That looks like a bug to me, as the decription
for the gfp_mask parameter says: "GFP mask to use during compaction".
Regardless, when tag storage page T1 is migrated to it can be used to
storage tags, it doesn't matter if it is replaced by another tag storage
page T2 or a regular page, as long as the replacement isn't also tagged. If
the replacement is also tagged, the code to reserve tag storage would
recurse and deadlock. See patch #16 ("KVM: arm64: Don't deny VM_PFNMAP VMAs
when kvm_has_mte()") [1] for the code.
Does that make sense?
[1] https://lore.kernel.org/linux-mm/[email protected]/
>
> >
> > Signed-off-by: Alexandru Elisei <[email protected]>
> > ---
> > include/linux/pgtable.h | 7 +++++++
> > mm/page_alloc.c | 3 ++-
> > 2 files changed, 9 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > index 6d98d5fdd697..c5ddec6b5305 100644
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -905,6 +905,13 @@ static inline void arch_do_swap_page(struct mm_struct *mm,
> > static inline void arch_free_pages_prepare(struct page *page, int order) { }
> > #endif
> >
> > +#ifndef __HAVE_ARCH_ALLOC_CMA
>
> Same as last patch i.e __HAVE_ARCH_ALLOC_CMA could be avoided via
> a direct check on #ifndef arch_alloc_cma instead.
include/linux/pgtable.h uses __HAVE_ARCH_*, and I would rather keep it
consistent.
Thanks,
Alex
>
> > +static inline bool arch_alloc_cma(gfp_t gfp)
> > +{
> > + return true;
> > +}
> > +#endif
> > +
> > #ifndef __HAVE_ARCH_UNMAP_ONE
> > /*
> > * Some architectures support metadata associated with a page. When a
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 27282a1c82fe..a96d47a6393e 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3157,7 +3157,8 @@ static inline unsigned int gfp_to_alloc_flags_cma(gfp_t gfp_mask,
> > unsigned int alloc_flags)
> > {
> > #ifdef CONFIG_CMA
> > - if (gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE)
> > + if (gfp_migratetype(gfp_mask) == MIGRATE_MOVABLE &&
> > + arch_alloc_cma(gfp_mask))
> > alloc_flags |= ALLOC_CMA;
> > #endif
> > return alloc_flags;
Hi,
On Mon, Jan 29, 2024 at 02:31:23PM +0530, Anshuman Khandual wrote:
>
>
> On 1/25/24 22:12, Alexandru Elisei wrote:
> > The patch f945116e4e19 ("mm: page_alloc: remove stale CMA guard code")
> > removed the CMA filter when allocating from the MIGRATE_MOVABLE pcp list
> > because CMA is always allowed when __GFP_MOVABLE is set.
> >
> > With the introduction of the arch_alloc_cma() function, the above is not
> > true anymore, so bring back the filter.
>
> This makes sense as arch_alloc_cma() now might prevent ALLOC_CMA being
> assigned to alloc_flags in gfp_to_alloc_flags_cma().
Can I add your Reviewed-by tag then?
Thanks,
Alex
>
> >
> > This is a partially revert because the stale comment remains removed.
> >
> > Signed-off-by: Alexandru Elisei <[email protected]>
> > ---
> > mm/page_alloc.c | 15 +++++++++++----
> > 1 file changed, 11 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index a96d47a6393e..0fa34bcfb1af 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2897,10 +2897,17 @@ struct page *rmqueue(struct zone *preferred_zone,
> > WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
> >
> > if (likely(pcp_allowed_order(order))) {
> > - page = rmqueue_pcplist(preferred_zone, zone, order,
> > - migratetype, alloc_flags);
> > - if (likely(page))
> > - goto out;
> > + /*
> > + * MIGRATE_MOVABLE pcplist could have the pages on CMA area and
> > + * we need to skip it when CMA area isn't allowed.
> > + */
> > + if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & ALLOC_CMA ||
> > + migratetype != MIGRATE_MOVABLE) {
> > + page = rmqueue_pcplist(preferred_zone, zone, order,
> > + migratetype, alloc_flags);
> > + if (likely(page))
> > + goto out;
> > + }
> > }
> >
> > page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags,
Hi,
On Mon, Jan 29, 2024 at 02:43:08PM +0530, Anshuman Khandual wrote:
>
> On 1/25/24 22:12, Alexandru Elisei wrote:
> > cma->name is displayed in several CMA messages. When the name is generated
> > by the CMA code, don't append a newline to avoid breaking the text across
> > two lines.
>
> An example of such mis-formatted CMA output from dmesg could be added
> here in the commit message to demonstrate the problem better.
>
> >
> > Signed-off-by: Alexandru Elisei <[email protected]>
> > ---
>
> Regardless, LGTM.
>
> Reviewed-by: Anshuman Khandual <[email protected]>
Thanks!
>
> >
> > Changes since rfc v2:
> >
> > * New patch. This is a fix, and can be merged independently of the other
> > patches.
>
> Right, need not be part of this series. Hence please send it separately to
> the MM list.
Will do!
Alex
>
> >
> > mm/cma.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/mm/cma.c b/mm/cma.c
> > index 7c09c47e530b..f49c95f8ee37 100644
> > --- a/mm/cma.c
> > +++ b/mm/cma.c
> > @@ -204,7 +204,7 @@ int __init cma_init_reserved_mem(phys_addr_t base, phys_addr_t size,
> > if (name)
> > snprintf(cma->name, CMA_MAX_NAME, name);
> > else
> > - snprintf(cma->name, CMA_MAX_NAME, "cma%d\n", cma_area_count);
> > + snprintf(cma->name, CMA_MAX_NAME, "cma%d", cma_area_count);
> >
> > cma->base_pfn = PFN_DOWN(base);
> > cma->count = size >> PAGE_SHIFT;
Hi,
On Mon, Jan 29, 2024 at 02:54:20PM +0530, Anshuman Khandual wrote:
>
>
> On 1/25/24 22:12, Alexandru Elisei wrote:
> > The CMA_ALLOC_SUCCESS, respectively CMA_ALLOC_FAIL, are increased by one
> > after each cma_alloc() function call. This is done even though cma_alloc()
> > can allocate an arbitrary number of CMA pages. When looking at
> > /proc/vmstat, the number of successful (or failed) cma_alloc() calls
> > doesn't tell much with regards to how many CMA pages were allocated via
> > cma_alloc() versus via the page allocator (regular allocation request or
> > PCP lists refill).
> >
> > This can also be rather confusing to a user who isn't familiar with the
> > code, since the unit of measurement for nr_free_cma is the number of pages,
> > but cma_alloc_success and cma_alloc_fail count the number of cma_alloc()
> > function calls.
> >
> > Let's make this consistent, and arguably more useful, by having
> > CMA_ALLOC_SUCCESS count the number of successfully allocated CMA pages, and
> > CMA_ALLOC_FAIL count the number of pages the cma_alloc() failed to
> > allocate.
> >
> > For users that wish to track the number of cma_alloc() calls, there are
> > tracepoints for that already implemented.
> >
> > Signed-off-by: Alexandru Elisei <[email protected]>
> > ---
> > mm/cma.c | 4 ++--
> > 1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/cma.c b/mm/cma.c
> > index f49c95f8ee37..dbf7fe8cb1bd 100644
> > --- a/mm/cma.c
> > +++ b/mm/cma.c
> > @@ -517,10 +517,10 @@ struct page *cma_alloc(struct cma *cma, unsigned long count,
> > pr_debug("%s(): returned %p\n", __func__, page);
> > out:
> > if (page) {
> > - count_vm_event(CMA_ALLOC_SUCCESS);
> > + count_vm_events(CMA_ALLOC_SUCCESS, count);
> > cma_sysfs_account_success_pages(cma, count);
> > } else {
> > - count_vm_event(CMA_ALLOC_FAIL);
> > + count_vm_events(CMA_ALLOC_FAIL, count);
> > if (cma)
> > cma_sysfs_account_fail_pages(cma, count);
> > }
>
> Without getting into the merits of this patch - which is actually trying to do
> semantics change to /proc/vmstat, wondering how is this even related to this
> particular series ? If required this could be debated on it's on separately.
Having the number of CMA pages allocated and the number of CMA pages freed
allows someone to infer how many tagged pages are in use at a given time:
(allocated CMA pages - CMA pages allocated by drivers* - CMA pages
released) * 32. That is valuable information for software and hardware
designers.
Besides that, for every iteration of the series, this has proven invaluable
for discovering bugs with freeing and/or reserving tag storage pages.
*that would require userspace reading cma_alloc_success and
cma_release_success before any tagged allocations are performed.
Thanks,
Alex
Hi,
On Mon, Jan 29, 2024 at 03:01:24PM +0530, Anshuman Khandual wrote:
>
>
> On 1/25/24 22:12, Alexandru Elisei wrote:
> > Similar to the two events that relate to CMA allocations, add the
> > CMA_RELEASE_SUCCESS and CMA_RELEASE_FAIL events that count when CMA pages
> > are freed.
>
> How is this is going to be beneficial towards analyzing CMA alloc/release
> behaviour - particularly with respect to this series. OR just adding this
> from parity perspective with CMA alloc side counters ? Regardless this
> CMA change too could be discussed separately.
Added for parity and because it's useful for this series (see my reply to
the previous patch where I discuss how I've used the counters).
Thanks,
Alex
>
> >
> > Signed-off-by: Alexandru Elisei <[email protected]>
> > ---
> >
> > Changes since rfc v2:
> >
> > * New patch.
> >
> > include/linux/vm_event_item.h | 2 ++
> > mm/cma.c | 6 +++++-
> > mm/vmstat.c | 2 ++
> > 3 files changed, 9 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> > index 747943bc8cc2..aba5c5bf8127 100644
> > --- a/include/linux/vm_event_item.h
> > +++ b/include/linux/vm_event_item.h
> > @@ -83,6 +83,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> > #ifdef CONFIG_CMA
> > CMA_ALLOC_SUCCESS,
> > CMA_ALLOC_FAIL,
> > + CMA_RELEASE_SUCCESS,
> > + CMA_RELEASE_FAIL,
> > #endif
> > UNEVICTABLE_PGCULLED, /* culled to noreclaim list */
> > UNEVICTABLE_PGSCANNED, /* scanned for reclaimability */
> > diff --git a/mm/cma.c b/mm/cma.c
> > index dbf7fe8cb1bd..543bb6b3be8e 100644
> > --- a/mm/cma.c
> > +++ b/mm/cma.c
> > @@ -562,8 +562,10 @@ bool cma_release(struct cma *cma, const struct page *pages,
> > {
> > unsigned long pfn;
> >
> > - if (!cma_pages_valid(cma, pages, count))
> > + if (!cma_pages_valid(cma, pages, count)) {
> > + count_vm_events(CMA_RELEASE_FAIL, count);
> > return false;
> > + }
> >
> > pr_debug("%s(page %p, count %lu)\n", __func__, (void *)pages, count);
> >
> > @@ -575,6 +577,8 @@ bool cma_release(struct cma *cma, const struct page *pages,
> > cma_clear_bitmap(cma, pfn, count);
> > trace_cma_release(cma->name, pfn, pages, count);
> >
> > + count_vm_events(CMA_RELEASE_SUCCESS, count);
> > +
> > return true;
> > }
> >
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index db79935e4a54..eebfd5c6c723 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -1340,6 +1340,8 @@ const char * const vmstat_text[] = {
> > #ifdef CONFIG_CMA
> > "cma_alloc_success",
> > "cma_alloc_fail",
> > + "cma_release_success",
> > + "cma_release_fail",
> > #endif
> > "unevictable_pgs_culled",
> > "unevictable_pgs_scanned",
Hi Peter,
On Fri, Jan 26, 2024 at 12:00:36PM -0800, Peter Collingbourne wrote:
> On Thu, Jan 25, 2024 at 8:43 AM Alexandru Elisei
> <[email protected]> wrote:
> >
> > arm64 uses VM_HIGH_ARCH_0 and VM_HIGH_ARCH_1 for enabling MTE for a VMA.
> > When VM_HIGH_ARCH_0, which arm64 renames to VM_MTE, is set for a VMA, and
> > the gfp flag __GFP_ZERO is present, the __GFP_ZEROTAGS gfp flag also gets
> > set in vma_alloc_zeroed_movable_folio().
> >
> > Expand this to be more generic by adding an arch hook that modifes the gfp
> > flags for an allocation when the VMA is known.
> >
> > Note that __GFP_ZEROTAGS is ignored by the page allocator unless __GFP_ZERO
> > is also set; from that point of view, the current behaviour is unchanged,
> > even though the arm64 flag is set in more places. When arm64 will have
> > support to reuse the tag storage for data allocation, the uses of the
> > __GFP_ZEROTAGS flag will be expanded to instruct the page allocator to try
> > to reserve the corresponding tag storage for the pages being allocated.
> >
> > The flags returned by arch_calc_vma_gfp() are or'ed with the flags set by
> > the caller; this has been done to keep an architecture from modifying the
> > flags already set by the core memory management code; this is similar to
> > how do_mmap() -> calc_vm_flag_bits() -> arch_calc_vm_flag_bits() has been
> > implemented. This can be revisited in the future if there's a need to do
> > so.
> >
> > Signed-off-by: Alexandru Elisei <[email protected]>
>
> This patch also needs to update the non-CONFIG_NUMA definition of
> vma_alloc_folio in include/linux/gfp.h to call arch_calc_vma_gfp. See:
> https://r.android.com/2849146
Of course, you're already reported this to me, I cherry-pick the version of
the patch that doesn't have the fix for this series.
Will fix.
Thanks,
Alex
>
> Peter
On Thu, Jan 25, 2024 at 8:45 AM Alexandru Elisei
<[email protected]> wrote:
>
> Reserve tag storage for a page that is being allocated as tagged. This
> is a best effort approach, and failing to reserve tag storage is
> allowed.
>
> When all the associated tagged pages have been freed, return the tag
> storage pages back to the page allocator, where they can be used again for
> data allocations.
>
> Signed-off-by: Alexandru Elisei <[email protected]>
> ---
>
> Changes since rfc v2:
>
> * Based on rfc v2 patch #16 ("arm64: mte: Manage tag storage on page
> allocation").
> * Fixed calculation of the number of associated tag storage blocks (Hyesoo
> Yu).
> * Tag storage is reserved in arch_alloc_page() instead of
> arch_prep_new_page().
>
> arch/arm64/include/asm/mte.h | 16 +-
> arch/arm64/include/asm/mte_tag_storage.h | 31 +++
> arch/arm64/include/asm/page.h | 5 +
> arch/arm64/include/asm/pgtable.h | 19 ++
> arch/arm64/kernel/mte_tag_storage.c | 234 +++++++++++++++++++++++
> arch/arm64/mm/fault.c | 7 +
> fs/proc/page.c | 1 +
> include/linux/kernel-page-flags.h | 1 +
> include/linux/page-flags.h | 1 +
> include/trace/events/mmflags.h | 3 +-
> mm/huge_memory.c | 1 +
> 11 files changed, 316 insertions(+), 3 deletions(-)
>
> diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
> index 8034695b3dd7..6457b7899207 100644
> --- a/arch/arm64/include/asm/mte.h
> +++ b/arch/arm64/include/asm/mte.h
> @@ -40,12 +40,24 @@ void mte_free_tag_buf(void *buf);
> #ifdef CONFIG_ARM64_MTE
>
> /* track which pages have valid allocation tags */
> -#define PG_mte_tagged PG_arch_2
> +#define PG_mte_tagged PG_arch_2
> /* simple lock to avoid multiple threads tagging the same page */
> -#define PG_mte_lock PG_arch_3
> +#define PG_mte_lock PG_arch_3
> +/* Track if a tagged page has tag storage reserved */
> +#define PG_tag_storage_reserved PG_arch_4
> +
> +#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
> +DECLARE_STATIC_KEY_FALSE(tag_storage_enabled_key);
> +extern bool page_tag_storage_reserved(struct page *page);
> +#endif
>
> static inline void set_page_mte_tagged(struct page *page)
> {
> +#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
> + /* Open code mte_tag_storage_enabled() */
> + WARN_ON_ONCE(static_branch_likely(&tag_storage_enabled_key) &&
> + !page_tag_storage_reserved(page));
> +#endif
> /*
> * Ensure that the tags written prior to this function are visible
> * before the page flags update.
> diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
> index 7b3f6bff8e6f..09f1318d924e 100644
> --- a/arch/arm64/include/asm/mte_tag_storage.h
> +++ b/arch/arm64/include/asm/mte_tag_storage.h
> @@ -5,6 +5,12 @@
> #ifndef __ASM_MTE_TAG_STORAGE_H
> #define __ASM_MTE_TAG_STORAGE_H
>
> +#ifndef __ASSEMBLY__
> +
> +#include <linux/mm_types.h>
> +
> +#include <asm/mte.h>
> +
> #ifdef CONFIG_ARM64_MTE_TAG_STORAGE
>
> DECLARE_STATIC_KEY_FALSE(tag_storage_enabled_key);
> @@ -15,6 +21,15 @@ static inline bool tag_storage_enabled(void)
> }
>
> void mte_init_tag_storage(void);
> +
> +static inline bool alloc_requires_tag_storage(gfp_t gfp)
> +{
> + return gfp & __GFP_TAGGED;
> +}
> +int reserve_tag_storage(struct page *page, int order, gfp_t gfp);
> +void free_tag_storage(struct page *page, int order);
> +
> +bool page_tag_storage_reserved(struct page *page);
> #else
> static inline bool tag_storage_enabled(void)
> {
> @@ -23,6 +38,22 @@ static inline bool tag_storage_enabled(void)
> static inline void mte_init_tag_storage(void)
> {
> }
> +static inline bool alloc_requires_tag_storage(struct page *page)
This function should take a gfp_t to match the
CONFIG_ARM64_MTE_TAG_STORAGE case.
Peter
> +{
> + return false;
> +}
> +static inline int reserve_tag_storage(struct page *page, int order, gfp_t gfp)
> +{
> + return 0;
> +}
> +static inline void free_tag_storage(struct page *page, int order)
> +{
> +}
> +static inline bool page_tag_storage_reserved(struct page *page)
> +{
> + return true;
> +}
> #endif /* CONFIG_ARM64_MTE_TAG_STORAGE */
>
> +#endif /* !__ASSEMBLY__ */
> #endif /* __ASM_MTE_TAG_STORAGE_H */
> diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
> index 88bab032a493..3a656492f34a 100644
> --- a/arch/arm64/include/asm/page.h
> +++ b/arch/arm64/include/asm/page.h
> @@ -35,6 +35,11 @@ void copy_highpage(struct page *to, struct page *from);
> void tag_clear_highpage(struct page *to);
> #define __HAVE_ARCH_TAG_CLEAR_HIGHPAGE
>
> +#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
> +void arch_alloc_page(struct page *, int order, gfp_t gfp);
> +#define HAVE_ARCH_ALLOC_PAGE
> +#endif
> +
> #define clear_user_page(page, vaddr, pg) clear_page(page)
> #define copy_user_page(to, from, vaddr, pg) copy_page(to, from)
>
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 2499cc4fa4f2..f30466199a9b 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -10,6 +10,7 @@
>
> #include <asm/memory.h>
> #include <asm/mte.h>
> +#include <asm/mte_tag_storage.h>
> #include <asm/pgtable-hwdef.h>
> #include <asm/pgtable-prot.h>
> #include <asm/tlbflush.h>
> @@ -1069,6 +1070,24 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> mte_restore_page_tags_by_swp_entry(entry, &folio->page);
> }
>
> +#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
> +
> +#define __HAVE_ARCH_FREE_PAGES_PREPARE
> +static inline void arch_free_pages_prepare(struct page *page, int order)
> +{
> + if (tag_storage_enabled() && page_mte_tagged(page))
> + free_tag_storage(page, order);
> +}
> +
> +#define __HAVE_ARCH_ALLOC_CMA
> +static inline bool arch_alloc_cma(gfp_t gfp_mask)
> +{
> + if (tag_storage_enabled() && alloc_requires_tag_storage(gfp_mask))
> + return false;
> + return true;
> +}
> +
> +#endif /* CONFIG_ARM64_MTE_TAG_STORAGE */
> #endif /* CONFIG_ARM64_MTE */
>
> #define __HAVE_ARCH_CALC_VMA_GFP
> diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
> index d58c68b4a849..762c7c803a70 100644
> --- a/arch/arm64/kernel/mte_tag_storage.c
> +++ b/arch/arm64/kernel/mte_tag_storage.c
> @@ -34,6 +34,31 @@ struct tag_region {
> static struct tag_region tag_regions[MAX_TAG_REGIONS];
> static int num_tag_regions;
>
> +/*
> + * A note on locking. Reserving tag storage takes the tag_blocks_lock mutex,
> + * because alloc_contig_range() might sleep.
> + *
> + * Freeing tag storage takes the xa_lock spinlock with interrupts disabled
> + * because pages can be freed from non-preemptible contexts, including from an
> + * interrupt handler.
> + *
> + * Because tag storage can be freed from interrupt contexts, the xarray is
> + * defined with the XA_FLAGS_LOCK_IRQ flag to disable interrupts when calling
> + * xa_store(). This is done to prevent a deadlock with free_tag_storage() being
> + * called from an interrupt raised before xa_store() releases the xa_lock.
> + *
> + * All of the above means that reserve_tag_storage() cannot run concurrently
> + * with itself (no concurrent insertions), but it can run at the same time as
> + * free_tag_storage(). The first thing that reserve_tag_storage() does after
> + * taking the mutex is increase the refcount on all present tag storage blocks
> + * with the xa_lock held, to serialize against freeing the blocks. This is an
> + * optimization to avoid taking and releasing the xa_lock after each iteration
> + * if the refcount operation was moved inside the loop, where it would have had
> + * to be executed for each block.
> + */
> +static DEFINE_XARRAY_FLAGS(tag_blocks_reserved, XA_FLAGS_LOCK_IRQ);
> +static DEFINE_MUTEX(tag_blocks_lock);
> +
> static u32 __init get_block_size_pages(u32 block_size_bytes)
> {
> u32 a = PAGE_SIZE;
> @@ -364,3 +389,212 @@ static int __init mte_enable_tag_storage(void)
> return -EINVAL;
> }
> arch_initcall(mte_enable_tag_storage);
> +
> +static void page_set_tag_storage_reserved(struct page *page, int order)
> +{
> + int i;
> +
> + for (i = 0; i < (1 << order); i++)
> + set_bit(PG_tag_storage_reserved, &(page + i)->flags);
> +}
> +
> +static void block_ref_add(unsigned long block, struct tag_region *region, int order)
> +{
> + int count;
> +
> + count = min(1u << order, 32 * region->block_size_pages);
> + page_ref_add(pfn_to_page(block), count);
> +}
> +
> +static int block_ref_sub_return(unsigned long block, struct tag_region *region, int order)
> +{
> + int count;
> +
> + count = min(1u << order, 32 * region->block_size_pages);
> + return page_ref_sub_return(pfn_to_page(block), count);
> +}
> +
> +static bool tag_storage_block_is_reserved(unsigned long block)
> +{
> + return xa_load(&tag_blocks_reserved, block) != NULL;
> +}
> +
> +static int tag_storage_reserve_block(unsigned long block, struct tag_region *region, int order)
> +{
> + int ret;
> +
> + ret = xa_err(xa_store(&tag_blocks_reserved, block, pfn_to_page(block), GFP_KERNEL));
> + if (!ret)
> + block_ref_add(block, region, order);
> +
> + return ret;
> +}
> +
> +static int order_to_num_blocks(int order, u32 block_size_pages)
> +{
> + int num_tag_storage_pages = max((1 << order) / 32, 1);
> +
> + return DIV_ROUND_UP(num_tag_storage_pages, block_size_pages);
> +}
> +
> +static int tag_storage_find_block_in_region(struct page *page, unsigned long *blockp,
> + struct tag_region *region)
> +{
> + struct range *tag_range = ®ion->tag_range;
> + struct range *mem_range = ®ion->mem_range;
> + u64 page_pfn = page_to_pfn(page);
> + u64 block, block_offset;
> +
> + if (!(mem_range->start <= page_pfn && page_pfn <= mem_range->end))
> + return -ERANGE;
> +
> + block_offset = (page_pfn - mem_range->start) / 32;
> + block = tag_range->start + rounddown(block_offset, region->block_size_pages);
> +
> + if (block + region->block_size_pages - 1 > tag_range->end) {
> + pr_err("Block 0x%llx-0x%llx is outside tag region 0x%llx-0x%llx\n",
> + PFN_PHYS(block), PFN_PHYS(block + region->block_size_pages + 1) - 1,
> + PFN_PHYS(tag_range->start), PFN_PHYS(tag_range->end + 1) - 1);
> + return -ERANGE;
> + }
> + *blockp = block;
> +
> + return 0;
> +
> +}
> +
> +static int tag_storage_find_block(struct page *page, unsigned long *block,
> + struct tag_region **region)
> +{
> + int i, ret;
> +
> + for (i = 0; i < num_tag_regions; i++) {
> + ret = tag_storage_find_block_in_region(page, block, &tag_regions[i]);
> + if (ret == 0) {
> + *region = &tag_regions[i];
> + return 0;
> + }
> + }
> +
> + return -EINVAL;
> +}
> +
> +bool page_tag_storage_reserved(struct page *page)
> +{
> + return test_bit(PG_tag_storage_reserved, &page->flags);
> +}
> +
> +int reserve_tag_storage(struct page *page, int order, gfp_t gfp)
> +{
> + unsigned long start_block, end_block;
> + struct tag_region *region;
> + unsigned long block;
> + unsigned long flags;
> + int ret = 0;
> +
> + VM_WARN_ON_ONCE(!preemptible());
> +
> + if (page_tag_storage_reserved(page))
> + return 0;
> +
> + /*
> + * __alloc_contig_migrate_range() ignores gfp when allocating the
> + * destination page for migration. Regardless, massage gfp flags and
> + * remove __GFP_TAGGED to avoid recursion in case gfp stops being
> + * ignored.
> + */
> + gfp &= ~__GFP_TAGGED;
> + if (!(gfp & __GFP_NORETRY))
> + gfp |= __GFP_RETRY_MAYFAIL;
> +
> + ret = tag_storage_find_block(page, &start_block, ®ion);
> + if (WARN_ONCE(ret, "Missing tag storage block for pfn 0x%lx", page_to_pfn(page)))
> + return -EINVAL;
> + end_block = start_block + order_to_num_blocks(order, region->block_size_pages);
> +
> + mutex_lock(&tag_blocks_lock);
> +
> + /* Check again, this time with the lock held. */
> + if (page_tag_storage_reserved(page))
> + goto out_unlock;
> +
> + /* Make sure existing entries are not freed from out under out feet. */
> + xa_lock_irqsave(&tag_blocks_reserved, flags);
> + for (block = start_block; block < end_block; block += region->block_size_pages) {
> + if (tag_storage_block_is_reserved(block))
> + block_ref_add(block, region, order);
> + }
> + xa_unlock_irqrestore(&tag_blocks_reserved, flags);
> +
> + for (block = start_block; block < end_block; block += region->block_size_pages) {
> + /* Refcount incremented above. */
> + if (tag_storage_block_is_reserved(block))
> + continue;
> +
> + ret = cma_alloc_range(region->cma, block, region->block_size_pages, 3, gfp);
> + /* Should never happen. */
> + VM_WARN_ON_ONCE(ret == -EEXIST);
> + if (ret)
> + goto out_error;
> +
> + ret = tag_storage_reserve_block(block, region, order);
> + if (ret) {
> + cma_release(region->cma, pfn_to_page(block), region->block_size_pages);
> + goto out_error;
> + }
> + }
> +
> + page_set_tag_storage_reserved(page, order);
> +out_unlock:
> + mutex_unlock(&tag_blocks_lock);
> +
> + return 0;
> +
> +out_error:
> + xa_lock_irqsave(&tag_blocks_reserved, flags);
> + for (block = start_block; block < end_block; block += region->block_size_pages) {
> + if (tag_storage_block_is_reserved(block) &&
> + block_ref_sub_return(block, region, order) == 1) {
> + __xa_erase(&tag_blocks_reserved, block);
> + cma_release(region->cma, pfn_to_page(block), region->block_size_pages);
> + }
> + }
> + xa_unlock_irqrestore(&tag_blocks_reserved, flags);
> +
> + mutex_unlock(&tag_blocks_lock);
> +
> + return ret;
> +}
> +
> +void free_tag_storage(struct page *page, int order)
> +{
> + unsigned long block, start_block, end_block;
> + struct tag_region *region;
> + unsigned long flags;
> + int ret;
> +
> + ret = tag_storage_find_block(page, &start_block, ®ion);
> + if (WARN_ONCE(ret, "Missing tag storage block for pfn 0x%lx", page_to_pfn(page)))
> + return;
> +
> + end_block = start_block + order_to_num_blocks(order, region->block_size_pages);
> +
> + xa_lock_irqsave(&tag_blocks_reserved, flags);
> + for (block = start_block; block < end_block; block += region->block_size_pages) {
> + if (WARN_ONCE(!tag_storage_block_is_reserved(block),
> + "Block 0x%lx is not reserved for pfn 0x%lx", block, page_to_pfn(page)))
> + continue;
> +
> + if (block_ref_sub_return(block, region, order) == 1) {
> + __xa_erase(&tag_blocks_reserved, block);
> + cma_release(region->cma, pfn_to_page(block), region->block_size_pages);
> + }
> + }
> + xa_unlock_irqrestore(&tag_blocks_reserved, flags);
> +}
> +
> +void arch_alloc_page(struct page *page, int order, gfp_t gfp)
> +{
> + if (tag_storage_enabled() && alloc_requires_tag_storage(gfp))
> + reserve_tag_storage(page, order, gfp);
> +}
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index c022e473c17c..1ffaeccecda2 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -37,6 +37,7 @@
> #include <asm/esr.h>
> #include <asm/kprobes.h>
> #include <asm/mte.h>
> +#include <asm/mte_tag_storage.h>
> #include <asm/processor.h>
> #include <asm/sysreg.h>
> #include <asm/system_misc.h>
> @@ -950,6 +951,12 @@ gfp_t arch_calc_vma_gfp(struct vm_area_struct *vma, gfp_t gfp)
>
> void tag_clear_highpage(struct page *page)
> {
> + if (tag_storage_enabled() && !page_tag_storage_reserved(page)) {
> + /* Don't zero the tags if tag storage is not reserved */
> + clear_page(page_address(page));
> + return;
> + }
> +
> /* Newly allocated page, shouldn't have been tagged yet */
> WARN_ON_ONCE(!try_page_mte_tagging(page));
> mte_zero_clear_page_tags(page_address(page));
> diff --git a/fs/proc/page.c b/fs/proc/page.c
> index 195b077c0fac..e7eb584a9234 100644
> --- a/fs/proc/page.c
> +++ b/fs/proc/page.c
> @@ -221,6 +221,7 @@ u64 stable_page_flags(struct page *page)
> #ifdef CONFIG_ARCH_USES_PG_ARCH_X
> u |= kpf_copy_bit(k, KPF_ARCH_2, PG_arch_2);
> u |= kpf_copy_bit(k, KPF_ARCH_3, PG_arch_3);
> + u |= kpf_copy_bit(k, KPF_ARCH_4, PG_arch_4);
> #endif
>
> return u;
> diff --git a/include/linux/kernel-page-flags.h b/include/linux/kernel-page-flags.h
> index 859f4b0c1b2b..4a0d719ffdd4 100644
> --- a/include/linux/kernel-page-flags.h
> +++ b/include/linux/kernel-page-flags.h
> @@ -19,5 +19,6 @@
> #define KPF_SOFTDIRTY 40
> #define KPF_ARCH_2 41
> #define KPF_ARCH_3 42
> +#define KPF_ARCH_4 43
>
> #endif /* LINUX_KERNEL_PAGE_FLAGS_H */
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index b7237bce7446..03f03e6d735e 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -135,6 +135,7 @@ enum pageflags {
> #ifdef CONFIG_ARCH_USES_PG_ARCH_X
> PG_arch_2,
> PG_arch_3,
> + PG_arch_4,
> #endif
> __NR_PAGEFLAGS,
>
> diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
> index 6ca0d5ed46c0..ba962fd10a2c 100644
> --- a/include/trace/events/mmflags.h
> +++ b/include/trace/events/mmflags.h
> @@ -125,7 +125,8 @@ IF_HAVE_PG_HWPOISON(hwpoison) \
> IF_HAVE_PG_IDLE(idle) \
> IF_HAVE_PG_IDLE(young) \
> IF_HAVE_PG_ARCH_X(arch_2) \
> -IF_HAVE_PG_ARCH_X(arch_3)
> +IF_HAVE_PG_ARCH_X(arch_3) \
> +IF_HAVE_PG_ARCH_X(arch_4)
>
> #define show_page_flags(flags) \
> (flags) ? __print_flags(flags, "|", \
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 2bad63a7ec16..47932539cc50 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2804,6 +2804,7 @@ static void __split_huge_page_tail(struct folio *folio, int tail,
> #ifdef CONFIG_ARCH_USES_PG_ARCH_X
> (1L << PG_arch_2) |
> (1L << PG_arch_3) |
> + (1L << PG_arch_4) |
> #endif
> (1L << PG_dirty) |
> LRU_GEN_MASK | LRU_REFS_MASK));
> --
> 2.43.0
>
On 1/29/24 17:11, Alexandru Elisei wrote:
> Hi,
>
> On Mon, Jan 29, 2024 at 11:18:59AM +0530, Anshuman Khandual wrote:
>> On 1/25/24 22:12, Alexandru Elisei wrote:
>>> Extend the usefulness of arch_alloc_page() by adding the gfp_flags
>>> parameter.
>> Although the change here is harmless in itself, it will definitely benefit
>> from some additional context explaining the rationale, taking into account
>> why-how arch_alloc_page() got added particularly for s390 platform and how
>> it's going to be used in the present proposal.
> arm64 will use it to reserve tag storage if the caller requested a tagged
> page. Right now that means that __GFP_ZEROTAGS is set in the gfp mask, but
> I'll rename it to __GFP_TAGGED in patch #18 ("arm64: mte: Rename
> __GFP_ZEROTAGS to __GFP_TAGGED") [1].
>
> [1] https://lore.kernel.org/lkml/[email protected]/
Makes sense, but please do update the commit message explaining how
new gfp mask argument will be used to detect tagged page allocation
requests, further requiring tag storage allocation.
On 1/29/24 17:16, Alexandru Elisei wrote:
> Hi,
>
> On Mon, Jan 29, 2024 at 02:31:23PM +0530, Anshuman Khandual wrote:
>>
>>
>> On 1/25/24 22:12, Alexandru Elisei wrote:
>>> The patch f945116e4e19 ("mm: page_alloc: remove stale CMA guard code")
>>> removed the CMA filter when allocating from the MIGRATE_MOVABLE pcp list
>>> because CMA is always allowed when __GFP_MOVABLE is set.
>>>
>>> With the introduction of the arch_alloc_cma() function, the above is not
>>> true anymore, so bring back the filter.
>>
>> This makes sense as arch_alloc_cma() now might prevent ALLOC_CMA being
>> assigned to alloc_flags in gfp_to_alloc_flags_cma().
>
> Can I add your Reviewed-by tag then?
I think all these changes need to be reviewed in their entirety
even though some patches do look good on their own. For example
this patch depends on whether [PATCH 03/35] is acceptable or not.
I would suggest separating out CMA patches which could be debated
and merged regardless of this series.
On 1/29/24 17:21, Alexandru Elisei wrote:
> Hi,
>
> On Mon, Jan 29, 2024 at 02:54:20PM +0530, Anshuman Khandual wrote:
>>
>>
>> On 1/25/24 22:12, Alexandru Elisei wrote:
>>> The CMA_ALLOC_SUCCESS, respectively CMA_ALLOC_FAIL, are increased by one
>>> after each cma_alloc() function call. This is done even though cma_alloc()
>>> can allocate an arbitrary number of CMA pages. When looking at
>>> /proc/vmstat, the number of successful (or failed) cma_alloc() calls
>>> doesn't tell much with regards to how many CMA pages were allocated via
>>> cma_alloc() versus via the page allocator (regular allocation request or
>>> PCP lists refill).
>>>
>>> This can also be rather confusing to a user who isn't familiar with the
>>> code, since the unit of measurement for nr_free_cma is the number of pages,
>>> but cma_alloc_success and cma_alloc_fail count the number of cma_alloc()
>>> function calls.
>>>
>>> Let's make this consistent, and arguably more useful, by having
>>> CMA_ALLOC_SUCCESS count the number of successfully allocated CMA pages, and
>>> CMA_ALLOC_FAIL count the number of pages the cma_alloc() failed to
>>> allocate.
>>>
>>> For users that wish to track the number of cma_alloc() calls, there are
>>> tracepoints for that already implemented.
>>>
>>> Signed-off-by: Alexandru Elisei <[email protected]>
>>> ---
>>> mm/cma.c | 4 ++--
>>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/cma.c b/mm/cma.c
>>> index f49c95f8ee37..dbf7fe8cb1bd 100644
>>> --- a/mm/cma.c
>>> +++ b/mm/cma.c
>>> @@ -517,10 +517,10 @@ struct page *cma_alloc(struct cma *cma, unsigned long count,
>>> pr_debug("%s(): returned %p\n", __func__, page);
>>> out:
>>> if (page) {
>>> - count_vm_event(CMA_ALLOC_SUCCESS);
>>> + count_vm_events(CMA_ALLOC_SUCCESS, count);
>>> cma_sysfs_account_success_pages(cma, count);
>>> } else {
>>> - count_vm_event(CMA_ALLOC_FAIL);
>>> + count_vm_events(CMA_ALLOC_FAIL, count);
>>> if (cma)
>>> cma_sysfs_account_fail_pages(cma, count);
>>> }
>>
>> Without getting into the merits of this patch - which is actually trying to do
>> semantics change to /proc/vmstat, wondering how is this even related to this
>> particular series ? If required this could be debated on it's on separately.
>
> Having the number of CMA pages allocated and the number of CMA pages freed
> allows someone to infer how many tagged pages are in use at a given time:
That should not be done in CMA which is a generic multi purpose allocator.
> (allocated CMA pages - CMA pages allocated by drivers* - CMA pages
> released) * 32. That is valuable information for software and hardware
> designers.
>
> Besides that, for every iteration of the series, this has proven invaluable
> for discovering bugs with freeing and/or reserving tag storage pages.
I am afraid that might not be enough justification for getting something
merged mainline.
>
> *that would require userspace reading cma_alloc_success and
> cma_release_success before any tagged allocations are performed.
While assuming that no other non-memory-tagged CMA based allocation amd free
call happens in the meantime ? That would be on real thin ice.
I suppose arm64 tagged memory specific allocation or free related counters
need to be created on the caller side, including arch_free_pages_prepare().
On 1/25/24 22:12, Alexandru Elisei wrote:
> Today, cma_alloc() is used to allocate a contiguous memory region. The
> function allows the caller to specify the number of pages to allocate, but
> not the starting address. cma_alloc() will walk over the entire CMA region
> trying to allocate the first available range of the specified size.
>
> Introduce cma_alloc_range(), which makes CMA more versatile by allowing the
> caller to specify a particular range in the CMA region, defined by the
> start pfn and the size.
>
> arm64 will make use of this function when tag storage management will be
> implemented: cma_alloc_range() will be used to reserve the tag storage
> associated with a tagged page.
Basically, you would like to pass on a preferred start address and the
allocation could just fail if a contig range is not available from such
a starting address ?
Then why not just change cma_alloc() to take a new argument 'start_pfn'.
Why create a new but almost similar allocator ?
But then I am wondering why this could not be done in the arm64 platform
code itself operating on a CMA area reserved just for tag storage. Unless
this new allocator has other usage beyond MTE, this could be implemented
in the platform itself.
>
> Signed-off-by: Alexandru Elisei <[email protected]>
> ---
>
> Changes since rfc v2:
>
> * New patch.
>
> include/linux/cma.h | 2 +
> include/trace/events/cma.h | 59 ++++++++++++++++++++++++++
> mm/cma.c | 86 ++++++++++++++++++++++++++++++++++++++
> 3 files changed, 147 insertions(+)
>
> diff --git a/include/linux/cma.h b/include/linux/cma.h
> index 63873b93deaa..e32559da6942 100644
> --- a/include/linux/cma.h
> +++ b/include/linux/cma.h
> @@ -50,6 +50,8 @@ extern int cma_init_reserved_mem(phys_addr_t base, phys_addr_t size,
> struct cma **res_cma);
> extern struct page *cma_alloc(struct cma *cma, unsigned long count, unsigned int align,
> bool no_warn);
> +extern int cma_alloc_range(struct cma *cma, unsigned long start, unsigned long count,
> + unsigned tries, gfp_t gfp);
> extern bool cma_pages_valid(struct cma *cma, const struct page *pages, unsigned long count);
> extern bool cma_release(struct cma *cma, const struct page *pages, unsigned long count);
>
> diff --git a/include/trace/events/cma.h b/include/trace/events/cma.h
> index 25103e67737c..a89af313a572 100644
> --- a/include/trace/events/cma.h
> +++ b/include/trace/events/cma.h
> @@ -36,6 +36,65 @@ TRACE_EVENT(cma_release,
> __entry->count)
> );
>
> +TRACE_EVENT(cma_alloc_range_start,
> +
> + TP_PROTO(const char *name, unsigned long start, unsigned long count,
> + unsigned tries),
> +
> + TP_ARGS(name, start, count, tries),
> +
> + TP_STRUCT__entry(
> + __string(name, name)
> + __field(unsigned long, start)
> + __field(unsigned long, count)
> + __field(unsigned, tries)
> + ),
> +
> + TP_fast_assign(
> + __assign_str(name, name);
> + __entry->start = start;
> + __entry->count = count;
> + __entry->tries = tries;
> + ),
> +
> + TP_printk("name=%s start=%lx count=%lu tries=%u",
> + __get_str(name),
> + __entry->start,
> + __entry->count,
> + __entry->tries)
> +);
> +
> +TRACE_EVENT(cma_alloc_range_finish,
> +
> + TP_PROTO(const char *name, unsigned long start, unsigned long count,
> + unsigned attempts, int err),
> +
> + TP_ARGS(name, start, count, attempts, err),
> +
> + TP_STRUCT__entry(
> + __string(name, name)
> + __field(unsigned long, start)
> + __field(unsigned long, count)
> + __field(unsigned, attempts)
> + __field(int, err)
> + ),
> +
> + TP_fast_assign(
> + __assign_str(name, name);
> + __entry->start = start;
> + __entry->count = count;
> + __entry->attempts = attempts;
> + __entry->err = err;
> + ),
> +
> + TP_printk("name=%s start=%lx count=%lu attempts=%u err=%d",
> + __get_str(name),
> + __entry->start,
> + __entry->count,
> + __entry->attempts,
> + __entry->err)
> +);
> +
> TRACE_EVENT(cma_alloc_start,
>
> TP_PROTO(const char *name, unsigned long count, unsigned int align),
> diff --git a/mm/cma.c b/mm/cma.c
> index 543bb6b3be8e..4a0f68b9443b 100644
> --- a/mm/cma.c
> +++ b/mm/cma.c
> @@ -416,6 +416,92 @@ static void cma_debug_show_areas(struct cma *cma)
> static inline void cma_debug_show_areas(struct cma *cma) { }
> #endif
>
> +/**
> + * cma_alloc_range() - allocate pages in a specific range
> + * @cma: Contiguous memory region for which the allocation is performed.
> + * @start: Starting pfn of the allocation.
> + * @count: Requested number of pages
> + * @tries: Number of tries if the range is busy
> + * @no_warn: Avoid printing message about failed allocation
> + *
> + * This function allocates part of contiguous memory from a specific contiguous
> + * memory area, from the specified starting address. The 'start' pfn and the the
> + * 'count' number of pages must be aligned to the CMA bitmap order per bit.
> + */
> +int cma_alloc_range(struct cma *cma, unsigned long start, unsigned long count,
> + unsigned tries, gfp_t gfp)
> +{
> + unsigned long bitmap_maxno, bitmap_no, bitmap_start, bitmap_count;
> + unsigned long i = 0;
> + struct page *page;
> + int err = -EINVAL;
> +
> + if (!cma || !cma->count || !cma->bitmap)
> + goto out_stats;
> +
> + trace_cma_alloc_range_start(cma->name, start, count, tries);
> +
> + if (!count || start < cma->base_pfn ||
> + start + count > cma->base_pfn + cma->count)
> + goto out_stats;
> +
> + if (!IS_ALIGNED(start | count, 1 << cma->order_per_bit))
> + goto out_stats;
> +
> + bitmap_start = (start - cma->base_pfn) >> cma->order_per_bit;
> + bitmap_maxno = cma_bitmap_maxno(cma);
> + bitmap_count = cma_bitmap_pages_to_bits(cma, count);
> +
> + spin_lock_irq(&cma->lock);
> + bitmap_no = bitmap_find_next_zero_area(cma->bitmap, bitmap_maxno,
> + bitmap_start, bitmap_count, 0);
> + if (bitmap_no != bitmap_start) {
> + spin_unlock_irq(&cma->lock);
> + err = -EEXIST;
> + goto out_stats;
> + }
> + bitmap_set(cma->bitmap, bitmap_start, bitmap_count);
> + spin_unlock_irq(&cma->lock);
> +
> + for (i = 0; i < tries; i++) {
> + mutex_lock(&cma_mutex);
> + err = alloc_contig_range(start, start + count, MIGRATE_CMA, gfp);
> + mutex_unlock(&cma_mutex);
> +
> + if (err != -EBUSY)
> + break;
> + }
> +
> + if (err) {
> + cma_clear_bitmap(cma, start, count);
> + } else {
> + page = pfn_to_page(start);
> +
> + /*
> + * CMA can allocate multiple page blocks, which results in
> + * different blocks being marked with different tags. Reset the
> + * tags to ignore those page blocks.
> + */
> + for (i = 0; i < count; i++)
> + page_kasan_tag_reset(nth_page(page, i));
> + }
> +
> +out_stats:
> + trace_cma_alloc_range_finish(cma->name, start, count, i, err);
> +
> + if (err) {
> + count_vm_events(CMA_ALLOC_FAIL, count);
> + if (cma)
> + cma_sysfs_account_fail_pages(cma, count);
> + } else {
> + count_vm_events(CMA_ALLOC_SUCCESS, count);
> + cma_sysfs_account_success_pages(cma, count);
> + }
> +
> + return err;
> +}
> +
> +
> /**
> * cma_alloc() - allocate pages from contiguous area
> * @cma: Contiguous memory region for which the allocation is performed.
On 1/25/24 22:12, Alexandru Elisei wrote:
> Memory is added to CMA with cma_declare_contiguous_nid() and
> cma_init_reserved_mem(). This memory is then put on the MIGRATE_CMA list in
> cma_init_reserved_areas(), where the page allocator can make use of it.
cma_declare_contiguous_nid() reserves memory in memblock and marks the
for subsequent CMA usage, where as cma_init_reserved_areas() activates
these memory areas through init_cma_reserved_pageblock(). Standard page
allocator only receives these memory via free_reserved_page() - only if
the page block activation fails.
>
> If a device manages multiple CMA areas, and there's an error when one of
> the areas is added to CMA, there is no mechanism for the device to prevent
What kind of error ? init_cma_reserved_pageblock() fails ? But that will
not happen until cma_init_reserved_areas().
> the rest of the areas, which were added before the error occured, from
> being later added to the MIGRATE_CMA list.
Why is this mechanism required ? cma_init_reserved_areas() scans over all
CMA areas and try and activate each of them sequentially. Why is not this
sufficient ?
>
> Add cma_remove_mem() which allows a previously reserved CMA area to be
> removed and thus it cannot be used by the page allocator.
Successfully activated CMA areas do not get used by the buddy allocator.
>
> Signed-off-by: Alexandru Elisei <[email protected]>
> ---
>
> Changes since rfc v2:
>
> * New patch.
>
> include/linux/cma.h | 1 +
> mm/cma.c | 30 +++++++++++++++++++++++++++++-
> 2 files changed, 30 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/cma.h b/include/linux/cma.h
> index e32559da6942..787cbec1702e 100644
> --- a/include/linux/cma.h
> +++ b/include/linux/cma.h
> @@ -48,6 +48,7 @@ extern int cma_init_reserved_mem(phys_addr_t base, phys_addr_t size,
> unsigned int order_per_bit,
> const char *name,
> struct cma **res_cma);
> +extern void cma_remove_mem(struct cma **res_cma);
> extern struct page *cma_alloc(struct cma *cma, unsigned long count, unsigned int align,
> bool no_warn);
> extern int cma_alloc_range(struct cma *cma, unsigned long start, unsigned long count,
> diff --git a/mm/cma.c b/mm/cma.c
> index 4a0f68b9443b..2881bab12b01 100644
> --- a/mm/cma.c
> +++ b/mm/cma.c
> @@ -147,8 +147,12 @@ static int __init cma_init_reserved_areas(void)
> {
> int i;
>
> - for (i = 0; i < cma_area_count; i++)
> + for (i = 0; i < cma_area_count; i++) {
> + /* Region was removed. */
> + if (!cma_areas[i].count)
> + continue;
Skip previously added CMA area (now zeroed out) ?
> cma_activate_area(&cma_areas[i]);
> + }
>
> return 0;
> }
cma_init_reserved_areas() gets called via core_initcall(). Some how
platform/device needs to call cma_remove_mem() before core_initcall()
gets called ? This might be time sensitive.
> @@ -216,6 +220,30 @@ int __init cma_init_reserved_mem(phys_addr_t base, phys_addr_t size,
> return 0;
> }
>
> +/**
> + * cma_remove_mem() - remove cma area
> + * @res_cma: Pointer to the cma region.
> + *
> + * This function removes a cma region created with cma_init_reserved_mem(). The
> + * ->count is set to 0.
> + */
> +void __init cma_remove_mem(struct cma **res_cma)
> +{
> + struct cma *cma;
> +
> + if (WARN_ON_ONCE(!res_cma || !(*res_cma)))
> + return;
> +
> + cma = *res_cma;
> + if (WARN_ON_ONCE(!cma->count))
> + return;
> +
> + totalcma_pages -= cma->count;
> + cma->count = 0;
> +
> + *res_cma = NULL;
> +}
> +
> /**
> * cma_declare_contiguous_nid() - reserve custom contiguous area
> * @base: Base address of the reserved area optional, use 0 for any
But first please do explain what are the errors device or platform might
see on a previously marked CMA area so that removing them on way becomes
necessary preventing their activation via cma_init_reserved_areas().
On 1/25/24 22:12, Alexandru Elisei wrote:
> If the pages to be allocated are free, take them directly off the buddy
> allocator, instead of going through alloc_contig_range() and avoiding
> costly calls to lru_cache_disable().
>
> Only allocations of the same size as the CMA region order are considered,
> to avoid taking the zone spinlock for too long.
>
> Signed-off-by: Alexandru Elisei <[email protected]>
This patch seems to be improving standard cma_alloc() as well as
the previously added new allocator i.e cma_alloc_range() - via a
new helper cma_alloc_pages_fastpath().
Should not any standard cma_alloc() improvement be discussed as
an independent patch separately irrespective of this series. OR
it is some how related to this series which I might be missing ?
> ---
>
> Changes since rfc v2:
>
> * New patch. Reworked from the rfc v2 patch #26 ("arm64: mte: Fast track
> reserving tag storage when the block is free") (David Hildenbrand).
>
> include/linux/page-flags.h | 15 ++++++++++++--
> mm/Kconfig | 5 +++++
> mm/cma.c | 42 ++++++++++++++++++++++++++++++++++----
> mm/memory-failure.c | 8 ++++----
> mm/page_alloc.c | 23 ++++++++++++---------
> 5 files changed, 73 insertions(+), 20 deletions(-)
>
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 735cddc13d20..b7237bce7446 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -575,11 +575,22 @@ TESTSCFLAG(HWPoison, hwpoison, PF_ANY)
> #define MAGIC_HWPOISON 0x48575053U /* HWPS */
> extern void SetPageHWPoisonTakenOff(struct page *page);
> extern void ClearPageHWPoisonTakenOff(struct page *page);
> -extern bool take_page_off_buddy(struct page *page);
> -extern bool put_page_back_buddy(struct page *page);
> +extern bool PageHWPoisonTakenOff(struct page *page);
> #else
> PAGEFLAG_FALSE(HWPoison, hwpoison)
> +TESTSCFLAG_FALSE(HWPoison, hwpoison)
> #define __PG_HWPOISON 0
> +static inline void SetPageHWPoisonTakenOff(struct page *page) { }
> +static inline void ClearPageHWPoisonTakenOff(struct page *page) { }
> +static inline bool PageHWPoisonTakenOff(struct page *page)
> +{
> + return false;
> +}
> +#endif
> +
> +#ifdef CONFIG_WANTS_TAKE_PAGE_OFF_BUDDY
> +extern bool take_page_off_buddy(struct page *page, bool poison);
> +extern bool put_page_back_buddy(struct page *page, bool unpoison);
> #endif
>
> #if defined(CONFIG_PAGE_IDLE_FLAG) && defined(CONFIG_64BIT)
> diff --git a/mm/Kconfig b/mm/Kconfig
> index ffc3a2ba3a8c..341cf53898db 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -745,12 +745,16 @@ config DEFAULT_MMAP_MIN_ADDR
> config ARCH_SUPPORTS_MEMORY_FAILURE
> bool
>
> +config WANTS_TAKE_PAGE_OFF_BUDDY
> + bool> +
> config MEMORY_FAILURE
> depends on MMU
> depends on ARCH_SUPPORTS_MEMORY_FAILURE
> bool "Enable recovery from hardware memory errors"
> select MEMORY_ISOLATION
> select RAS
> + select WANTS_TAKE_PAGE_OFF_BUDDY
> help
> Enables code to recover from some memory failures on systems
> with MCA recovery. This allows a system to continue running
> @@ -891,6 +895,7 @@ config CMA
> depends on MMU
> select MIGRATION
> select MEMORY_ISOLATION
> + select WANTS_TAKE_PAGE_OFF_BUDDY
> help
> This enables the Contiguous Memory Allocator which allows other
> subsystems to allocate big physically-contiguous blocks of memory.
> diff --git a/mm/cma.c b/mm/cma.c
> index 2881bab12b01..15663f95d77b 100644
> --- a/mm/cma.c
> +++ b/mm/cma.c
> @@ -444,6 +444,34 @@ static void cma_debug_show_areas(struct cma *cma)
> static inline void cma_debug_show_areas(struct cma *cma) { }
> #endif
>
> +/* Called with the cma mutex held. */
> +static int cma_alloc_pages_fastpath(struct cma *cma, unsigned long start,
> + unsigned long end)
> +{
> + bool success = false;
> + unsigned long i, j;
> +
> + /* Avoid contention on the zone lock. */
> + if (start - end != 1 << cma->order_per_bit)
> + return -EINVAL;
> +
> + for (i = start; i < end; i++) {
> + if (!is_free_buddy_page(pfn_to_page(i)))
> + break;
> + success = take_page_off_buddy(pfn_to_page(i), false);
> + if (!success)
> + break;
> + }
> +
> + if (success)
> + return 0;
> +
> + for (j = start; j < i; j++)
> + put_page_back_buddy(pfn_to_page(j), false);
> +
> + return -EBUSY;
> +}
> +
> /**
> * cma_alloc_range() - allocate pages in a specific range
> * @cma: Contiguous memory region for which the allocation is performed.
> @@ -493,7 +521,11 @@ int cma_alloc_range(struct cma *cma, unsigned long start, unsigned long count,
>
> for (i = 0; i < tries; i++) {
> mutex_lock(&cma_mutex);
> - err = alloc_contig_range(start, start + count, MIGRATE_CMA, gfp);
> + err = cma_alloc_pages_fastpath(cma, start, start + count);
> + if (err) {
> + err = alloc_contig_range(start, start + count,
> + MIGRATE_CMA, gfp);
> + }
> mutex_unlock(&cma_mutex);
>
> if (err != -EBUSY)
> @@ -529,7 +561,6 @@ int cma_alloc_range(struct cma *cma, unsigned long start, unsigned long count,
> return err;
> }
>
> -
> /**
> * cma_alloc() - allocate pages from contiguous area
> * @cma: Contiguous memory region for which the allocation is performed.
> @@ -589,8 +620,11 @@ struct page *cma_alloc(struct cma *cma, unsigned long count,
>
> pfn = cma->base_pfn + (bitmap_no << cma->order_per_bit);
> mutex_lock(&cma_mutex);
> - ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA,
> - GFP_KERNEL | (no_warn ? __GFP_NOWARN : 0));
> + ret = cma_alloc_pages_fastpath(cma, pfn, pfn + count);
> + if (ret) {
> + ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA,
> + GFP_KERNEL | (no_warn ? __GFP_NOWARN : 0));
> + }
> mutex_unlock(&cma_mutex);
> if (ret == 0) {
> page = pfn_to_page(pfn);
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 4f9b61f4a668..b87b533a9871 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -157,7 +157,7 @@ static int __page_handle_poison(struct page *page)
> zone_pcp_disable(page_zone(page));
> ret = dissolve_free_huge_page(page);
> if (!ret)
> - ret = take_page_off_buddy(page);
> + ret = take_page_off_buddy(page, true);
> zone_pcp_enable(page_zone(page));
>
> return ret;
> @@ -1353,7 +1353,7 @@ static int page_action(struct page_state *ps, struct page *p,
> return action_result(pfn, ps->type, result);
> }
>
> -static inline bool PageHWPoisonTakenOff(struct page *page)
> +bool PageHWPoisonTakenOff(struct page *page)
> {
> return PageHWPoison(page) && page_private(page) == MAGIC_HWPOISON;
> }
> @@ -2247,7 +2247,7 @@ int memory_failure(unsigned long pfn, int flags)
> res = get_hwpoison_page(p, flags);
> if (!res) {
> if (is_free_buddy_page(p)) {
> - if (take_page_off_buddy(p)) {
> + if (take_page_off_buddy(p, true)) {
> page_ref_inc(p);
> res = MF_RECOVERED;
> } else {
> @@ -2578,7 +2578,7 @@ int unpoison_memory(unsigned long pfn)
> ret = folio_test_clear_hwpoison(folio) ? 0 : -EBUSY;
> } else if (ghp < 0) {
> if (ghp == -EHWPOISON) {
> - ret = put_page_back_buddy(p) ? 0 : -EBUSY;
> + ret = put_page_back_buddy(p, true) ? 0 : -EBUSY;
> } else {
> ret = ghp;
> unpoison_pr_info("Unpoison: failed to grab page %#lx\n",
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 0fa34bcfb1af..502ee3eb8583 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6655,7 +6655,7 @@ bool is_free_buddy_page(struct page *page)
> }
> EXPORT_SYMBOL(is_free_buddy_page);
>
> -#ifdef CONFIG_MEMORY_FAILURE
> +#ifdef CONFIG_WANTS_TAKE_PAGE_OFF_BUDDY
> /*
> * Break down a higher-order page in sub-pages, and keep our target out of
> * buddy allocator.
> @@ -6687,9 +6687,9 @@ static void break_down_buddy_pages(struct zone *zone, struct page *page,
> }
>
> /*
> - * Take a page that will be marked as poisoned off the buddy allocator.
> + * Take a page off the buddy allocator, and optionally mark it as poisoned.
> */
> -bool take_page_off_buddy(struct page *page)
> +bool take_page_off_buddy(struct page *page, bool poison)
> {
> struct zone *zone = page_zone(page);
> unsigned long pfn = page_to_pfn(page);
> @@ -6710,7 +6710,8 @@ bool take_page_off_buddy(struct page *page)
> del_page_from_free_list(page_head, zone, page_order);
> break_down_buddy_pages(zone, page_head, page, 0,
> page_order, migratetype);
> - SetPageHWPoisonTakenOff(page);
> + if (poison)
> + SetPageHWPoisonTakenOff(page);
> if (!is_migrate_isolate(migratetype))
> __mod_zone_freepage_state(zone, -1, migratetype);
> ret = true;
> @@ -6724,9 +6725,10 @@ bool take_page_off_buddy(struct page *page)
> }
>
> /*
> - * Cancel takeoff done by take_page_off_buddy().
> + * Cancel takeoff done by take_page_off_buddy(), and optionally unpoison the
> + * page.
> */
> -bool put_page_back_buddy(struct page *page)
> +bool put_page_back_buddy(struct page *page, bool unpoison)
> {
> struct zone *zone = page_zone(page);
> unsigned long pfn = page_to_pfn(page);
> @@ -6736,17 +6738,18 @@ bool put_page_back_buddy(struct page *page)
>
> spin_lock_irqsave(&zone->lock, flags);
> if (put_page_testzero(page)) {
> - ClearPageHWPoisonTakenOff(page);
> + VM_WARN_ON_ONCE(PageHWPoisonTakenOff(page) && !unpoison);
> + if (unpoison)
> + ClearPageHWPoisonTakenOff(page);
> __free_one_page(page, pfn, zone, 0, migratetype, FPI_NONE);
> - if (TestClearPageHWPoison(page)) {
> + if (!unpoison || (unpoison && TestClearPageHWPoison(page)))
> ret = true;
> - }
> }
> spin_unlock_irqrestore(&zone->lock, flags);
>
> return ret;
> }
> -#endif
> +#endif /* CONFIG_WANTS_TAKE_PAGE_OFF_BUDDY */
>
> #ifdef CONFIG_ZONE_DMA
> bool has_managed_dma(void)
On 1/25/24 22:12, Alexandru Elisei wrote:
> arm64 uses VM_HIGH_ARCH_0 and VM_HIGH_ARCH_1 for enabling MTE for a VMA.
> When VM_HIGH_ARCH_0, which arm64 renames to VM_MTE, is set for a VMA, and
> the gfp flag __GFP_ZERO is present, the __GFP_ZEROTAGS gfp flag also gets
> set in vma_alloc_zeroed_movable_folio().
>
> Expand this to be more generic by adding an arch hook that modifes the gfp
> flags for an allocation when the VMA is known.
>
> Note that __GFP_ZEROTAGS is ignored by the page allocator unless __GFP_ZERO
> is also set; from that point of view, the current behaviour is unchanged,
> even though the arm64 flag is set in more places. When arm64 will have
> support to reuse the tag storage for data allocation, the uses of the
> __GFP_ZEROTAGS flag will be expanded to instruct the page allocator to try
> to reserve the corresponding tag storage for the pages being allocated.
Right but how will pushing __GFP_ZEROTAGS addition into gfp_t flags further
down via a new arch call back i.e arch_calc_vma_gfp() while still maintaining
(vma->vm_flags & VM_MTE) conditionality improve the current scenario. Because
the page allocator could have still analyzed alloc flags for __GFP_ZEROTAGS
for any additional stuff.
OR this just adds some new core MM paths to get __GFP_ZEROTAGS which was not
the case earlier via this call back.
>
> The flags returned by arch_calc_vma_gfp() are or'ed with the flags set by
> the caller; this has been done to keep an architecture from modifying the
> flags already set by the core memory management code; this is similar to
> how do_mmap() -> calc_vm_flag_bits() -> arch_calc_vm_flag_bits() has been
> implemented. This can be revisited in the future if there's a need to do
> so.
>
> Signed-off-by: Alexandru Elisei <[email protected]>
> ---
> arch/arm64/include/asm/page.h | 5 ++---
> arch/arm64/include/asm/pgtable.h | 3 +++
> arch/arm64/mm/fault.c | 19 ++++++-------------
> include/linux/pgtable.h | 7 +++++++
> mm/mempolicy.c | 1 +
> mm/shmem.c | 5 ++++-
> 6 files changed, 23 insertions(+), 17 deletions(-)
>
> diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
> index 2312e6ee595f..88bab032a493 100644
> --- a/arch/arm64/include/asm/page.h
> +++ b/arch/arm64/include/asm/page.h
> @@ -29,9 +29,8 @@ void copy_user_highpage(struct page *to, struct page *from,
> void copy_highpage(struct page *to, struct page *from);
> #define __HAVE_ARCH_COPY_HIGHPAGE
>
> -struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
> - unsigned long vaddr);
> -#define vma_alloc_zeroed_movable_folio vma_alloc_zeroed_movable_folio
> +#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
> + vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr, false)
>
> void tag_clear_highpage(struct page *to);
> #define __HAVE_ARCH_TAG_CLEAR_HIGHPAGE
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 79ce70fbb751..08f0904dbfc2 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1071,6 +1071,9 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
>
> #endif /* CONFIG_ARM64_MTE */
>
> +#define __HAVE_ARCH_CALC_VMA_GFP
> +gfp_t arch_calc_vma_gfp(struct vm_area_struct *vma, gfp_t gfp);
> +
> /*
> * On AArch64, the cache coherency is handled via the set_pte_at() function.
> */
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index 55f6455a8284..4d3f0a870ad8 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -937,22 +937,15 @@ void do_debug_exception(unsigned long addr_if_watchpoint, unsigned long esr,
> NOKPROBE_SYMBOL(do_debug_exception);
>
> /*
> - * Used during anonymous page fault handling.
> + * If this is called during anonymous page fault handling, and the page is
> + * mapped with PROT_MTE, initialise the tags at the point of tag zeroing as this
> + * is usually faster than separate DC ZVA and STGM.
> */
> -struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
> - unsigned long vaddr)
> +gfp_t arch_calc_vma_gfp(struct vm_area_struct *vma, gfp_t gfp)
> {
> - gfp_t flags = GFP_HIGHUSER_MOVABLE | __GFP_ZERO;
> -
> - /*
> - * If the page is mapped with PROT_MTE, initialise the tags at the
> - * point of allocation and page zeroing as this is usually faster than
> - * separate DC ZVA and STGM.
> - */
> if (vma->vm_flags & VM_MTE)
> - flags |= __GFP_ZEROTAGS;
> -
> - return vma_alloc_folio(flags, 0, vma, vaddr, false);
> + return __GFP_ZEROTAGS;
> + return 0;
> }
>
> void tag_clear_highpage(struct page *page)
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index c5ddec6b5305..98f81ca08cbe 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -901,6 +901,13 @@ static inline void arch_do_swap_page(struct mm_struct *mm,
> }
> #endif
>
> +#ifndef __HAVE_ARCH_CALC_VMA_GFP
> +static inline gfp_t arch_calc_vma_gfp(struct vm_area_struct *vma, gfp_t gfp)
> +{
> + return 0;
> +}
> +#endif
> +
> #ifndef __HAVE_ARCH_FREE_PAGES_PREPARE
> static inline void arch_free_pages_prepare(struct page *page, int order) { }
> #endif
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 10a590ee1c89..f7ef52760b32 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2168,6 +2168,7 @@ struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma,
> pgoff_t ilx;
> struct page *page;
>
> + gfp |= arch_calc_vma_gfp(vma, gfp);
> pol = get_vma_policy(vma, addr, order, &ilx);
> page = alloc_pages_mpol(gfp | __GFP_COMP, order,
> pol, ilx, numa_node_id());
> diff --git a/mm/shmem.c b/mm/shmem.c
> index d7c84ff62186..14427e9982f9 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1585,7 +1585,7 @@ static struct folio *shmem_swapin_cluster(swp_entry_t swap, gfp_t gfp,
> */
> static gfp_t limit_gfp_mask(gfp_t huge_gfp, gfp_t limit_gfp)
> {
> - gfp_t allowflags = __GFP_IO | __GFP_FS | __GFP_RECLAIM;
> + gfp_t allowflags = __GFP_IO | __GFP_FS | __GFP_RECLAIM | __GFP_ZEROTAGS;
> gfp_t denyflags = __GFP_NOWARN | __GFP_NORETRY;
> gfp_t zoneflags = limit_gfp & GFP_ZONEMASK;
> gfp_t result = huge_gfp & ~(allowflags | GFP_ZONEMASK);
> @@ -2038,6 +2038,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
> gfp_t huge_gfp;
>
> huge_gfp = vma_thp_gfp_mask(vma);
> + huge_gfp |= arch_calc_vma_gfp(vma, huge_gfp);
> huge_gfp = limit_gfp_mask(huge_gfp, gfp);
> folio = shmem_alloc_and_add_folio(huge_gfp,
> inode, index, fault_mm, true);
> @@ -2214,6 +2215,8 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
> vm_fault_t ret = 0;
> int err;
>
> + gfp |= arch_calc_vma_gfp(vmf->vma, gfp);
> +
> /*
> * Trinity finds that probing a hole which tmpfs is punching can
> * prevent the hole-punch from ever completing: noted in i_private.
Hi,
I really appreciate the feedback you have given me so far. I believe the
commit message isn't clear enough and there has been a confusion.
A CMA user adds a CMA area to the cma_areas array with
cma_declare_contiguous_nid() or cma_init_reserved_mem().
init_cma_reserved_pageblock() then iterates over the array and activates
all cma areas.
The function cma_remove_mem() is intended to be used to remove a cma area
from the cma_areas array **before** the area has been activated.
Usecase: a driver (in this case, the arm64 dynamic tag storage code)
manages several cma areas. The driver successfully adds the first area to
the cma_areas array. When the driver tries to adds the second area, the
function fails. Without cma_remove_mem(), the driver has no way to prevent
the first area from being freed to the page allocator. cma_remove_mem() is
about providing a means to do cleanup in case of error.
Does that make more sense now?
Ok Tue, Jan 30, 2024 at 11:20:56AM +0530, Anshuman Khandual wrote:
>
>
> On 1/25/24 22:12, Alexandru Elisei wrote:
> > Memory is added to CMA with cma_declare_contiguous_nid() and
> > cma_init_reserved_mem(). This memory is then put on the MIGRATE_CMA list in
> > cma_init_reserved_areas(), where the page allocator can make use of it.
>
> cma_declare_contiguous_nid() reserves memory in memblock and marks the
You forgot about about cma_init_reserved_mem() which does the same thing,
but yes, you are right.
> for subsequent CMA usage, where as cma_init_reserved_areas() activates
> these memory areas through init_cma_reserved_pageblock(). Standard page
> allocator only receives these memory via free_reserved_page() - only if
I don't think that's correct. init_cma_reserved_pageblock() clears the
PG_reserved page flag, sets the migratetype to MIGRATE_CMA and then frees
the page. After that, the page is available to the standard page allocator
to use for allocation. Otherwise, what would be the point of the
MIGRATE_CMA migratetype?
> the page block activation fails.
For the sake of having a complete picture, I'll add that that only happens
if cma->reserve_pages_on_error is false. If the CMA user sets the field to
'true' (with cma_reserve_pages_on_error()), then the pages in the CMA
region are kept PG_reserved if activation fails.
>
> >
> > If a device manages multiple CMA areas, and there's an error when one of
> > the areas is added to CMA, there is no mechanism for the device to prevent
>
> What kind of error ? init_cma_reserved_pageblock() fails ? But that will
> not happen until cma_init_reserved_areas().
I think I haven't been clear enough. When I say that "an area is added
to CMA", I mean that the memory region is added to cma_areas array, via
cma_declare_contiguous_nid() or cma_init_reserved_mem(). There are several
ways in which either function can fail.
>
> > the rest of the areas, which were added before the error occured, from
> > being later added to the MIGRATE_CMA list.
>
> Why is this mechanism required ? cma_init_reserved_areas() scans over all
> CMA areas and try and activate each of them sequentially. Why is not this
> sufficient ?
This patch is about removing a struct cma from the cma_areas array after it
has been added to the array, with cma_declare_contiguous_nid() or
cma_init_reserved_mem(), to prevent the area from being activated in
cma_init_reserved_areas(). Sorry for the confusion.
I'll add a check in cma_remove_mem() to fail if the cma area has been
activated, and a comment to the function to explain its usage.
>
> >
> > Add cma_remove_mem() which allows a previously reserved CMA area to be
> > removed and thus it cannot be used by the page allocator.
>
> Successfully activated CMA areas do not get used by the buddy allocator.
I don't believe that is correct, see above.
>
> >
> > Signed-off-by: Alexandru Elisei <[email protected]>
> > ---
> >
> > Changes since rfc v2:
> >
> > * New patch.
> >
> > include/linux/cma.h | 1 +
> > mm/cma.c | 30 +++++++++++++++++++++++++++++-
> > 2 files changed, 30 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/cma.h b/include/linux/cma.h
> > index e32559da6942..787cbec1702e 100644
> > --- a/include/linux/cma.h
> > +++ b/include/linux/cma.h
> > @@ -48,6 +48,7 @@ extern int cma_init_reserved_mem(phys_addr_t base, phys_addr_t size,
> > unsigned int order_per_bit,
> > const char *name,
> > struct cma **res_cma);
> > +extern void cma_remove_mem(struct cma **res_cma);
> > extern struct page *cma_alloc(struct cma *cma, unsigned long count, unsigned int align,
> > bool no_warn);
> > extern int cma_alloc_range(struct cma *cma, unsigned long start, unsigned long count,
> > diff --git a/mm/cma.c b/mm/cma.c
> > index 4a0f68b9443b..2881bab12b01 100644
> > --- a/mm/cma.c
> > +++ b/mm/cma.c
> > @@ -147,8 +147,12 @@ static int __init cma_init_reserved_areas(void)
> > {
> > int i;
> >
> > - for (i = 0; i < cma_area_count; i++)
> > + for (i = 0; i < cma_area_count; i++) {
> > + /* Region was removed. */
> > + if (!cma_areas[i].count)
> > + continue;
>
> Skip previously added CMA area (now zeroed out) ?
Yes, that's what I meant with the comment "Region was removed". Do you
think I should reword the comment?
>
> > cma_activate_area(&cma_areas[i]);
> > + }
> >
> > return 0;
> > }
>
> cma_init_reserved_areas() gets called via core_initcall(). Some how
> platform/device needs to call cma_remove_mem() before core_initcall()
> gets called ? This might be time sensitive.
I don't understand your point.
>
> > @@ -216,6 +220,30 @@ int __init cma_init_reserved_mem(phys_addr_t base, phys_addr_t size,
> > return 0;
> > }
> >
> > +/**
> > + * cma_remove_mem() - remove cma area
> > + * @res_cma: Pointer to the cma region.
> > + *
> > + * This function removes a cma region created with cma_init_reserved_mem(). The
> > + * ->count is set to 0.
> > + */
> > +void __init cma_remove_mem(struct cma **res_cma)
> > +{
> > + struct cma *cma;
> > +
> > + if (WARN_ON_ONCE(!res_cma || !(*res_cma)))
> > + return;
> > +
> > + cma = *res_cma;
> > + if (WARN_ON_ONCE(!cma->count))
> > + return;
> > +
> > + totalcma_pages -= cma->count;
> > + cma->count = 0;
> > +
> > + *res_cma = NULL;
> > +}
> > +
> > /**
> > * cma_declare_contiguous_nid() - reserve custom contiguous area
> > * @base: Base address of the reserved area optional, use 0 for any
>
> But first please do explain what are the errors device or platform might
cma_declare_contiguous_nid() and cma_init_reserved_mem() can fail in a
number of ways, the code should be self documenting.
> see on a previously marked CMA area so that removing them on way becomes
> necessary preventing their activation via cma_init_reserved_areas().
I've described how the function is supposed to be used at the top of my
reply.
Thanks,
Alex
Hi,
On Tue, Jan 30, 2024 at 02:48:53PM +0530, Anshuman Khandual wrote:
>
>
> On 1/25/24 22:12, Alexandru Elisei wrote:
> > If the pages to be allocated are free, take them directly off the buddy
> > allocator, instead of going through alloc_contig_range() and avoiding
> > costly calls to lru_cache_disable().
> >
> > Only allocations of the same size as the CMA region order are considered,
> > to avoid taking the zone spinlock for too long.
> >
> > Signed-off-by: Alexandru Elisei <[email protected]>
>
> This patch seems to be improving standard cma_alloc() as well as
> the previously added new allocator i.e cma_alloc_range() - via a
> new helper cma_alloc_pages_fastpath().
Yes, that's correct.
>
> Should not any standard cma_alloc() improvement be discussed as
> an independent patch separately irrespective of this series. OR
> it is some how related to this series which I might be missing ?
Yes, it's related to this series. I wrote this patch because it fixes a
performance regression with Chrome when dynamic tag storage management is
enabled [1]. I will bring back the commit message explaining that.
[1] https://lore.kernel.org/linux-fsdevel/[email protected]/
Thanks,
Alex
>
> > ---
> >
> > Changes since rfc v2:
> >
> > * New patch. Reworked from the rfc v2 patch #26 ("arm64: mte: Fast track
> > reserving tag storage when the block is free") (David Hildenbrand).
> >
> > include/linux/page-flags.h | 15 ++++++++++++--
> > mm/Kconfig | 5 +++++
> > mm/cma.c | 42 ++++++++++++++++++++++++++++++++++----
> > mm/memory-failure.c | 8 ++++----
> > mm/page_alloc.c | 23 ++++++++++++---------
> > 5 files changed, 73 insertions(+), 20 deletions(-)
> >
> > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> > index 735cddc13d20..b7237bce7446 100644
> > --- a/include/linux/page-flags.h
> > +++ b/include/linux/page-flags.h
> > @@ -575,11 +575,22 @@ TESTSCFLAG(HWPoison, hwpoison, PF_ANY)
> > #define MAGIC_HWPOISON 0x48575053U /* HWPS */
> > extern void SetPageHWPoisonTakenOff(struct page *page);
> > extern void ClearPageHWPoisonTakenOff(struct page *page);
> > -extern bool take_page_off_buddy(struct page *page);
> > -extern bool put_page_back_buddy(struct page *page);
> > +extern bool PageHWPoisonTakenOff(struct page *page);
> > #else
> > PAGEFLAG_FALSE(HWPoison, hwpoison)
> > +TESTSCFLAG_FALSE(HWPoison, hwpoison)
> > #define __PG_HWPOISON 0
> > +static inline void SetPageHWPoisonTakenOff(struct page *page) { }
> > +static inline void ClearPageHWPoisonTakenOff(struct page *page) { }
> > +static inline bool PageHWPoisonTakenOff(struct page *page)
> > +{
> > + return false;
> > +}
> > +#endif
> > +
> > +#ifdef CONFIG_WANTS_TAKE_PAGE_OFF_BUDDY
> > +extern bool take_page_off_buddy(struct page *page, bool poison);
> > +extern bool put_page_back_buddy(struct page *page, bool unpoison);
> > #endif
> >
> > #if defined(CONFIG_PAGE_IDLE_FLAG) && defined(CONFIG_64BIT)
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index ffc3a2ba3a8c..341cf53898db 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -745,12 +745,16 @@ config DEFAULT_MMAP_MIN_ADDR
> > config ARCH_SUPPORTS_MEMORY_FAILURE
> > bool
> >
> > +config WANTS_TAKE_PAGE_OFF_BUDDY
> > + bool> +
> > config MEMORY_FAILURE
> > depends on MMU
> > depends on ARCH_SUPPORTS_MEMORY_FAILURE
> > bool "Enable recovery from hardware memory errors"
> > select MEMORY_ISOLATION
> > select RAS
> > + select WANTS_TAKE_PAGE_OFF_BUDDY
> > help
> > Enables code to recover from some memory failures on systems
> > with MCA recovery. This allows a system to continue running
> > @@ -891,6 +895,7 @@ config CMA
> > depends on MMU
> > select MIGRATION
> > select MEMORY_ISOLATION
> > + select WANTS_TAKE_PAGE_OFF_BUDDY
> > help
> > This enables the Contiguous Memory Allocator which allows other
> > subsystems to allocate big physically-contiguous blocks of memory.
> > diff --git a/mm/cma.c b/mm/cma.c
> > index 2881bab12b01..15663f95d77b 100644
> > --- a/mm/cma.c
> > +++ b/mm/cma.c
> > @@ -444,6 +444,34 @@ static void cma_debug_show_areas(struct cma *cma)
> > static inline void cma_debug_show_areas(struct cma *cma) { }
> > #endif
> >
> > +/* Called with the cma mutex held. */
> > +static int cma_alloc_pages_fastpath(struct cma *cma, unsigned long start,
> > + unsigned long end)
> > +{
> > + bool success = false;
> > + unsigned long i, j;
> > +
> > + /* Avoid contention on the zone lock. */
> > + if (start - end != 1 << cma->order_per_bit)
> > + return -EINVAL;
> > +
> > + for (i = start; i < end; i++) {
> > + if (!is_free_buddy_page(pfn_to_page(i)))
> > + break;
> > + success = take_page_off_buddy(pfn_to_page(i), false);
> > + if (!success)
> > + break;
> > + }
> > +
> > + if (success)
> > + return 0;
> > +
> > + for (j = start; j < i; j++)
> > + put_page_back_buddy(pfn_to_page(j), false);
> > +
> > + return -EBUSY;
> > +}
> > +
> > /**
> > * cma_alloc_range() - allocate pages in a specific range
> > * @cma: Contiguous memory region for which the allocation is performed.
> > @@ -493,7 +521,11 @@ int cma_alloc_range(struct cma *cma, unsigned long start, unsigned long count,
> >
> > for (i = 0; i < tries; i++) {
> > mutex_lock(&cma_mutex);
> > - err = alloc_contig_range(start, start + count, MIGRATE_CMA, gfp);
> > + err = cma_alloc_pages_fastpath(cma, start, start + count);
> > + if (err) {
> > + err = alloc_contig_range(start, start + count,
> > + MIGRATE_CMA, gfp);
> > + }
> > mutex_unlock(&cma_mutex);
> >
> > if (err != -EBUSY)
> > @@ -529,7 +561,6 @@ int cma_alloc_range(struct cma *cma, unsigned long start, unsigned long count,
> > return err;
> > }
> >
> > -
> > /**
> > * cma_alloc() - allocate pages from contiguous area
> > * @cma: Contiguous memory region for which the allocation is performed.
> > @@ -589,8 +620,11 @@ struct page *cma_alloc(struct cma *cma, unsigned long count,
> >
> > pfn = cma->base_pfn + (bitmap_no << cma->order_per_bit);
> > mutex_lock(&cma_mutex);
> > - ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA,
> > - GFP_KERNEL | (no_warn ? __GFP_NOWARN : 0));
> > + ret = cma_alloc_pages_fastpath(cma, pfn, pfn + count);
> > + if (ret) {
> > + ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA,
> > + GFP_KERNEL | (no_warn ? __GFP_NOWARN : 0));
> > + }
> > mutex_unlock(&cma_mutex);
> > if (ret == 0) {
> > page = pfn_to_page(pfn);
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index 4f9b61f4a668..b87b533a9871 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -157,7 +157,7 @@ static int __page_handle_poison(struct page *page)
> > zone_pcp_disable(page_zone(page));
> > ret = dissolve_free_huge_page(page);
> > if (!ret)
> > - ret = take_page_off_buddy(page);
> > + ret = take_page_off_buddy(page, true);
> > zone_pcp_enable(page_zone(page));
> >
> > return ret;
> > @@ -1353,7 +1353,7 @@ static int page_action(struct page_state *ps, struct page *p,
> > return action_result(pfn, ps->type, result);
> > }
> >
> > -static inline bool PageHWPoisonTakenOff(struct page *page)
> > +bool PageHWPoisonTakenOff(struct page *page)
> > {
> > return PageHWPoison(page) && page_private(page) == MAGIC_HWPOISON;
> > }
> > @@ -2247,7 +2247,7 @@ int memory_failure(unsigned long pfn, int flags)
> > res = get_hwpoison_page(p, flags);
> > if (!res) {
> > if (is_free_buddy_page(p)) {
> > - if (take_page_off_buddy(p)) {
> > + if (take_page_off_buddy(p, true)) {
> > page_ref_inc(p);
> > res = MF_RECOVERED;
> > } else {
> > @@ -2578,7 +2578,7 @@ int unpoison_memory(unsigned long pfn)
> > ret = folio_test_clear_hwpoison(folio) ? 0 : -EBUSY;
> > } else if (ghp < 0) {
> > if (ghp == -EHWPOISON) {
> > - ret = put_page_back_buddy(p) ? 0 : -EBUSY;
> > + ret = put_page_back_buddy(p, true) ? 0 : -EBUSY;
> > } else {
> > ret = ghp;
> > unpoison_pr_info("Unpoison: failed to grab page %#lx\n",
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 0fa34bcfb1af..502ee3eb8583 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -6655,7 +6655,7 @@ bool is_free_buddy_page(struct page *page)
> > }
> > EXPORT_SYMBOL(is_free_buddy_page);
> >
> > -#ifdef CONFIG_MEMORY_FAILURE
> > +#ifdef CONFIG_WANTS_TAKE_PAGE_OFF_BUDDY
> > /*
> > * Break down a higher-order page in sub-pages, and keep our target out of
> > * buddy allocator.
> > @@ -6687,9 +6687,9 @@ static void break_down_buddy_pages(struct zone *zone, struct page *page,
> > }
> >
> > /*
> > - * Take a page that will be marked as poisoned off the buddy allocator.
> > + * Take a page off the buddy allocator, and optionally mark it as poisoned.
> > */
> > -bool take_page_off_buddy(struct page *page)
> > +bool take_page_off_buddy(struct page *page, bool poison)
> > {
> > struct zone *zone = page_zone(page);
> > unsigned long pfn = page_to_pfn(page);
> > @@ -6710,7 +6710,8 @@ bool take_page_off_buddy(struct page *page)
> > del_page_from_free_list(page_head, zone, page_order);
> > break_down_buddy_pages(zone, page_head, page, 0,
> > page_order, migratetype);
> > - SetPageHWPoisonTakenOff(page);
> > + if (poison)
> > + SetPageHWPoisonTakenOff(page);
> > if (!is_migrate_isolate(migratetype))
> > __mod_zone_freepage_state(zone, -1, migratetype);
> > ret = true;
> > @@ -6724,9 +6725,10 @@ bool take_page_off_buddy(struct page *page)
> > }
> >
> > /*
> > - * Cancel takeoff done by take_page_off_buddy().
> > + * Cancel takeoff done by take_page_off_buddy(), and optionally unpoison the
> > + * page.
> > */
> > -bool put_page_back_buddy(struct page *page)
> > +bool put_page_back_buddy(struct page *page, bool unpoison)
> > {
> > struct zone *zone = page_zone(page);
> > unsigned long pfn = page_to_pfn(page);
> > @@ -6736,17 +6738,18 @@ bool put_page_back_buddy(struct page *page)
> >
> > spin_lock_irqsave(&zone->lock, flags);
> > if (put_page_testzero(page)) {
> > - ClearPageHWPoisonTakenOff(page);
> > + VM_WARN_ON_ONCE(PageHWPoisonTakenOff(page) && !unpoison);
> > + if (unpoison)
> > + ClearPageHWPoisonTakenOff(page);
> > __free_one_page(page, pfn, zone, 0, migratetype, FPI_NONE);
> > - if (TestClearPageHWPoison(page)) {
> > + if (!unpoison || (unpoison && TestClearPageHWPoison(page)))
> > ret = true;
> > - }
> > }
> > spin_unlock_irqrestore(&zone->lock, flags);
> >
> > return ret;
> > }
> > -#endif
> > +#endif /* CONFIG_WANTS_TAKE_PAGE_OFF_BUDDY */
> >
> > #ifdef CONFIG_ZONE_DMA
> > bool has_managed_dma(void)
Hi,
On Tue, Jan 30, 2024 at 03:25:20PM +0530, Anshuman Khandual wrote:
>
>
> On 1/25/24 22:12, Alexandru Elisei wrote:
> > arm64 uses VM_HIGH_ARCH_0 and VM_HIGH_ARCH_1 for enabling MTE for a VMA.
> > When VM_HIGH_ARCH_0, which arm64 renames to VM_MTE, is set for a VMA, and
> > the gfp flag __GFP_ZERO is present, the __GFP_ZEROTAGS gfp flag also gets
> > set in vma_alloc_zeroed_movable_folio().
> >
> > Expand this to be more generic by adding an arch hook that modifes the gfp
> > flags for an allocation when the VMA is known.
> >
> > Note that __GFP_ZEROTAGS is ignored by the page allocator unless __GFP_ZERO
> > is also set; from that point of view, the current behaviour is unchanged,
> > even though the arm64 flag is set in more places. When arm64 will have
> > support to reuse the tag storage for data allocation, the uses of the
> > __GFP_ZEROTAGS flag will be expanded to instruct the page allocator to try
> > to reserve the corresponding tag storage for the pages being allocated.
>
> Right but how will pushing __GFP_ZEROTAGS addition into gfp_t flags further
> down via a new arch call back i.e arch_calc_vma_gfp() while still maintaining
> (vma->vm_flags & VM_MTE) conditionality improve the current scenario. Because
I'm afraid I don't follow you.
> the page allocator could have still analyzed alloc flags for __GFP_ZEROTAGS
> for any additional stuff.
>
> OR this just adds some new core MM paths to get __GFP_ZEROTAGS which was not
> the case earlier via this call back.
Before this patch: vma_alloc_zeroed_movable_folio() sets __GFP_ZEROTAGS.
After this patch: vma_alloc_folio() sets __GFP_ZEROTAGS.
This patch is about adding __GFP_ZEROTAGS for more callers.
Thanks,
Alex
>
> >
> > The flags returned by arch_calc_vma_gfp() are or'ed with the flags set by
> > the caller; this has been done to keep an architecture from modifying the
> > flags already set by the core memory management code; this is similar to
> > how do_mmap() -> calc_vm_flag_bits() -> arch_calc_vm_flag_bits() has been
> > implemented. This can be revisited in the future if there's a need to do
> > so.
> >
> > Signed-off-by: Alexandru Elisei <[email protected]>
> > ---
> > arch/arm64/include/asm/page.h | 5 ++---
> > arch/arm64/include/asm/pgtable.h | 3 +++
> > arch/arm64/mm/fault.c | 19 ++++++-------------
> > include/linux/pgtable.h | 7 +++++++
> > mm/mempolicy.c | 1 +
> > mm/shmem.c | 5 ++++-
> > 6 files changed, 23 insertions(+), 17 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/page.h b/arch/arm64/include/asm/page.h
> > index 2312e6ee595f..88bab032a493 100644
> > --- a/arch/arm64/include/asm/page.h
> > +++ b/arch/arm64/include/asm/page.h
> > @@ -29,9 +29,8 @@ void copy_user_highpage(struct page *to, struct page *from,
> > void copy_highpage(struct page *to, struct page *from);
> > #define __HAVE_ARCH_COPY_HIGHPAGE
> >
> > -struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
> > - unsigned long vaddr);
> > -#define vma_alloc_zeroed_movable_folio vma_alloc_zeroed_movable_folio
> > +#define vma_alloc_zeroed_movable_folio(vma, vaddr) \
> > + vma_alloc_folio(GFP_HIGHUSER_MOVABLE | __GFP_ZERO, 0, vma, vaddr, false)
> >
> > void tag_clear_highpage(struct page *to);
> > #define __HAVE_ARCH_TAG_CLEAR_HIGHPAGE
> > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > index 79ce70fbb751..08f0904dbfc2 100644
> > --- a/arch/arm64/include/asm/pgtable.h
> > +++ b/arch/arm64/include/asm/pgtable.h
> > @@ -1071,6 +1071,9 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> >
> > #endif /* CONFIG_ARM64_MTE */
> >
> > +#define __HAVE_ARCH_CALC_VMA_GFP
> > +gfp_t arch_calc_vma_gfp(struct vm_area_struct *vma, gfp_t gfp);
> > +
> > /*
> > * On AArch64, the cache coherency is handled via the set_pte_at() function.
> > */
> > diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> > index 55f6455a8284..4d3f0a870ad8 100644
> > --- a/arch/arm64/mm/fault.c
> > +++ b/arch/arm64/mm/fault.c
> > @@ -937,22 +937,15 @@ void do_debug_exception(unsigned long addr_if_watchpoint, unsigned long esr,
> > NOKPROBE_SYMBOL(do_debug_exception);
> >
> > /*
> > - * Used during anonymous page fault handling.
> > + * If this is called during anonymous page fault handling, and the page is
> > + * mapped with PROT_MTE, initialise the tags at the point of tag zeroing as this
> > + * is usually faster than separate DC ZVA and STGM.
> > */
> > -struct folio *vma_alloc_zeroed_movable_folio(struct vm_area_struct *vma,
> > - unsigned long vaddr)
> > +gfp_t arch_calc_vma_gfp(struct vm_area_struct *vma, gfp_t gfp)
> > {
> > - gfp_t flags = GFP_HIGHUSER_MOVABLE | __GFP_ZERO;
> > -
> > - /*
> > - * If the page is mapped with PROT_MTE, initialise the tags at the
> > - * point of allocation and page zeroing as this is usually faster than
> > - * separate DC ZVA and STGM.
> > - */
> > if (vma->vm_flags & VM_MTE)
> > - flags |= __GFP_ZEROTAGS;
> > -
> > - return vma_alloc_folio(flags, 0, vma, vaddr, false);
> > + return __GFP_ZEROTAGS;
> > + return 0;
> > }
> >
> > void tag_clear_highpage(struct page *page)
> > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > index c5ddec6b5305..98f81ca08cbe 100644
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -901,6 +901,13 @@ static inline void arch_do_swap_page(struct mm_struct *mm,
> > }
> > #endif
> >
> > +#ifndef __HAVE_ARCH_CALC_VMA_GFP
> > +static inline gfp_t arch_calc_vma_gfp(struct vm_area_struct *vma, gfp_t gfp)
> > +{
> > + return 0;
> > +}
> > +#endif
> > +
> > #ifndef __HAVE_ARCH_FREE_PAGES_PREPARE
> > static inline void arch_free_pages_prepare(struct page *page, int order) { }
> > #endif
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index 10a590ee1c89..f7ef52760b32 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -2168,6 +2168,7 @@ struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma,
> > pgoff_t ilx;
> > struct page *page;
> >
> > + gfp |= arch_calc_vma_gfp(vma, gfp);
> > pol = get_vma_policy(vma, addr, order, &ilx);
> > page = alloc_pages_mpol(gfp | __GFP_COMP, order,
> > pol, ilx, numa_node_id());
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index d7c84ff62186..14427e9982f9 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -1585,7 +1585,7 @@ static struct folio *shmem_swapin_cluster(swp_entry_t swap, gfp_t gfp,
> > */
> > static gfp_t limit_gfp_mask(gfp_t huge_gfp, gfp_t limit_gfp)
> > {
> > - gfp_t allowflags = __GFP_IO | __GFP_FS | __GFP_RECLAIM;
> > + gfp_t allowflags = __GFP_IO | __GFP_FS | __GFP_RECLAIM | __GFP_ZEROTAGS;
> > gfp_t denyflags = __GFP_NOWARN | __GFP_NORETRY;
> > gfp_t zoneflags = limit_gfp & GFP_ZONEMASK;
> > gfp_t result = huge_gfp & ~(allowflags | GFP_ZONEMASK);
> > @@ -2038,6 +2038,7 @@ static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
> > gfp_t huge_gfp;
> >
> > huge_gfp = vma_thp_gfp_mask(vma);
> > + huge_gfp |= arch_calc_vma_gfp(vma, huge_gfp);
> > huge_gfp = limit_gfp_mask(huge_gfp, gfp);
> > folio = shmem_alloc_and_add_folio(huge_gfp,
> > inode, index, fault_mm, true);
> > @@ -2214,6 +2215,8 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
> > vm_fault_t ret = 0;
> > int err;
> >
> > + gfp |= arch_calc_vma_gfp(vmf->vma, gfp);
> > +
> > /*
> > * Trinity finds that probing a hole which tmpfs is punching can
> > * prevent the hole-punch from ever completing: noted in i_private.
Hi,
On Tue, Jan 30, 2024 at 10:50:00AM +0530, Anshuman Khandual wrote:
>
>
> On 1/25/24 22:12, Alexandru Elisei wrote:
> > Today, cma_alloc() is used to allocate a contiguous memory region. The
> > function allows the caller to specify the number of pages to allocate, but
> > not the starting address. cma_alloc() will walk over the entire CMA region
> > trying to allocate the first available range of the specified size.
> >
> > Introduce cma_alloc_range(), which makes CMA more versatile by allowing the
> > caller to specify a particular range in the CMA region, defined by the
> > start pfn and the size.
> >
> > arm64 will make use of this function when tag storage management will be
> > implemented: cma_alloc_range() will be used to reserve the tag storage
> > associated with a tagged page.
>
> Basically, you would like to pass on a preferred start address and the
> allocation could just fail if a contig range is not available from such
> a starting address ?
>
> Then why not just change cma_alloc() to take a new argument 'start_pfn'.
> Why create a new but almost similar allocator ?
I tried doing that, and I gave up because:
- It made cma_alloc() even more complex and hard to follow.
- What value should 'start_pfn' be to tell cma_alloc() that it should be
ignored? Or, to put it another way, what pfn number is invalid on **all**
platforms that Linux supports?
I can give it another go if we can come up with an invalid value for
'start_pfn'.
>
> But then I am wondering why this could not be done in the arm64 platform
> code itself operating on a CMA area reserved just for tag storage. Unless
> this new allocator has other usage beyond MTE, this could be implemented
> in the platform itself.
I had the same idea in the previous iteration, David Hildenbrand suggested
this approach [1].
[1] https://lore.kernel.org/linux-fsdevel/[email protected]/
Thanks,
Alex
>
> >
> > Signed-off-by: Alexandru Elisei <[email protected]>
> > ---
> >
> > Changes since rfc v2:
> >
> > * New patch.
> >
> > include/linux/cma.h | 2 +
> > include/trace/events/cma.h | 59 ++++++++++++++++++++++++++
> > mm/cma.c | 86 ++++++++++++++++++++++++++++++++++++++
> > 3 files changed, 147 insertions(+)
> >
> > diff --git a/include/linux/cma.h b/include/linux/cma.h
> > index 63873b93deaa..e32559da6942 100644
> > --- a/include/linux/cma.h
> > +++ b/include/linux/cma.h
> > @@ -50,6 +50,8 @@ extern int cma_init_reserved_mem(phys_addr_t base, phys_addr_t size,
> > struct cma **res_cma);
> > extern struct page *cma_alloc(struct cma *cma, unsigned long count, unsigned int align,
> > bool no_warn);
> > +extern int cma_alloc_range(struct cma *cma, unsigned long start, unsigned long count,
> > + unsigned tries, gfp_t gfp);
> > extern bool cma_pages_valid(struct cma *cma, const struct page *pages, unsigned long count);
> > extern bool cma_release(struct cma *cma, const struct page *pages, unsigned long count);
> >
> > diff --git a/include/trace/events/cma.h b/include/trace/events/cma.h
> > index 25103e67737c..a89af313a572 100644
> > --- a/include/trace/events/cma.h
> > +++ b/include/trace/events/cma.h
> > @@ -36,6 +36,65 @@ TRACE_EVENT(cma_release,
> > __entry->count)
> > );
> >
> > +TRACE_EVENT(cma_alloc_range_start,
> > +
> > + TP_PROTO(const char *name, unsigned long start, unsigned long count,
> > + unsigned tries),
> > +
> > + TP_ARGS(name, start, count, tries),
> > +
> > + TP_STRUCT__entry(
> > + __string(name, name)
> > + __field(unsigned long, start)
> > + __field(unsigned long, count)
> > + __field(unsigned, tries)
> > + ),
> > +
> > + TP_fast_assign(
> > + __assign_str(name, name);
> > + __entry->start = start;
> > + __entry->count = count;
> > + __entry->tries = tries;
> > + ),
> > +
> > + TP_printk("name=%s start=%lx count=%lu tries=%u",
> > + __get_str(name),
> > + __entry->start,
> > + __entry->count,
> > + __entry->tries)
> > +);
> > +
> > +TRACE_EVENT(cma_alloc_range_finish,
> > +
> > + TP_PROTO(const char *name, unsigned long start, unsigned long count,
> > + unsigned attempts, int err),
> > +
> > + TP_ARGS(name, start, count, attempts, err),
> > +
> > + TP_STRUCT__entry(
> > + __string(name, name)
> > + __field(unsigned long, start)
> > + __field(unsigned long, count)
> > + __field(unsigned, attempts)
> > + __field(int, err)
> > + ),
> > +
> > + TP_fast_assign(
> > + __assign_str(name, name);
> > + __entry->start = start;
> > + __entry->count = count;
> > + __entry->attempts = attempts;
> > + __entry->err = err;
> > + ),
> > +
> > + TP_printk("name=%s start=%lx count=%lu attempts=%u err=%d",
> > + __get_str(name),
> > + __entry->start,
> > + __entry->count,
> > + __entry->attempts,
> > + __entry->err)
> > +);
> > +
> > TRACE_EVENT(cma_alloc_start,
> >
> > TP_PROTO(const char *name, unsigned long count, unsigned int align),
> > diff --git a/mm/cma.c b/mm/cma.c
> > index 543bb6b3be8e..4a0f68b9443b 100644
> > --- a/mm/cma.c
> > +++ b/mm/cma.c
> > @@ -416,6 +416,92 @@ static void cma_debug_show_areas(struct cma *cma)
> > static inline void cma_debug_show_areas(struct cma *cma) { }
> > #endif
> >
> > +/**
> > + * cma_alloc_range() - allocate pages in a specific range
> > + * @cma: Contiguous memory region for which the allocation is performed.
> > + * @start: Starting pfn of the allocation.
> > + * @count: Requested number of pages
> > + * @tries: Number of tries if the range is busy
> > + * @no_warn: Avoid printing message about failed allocation
> > + *
> > + * This function allocates part of contiguous memory from a specific contiguous
> > + * memory area, from the specified starting address. The 'start' pfn and the the
> > + * 'count' number of pages must be aligned to the CMA bitmap order per bit.
> > + */
> > +int cma_alloc_range(struct cma *cma, unsigned long start, unsigned long count,
> > + unsigned tries, gfp_t gfp)
> > +{
> > + unsigned long bitmap_maxno, bitmap_no, bitmap_start, bitmap_count;
> > + unsigned long i = 0;
> > + struct page *page;
> > + int err = -EINVAL;
> > +
> > + if (!cma || !cma->count || !cma->bitmap)
> > + goto out_stats;
> > +
> > + trace_cma_alloc_range_start(cma->name, start, count, tries);
> > +
> > + if (!count || start < cma->base_pfn ||
> > + start + count > cma->base_pfn + cma->count)
> > + goto out_stats;
> > +
> > + if (!IS_ALIGNED(start | count, 1 << cma->order_per_bit))
> > + goto out_stats;
> > +
> > + bitmap_start = (start - cma->base_pfn) >> cma->order_per_bit;
> > + bitmap_maxno = cma_bitmap_maxno(cma);
> > + bitmap_count = cma_bitmap_pages_to_bits(cma, count);
> > +
> > + spin_lock_irq(&cma->lock);
> > + bitmap_no = bitmap_find_next_zero_area(cma->bitmap, bitmap_maxno,
> > + bitmap_start, bitmap_count, 0);
> > + if (bitmap_no != bitmap_start) {
> > + spin_unlock_irq(&cma->lock);
> > + err = -EEXIST;
> > + goto out_stats;
> > + }
> > + bitmap_set(cma->bitmap, bitmap_start, bitmap_count);
> > + spin_unlock_irq(&cma->lock);
> > +
> > + for (i = 0; i < tries; i++) {
> > + mutex_lock(&cma_mutex);
> > + err = alloc_contig_range(start, start + count, MIGRATE_CMA, gfp);
> > + mutex_unlock(&cma_mutex);
> > +
> > + if (err != -EBUSY)
> > + break;
> > + }
> > +
> > + if (err) {
> > + cma_clear_bitmap(cma, start, count);
> > + } else {
> > + page = pfn_to_page(start);
> > +
> > + /*
> > + * CMA can allocate multiple page blocks, which results in
> > + * different blocks being marked with different tags. Reset the
> > + * tags to ignore those page blocks.
> > + */
> > + for (i = 0; i < count; i++)
> > + page_kasan_tag_reset(nth_page(page, i));
> > + }
> > +
> > +out_stats:
> > + trace_cma_alloc_range_finish(cma->name, start, count, i, err);
> > +
> > + if (err) {
> > + count_vm_events(CMA_ALLOC_FAIL, count);
> > + if (cma)
> > + cma_sysfs_account_fail_pages(cma, count);
> > + } else {
> > + count_vm_events(CMA_ALLOC_SUCCESS, count);
> > + cma_sysfs_account_success_pages(cma, count);
> > + }
> > +
> > + return err;
> > +}
> > +
> > +
> > /**
> > * cma_alloc() - allocate pages from contiguous area
> > * @cma: Contiguous memory region for which the allocation is performed.
>
Hi Peter,
On Mon, Jan 29, 2024 at 04:04:18PM -0800, Peter Collingbourne wrote:
> On Thu, Jan 25, 2024 at 8:45 AM Alexandru Elisei
> <[email protected]> wrote:
> >
> > Reserve tag storage for a page that is being allocated as tagged. This
> > is a best effort approach, and failing to reserve tag storage is
> > allowed.
> >
> > When all the associated tagged pages have been freed, return the tag
> > storage pages back to the page allocator, where they can be used again for
> > data allocations.
> >
> > Signed-off-by: Alexandru Elisei <[email protected]>
> > ---
> >
> > Changes since rfc v2:
> >
> > * Based on rfc v2 patch #16 ("arm64: mte: Manage tag storage on page
> > allocation").
> > * Fixed calculation of the number of associated tag storage blocks (Hyesoo
> > Yu).
> > * Tag storage is reserved in arch_alloc_page() instead of
> > arch_prep_new_page().
> >
> > arch/arm64/include/asm/mte.h | 16 +-
> > arch/arm64/include/asm/mte_tag_storage.h | 31 +++
> > arch/arm64/include/asm/page.h | 5 +
> > arch/arm64/include/asm/pgtable.h | 19 ++
> > arch/arm64/kernel/mte_tag_storage.c | 234 +++++++++++++++++++++++
> > arch/arm64/mm/fault.c | 7 +
> > fs/proc/page.c | 1 +
> > include/linux/kernel-page-flags.h | 1 +
> > include/linux/page-flags.h | 1 +
> > include/trace/events/mmflags.h | 3 +-
> > mm/huge_memory.c | 1 +
> > 11 files changed, 316 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
> > index 8034695b3dd7..6457b7899207 100644
> > --- a/arch/arm64/include/asm/mte.h
> > +++ b/arch/arm64/include/asm/mte.h
> > @@ -40,12 +40,24 @@ void mte_free_tag_buf(void *buf);
> > #ifdef CONFIG_ARM64_MTE
> >
> > /* track which pages have valid allocation tags */
> > -#define PG_mte_tagged PG_arch_2
> > +#define PG_mte_tagged PG_arch_2
> > /* simple lock to avoid multiple threads tagging the same page */
> > -#define PG_mte_lock PG_arch_3
> > +#define PG_mte_lock PG_arch_3
> > +/* Track if a tagged page has tag storage reserved */
> > +#define PG_tag_storage_reserved PG_arch_4
> > +
> > +#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
> > +DECLARE_STATIC_KEY_FALSE(tag_storage_enabled_key);
> > +extern bool page_tag_storage_reserved(struct page *page);
> > +#endif
> >
> > static inline void set_page_mte_tagged(struct page *page)
> > {
> > +#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
> > + /* Open code mte_tag_storage_enabled() */
> > + WARN_ON_ONCE(static_branch_likely(&tag_storage_enabled_key) &&
> > + !page_tag_storage_reserved(page));
> > +#endif
> > /*
> > * Ensure that the tags written prior to this function are visible
> > * before the page flags update.
> > diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
> > index 7b3f6bff8e6f..09f1318d924e 100644
> > --- a/arch/arm64/include/asm/mte_tag_storage.h
> > +++ b/arch/arm64/include/asm/mte_tag_storage.h
> > @@ -5,6 +5,12 @@
> > #ifndef __ASM_MTE_TAG_STORAGE_H
> > #define __ASM_MTE_TAG_STORAGE_H
> >
> > +#ifndef __ASSEMBLY__
> > +
> > +#include <linux/mm_types.h>
> > +
> > +#include <asm/mte.h>
> > +
> > #ifdef CONFIG_ARM64_MTE_TAG_STORAGE
> >
> > DECLARE_STATIC_KEY_FALSE(tag_storage_enabled_key);
> > @@ -15,6 +21,15 @@ static inline bool tag_storage_enabled(void)
> > }
> >
> > void mte_init_tag_storage(void);
> > +
> > +static inline bool alloc_requires_tag_storage(gfp_t gfp)
> > +{
> > + return gfp & __GFP_TAGGED;
> > +}
> > +int reserve_tag_storage(struct page *page, int order, gfp_t gfp);
> > +void free_tag_storage(struct page *page, int order);
> > +
> > +bool page_tag_storage_reserved(struct page *page);
> > #else
> > static inline bool tag_storage_enabled(void)
> > {
> > @@ -23,6 +38,22 @@ static inline bool tag_storage_enabled(void)
> > static inline void mte_init_tag_storage(void)
> > {
> > }
> > +static inline bool alloc_requires_tag_storage(struct page *page)
>
> This function should take a gfp_t to match the
> CONFIG_ARM64_MTE_TAG_STORAGE case.
Ah, yes, it should, nice catch, the compiler didn't throw an error. Will
fix, thanks!
Alex
Hi,
On Tue, Jan 30, 2024 at 10:04:02AM +0530, Anshuman Khandual wrote:
>
>
> On 1/29/24 17:16, Alexandru Elisei wrote:
> > Hi,
> >
> > On Mon, Jan 29, 2024 at 02:31:23PM +0530, Anshuman Khandual wrote:
> >>
> >>
> >> On 1/25/24 22:12, Alexandru Elisei wrote:
> >>> The patch f945116e4e19 ("mm: page_alloc: remove stale CMA guard code")
> >>> removed the CMA filter when allocating from the MIGRATE_MOVABLE pcp list
> >>> because CMA is always allowed when __GFP_MOVABLE is set.
> >>>
> >>> With the introduction of the arch_alloc_cma() function, the above is not
> >>> true anymore, so bring back the filter.
> >>
> >> This makes sense as arch_alloc_cma() now might prevent ALLOC_CMA being
> >> assigned to alloc_flags in gfp_to_alloc_flags_cma().
> >
> > Can I add your Reviewed-by tag then?
>
> I think all these changes need to be reviewed in their entirety
> even though some patches do look good on their own. For example
> this patch depends on whether [PATCH 03/35] is acceptable or not.
>
> I would suggest separating out CMA patches which could be debated
> and merged regardless of this series.
Ah, I see, makes sense. Since basically all the core mm changes are there
to enable dynamic tag storage for arm64, I'll hold on until the series
stabilises before separating the core mm from the arm64 patches.
Thanks,
Alex
Hi,
On Tue, Jan 30, 2024 at 10:22:11AM +0530, Anshuman Khandual wrote:
>
>
> On 1/29/24 17:21, Alexandru Elisei wrote:
> > Hi,
> >
> > On Mon, Jan 29, 2024 at 02:54:20PM +0530, Anshuman Khandual wrote:
> >>
> >>
> >> On 1/25/24 22:12, Alexandru Elisei wrote:
> >>> The CMA_ALLOC_SUCCESS, respectively CMA_ALLOC_FAIL, are increased by one
> >>> after each cma_alloc() function call. This is done even though cma_alloc()
> >>> can allocate an arbitrary number of CMA pages. When looking at
> >>> /proc/vmstat, the number of successful (or failed) cma_alloc() calls
> >>> doesn't tell much with regards to how many CMA pages were allocated via
> >>> cma_alloc() versus via the page allocator (regular allocation request or
> >>> PCP lists refill).
> >>>
> >>> This can also be rather confusing to a user who isn't familiar with the
> >>> code, since the unit of measurement for nr_free_cma is the number of pages,
> >>> but cma_alloc_success and cma_alloc_fail count the number of cma_alloc()
> >>> function calls.
> >>>
> >>> Let's make this consistent, and arguably more useful, by having
> >>> CMA_ALLOC_SUCCESS count the number of successfully allocated CMA pages, and
> >>> CMA_ALLOC_FAIL count the number of pages the cma_alloc() failed to
> >>> allocate.
> >>>
> >>> For users that wish to track the number of cma_alloc() calls, there are
> >>> tracepoints for that already implemented.
> >>>
> >>> Signed-off-by: Alexandru Elisei <[email protected]>
> >>> ---
> >>> mm/cma.c | 4 ++--
> >>> 1 file changed, 2 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/mm/cma.c b/mm/cma.c
> >>> index f49c95f8ee37..dbf7fe8cb1bd 100644
> >>> --- a/mm/cma.c
> >>> +++ b/mm/cma.c
> >>> @@ -517,10 +517,10 @@ struct page *cma_alloc(struct cma *cma, unsigned long count,
> >>> pr_debug("%s(): returned %p\n", __func__, page);
> >>> out:
> >>> if (page) {
> >>> - count_vm_event(CMA_ALLOC_SUCCESS);
> >>> + count_vm_events(CMA_ALLOC_SUCCESS, count);
> >>> cma_sysfs_account_success_pages(cma, count);
> >>> } else {
> >>> - count_vm_event(CMA_ALLOC_FAIL);
> >>> + count_vm_events(CMA_ALLOC_FAIL, count);
> >>> if (cma)
> >>> cma_sysfs_account_fail_pages(cma, count);
> >>> }
> >>
> >> Without getting into the merits of this patch - which is actually trying to do
> >> semantics change to /proc/vmstat, wondering how is this even related to this
> >> particular series ? If required this could be debated on it's on separately.
> >
> > Having the number of CMA pages allocated and the number of CMA pages freed
> > allows someone to infer how many tagged pages are in use at a given time:
>
> That should not be done in CMA which is a generic multi purpose allocator.
Ah, ok. Let me rephrase that: Having the number of CMA pages allocated, the
number of failed CMA page allocations and the number of freed CMA pages
allows someone to infer how many CMA pages are in use at a given time.
That's valuable information for software designers and system
administrators, as it allows them to tune the number of CMA pages available
in a system.
Or put another way: what would you consider to be more useful? Knowing the
number of cma_alloc()/cma_release() calls, or knowing the number of pages
that cma_alloc()/cma_release() allocated or freed?
>
> > (allocated CMA pages - CMA pages allocated by drivers* - CMA pages
> > released) * 32. That is valuable information for software and hardware
> > designers.
> >
> > Besides that, for every iteration of the series, this has proven invaluable
> > for discovering bugs with freeing and/or reserving tag storage pages.
>
> I am afraid that might not be enough justification for getting something
> merged mainline.
>
> >
> > *that would require userspace reading cma_alloc_success and
> > cma_release_success before any tagged allocations are performed.
>
> While assuming that no other non-memory-tagged CMA based allocation amd free
> call happens in the meantime ? That would be on real thin ice.
>
> I suppose arm64 tagged memory specific allocation or free related counters
> need to be created on the caller side, including arch_free_pages_prepare().
I'll think about this. At the very least, I can add tracepoints.
Thanks,
Alex
Hi,
On Tue, Jan 30, 2024 at 09:56:10AM +0530, Anshuman Khandual wrote:
>
>
> On 1/29/24 17:11, Alexandru Elisei wrote:
> > Hi,
> >
> > On Mon, Jan 29, 2024 at 11:18:59AM +0530, Anshuman Khandual wrote:
> >> On 1/25/24 22:12, Alexandru Elisei wrote:
> >>> Extend the usefulness of arch_alloc_page() by adding the gfp_flags
> >>> parameter.
> >> Although the change here is harmless in itself, it will definitely benefit
> >> from some additional context explaining the rationale, taking into account
> >> why-how arch_alloc_page() got added particularly for s390 platform and how
> >> it's going to be used in the present proposal.
> > arm64 will use it to reserve tag storage if the caller requested a tagged
> > page. Right now that means that __GFP_ZEROTAGS is set in the gfp mask, but
> > I'll rename it to __GFP_TAGGED in patch #18 ("arm64: mte: Rename
> > __GFP_ZEROTAGS to __GFP_TAGGED") [1].
> >
> > [1] https://lore.kernel.org/lkml/[email protected]/
>
> Makes sense, but please do update the commit message explaining how
> new gfp mask argument will be used to detect tagged page allocation
> requests, further requiring tag storage allocation.
Will do, thanks!
Alex
On 1/30/24 17:27, Alexandru Elisei wrote:
> Hi,
>
> On Tue, Jan 30, 2024 at 10:04:02AM +0530, Anshuman Khandual wrote:
>>
>>
>> On 1/29/24 17:16, Alexandru Elisei wrote:
>>> Hi,
>>>
>>> On Mon, Jan 29, 2024 at 02:31:23PM +0530, Anshuman Khandual wrote:
>>>>
>>>>
>>>> On 1/25/24 22:12, Alexandru Elisei wrote:
>>>>> The patch f945116e4e19 ("mm: page_alloc: remove stale CMA guard code")
>>>>> removed the CMA filter when allocating from the MIGRATE_MOVABLE pcp list
>>>>> because CMA is always allowed when __GFP_MOVABLE is set.
>>>>>
>>>>> With the introduction of the arch_alloc_cma() function, the above is not
>>>>> true anymore, so bring back the filter.
>>>>
>>>> This makes sense as arch_alloc_cma() now might prevent ALLOC_CMA being
>>>> assigned to alloc_flags in gfp_to_alloc_flags_cma().
>>>
>>> Can I add your Reviewed-by tag then?
>>
>> I think all these changes need to be reviewed in their entirety
>> even though some patches do look good on their own. For example
>> this patch depends on whether [PATCH 03/35] is acceptable or not.
>>
>> I would suggest separating out CMA patches which could be debated
>> and merged regardless of this series.
>
> Ah, I see, makes sense. Since basically all the core mm changes are there
> to enable dynamic tag storage for arm64, I'll hold on until the series
> stabilises before separating the core mm from the arm64 patches.
Fair enough but at least could you please separate out this particular
patch right away and send across.
mm: cma: Don't append newline when generating CMA area name
On 1/30/24 17:28, Alexandru Elisei wrote:
> Hi,
>
> On Tue, Jan 30, 2024 at 10:22:11AM +0530, Anshuman Khandual wrote:
>>
>> On 1/29/24 17:21, Alexandru Elisei wrote:
>>> Hi,
>>>
>>> On Mon, Jan 29, 2024 at 02:54:20PM +0530, Anshuman Khandual wrote:
>>>>
>>>> On 1/25/24 22:12, Alexandru Elisei wrote:
>>>>> The CMA_ALLOC_SUCCESS, respectively CMA_ALLOC_FAIL, are increased by one
>>>>> after each cma_alloc() function call. This is done even though cma_alloc()
>>>>> can allocate an arbitrary number of CMA pages. When looking at
>>>>> /proc/vmstat, the number of successful (or failed) cma_alloc() calls
>>>>> doesn't tell much with regards to how many CMA pages were allocated via
>>>>> cma_alloc() versus via the page allocator (regular allocation request or
>>>>> PCP lists refill).
>>>>>
>>>>> This can also be rather confusing to a user who isn't familiar with the
>>>>> code, since the unit of measurement for nr_free_cma is the number of pages,
>>>>> but cma_alloc_success and cma_alloc_fail count the number of cma_alloc()
>>>>> function calls.
>>>>>
>>>>> Let's make this consistent, and arguably more useful, by having
>>>>> CMA_ALLOC_SUCCESS count the number of successfully allocated CMA pages, and
>>>>> CMA_ALLOC_FAIL count the number of pages the cma_alloc() failed to
>>>>> allocate.
>>>>>
>>>>> For users that wish to track the number of cma_alloc() calls, there are
>>>>> tracepoints for that already implemented.
>>>>>
>>>>> Signed-off-by: Alexandru Elisei <[email protected]>
>>>>> ---
>>>>> mm/cma.c | 4 ++--
>>>>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/mm/cma.c b/mm/cma.c
>>>>> index f49c95f8ee37..dbf7fe8cb1bd 100644
>>>>> --- a/mm/cma.c
>>>>> +++ b/mm/cma.c
>>>>> @@ -517,10 +517,10 @@ struct page *cma_alloc(struct cma *cma, unsigned long count,
>>>>> pr_debug("%s(): returned %p\n", __func__, page);
>>>>> out:
>>>>> if (page) {
>>>>> - count_vm_event(CMA_ALLOC_SUCCESS);
>>>>> + count_vm_events(CMA_ALLOC_SUCCESS, count);
>>>>> cma_sysfs_account_success_pages(cma, count);
>>>>> } else {
>>>>> - count_vm_event(CMA_ALLOC_FAIL);
>>>>> + count_vm_events(CMA_ALLOC_FAIL, count);
>>>>> if (cma)
>>>>> cma_sysfs_account_fail_pages(cma, count);
>>>>> }
>>>> Without getting into the merits of this patch - which is actually trying to do
>>>> semantics change to /proc/vmstat, wondering how is this even related to this
>>>> particular series ? If required this could be debated on it's on separately.
>>> Having the number of CMA pages allocated and the number of CMA pages freed
>>> allows someone to infer how many tagged pages are in use at a given time:
>> That should not be done in CMA which is a generic multi purpose allocator.
> Ah, ok. Let me rephrase that: Having the number of CMA pages allocated, the
> number of failed CMA page allocations and the number of freed CMA pages
> allows someone to infer how many CMA pages are in use at a given time.
> That's valuable information for software designers and system
> administrators, as it allows them to tune the number of CMA pages available
> in a system.
>
> Or put another way: what would you consider to be more useful? Knowing the
> number of cma_alloc()/cma_release() calls, or knowing the number of pages
> that cma_alloc()/cma_release() allocated or freed?
There is still value in knowing how many times cma_alloc() succeeded or failed
regardless of the cumulative number pages involved over the time. Actually the
count helps to understand how cma_alloc() performed overall as an allocator.
But on the cma_release() path there is no chances of failure apart from - just
when the caller itself provides an wrong input. So there are no corresponding
CMA_RELEASE_SUCCESS/CMA_RELEASE_FAIL vmstat counters in there - for a reason !
Coming back to CMA based pages being allocated and freed, there is already an
interface via sysfs (CONFIG_CMA_SYSFS) which gets updated in cma_alloc() path
via cma_sysfs_account_success_pages() and cma_sysfs_account_fail_pages().
#ls /sys/kernel/mm/cma/<name>
alloc_pages_fail alloc_pages_success
Why these counters could not meet your requirements ? Also 'struct cma' can
be updated to add an element 'nr_pages_freed' to be tracked in cma_release(),
providing free pages count as well.
There are additional debug fs based elements (CONFIG_CMA_DEBUGFS) available.
#ls /sys/kernel/debug/cma/<name>
alloc base_pfn bitmap count free maxchunk order_per_bit used
On 1/29/24 17:23, Alexandru Elisei wrote:
> Hi,
>
> On Mon, Jan 29, 2024 at 03:01:24PM +0530, Anshuman Khandual wrote:
>>
>> On 1/25/24 22:12, Alexandru Elisei wrote:
>>> Similar to the two events that relate to CMA allocations, add the
>>> CMA_RELEASE_SUCCESS and CMA_RELEASE_FAIL events that count when CMA pages
>>> are freed.
>> How is this is going to be beneficial towards analyzing CMA alloc/release
>> behaviour - particularly with respect to this series. OR just adding this
>> from parity perspective with CMA alloc side counters ? Regardless this
>> CMA change too could be discussed separately.
> Added for parity and because it's useful for this series (see my reply to
> the previous patch where I discuss how I've used the counters).
As mentioned earlier, a new CONFIG_CMA_SYSFS element 'cma->nr_freed_pages'
could be instrumented in cma_release()'s success path for this purpose.
But again the failure path is not of much value as it could only happen
when there is an invalid input from the caller i.e when cma_pages_valid()
check fails.
On 1/30/24 17:05, Alexandru Elisei wrote:
> Hi,
>
> On Tue, Jan 30, 2024 at 10:50:00AM +0530, Anshuman Khandual wrote:
>>
>> On 1/25/24 22:12, Alexandru Elisei wrote:
>>> Today, cma_alloc() is used to allocate a contiguous memory region. The
>>> function allows the caller to specify the number of pages to allocate, but
>>> not the starting address. cma_alloc() will walk over the entire CMA region
>>> trying to allocate the first available range of the specified size.
>>>
>>> Introduce cma_alloc_range(), which makes CMA more versatile by allowing the
>>> caller to specify a particular range in the CMA region, defined by the
>>> start pfn and the size.
>>>
>>> arm64 will make use of this function when tag storage management will be
>>> implemented: cma_alloc_range() will be used to reserve the tag storage
>>> associated with a tagged page.
>> Basically, you would like to pass on a preferred start address and the
>> allocation could just fail if a contig range is not available from such
>> a starting address ?
>>
>> Then why not just change cma_alloc() to take a new argument 'start_pfn'.
>> Why create a new but almost similar allocator ?
> I tried doing that, and I gave up because:
>
> - It made cma_alloc() even more complex and hard to follow.
>
> - What value should 'start_pfn' be to tell cma_alloc() that it should be
> ignored? Or, to put it another way, what pfn number is invalid on **all**
> platforms that Linux supports?
>
> I can give it another go if we can come up with an invalid value for
> 'start_pfn'.
Something negative might work. How about -1/-1UL ? A quick search gives
some instances such as ...
git grep "pfn == -1"
mm/mm_init.c: if (*start_pfn == -1UL)
mm/vmscan.c: if (pfn == -1)
mm/vmscan.c: if (pfn == -1)
mm/vmscan.c: if (pfn == -1)
tools/testing/selftests/mm/hugepage-vmemmap.c: if (pfn == -1UL) {
Could not -1UL be abstracted as common macro MM_INVALID_PFN to be used in
such scenarios including here ?
>
>> But then I am wondering why this could not be done in the arm64 platform
>> code itself operating on a CMA area reserved just for tag storage. Unless
>> this new allocator has other usage beyond MTE, this could be implemented
>> in the platform itself.
> I had the same idea in the previous iteration, David Hildenbrand suggested
> this approach [1].
>
> [1] https://lore.kernel.org/linux-fsdevel/[email protected]/
There are two different cma_alloc() proposals here - including the next
patch i.e mm: cma: Fast track allocating memory when the pages are free
1) Augment cma_alloc() or add cma_alloc_range() with start_pfn parameter
2) Speed up cma_alloc() for small allocation requests when pages are free
The second one if separated out from this series could be considered on
its own as it will help all existing cma_alloc() callers. The first one
definitely needs an use case as provided in this series.
On 1/30/24 17:04, Alexandru Elisei wrote:
> Hi,
>
> On Tue, Jan 30, 2024 at 03:25:20PM +0530, Anshuman Khandual wrote:
>>
>> On 1/25/24 22:12, Alexandru Elisei wrote:
>>> arm64 uses VM_HIGH_ARCH_0 and VM_HIGH_ARCH_1 for enabling MTE for a VMA.
>>> When VM_HIGH_ARCH_0, which arm64 renames to VM_MTE, is set for a VMA, and
>>> the gfp flag __GFP_ZERO is present, the __GFP_ZEROTAGS gfp flag also gets
>>> set in vma_alloc_zeroed_movable_folio().
>>>
>>> Expand this to be more generic by adding an arch hook that modifes the gfp
>>> flags for an allocation when the VMA is known.
>>>
>>> Note that __GFP_ZEROTAGS is ignored by the page allocator unless __GFP_ZERO
>>> is also set; from that point of view, the current behaviour is unchanged,
>>> even though the arm64 flag is set in more places. When arm64 will have
>>> support to reuse the tag storage for data allocation, the uses of the
>>> __GFP_ZEROTAGS flag will be expanded to instruct the page allocator to try
>>> to reserve the corresponding tag storage for the pages being allocated.
>> Right but how will pushing __GFP_ZEROTAGS addition into gfp_t flags further
>> down via a new arch call back i.e arch_calc_vma_gfp() while still maintaining
>> (vma->vm_flags & VM_MTE) conditionality improve the current scenario. Because
> I'm afraid I don't follow you.
I was just asking whether the overall scope of __GFP_ZEROTAGS flag is being
increased to cover more core MM paths through this patch. I think you have
already answered that below.
>
>> the page allocator could have still analyzed alloc flags for __GFP_ZEROTAGS
>> for any additional stuff.
>>
>> OR this just adds some new core MM paths to get __GFP_ZEROTAGS which was not
>> the case earlier via this call back.
> Before this patch: vma_alloc_zeroed_movable_folio() sets __GFP_ZEROTAGS.
> After this patch: vma_alloc_folio() sets __GFP_ZEROTAGS.
Understood.
>
> This patch is about adding __GFP_ZEROTAGS for more callers.
Right, I guess that is the real motivation for this patch. But just wondering
does this cover all possible anon fault paths for converting given vma_flag's
VM_MTE flag into page alloc flag __GFP_ZEROTAGS ? Aren't there any other file
besides (mm/shmem.c) which needs to be changed to include arch_calc_vma_gfp() ?
Hi,
On Wed, Jan 31, 2024 at 12:23:51PM +0530, Anshuman Khandual wrote:
>
>
> On 1/30/24 17:04, Alexandru Elisei wrote:
> > Hi,
> >
> > On Tue, Jan 30, 2024 at 03:25:20PM +0530, Anshuman Khandual wrote:
> >>
> >> On 1/25/24 22:12, Alexandru Elisei wrote:
> >>> arm64 uses VM_HIGH_ARCH_0 and VM_HIGH_ARCH_1 for enabling MTE for a VMA.
> >>> When VM_HIGH_ARCH_0, which arm64 renames to VM_MTE, is set for a VMA, and
> >>> the gfp flag __GFP_ZERO is present, the __GFP_ZEROTAGS gfp flag also gets
> >>> set in vma_alloc_zeroed_movable_folio().
> >>>
> >>> Expand this to be more generic by adding an arch hook that modifes the gfp
> >>> flags for an allocation when the VMA is known.
> >>>
> >>> Note that __GFP_ZEROTAGS is ignored by the page allocator unless __GFP_ZERO
> >>> is also set; from that point of view, the current behaviour is unchanged,
> >>> even though the arm64 flag is set in more places. When arm64 will have
> >>> support to reuse the tag storage for data allocation, the uses of the
> >>> __GFP_ZEROTAGS flag will be expanded to instruct the page allocator to try
> >>> to reserve the corresponding tag storage for the pages being allocated.
> >> Right but how will pushing __GFP_ZEROTAGS addition into gfp_t flags further
> >> down via a new arch call back i.e arch_calc_vma_gfp() while still maintaining
> >> (vma->vm_flags & VM_MTE) conditionality improve the current scenario. Because
> > I'm afraid I don't follow you.
>
> I was just asking whether the overall scope of __GFP_ZEROTAGS flag is being
> increased to cover more core MM paths through this patch. I think you have
> already answered that below.
>
> >
> >> the page allocator could have still analyzed alloc flags for __GFP_ZEROTAGS
> >> for any additional stuff.
> >>
> >> OR this just adds some new core MM paths to get __GFP_ZEROTAGS which was not
> >> the case earlier via this call back.
> > Before this patch: vma_alloc_zeroed_movable_folio() sets __GFP_ZEROTAGS.
> > After this patch: vma_alloc_folio() sets __GFP_ZEROTAGS.
>
> Understood.
>
> >
> > This patch is about adding __GFP_ZEROTAGS for more callers.
>
> Right, I guess that is the real motivation for this patch. But just wondering
> does this cover all possible anon fault paths for converting given vma_flag's
> VM_MTE flag into page alloc flag __GFP_ZEROTAGS ? Aren't there any other file
> besides (mm/shmem.c) which needs to be changed to include arch_calc_vma_gfp() ?
My thoughts exactly. I went through most of the fault handling code, and
from the code I read, all the allocation were executed with
vma_alloc_folio() or by shmem.
That's not to say there's no scope for improvment, there definitely is, but
since having __GFP_ZEROTAGS isn't necessary for correctness (but it's very
useful for performance, since it can avoid a page fault and a page
migration) and this series is an RFC I settled on changing only the above,
since KVM support for dynamic tag storage also benefits from this change.
The series is very big already, I wanted to settle on an approach that is
acceptable for upstreaming before thinking too much about performance.
Thanks,
Alex
On 1/30/24 17:03, Alexandru Elisei wrote:
> Hi,
>
> I really appreciate the feedback you have given me so far. I believe the
> commit message isn't clear enough and there has been a confusion.
>
> A CMA user adds a CMA area to the cma_areas array with
> cma_declare_contiguous_nid() or cma_init_reserved_mem().
> init_cma_reserved_pageblock() then iterates over the array and activates
> all cma areas.
Agreed.
>
> The function cma_remove_mem() is intended to be used to remove a cma area
> from the cma_areas array **before** the area has been activated.
Understood.
>
> Usecase: a driver (in this case, the arm64 dynamic tag storage code)
> manages several cma areas. The driver successfully adds the first area to
> the cma_areas array. When the driver tries to adds the second area, the
> function fails. Without cma_remove_mem(), the driver has no way to prevent
> the first area from being freed to the page allocator. cma_remove_mem() is
> about providing a means to do cleanup in case of error.
>
> Does that make more sense now?
How to ensure that cma_remove_mem() should get called by the driver before
core_initcall()---> cma_init_reserved_areas()---> cma_activate_area() chain
happens. Else cma_remove_mem() will miss out to clear cma->count and given
area will proceed to get activated like always.
>
> Ok Tue, Jan 30, 2024 at 11:20:56AM +0530, Anshuman Khandual wrote:
>>
>>
>> On 1/25/24 22:12, Alexandru Elisei wrote:
>>> Memory is added to CMA with cma_declare_contiguous_nid() and
>>> cma_init_reserved_mem(). This memory is then put on the MIGRATE_CMA list in
>>> cma_init_reserved_areas(), where the page allocator can make use of it.
>>
>> cma_declare_contiguous_nid() reserves memory in memblock and marks the
>
> You forgot about about cma_init_reserved_mem() which does the same thing,
> but yes, you are right.
Agreed, missed that. There are some direct cma_init_reserved_mem() calls as well.
>
>> for subsequent CMA usage, where as cma_init_reserved_areas() activates
>> these memory areas through init_cma_reserved_pageblock(). Standard page
>> allocator only receives these memory via free_reserved_page() - only if
>
> I don't think that's correct. init_cma_reserved_pageblock() clears the
> PG_reserved page flag, sets the migratetype to MIGRATE_CMA and then frees
> the page. After that, the page is available to the standard page allocator
> to use for allocation. Otherwise, what would be the point of the
> MIGRATE_CMA migratetype?
Understood and agreed.
>
>> the page block activation fails.
>
> For the sake of having a complete picture, I'll add that that only happens
> if cma->reserve_pages_on_error is false. If the CMA user sets the field to
> 'true' (with cma_reserve_pages_on_error()), then the pages in the CMA
> region are kept PG_reserved if activation fails.
Why cannot you use cma_reserve_pages_on_error() ?
>
>>
>>>
>>> If a device manages multiple CMA areas, and there's an error when one of
>>> the areas is added to CMA, there is no mechanism for the device to prevent
>>
>> What kind of error ? init_cma_reserved_pageblock() fails ? But that will
>> not happen until cma_init_reserved_areas().
>
> I think I haven't been clear enough. When I say that "an area is added
> to CMA", I mean that the memory region is added to cma_areas array, via
> cma_declare_contiguous_nid() or cma_init_reserved_mem(). There are several
> ways in which either function can fail.
Okay.
>
>>
>>> the rest of the areas, which were added before the error occured, from
>>> being later added to the MIGRATE_CMA list.
>>
>> Why is this mechanism required ? cma_init_reserved_areas() scans over all
>> CMA areas and try and activate each of them sequentially. Why is not this
>> sufficient ?
>
> This patch is about removing a struct cma from the cma_areas array after it
> has been added to the array, with cma_declare_contiguous_nid() or
> cma_init_reserved_mem(), to prevent the area from being activated in
> cma_init_reserved_areas(). Sorry for the confusion.
>
> I'll add a check in cma_remove_mem() to fail if the cma area has been
> activated, and a comment to the function to explain its usage.
That will be a good check.
>
>>
>>>
>>> Add cma_remove_mem() which allows a previously reserved CMA area to be
>>> removed and thus it cannot be used by the page allocator.
>>
>> Successfully activated CMA areas do not get used by the buddy allocator.
>
> I don't believe that is correct, see above.
Apologies, it's my bad.
>
>>
>>>
>>> Signed-off-by: Alexandru Elisei <[email protected]>
>>> ---
>>>
>>> Changes since rfc v2:
>>>
>>> * New patch.
>>>
>>> include/linux/cma.h | 1 +
>>> mm/cma.c | 30 +++++++++++++++++++++++++++++-
>>> 2 files changed, 30 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/include/linux/cma.h b/include/linux/cma.h
>>> index e32559da6942..787cbec1702e 100644
>>> --- a/include/linux/cma.h
>>> +++ b/include/linux/cma.h
>>> @@ -48,6 +48,7 @@ extern int cma_init_reserved_mem(phys_addr_t base, phys_addr_t size,
>>> unsigned int order_per_bit,
>>> const char *name,
>>> struct cma **res_cma);
>>> +extern void cma_remove_mem(struct cma **res_cma);
>>> extern struct page *cma_alloc(struct cma *cma, unsigned long count, unsigned int align,
>>> bool no_warn);
>>> extern int cma_alloc_range(struct cma *cma, unsigned long start, unsigned long count,
>>> diff --git a/mm/cma.c b/mm/cma.c
>>> index 4a0f68b9443b..2881bab12b01 100644
>>> --- a/mm/cma.c
>>> +++ b/mm/cma.c
>>> @@ -147,8 +147,12 @@ static int __init cma_init_reserved_areas(void)
>>> {
>>> int i;
>>>
>>> - for (i = 0; i < cma_area_count; i++)
>>> + for (i = 0; i < cma_area_count; i++) {
>>> + /* Region was removed. */
>>> + if (!cma_areas[i].count)
>>> + continue;
>>
>> Skip previously added CMA area (now zeroed out) ?
>
> Yes, that's what I meant with the comment "Region was removed". Do you
> think I should reword the comment?
>
>>
>>> cma_activate_area(&cma_areas[i]);
>>> + }
>>>
>>> return 0;
>>> }
>>
>> cma_init_reserved_areas() gets called via core_initcall(). Some how
>> platform/device needs to call cma_remove_mem() before core_initcall()
>> gets called ? This might be time sensitive.
>
> I don't understand your point.
>
>>
>>> @@ -216,6 +220,30 @@ int __init cma_init_reserved_mem(phys_addr_t base, phys_addr_t size,
>>> return 0;
>>> }
>>>
>>> +/**
>>> + * cma_remove_mem() - remove cma area
>>> + * @res_cma: Pointer to the cma region.
>>> + *
>>> + * This function removes a cma region created with cma_init_reserved_mem(). The
>>> + * ->count is set to 0.
>>> + */
>>> +void __init cma_remove_mem(struct cma **res_cma)
>>> +{
>>> + struct cma *cma;
>>> +
>>> + if (WARN_ON_ONCE(!res_cma || !(*res_cma)))
>>> + return;
>>> +
>>> + cma = *res_cma;
>>> + if (WARN_ON_ONCE(!cma->count))
>>> + return;
>>> +
>>> + totalcma_pages -= cma->count;
>>> + cma->count = 0;
>>> +
>>> + *res_cma = NULL;
>>> +}
>>> +
>>> /**
>>> * cma_declare_contiguous_nid() - reserve custom contiguous area
>>> * @base: Base address of the reserved area optional, use 0 for any
>>
>> But first please do explain what are the errors device or platform might
>
> cma_declare_contiguous_nid() and cma_init_reserved_mem() can fail in a
> number of ways, the code should be self documenting.
But when they do fail - would not cma->count be left uninitialized as 0 ?
Hence the proposed check (!cma->count) in cma_init_reserved_areas() should
just do the trick without requiring an explicit cma_remove_mem() call.
>
>> see on a previously marked CMA area so that removing them on way becomes
>> necessary preventing their activation via cma_init_reserved_areas().
>
> I've described how the function is supposed to be used at the top of my
> reply.
>
> Thanks,
> Alex
Hi,
On Wed, Jan 31, 2024 at 10:10:05AM +0530, Anshuman Khandual wrote:
>
>
> On 1/30/24 17:28, Alexandru Elisei wrote:
> > Hi,
> >
> > On Tue, Jan 30, 2024 at 10:22:11AM +0530, Anshuman Khandual wrote:
> >>
> >> On 1/29/24 17:21, Alexandru Elisei wrote:
> >>> Hi,
> >>>
> >>> On Mon, Jan 29, 2024 at 02:54:20PM +0530, Anshuman Khandual wrote:
> >>>>
> >>>> On 1/25/24 22:12, Alexandru Elisei wrote:
> >>>>> The CMA_ALLOC_SUCCESS, respectively CMA_ALLOC_FAIL, are increased by one
> >>>>> after each cma_alloc() function call. This is done even though cma_alloc()
> >>>>> can allocate an arbitrary number of CMA pages. When looking at
> >>>>> /proc/vmstat, the number of successful (or failed) cma_alloc() calls
> >>>>> doesn't tell much with regards to how many CMA pages were allocated via
> >>>>> cma_alloc() versus via the page allocator (regular allocation request or
> >>>>> PCP lists refill).
> >>>>>
> >>>>> This can also be rather confusing to a user who isn't familiar with the
> >>>>> code, since the unit of measurement for nr_free_cma is the number of pages,
> >>>>> but cma_alloc_success and cma_alloc_fail count the number of cma_alloc()
> >>>>> function calls.
> >>>>>
> >>>>> Let's make this consistent, and arguably more useful, by having
> >>>>> CMA_ALLOC_SUCCESS count the number of successfully allocated CMA pages, and
> >>>>> CMA_ALLOC_FAIL count the number of pages the cma_alloc() failed to
> >>>>> allocate.
> >>>>>
> >>>>> For users that wish to track the number of cma_alloc() calls, there are
> >>>>> tracepoints for that already implemented.
> >>>>>
> >>>>> Signed-off-by: Alexandru Elisei <[email protected]>
> >>>>> ---
> >>>>> mm/cma.c | 4 ++--
> >>>>> 1 file changed, 2 insertions(+), 2 deletions(-)
> >>>>>
> >>>>> diff --git a/mm/cma.c b/mm/cma.c
> >>>>> index f49c95f8ee37..dbf7fe8cb1bd 100644
> >>>>> --- a/mm/cma.c
> >>>>> +++ b/mm/cma.c
> >>>>> @@ -517,10 +517,10 @@ struct page *cma_alloc(struct cma *cma, unsigned long count,
> >>>>> pr_debug("%s(): returned %p\n", __func__, page);
> >>>>> out:
> >>>>> if (page) {
> >>>>> - count_vm_event(CMA_ALLOC_SUCCESS);
> >>>>> + count_vm_events(CMA_ALLOC_SUCCESS, count);
> >>>>> cma_sysfs_account_success_pages(cma, count);
> >>>>> } else {
> >>>>> - count_vm_event(CMA_ALLOC_FAIL);
> >>>>> + count_vm_events(CMA_ALLOC_FAIL, count);
> >>>>> if (cma)
> >>>>> cma_sysfs_account_fail_pages(cma, count);
> >>>>> }
> >>>> Without getting into the merits of this patch - which is actually trying to do
> >>>> semantics change to /proc/vmstat, wondering how is this even related to this
> >>>> particular series ? If required this could be debated on it's on separately.
> >>> Having the number of CMA pages allocated and the number of CMA pages freed
> >>> allows someone to infer how many tagged pages are in use at a given time:
> >> That should not be done in CMA which is a generic multi purpose allocator.
>
> > Ah, ok. Let me rephrase that: Having the number of CMA pages allocated, the
> > number of failed CMA page allocations and the number of freed CMA pages
> > allows someone to infer how many CMA pages are in use at a given time.
> > That's valuable information for software designers and system
> > administrators, as it allows them to tune the number of CMA pages available
> > in a system.
> >
> > Or put another way: what would you consider to be more useful? Knowing the
> > number of cma_alloc()/cma_release() calls, or knowing the number of pages
> > that cma_alloc()/cma_release() allocated or freed?
>
> There is still value in knowing how many times cma_alloc() succeeded or failed
> regardless of the cumulative number pages involved over the time. Actually the
> count helps to understand how cma_alloc() performed overall as an allocator.
>
> But on the cma_release() path there is no chances of failure apart from - just
> when the caller itself provides an wrong input. So there are no corresponding
> CMA_RELEASE_SUCCESS/CMA_RELEASE_FAIL vmstat counters in there - for a reason !
>
> Coming back to CMA based pages being allocated and freed, there is already an
> interface via sysfs (CONFIG_CMA_SYSFS) which gets updated in cma_alloc() path
> via cma_sysfs_account_success_pages() and cma_sysfs_account_fail_pages().
>
> #ls /sys/kernel/mm/cma/<name>
> alloc_pages_fail alloc_pages_success
>
> Why these counters could not meet your requirements ? Also 'struct cma' can
> be updated to add an element 'nr_pages_freed' to be tracked in cma_release(),
> providing free pages count as well.
>
> There are additional debug fs based elements (CONFIG_CMA_DEBUGFS) available.
>
> #ls /sys/kernel/debug/cma/<name>
> alloc base_pfn bitmap count free maxchunk order_per_bit used
Ok, I'll have a look at those, thank you for the suggestion.
Thanks,
Alex
Hi,
On Wed, Jan 31, 2024 at 06:49:34PM +0530, Anshuman Khandual wrote:
> On 1/30/24 17:03, Alexandru Elisei wrote:
> > Hi,
> >
> > I really appreciate the feedback you have given me so far. I believe the
> > commit message isn't clear enough and there has been a confusion.
> >
> > A CMA user adds a CMA area to the cma_areas array with
> > cma_declare_contiguous_nid() or cma_init_reserved_mem().
> > init_cma_reserved_pageblock() then iterates over the array and activates
> > all cma areas.
>
> Agreed.
>
> >
> > The function cma_remove_mem() is intended to be used to remove a cma area
> > from the cma_areas array **before** the area has been activated.
>
> Understood.
>
> >
> > Usecase: a driver (in this case, the arm64 dynamic tag storage code)
> > manages several cma areas. The driver successfully adds the first area to
> > the cma_areas array. When the driver tries to adds the second area, the
> > function fails. Without cma_remove_mem(), the driver has no way to prevent
> > the first area from being freed to the page allocator. cma_remove_mem() is
> > about providing a means to do cleanup in case of error.
> >
> > Does that make more sense now?
>
> How to ensure that cma_remove_mem() should get called by the driver before
> core_initcall()---> cma_init_reserved_areas()---> cma_activate_area() chain
> happens. Else cma_remove_mem() will miss out to clear cma->count and given
> area will proceed to get activated like always.
The same way drivers today call cma_declare_contiguous_nid() and
cma_init_reserved_mem() before cma_init_reserved_areas(). For an example,
have a look at kernel/dma/contiguous.c:: rmem_cma_setup().
As for how the series uses cma_remove_mem(), have a look at patch #20
("arm64: mte: Add tag storage memory to CMA") [1], specifically this bit:
for (i = 0; i < num_tag_regions; i++) {
region = &tag_regions[i];
// code removed for clarity
ret = cma_init_reserved_mem(PFN_PHYS(region->tag_range.start),
PFN_PHYS(range_len(®ion->tag_range)),
order, NULL, ®ion->cma);
if (ret) {
for (j = 0; j < i; j++)
cma_remove_mem(®ion->cma);
goto out_disabled;
}
}
// code removed for clarity
out_disabled:
num_tag_regions = 0;
pr_info("MTE tag storage region management disabled");
I'll try to walk you through it. The driver manages 2 cma regions.
cma_init_reserved_mem() succeeds for the first region.
cma_init_reserved_mem() fails for the second region.
As a result, the first region will be activated (pages will be placed on
the MIGRATE_CMA list), but the second region will not be activated.
The driver can function only when **all** cma regions have been
successfully activated.
Driver removes first region from CMA, so now no regions will be activated,
and probing fails.
In a more general sense, cma_remove_mem() is **not** about removing a
region that failed initialization or activation, it's about removing a cma
area that was added to cma_areas successfully, but the driver doesn't want
to activate anymore for whatever reason (it can be because of a probing
error totally unrelated to CMA).
Does it make more sense now? I hope that this example also answers the rest
of your questions.
[1] https://lore.kernel.org/linux-arm-kernel/[email protected]/
Thanks,
Alex
>
> >
> > Ok Tue, Jan 30, 2024 at 11:20:56AM +0530, Anshuman Khandual wrote:
> >>
> >>
> >> On 1/25/24 22:12, Alexandru Elisei wrote:
> >>> Memory is added to CMA with cma_declare_contiguous_nid() and
> >>> cma_init_reserved_mem(). This memory is then put on the MIGRATE_CMA list in
> >>> cma_init_reserved_areas(), where the page allocator can make use of it.
> >>
> >> cma_declare_contiguous_nid() reserves memory in memblock and marks the
> >
> > You forgot about about cma_init_reserved_mem() which does the same thing,
> > but yes, you are right.
>
> Agreed, missed that. There are some direct cma_init_reserved_mem() calls as well.
>
> >
> >> for subsequent CMA usage, where as cma_init_reserved_areas() activates
> >> these memory areas through init_cma_reserved_pageblock(). Standard page
> >> allocator only receives these memory via free_reserved_page() - only if
> >
> > I don't think that's correct. init_cma_reserved_pageblock() clears the
> > PG_reserved page flag, sets the migratetype to MIGRATE_CMA and then frees
> > the page. After that, the page is available to the standard page allocator
> > to use for allocation. Otherwise, what would be the point of the
> > MIGRATE_CMA migratetype?
>
> Understood and agreed.
>
> >
> >> the page block activation fails.
> >
> > For the sake of having a complete picture, I'll add that that only happens
> > if cma->reserve_pages_on_error is false. If the CMA user sets the field to
> > 'true' (with cma_reserve_pages_on_error()), then the pages in the CMA
> > region are kept PG_reserved if activation fails.
>
> Why cannot you use cma_reserve_pages_on_error() ?
>
> >
> >>
> >>>
> >>> If a device manages multiple CMA areas, and there's an error when one of
> >>> the areas is added to CMA, there is no mechanism for the device to prevent
> >>
> >> What kind of error ? init_cma_reserved_pageblock() fails ? But that will
> >> not happen until cma_init_reserved_areas().
> >
> > I think I haven't been clear enough. When I say that "an area is added
> > to CMA", I mean that the memory region is added to cma_areas array, via
> > cma_declare_contiguous_nid() or cma_init_reserved_mem(). There are several
> > ways in which either function can fail.
>
> Okay.
>
> >
> >>
> >>> the rest of the areas, which were added before the error occured, from
> >>> being later added to the MIGRATE_CMA list.
> >>
> >> Why is this mechanism required ? cma_init_reserved_areas() scans over all
> >> CMA areas and try and activate each of them sequentially. Why is not this
> >> sufficient ?
> >
> > This patch is about removing a struct cma from the cma_areas array after it
> > has been added to the array, with cma_declare_contiguous_nid() or
> > cma_init_reserved_mem(), to prevent the area from being activated in
> > cma_init_reserved_areas(). Sorry for the confusion.
> >
> > I'll add a check in cma_remove_mem() to fail if the cma area has been
> > activated, and a comment to the function to explain its usage.
>
> That will be a good check.
>
> >
> >>
> >>>
> >>> Add cma_remove_mem() which allows a previously reserved CMA area to be
> >>> removed and thus it cannot be used by the page allocator.
> >>
> >> Successfully activated CMA areas do not get used by the buddy allocator.
> >
> > I don't believe that is correct, see above.
> Apologies, it's my bad.
>
> >
> >>
> >>>
> >>> Signed-off-by: Alexandru Elisei <[email protected]>
> >>> ---
> >>>
> >>> Changes since rfc v2:
> >>>
> >>> * New patch.
> >>>
> >>> include/linux/cma.h | 1 +
> >>> mm/cma.c | 30 +++++++++++++++++++++++++++++-
> >>> 2 files changed, 30 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/include/linux/cma.h b/include/linux/cma.h
> >>> index e32559da6942..787cbec1702e 100644
> >>> --- a/include/linux/cma.h
> >>> +++ b/include/linux/cma.h
> >>> @@ -48,6 +48,7 @@ extern int cma_init_reserved_mem(phys_addr_t base, phys_addr_t size,
> >>> unsigned int order_per_bit,
> >>> const char *name,
> >>> struct cma **res_cma);
> >>> +extern void cma_remove_mem(struct cma **res_cma);
> >>> extern struct page *cma_alloc(struct cma *cma, unsigned long count, unsigned int align,
> >>> bool no_warn);
> >>> extern int cma_alloc_range(struct cma *cma, unsigned long start, unsigned long count,
> >>> diff --git a/mm/cma.c b/mm/cma.c
> >>> index 4a0f68b9443b..2881bab12b01 100644
> >>> --- a/mm/cma.c
> >>> +++ b/mm/cma.c
> >>> @@ -147,8 +147,12 @@ static int __init cma_init_reserved_areas(void)
> >>> {
> >>> int i;
> >>>
> >>> - for (i = 0; i < cma_area_count; i++)
> >>> + for (i = 0; i < cma_area_count; i++) {
> >>> + /* Region was removed. */
> >>> + if (!cma_areas[i].count)
> >>> + continue;
> >>
> >> Skip previously added CMA area (now zeroed out) ?
> >
> > Yes, that's what I meant with the comment "Region was removed". Do you
> > think I should reword the comment?
> >
> >>
> >>> cma_activate_area(&cma_areas[i]);
> >>> + }
> >>>
> >>> return 0;
> >>> }
> >>
> >> cma_init_reserved_areas() gets called via core_initcall(). Some how
> >> platform/device needs to call cma_remove_mem() before core_initcall()
> >> gets called ? This might be time sensitive.
> >
> > I don't understand your point.
> >
> >>
> >>> @@ -216,6 +220,30 @@ int __init cma_init_reserved_mem(phys_addr_t base, phys_addr_t size,
> >>> return 0;
> >>> }
> >>>
> >>> +/**
> >>> + * cma_remove_mem() - remove cma area
> >>> + * @res_cma: Pointer to the cma region.
> >>> + *
> >>> + * This function removes a cma region created with cma_init_reserved_mem(). The
> >>> + * ->count is set to 0.
> >>> + */
> >>> +void __init cma_remove_mem(struct cma **res_cma)
> >>> +{
> >>> + struct cma *cma;
> >>> +
> >>> + if (WARN_ON_ONCE(!res_cma || !(*res_cma)))
> >>> + return;
> >>> +
> >>> + cma = *res_cma;
> >>> + if (WARN_ON_ONCE(!cma->count))
> >>> + return;
> >>> +
> >>> + totalcma_pages -= cma->count;
> >>> + cma->count = 0;
> >>> +
> >>> + *res_cma = NULL;
> >>> +}
> >>> +
> >>> /**
> >>> * cma_declare_contiguous_nid() - reserve custom contiguous area
> >>> * @base: Base address of the reserved area optional, use 0 for any
> >>
> >> But first please do explain what are the errors device or platform might
> >
> > cma_declare_contiguous_nid() and cma_init_reserved_mem() can fail in a
> > number of ways, the code should be self documenting.
>
> But when they do fail - would not cma->count be left uninitialized as 0 ?
> Hence the proposed check (!cma->count) in cma_init_reserved_areas() should
> just do the trick without requiring an explicit cma_remove_mem() call.
>
> >
> >> see on a previously marked CMA area so that removing them on way becomes
> >> necessary preventing their activation via cma_init_reserved_areas().
> >
> > I've described how the function is supposed to be used at the top of my
> > reply.
> >
> > Thanks,
> > Alex
>
Hi,
On Wed, Jan 31, 2024 at 11:54:17AM +0530, Anshuman Khandual wrote:
>
>
> On 1/30/24 17:05, Alexandru Elisei wrote:
> > Hi,
> >
> > On Tue, Jan 30, 2024 at 10:50:00AM +0530, Anshuman Khandual wrote:
> >>
> >> On 1/25/24 22:12, Alexandru Elisei wrote:
> >>> Today, cma_alloc() is used to allocate a contiguous memory region. The
> >>> function allows the caller to specify the number of pages to allocate, but
> >>> not the starting address. cma_alloc() will walk over the entire CMA region
> >>> trying to allocate the first available range of the specified size.
> >>>
> >>> Introduce cma_alloc_range(), which makes CMA more versatile by allowing the
> >>> caller to specify a particular range in the CMA region, defined by the
> >>> start pfn and the size.
> >>>
> >>> arm64 will make use of this function when tag storage management will be
> >>> implemented: cma_alloc_range() will be used to reserve the tag storage
> >>> associated with a tagged page.
> >> Basically, you would like to pass on a preferred start address and the
> >> allocation could just fail if a contig range is not available from such
> >> a starting address ?
> >>
> >> Then why not just change cma_alloc() to take a new argument 'start_pfn'.
> >> Why create a new but almost similar allocator ?
> > I tried doing that, and I gave up because:
> >
> > - It made cma_alloc() even more complex and hard to follow.
> >
> > - What value should 'start_pfn' be to tell cma_alloc() that it should be
> > ignored? Or, to put it another way, what pfn number is invalid on **all**
> > platforms that Linux supports?
> >
> > I can give it another go if we can come up with an invalid value for
> > 'start_pfn'.
>
> Something negative might work. How about -1/-1UL ? A quick search gives
> some instances such as ...
>
> git grep "pfn == -1"
>
> mm/mm_init.c: if (*start_pfn == -1UL)
> mm/vmscan.c: if (pfn == -1)
> mm/vmscan.c: if (pfn == -1)
> mm/vmscan.c: if (pfn == -1)
> tools/testing/selftests/mm/hugepage-vmemmap.c: if (pfn == -1UL) {
>
> Could not -1UL be abstracted as common macro MM_INVALID_PFN to be used in
> such scenarios including here ?
Ah yes, you are right, get_pte_pfn() already uses -1 as an invalid pfn, so
I can just use that.
Will definitely give it a go on the next iteration, thanks for the
suggestion!
>
> >
> >> But then I am wondering why this could not be done in the arm64 platform
> >> code itself operating on a CMA area reserved just for tag storage. Unless
> >> this new allocator has other usage beyond MTE, this could be implemented
> >> in the platform itself.
> > I had the same idea in the previous iteration, David Hildenbrand suggested
> > this approach [1].
> >
> > [1] https://lore.kernel.org/linux-fsdevel/[email protected]/
>
> There are two different cma_alloc() proposals here - including the next
> patch i.e mm: cma: Fast track allocating memory when the pages are free
>
> 1) Augment cma_alloc() or add cma_alloc_range() with start_pfn parameter
> 2) Speed up cma_alloc() for small allocation requests when pages are free
>
> The second one if separated out from this series could be considered on
> its own as it will help all existing cma_alloc() callers. The first one
> definitely needs an use case as provided in this series.
I understand, thanks for the input!
Alex
On 1/25/24 22:12, Alexandru Elisei wrote:
> arm64 uses arch_swap_restore() to restore saved tags before the page is
> swapped in and it's called in atomic context (with the ptl lock held).
>
> Introduce arch_swap_prepare_to_restore() that will allow an architecture to
> perform extra work during swap in and outside of a critical section.
> This will be used by arm64 to allocate a buffer in memory where to
> temporarily save tags if tag storage is not available for the page being
> swapped in.
Just wondering if tag storage will always be unavailable for tagged pages
being swapped in ? OR there are cases where allocation might not even be
required ? This prepare phase needs to be outside the critical section -
only because there might be memory allocations ?
On 1/25/24 22:12, Alexandru Elisei wrote:
> Introduce a mechanism that allows an architecture to trigger a page fault,
> and add the infrastructure to handle that fault accordingly. To use make> use of this, an arch is expected to mark the table entry as PAGE_NONE (which
> will cause a fault next time it is accessed) and to implement an
> arch-specific method (like a software bit) for recognizing that the fault
> needs to be handled by the arch code.
>
> arm64 will use of this approach to reserve tag storage for pages which are
> mapped in an MTE enabled VMA, but the storage needed to store tags isn't
> reserved (for example, because of an mprotect(PROT_MTE) call on a VMA with
> existing pages).
Just to summerize -
So platform will create NUMA balancing like page faults - via marking existing
mappings with PAGE_NONE permission, when the subsequent fault happens identify
such cases via a software bit in the page table entry and then route the fault
to the platform code itself for special purpose page fault handling where page
might come from some reserved areas instead.
Some questions
- How often PAGE_NONE is to be marked for applicable MTE VMA based mappings
- Is it periodic like NUMA balancing or just one time for tag storage
- How this is going to interact with NUMA balancing given both use PAGE_NONE
- How to differentiate these mappings from standard pte_protnone()
>
> Signed-off-by: Alexandru Elisei <[email protected]>
> ---
>
> Changes since rfc v2:
>
> * New patch. Split from patch #19 ("mm: mprotect: Introduce PAGE_FAULT_ON_ACCESS
> for mprotect(PROT_MTE)") (David Hildenbrand).
>
> include/linux/huge_mm.h | 4 ++--
> include/linux/pgtable.h | 47 +++++++++++++++++++++++++++++++++++--
> mm/Kconfig | 3 +++
> mm/huge_memory.c | 36 +++++++++++++++++++++--------
> mm/memory.c | 51 ++++++++++++++++++++++++++---------------
> 5 files changed, 109 insertions(+), 32 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 5adb86af35fc..4678a0a5e6a8 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -346,7 +346,7 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
> struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
> pud_t *pud, int flags, struct dev_pagemap **pgmap);
>
> -vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf);
> +vm_fault_t handle_huge_pmd_protnone(struct vm_fault *vmf);
>
> extern struct page *huge_zero_page;
> extern unsigned long huge_zero_pfn;
> @@ -476,7 +476,7 @@ static inline spinlock_t *pud_trans_huge_lock(pud_t *pud,
> return NULL;
> }
>
> -static inline vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
> +static inline vm_fault_t handle_huge_pmd_protnone(struct vm_fault *vmf)
> {
> return 0;
> }
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 2d0f04042f62..81a21be855a2 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1455,7 +1455,7 @@ static inline int pud_trans_unstable(pud_t *pud)
> return 0;
> }
>
> -#ifndef CONFIG_NUMA_BALANCING
> +#if !defined(CONFIG_NUMA_BALANCING) && !defined(CONFIG_ARCH_HAS_FAULT_ON_ACCESS)
> /*
> * In an inaccessible (PROT_NONE) VMA, pte_protnone() may indicate "yes". It is
> * perfectly valid to indicate "no" in that case, which is why our default
> @@ -1477,7 +1477,50 @@ static inline int pmd_protnone(pmd_t pmd)
> {
> return 0;
> }
> -#endif /* CONFIG_NUMA_BALANCING */
> +#endif /* !CONFIG_NUMA_BALANCING && !CONFIG_ARCH_HAS_FAULT_ON_ACCESS */
> +
> +#ifndef CONFIG_ARCH_HAS_FAULT_ON_ACCESS
> +static inline bool arch_fault_on_access_pte(pte_t pte)
> +{
> + return false;
> +}
> +
> +static inline bool arch_fault_on_access_pmd(pmd_t pmd)
> +{
> + return false;
> +}
> +
> +/*
> + * The function is called with the fault lock held and an elevated reference on
> + * the folio.
> + *
> + * Rules that an arch implementation of the function must follow:
> + *
> + * 1. The function must return with the elevated reference dropped.
> + *
> + * 2. If the return value contains VM_FAULT_RETRY or VM_FAULT_COMPLETED then:
> + *
> + * - if FAULT_FLAG_RETRY_NOWAIT is not set, the function must return with the
> + * correct fault lock released, which can be accomplished with
> + * release_fault_lock(vmf). Note that release_fault_lock() doesn't check if
> + * FAULT_FLAG_RETRY_NOWAIT is set before releasing the mmap_lock.
> + *
> + * - if FAULT_FLAG_RETRY_NOWAIT is set, then the function must not release the
> + * mmap_lock. The flag should be set only if the mmap_lock is held.
> + *
> + * 3. If the return value contains neither of the above, the function must not
> + * release the fault lock; the generic fault handler will take care of releasing
> + * the correct lock.
> + */
> +static inline vm_fault_t arch_handle_folio_fault_on_access(struct folio *folio,
> + struct vm_fault *vmf,
> + bool *map_pte)
> +{
> + *map_pte = false;
> +
> + return VM_FAULT_SIGBUS;
> +}
> +#endif
>
> #endif /* CONFIG_MMU */
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 341cf53898db..153df67221f1 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1006,6 +1006,9 @@ config IDLE_PAGE_TRACKING
> config ARCH_HAS_CACHE_LINE_SIZE
> bool
>
> +config ARCH_HAS_FAULT_ON_ACCESS
> + bool
> +
> config ARCH_HAS_CURRENT_STACK_POINTER
> bool
> help
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 94ef5c02b459..2bad63a7ec16 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1698,7 +1698,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
> }
>
> /* NUMA hinting page fault entry point for trans huge pmds */
> -vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
> +vm_fault_t handle_huge_pmd_protnone(struct vm_fault *vmf)
> {
> struct vm_area_struct *vma = vmf->vma;
> pmd_t oldpmd = vmf->orig_pmd;
> @@ -1708,6 +1708,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
> int nid = NUMA_NO_NODE;
> int target_nid, last_cpupid = (-1 & LAST_CPUPID_MASK);
> bool migrated = false, writable = false;
> + vm_fault_t ret;
> int flags = 0;
>
> vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> @@ -1731,6 +1732,20 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
> if (!folio)
> goto out_map;
>
> + folio_get(folio);
> + vma_set_access_pid_bit(vma);
> +
> + if (arch_fault_on_access_pmd(oldpmd)) {
> + bool map_pte = false;
> +
> + spin_unlock(vmf->ptl);
> + ret = arch_handle_folio_fault_on_access(folio, vmf, &map_pte);
> + if (ret || !map_pte)
> + return ret;
> + writable = false;
> + goto out_lock_and_map;
> + }
> +
> /* See similar comment in do_numa_page for explanation */
> if (!writable)
> flags |= TNF_NO_GROUP;
> @@ -1755,15 +1770,18 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
> if (migrated) {
> flags |= TNF_MIGRATED;
> nid = target_nid;
> - } else {
> - flags |= TNF_MIGRATE_FAIL;
> - vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> - if (unlikely(!pmd_same(oldpmd, *vmf->pmd))) {
> - spin_unlock(vmf->ptl);
> - goto out;
> - }
> - goto out_map;
> + goto out;
> + }
> +
> + flags |= TNF_MIGRATE_FAIL;
> +
> +out_lock_and_map:
> + vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> + if (unlikely(!pmd_same(oldpmd, *vmf->pmd))) {
> + spin_unlock(vmf->ptl);
> + goto out;
> }
> + goto out_map;
>
> out:
> if (nid != NUMA_NO_NODE)
> diff --git a/mm/memory.c b/mm/memory.c
> index 8a421e168b57..110fe2224277 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4886,11 +4886,6 @@ static vm_fault_t do_fault(struct vm_fault *vmf)
> int numa_migrate_prep(struct folio *folio, struct vm_area_struct *vma,
> unsigned long addr, int page_nid, int *flags)
> {
> - folio_get(folio);
> -
> - /* Record the current PID acceesing VMA */
> - vma_set_access_pid_bit(vma);
> -
> count_vm_numa_event(NUMA_HINT_FAULTS);
> if (page_nid == numa_node_id()) {
> count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
> @@ -4900,13 +4895,14 @@ int numa_migrate_prep(struct folio *folio, struct vm_area_struct *vma,
> return mpol_misplaced(folio, vma, addr);
> }
>
> -static vm_fault_t do_numa_page(struct vm_fault *vmf)
> +static vm_fault_t handle_pte_protnone(struct vm_fault *vmf)
> {
> struct vm_area_struct *vma = vmf->vma;
> struct folio *folio = NULL;
> int nid = NUMA_NO_NODE;
> bool writable = false;
> int last_cpupid;
> + vm_fault_t ret;
> int target_nid;
> pte_t pte, old_pte;
> int flags = 0;
> @@ -4939,6 +4935,20 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
> if (!folio || folio_is_zone_device(folio))
> goto out_map;
>
> + folio_get(folio);
> + /* Record the current PID acceesing VMA */
> + vma_set_access_pid_bit(vma);
> +
> + if (arch_fault_on_access_pte(old_pte)) {
> + bool map_pte = false;
> +
> + pte_unmap_unlock(vmf->pte, vmf->ptl);
> + ret = arch_handle_folio_fault_on_access(folio, vmf, &map_pte);
> + if (ret || !map_pte)
> + return ret;
> + goto out_lock_and_map;
> + }
> +
> /* TODO: handle PTE-mapped THP */
> if (folio_test_large(folio))
> goto out_map;
> @@ -4983,18 +4993,21 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
> if (migrate_misplaced_folio(folio, vma, target_nid)) {
> nid = target_nid;
> flags |= TNF_MIGRATED;
> - } else {
> - flags |= TNF_MIGRATE_FAIL;
> - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> - vmf->address, &vmf->ptl);
> - if (unlikely(!vmf->pte))
> - goto out;
> - if (unlikely(!pte_same(ptep_get(vmf->pte), vmf->orig_pte))) {
> - pte_unmap_unlock(vmf->pte, vmf->ptl);
> - goto out;
> - }
> - goto out_map;
> + goto out;
> + }
> +
> + flags |= TNF_MIGRATE_FAIL;
> +
> +out_lock_and_map:
> + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> + vmf->address, &vmf->ptl);
> + if (unlikely(!vmf->pte))
> + goto out;
> + if (unlikely(!pte_same(ptep_get(vmf->pte), vmf->orig_pte))) {
> + pte_unmap_unlock(vmf->pte, vmf->ptl);
> + goto out;
> }
> + goto out_map;
>
> out:
> if (nid != NUMA_NO_NODE)
> @@ -5151,7 +5164,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
> return do_swap_page(vmf);
>
> if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
> - return do_numa_page(vmf);
> + return handle_pte_protnone(vmf);
>
> spin_lock(vmf->ptl);
> entry = vmf->orig_pte;
> @@ -5272,7 +5285,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
> }
> if (pmd_trans_huge(vmf.orig_pmd) || pmd_devmap(vmf.orig_pmd)) {
> if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma))
> - return do_huge_pmd_numa_page(&vmf);
> + return handle_huge_pmd_protnone(&vmf);
>
> if ((flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) &&
> !pmd_write(vmf.orig_pmd)) {
On 1/25/24 22:12, Alexandru Elisei wrote:
> copy_user_highpage() will do memory allocation if there are saved tags for
> the destination page, and the page is missing tag storage.
>
> After commit a349d72fd9ef ("mm/pgtable: add rcu_read_lock() and
> rcu_read_unlock()s"), collapse_huge_page() calls
> __collapse_huge_page_copy() -> .. -> copy_user_highpage() with the RCU lock
> held, which means that copy_user_highpage() can only allocate memory using
> GFP_ATOMIC or equivalent.
>
> Get around this by refusing to collapse pages into a transparent huge page
> if the VMA is MTE-enabled.
Makes sense when copy_user_highpage() will allocate memory for tag storage.
>
> Signed-off-by: Alexandru Elisei <[email protected]>
> ---
>
> Changes since rfc v2:
>
> * New patch. I think an agreement on whether copy*_user_highpage() should be
> always allowed to sleep, or should not be allowed, would be useful.
This is a good question ! Even after preventing the collapse of MTE VMA here,
there still might be more paths where a sleeping (i.e memory allocating)
copy*_user_highpage() becomes problematic ?
>
> arch/arm64/include/asm/pgtable.h | 3 +++
> arch/arm64/kernel/mte_tag_storage.c | 5 +++++
> include/linux/khugepaged.h | 5 +++++
> mm/khugepaged.c | 4 ++++
> 4 files changed, 17 insertions(+)
>
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 87ae59436162..d0473538c926 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1120,6 +1120,9 @@ static inline bool arch_alloc_cma(gfp_t gfp_mask)
> return true;
> }
>
> +bool arch_hugepage_vma_revalidate(struct vm_area_struct *vma, unsigned long address);
> +#define arch_hugepage_vma_revalidate arch_hugepage_vma_revalidate
> +
> #endif /* CONFIG_ARM64_MTE_TAG_STORAGE */
> #endif /* CONFIG_ARM64_MTE */
>
> diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
> index ac7b9c9c585c..a99959b70573 100644
> --- a/arch/arm64/kernel/mte_tag_storage.c
> +++ b/arch/arm64/kernel/mte_tag_storage.c
> @@ -636,3 +636,8 @@ void arch_alloc_page(struct page *page, int order, gfp_t gfp)
> if (tag_storage_enabled() && alloc_requires_tag_storage(gfp))
> reserve_tag_storage(page, order, gfp);
> }
> +
> +bool arch_hugepage_vma_revalidate(struct vm_area_struct *vma, unsigned long address)
> +{
> + return !(vma->vm_flags & VM_MTE);
> +}
> diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> index f68865e19b0b..461e4322dff2 100644
> --- a/include/linux/khugepaged.h
> +++ b/include/linux/khugepaged.h
> @@ -38,6 +38,11 @@ static inline void khugepaged_exit(struct mm_struct *mm)
> if (test_bit(MMF_VM_HUGEPAGE, &mm->flags))
> __khugepaged_exit(mm);
> }
> +
> +#ifndef arch_hugepage_vma_revalidate
> +#define arch_hugepage_vma_revalidate(vma, address) 1
Please replace s/1/true as arch_hugepage_vma_revalidate() returns bool ?
> +#endif
Right, above construct is much better than __HAVE_ARCH_XXXX based one.
> +
> #else /* CONFIG_TRANSPARENT_HUGEPAGE */
> static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
> {
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 2b219acb528e..cb9a9ddb4d86 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -935,6 +935,10 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> */
> if (expect_anon && (!(*vmap)->anon_vma || !vma_is_anonymous(*vmap)))
> return SCAN_PAGE_ANON;
> +
> + if (!arch_hugepage_vma_revalidate(vma, address))
> + return SCAN_VMA_CHECK;
> +
> return SCAN_SUCCEED;
> }
>
Otherwise this LGTM.
On 1/25/24 22:12, Alexandru Elisei wrote:
> A page can end up mapped in a MTE enabled VMA without the corresponding tag
> storage block reserved. Tag accesses made by ptrace in this case can lead
> to the wrong tags being read or memory corruption for the process that is
> using the tag storage memory as data.
>
> Reserve tag storage by treating ptrace accesses like a fault.
>
> Signed-off-by: Alexandru Elisei <[email protected]>
> ---
>
> Changes since rfc v2:
>
> * New patch, issue reported by Peter Collingbourne.
>
> arch/arm64/kernel/mte.c | 26 ++++++++++++++++++++++++--
> 1 file changed, 24 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
> index faf09da3400a..b1fa02dad4fd 100644
> --- a/arch/arm64/kernel/mte.c
> +++ b/arch/arm64/kernel/mte.c
> @@ -412,10 +412,13 @@ static int __access_remote_tags(struct mm_struct *mm, unsigned long addr,
> while (len) {
> struct vm_area_struct *vma;
> unsigned long tags, offset;
> + unsigned int fault_flags;
> + struct page *page;
> + vm_fault_t ret;
> void *maddr;
> - struct page *page = get_user_page_vma_remote(mm, addr,
> - gup_flags, &vma);
>
> +get_page:
> + page = get_user_page_vma_remote(mm, addr, gup_flags, &vma);
But if there is valid page returned here in the first GUP attempt, will there
still be a subsequent handle_mm_fault() on the same vma and addr ?
> if (IS_ERR(page)) {
> err = PTR_ERR(page);
> break;
> @@ -433,6 +436,25 @@ static int __access_remote_tags(struct mm_struct *mm, unsigned long addr,
> put_page(page);
> break;
> }
> +
> + if (tag_storage_enabled() && !page_tag_storage_reserved(page)) {
Should not '!page' be checked here as well ?
> + fault_flags = FAULT_FLAG_DEFAULT | \
> + FAULT_FLAG_USER | \
> + FAULT_FLAG_REMOTE | \
> + FAULT_FLAG_ALLOW_RETRY | \
> + FAULT_FLAG_RETRY_NOWAIT;
> + if (write)
> + fault_flags |= FAULT_FLAG_WRITE;
> +
> + put_page(page);
> + ret = handle_mm_fault(vma, addr, fault_flags, NULL);
> + if (ret & VM_FAULT_ERROR) {
> + err = -EFAULT;
> + break;
> + }
> + goto get_page;
> + }
> +
> WARN_ON_ONCE(!page_mte_tagged(page));
>
> /* limit access to the end of the page */
Hi,
On Thu, Feb 01, 2024 at 09:00:23AM +0530, Anshuman Khandual wrote:
>
>
> On 1/25/24 22:12, Alexandru Elisei wrote:
> > arm64 uses arch_swap_restore() to restore saved tags before the page is
> > swapped in and it's called in atomic context (with the ptl lock held).
> >
> > Introduce arch_swap_prepare_to_restore() that will allow an architecture to
> > perform extra work during swap in and outside of a critical section.
> > This will be used by arm64 to allocate a buffer in memory where to
> > temporarily save tags if tag storage is not available for the page being
> > swapped in.
>
> Just wondering if tag storage will always be unavailable for tagged pages
> being swapped in ? OR there are cases where allocation might not even be
In some (probably most) situations, tag storage will be available for the
page that will be swapped in. That's because either the page will have been
taken from the swap cache (which means it hasn't been freed, so it still
has tag storage reserved) or it has been allocated with vma_alloc_folio()
(when it's swapped back in in a VMA with VM_MTE set).
I've explained a scenario where tags will be restored for a page without
tag storage in patch #28 ("mte: swap: Handle tag restoring when missing tag
storage") [1]. Basically, it's because tagged pages can be mapped as tagged
in one VMA and untagged in another VMA; and swap tags are restored the
first time a page is swapped back in, even if it's swapped in a VMA with
MTE disabled.
[1] https://lore.kernel.org/linux-arm-kernel/[email protected]/
> required ? This prepare phase needs to be outside the critical section -
> only because there might be memory allocations ?
Yes, exactly. See patch above :)
Thanks,
Alex
Hi,
On Thu, Feb 01, 2024 at 02:51:39PM +0530, Anshuman Khandual wrote:
>
>
> On 1/25/24 22:12, Alexandru Elisei wrote:
> > A page can end up mapped in a MTE enabled VMA without the corresponding tag
> > storage block reserved. Tag accesses made by ptrace in this case can lead
> > to the wrong tags being read or memory corruption for the process that is
> > using the tag storage memory as data.
> >
> > Reserve tag storage by treating ptrace accesses like a fault.
> >
> > Signed-off-by: Alexandru Elisei <[email protected]>
> > ---
> >
> > Changes since rfc v2:
> >
> > * New patch, issue reported by Peter Collingbourne.
> >
> > arch/arm64/kernel/mte.c | 26 ++++++++++++++++++++++++--
> > 1 file changed, 24 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
> > index faf09da3400a..b1fa02dad4fd 100644
> > --- a/arch/arm64/kernel/mte.c
> > +++ b/arch/arm64/kernel/mte.c
> > @@ -412,10 +412,13 @@ static int __access_remote_tags(struct mm_struct *mm, unsigned long addr,
> > while (len) {
> > struct vm_area_struct *vma;
> > unsigned long tags, offset;
> > + unsigned int fault_flags;
> > + struct page *page;
> > + vm_fault_t ret;
> > void *maddr;
> > - struct page *page = get_user_page_vma_remote(mm, addr,
> > - gup_flags, &vma);
> >
> > +get_page:
> > + page = get_user_page_vma_remote(mm, addr, gup_flags, &vma);
>
> But if there is valid page returned here in the first GUP attempt, will there
> still be a subsequent handle_mm_fault() on the same vma and addr ?
Only if it's missing tag storage. If it's missing tag storage, the page has
been mapped as arch_fault_on_access_pte(), and
handle_mm_fault()->..->arch_handle_folio_fault_on_access() will either
reserve tag storage, or migrate it.
>
> > if (IS_ERR(page)) {
> > err = PTR_ERR(page);
> > break;
> > @@ -433,6 +436,25 @@ static int __access_remote_tags(struct mm_struct *mm, unsigned long addr,
> > put_page(page);
> > break;
> > }
> > +
> > + if (tag_storage_enabled() && !page_tag_storage_reserved(page)) {
>
> Should not '!page' be checked here as well ?
I was under the impression that get_user_page_vma_remote() returns an error
pointer if gup couldn't pin the page.
Thanks,
Alex
>
> > + fault_flags = FAULT_FLAG_DEFAULT | \
> > + FAULT_FLAG_USER | \
> > + FAULT_FLAG_REMOTE | \
> > + FAULT_FLAG_ALLOW_RETRY | \
> > + FAULT_FLAG_RETRY_NOWAIT;
> > + if (write)
> > + fault_flags |= FAULT_FLAG_WRITE;
> > +
> > + put_page(page);
> > + ret = handle_mm_fault(vma, addr, fault_flags, NULL);
> > + if (ret & VM_FAULT_ERROR) {
> > + err = -EFAULT;
> > + break;
> > + }
> > + goto get_page;
> > + }
> > +
> > WARN_ON_ONCE(!page_mte_tagged(page));
> >
> > /* limit access to the end of the page */
On Thu, Feb 01, 2024 at 01:42:08PM +0530, Anshuman Khandual wrote:
>
>
> On 1/25/24 22:12, Alexandru Elisei wrote:
> > copy_user_highpage() will do memory allocation if there are saved tags for
> > the destination page, and the page is missing tag storage.
> >
> > After commit a349d72fd9ef ("mm/pgtable: add rcu_read_lock() and
> > rcu_read_unlock()s"), collapse_huge_page() calls
> > __collapse_huge_page_copy() -> .. -> copy_user_highpage() with the RCU lock
> > held, which means that copy_user_highpage() can only allocate memory using
> > GFP_ATOMIC or equivalent.
> >
> > Get around this by refusing to collapse pages into a transparent huge page
> > if the VMA is MTE-enabled.
>
> Makes sense when copy_user_highpage() will allocate memory for tag storage.
>
> >
> > Signed-off-by: Alexandru Elisei <[email protected]>
> > ---
> >
> > Changes since rfc v2:
> >
> > * New patch. I think an agreement on whether copy*_user_highpage() should be
> > always allowed to sleep, or should not be allowed, would be useful.
>
> This is a good question ! Even after preventing the collapse of MTE VMA here,
> there still might be more paths where a sleeping (i.e memory allocating)
> copy*_user_highpage() becomes problematic ?
Exactly!
>
> >
> > arch/arm64/include/asm/pgtable.h | 3 +++
> > arch/arm64/kernel/mte_tag_storage.c | 5 +++++
> > include/linux/khugepaged.h | 5 +++++
> > mm/khugepaged.c | 4 ++++
> > 4 files changed, 17 insertions(+)
> >
> > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > index 87ae59436162..d0473538c926 100644
> > --- a/arch/arm64/include/asm/pgtable.h
> > +++ b/arch/arm64/include/asm/pgtable.h
> > @@ -1120,6 +1120,9 @@ static inline bool arch_alloc_cma(gfp_t gfp_mask)
> > return true;
> > }
> >
> > +bool arch_hugepage_vma_revalidate(struct vm_area_struct *vma, unsigned long address);
> > +#define arch_hugepage_vma_revalidate arch_hugepage_vma_revalidate
> > +
> > #endif /* CONFIG_ARM64_MTE_TAG_STORAGE */
> > #endif /* CONFIG_ARM64_MTE */
> >
> > diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
> > index ac7b9c9c585c..a99959b70573 100644
> > --- a/arch/arm64/kernel/mte_tag_storage.c
> > +++ b/arch/arm64/kernel/mte_tag_storage.c
> > @@ -636,3 +636,8 @@ void arch_alloc_page(struct page *page, int order, gfp_t gfp)
> > if (tag_storage_enabled() && alloc_requires_tag_storage(gfp))
> > reserve_tag_storage(page, order, gfp);
> > }
> > +
> > +bool arch_hugepage_vma_revalidate(struct vm_area_struct *vma, unsigned long address)
> > +{
> > + return !(vma->vm_flags & VM_MTE);
> > +}
> > diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> > index f68865e19b0b..461e4322dff2 100644
> > --- a/include/linux/khugepaged.h
> > +++ b/include/linux/khugepaged.h
> > @@ -38,6 +38,11 @@ static inline void khugepaged_exit(struct mm_struct *mm)
> > if (test_bit(MMF_VM_HUGEPAGE, &mm->flags))
> > __khugepaged_exit(mm);
> > }
> > +
> > +#ifndef arch_hugepage_vma_revalidate
> > +#define arch_hugepage_vma_revalidate(vma, address) 1
>
> Please replace s/1/true as arch_hugepage_vma_revalidate() returns bool ?
Yeah, that's strange, I don't know why I used 1 there. Will change it to true,
thanks for spotting it.
>
> > +#endif
>
> Right, above construct is much better than __HAVE_ARCH_XXXX based one.
Thanks!
Alex
>
> > +
> > #else /* CONFIG_TRANSPARENT_HUGEPAGE */
> > static inline void khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
> > {
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 2b219acb528e..cb9a9ddb4d86 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -935,6 +935,10 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
> > */
> > if (expect_anon && (!(*vmap)->anon_vma || !vma_is_anonymous(*vmap)))
> > return SCAN_PAGE_ANON;
> > +
> > + if (!arch_hugepage_vma_revalidate(vma, address))
> > + return SCAN_VMA_CHECK;
> > +
> > return SCAN_SUCCEED;
> > }
> >
>
> Otherwise this LGTM.
Hi,
On Thu, Feb 01, 2024 at 11:22:13AM +0530, Anshuman Khandual wrote:
> On 1/25/24 22:12, Alexandru Elisei wrote:
> > Introduce a mechanism that allows an architecture to trigger a page fault,
> > and add the infrastructure to handle that fault accordingly. To use make> use of this, an arch is expected to mark the table entry as PAGE_NONE (which
> > will cause a fault next time it is accessed) and to implement an
> > arch-specific method (like a software bit) for recognizing that the fault
> > needs to be handled by the arch code.
> >
> > arm64 will use of this approach to reserve tag storage for pages which are
> > mapped in an MTE enabled VMA, but the storage needed to store tags isn't
> > reserved (for example, because of an mprotect(PROT_MTE) call on a VMA with
> > existing pages).
>
> Just to summerize -
>
> So platform will create NUMA balancing like page faults - via marking existing
> mappings with PAGE_NONE permission, when the subsequent fault happens identify
> such cases via a software bit in the page table entry and then route the fault
> to the platform code itself for special purpose page fault handling where page
> might come from some reserved areas instead.
Indeed. In the tag storage scenario, the page is page that will be mapped
as tagged, if it's missing tag storage, the tag storage needs to be
reserved before it can be mapped as tagged (and tags can be accessed).
>
> Some questions
>
> - How often PAGE_NONE is to be marked for applicable MTE VMA based mappings
>
> - Is it periodic like NUMA balancing or just one time for tag storage
It's deterministic, and only for tag storage. It's done in
set_ptes()/__set_pte_at()->..->mte_sync_tags(), if the page is going to be
mapped as tagged, but is missing tag storage. See patch #26 ("arm64: mte:
Use fault-on-access to reserve missing tag storage") [1] for the code.
[1] https://lore.kernel.org/linux-arm-kernel/[email protected]/
>
> - How this is going to interact with NUMA balancing given both use PAGE_NONE
>
> - How to differentiate these mappings from standard pte_protnone()
The only place where the difference matters is in do_numa_page(), here
renamed to handle_pte_protnone(), and in the huge page equivalent.
Userspace can access tags only if set_ptes()/__set_pte_at() maps the pte
with the PT_NORMAL_TAGGED attribute, but those functions will always map
the page as arch_fault_on_access_pte() if it's missing tag storage. That
makes it impossible for the kernel to map it as tagged behind our back.
Unless you had other concerns.
Thanks,
Alex
>
> >
> > Signed-off-by: Alexandru Elisei <[email protected]>
> > ---
> >
> > Changes since rfc v2:
> >
> > * New patch. Split from patch #19 ("mm: mprotect: Introduce PAGE_FAULT_ON_ACCESS
> > for mprotect(PROT_MTE)") (David Hildenbrand).
> >
> > include/linux/huge_mm.h | 4 ++--
> > include/linux/pgtable.h | 47 +++++++++++++++++++++++++++++++++++--
> > mm/Kconfig | 3 +++
> > mm/huge_memory.c | 36 +++++++++++++++++++++--------
> > mm/memory.c | 51 ++++++++++++++++++++++++++---------------
> > 5 files changed, 109 insertions(+), 32 deletions(-)
> >
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index 5adb86af35fc..4678a0a5e6a8 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> > @@ -346,7 +346,7 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
> > struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
> > pud_t *pud, int flags, struct dev_pagemap **pgmap);
> >
> > -vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf);
> > +vm_fault_t handle_huge_pmd_protnone(struct vm_fault *vmf);
> >
> > extern struct page *huge_zero_page;
> > extern unsigned long huge_zero_pfn;
> > @@ -476,7 +476,7 @@ static inline spinlock_t *pud_trans_huge_lock(pud_t *pud,
> > return NULL;
> > }
> >
> > -static inline vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
> > +static inline vm_fault_t handle_huge_pmd_protnone(struct vm_fault *vmf)
> > {
> > return 0;
> > }
> > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > index 2d0f04042f62..81a21be855a2 100644
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -1455,7 +1455,7 @@ static inline int pud_trans_unstable(pud_t *pud)
> > return 0;
> > }
> >
> > -#ifndef CONFIG_NUMA_BALANCING
> > +#if !defined(CONFIG_NUMA_BALANCING) && !defined(CONFIG_ARCH_HAS_FAULT_ON_ACCESS)
> > /*
> > * In an inaccessible (PROT_NONE) VMA, pte_protnone() may indicate "yes". It is
> > * perfectly valid to indicate "no" in that case, which is why our default
> > @@ -1477,7 +1477,50 @@ static inline int pmd_protnone(pmd_t pmd)
> > {
> > return 0;
> > }
> > -#endif /* CONFIG_NUMA_BALANCING */
> > +#endif /* !CONFIG_NUMA_BALANCING && !CONFIG_ARCH_HAS_FAULT_ON_ACCESS */
> > +
> > +#ifndef CONFIG_ARCH_HAS_FAULT_ON_ACCESS
> > +static inline bool arch_fault_on_access_pte(pte_t pte)
> > +{
> > + return false;
> > +}
> > +
> > +static inline bool arch_fault_on_access_pmd(pmd_t pmd)
> > +{
> > + return false;
> > +}
> > +
> > +/*
> > + * The function is called with the fault lock held and an elevated reference on
> > + * the folio.
> > + *
> > + * Rules that an arch implementation of the function must follow:
> > + *
> > + * 1. The function must return with the elevated reference dropped.
> > + *
> > + * 2. If the return value contains VM_FAULT_RETRY or VM_FAULT_COMPLETED then:
> > + *
> > + * - if FAULT_FLAG_RETRY_NOWAIT is not set, the function must return with the
> > + * correct fault lock released, which can be accomplished with
> > + * release_fault_lock(vmf). Note that release_fault_lock() doesn't check if
> > + * FAULT_FLAG_RETRY_NOWAIT is set before releasing the mmap_lock.
> > + *
> > + * - if FAULT_FLAG_RETRY_NOWAIT is set, then the function must not release the
> > + * mmap_lock. The flag should be set only if the mmap_lock is held.
> > + *
> > + * 3. If the return value contains neither of the above, the function must not
> > + * release the fault lock; the generic fault handler will take care of releasing
> > + * the correct lock.
> > + */
> > +static inline vm_fault_t arch_handle_folio_fault_on_access(struct folio *folio,
> > + struct vm_fault *vmf,
> > + bool *map_pte)
> > +{
> > + *map_pte = false;
> > +
> > + return VM_FAULT_SIGBUS;
> > +}
> > +#endif
> >
> > #endif /* CONFIG_MMU */
> >
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 341cf53898db..153df67221f1 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -1006,6 +1006,9 @@ config IDLE_PAGE_TRACKING
> > config ARCH_HAS_CACHE_LINE_SIZE
> > bool
> >
> > +config ARCH_HAS_FAULT_ON_ACCESS
> > + bool
> > +
> > config ARCH_HAS_CURRENT_STACK_POINTER
> > bool
> > help
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 94ef5c02b459..2bad63a7ec16 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1698,7 +1698,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
> > }
> >
> > /* NUMA hinting page fault entry point for trans huge pmds */
> > -vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
> > +vm_fault_t handle_huge_pmd_protnone(struct vm_fault *vmf)
> > {
> > struct vm_area_struct *vma = vmf->vma;
> > pmd_t oldpmd = vmf->orig_pmd;
> > @@ -1708,6 +1708,7 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
> > int nid = NUMA_NO_NODE;
> > int target_nid, last_cpupid = (-1 & LAST_CPUPID_MASK);
> > bool migrated = false, writable = false;
> > + vm_fault_t ret;
> > int flags = 0;
> >
> > vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> > @@ -1731,6 +1732,20 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
> > if (!folio)
> > goto out_map;
> >
> > + folio_get(folio);
> > + vma_set_access_pid_bit(vma);
> > +
> > + if (arch_fault_on_access_pmd(oldpmd)) {
> > + bool map_pte = false;
> > +
> > + spin_unlock(vmf->ptl);
> > + ret = arch_handle_folio_fault_on_access(folio, vmf, &map_pte);
> > + if (ret || !map_pte)
> > + return ret;
> > + writable = false;
> > + goto out_lock_and_map;
> > + }
> > +
> > /* See similar comment in do_numa_page for explanation */
> > if (!writable)
> > flags |= TNF_NO_GROUP;
> > @@ -1755,15 +1770,18 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf)
> > if (migrated) {
> > flags |= TNF_MIGRATED;
> > nid = target_nid;
> > - } else {
> > - flags |= TNF_MIGRATE_FAIL;
> > - vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> > - if (unlikely(!pmd_same(oldpmd, *vmf->pmd))) {
> > - spin_unlock(vmf->ptl);
> > - goto out;
> > - }
> > - goto out_map;
> > + goto out;
> > + }
> > +
> > + flags |= TNF_MIGRATE_FAIL;
> > +
> > +out_lock_and_map:
> > + vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
> > + if (unlikely(!pmd_same(oldpmd, *vmf->pmd))) {
> > + spin_unlock(vmf->ptl);
> > + goto out;
> > }
> > + goto out_map;
> >
> > out:
> > if (nid != NUMA_NO_NODE)
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 8a421e168b57..110fe2224277 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4886,11 +4886,6 @@ static vm_fault_t do_fault(struct vm_fault *vmf)
> > int numa_migrate_prep(struct folio *folio, struct vm_area_struct *vma,
> > unsigned long addr, int page_nid, int *flags)
> > {
> > - folio_get(folio);
> > -
> > - /* Record the current PID acceesing VMA */
> > - vma_set_access_pid_bit(vma);
> > -
> > count_vm_numa_event(NUMA_HINT_FAULTS);
> > if (page_nid == numa_node_id()) {
> > count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
> > @@ -4900,13 +4895,14 @@ int numa_migrate_prep(struct folio *folio, struct vm_area_struct *vma,
> > return mpol_misplaced(folio, vma, addr);
> > }
> >
> > -static vm_fault_t do_numa_page(struct vm_fault *vmf)
> > +static vm_fault_t handle_pte_protnone(struct vm_fault *vmf)
> > {
> > struct vm_area_struct *vma = vmf->vma;
> > struct folio *folio = NULL;
> > int nid = NUMA_NO_NODE;
> > bool writable = false;
> > int last_cpupid;
> > + vm_fault_t ret;
> > int target_nid;
> > pte_t pte, old_pte;
> > int flags = 0;
> > @@ -4939,6 +4935,20 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
> > if (!folio || folio_is_zone_device(folio))
> > goto out_map;
> >
> > + folio_get(folio);
> > + /* Record the current PID acceesing VMA */
> > + vma_set_access_pid_bit(vma);
> > +
> > + if (arch_fault_on_access_pte(old_pte)) {
> > + bool map_pte = false;
> > +
> > + pte_unmap_unlock(vmf->pte, vmf->ptl);
> > + ret = arch_handle_folio_fault_on_access(folio, vmf, &map_pte);
> > + if (ret || !map_pte)
> > + return ret;
> > + goto out_lock_and_map;
> > + }
> > +
> > /* TODO: handle PTE-mapped THP */
> > if (folio_test_large(folio))
> > goto out_map;
> > @@ -4983,18 +4993,21 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
> > if (migrate_misplaced_folio(folio, vma, target_nid)) {
> > nid = target_nid;
> > flags |= TNF_MIGRATED;
> > - } else {
> > - flags |= TNF_MIGRATE_FAIL;
> > - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> > - vmf->address, &vmf->ptl);
> > - if (unlikely(!vmf->pte))
> > - goto out;
> > - if (unlikely(!pte_same(ptep_get(vmf->pte), vmf->orig_pte))) {
> > - pte_unmap_unlock(vmf->pte, vmf->ptl);
> > - goto out;
> > - }
> > - goto out_map;
> > + goto out;
> > + }
> > +
> > + flags |= TNF_MIGRATE_FAIL;
> > +
> > +out_lock_and_map:
> > + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> > + vmf->address, &vmf->ptl);
> > + if (unlikely(!vmf->pte))
> > + goto out;
> > + if (unlikely(!pte_same(ptep_get(vmf->pte), vmf->orig_pte))) {
> > + pte_unmap_unlock(vmf->pte, vmf->ptl);
> > + goto out;
> > }
> > + goto out_map;
> >
> > out:
> > if (nid != NUMA_NO_NODE)
> > @@ -5151,7 +5164,7 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
> > return do_swap_page(vmf);
> >
> > if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
> > - return do_numa_page(vmf);
> > + return handle_pte_protnone(vmf);
> >
> > spin_lock(vmf->ptl);
> > entry = vmf->orig_pte;
> > @@ -5272,7 +5285,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
> > }
> > if (pmd_trans_huge(vmf.orig_pmd) || pmd_devmap(vmf.orig_pmd)) {
> > if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma))
> > - return do_huge_pmd_numa_page(&vmf);
> > + return handle_huge_pmd_protnone(&vmf);
> >
> > if ((flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) &&
> > !pmd_write(vmf.orig_pmd)) {
On Thu, Jan 25, 2024 at 8:45 AM Alexandru Elisei
<[email protected]> wrote:
>
> Linux restores tags when a page is swapped in and there are tags associated
> with the swap entry which the new page will replace. The saved tags are
> restored even if the page will not be mapped as tagged, to protect against
> cases where the page is shared between different VMAs, and is tagged in
> some, but untagged in others. By using this approach, the process can still
> access the correct tags following an mprotect(PROT_MTE) on the non-MTE
> enabled VMA.
>
> But this poses a challenge for managing tag storage: in the scenario above,
> when a new page is allocated to be swapped in for the process where it will
> be mapped as untagged, the corresponding tag storage block is not reserved.
> mte_restore_page_tags_by_swp_entry(), when it restores the saved tags, will
> overwrite data in the tag storage block associated with the new page,
> leading to data corruption if the block is in use by a process.
>
> Get around this issue by saving the tags in a new xarray, this time indexed
> by the page pfn, and then restoring them when tag storage is reserved for
> the page.
>
> Signed-off-by: Alexandru Elisei <[email protected]>
> ---
>
> Changes since rfc v2:
>
> * Restore saved tags **before** setting the PG_tag_storage_reserved bit to
> eliminate a brief window of opportunity where userspace can access uninitialized
> tags (Peter Collingbourne).
>
> arch/arm64/include/asm/mte_tag_storage.h | 8 ++
> arch/arm64/include/asm/pgtable.h | 11 +++
> arch/arm64/kernel/mte_tag_storage.c | 12 ++-
> arch/arm64/mm/mteswap.c | 110 +++++++++++++++++++++++
> 4 files changed, 140 insertions(+), 1 deletion(-)
>
> diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
> index 50bdae94cf71..40590a8c3748 100644
> --- a/arch/arm64/include/asm/mte_tag_storage.h
> +++ b/arch/arm64/include/asm/mte_tag_storage.h
> @@ -36,6 +36,14 @@ bool page_is_tag_storage(struct page *page);
>
> vm_fault_t handle_folio_missing_tag_storage(struct folio *folio, struct vm_fault *vmf,
> bool *map_pte);
> +vm_fault_t mte_try_transfer_swap_tags(swp_entry_t entry, struct page *page);
> +
> +void tags_by_pfn_lock(void);
> +void tags_by_pfn_unlock(void);
> +
> +void *mte_erase_tags_for_pfn(unsigned long pfn);
> +bool mte_save_tags_for_pfn(void *tags, unsigned long pfn);
> +void mte_restore_tags_for_pfn(unsigned long start_pfn, int order);
> #else
> static inline bool tag_storage_enabled(void)
> {
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 0174e292f890..87ae59436162 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1085,6 +1085,17 @@ static inline void arch_swap_invalidate_area(int type)
> mte_invalidate_tags_area_by_swp_entry(type);
> }
>
> +#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
> +#define __HAVE_ARCH_SWAP_PREPARE_TO_RESTORE
> +static inline vm_fault_t arch_swap_prepare_to_restore(swp_entry_t entry,
> + struct folio *folio)
> +{
> + if (tag_storage_enabled())
> + return mte_try_transfer_swap_tags(entry, &folio->page);
> + return 0;
> +}
> +#endif
> +
> #define __HAVE_ARCH_SWAP_RESTORE
> static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> {
> diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
> index afe2bb754879..ac7b9c9c585c 100644
> --- a/arch/arm64/kernel/mte_tag_storage.c
> +++ b/arch/arm64/kernel/mte_tag_storage.c
> @@ -567,6 +567,7 @@ int reserve_tag_storage(struct page *page, int order, gfp_t gfp)
> }
> }
>
> + mte_restore_tags_for_pfn(page_to_pfn(page), order);
> page_set_tag_storage_reserved(page, order);
> out_unlock:
> mutex_unlock(&tag_blocks_lock);
> @@ -595,7 +596,8 @@ void free_tag_storage(struct page *page, int order)
> struct tag_region *region;
> unsigned long page_va;
> unsigned long flags;
> - int ret;
> + void *tags;
> + int i, ret;
>
> ret = tag_storage_find_block(page, &start_block, ®ion);
> if (WARN_ONCE(ret, "Missing tag storage block for pfn 0x%lx", page_to_pfn(page)))
> @@ -605,6 +607,14 @@ void free_tag_storage(struct page *page, int order)
> /* Avoid writeback of dirty tag cache lines corrupting data. */
> dcache_inval_tags_poc(page_va, page_va + (PAGE_SIZE << order));
>
> + tags_by_pfn_lock();
> + for (i = 0; i < (1 << order); i++) {
> + tags = mte_erase_tags_for_pfn(page_to_pfn(page + i));
> + if (unlikely(tags))
> + mte_free_tag_buf(tags);
> + }
> + tags_by_pfn_unlock();
> +
> end_block = start_block + order_to_num_blocks(order, region->block_size_pages);
>
> xa_lock_irqsave(&tag_blocks_reserved, flags);
> diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
> index 2a43746b803f..e11495fa3c18 100644
> --- a/arch/arm64/mm/mteswap.c
> +++ b/arch/arm64/mm/mteswap.c
> @@ -20,6 +20,112 @@ void mte_free_tag_buf(void *buf)
> kfree(buf);
> }
>
> +#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
> +static DEFINE_XARRAY(tags_by_pfn);
> +
> +void tags_by_pfn_lock(void)
> +{
> + xa_lock(&tags_by_pfn);
> +}
> +
> +void tags_by_pfn_unlock(void)
> +{
> + xa_unlock(&tags_by_pfn);
> +}
> +
> +void *mte_erase_tags_for_pfn(unsigned long pfn)
> +{
> + return __xa_erase(&tags_by_pfn, pfn);
> +}
> +
> +bool mte_save_tags_for_pfn(void *tags, unsigned long pfn)
> +{
> + void *entry;
> + int ret;
> +
> + ret = xa_reserve(&tags_by_pfn, pfn, GFP_KERNEL);
copy_highpage can be called from an atomic context, so it isn't
currently valid to pass GFP_KERNEL here.
To give one example of a possible atomic context call, copy_pte_range
will take a PTE spinlock and can call copy_present_pte, which can call
copy_present_page, which will call copy_user_highpage.
To give another example, __buffer_migrate_folio can call
spin_lock(&mapping->private_lock), then call folio_migrate_copy, which
will call folio_copy.
Peter
> + if (ret)
> + return true;
> +
> + tags_by_pfn_lock();
> +
> + if (page_tag_storage_reserved(pfn_to_page(pfn))) {
> + xa_release(&tags_by_pfn, pfn);
> + tags_by_pfn_unlock();
> + return false;
> + }
> +
> + entry = __xa_store(&tags_by_pfn, pfn, tags, GFP_ATOMIC);
> + if (xa_is_err(entry)) {
> + xa_release(&tags_by_pfn, pfn);
> + goto out_unlock;
> + } else if (entry) {
> + mte_free_tag_buf(entry);
> + }
> +
> +out_unlock:
> + tags_by_pfn_unlock();
> + return true;
> +}
> +
> +void mte_restore_tags_for_pfn(unsigned long start_pfn, int order)
> +{
> + struct page *page = pfn_to_page(start_pfn);
> + unsigned long pfn;
> + void *tags;
> +
> + tags_by_pfn_lock();
> +
> + for (pfn = start_pfn; pfn < start_pfn + (1 << order); pfn++, page++) {
> + tags = mte_erase_tags_for_pfn(pfn);
> + if (unlikely(tags)) {
> + /*
> + * Mark the page as tagged so mte_sync_tags() doesn't
> + * clear the tags.
> + */
> + WARN_ON_ONCE(!try_page_mte_tagging(page));
> + mte_copy_page_tags_from_buf(page_address(page), tags);
> + set_page_mte_tagged(page);
> + mte_free_tag_buf(tags);
> + }
> + }
> +
> + tags_by_pfn_unlock();
> +}
> +
> +/*
> + * Note on locking: swap in/out is done with the folio locked, which eliminates
> + * races with mte_save/restore_page_tags_by_swp_entry.
> + */
> +vm_fault_t mte_try_transfer_swap_tags(swp_entry_t entry, struct page *page)
> +{
> + void *swap_tags, *pfn_tags;
> + bool saved;
> +
> + /*
> + * mte_restore_page_tags_by_swp_entry() will take care of copying the
> + * tags over.
> + */
> + if (likely(page_mte_tagged(page) || page_tag_storage_reserved(page)))
> + return 0;
> +
> + swap_tags = xa_load(&tags_by_swp_entry, entry.val);
> + if (!swap_tags)
> + return 0;
> +
> + pfn_tags = mte_allocate_tag_buf();
> + if (!pfn_tags)
> + return VM_FAULT_OOM;
> +
> + memcpy(pfn_tags, swap_tags, MTE_PAGE_TAG_STORAGE_SIZE);
> + saved = mte_save_tags_for_pfn(pfn_tags, page_to_pfn(page));
> + if (!saved)
> + mte_free_tag_buf(pfn_tags);
> +
> + return 0;
> +}
> +#endif
> +
> int mte_save_page_tags_by_swp_entry(struct page *page)
> {
> void *tags, *ret;
> @@ -54,6 +160,10 @@ void mte_restore_page_tags_by_swp_entry(swp_entry_t entry, struct page *page)
> if (!tags)
> return;
>
> + /* Tags will be restored when tag storage is reserved. */
> + if (tag_storage_enabled() && unlikely(!page_tag_storage_reserved(page)))
> + return;
> +
> if (try_page_mte_tagging(page)) {
> mte_copy_page_tags_from_buf(page_address(page), tags);
> set_page_mte_tagged(page);
> --
> 2.43.0
>
Hi Peter,
On Thu, Feb 01, 2024 at 08:02:40PM -0800, Peter Collingbourne wrote:
> On Thu, Jan 25, 2024 at 8:45 AM Alexandru Elisei
> <[email protected]> wrote:
> >
> > Linux restores tags when a page is swapped in and there are tags associated
> > with the swap entry which the new page will replace. The saved tags are
> > restored even if the page will not be mapped as tagged, to protect against
> > cases where the page is shared between different VMAs, and is tagged in
> > some, but untagged in others. By using this approach, the process can still
> > access the correct tags following an mprotect(PROT_MTE) on the non-MTE
> > enabled VMA.
> >
> > But this poses a challenge for managing tag storage: in the scenario above,
> > when a new page is allocated to be swapped in for the process where it will
> > be mapped as untagged, the corresponding tag storage block is not reserved.
> > mte_restore_page_tags_by_swp_entry(), when it restores the saved tags, will
> > overwrite data in the tag storage block associated with the new page,
> > leading to data corruption if the block is in use by a process.
> >
> > Get around this issue by saving the tags in a new xarray, this time indexed
> > by the page pfn, and then restoring them when tag storage is reserved for
> > the page.
> >
> > Signed-off-by: Alexandru Elisei <[email protected]>
> > ---
> >
> > Changes since rfc v2:
> >
> > * Restore saved tags **before** setting the PG_tag_storage_reserved bit to
> > eliminate a brief window of opportunity where userspace can access uninitialized
> > tags (Peter Collingbourne).
> >
> > arch/arm64/include/asm/mte_tag_storage.h | 8 ++
> > arch/arm64/include/asm/pgtable.h | 11 +++
> > arch/arm64/kernel/mte_tag_storage.c | 12 ++-
> > arch/arm64/mm/mteswap.c | 110 +++++++++++++++++++++++
> > 4 files changed, 140 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
> > index 50bdae94cf71..40590a8c3748 100644
> > --- a/arch/arm64/include/asm/mte_tag_storage.h
> > +++ b/arch/arm64/include/asm/mte_tag_storage.h
> > @@ -36,6 +36,14 @@ bool page_is_tag_storage(struct page *page);
> >
> > vm_fault_t handle_folio_missing_tag_storage(struct folio *folio, struct vm_fault *vmf,
> > bool *map_pte);
> > +vm_fault_t mte_try_transfer_swap_tags(swp_entry_t entry, struct page *page);
> > +
> > +void tags_by_pfn_lock(void);
> > +void tags_by_pfn_unlock(void);
> > +
> > +void *mte_erase_tags_for_pfn(unsigned long pfn);
> > +bool mte_save_tags_for_pfn(void *tags, unsigned long pfn);
> > +void mte_restore_tags_for_pfn(unsigned long start_pfn, int order);
> > #else
> > static inline bool tag_storage_enabled(void)
> > {
> > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > index 0174e292f890..87ae59436162 100644
> > --- a/arch/arm64/include/asm/pgtable.h
> > +++ b/arch/arm64/include/asm/pgtable.h
> > @@ -1085,6 +1085,17 @@ static inline void arch_swap_invalidate_area(int type)
> > mte_invalidate_tags_area_by_swp_entry(type);
> > }
> >
> > +#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
> > +#define __HAVE_ARCH_SWAP_PREPARE_TO_RESTORE
> > +static inline vm_fault_t arch_swap_prepare_to_restore(swp_entry_t entry,
> > + struct folio *folio)
> > +{
> > + if (tag_storage_enabled())
> > + return mte_try_transfer_swap_tags(entry, &folio->page);
> > + return 0;
> > +}
> > +#endif
> > +
> > #define __HAVE_ARCH_SWAP_RESTORE
> > static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> > {
> > diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
> > index afe2bb754879..ac7b9c9c585c 100644
> > --- a/arch/arm64/kernel/mte_tag_storage.c
> > +++ b/arch/arm64/kernel/mte_tag_storage.c
> > @@ -567,6 +567,7 @@ int reserve_tag_storage(struct page *page, int order, gfp_t gfp)
> > }
> > }
> >
> > + mte_restore_tags_for_pfn(page_to_pfn(page), order);
> > page_set_tag_storage_reserved(page, order);
> > out_unlock:
> > mutex_unlock(&tag_blocks_lock);
> > @@ -595,7 +596,8 @@ void free_tag_storage(struct page *page, int order)
> > struct tag_region *region;
> > unsigned long page_va;
> > unsigned long flags;
> > - int ret;
> > + void *tags;
> > + int i, ret;
> >
> > ret = tag_storage_find_block(page, &start_block, ®ion);
> > if (WARN_ONCE(ret, "Missing tag storage block for pfn 0x%lx", page_to_pfn(page)))
> > @@ -605,6 +607,14 @@ void free_tag_storage(struct page *page, int order)
> > /* Avoid writeback of dirty tag cache lines corrupting data. */
> > dcache_inval_tags_poc(page_va, page_va + (PAGE_SIZE << order));
> >
> > + tags_by_pfn_lock();
> > + for (i = 0; i < (1 << order); i++) {
> > + tags = mte_erase_tags_for_pfn(page_to_pfn(page + i));
> > + if (unlikely(tags))
> > + mte_free_tag_buf(tags);
> > + }
> > + tags_by_pfn_unlock();
> > +
> > end_block = start_block + order_to_num_blocks(order, region->block_size_pages);
> >
> > xa_lock_irqsave(&tag_blocks_reserved, flags);
> > diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
> > index 2a43746b803f..e11495fa3c18 100644
> > --- a/arch/arm64/mm/mteswap.c
> > +++ b/arch/arm64/mm/mteswap.c
> > @@ -20,6 +20,112 @@ void mte_free_tag_buf(void *buf)
> > kfree(buf);
> > }
> >
> > +#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
> > +static DEFINE_XARRAY(tags_by_pfn);
> > +
> > +void tags_by_pfn_lock(void)
> > +{
> > + xa_lock(&tags_by_pfn);
> > +}
> > +
> > +void tags_by_pfn_unlock(void)
> > +{
> > + xa_unlock(&tags_by_pfn);
> > +}
> > +
> > +void *mte_erase_tags_for_pfn(unsigned long pfn)
> > +{
> > + return __xa_erase(&tags_by_pfn, pfn);
> > +}
> > +
> > +bool mte_save_tags_for_pfn(void *tags, unsigned long pfn)
> > +{
> > + void *entry;
> > + int ret;
> > +
> > + ret = xa_reserve(&tags_by_pfn, pfn, GFP_KERNEL);
>
> copy_highpage can be called from an atomic context, so it isn't
> currently valid to pass GFP_KERNEL here.
>
> To give one example of a possible atomic context call, copy_pte_range
> will take a PTE spinlock and can call copy_present_pte, which can call
> copy_present_page, which will call copy_user_highpage.
>
> To give another example, __buffer_migrate_folio can call
> spin_lock(&mapping->private_lock), then call folio_migrate_copy, which
> will call folio_copy.
That is very unfortunate from my part. I distinctly remember looking
precisely at copy_page_range() to double check that it doesn't call
copy_*highpage() from an atomic context, I can only assume that I missed
that it's called with the ptl lock held.
With your two examples, and the khugepaged case in patch #31 ("khugepaged:
arm64: Don't collapse MTE enabled VMAs"), it's crystal clear that the
convention for copy_*highpage() is that the function cannot sleep.
There are two issues here: allocating the buffer in memory where the tags
will be copied, and xarray allocating memory for a new entry.
One fix would be to allocate an entire page with __GFP_ATOMIC, and use that
as a cache for tag buffers (storing the tags for a page uses 1/32th of a
page). From what little I know about xarray, xarray stores would still have
to be GFP_ATOMIC. This should fix the sleeping in atomic context bug. But
the issue I see with this is that a memory allocation can fail, while
copy_*highpage() cannot. Send a fatal signal to the process if memory
allocation fails?
Another approach would be to preallocate memory in a preemptible context,
something like copy_*highpage_prepare(), but that would mean a lot more
work: finding all the places where copy_*highpage is used and add
copy_*highpage_prepare() outside the critical section, releasing the memory
in case of failure (like in the copy_pte_range() case - maybe
copy_*highpage_end()?). That's a pretty big maintenance burden for the MM
code. Although maybe other architectures can find a use for it?
And yet another approach is reserve the needed memory (for the buffer and
in the xarray) when the page is allocated, if it doesn't have tag storage
reserved, regardless of the page being allocated as tagged or not. Then in
set_pte_at() free this memory if it's unused. But this would mean reserving
memory for possibly all memory allocations in the system (including for tag
storage pages) if userspace doesn't use tags at all, though not all pages
in the system will have this memory reserved at the same time. Pretty big
downside.
Out of the three, I prefer the first, but it's definitely not perfect. I'll
try to think of something else, maybe I can come up with something better.
What are your thoughts?
Thanks,
Alex
>
> Peter
>
> > + if (ret)
> > + return true;
> > +
> > + tags_by_pfn_lock();
> > +
> > + if (page_tag_storage_reserved(pfn_to_page(pfn))) {
> > + xa_release(&tags_by_pfn, pfn);
> > + tags_by_pfn_unlock();
> > + return false;
> > + }
> > +
> > + entry = __xa_store(&tags_by_pfn, pfn, tags, GFP_ATOMIC);
> > + if (xa_is_err(entry)) {
> > + xa_release(&tags_by_pfn, pfn);
> > + goto out_unlock;
> > + } else if (entry) {
> > + mte_free_tag_buf(entry);
> > + }
> > +
> > +out_unlock:
> > + tags_by_pfn_unlock();
> > + return true;
> > +}
> > +
> > +void mte_restore_tags_for_pfn(unsigned long start_pfn, int order)
> > +{
> > + struct page *page = pfn_to_page(start_pfn);
> > + unsigned long pfn;
> > + void *tags;
> > +
> > + tags_by_pfn_lock();
> > +
> > + for (pfn = start_pfn; pfn < start_pfn + (1 << order); pfn++, page++) {
> > + tags = mte_erase_tags_for_pfn(pfn);
> > + if (unlikely(tags)) {
> > + /*
> > + * Mark the page as tagged so mte_sync_tags() doesn't
> > + * clear the tags.
> > + */
> > + WARN_ON_ONCE(!try_page_mte_tagging(page));
> > + mte_copy_page_tags_from_buf(page_address(page), tags);
> > + set_page_mte_tagged(page);
> > + mte_free_tag_buf(tags);
> > + }
> > + }
> > +
> > + tags_by_pfn_unlock();
> > +}
> > +
> > +/*
> > + * Note on locking: swap in/out is done with the folio locked, which eliminates
> > + * races with mte_save/restore_page_tags_by_swp_entry.
> > + */
> > +vm_fault_t mte_try_transfer_swap_tags(swp_entry_t entry, struct page *page)
> > +{
> > + void *swap_tags, *pfn_tags;
> > + bool saved;
> > +
> > + /*
> > + * mte_restore_page_tags_by_swp_entry() will take care of copying the
> > + * tags over.
> > + */
> > + if (likely(page_mte_tagged(page) || page_tag_storage_reserved(page)))
> > + return 0;
> > +
> > + swap_tags = xa_load(&tags_by_swp_entry, entry.val);
> > + if (!swap_tags)
> > + return 0;
> > +
> > + pfn_tags = mte_allocate_tag_buf();
> > + if (!pfn_tags)
> > + return VM_FAULT_OOM;
> > +
> > + memcpy(pfn_tags, swap_tags, MTE_PAGE_TAG_STORAGE_SIZE);
> > + saved = mte_save_tags_for_pfn(pfn_tags, page_to_pfn(page));
> > + if (!saved)
> > + mte_free_tag_buf(pfn_tags);
> > +
> > + return 0;
> > +}
> > +#endif
> > +
> > int mte_save_page_tags_by_swp_entry(struct page *page)
> > {
> > void *tags, *ret;
> > @@ -54,6 +160,10 @@ void mte_restore_page_tags_by_swp_entry(swp_entry_t entry, struct page *page)
> > if (!tags)
> > return;
> >
> > + /* Tags will be restored when tag storage is reserved. */
> > + if (tag_storage_enabled() && unlikely(!page_tag_storage_reserved(page)))
> > + return;
> > +
> > if (try_page_mte_tagging(page)) {
> > mte_copy_page_tags_from_buf(page_address(page), tags);
> > set_page_mte_tagged(page);
> > --
> > 2.43.0
> >
On Thu, Jan 25, 2024 at 8:44 AM Alexandru Elisei
<[email protected]> wrote:
>
> Before enabling MTE tag storage management, make sure that the CMA areas
> have been successfully activated. If a CMA area fails activation, the pages
> are kept as reserved. Reserved pages are never used by the page allocator.
>
> If this happens, the kernel would have to manage tag storage only for some
> of the memory, but not for all memory, and that would make the code
> unreasonably complicated.
>
> Choose to disable tag storage management altogether if a CMA area fails to
> be activated.
>
> Signed-off-by: Alexandru Elisei <[email protected]>
> ---
>
> Changes since v2:
>
> * New patch.
>
> arch/arm64/include/asm/mte_tag_storage.h | 12 ++++++
> arch/arm64/kernel/mte_tag_storage.c | 50 ++++++++++++++++++++++++
> 2 files changed, 62 insertions(+)
>
> diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
> index 3c2cd29e053e..7b3f6bff8e6f 100644
> --- a/arch/arm64/include/asm/mte_tag_storage.h
> +++ b/arch/arm64/include/asm/mte_tag_storage.h
> @@ -6,8 +6,20 @@
> #define __ASM_MTE_TAG_STORAGE_H
>
> #ifdef CONFIG_ARM64_MTE_TAG_STORAGE
> +
> +DECLARE_STATIC_KEY_FALSE(tag_storage_enabled_key);
> +
> +static inline bool tag_storage_enabled(void)
> +{
> + return static_branch_likely(&tag_storage_enabled_key);
> +}
> +
> void mte_init_tag_storage(void);
> #else
> +static inline bool tag_storage_enabled(void)
> +{
> + return false;
> +}
> static inline void mte_init_tag_storage(void)
> {
> }
> diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
> index 9a1a8a45171e..d58c68b4a849 100644
> --- a/arch/arm64/kernel/mte_tag_storage.c
> +++ b/arch/arm64/kernel/mte_tag_storage.c
> @@ -19,6 +19,8 @@
>
> #include <asm/mte_tag_storage.h>
>
> +__ro_after_init DEFINE_STATIC_KEY_FALSE(tag_storage_enabled_key);
> +
> struct tag_region {
> struct range mem_range; /* Memory associated with the tag storage, in PFNs. */
> struct range tag_range; /* Tag storage memory, in PFNs. */
> @@ -314,3 +316,51 @@ void __init mte_init_tag_storage(void)
> num_tag_regions = 0;
> pr_info("MTE tag storage region management disabled");
> }
> +
> +static int __init mte_enable_tag_storage(void)
> +{
> + struct range *tag_range;
> + struct cma *cma;
> + int i, ret;
> +
> + if (num_tag_regions == 0)
> + return 0;
> +
> + for (i = 0; i < num_tag_regions; i++) {
> + tag_range = &tag_regions[i].tag_range;
> + cma = tag_regions[i].cma;
> + /*
> + * CMA will keep the pages as reserved when the region fails
> + * activation.
> + */
> + if (PageReserved(pfn_to_page(tag_range->start)))
> + goto out_disabled;
> + }
> +
> + static_branch_enable(&tag_storage_enabled_key);
> + pr_info("MTE tag storage region management enabled");
> +
> + return 0;
> +
> +out_disabled:
> + for (i = 0; i < num_tag_regions; i++) {
> + tag_range = &tag_regions[i].tag_range;
> + cma = tag_regions[i].cma;
> +
> + if (PageReserved(pfn_to_page(tag_range->start)))
> + continue;
> +
> + /* Try really hard to reserve the tag storage. */
> + ret = cma_alloc(cma, range_len(tag_range), 8, true);
> + /*
> + * Tag storage is still in use for data, memory and/or tag
> + * corruption will ensue.
> + */
> + WARN_ON_ONCE(ret);
cma_alloc returns (page *), so this condition needs to be inverted,
and the type of `ret` changed.
Not sure how it slipped through, this is a compile error with clang.
> + }
> + num_tag_regions = 0;
> + pr_info("MTE tag storage region management disabled");
> +
> + return -EINVAL;
> +}
> +arch_initcall(mte_enable_tag_storage);
> --
> 2.43.0
>
On Fri, Feb 2, 2024 at 6:56 AM Alexandru Elisei
<[email protected]> wrote:
>
> Hi Peter,
>
> On Thu, Feb 01, 2024 at 08:02:40PM -0800, Peter Collingbourne wrote:
> > On Thu, Jan 25, 2024 at 8:45 AM Alexandru Elisei
> > <[email protected]> wrote:
> > >
> > > Linux restores tags when a page is swapped in and there are tags associated
> > > with the swap entry which the new page will replace. The saved tags are
> > > restored even if the page will not be mapped as tagged, to protect against
> > > cases where the page is shared between different VMAs, and is tagged in
> > > some, but untagged in others. By using this approach, the process can still
> > > access the correct tags following an mprotect(PROT_MTE) on the non-MTE
> > > enabled VMA.
> > >
> > > But this poses a challenge for managing tag storage: in the scenario above,
> > > when a new page is allocated to be swapped in for the process where it will
> > > be mapped as untagged, the corresponding tag storage block is not reserved.
> > > mte_restore_page_tags_by_swp_entry(), when it restores the saved tags, will
> > > overwrite data in the tag storage block associated with the new page,
> > > leading to data corruption if the block is in use by a process.
> > >
> > > Get around this issue by saving the tags in a new xarray, this time indexed
> > > by the page pfn, and then restoring them when tag storage is reserved for
> > > the page.
> > >
> > > Signed-off-by: Alexandru Elisei <[email protected]>
> > > ---
> > >
> > > Changes since rfc v2:
> > >
> > > * Restore saved tags **before** setting the PG_tag_storage_reserved bit to
> > > eliminate a brief window of opportunity where userspace can access uninitialized
> > > tags (Peter Collingbourne).
> > >
> > > arch/arm64/include/asm/mte_tag_storage.h | 8 ++
> > > arch/arm64/include/asm/pgtable.h | 11 +++
> > > arch/arm64/kernel/mte_tag_storage.c | 12 ++-
> > > arch/arm64/mm/mteswap.c | 110 +++++++++++++++++++++++
> > > 4 files changed, 140 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
> > > index 50bdae94cf71..40590a8c3748 100644
> > > --- a/arch/arm64/include/asm/mte_tag_storage.h
> > > +++ b/arch/arm64/include/asm/mte_tag_storage.h
> > > @@ -36,6 +36,14 @@ bool page_is_tag_storage(struct page *page);
> > >
> > > vm_fault_t handle_folio_missing_tag_storage(struct folio *folio, struct vm_fault *vmf,
> > > bool *map_pte);
> > > +vm_fault_t mte_try_transfer_swap_tags(swp_entry_t entry, struct page *page);
> > > +
> > > +void tags_by_pfn_lock(void);
> > > +void tags_by_pfn_unlock(void);
> > > +
> > > +void *mte_erase_tags_for_pfn(unsigned long pfn);
> > > +bool mte_save_tags_for_pfn(void *tags, unsigned long pfn);
> > > +void mte_restore_tags_for_pfn(unsigned long start_pfn, int order);
> > > #else
> > > static inline bool tag_storage_enabled(void)
> > > {
> > > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > > index 0174e292f890..87ae59436162 100644
> > > --- a/arch/arm64/include/asm/pgtable.h
> > > +++ b/arch/arm64/include/asm/pgtable.h
> > > @@ -1085,6 +1085,17 @@ static inline void arch_swap_invalidate_area(int type)
> > > mte_invalidate_tags_area_by_swp_entry(type);
> > > }
> > >
> > > +#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
> > > +#define __HAVE_ARCH_SWAP_PREPARE_TO_RESTORE
> > > +static inline vm_fault_t arch_swap_prepare_to_restore(swp_entry_t entry,
> > > + struct folio *folio)
> > > +{
> > > + if (tag_storage_enabled())
> > > + return mte_try_transfer_swap_tags(entry, &folio->page);
> > > + return 0;
> > > +}
> > > +#endif
> > > +
> > > #define __HAVE_ARCH_SWAP_RESTORE
> > > static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> > > {
> > > diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
> > > index afe2bb754879..ac7b9c9c585c 100644
> > > --- a/arch/arm64/kernel/mte_tag_storage.c
> > > +++ b/arch/arm64/kernel/mte_tag_storage.c
> > > @@ -567,6 +567,7 @@ int reserve_tag_storage(struct page *page, int order, gfp_t gfp)
> > > }
> > > }
> > >
> > > + mte_restore_tags_for_pfn(page_to_pfn(page), order);
> > > page_set_tag_storage_reserved(page, order);
> > > out_unlock:
> > > mutex_unlock(&tag_blocks_lock);
> > > @@ -595,7 +596,8 @@ void free_tag_storage(struct page *page, int order)
> > > struct tag_region *region;
> > > unsigned long page_va;
> > > unsigned long flags;
> > > - int ret;
> > > + void *tags;
> > > + int i, ret;
> > >
> > > ret = tag_storage_find_block(page, &start_block, ®ion);
> > > if (WARN_ONCE(ret, "Missing tag storage block for pfn 0x%lx", page_to_pfn(page)))
> > > @@ -605,6 +607,14 @@ void free_tag_storage(struct page *page, int order)
> > > /* Avoid writeback of dirty tag cache lines corrupting data. */
> > > dcache_inval_tags_poc(page_va, page_va + (PAGE_SIZE << order));
> > >
> > > + tags_by_pfn_lock();
> > > + for (i = 0; i < (1 << order); i++) {
> > > + tags = mte_erase_tags_for_pfn(page_to_pfn(page + i));
> > > + if (unlikely(tags))
> > > + mte_free_tag_buf(tags);
> > > + }
> > > + tags_by_pfn_unlock();
> > > +
> > > end_block = start_block + order_to_num_blocks(order, region->block_size_pages);
> > >
> > > xa_lock_irqsave(&tag_blocks_reserved, flags);
> > > diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
> > > index 2a43746b803f..e11495fa3c18 100644
> > > --- a/arch/arm64/mm/mteswap.c
> > > +++ b/arch/arm64/mm/mteswap.c
> > > @@ -20,6 +20,112 @@ void mte_free_tag_buf(void *buf)
> > > kfree(buf);
> > > }
> > >
> > > +#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
> > > +static DEFINE_XARRAY(tags_by_pfn);
> > > +
> > > +void tags_by_pfn_lock(void)
> > > +{
> > > + xa_lock(&tags_by_pfn);
> > > +}
> > > +
> > > +void tags_by_pfn_unlock(void)
> > > +{
> > > + xa_unlock(&tags_by_pfn);
> > > +}
> > > +
> > > +void *mte_erase_tags_for_pfn(unsigned long pfn)
> > > +{
> > > + return __xa_erase(&tags_by_pfn, pfn);
> > > +}
> > > +
> > > +bool mte_save_tags_for_pfn(void *tags, unsigned long pfn)
> > > +{
> > > + void *entry;
> > > + int ret;
> > > +
> > > + ret = xa_reserve(&tags_by_pfn, pfn, GFP_KERNEL);
> >
> > copy_highpage can be called from an atomic context, so it isn't
> > currently valid to pass GFP_KERNEL here.
> >
> > To give one example of a possible atomic context call, copy_pte_range
> > will take a PTE spinlock and can call copy_present_pte, which can call
> > copy_present_page, which will call copy_user_highpage.
> >
> > To give another example, __buffer_migrate_folio can call
> > spin_lock(&mapping->private_lock), then call folio_migrate_copy, which
> > will call folio_copy.
>
> That is very unfortunate from my part. I distinctly remember looking
> precisely at copy_page_range() to double check that it doesn't call
> copy_*highpage() from an atomic context, I can only assume that I missed
> that it's called with the ptl lock held.
>
> With your two examples, and the khugepaged case in patch #31 ("khugepaged:
> arm64: Don't collapse MTE enabled VMAs"), it's crystal clear that the
> convention for copy_*highpage() is that the function cannot sleep.
>
> There are two issues here: allocating the buffer in memory where the tags
> will be copied, and xarray allocating memory for a new entry.
>
> One fix would be to allocate an entire page with __GFP_ATOMIC, and use that
> as a cache for tag buffers (storing the tags for a page uses 1/32th of a
> page). From what little I know about xarray, xarray stores would still have
> to be GFP_ATOMIC. This should fix the sleeping in atomic context bug. But
> the issue I see with this is that a memory allocation can fail, while
> copy_*highpage() cannot. Send a fatal signal to the process if memory
> allocation fails?
>
> Another approach would be to preallocate memory in a preemptible context,
> something like copy_*highpage_prepare(), but that would mean a lot more
> work: finding all the places where copy_*highpage is used and add
> copy_*highpage_prepare() outside the critical section, releasing the memory
> in case of failure (like in the copy_pte_range() case - maybe
> copy_*highpage_end()?). That's a pretty big maintenance burden for the MM
> code. Although maybe other architectures can find a use for it?
>
> And yet another approach is reserve the needed memory (for the buffer and
> in the xarray) when the page is allocated, if it doesn't have tag storage
> reserved, regardless of the page being allocated as tagged or not. Then in
> set_pte_at() free this memory if it's unused. But this would mean reserving
> memory for possibly all memory allocations in the system (including for tag
> storage pages) if userspace doesn't use tags at all, though not all pages
> in the system will have this memory reserved at the same time. Pretty big
> downside.
>
> Out of the three, I prefer the first, but it's definitely not perfect. I'll
> try to think of something else, maybe I can come up with something better.
>
> What are your thoughts?
>
> Thanks,
> Alex
>
> >
> > Peter
> >
> > > + if (ret)
> > > + return true;
> > > +
> > > + tags_by_pfn_lock();
> > > +
> > > + if (page_tag_storage_reserved(pfn_to_page(pfn))) {
> > > + xa_release(&tags_by_pfn, pfn);
> > > + tags_by_pfn_unlock();
> > > + return false;
> > > + }
> > > +
> > > + entry = __xa_store(&tags_by_pfn, pfn, tags, GFP_ATOMIC);
> > > + if (xa_is_err(entry)) {
> > > + xa_release(&tags_by_pfn, pfn);
> > > + goto out_unlock;
> > > + } else if (entry) {
> > > + mte_free_tag_buf(entry);
> > > + }
> > > +
> > > +out_unlock:
> > > + tags_by_pfn_unlock();
> > > + return true;
> > > +}
> > > +
> > > +void mte_restore_tags_for_pfn(unsigned long start_pfn, int order)
> > > +{
> > > + struct page *page = pfn_to_page(start_pfn);
> > > + unsigned long pfn;
> > > + void *tags;
> > > +
> > > + tags_by_pfn_lock();
> > > +
> > > + for (pfn = start_pfn; pfn < start_pfn + (1 << order); pfn++, page++) {
> > > + tags = mte_erase_tags_for_pfn(pfn);
> > > + if (unlikely(tags)) {
> > > + /*
> > > + * Mark the page as tagged so mte_sync_tags() doesn't
> > > + * clear the tags.
> > > + */
> > > + WARN_ON_ONCE(!try_page_mte_tagging(page));
> > > + mte_copy_page_tags_from_buf(page_address(page), tags);
> > > + set_page_mte_tagged(page);
I hit a WARN_ON_ONCE inside `set_page_mte_tagged` at this call site,
because the page does not have PG_tag_storage_reserved yet.
Swap the order of calls in reserve_tag_storage?
> > > + mte_free_tag_buf(tags);
> > > + }
> > > + }
> > > +
> > > + tags_by_pfn_unlock();
> > > +}
> > > +
> > > +/*
> > > + * Note on locking: swap in/out is done with the folio locked, which eliminates
> > > + * races with mte_save/restore_page_tags_by_swp_entry.
> > > + */
> > > +vm_fault_t mte_try_transfer_swap_tags(swp_entry_t entry, struct page *page)
> > > +{
> > > + void *swap_tags, *pfn_tags;
> > > + bool saved;
> > > +
> > > + /*
> > > + * mte_restore_page_tags_by_swp_entry() will take care of copying the
> > > + * tags over.
> > > + */
> > > + if (likely(page_mte_tagged(page) || page_tag_storage_reserved(page)))
> > > + return 0;
> > > +
> > > + swap_tags = xa_load(&tags_by_swp_entry, entry.val);
> > > + if (!swap_tags)
> > > + return 0;
> > > +
> > > + pfn_tags = mte_allocate_tag_buf();
> > > + if (!pfn_tags)
> > > + return VM_FAULT_OOM;
> > > +
> > > + memcpy(pfn_tags, swap_tags, MTE_PAGE_TAG_STORAGE_SIZE);
> > > + saved = mte_save_tags_for_pfn(pfn_tags, page_to_pfn(page));
> > > + if (!saved)
> > > + mte_free_tag_buf(pfn_tags);
> > > +
> > > + return 0;
> > > +}
> > > +#endif
> > > +
> > > int mte_save_page_tags_by_swp_entry(struct page *page)
> > > {
> > > void *tags, *ret;
> > > @@ -54,6 +160,10 @@ void mte_restore_page_tags_by_swp_entry(swp_entry_t entry, struct page *page)
> > > if (!tags)
> > > return;
> > >
> > > + /* Tags will be restored when tag storage is reserved. */
> > > + if (tag_storage_enabled() && unlikely(!page_tag_storage_reserved(page)))
> > > + return;
> > > +
> > > if (try_page_mte_tagging(page)) {
> > > mte_copy_page_tags_from_buf(page_address(page), tags);
> > > set_page_mte_tagged(page);
> > > --
> > > 2.43.0
> > >
On Fri, Feb 2, 2024 at 6:56 AM Alexandru Elisei
<[email protected]> wrote:
>
> Hi Peter,
>
> On Thu, Feb 01, 2024 at 08:02:40PM -0800, Peter Collingbourne wrote:
> > On Thu, Jan 25, 2024 at 8:45 AM Alexandru Elisei
> > <[email protected]> wrote:
> > >
> > > Linux restores tags when a page is swapped in and there are tags associated
> > > with the swap entry which the new page will replace. The saved tags are
> > > restored even if the page will not be mapped as tagged, to protect against
> > > cases where the page is shared between different VMAs, and is tagged in
> > > some, but untagged in others. By using this approach, the process can still
> > > access the correct tags following an mprotect(PROT_MTE) on the non-MTE
> > > enabled VMA.
> > >
> > > But this poses a challenge for managing tag storage: in the scenario above,
> > > when a new page is allocated to be swapped in for the process where it will
> > > be mapped as untagged, the corresponding tag storage block is not reserved.
> > > mte_restore_page_tags_by_swp_entry(), when it restores the saved tags, will
> > > overwrite data in the tag storage block associated with the new page,
> > > leading to data corruption if the block is in use by a process.
> > >
> > > Get around this issue by saving the tags in a new xarray, this time indexed
> > > by the page pfn, and then restoring them when tag storage is reserved for
> > > the page.
> > >
> > > Signed-off-by: Alexandru Elisei <[email protected]>
> > > ---
> > >
> > > Changes since rfc v2:
> > >
> > > * Restore saved tags **before** setting the PG_tag_storage_reserved bit to
> > > eliminate a brief window of opportunity where userspace can access uninitialized
> > > tags (Peter Collingbourne).
> > >
> > > arch/arm64/include/asm/mte_tag_storage.h | 8 ++
> > > arch/arm64/include/asm/pgtable.h | 11 +++
> > > arch/arm64/kernel/mte_tag_storage.c | 12 ++-
> > > arch/arm64/mm/mteswap.c | 110 +++++++++++++++++++++++
> > > 4 files changed, 140 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
> > > index 50bdae94cf71..40590a8c3748 100644
> > > --- a/arch/arm64/include/asm/mte_tag_storage.h
> > > +++ b/arch/arm64/include/asm/mte_tag_storage.h
> > > @@ -36,6 +36,14 @@ bool page_is_tag_storage(struct page *page);
> > >
> > > vm_fault_t handle_folio_missing_tag_storage(struct folio *folio, struct vm_fault *vmf,
> > > bool *map_pte);
> > > +vm_fault_t mte_try_transfer_swap_tags(swp_entry_t entry, struct page *page);
> > > +
> > > +void tags_by_pfn_lock(void);
> > > +void tags_by_pfn_unlock(void);
> > > +
> > > +void *mte_erase_tags_for_pfn(unsigned long pfn);
> > > +bool mte_save_tags_for_pfn(void *tags, unsigned long pfn);
> > > +void mte_restore_tags_for_pfn(unsigned long start_pfn, int order);
> > > #else
> > > static inline bool tag_storage_enabled(void)
> > > {
> > > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > > index 0174e292f890..87ae59436162 100644
> > > --- a/arch/arm64/include/asm/pgtable.h
> > > +++ b/arch/arm64/include/asm/pgtable.h
> > > @@ -1085,6 +1085,17 @@ static inline void arch_swap_invalidate_area(int type)
> > > mte_invalidate_tags_area_by_swp_entry(type);
> > > }
> > >
> > > +#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
> > > +#define __HAVE_ARCH_SWAP_PREPARE_TO_RESTORE
> > > +static inline vm_fault_t arch_swap_prepare_to_restore(swp_entry_t entry,
> > > + struct folio *folio)
> > > +{
> > > + if (tag_storage_enabled())
> > > + return mte_try_transfer_swap_tags(entry, &folio->page);
> > > + return 0;
> > > +}
> > > +#endif
> > > +
> > > #define __HAVE_ARCH_SWAP_RESTORE
> > > static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
> > > {
> > > diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
> > > index afe2bb754879..ac7b9c9c585c 100644
> > > --- a/arch/arm64/kernel/mte_tag_storage.c
> > > +++ b/arch/arm64/kernel/mte_tag_storage.c
> > > @@ -567,6 +567,7 @@ int reserve_tag_storage(struct page *page, int order, gfp_t gfp)
> > > }
> > > }
> > >
> > > + mte_restore_tags_for_pfn(page_to_pfn(page), order);
> > > page_set_tag_storage_reserved(page, order);
> > > out_unlock:
> > > mutex_unlock(&tag_blocks_lock);
> > > @@ -595,7 +596,8 @@ void free_tag_storage(struct page *page, int order)
> > > struct tag_region *region;
> > > unsigned long page_va;
> > > unsigned long flags;
> > > - int ret;
> > > + void *tags;
> > > + int i, ret;
> > >
> > > ret = tag_storage_find_block(page, &start_block, ®ion);
> > > if (WARN_ONCE(ret, "Missing tag storage block for pfn 0x%lx", page_to_pfn(page)))
> > > @@ -605,6 +607,14 @@ void free_tag_storage(struct page *page, int order)
> > > /* Avoid writeback of dirty tag cache lines corrupting data. */
> > > dcache_inval_tags_poc(page_va, page_va + (PAGE_SIZE << order));
> > >
> > > + tags_by_pfn_lock();
> > > + for (i = 0; i < (1 << order); i++) {
> > > + tags = mte_erase_tags_for_pfn(page_to_pfn(page + i));
> > > + if (unlikely(tags))
> > > + mte_free_tag_buf(tags);
> > > + }
> > > + tags_by_pfn_unlock();
> > > +
> > > end_block = start_block + order_to_num_blocks(order, region->block_size_pages);
> > >
> > > xa_lock_irqsave(&tag_blocks_reserved, flags);
> > > diff --git a/arch/arm64/mm/mteswap.c b/arch/arm64/mm/mteswap.c
> > > index 2a43746b803f..e11495fa3c18 100644
> > > --- a/arch/arm64/mm/mteswap.c
> > > +++ b/arch/arm64/mm/mteswap.c
> > > @@ -20,6 +20,112 @@ void mte_free_tag_buf(void *buf)
> > > kfree(buf);
> > > }
> > >
> > > +#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
> > > +static DEFINE_XARRAY(tags_by_pfn);
> > > +
> > > +void tags_by_pfn_lock(void)
> > > +{
> > > + xa_lock(&tags_by_pfn);
> > > +}
> > > +
> > > +void tags_by_pfn_unlock(void)
> > > +{
> > > + xa_unlock(&tags_by_pfn);
> > > +}
> > > +
> > > +void *mte_erase_tags_for_pfn(unsigned long pfn)
> > > +{
> > > + return __xa_erase(&tags_by_pfn, pfn);
> > > +}
> > > +
> > > +bool mte_save_tags_for_pfn(void *tags, unsigned long pfn)
> > > +{
> > > + void *entry;
> > > + int ret;
> > > +
> > > + ret = xa_reserve(&tags_by_pfn, pfn, GFP_KERNEL);
> >
> > copy_highpage can be called from an atomic context, so it isn't
> > currently valid to pass GFP_KERNEL here.
> >
> > To give one example of a possible atomic context call, copy_pte_range
> > will take a PTE spinlock and can call copy_present_pte, which can call
> > copy_present_page, which will call copy_user_highpage.
> >
> > To give another example, __buffer_migrate_folio can call
> > spin_lock(&mapping->private_lock), then call folio_migrate_copy, which
> > will call folio_copy.
>
> That is very unfortunate from my part. I distinctly remember looking
> precisely at copy_page_range() to double check that it doesn't call
> copy_*highpage() from an atomic context, I can only assume that I missed
> that it's called with the ptl lock held.
>
> With your two examples, and the khugepaged case in patch #31 ("khugepaged:
> arm64: Don't collapse MTE enabled VMAs"), it's crystal clear that the
> convention for copy_*highpage() is that the function cannot sleep.
>
> There are two issues here: allocating the buffer in memory where the tags
> will be copied, and xarray allocating memory for a new entry.
>
> One fix would be to allocate an entire page with __GFP_ATOMIC, and use that
> as a cache for tag buffers (storing the tags for a page uses 1/32th of a
> page). From what little I know about xarray, xarray stores would still have
> to be GFP_ATOMIC. This should fix the sleeping in atomic context bug. But
> the issue I see with this is that a memory allocation can fail, while
> copy_*highpage() cannot. Send a fatal signal to the process if memory
> allocation fails?
Right, I think I'd have stability concerns about an approach like this.
> Another approach would be to preallocate memory in a preemptible context,
> something like copy_*highpage_prepare(), but that would mean a lot more
> work: finding all the places where copy_*highpage is used and add
> copy_*highpage_prepare() outside the critical section, releasing the memory
> in case of failure (like in the copy_pte_range() case - maybe
> copy_*highpage_end()?). That's a pretty big maintenance burden for the MM
> code. Although maybe other architectures can find a use for it?
This one might not be too bad. There are only a handful of calls to
this function, so it might not be a major ongoing burden. We can
implement copy_highpage() like this:
copy_highpage() {
might_sleep();
copy_highpage_atomic();
}
rename the existing implementations to copy_highpage_atomic() and
change atomic context callers to call copy_highpage_atomic(). That
way, kernels with CONFIG_DEBUG_ATOMIC_SLEEP will detect errors on all
architectures. Then in a later patch, introduce
copy_highpage_prepare() (or whatever) and update the
copy_highpage_atomic() callers.
Peter
> And yet another approach is reserve the needed memory (for the buffer and
> in the xarray) when the page is allocated, if it doesn't have tag storage
> reserved, regardless of the page being allocated as tagged or not. Then in
> set_pte_at() free this memory if it's unused. But this would mean reserving
> memory for possibly all memory allocations in the system (including for tag
> storage pages) if userspace doesn't use tags at all, though not all pages
> in the system will have this memory reserved at the same time. Pretty big
> downside.
>
> Out of the three, I prefer the first, but it's definitely not perfect. I'll
> try to think of something else, maybe I can come up with something better.
>
> What are your thoughts?
>
> Thanks,
> Alex
>
> >
> > Peter
> >
> > > + if (ret)
> > > + return true;
> > > +
> > > + tags_by_pfn_lock();
> > > +
> > > + if (page_tag_storage_reserved(pfn_to_page(pfn))) {
> > > + xa_release(&tags_by_pfn, pfn);
> > > + tags_by_pfn_unlock();
> > > + return false;
> > > + }
> > > +
> > > + entry = __xa_store(&tags_by_pfn, pfn, tags, GFP_ATOMIC);
> > > + if (xa_is_err(entry)) {
> > > + xa_release(&tags_by_pfn, pfn);
> > > + goto out_unlock;
> > > + } else if (entry) {
> > > + mte_free_tag_buf(entry);
> > > + }
> > > +
> > > +out_unlock:
> > > + tags_by_pfn_unlock();
> > > + return true;
> > > +}
> > > +
> > > +void mte_restore_tags_for_pfn(unsigned long start_pfn, int order)
> > > +{
> > > + struct page *page = pfn_to_page(start_pfn);
> > > + unsigned long pfn;
> > > + void *tags;
> > > +
> > > + tags_by_pfn_lock();
> > > +
> > > + for (pfn = start_pfn; pfn < start_pfn + (1 << order); pfn++, page++) {
> > > + tags = mte_erase_tags_for_pfn(pfn);
> > > + if (unlikely(tags)) {
> > > + /*
> > > + * Mark the page as tagged so mte_sync_tags() doesn't
> > > + * clear the tags.
> > > + */
> > > + WARN_ON_ONCE(!try_page_mte_tagging(page));
> > > + mte_copy_page_tags_from_buf(page_address(page), tags);
> > > + set_page_mte_tagged(page);
> > > + mte_free_tag_buf(tags);
> > > + }
> > > + }
> > > +
> > > + tags_by_pfn_unlock();
> > > +}
> > > +
> > > +/*
> > > + * Note on locking: swap in/out is done with the folio locked, which eliminates
> > > + * races with mte_save/restore_page_tags_by_swp_entry.
> > > + */
> > > +vm_fault_t mte_try_transfer_swap_tags(swp_entry_t entry, struct page *page)
> > > +{
> > > + void *swap_tags, *pfn_tags;
> > > + bool saved;
> > > +
> > > + /*
> > > + * mte_restore_page_tags_by_swp_entry() will take care of copying the
> > > + * tags over.
> > > + */
> > > + if (likely(page_mte_tagged(page) || page_tag_storage_reserved(page)))
> > > + return 0;
> > > +
> > > + swap_tags = xa_load(&tags_by_swp_entry, entry.val);
> > > + if (!swap_tags)
> > > + return 0;
> > > +
> > > + pfn_tags = mte_allocate_tag_buf();
> > > + if (!pfn_tags)
> > > + return VM_FAULT_OOM;
> > > +
> > > + memcpy(pfn_tags, swap_tags, MTE_PAGE_TAG_STORAGE_SIZE);
> > > + saved = mte_save_tags_for_pfn(pfn_tags, page_to_pfn(page));
> > > + if (!saved)
> > > + mte_free_tag_buf(pfn_tags);
> > > +
> > > + return 0;
> > > +}
> > > +#endif
> > > +
> > > int mte_save_page_tags_by_swp_entry(struct page *page)
> > > {
> > > void *tags, *ret;
> > > @@ -54,6 +160,10 @@ void mte_restore_page_tags_by_swp_entry(swp_entry_t entry, struct page *page)
> > > if (!tags)
> > > return;
> > >
> > > + /* Tags will be restored when tag storage is reserved. */
> > > + if (tag_storage_enabled() && unlikely(!page_tag_storage_reserved(page)))
> > > + return;
> > > +
> > > if (try_page_mte_tagging(page)) {
> > > mte_copy_page_tags_from_buf(page_address(page), tags);
> > > set_page_mte_tagged(page);
> > > --
> > > 2.43.0
> > >
Hi Evgenii,
On Fri, Feb 02, 2024 at 02:30:00PM -0800, Evgenii Stepanov wrote:
> On Thu, Jan 25, 2024 at 8:44 AM Alexandru Elisei
> <[email protected]> wrote:
> >
> > Before enabling MTE tag storage management, make sure that the CMA areas
> > have been successfully activated. If a CMA area fails activation, the pages
> > are kept as reserved. Reserved pages are never used by the page allocator.
> >
> > If this happens, the kernel would have to manage tag storage only for some
> > of the memory, but not for all memory, and that would make the code
> > unreasonably complicated.
> >
> > Choose to disable tag storage management altogether if a CMA area fails to
> > be activated.
> >
> > Signed-off-by: Alexandru Elisei <[email protected]>
> > ---
> >
> > Changes since v2:
> >
> > * New patch.
> >
> > arch/arm64/include/asm/mte_tag_storage.h | 12 ++++++
> > arch/arm64/kernel/mte_tag_storage.c | 50 ++++++++++++++++++++++++
> > 2 files changed, 62 insertions(+)
> >
> > diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
> > index 3c2cd29e053e..7b3f6bff8e6f 100644
> > --- a/arch/arm64/include/asm/mte_tag_storage.h
> > +++ b/arch/arm64/include/asm/mte_tag_storage.h
> > @@ -6,8 +6,20 @@
> > #define __ASM_MTE_TAG_STORAGE_H
> >
> > #ifdef CONFIG_ARM64_MTE_TAG_STORAGE
> > +
> > +DECLARE_STATIC_KEY_FALSE(tag_storage_enabled_key);
> > +
> > +static inline bool tag_storage_enabled(void)
> > +{
> > + return static_branch_likely(&tag_storage_enabled_key);
> > +}
> > +
> > void mte_init_tag_storage(void);
> > #else
> > +static inline bool tag_storage_enabled(void)
> > +{
> > + return false;
> > +}
> > static inline void mte_init_tag_storage(void)
> > {
> > }
> > diff --git a/arch/arm64/kernel/mte_tag_storage.c b/arch/arm64/kernel/mte_tag_storage.c
> > index 9a1a8a45171e..d58c68b4a849 100644
> > --- a/arch/arm64/kernel/mte_tag_storage.c
> > +++ b/arch/arm64/kernel/mte_tag_storage.c
> > @@ -19,6 +19,8 @@
> >
> > #include <asm/mte_tag_storage.h>
> >
> > +__ro_after_init DEFINE_STATIC_KEY_FALSE(tag_storage_enabled_key);
> > +
> > struct tag_region {
> > struct range mem_range; /* Memory associated with the tag storage, in PFNs. */
> > struct range tag_range; /* Tag storage memory, in PFNs. */
> > @@ -314,3 +316,51 @@ void __init mte_init_tag_storage(void)
> > num_tag_regions = 0;
> > pr_info("MTE tag storage region management disabled");
> > }
> > +
> > +static int __init mte_enable_tag_storage(void)
> > +{
> > + struct range *tag_range;
> > + struct cma *cma;
> > + int i, ret;
> > +
> > + if (num_tag_regions == 0)
> > + return 0;
> > +
> > + for (i = 0; i < num_tag_regions; i++) {
> > + tag_range = &tag_regions[i].tag_range;
> > + cma = tag_regions[i].cma;
> > + /*
> > + * CMA will keep the pages as reserved when the region fails
> > + * activation.
> > + */
> > + if (PageReserved(pfn_to_page(tag_range->start)))
> > + goto out_disabled;
> > + }
> > +
> > + static_branch_enable(&tag_storage_enabled_key);
> > + pr_info("MTE tag storage region management enabled");
> > +
> > + return 0;
> > +
> > +out_disabled:
> > + for (i = 0; i < num_tag_regions; i++) {
> > + tag_range = &tag_regions[i].tag_range;
> > + cma = tag_regions[i].cma;
> > +
> > + if (PageReserved(pfn_to_page(tag_range->start)))
> > + continue;
> > +
> > + /* Try really hard to reserve the tag storage. */
> > + ret = cma_alloc(cma, range_len(tag_range), 8, true);
> > + /*
> > + * Tag storage is still in use for data, memory and/or tag
> > + * corruption will ensue.
> > + */
> > + WARN_ON_ONCE(ret);
>
> cma_alloc returns (page *), so this condition needs to be inverted,
> and the type of `ret` changed.
> Not sure how it slipped through, this is a compile error with clang.
Checked just now, it's a warning with gcc, I must have missed it. Will fix.
Thanks,
Alex
>
> > + }
> > + num_tag_regions = 0;
> > + pr_info("MTE tag storage region management disabled");
> > +
> > + return -EINVAL;
> > +}
> > +arch_initcall(mte_enable_tag_storage);
> > --
> > 2.43.0
> >